JOURNAL ARTICLE

TEXT CONDITIONED SYMBOLIC DRUMBEAT GENERATION USING LATENT DIFFUSION MODELS

JAJORIA, PushkarKLAKOW, DietrichMCDERMOTT, James

Year: 2025 Journal:   Zenodo (CERN European Organization for Nuclear Research)   Publisher: European Organization for Nuclear Research

Abstract

The challenge of generating coherent and novel MIDI drumbeats conditioned on text prompts remains largely unsolved, primarily due to the scarcity of well-annotated datasets linking text and MIDI drumbeats. Existing models have made strides in AI-generated music, yet they often fall short in producing high-quality drum beats or drum beats that align well with textual prompts. This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs). We use informative conditioning text extracted from training data filenames. By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network we align the modalities of text and music closely. Additionally, we examine an alternative text encoder based on multi-hot text encodings. Inspired by music’s multi-resolution nature, we train the MIDI autoencoder using a novel LSTM variant, MultiResolutionLSTM (MRLSTM), designed to operate at various resolutions independently. In common with recent LDMs for image generation, we also speed up the generation process and bring down the generation time for a single drumbeat to 1.1 seconds by running diffusion in the autoencoder latent space. We demonstrate the originality and variety of the generated drumbeats by measuring distance (both over binary pianorolls and in the latent space) versus the training dataset and among the generated drumbeats. We also assess the generated drumbeats through a listening test focused on questions of quality, aptness for the text prompt, and novelty. BERT model achieved quality and aptness scores comparable to the dataset drumbeats (differences of -1.99% and +5.64% respectively), while exhibiting a 22.26% improvement in novelty.

Keywords:
Autoencoder MIDI Process (computing) Encoder Deep learning Binary number Modality (human–computer interaction) Artificial neural network

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.61
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Generative Adversarial Networks and Image Synthesis
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music Technology and Sound Studies
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.