TEXT CONDITIONED SYMBOLIC DRUMBEAT GENERATION USING LATENT DIFFUSION MODELS

JAJORIA, Pushkar; KLAKOW, Dietrich; MCDERMOTT, James

doi:10.5281/zenodo.15838007

ScienceGate Book Chapters

JOURNAL ARTICLE

TEXT CONDITIONED SYMBOLIC DRUMBEAT GENERATION USING LATENT DIFFUSION MODELS

JAJORIA, Pushkar KLAKOW, Dietrich MCDERMOTT, James

Year: 2025 Journal: Zenodo (CERN European Organization for Nuclear Research) Publisher: European Organization for Nuclear Research

DOI: 10.5281/zenodo.15838007

Get Full-Text PDF Get Analytical Report

Abstract

The challenge of generating coherent and novel MIDI drumbeats conditioned on text prompts remains largely unsolved, primarily due to the scarcity of well-annotated datasets linking text and MIDI drumbeats. Existing models have made strides in AI-generated music, yet they often fall short in producing high-quality drum beats or drum beats that align well with textual prompts. This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs). We use informative conditioning text extracted from training data filenames. By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network we align the modalities of text and music closely. Additionally, we examine an alternative text encoder based on multi-hot text encodings. Inspired by music’s multi-resolution nature, we train the MIDI autoencoder using a novel LSTM variant, MultiResolutionLSTM (MRLSTM), designed to operate at various resolutions independently. In common with recent LDMs for image generation, we also speed up the generation process and bring down the generation time for a single drumbeat to 1.1 seconds by running diffusion in the autoencoder latent space. We demonstrate the originality and variety of the generated drumbeats by measuring distance (both over binary pianorolls and in the latent space) versus the training dataset and among the generated drumbeats. We also assess the generated drumbeats through a listening test focused on questions of quality, aptness for the text prompt, and novelty. BERT model achieved quality and aptness scores comparable to the dataset drumbeats (differences of -1.99% and +5.64% respectively), while exhibiting a 22.26% improvement in novelty.

Keywords:

Autoencoder MIDI Process (computing) Encoder Deep learning Binary number Modality (human–computer interaction) Artificial neural network

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.61

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Generative Adversarial Networks and Image Synthesis

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

TEXT CONDITIONED SYMBOLIC DRUMBEAT GENERATION USING LATENT DIFFUSION MODELS

Abstract

Metrics

Topics

Related Documents

TEXT CONDITIONED SYMBOLIC DRUMBEAT GENERATION USING LATENT DIFFUSION MODELS

Text-conditioned image generation using diffusion models

PixMus : Video and Text Conditioned Background Music Generation Using Latent Diffusion

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models