Audio-Journey: Open Domain Latent Diffusion Based Text-To-Audio Generation

Jackson Michaels; Juncheng B Li; Laura Yao; Lijun Yu; Zach Wood-Doughty; Florian Metze

doi:10.1109/icassp48485.2024.10448220

ScienceGate Book Chapters

JOURNAL ARTICLE

Audio-Journey: Open Domain Latent Diffusion Based Text-To-Audio Generation

Jackson Michaels Juncheng B Li Laura Yao Lijun Yu Zach Wood-Doughty Florian Metze

Year: 2024 Pages: 6960-6964

DOI: 10.1109/icassp48485.2024.10448220

Get Full-Text PDF Get Analytical Report

Abstract

Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.

Keywords:

Computer science Closed captioning Speech recognition Leverage (statistics) Merge (version control) Encoder Artificial intelligence Language model Natural language processing Information retrieval Image (mathematics)

Metrics

Cited By

2.85

FWCI (Field Weighted Citation Impact)

Refs

0.83

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Audio-Journey: Open Domain Latent Diffusion Based Text-To-Audio Generation

Abstract

Metrics

Citation History

Topics

Related Documents

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Text-to-Audio Generation using Instruction Guided Latent Diffusion Model