Abstract

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

Keywords:
Computer science Augmented reality Speech recognition Multimedia Artificial intelligence Information retrieval Natural language processing

Metrics

17
Cited By
12.11
FWCI (Field Weighted Citation Impact)
28
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Enhancing text-audio generation by music classification and Retrieval-Augmented Generation

Runyu HeJunyi ZhuBingying Wang Bingying WangYixuan Yin

Journal:   Applied and Computational Engineering Year: 2024 Vol: 68 (1)Pages: 319-329
DISSERTATION

Text-Graph Encoders and Retrieval-Augmented Generation

Coman, Andrei Catalin

University:   Infoscience (Ecole Polytechnique Fédérale de Lausanne) Year: 2025
JOURNAL ARTICLE

Recent Advances in Retrieval-Augmented Text Generation

Deng CaiYan WangLemao LiuShuming Shi

Journal:   Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval Year: 2022 Pages: 3417-3419
© 2026 ScienceGate Book Chapters — All rights reserved.