Abstract

Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with stateof-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.

Keywords:
Closed captioning Computer science Speech recognition Encoder Audio mining Natural language processing Task (project management) Inference Artificial intelligence Acoustic model Speech processing

Metrics

14
Cited By
9.98
FWCI (Field Weighted Citation Impact)
35
Refs
0.97
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Diverse Audio Captioning Via Adversarial Training

Xinhao MeiXubo LiuJianyuan SunMark D. PlumbleyWenwu Wang

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 8882-8886
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
© 2026 ScienceGate Book Chapters — All rights reserved.