JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vasilis

Year: 2023 Journal:   Zenodo (CERN European Organization for Nuclear Research)   Publisher: European Organization for Nuclear Research

Abstract

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~ compared to fully supervised approaches trained with paired target data.

Keywords:
Closed captioning Modality (human–computer interaction) Inference Similarity (geometry) Bridge (graph theory) Training set Audio signal

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.31
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vassilis

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vassilis

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vasilis

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
JOURNAL ARTICLE

Towards Weakly Supervised Text-to-Audio Grounding

Xuenan XuZiyang MaMengyue WuKai Yu

Journal:   IEEE Transactions on Multimedia Year: 2024 Vol: 26 Pages: 11126-11138
JOURNAL ARTICLE

Sound to Text: Automated Audio Captioning using Deep Learning

Mei, Xinhao

Journal:   Surrey Open Research repository (University of Surrey) Year: 2024
© 2026 ScienceGate Book Chapters — All rights reserved.