Recap: Retrieval-Augmented Audio Captioning

Sreyan Ghosh; Sonal Kumar; Chandra Kiran Reddy Evuru; Ramani Duraiswami; Dinesh Manocha

doi:10.1109/icassp48485.2024.10448030

ScienceGate Book Chapters

JOURNAL ARTICLE

Recap: Retrieval-Augmented Audio Captioning

Sreyan Ghosh Sonal Kumar Chandra Kiran Reddy Evuru Ramani Duraiswami Dinesh Manocha

Year: 2024 Pages: 1161-1165

DOI: 10.1109/icassp48485.2024.10448030

Get Full-Text PDF Get Analytical Report

Abstract

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP [1] to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho ¹ .

Keywords:

Closed captioning Computer science Leverage (statistics) Encoder Domain (mathematical analysis) Benchmark (surveying) Speech recognition Artificial intelligence Information retrieval Natural language processing Image (mathematics)

Metrics

Cited By

9.98

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Recap: Retrieval-Augmented Audio Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Retrieval-augmented Image Captioning

Retrieval-Augmented Egocentric Video Captioning

DistillCaps: Enhancing Audio-Language Alignment in Captioning via Retrieval-Augmented Knowledge Distillation

Understanding Retrieval Robustness for Retrieval-augmented Image Captioning

Retrieval-Augmented Transformer for Image Captioning