Abstract

Audio captioning aims to generate text descriptions of audio clips.In the real world, many objects produce similar sounds.How to accurately recognize ambiguous sounds is a major challenge for audio captioning.In this work, inspired by inherent human multimodal perception, we propose visuallyaware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system.Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space.Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-theart results on machine translation metrics.

Keywords:
Closed captioning Computer science Audio visual Multimedia Speech recognition Human–computer interaction Artificial intelligence Image (mathematics)

Metrics

15
Cited By
2.73
FWCI (Field Weighted Citation Impact)
31
Refs
0.88
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Subtitles and Audiovisual Media
Social Sciences →  Arts and Humanities →  Language and Linguistics

Related Documents

JOURNAL ARTICLE

Graph Attention for Automated Audio Captioning

Feiyang XiaoJian GuanQiaoxi ZhuWenwu Wang

Journal:   IEEE Signal Processing Letters Year: 2023 Vol: 30 Pages: 413-417
JOURNAL ARTICLE

Visual Content Captioning and Audio Conversion using CNN-RNN with Attention Model

Agus HermantoGiat KaryonoImam TahyudinBoby Sandityas Prahasto

Journal:   Journal of Innovation Information Technology and Application (JINITA) Year: 2025 Vol: 7 (1)Pages: 163-170
© 2026 ScienceGate Book Chapters — All rights reserved.