Abstract

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.

Keywords:
Computer science Closed captioning Audio analyzer Speech recognition Feature (linguistics) Audio signal Encoder Speech coding Artificial intelligence Audio mining Audio signal processing Multimedia Acoustic model Speech processing Image (mathematics)

Metrics

3
Cited By
2.14
FWCI (Field Weighted Citation Impact)
20
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Audio Difference Learning Framework for Audio Captioning

Tatsuya KomatsuKazuya TakedaTomoki Toda

Journal:   APSIPA Transactions on Signal and Information Processing Year: 2025 Vol: 14 (1)
JOURNAL ARTICLE

Transfer Learning for Audio Captioning

Yanxi ChenShuguo Yang

Year: 2025 Pages: 88-93
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
JOURNAL ARTICLE

Audio captioning

Kuzmin, NikitaDyakonov Alexander

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2020
© 2026 ScienceGate Book Chapters — All rights reserved.