EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Jaeyeon Kim; Jae‐Yoon Jung; Jinjoo Lee; Sang Hoon Woo

doi:10.1109/icassp48485.2024.10446672

ScienceGate Book Chapters

JOURNAL ARTICLE

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Jaeyeon Kim Jae‐Yoon Jung Jinjoo Lee Sang Hoon Woo

Year: 2024 Pages: 6735-6739

DOI: 10.1109/icassp48485.2024.10446672

Get Full-Text PDF Get Analytical Report

Abstract

We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP. ¹

Keywords:

Computer science Closed captioning Codec Embedding Speech recognition Speech coding Language model Source code Natural language processing Code (set theory) Artificial intelligence Programming language Computer hardware Image (mathematics)

Metrics

Cited By

7.84

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Neural Audio Codec

Discrete Audio Representations for Automated Audio Captioning

Sound to Text: Automated Audio Captioning using Deep Learning

Weakly-supervised Automated Audio Captioning via text only training

Weakly-supervised Automated Audio Captioning via text only training