JOURNAL ARTICLE

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Abstract

We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP. 1

Keywords:
Computer science Closed captioning Codec Embedding Speech recognition Speech coding Language model Source code Natural language processing Code (set theory) Artificial intelligence Programming language Computer hardware Image (mathematics)

Metrics

11
Cited By
7.84
FWCI (Field Weighted Citation Impact)
38
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Neural Audio Codec

Tapendra Pandey

Year: 2023
JOURNAL ARTICLE

Sound to Text: Automated Audio Captioning using Deep Learning

Mei, Xinhao

Journal:   Surrey Open Research repository (University of Surrey) Year: 2024
JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vassilis

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
JOURNAL ARTICLE

Weakly-supervised Automated Audio Captioning via text only training

Kouzelis, TheodorosKatsouros, Vasilis

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
© 2026 ScienceGate Book Chapters — All rights reserved.