This thesis investigates the potential of Neural Audio Codecs (NACs) to enrich the audio representation capabilities of Contrastive Language-Audio Pretraining (CLAP) models. We introduce an innovative evaluation approach to systematically compare various CLAP configurations utilizing distinct audio encoder modules on the text-to-audio retrieval task. Our rigorous experimental analysis implies that NAC-based modules offer superior feature discrimination and retrieval eficacy. The research presents a methodological framework for NAC integration in CLAP models, sets new performance benchmarks, and outlines future directions, emphasizing the development of universal audio embeddings and refined pre-training techniques.Our codes are available at https://github.com/duduOliver/SMC_CodecCLAP.
Yuke LinFulin ZhangYingying GaoShilei ZhangMing Li
Michele PanarielloFrancesco NespoliMassimiliano TodiscoNicholas Evans
Shahan NercessianJohannes ImortNinon DevisFrederik Blang