Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning

Xuenan Xu; Mengyue Wu; Kai Yu

doi:10.1109/icasspw59220.2023.10192960

ScienceGate Book Chapters

JOURNAL ARTICLE

Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning

Xuenan Xu Mengyue Wu Kai Yu

Year: 2023 Pages: 1-5

DOI: 10.1109/icasspw59220.2023.10192960

Get Full-Text PDF Get Analytical Report

Abstract

Text-to-audio grounding (TAG) aims to detect sound events described by natural language in an audio clip. Strongly-supervised TAG requires extensive human annotations of the events' on- and off-sets. To mitigate the reliance on strongly-annotated data, weakly-supervised TAG (WSTAG) is proposed to train TAG on audio captioning data based on contrastive learning. However, crucial components in WSTAG, namely pooling strategies and loss functions, remain unexplored. Directly bringing their corresponding ones in closely-related tasks, such as sound event detection (SED) and audio-text retrieval, do not necessarily fit this task due to TAG's unique requirement of fine-grained alignment via free text. In this work, we first improve the TAG dataset to obtain a more reliable TAG performance indicator, AudioGrounding v2. Then we extensively investigate the effects of these components on WSTAG. The result on the refined dataset demonstrates that the pooling strategy is crucial to the model performance while the loss function presents much less influence. By combining proper pooling strategies and loss functions, we explore a more effective WSTAG framework that significantly enhances the ability to detect events, especially for short-duration ones ¹ . ¹ The code and data are available athttps://github.com/wsntxxn/TextToAudioGrounding

Keywords:

Pooling Computer science Closed captioning Artificial intelligence Code (set theory) Natural language processing Function (biology) Natural language Speech recognition Machine learning Information retrieval Image (mathematics) Programming language

Metrics

Cited By

1.34

FWCI (Field Weighted Citation Impact)

Refs

0.77

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning

Abstract

Metrics

Citation History

Topics

Related Documents

Towards Weakly Supervised Text-to-Audio Grounding

Counterfactual contrastive learning for weakly supervised temporal sentence grounding

Contrastive Learning with Expectation-Maximization for Weakly Supervised Phrase Grounding

Weakly Supervised Contrastive Learning

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning