JOURNAL ARTICLE

Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning

Abstract

Text-to-audio grounding (TAG) aims to detect sound events described by natural language in an audio clip. Strongly-supervised TAG requires extensive human annotations of the events' on- and off-sets. To mitigate the reliance on strongly-annotated data, weakly-supervised TAG (WSTAG) is proposed to train TAG on audio captioning data based on contrastive learning. However, crucial components in WSTAG, namely pooling strategies and loss functions, remain unexplored. Directly bringing their corresponding ones in closely-related tasks, such as sound event detection (SED) and audio-text retrieval, do not necessarily fit this task due to TAG's unique requirement of fine-grained alignment via free text. In this work, we first improve the TAG dataset to obtain a more reliable TAG performance indicator, AudioGrounding v2. Then we extensively investigate the effects of these components on WSTAG. The result on the refined dataset demonstrates that the pooling strategy is crucial to the model performance while the loss function presents much less influence. By combining proper pooling strategies and loss functions, we explore a more effective WSTAG framework that significantly enhances the ability to detect events, especially for short-duration ones 1 . 1 The code and data are available athttps://github.com/wsntxxn/TextToAudioGrounding

Keywords:
Pooling Computer science Closed captioning Artificial intelligence Code (set theory) Natural language processing Function (biology) Natural language Speech recognition Machine learning Information retrieval Image (mathematics) Programming language

Metrics

5
Cited By
1.34
FWCI (Field Weighted Citation Impact)
27
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Towards Weakly Supervised Text-to-Audio Grounding

Xuenan XuZiyang MaMengyue WuKai Yu

Journal:   IEEE Transactions on Multimedia Year: 2024 Vol: 26 Pages: 11126-11138
JOURNAL ARTICLE

Counterfactual contrastive learning for weakly supervised temporal sentence grounding

Yenan XuWanru XuZhenjiang Miao

Journal:   Neurocomputing Year: 2025 Vol: 624 Pages: 129508-129508
JOURNAL ARTICLE

Weakly Supervised Contrastive Learning

Mingkai ZhengFei WangShan YouChen QianChangshui ZhangXiaogang WangChang Xu

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 10022-10031
JOURNAL ARTICLE

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Minghang ZhengYanjie HuangQing-Chao ChenYuxin PengYang Liu

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 15534-15543
© 2026 ScienceGate Book Chapters — All rights reserved.