Text-to-audio grounding (TAG) aims to detect sound events described by natural language in an audio clip. Strongly-supervised TAG requires extensive human annotations of the events' on- and off-sets. To mitigate the reliance on strongly-annotated data, weakly-supervised TAG (WSTAG) is proposed to train TAG on audio captioning data based on contrastive learning. However, crucial components in WSTAG, namely pooling strategies and loss functions, remain unexplored. Directly bringing their corresponding ones in closely-related tasks, such as sound event detection (SED) and audio-text retrieval, do not necessarily fit this task due to TAG's unique requirement of fine-grained alignment via free text. In this work, we first improve the TAG dataset to obtain a more reliable TAG performance indicator, AudioGrounding v2. Then we extensively investigate the effects of these components on WSTAG. The result on the refined dataset demonstrates that the pooling strategy is crucial to the model performance while the loss function presents much less influence. By combining proper pooling strategies and loss functions, we explore a more effective WSTAG framework that significantly enhances the ability to detect events, especially for short-duration ones 1 . 1 The code and data are available athttps://github.com/wsntxxn/TextToAudioGrounding
Xuenan XuZiyang MaMengyue WuKai Yu
Yenan XuWanru XuZhenjiang Miao
Keqin ChenRichong ZhangS. Y. MensahYongyi Mao
Mingkai ZhengFei WangShan YouChen QianChangshui ZhangXiaogang WangChang Xu
Minghang ZhengYanjie HuangQing-Chao ChenYuxin PengYang Liu