Kang LinWei ZhouZhijie ZhengDihu ChenTao Su
Weakly-Supervised Temporal Action Localization (WTAL) aims to identify the temporal boundaries and classify actions in untrimmed videos using only video-level labels during training. Despite recent progress, many existing approaches primarily follow a localization-by-classification pipeline, treating snippets as independent instances and thus exploiting only limited contextual information. Besides, these methods struggle to capture multi-scale temporal information and neglect both the internal temporal structures within videos and the semantic consistency between videos, resulting in misclassification and inaccurate localization. To address these limitations, we introduce a novel Temporal and Semantic Correlation Network (TSC-Net) for WTAL task, which can be trained end-to-end. First, we propose a Multi-Scale Features Integration Pyramid (MFIP) module to integrate multi-scale temporal features, effectively addressing the challenge of missed detections caused by short action durations. Furthermore, we design a Temporal Correlation Enhancement (TCE) branch to enhance segment correlations by video-level temporal structures to improve the completeness of action localization. Finally, a Dataset-Wide Semantic Awareness (DSA) branch is designed to construct and propagate a dataset-level action semantics bank, enhancing the model’s awareness of semantic consistency in actions. Extensive experiments show that TSC-Net outperforms most existing WTAL methods, achieving an average mAP of 46.3% on the THUMOS-14 dataset and 26.5% on the ActivityNet1.2 dataset. Detailed ablation studies further confirm the effectiveness of each component in our model. The code and models are publicly available at https://github.com/linkang-els/TSC-Net-main .
Yuanhao ZhaiLe WangZiyi LiuQilin ZhangGang HuaNanning Zheng
Yuanhao ZhaiLe WangWei TangQilin ZhangNanning ZhengGang Hua
Jungin ParkJiyoung LeeSangryul JeonSeungryong KimKwanghoon Sohn
Linjiang HuangLiang WangHongsheng Li