Temporal and Semantic Correlation Network for Weakly-Supervised Temporal Action Localization

Kang Lin; Wei Zhou; Zhijie Zheng; Dihu Chen; Tao Su

doi:10.1145/3721433

ScienceGate Book Chapters

JOURNAL ARTICLE

Temporal and Semantic Correlation Network for Weakly-Supervised Temporal Action Localization

Kang Lin Wei Zhou Zhijie Zheng Dihu Chen Tao Su

Year: 2025 Journal: ACM Transactions on Multimedia Computing Communications and Applications Vol: 21 (5)Pages: 1-23 Publisher: Association for Computing Machinery

DOI: 10.1145/3721433

Get Full-Text PDF Get Analytical Report

Abstract

Weakly-Supervised Temporal Action Localization (WTAL) aims to identify the temporal boundaries and classify actions in untrimmed videos using only video-level labels during training. Despite recent progress, many existing approaches primarily follow a localization-by-classification pipeline, treating snippets as independent instances and thus exploiting only limited contextual information. Besides, these methods struggle to capture multi-scale temporal information and neglect both the internal temporal structures within videos and the semantic consistency between videos, resulting in misclassification and inaccurate localization. To address these limitations, we introduce a novel Temporal and Semantic Correlation Network (TSC-Net) for WTAL task, which can be trained end-to-end. First, we propose a Multi-Scale Features Integration Pyramid (MFIP) module to integrate multi-scale temporal features, effectively addressing the challenge of missed detections caused by short action durations. Furthermore, we design a Temporal Correlation Enhancement (TCE) branch to enhance segment correlations by video-level temporal structures to improve the completeness of action localization. Finally, a Dataset-Wide Semantic Awareness (DSA) branch is designed to construct and propagate a dataset-level action semantics bank, enhancing the model’s awareness of semantic consistency in actions. Extensive experiments show that TSC-Net outperforms most existing WTAL methods, achieving an average mAP of 46.3% on the THUMOS-14 dataset and 26.5% on the ActivityNet1.2 dataset. Detailed ablation studies further confirm the effectiveness of each component in our model. The code and models are publicly available at https://github.com/linkang-els/TSC-Net-main .

Keywords:

Computer science Artificial intelligence Correlation Action (physics) Natural language processing

Metrics

Cited By

9.55

FWCI (Field Weighted Citation Impact)

Refs

0.92

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Gait Recognition and Analysis

Physical Sciences → Engineering → Biomedical Engineering

Temporal and Semantic Correlation Network for Weakly-Supervised Temporal Action Localization

Abstract

Metrics

Citation History

Topics

Related Documents

Semantic and Temporal Contextual Correlation Learning for Weakly-Supervised Temporal Action Localization

Action Coherence Network for Weakly Supervised Temporal Action Localization

Action Coherence Network for Weakly-Supervised Temporal Action Localization

Graph Regularization Network with Semantic Affinity for Weakly-Supervised Temporal Action Localization

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization