JOURNAL ARTICLE

Toward an Effective Action-Region Tracking Framework for Fine-Grained Video Action Recognition

Baoli SunYihan WangXinzhu MaZhihui WangKun LuZhiyong Wang

Year: 2025 Journal:   IEEE Transactions on Neural Networks and Learning Systems Vol: 37 (1)Pages: 176-190   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the action-region tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling distinguishing similar actions effectively. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics serve as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are then organized into action tracklets, which characterize the region-based action dynamics by linking related responses across different video frames in a coherent sequence. The text-constrained queries are designed to expressly encode nuanced semantic representations derived from the textual descriptions of action labels, as extracted by the language branches within visual language models. To optimize generated action tracklets, we design a multilevel tracklet contrastive constraint among multiple region responses at spatial and temporal levels, which can effectively distinguish individual region responses in each video frame (spatial level) and establish the correlation of similar region responses between adjacent video frames (temporal level). In addition, we implement a task-specific fine-tuning mechanism to refine textual semantics during training. This ensures that the semantic representations encoded by vision language models (VLMs) are not only preserved but also optimized for specific task preferences. Comprehensive experiments on several widely used action recognition benchmarks, i.e., FineGym, Diving48, NTURGB-D, Kinetics, and Something-Something, clearly demonstrate the superiority to previous state-of-the-art baselines.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
80
Refs
0.38
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.