Toward an Effective Action-Region Tracking Framework for Fine-Grained Video Action Recognition

Baoli Sun; Yihan Wang; Xinzhu Ma; Zhihui Wang; Kun Lu; Zhiyong Wang

doi:10.1109/tnnls.2025.3602089

ScienceGate Book Chapters

JOURNAL ARTICLE

Toward an Effective Action-Region Tracking Framework for Fine-Grained Video Action Recognition

Baoli Sun Yihan Wang Xinzhu Ma Zhihui Wang Kun Lu Zhiyong Wang

Year: 2025 Journal: IEEE Transactions on Neural Networks and Learning Systems Vol: 37 (1)Pages: 176-190 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tnnls.2025.3602089

Get Full-Text PDF Get Analytical Report

Abstract

Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the action-region tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling distinguishing similar actions effectively. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics serve as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are then organized into action tracklets, which characterize the region-based action dynamics by linking related responses across different video frames in a coherent sequence. The text-constrained queries are designed to expressly encode nuanced semantic representations derived from the textual descriptions of action labels, as extracted by the language branches within visual language models. To optimize generated action tracklets, we design a multilevel tracklet contrastive constraint among multiple region responses at spatial and temporal levels, which can effectively distinguish individual region responses in each video frame (spatial level) and establish the correlation of similar region responses between adjacent video frames (temporal level). In addition, we implement a task-specific fine-tuning mechanism to refine textual semantics during training. This ensures that the semantic representations encoded by vision language models (VLMs) are not only preserved but also optimized for specific task preferences. Comprehensive experiments on several widely used action recognition benchmarks, i.e., FineGym, Diving48, NTURGB-D, Kinetics, and Something-Something, clearly demonstrate the superiority to previous state-of-the-art baselines.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.38

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Toward an Effective Action-Region Tracking Framework for Fine-Grained Video Action Recognition

Abstract

Metrics

Topics

Related Documents

Adaptive Recursive Circle Framework for Fine-Grained Action Recognition

Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

PGVT: Pose-Guided Video Transformer for Fine-Grained Action Recognition

A Novel Framework for Fine Grained Action Recognition in Soccer