JOURNAL ARTICLE

Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

Abstract

Vision transformers have achieved impressive performance for video action recognition due to their strong capability of modeling long-range dependencies among spatio-temporal tokens. However, as for fine-grained actions, subtle and discriminative differences mainly exist in the regions of actors, directly utilizing vision transformers without removing irrelevant tokens will compromise recognition performance and lead to high computational costs. In this paper, we propose a coarse-to-fine action token localization and interaction network, namely C2F-ALIN, that dynamically localizes the most informative tokens at a coarse granularity and then partitions these located tokens to a fine granularity for sufficient fine-grained spatio-temporal interaction. Specifically, in the coarse stage, we devise a discriminative token localization module to accurately identify informative tokens and to discard irrelevant tokens, where each localized token corresponds to a large spatial region, thus effectively preserving the continuity of action regions.In the fine stage, we only further partition the localized tokens obtained in the coarse stage into a finer granularity and then characterize fine-grained token interactions in two aspects: (1) first using vanilla transformers to learn compact dependencies among all discriminative tokens; and (2) proposing a global contextual interaction module which enables each fine-grained tokens to communicate with all the spatio-temporal tokens and to embed the global context. As a result, our coarse-to-fine strategy is able to identify more relevant tokens and integrate global context for high recognition accuracy while maintaining high efficiency.Comprehensive experimental results on four widely used action recognition benchmarks, including FineGym, Diving48, Kinetics and Something-Something, clearly demonstrate the advantages of our proposed method in comparison with other state-of-the-art ones.

Keywords:
Security token Computer science Discriminative model Granularity Action recognition Artificial intelligence Transformer Context (archaeology) Context model Pattern recognition (psychology) Computer network Voltage

Metrics

7
Cited By
1.27
FWCI (Field Weighted Citation Impact)
22
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Gait Recognition and Analysis
Physical Sciences →  Engineering →  Biomedical Engineering
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition

Chunlei LiCan ChengMiao YuZhoufeng LiuDi Huang

Journal:   IEEE Transactions on Biometrics Behavior and Identity Science Year: 2025 Vol: 7 (3)Pages: 444-457
JOURNAL ARTICLE

FineAction: A Fine-Grained Video Dataset for Temporal Action Localization

Yi LiuLimin WangYali WangXiao MaYu Qiao

Journal:   IEEE Transactions on Image Processing Year: 2022 Vol: 31 Pages: 6937-6950
JOURNAL ARTICLE

Coarse-to-Fine Localization of Temporal Action Proposals

Fuchen LongTing YaoZhaofan QiuXinmei TianTao MeiJiebo Luo

Journal:   IEEE Transactions on Multimedia Year: 2019 Vol: 22 (6)Pages: 1577-1590
© 2026 ScienceGate Book Chapters — All rights reserved.