JOURNAL ARTICLE

Spatio-Temporal Convolution-Attention Video Network

Abstract

In this paper, we present a hierarchical neural network based on convolutional and attention modeling for short and long-range video reasoning, called Spatio-Temporal Convolution-Attention Video Network (STCA). The proposed method is capable of learning appearance and temporal cues in two stages with different temporal depths to maximize engagement of the short-range and long-range video sequences. It has the benefits of convolutional and attention networks in exploiting spatial and temporal cues for a new spatio-temporal sequence modeling. Our method is a novel mixer architecture to obtain robust properties of convolution (such as translational equivariance) while having the generalization and sequential modeling ability of transformers to deal with dynamic variations in videos. The proposed video deep neural network aims to exploit spatio-temporal information in two stages: 1.) Short Clip Stage (SCS) and 2.) Long Video Stage (LVS). SCS handles spatio-temporal cues dealing with short-range video clips and operates on video frames with 3D convolutions and multi-headed self-attention modeling. Since SCS operates on video frames, this reduces the quadratic complexity of the self-attention operation. In LVS, we mitigate the issue of modeling long-range temporal self-attention. LVS models long-range temporal reasoning using representation (i.e., tokens) obtained from SCS. LVS consists of variants of long-range temporal modeling mechanisms for learning compact and robust global temporal representations of the entire video. We conduct experiments on six challenging video recognition datasets: HVU, Kinetics (400, 600, 700), Something-Something V2, and Long Video Understanding dataset. Through extensive evaluations and ablation studies, we show outstanding performances in comparison to state-of-the-art methods on the mentioned datasets.

Keywords:
Computer science Artificial intelligence Convolutional neural network Convolution (computer science) Exploit Pattern recognition (psychology) Computer vision Artificial neural network

Metrics

3
Cited By
0.55
FWCI (Field Weighted Citation Impact)
84
Refs
0.63
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Hybrid video coding scheme based on VVC and spatio-temporal attention convolution neural network

Gang HeKepeng XuChang WuZijia MaXing WenMing Sun

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Year: 2022 Pages: 1790-1793
BOOK-CHAPTER

Spatio-Temporal Deformable Attention Network for Video Deblurring

Huicong ZhangHaozhe XieHongxun Yao

Lecture notes in computer science Year: 2022 Pages: 581-596
JOURNAL ARTICLE

Spatio-Temporal Graph Attention Convolution Network for Traffic Flow Forecasting

Kun LiuYifan ZhuXiao WangHongya JiChengfei Huang

Journal:   Transportation Research Record Journal of the Transportation Research Board Year: 2024 Vol: 2678 (9)Pages: 136-149
JOURNAL ARTICLE

Spatio-Temporal Attention Graph Convolution Network for Functional Connectome Classification

Wenhan WangYouyong KongZhenghua HouChunfeng YangYonggui Yuan

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 1486-1490
© 2026 ScienceGate Book Chapters — All rights reserved.