JOURNAL ARTICLE

Spatio-Temporal Graph Convolution Transformer for Video Question Answering

Jiahao TangJianguo HuWenjun HuangShengzhi ShenJiakai PanDe-Ming WangYanyu Ding

Year: 2024 Journal:   IEEE Access Vol: 12 Pages: 131664-131680   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model’s performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.

Keywords:
Computer science Question answering Transformer Overlap–add method Convolution (computer science) Graph Artificial intelligence Theoretical computer science Mathematics Voltage Electrical engineering Fourier transform

Metrics

3
Cited By
1.59
FWCI (Field Weighted Citation Impact)
43
Refs
0.75
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Video Graph Transformer for Video Question Answering

Junbin XiaoPan ZhouTat‐Seng ChuaShuicheng Yan

Lecture notes in computer science Year: 2022 Pages: 39-58
JOURNAL ARTICLE

Question answering over spatio-temporal knowledge graph

Xinbang DaiHuiying LiNan HuYongrui ChenRihui JinHuikang HuGuilin Qi

Journal:   Knowledge-Based Systems Year: 2025 Vol: 329 Pages: 114314-114314
JOURNAL ARTICLE

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Yun LiuXiaoming ZhangFeiran HuangBo ZhangZhoujun Li

Journal:   IEEE Transactions on Image Processing Year: 2022 Vol: 31 Pages: 1684-1696
JOURNAL ARTICLE

Contrastive Video Question Answering via Video Graph Transformer

Junbin XiaoPan ZhouAngela YaoYicong LiRichang HongShuicheng YanTat‐Seng Chua

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2023 Vol: 45 (11)Pages: 13265-13280
© 2026 ScienceGate Book Chapters — All rights reserved.