JOURNAL ARTICLE

Progressive Graph Attention Network for Video Question Answering

Abstract

Video question answering~(Video-QA) is a task of answering a natural language question related to the content of a video. Existing methods generally explore the single interactions between objects or between frames, which are insufficient to deal with the sophisticated scenes in videos. To tackle this problem, we propose a novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on object-level, frame-level and clip-level. Specifically, in the object-level relation encoding, we design two kinds of complementary graphs, one for learning the spatial and semantic relations between objects from the same frame, the other for modeling the temporal relations between the same object from different frames. The frame-level graph explores the interactions between diverse frames to record the fine-grained appearance change, while the clip-level graph models the temporal and semantic relations between various actions from clips. These different-level graphs are concatenated in a progressive manner to learn the visual relations from low-level to high-level. Furthermore, we for the first time identified that there are serious answer biases with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate that our model significantly outperforms other state-of-the-art models. Our codes and dataset are available at https://github.com/PengLiang-cn/PGAT.

Keywords:
Computer science Question answering Scene graph Graph Benchmark (surveying) Artificial intelligence Frame (networking) Spatial relation Object (grammar) Information retrieval Natural language processing Machine learning Theoretical computer science

Metrics

45
Cited By
4.09
FWCI (Field Weighted Citation Impact)
38
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.