JOURNAL ARTICLE

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Yi ChengHehe FanDongyun LinYing SunMohan KankanhalliJoo‐Hwee Lim

Year: 2023 Journal:   IEEE Transactions on Multimedia Vol: 26 Pages: 6131-6141   Publisher: Institute of Electrical and Electronics Engineers

Abstract

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

Keywords:
Computer science Graph Spatial relation Question answering Temporal database Relation (database) Artificial intelligence Data mining Theoretical computer science

Metrics

9
Cited By
1.64
FWCI (Field Weighted Citation Impact)
55
Refs
0.82
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Yun LiuXiaoming ZhangFeiran HuangBo ZhangZhoujun Li

Journal:   IEEE Transactions on Image Processing Year: 2022 Vol: 31 Pages: 1684-1696
BOOK-CHAPTER

Spatio-Temporal Context Networks for Video Question Answering

Kun GaoYahong Han

Lecture notes in computer science Year: 2018 Pages: 108-118
JOURNAL ARTICLE

Location-Aware Graph Convolutional Networks for Video Question Answering

Deng HuangPeihao ChenRunhao ZengQing DuMingkui TanChuang Gan

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2020 Vol: 34 (07)Pages: 11021-11028
© 2026 ScienceGate Book Chapters — All rights reserved.