JOURNAL ARTICLE

Harnessing Representative Spatial-Temporal Information for Video Question Answering

Yuanyuan WangMeng LiuXuemeng SongLiqiang Nie

Year: 2024 Journal:   ACM Transactions on Multimedia Computing Communications and Applications Vol: 20 (10)Pages: 1-20   Publisher: Association for Computing Machinery

Abstract

Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

Keywords:
Question answering Computer science Spatial analysis Information retrieval Geography Remote sensing

Metrics

1
Cited By
0.53
FWCI (Field Weighted Citation Impact)
39
Refs
0.53
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Spatial-Temporal Clue Reasoning Chain for Long Video Question Answering

Haibo GongLiang LiJiehua ZhangYaoqi SunYuhan GaoChenggang Yan

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2025 Pages: 1-1
JOURNAL ARTICLE

Event Graph Guided Compositional Spatial–Temporal Reasoning for Video Question Answering

Ziyi BaiRuiping WangDifei GaoXilin Chen

Journal:   IEEE Transactions on Image Processing Year: 2024 Vol: 33 Pages: 1109-1121
JOURNAL ARTICLE

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Weike JinZhou ZhaoYimeng LiJie LiJun XiaoYueting Zhuang

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2019 Vol: 15 (2s)Pages: 1-22
JOURNAL ARTICLE

Video Question Answering with Spatio-Temporal Reasoning

Yunseok JangYale SongChris Dongjoo KimYoungjae YuYoungjin KimGunhee Kim

Journal:   International Journal of Computer Vision Year: 2019 Vol: 127 (10)Pages: 1385-1412
BOOK-CHAPTER

Question Answering with Imperfect Temporal Information

Steven SchockaertDavid AhnMartine De CockEtienne Kerre

Lecture notes in computer science Year: 2006 Pages: 647-658
© 2026 ScienceGate Book Chapters — All rights reserved.