JOURNAL ARTICLE

Explore Multi-Step Reasoning in Video Question Answering

Abstract

This invited talk is a repeated but more detailed talk about the paper which is accepted by ACM-MM 2018: Video question answering (VideoQA) always involves visual reasoning. When answering questions composing of multiple logic correlations, models need to perform multi-step reasoning. In this paper, we formulate multi-step reasoning in VideoQA as a new task to answer compositional and logical structured questions based on video content. Existing VideoQA datasets are inadequate as benchmarks for the multi-step reasoning due to limitations as lacking logical structure and having language biases. Thus we design a system to automatically generate a large-scale dataset, namely SVQA (Synthetic Video Question Answering). Compared with other VideoQA datasets, SVQA contains exclusively long and structured questions with various spatial and temporal relations between objects. More importantly, questions in SVQA can be decomposed into human readable logical tree or chain layouts, each node of which represents a sub-task requiring a reasoning operation such as comparison or arithmetic. Towards automatic question answering in SVQA, we develop a new VideoQA model. Particularly, we construct a new attention module, which contains spatial attention mechanism to address crucial and multiple logical sub-tasks embedded in questions, as well as a refined GRU called ta-GRU (temporal-attention GRU) to capture the long-term temporal dependency and gather complete visual cues. Experimental results show the capability of multi-step reasoning of SVQA and the effectiveness of our model when compared with other existing models.

Keywords:
Question answering Computer science Construct (python library) Task (project management) Spatial intelligence Artificial intelligence Dependency (UML) Reasoning system Tree (set theory) Logical reasoning Natural language processing Programming language

Metrics

11
Cited By
0.58
FWCI (Field Weighted Citation Impact)
0
Refs
0.68
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Efficient multi-step reasoning attention network for visual question answering

Haotian ZhangWei Biao WuMeng Zhang

Journal:   Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021) Year: 2022 Pages: 38-38
JOURNAL ARTICLE

Differentiated Attention with Multi-modal Reasoning for Video Question Answering

Shentao YaoKun LiKun XingKewei WuZhao XieDan Guo

Journal:   2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA) Year: 2022 Vol: 35 Pages: 525-530
© 2026 ScienceGate Book Chapters — All rights reserved.