Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Zhou Zhao; Qifan Yang; Deng Cai; Xiaofei He; Yueting Zhuang

doi:10.24963/ijcai.2017/492

ScienceGate Book Chapters

JOURNAL ARTICLE

Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Zhou Zhao Qifan Yang Deng Cai Xiaofei He Yueting Zhuang

Year: 2017 Pages: 3518-3524

DOI: 10.24963/ijcai.2017/492

Get Full-Text PDF Get Analytical Report

Abstract

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

Keywords:

Question answering Computer science Construct (python library) Encoder Artificial intelligence Representation (politics) Focus (optics) Recurrent neural network Natural language Artificial neural network

Metrics

104

Cited By

7.63

FWCI (Field Weighted Citation Impact)

Refs

0.97

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Abstract

Metrics

Citation History

Topics

Related Documents

Spatio-Temporal Context Networks for Video Question Answering

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Video Question Answering with Spatio-Temporal Reasoning

Question Answering with Hierarchical Attention Networks

Hierarchical Relational Attention for Video Question Answering