Progressive Graph Attention Network for Video Question Answering

Liang Peng; Shuangji Yang; Yi Bin; Guoqing Wang

doi:10.1145/3474085.3475193

ScienceGate Book Chapters

JOURNAL ARTICLE

Progressive Graph Attention Network for Video Question Answering

Liang Peng Shuangji Yang Yi Bin Guoqing Wang

Year: 2021 Pages: 2871-2879

DOI: 10.1145/3474085.3475193

Get Full-Text PDF Get Analytical Report

Abstract

Video question answering~(Video-QA) is a task of answering a natural language question related to the content of a video. Existing methods generally explore the single interactions between objects or between frames, which are insufficient to deal with the sophisticated scenes in videos. To tackle this problem, we propose a novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on object-level, frame-level and clip-level. Specifically, in the object-level relation encoding, we design two kinds of complementary graphs, one for learning the spatial and semantic relations between objects from the same frame, the other for modeling the temporal relations between the same object from different frames. The frame-level graph explores the interactions between diverse frames to record the fine-grained appearance change, while the clip-level graph models the temporal and semantic relations between various actions from clips. These different-level graphs are concatenated in a progressive manner to learn the visual relations from low-level to high-level. Furthermore, we for the first time identified that there are serious answer biases with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate that our model significantly outperforms other state-of-the-art models. Our codes and dataset are available at https://github.com/PengLiang-cn/PGAT.

Keywords:

Computer science Question answering Scene graph Graph Benchmark (surveying) Artificial intelligence Frame (networking) Spatial relation Object (grammar) Information retrieval Natural language processing Machine learning Theoretical computer science

Metrics

Cited By

4.09

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Progressive Graph Attention Network for Video Question Answering

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Scale Progressive Attention Network for Video Question Answering

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Progressive Attention Memory Network for Movie Story Question Answering

Co-attention graph convolutional network for visual question answering

Relation-Aware Graph Attention Network for Visual Question Answering