Video question answering~(Video-QA) is a task of answering a natural language question related to the content of a video. Existing methods generally explore the single interactions between objects or between frames, which are insufficient to deal with the sophisticated scenes in videos. To tackle this problem, we propose a novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on object-level, frame-level and clip-level. Specifically, in the object-level relation encoding, we design two kinds of complementary graphs, one for learning the spatial and semantic relations between objects from the same frame, the other for modeling the temporal relations between the same object from different frames. The frame-level graph explores the interactions between diverse frames to record the fine-grained appearance change, while the clip-level graph models the temporal and semantic relations between various actions from clips. These different-level graphs are concatenated in a progressive manner to learn the visual relations from low-level to high-level. Furthermore, we for the first time identified that there are serious answer biases with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate that our model significantly outperforms other state-of-the-art models. Our codes and dataset are available at https://github.com/PengLiang-cn/PGAT.
Weike JinZhou ZhaoYimeng LiJie LiJun XiaoYueting Zhuang
Junyeong KimMinuk MaKyungsu KimSung-Jin KimChang D. Yoo
Chuan LiuYing-Ying TanTian-Tian XiaJiajing ZhangMing Zhu
Linjie LiZhe GanYu ChengJingjing Liu