Contrastive Video Question Answering via Video Graph Transformer

Junbin Xiao; Pan Zhou; Angela Yao; Yicong Li; Richang Hong; Shuicheng Yan; Tat‐Seng Chua

doi:10.1109/tpami.2023.3292266

ScienceGate Book Chapters

JOURNAL ARTICLE

Contrastive Video Question Answering via Video Graph Transformer

Junbin Xiao Pan Zhou Angela Yao Yicong Li Richang Hong Shuicheng Yan Tat‐Seng Chua

Year: 2023 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 45 (11)Pages: 13265-13280 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2023.3292266

Get Full-Text PDF Get Analytical Report

Abstract

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining.

Keywords:

Computer science Transformer Question answering Artificial intelligence Modal Machine learning Natural language processing

Metrics

Cited By

6.37

FWCI (Field Weighted Citation Impact)

116

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Contrastive Video Question Answering via Video Graph Transformer

Abstract

Metrics

Citation History

Topics

Related Documents

Video Graph Transformer for Video Question Answering

Spatio-Temporal Graph Convolution Transformer for Video Question Answering

Graph Prompts: Adapting Video Graph for Video Question Answering

Graph Prompts: Adapting Video Graph for Video Question Answering

Video-Context Aligned Transformer for Video Question Answering