Video Question Answering With Semantic Disentanglement and Reasoning

Jin Liu; Guoxiang Wang; Jialong Xie; Fengyu Zhou; Huijuan Xu

doi:10.1109/tcsvt.2023.3317447

ScienceGate Book Chapters

JOURNAL ARTICLE

Video Question Answering With Semantic Disentanglement and Reasoning

Jin Liu Guoxiang Wang Jialong Xie Fengyu Zhou Huijuan Xu

Year: 2023 Journal: IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (5)Pages: 3663-3673 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tcsvt.2023.3317447

Get Full-Text PDF Get Analytical Report

Abstract

Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentence-level. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.

Keywords:

Computer science Natural language processing Question answering Sentence Context (archaeology) Artificial intelligence Process (computing) Information retrieval Noun phrase Comprehension Noun

Metrics

Cited By

3.82

FWCI (Field Weighted Citation Impact)

Refs

0.92

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Question Answering With Semantic Disentanglement and Reasoning

Abstract

Metrics

Citation History

Topics

Related Documents

Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering

HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering

Multi-Semantic Alignment Co-Reasoning Network for Video Question Answering

Video Question Answering with Spatio-Temporal Reasoning

Temporally Multi-Modal Semantic Reasoning with Spatial Language Constraints for Video Question Answering