JOURNAL ARTICLE

Video Question Answering With Semantic Disentanglement and Reasoning

Jin LiuGuoxiang WangJialong XieFengyu ZhouHuijuan Xu

Year: 2023 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 34 (5)Pages: 3663-3673   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentence-level. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.

Keywords:
Computer science Natural language processing Question answering Sentence Context (archaeology) Artificial intelligence Process (computing) Information retrieval Noun phrase Comprehension Noun

Metrics

21
Cited By
3.82
FWCI (Field Weighted Citation Impact)
73
Refs
0.92
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering

Xize WuJiasong WuLei ZhuLotfi SenhadjiHuazhong Shu

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2024 Vol: 35 (3)Pages: 2074-2086
JOURNAL ARTICLE

HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering

Fei LiuJing LiuNing WangHanqing Lu

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 1678-1687
JOURNAL ARTICLE

Video Question Answering with Spatio-Temporal Reasoning

Yunseok JangYale SongChris Dongjoo KimYoungjae YuYoungjin KimGunhee Kim

Journal:   International Journal of Computer Vision Year: 2019 Vol: 127 (10)Pages: 1385-1412
© 2026 ScienceGate Book Chapters — All rights reserved.