JOURNAL ARTICLE

Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering

Jiangtong LiZhaohe LiaoFengshun XiaoTianjiao LiQiang ZhangHaohua ZhaoLi NiuGuang ChenLiqing ZhangChangjun Jiang

Year: 2026 Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: PP Pages: 1-18   Publisher: IEEE Computer Society

Abstract

Video Question-Answering (VideoQA) enables machines to interpret and respond to complex video content, advanc ing human-computer interaction. However, existing multimodal large language models (MLLMs) often provide incomplete or opaque explanations and existing benchmarks mainly focus on the correction of final answers, limiting insight into their reasoning processes and hindering both transparency and verifiability. To address this gap, we propose the Question Parsing, Video Alignment and Answer Aggregation framework (QPVA3), which leverages a compositional graph to drive visual and logical reasoning in VideoQA. Specifically, QPVA3 consists of three core components, the planner, executor, and reasoner to generate the compositional graph and conduct graph-driven reasoning. For the original question, the planner parses it into the compositional graph, capturing the underlying reasoning logic and structuring it into a series of interconnected questions. For each question in compositional graph, the executor aligns the video by selecting relevant video clips and generates answers, ensuring accurate, context-specific responses. For each question with its first-order descents, the reasoner aggregates answers by integrating rea soning logic with visual evidence, resolving conflicts to produce a coherent and accurate response. Moreover, to assess the performance of existing MLLMs in the reasoning processes of VideoQA, we introduce novel compositional consistency metrics and construct a VideoQA benchmark (QPVA3Bench) with 3,492 question-video tuples, each annotated with detailed composi tional graphs and fine-grained answers. We evaluate the QPVA3 framework on QPVA3Bench and 5 other VideoQA benchmarks. Experimental results demonstrate that our framework improves both consistency and accuracy compared to baselines, leading to a more transparent and verifiable VideoQA system. This approach has the potential to advance the field, as supported by our comprehensive evaluation and benchmarking efforts. Code and dataset are available at https://github.com/QPVA3/QPVA3-PAMI.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.83
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Graph Neural Networks
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Event Graph Guided Compositional Spatial–Temporal Reasoning for Video Question Answering

Ziyi BaiRuiping WangDifei GaoXilin Chen

Journal:   IEEE Transactions on Image Processing Year: 2024 Vol: 33 Pages: 1109-1121
JOURNAL ARTICLE

Graph-based relational reasoning network for video question answering

Tao TanGuanglu Sun

Journal:   Machine Vision and Applications Year: 2024 Vol: 36 (1)
JOURNAL ARTICLE

Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Jiang PinYahong Han

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2020 Vol: 34 (07)Pages: 11109-11116
© 2026 ScienceGate Book Chapters — All rights reserved.