Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering

Jiangtong Li; Zhaohe Liao; Fengshun Xiao; Tianjiao Li; Qiang Zhang; Haohua Zhao; Li Niu; Guang Chen; Liqing Zhang; Changjun Jiang

doi:10.1109/tpami.2026.3650864

JOURNAL ARTICLE

Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering

Jiangtong Li Zhaohe Liao Fengshun Xiao Tianjiao Li Qiang Zhang Haohua Zhao Li Niu Guang Chen Liqing Zhang Changjun Jiang

Year: 2026 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: PP Pages: 1-18 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2026.3650864

Get Full-Text PDF Get Analytical Report

Abstract

Video Question-Answering (VideoQA) enables machines to interpret and respond to complex video content, advanc ing human-computer interaction. However, existing multimodal large language models (MLLMs) often provide incomplete or opaque explanations and existing benchmarks mainly focus on the correction of final answers, limiting insight into their reasoning processes and hindering both transparency and verifiability. To address this gap, we propose the Question Parsing, Video Alignment and Answer Aggregation framework (QPVA³), which leverages a compositional graph to drive visual and logical reasoning in VideoQA. Specifically, QPVA³ consists of three core components, the planner, executor, and reasoner to generate the compositional graph and conduct graph-driven reasoning. For the original question, the planner parses it into the compositional graph, capturing the underlying reasoning logic and structuring it into a series of interconnected questions. For each question in compositional graph, the executor aligns the video by selecting relevant video clips and generates answers, ensuring accurate, context-specific responses. For each question with its first-order descents, the reasoner aggregates answers by integrating rea soning logic with visual evidence, resolving conflicts to produce a coherent and accurate response. Moreover, to assess the performance of existing MLLMs in the reasoning processes of VideoQA, we introduce novel compositional consistency metrics and construct a VideoQA benchmark (QPVA³Bench) with 3,492 question-video tuples, each annotated with detailed composi tional graphs and fine-grained answers. We evaluate the QPVA³ framework on QPVA³Bench and 5 other VideoQA benchmarks. Experimental results demonstrate that our framework improves both consistency and accuracy compared to baselines, leading to a more transparent and verifiable VideoQA system. This approach has the potential to advance the field, as supported by our comprehensive evaluation and benchmarking efforts. Code and dataset are available at https://github.com/QPVA3/QPVA3-PAMI.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.83

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Graph Neural Networks

Physical Sciences → Computer Science → Artificial Intelligence

Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering

Abstract

Metrics

Topics

Related Documents

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Event Graph Guided Compositional Spatial–Temporal Reasoning for Video Question Answering

Multimodal Graph Reasoning and Fusion for Video Question Answering

Graph-based relational reasoning network for video question answering

Reasoning with Heterogeneous Graph Alignment for Video Question Answering