Visual reasoning ability, as an advanced cognitive ability of models, has been widely studied. In visual question answering tasks, the symbolic reasoning method based on task decomposition enables the model to perform visual reasoning according to human logical patterns, thereby facilitating question answering. This approach has demonstrated impressive performance across various descriptive question types. The quintessential visual question answering task comprises a question paired with an image. When compared to static visual reasoning centered around images, dynamic visual reasoning pertaining to video content poses heightened challenges in logic, temporal comprehension, and causal structure, rendering it difficult for prior methodologies to grasp the dynamic interrelationships among objects within dynamic scenes. In this study, we propose a task-guided dynamic visual reasoning method for visual question answering, which models the spatiotemporal states of objects in dynamic scenes, decomposes the questions into task steps, and finally deduces reasoning on the established spatiotemporal dynamic scene graph neural network. We performed experimental verification with two benchmarks, CLEVRER and CATER, and the results of this verification show that our model can effectively extract spatiotemporal features of objects in dynamic scenes, perform well in descriptive problems, and improve the accuracy of explanatory and predictive problems compared to comparative models.
Xinyu LiuChenchen JingMingliang ZhaiYuwei WuYunde Jia
Jinlai LiuChenfei WuXiaojie WangDong Xuan
Xiangyu WuJianfeng LuZhuanfeng LiFengchao Xiong