JOURNAL ARTICLE

Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems

Abstract

Open-ended Video Question Answering systems is a very challenging problem with widespread applications in real life. Existing systems tend to focus on single word video question answering system, which cannot be easily extended to develop. In this paper, we propose using an architecture, popularly used for video captioning systems to solve the problem of open-ended video based question answering systems. For generating good answers, the model is required to focus on each frame separately as well as understand how to link information from different frames to generate the answer. The model also needs to keep in mind the different modalities and adapt itself accordingly while processing the videos as well as the questions. We propose an attention based multimodal fusion architecture for Video Question Answering (AMF-VQA) that uses attention mechanism at every time to output a word. Such kind of mechanism allows the model to focus on different frames as well as focus on different modalities while outputting every single word. The proposed model is very flexible were we can just add other modalities such as audio features, captions, etc. to the existing model and fine-tune the model to get improve results if these new features are available.

Keywords:
Computer science Question answering Focus (optics) Closed captioning Modalities Frame (networking) Architecture Word (group theory) Artificial intelligence Modal Information retrieval Natural language processing Human–computer interaction Image (mathematics) Linguistics

Metrics

4
Cited By
0.31
FWCI (Field Weighted Citation Impact)
18
Refs
0.55
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.