Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems

Sumedh Pendurkar; Sameer Kolpekwar; Shreyas Dhoot; Yashodhara Haribhakta; Biplab Banerjee

doi:10.1016/j.procs.2020.04.047

ScienceGate Book Chapters

JOURNAL ARTICLE

Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems

Sumedh Pendurkar Sameer Kolpekwar Shreyas Dhoot Yashodhara Haribhakta Biplab Banerjee

Year: 2020 Journal: Procedia Computer Science Vol: 171 Pages: 446-455 Publisher: Elsevier BV

DOI: 10.1016/j.procs.2020.04.047

Get Full-Text PDF Get Analytical Report

Abstract

Open-ended Video Question Answering systems is a very challenging problem with widespread applications in real life. Existing systems tend to focus on single word video question answering system, which cannot be easily extended to develop. In this paper, we propose using an architecture, popularly used for video captioning systems to solve the problem of open-ended video based question answering systems. For generating good answers, the model is required to focus on each frame separately as well as understand how to link information from different frames to generate the answer. The model also needs to keep in mind the different modalities and adapt itself accordingly while processing the videos as well as the questions. We propose an attention based multimodal fusion architecture for Video Question Answering (AMF-VQA) that uses attention mechanism at every time to output a word. Such kind of mechanism allows the model to focus on different frames as well as focus on different modalities while outputting every single word. The proposed model is very flexible were we can just add other modalities such as audio features, captions, etc. to the existing model and fine-tune the model to get improve results if these new features are available.

Keywords:

Computer science Question answering Focus (optics) Closed captioning Modalities Frame (networking) Architecture Word (group theory) Artificial intelligence Modal Information retrieval Natural language processing Human–computer interaction Image (mathematics) Linguistics

Metrics

Cited By

0.31

FWCI (Field Weighted Citation Impact)

Refs

0.55

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems

Abstract

Metrics

Citation History

Topics

Related Documents

Open-Ended Multi-Modal Relational Reasoning for Video Question Answering

Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks

Caption based Co-attention Architecture for Open-Ended Visual Question Answering

Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

A RAG Approach for Multi-Modal Open-ended Lifelog Question-Answering