JOURNAL ARTICLE

Open-Ended Multi-Modal Relational Reasoning for Video Question Answering

Abstract

In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2% to 3% performance enhancement in comparison to other benchmark methods.

Keywords:
Computer science Benchmark (surveying) Focus (optics) Question answering Modal Artificial intelligence Robot Human–computer interaction Natural language Natural language processing

Metrics

2
Cited By
0.36
FWCI (Field Weighted Citation Impact)
56
Refs
0.55
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.