Abstract

How far can we go with textual representations for understanding pictures? Deep visual features extracted by object recognition models are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Mean-while, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA.

Keywords:
Question answering Computer science Artificial intelligence Context (archaeology) Natural language processing Object (grammar) Image (mathematics) Information retrieval Visualization History

Metrics

6
Cited By
0.61
FWCI (Field Weighted Citation Impact)
27
Refs
0.69
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Visual Question Answering with Satellite Images

Gaurav DubeySarthak SharmaVinayak MishraRishikesh

Lecture notes in networks and systems Year: 2025 Pages: 455-465
JOURNAL ARTICLE

Elevating Textual Question Answering with On-Demand Visual Augmentation

Sina EhsaniJian Liu

Journal:   ACM Transactions on Multimedia Computing Communications and Applications Year: 2025 Vol: 21 (10)Pages: 1-25
© 2026 ScienceGate Book Chapters — All rights reserved.