Zero-Shot Knowledge-Based Visual Question Answering with Frozen Language Models

Jing Liu; Lizong Zhang; Cao Chen; Yinong Shi; Chong Mu; Jiaxin Li

doi:10.26599/bdma.2025.9020032

ScienceGate Book Chapters

JOURNAL ARTICLE

Zero-Shot Knowledge-Based Visual Question Answering with Frozen Language Models

Jing Liu Lizong Zhang Cao Chen Yinong Shi Chong Mu Jiaxin Li

Year: 2025 Journal: Big Data Mining and Analytics Vol: 8 (6)Pages: 1418-1431 Publisher: Tsinghua University Press

DOI: 10.26599/bdma.2025.9020032

Get Full-Text PDF Get Analytical Report

Abstract

Knowledge-based Visual Question Answering (VQA) is a challenging task that requires models to access external knowledge for reasoning. Large Language Models (LLMs) have recently been employed for zero-shot knowledge-based VQA due to their inherent knowledge storage and in-context learning capabilities. However, LLMs are commonly perceived as implicit knowledge bases, and their generative and in-context learning potential remains underutilized. Existing works demonstrate that the performance of in-context learning strongly depends on the quality and order of demonstrations in prompts. In light of this, we propose Knowledge Generation with Frozen Language Models (KGFLM), a novel method for generating explicit knowledge statements to improve zero-shot knowledge-based VQA. Our knowledge generation strategy aims to identify effective demonstrations and determine their optimal order, thereby activating the frozen LLM to produce more useful knowledge statements for better predictions. The generated knowledge statements can also serve as interpretable rationales. In our method, the selection and arrangement of demonstrations are based on semantic similarity and quality of demonstrations for each question, without requiring additional annotations. Furthermore, a series of experiments are conducted on A-OKVQA and OKVQA datasets. The results show that our method outperforms some superior zero-shot knowledge-based VQA methods.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.37

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Zero-Shot Knowledge-Based Visual Question Answering with Frozen Language Models

Abstract

Metrics

Topics

Related Documents

From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

Diff-ZsVQA: Zero-shot Visual Question Answering with Frozen Large Language Models Using Diffusion Model

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

Zero-shot Visual Question Answering with Language Model Feedback

Zero-shot Visual Question Answering with Language Model Feedback