From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

Jiaxian Guo; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Boyang Li; Dacheng Tao; Steven C. H. Hoi

doi:10.1109/cvpr52729.2023.01046

ScienceGate Book Chapters

JOURNAL ARTICLE

From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

Jiaxian Guo Junnan Li Dongxu Li Anthony Meng Huat Tiong Boyang Li Dacheng Tao Steven C. H. Hoi

Year: 2023 Pages: 10867-10877

DOI: 10.1109/cvpr52729.2023.01046

Get Full-Text PDF Get Analytical Report

Abstract

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnect and task disconnect between the LLM and VQA tasks. End-to-end training on multimodal data may bridge the disconnects, but is inflexible and computationally expensive. To address this issue, we propose Img2LLM, a plug-and-play module that provides LLM prompts to enable LLMs to perform zeroshot VQA tasks without end-to-end training. We develop LLM-agnostic models describe image content as exemplar question-answer pairs, which prove to be effective LLM prompts. Img2LLM offers the following benefits: 1) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo [3] by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method outperforms few-shot methods by as much as 20%. 2) It flexibly interfaces with a wide range of LLMs to perform VQA. 3) It eliminates the need to specialize LLMs using end-to-end finetuning and serve highly specialized LLMs to end users, thereby reducing cost. Code is available via the LAVIS [28] framework at https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa.

Keywords:

Computer science Shot (pellet) Task (project management) Question answering Code (set theory) Generalization Artificial intelligence Tree (set theory) Natural language processing Programming language Engineering

Metrics

117

Cited By

21.29

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

Abstract

Metrics

Citation History

Topics

Related Documents

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Zero-Shot Knowledge-Based Visual Question Answering with Frozen Language Models

ZVQAF: Zero-shot visual question answering with feedback from large language models

Diff-ZsVQA: Zero-shot Visual Question Answering with Frozen Large Language Models Using Diffusion Model

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models