Focal Visual-Text Attention for Memex Question Answering

Junwei Liang; Lu Jiang; Liangliang Cao; Yannis Kalantidis; Li-Jia Li; Alexander G. Hauptmann

doi:10.1109/tpami.2018.2890628

ScienceGate Book Chapters

JOURNAL ARTICLE

Focal Visual-Text Attention for Memex Question Answering

Junwei Liang Lu Jiang Liangliang Cao Yannis Kalantidis Li-Jia Li Alexander G. Hauptmann

Year: 2019 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 41 (8)Pages: 1893-1908 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2018.2890628

Get Full-Text PDF Get Analytical Report

Abstract

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photo albums, we have to look at whole collections with sequences of photos. This paper proposes a new multimodal MemexQA task: given a sequence of photos from a user, the goal is to automatically answer questions that help users recover their memory about an event captured in these photos. In addition to a text answer, a few grounding photos are also given to justify the answer. The grounding photos are necessary as they help users quickly verifying the answer. Towards solving the task, we 1) present the MemexQA dataset, the first publicly available multimodal question answering dataset consisting of real personal photo albums; 2) propose an end-to-end trainable network that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. Experimental results on the MemexQA dataset demonstrate that our model outperforms strong baselines and yields the most relevant grounding photos on this challenging task.

Keywords:

Question answering Computer science Metadata Focus (optics) Information retrieval Artificial intelligence Artificial neural network Natural language processing World Wide Web

Metrics

Cited By

5.45

FWCI (Field Weighted Citation Impact)

Refs

0.96

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Focal Visual-Text Attention for Memex Question Answering

Abstract

Metrics

Citation History

Topics

Related Documents

Focal Visual-Text Attention for Visual Question Answering

Video Question Answering Using Clip-Guided Visual-Text Attention

Text-Guided Dual-Branch Attention Network for Visual Question Answering

Question-Agnostic Attention for Visual Question Answering

Visual Localization and Text-Visual Interaction Attention for Medical Visual Question Localized-Answering