Auto-encoding and Distilling Scene Graphs for Image Captioning

Xu Yang; Hanwang Zhang; Jianfei Cai

doi:10.1109/tpami.2020.3042192

ScienceGate Book Chapters

JOURNAL ARTICLE

Auto-encoding and Distilling Scene Graphs for Image Captioning

Xu Yang Hanwang Zhang Jianfei Cai

Year: 2020 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 44 (5)Pages: 1-1 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2020.3042192

Get Full-Text PDF Get Analytical Report

Abstract

We propose scene graph auto-encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inferences in discourse. For example, when we see the relation "a person on a bike", it is natural to replace "on" with "ride" and infer "a person riding a bike on a road" even when the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models reason as we humans and generate more descriptive captions. Specifically, we use the scene graph-a directed graph ( G) where an object node is connected by adjective nodes and relationship nodes-to represent the complex structural layout of both image ( I) and sentence ( S). In the language domain, we use SGAE to learn a dictionary set ( D) that helps reconstruct sentences in the S→ G_S → D → S auto-encoding pipeline, where D encodes the desired language prior and the decoder learns to caption from such a prior; in the vision-language domain, we share D in the I→ G_I → D → S pipeline and distill the knowledge of the language decoder of the auto-encoder to that of the encoder-decoder based image captioner to transfer the language inductive bias. In this way, the shared D provides hidden embeddings about descriptive collocations to the encoder-decoder and the distillation strategy teaches the encoder-decoder to transform these embeddings to human-like captions as the auto-encoder. Thanks to the scene graph representation, the shared dictionary set, and the Knowledge Distillation strategy, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, where our SGAE-based single-model achieves a new state-of-the-art 129.6 CIDEr-D on the Karpathy split, and a competitive 126.6 CIDEr-D (c40) on the official server, which is even comparable to other ensemble models. Furthermore, we validate the transferability of SGAE on two more challenging settings: transferring inductive bias from other language corpora and unpaired image captioning. Once again, the results of both settings confirm the superiority of SGAE. The code is released in https://github.com/yangxuntu/SGAE.

Keywords:

Computer science Scene graph Closed captioning Encoder Artificial intelligence Sentence Pipeline (software) Natural language processing Natural language Decoding methods Inductive bias Encoding (memory) Factor graph Language model Speech recognition Graph Computer vision Image (mathematics) Theoretical computer science Algorithm Programming language

Metrics

Cited By

3.99

FWCI (Field Weighted Citation Impact)

106

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Auto-encoding and Distilling Scene Graphs for Image Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Auto-Encoding Scene Graphs for Image Captioning

Topic scene graphs for image captioning

Image captioning based on scene graphs: A survey

Image Captioning Based on Scene Graphs: A Survey

On the Role of Scene Graphs in Image Captioning