Graph Alignment Transformer for More Grounded Image Captioning

Canwei Tian; Haiyang Hu; Zhongji Li

doi:10.1109/iiotbdsc57192.2022.00028

ScienceGate Book Chapters

JOURNAL ARTICLE

Graph Alignment Transformer for More Grounded Image Captioning

Canwei Tian Haiyang Hu Zhongji Li

Year: 2022 Vol: 6314 Pages: 95-102

DOI: 10.1109/iiotbdsc57192.2022.00028

Get Full-Text PDF Get Analytical Report

Abstract

The Industrial Internet of Things (IIoT) generates massive amounts of data that are the cornerstone for companies to increase productivity and provide reliable services. Based on these data, predictive and in-depth analysis can be used to identify weaknesses and make improvements. And how to analyze these data efficiently, effectively and safely requires us to explore. We expect to exploit these data by using methods of deep learning, and image captioning is one of the meaningful tasks. Image captioning aims to describe a given image in natural language. It is well believed that mining relationships between objects is a proven method to improve the performance of reasoning. These methods often extract the general relational expressions on another visual relationship benchmark. And it usually brings redundant connections between region pairs. In this paper, we propose a novel Graph Alignment Transformer (GAT) that models visual relationships in an unsupervised way to perform multimodal representation. Without taking the pre-training approach to obtain the explicit relational expressions, our model still achieves comparable results. Furthermore, we design a Graph Alignment (GA) module to explore semantic and visual alignment at node-level and graph-level, lead to accurate captions. We measured our method on the benchmark MSCOCO image captioning dataset and conduct ablation studies to investigate its effectiveness both quantitatively and qualitatively. Compared to state-of-the-art manners, our propose approach yields an impressive result.

Keywords:

Closed captioning Computer science Exploit Transformer Artificial intelligence Graph Benchmark (surveying) Machine learning Pairwise comparison Natural language processing Visualization Deep learning Data mining Information retrieval Image (mathematics) Theoretical computer science

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.16

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Graph Alignment Transformer for More Grounded Image Captioning

Abstract

Metrics

Topics

Related Documents

Relational Graph Reasoning Transformer for Image Captioning

Image captioning with transformer and knowledge graph

Consensus Graph Representation Learning for Better Grounded Image Captioning

More Grounded Image Captioning by Distilling Image-Text Matching Model

Multi-Modal Graph Aggregation Transformer for image captioning