JOURNAL ARTICLE

Graph Alignment Transformer for More Grounded Image Captioning

Abstract

The Industrial Internet of Things (IIoT) generates massive amounts of data that are the cornerstone for companies to increase productivity and provide reliable services. Based on these data, predictive and in-depth analysis can be used to identify weaknesses and make improvements. And how to analyze these data efficiently, effectively and safely requires us to explore. We expect to exploit these data by using methods of deep learning, and image captioning is one of the meaningful tasks. Image captioning aims to describe a given image in natural language. It is well believed that mining relationships between objects is a proven method to improve the performance of reasoning. These methods often extract the general relational expressions on another visual relationship benchmark. And it usually brings redundant connections between region pairs. In this paper, we propose a novel Graph Alignment Transformer (GAT) that models visual relationships in an unsupervised way to perform multimodal representation. Without taking the pre-training approach to obtain the explicit relational expressions, our model still achieves comparable results. Furthermore, we design a Graph Alignment (GA) module to explore semantic and visual alignment at node-level and graph-level, lead to accurate captions. We measured our method on the benchmark MSCOCO image captioning dataset and conduct ablation studies to investigate its effectiveness both quantitatively and qualitatively. Compared to state-of-the-art manners, our propose approach yields an impressive result.

Keywords:
Closed captioning Computer science Exploit Transformer Artificial intelligence Graph Benchmark (surveying) Machine learning Pairwise comparison Natural language processing Visualization Deep learning Data mining Information retrieval Image (mathematics) Theoretical computer science

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
43
Refs
0.16
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Relational Graph Reasoning Transformer for Image Captioning

Xinyu XiaoZixun SunTingtian LiYipeng Yu

Journal:   2022 IEEE International Conference on Multimedia and Expo (ICME) Year: 2022
JOURNAL ARTICLE

Image captioning with transformer and knowledge graph

Yu ZhangXinyu ShiSiya MiXu Yang

Journal:   Pattern Recognition Letters Year: 2021 Vol: 143 Pages: 43-49
JOURNAL ARTICLE

Consensus Graph Representation Learning for Better Grounded Image Captioning

Wenqiao ZhangHaochen ShiSiliang TangJun XiaoQiang YuYueting Zhuang

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2021 Vol: 35 (4)Pages: 3394-3402
JOURNAL ARTICLE

Multi-Modal Graph Aggregation Transformer for image captioning

Lizhi ChenKesen Li

Journal:   Neural Networks Year: 2024 Vol: 181 Pages: 106813-106813
© 2026 ScienceGate Book Chapters — All rights reserved.