JOURNAL ARTICLE

PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei TangYaohua YiChanghui YuAiguo Yin

Year: 2023 Journal:   Computers, materials & continua/Computers, materials & continua (Print) Vol: 75 (3)Pages: 6007-6022

Abstract

Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among objects and quantization. Meanwhile, we also improve the Self-attention to adapt the GMPE. Then, we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes, which is employed to facilitate embedding features and refining the encoder-decoder framework. To capture the interaction between multi-modal features, we propose Object Classes Awareness (OCA) to refine the encoder and decoder, namely OCAE and OCAD, respectively. Finally, we apply GMPE, OCAE and OCAD to form various combinations and to complete the entire PCAT. We utilize the MSCOCO dataset to evaluate the performance of our method. The results demonstrate that PCAT outperforms the other competitive methods.

Keywords:
Computer science Encoder Closed captioning Embedding Quantization (signal processing) Transformer Artificial intelligence Grid Object (grammar) Data mining Theoretical computer science Information retrieval Computer vision Image (mathematics) Mathematics

Metrics

4
Cited By
0.73
FWCI (Field Weighted Citation Impact)
45
Refs
0.65
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Position-guided transformer for image captioning

Juntao HuYou YangYao LuYongzhi AnLongyue Pan

Journal:   Image and Vision Computing Year: 2022 Vol: 128 Pages: 104575-104575
JOURNAL ARTICLE

A Position-Aware Transformer for Image Captioning

Zelin DengBo ZhouPei HeJianfeng HuangOsama AlfarrajAmr Tolba

Journal:   Computers, materials & continua/Computers, materials & continua (Print) Year: 2021 Vol: 70 (1)Pages: 2065-2081
JOURNAL ARTICLE

Double-Stream Position Learning Transformer Network for Image Captioning

Weitao JiangWei ZhouHaifeng Hu

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2022 Vol: 32 (11)Pages: 7706-7718
JOURNAL ARTICLE

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning

A.K. LiuLingwu MengLiang Xiao

Journal:   IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Year: 2024 Vol: 17 Pages: 20026-20040
© 2026 ScienceGate Book Chapters — All rights reserved.