PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei Tang; Yaohua Yi; Changhui Yu; Aiguo Yin

doi:10.32604/cmc.2023.037861

ScienceGate Book Chapters

JOURNAL ARTICLE

PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei Tang Yaohua Yi Changhui Yu Aiguo Yin

Year: 2023 Journal: Computers, materials & continua/Computers, materials & continua (Print) Vol: 75 (3)Pages: 6007-6022

DOI: 10.32604/cmc.2023.037861

Get Full-Text PDF Get Analytical Report

Abstract

Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among objects and quantization. Meanwhile, we also improve the Self-attention to adapt the GMPE. Then, we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes, which is employed to facilitate embedding features and refining the encoder-decoder framework. To capture the interaction between multi-modal features, we propose Object Classes Awareness (OCA) to refine the encoder and decoder, namely OCA_E and OCA_D, respectively. Finally, we apply GMPE, OCA_E and OCA_D to form various combinations and to complete the entire PCAT. We utilize the MSCOCO dataset to evaluate the performance of our method. The results demonstrate that PCAT outperforms the other competitive methods.

Keywords:

Computer science Encoder Closed captioning Embedding Quantization (signal processing) Transformer Artificial intelligence Grid Object (grammar) Data mining Theoretical computer science Information retrieval Computer vision Image (mathematics) Mathematics

Metrics

Cited By

0.73

FWCI (Field Weighted Citation Impact)

Refs

0.65

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

PCATNet: Position-Class Awareness Transformer for Image Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

Position-guided transformer for image captioning

A Position-Aware Transformer for Image Captioning

Dual Position Relationship Transformer for Image Captioning

Double-Stream Position Learning Transformer Network for Image Captioning

Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning