Image Captioning Method Based on Transformer Visual Features Fusion

Xuebing BAI, Jin CHE, Jinman WU, Yumin CHEN

ScienceGate Book Chapters

JOURNAL ARTICLE

Image Captioning Method Based on Transformer Visual Features Fusion

Xuebing BAI, Jin CHE, Jinman WU, Yumin CHEN

Year: 2024 Journal: DOAJ (DOAJ: Directory of Open Access Journals)

Get Full-Text PDF Get Analytical Report

Abstract

Existing image captioning methods only use regional visual features to generate description statements and ignore the importance of grid visual features. Moreover, as these methods are two-stage approaches, image captioning quality is affected. To address this issue, this study proposes an end-to-end image captioning method based on the visual feature fusion of Transformer. First, in the feature extraction stage, the visual feature extractor is used to extract regional and grid visual features. Second, in the feature fusion stage, the regional and grid visual features are concatenated using a visual feature fusion module. Finally, the visual features are sent to the language generator to realize image captioning. All components of the method are implemented based on the Transformer model, which is a one-stage method. The experimental results on the MS-COCO dataset show that the proposed method can fully utilize the respective advantages of regional and grid visual features, with the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics reaching 83.1%, 41.5%, 30.2%, 60.1%, 140.3%, and 23.9%, respectively, indicating that the proposed method is superior to mainstream image captioning methods and can generate more accurate and rich description statements.

Keywords:

Closed captioning Transformer Feature extraction Feature (linguistics) Visualization Image fusion Grid

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.47

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Captioning Method Based on Transformer Visual Features Fusion

Abstract

Metrics

Topics

Related Documents

Visual Image Captioning through Transformer

An Image Captioning Method Based on Transformer for Multi-feature Fusion

Semantically Enhanced Dual Visual Fusion Transformer for accurate image captioning

Recurrent fusion transformer for image captioning

Dual visual align-cross attention-based image captioning transformer