JOURNAL ARTICLE

Image Captioning Method Based on Transformer Visual Features Fusion

Xuebing BAI, Jin CHE, Jinman WU, Yumin CHEN

Year: 2024 Journal:   DOAJ (DOAJ: Directory of Open Access Journals)

Abstract

Existing image captioning methods only use regional visual features to generate description statements and ignore the importance of grid visual features. Moreover, as these methods are two-stage approaches, image captioning quality is affected. To address this issue, this study proposes an end-to-end image captioning method based on the visual feature fusion of Transformer. First, in the feature extraction stage, the visual feature extractor is used to extract regional and grid visual features. Second, in the feature fusion stage, the regional and grid visual features are concatenated using a visual feature fusion module. Finally, the visual features are sent to the language generator to realize image captioning. All components of the method are implemented based on the Transformer model, which is a one-stage method. The experimental results on the MS-COCO dataset show that the proposed method can fully utilize the respective advantages of regional and grid visual features, with the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics reaching 83.1%, 41.5%, 30.2%, 60.1%, 140.3%, and 23.9%, respectively, indicating that the proposed method is superior to mainstream image captioning methods and can generate more accurate and rich description statements.

Keywords:
Closed captioning Transformer Feature extraction Feature (linguistics) Visualization Image fusion Grid

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.47
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Visual Image Captioning through Transformer

Muneeb NabiRohit PachauriShouaib AhmadKanishk VarshneyPrachi GoelApurva Jain

Journal:   International Journal for Research in Applied Science and Engineering Technology Year: 2023 Vol: 11 (12)Pages: 2047-2050
JOURNAL ARTICLE

Recurrent fusion transformer for image captioning

Zhenping MouQiao YuanTianqi Song

Journal:   Signal Image and Video Processing Year: 2024 Vol: 19 (1)
JOURNAL ARTICLE

Dual visual align-cross attention-based image captioning transformer

Yonggong RenJinghan ZhangWenqiang XuYuzhu LinBo FuDang N. H. Thanh

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 84 (12)Pages: 10645-10664
© 2026 ScienceGate Book Chapters — All rights reserved.