JOURNAL ARTICLE

Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion

Abstract

The objective of image captioning involves empowering computers to autonomously produce human-like sentences that depict a provided image. To address the issues of insufficient accuracy in image feature extraction and underutilization of visual information, we propose a Swin Transformer-based image captioning model with feature enhancement and multi-stage fusion. First, the Swin Transformer is employed in the capacity of an encoder for the purpose of extracting image features, and feature enhancement is adopted to capture more information about image features. Then, a multi-stage image and semantic fusion module is constructed to utilize the semantic information from past time steps. Finally, LSTM is used to decode the semantic and image information and generate captions. The proposed model achieves better results in baseline tests on the public datasets Flickr8K and Flickr30K.

Keywords:
Closed captioning Computer science Transformer Artificial intelligence Encoder Feature extraction Feature (linguistics) Image (mathematics) Computer vision Image fusion Semantic feature Pattern recognition (psychology) Engineering Voltage

Metrics

4
Cited By
0.73
FWCI (Field Weighted Citation Impact)
35
Refs
0.67
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion

Lei LiuYidi JiaoXiaoran LiJing LiHaitao WangXinyu Cao

Journal:   International Journal of Computational Intelligence and Applications Year: 2024 Vol: 24 (03)
JOURNAL ARTICLE

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning

Jing ZhangZhongjun FangZhe Wang

Journal:   Applied Intelligence Year: 2022 Vol: 53 (11)Pages: 13398-13414
JOURNAL ARTICLE

MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion

Jing WangHaotian FanXiaoxia HouYitian XuTao LiXuechao LuLean Fu

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Year: 2022 Pages: 1268-1277
© 2026 ScienceGate Book Chapters — All rights reserved.