Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion

Lei Liu; Yidi Jiao; Xiaoran Li; Jing Li; Haitao Wang; Xinyu Cao

doi:10.1109/icnc-fskd59587.2023.10281090

ScienceGate Book Chapters

JOURNAL ARTICLE

Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion

Lei Liu Yidi Jiao Xiaoran Li Jing Li Haitao Wang Xinyu Cao

Year: 2023 Pages: 1-7

DOI: 10.1109/icnc-fskd59587.2023.10281090

Get Full-Text PDF Get Analytical Report

Abstract

The objective of image captioning involves empowering computers to autonomously produce human-like sentences that depict a provided image. To address the issues of insufficient accuracy in image feature extraction and underutilization of visual information, we propose a Swin Transformer-based image captioning model with feature enhancement and multi-stage fusion. First, the Swin Transformer is employed in the capacity of an encoder for the purpose of extracting image features, and feature enhancement is adopted to capture more information about image features. Then, a multi-stage image and semantic fusion module is constructed to utilize the semantic information from past time steps. Finally, LSTM is used to decode the semantic and image information and generate captions. The proposed model achieves better results in baseline tests on the public datasets Flickr8K and Flickr30K.

Keywords:

Closed captioning Computer science Transformer Artificial intelligence Encoder Feature extraction Feature (linguistics) Image (mathematics) Computer vision Image fusion Semantic feature Pattern recognition (psychology) Engineering Voltage

Metrics

Cited By

0.73

FWCI (Field Weighted Citation Impact)

Refs

0.67

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Domain Adaptation and Few-Shot Learning

Physical Sciences → Computer Science → Artificial Intelligence

Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion

Abstract

Metrics

Citation History

Topics

Related Documents

Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion

Multi-feature fusion image super-resolution network based on Swin Transformer

An Image Captioning Method Based on Transformer for Multi-feature Fusion

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning

MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion