Abstract

For problems with image captioning is a technique that has been used for a long time. In the past, there was a way to use convolutional neural network (CNN) for feature extraction and recurrent neural network (RNN) for generating text, and especially in Thai language, It has to be developed further in the era of the popular use of transformers. This paper proposes an end-to-end image captioning with pretrained vision Transformers (ViT) and text transformers in Thai language models namely ThaiTC, Which leverages the transformer architecture both. We has experiment pretrained vision transformer and text transformer in Thai language that best for Thai image captioning and tested on 3 Thai image captioning datasets 1) Travel 2) Food 3) Flickr 30k(t $r$ anslate) with different challenges. Includes freeze vision transformers weight training for image captioning dataset training with less image features, From the experiment, We found that ThaiTC performed much better in the Food and Flickr30k datasets than the Travel datasets, Which allowed us to automatically create subtitles about food and travel.

Keywords:
Closed captioning Transformer Computer science Convolutional neural network Feature extraction Artificial intelligence Natural language processing Speech recognition Image (mathematics) Engineering Electrical engineering

Metrics

6
Cited By
0.74
FWCI (Field Weighted Citation Impact)
21
Refs
0.68
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Captioning Images with Words: A Transformer-based Image Captioning Model

M Yuvanesh

Journal:   International Journal for Research in Applied Science and Engineering Technology Year: 2025 Vol: 13 (5)Pages: 3911-3919
JOURNAL ARTICLE

Efficient Image Captioning Based on Vision Transformer Models

Samar ElbedwehyT. MedhatTaher HamzaMohammed F. Alrahmawy

Journal:   Computers, materials & continua/Computers, materials & continua (Print) Year: 2022 Vol: 73 (1)Pages: 1483-1500
BOOK-CHAPTER

Image Captioning using CNN and Attention Based Transformer

Deepa MulimaniPrakashgoud PatilNagaraj Chaklabbi

Soft Computing Research Society eBooks Year: 2023 Pages: 157-166
© 2026 ScienceGate Book Chapters — All rights reserved.