JOURNAL ARTICLE

Biomedical image classification using vision transformer

Abstract

Particularly in the areas of public health and welfare, biomedical image classification plays a significant role in advancing the Sustainable Development Goals (SDGs). Classifying medical images is a challenging and quickly developing area of computer vision and artificial intelligence. Significant progress has been made in the classification of medical images, including MRI, CT, X-ray, and histopathological tissue images, thanks to the use of deep learning techniques. CNN and its variations are well-known deep learning architectures used for medical image classification. CNN’s limitations in medical image classification are due to its focus on local features through convolution operations, which prevents it from understanding global relationships between images. Furthermore, CNN occasionally requires many layers to capture a wide range of geographic context, which leads to the loss of important data and increased model complexity. By using a self-attention mechanism that can model global relationships among visual components from an early stage, the Vision Transformer (ViT) overcomes these shortcomings. ViT efficiently captures long-term dependencies and intricate spatial structures by partitioning the image into patches and processing them concurrently within transformer blocks, surpassing CNN performance. This research conducted on the application of Vision Transformer (ViT) architectures to biomedical image classification. Ten key studies were analyzed, encompassing tasks such as breast and brain tumor classification, COVID-19 detection, and lung nodule identification. ViT‐based models consistently achieved high performance: peak accuracies ranged from 95.1% to 99.6%, with complementary metrics (sensitivity, specificity, AUC) exceeding 90% in most cases. Despite their promise, ViT face challenges related to extensive data requirements and computational complexity. Emerging solutions, including hybrid architectures, self-supervised pretraining, and hierarchical embeddings, aim to mitigate these limitations. Future directions involve developing lightweight, privacy-preserving ViT variants and enhancing model explainability to support trustworthy clinical adoption.

Keywords:
Deep learning Medical imaging Transformer Image processing Pattern recognition (psychology) Contextual image classification Machine vision Facial recognition system

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.41
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Geochemistry and Geologic Mapping
Physical Sciences →  Computer Science →  Artificial Intelligence
Geological Modeling and Analysis
Physical Sciences →  Earth and Planetary Sciences →  Geochemistry and Petrology
Electrical and Electromagnetic Research
Physical Sciences →  Physics and Astronomy →  Atomic and Molecular Physics, and Optics

Related Documents

BOOK-CHAPTER

Vehicle Image Classification Method Using Vision Transformer

Youssef TakiEl Moukhtar Zemmouri

Lecture notes in networks and systems Year: 2023 Pages: 221-230
JOURNAL ARTICLE

Privacy-Preserving Image Classification Using Vision Transformer

Qi ZhengAprilPyone MaungMaungYuma KinoshitaHitoshi Kiya

Journal:   2022 30th European Signal Processing Conference (EUSIPCO) Year: 2022 Pages: 543-547
BOOK-CHAPTER

Image Quality Distortion Classification Using Vision Transformer

Nay Chi LynnTetsuya Shimamura

Lecture notes on data engineering and communications technologies Year: 2024 Pages: 353-361
© 2026 ScienceGate Book Chapters — All rights reserved.