Hapsari Peni Agustin TjahyaningtijasMuhammad Aamir NashrullahPradini PuspitaningayuLusia RakhmawatiYuni YamasariJessie R. Paragas
Particularly in the areas of public health and welfare, biomedical image classification plays a significant role in advancing the Sustainable Development Goals (SDGs). Classifying medical images is a challenging and quickly developing area of computer vision and artificial intelligence. Significant progress has been made in the classification of medical images, including MRI, CT, X-ray, and histopathological tissue images, thanks to the use of deep learning techniques. CNN and its variations are well-known deep learning architectures used for medical image classification. CNN’s limitations in medical image classification are due to its focus on local features through convolution operations, which prevents it from understanding global relationships between images. Furthermore, CNN occasionally requires many layers to capture a wide range of geographic context, which leads to the loss of important data and increased model complexity. By using a self-attention mechanism that can model global relationships among visual components from an early stage, the Vision Transformer (ViT) overcomes these shortcomings. ViT efficiently captures long-term dependencies and intricate spatial structures by partitioning the image into patches and processing them concurrently within transformer blocks, surpassing CNN performance. This research conducted on the application of Vision Transformer (ViT) architectures to biomedical image classification. Ten key studies were analyzed, encompassing tasks such as breast and brain tumor classification, COVID-19 detection, and lung nodule identification. ViT‐based models consistently achieved high performance: peak accuracies ranged from 95.1% to 99.6%, with complementary metrics (sensitivity, specificity, AUC) exceeding 90% in most cases. Despite their promise, ViT face challenges related to extensive data requirements and computational complexity. Emerging solutions, including hybrid architectures, self-supervised pretraining, and hierarchical embeddings, aim to mitigate these limitations. Future directions involve developing lightweight, privacy-preserving ViT variants and enhancing model explainability to support trustworthy clinical adoption.
Hapsari Peni Agustin TjahyaningtijasMuhammad Aamir NashrullahPradini PuspitaningayuLusia RakhmawatiYuni YamasariJessie R. Paragas
V. Vishnu PriyaB. VenkatesanP. RamyaR. SubashiniK M. YuvaPriyan
Youssef TakiEl Moukhtar Zemmouri
Qi ZhengAprilPyone MaungMaungYuma KinoshitaHitoshi Kiya