Sithembiso NtanziSerestina Viriri
There have been significant breakthroughs in developing models for segmenting 3D medical images, with many promising results attributed to the incorporation of Vision Transformers (ViT). However, the fundamental mechanism of transformers, known as self-attention, has quadratic complexity, which significantly increases computational requirements, especially in the case of 3D medical images. In this paper, we investigate the UNETR++ model and propose a voxel-focused attention mechanism inspired by TransNeXt pixel-focused attention. The core component of UNETR++ is the Efficient Paired Attention (EPA) block, which learns from two interdependent branches: spatial and channel attention. For spatial attention, we incorporated the voxel-focused attention mechanism, which has linear complexity with respect to input sequence length, rather than projecting the keys and values into lower dimensions. The deficiency of UNETR++ lies in its reliance on dimensionality reduction for spatial attention, which reduces efficiency but risks information loss. Our contribution is to replace this with a voxel-focused attention design that achieves linear complexity without low-dimensional projection, thereby reducing parameters while preserving representational power. This effectively reduces the model’s parameter count while maintaining competitive performance and inference speed. On the Synapse dataset, the enhanced UNETR++ model contains 21.42 M parameters, a 50% reduction from the original 42.96 M, while achieving a competitive Dice score of 86.72%.
Ali HatamizadehYucheng TangVishwesh NathDong YangAndriy MyronenkoBennett A. LandmanHolger R. RothDaguang Xu
Yan PangJiaming LiangTeng HuangHao ChenYunhao LiDan LiLin HuangQiong Wang
Abdelrahman ShakerMuhammad MaazHanoona RasheedSalman KhanMing–Hsuan YangFahad Shahbaz Khan
Junyoung ParkMinyoung ParkTaikyeong JeongSungwook Yu
Sibo JuZhicong ChenXiangwen LiaoYiqing ShenJunjun HeYanzhou Su