Achieving accurate medical image segmentation requires considering both global contextual information and local regional details. Compared to traditional convolutional neural networks (CNNs) that focus on extracting representative features from local fields by convolutional layers, Transformer-based methods can capture long-range spatial interactions by the attention mechanism. However, the weakness of Transformer in learning from small datasets hindered its development in medical image segmentation since such tasks contain much smaller datasets than natural image segmentation tasks due to data privacy and the large efforts in the data annotation. In this paper, we propose CTranS that combines both CNN and Transformer in a multi-resolution manner to achieve both effective local feature learning and global contextual information fusion for medical image segmentation. The proposed method was evaluated and compared to both state-of-the-art (SOTA) CNN-based methods and Transformer-based methods using three public datasets, including skin lesion, polyp, and cell segmentation datasets. Our method achieves the best performance in terms of the Dice coefficient and averaged symmetric surface distance measures. Compared to other Transformer-based methods, our method has the fewest model parameters without any model pretraining. Github link: https://github.com/naisops/CTranS.
Baosheng ZouZong‐Guang ZhouYing HanKang LiGuotai Wang
Hulin KuangYahui WangXianzhen TanJialin YangJiarui SunJin LiuWu QiuJingyang ZhangJiulou ZhangChunfeng YangJianxin WangYang Chen
Yanhua ZhangGabriella BalestraKe ZhangJingyu WangSamanta RosatiValentina Giannini