Text segmentation is a challenging computer vision task with many downstream applications. Current text segmentation models need to be trained with pixel-level annotations, which requires a lot of labor cost. In this paper, we take the first attempt to perform weakly-supervised text instance segmentation through bridging text recognition and text segmentation. We observe that text recognition models are able to produce the attention localization of each text instance. Based on this observation, we propose a two-stage Text Adaptive Refinement (TAR) module to generate the pseudo labels based on the attention map of a text recognizer. Meanwhile, we develop a text segmentation module to take the rough attention location as input to predict segmentation masks, which are supervised by the aforementioned pseudo labels. In addition, we introduce a mask-augmented contrastive learning by treating the segmentation result as an augmented version of the input text image, thus improving the visual representation and further enhancing the performance of both recognition and segmentation. The experimental results demonstrate that the proposed method outperforms the state-of-the-art (SOTA) weakly-supervised generic segmentation methods by 18.95% and 17.80% in fgIoU on ICDAR13-FST and TextSeg. On MLT-S, COCO-TS and Total-Text, the proposed method achieves about 82% of the fully-supervised methods' performance. When evaluated on instance segmentation, the proposed method exceeds existing SOTA methods by 23.32% and 21.34% on ICDAR13-FST and TextSeg, respectively. Code and Supplementary Materials are available at https://github.com/FudanVI/FudanOCR/tree/main/weakly-text-segmentation.
Yi ZhuYanzhao ZhouHuijuan XuQixiang YeDavid DoermannJianbin Jiao
Xin HuangQianshu ZhuYongtuo LiuShengfeng He
Zhen SunHuan XuJinlin WuZhen ChenHongbin LiuZhen Lei
Shisha LiaoYongqing SunChenqiang GaoPranav ShenoySong MuJun ShimamuraAtsushi Sagata