Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIPs inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIPs image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIPs image-text alignment to RIS.
Mingxing PuBing LuoChao ZhangLi XuFayou XuMingming Kong
Sen LeiXinyu XiaoTianlin ZhangHeng-Chao LiZhenwei ShiQing Zhu
Yaxiong ChenMinghong WeiZixuan ZhengJingliang HuYilei ShiShengwu XiongXiao Xiang ZhuLichao Mou
Jing LiuHuajie JiangYandong BiYongli HuBaocai Yin
Fang LiuYuhao LiuYuqiu KongKe XuLihe ZhangBaocai YinGerhard P. HanckeRynson W. H. Lau