Referring image segmentation aims to segment one particular object referred by a natural language expression in the image. One major challenge of this task is how to understand and align vision and language to distinguish the referent. Another major challenge is how to refine the segmentation mask of the referent. In this paper, we focus on dissecting and enhancing the interaction between modalities to address these challenges. Specifically, we propose a Structured Multimodal Fusion Network (SMFN), which consists of a multimodal tree, a cross-modal transformer, and a mask refinement module. SMFN first exploits multimodal fusion structures to deeply integrate visual and linguistic features so that the referent can be accurately distinguished and then further utilizes a mask refinement module to aggregate multi-scale visual features to clarify boundaries. We conduct extensive experiments on the four benchmark datasets and achieve new state-of-the-art performances under different evaluation metrics.
Liang LinPengxiang YanXiaoqian XuSibei YangKun ZengGuanbin Li
Yingjie LiWeiqi JinSu QiuQiyang Sun
Xianzhu PanXuemei XieJianxiu Yang
Xianwen DaiJiacheng LinKe NaiQingpeng LiZhiyong Li