Human parsing is a challenging and important task in various applications, such as dress collocation, clothing recommendation and action analysis. However, the existing methods are easily affected by pose variation and occlusion with requiring massive intensive annotations for fine-grained human segmentation. In this paper, we design a cascaded segmentation network with three stages to solve the above problems. Given a human image, we firstly predict the human joints as pose features. Secondly, these features along with the input image are fed into the first stage to obtain a primitive segmentation map to separate the human and the background. The primitive segmentation is then fed into the second stage with the original image to give a rough segmentation of human body. This procedure is repeated in the third stage to acquire a refined segmentation. Experimental results demonstrate the proposed method achieve superior performance than state-of-the-arts and show great generalization ability.
Tianfei ZhouYi YangWenguan Wang
Yikemaiti SataerYunlong FanBin LiMiao GaoChuanqi ShiZhiqiang Gao
Fangting XiaPeng WangLiang-Chieh ChenAlan Yuille