In recent years, many segmentation methods based on encoder-decoder structure have realized real-time semantic segmentation by using an improved lightweight classification network as the encoder. However, for the segmentation of complex street scenes, the receptive field is not enough to meet the requirement. To alleviate this issue, several methods incorporate a multi-scale feature extraction module into the encoder to capture varied feature information while expanding the receptive field. However, it is observed that a reasonable multi-scale feature extraction within the decoding stage can achieve better segmentation performance. We believe that different decoding stage has different demand for feature information, and delicately designing multi-scale receptive fields for different decoding stages can not only effectively enhance the semantic understanding of the network, but also reduce the amount of network parameters and redundant computing. Therefore, we propose a series of novel stage-aware multi-scale feature extraction (SMFE) modules. These modules aim to extract multi-scale feature information across various decoding stages by employing distinct combinations of receptive fields. Leveraging the SMFE modules, we design a lightweight and efficient decoder, and the network using this decoder is called SMFENet. Experiments demonstrate that SMFENet strikes an effective balance between speed and accuracy. Utilizing ResNet-18 on a single NVIDIA GeForce 1080Ti GPU, SMFENet achieves 78.2% mIoU with 39 FPS on the Cityscapes dataset at a resolution of 1, 024×2, 048, and 74.8% mIoU with 114.2 FPS on the CamVid dataset at a resolution of 720 × 960.
Xi WengYan YanSi ChenJing‐Hao XueHanzi Wang
Yan ZhouXihong ZhengYin YangJianxun LiJinzhen MuRichard Irampaye
Kaige LiQichuan GengZhong Zhou
Wenrui ZhangZongju PengLian HuangFen ChenHonglin Tan