This paper explores a multimodal deep learning network based on SqueezeSeg. We extend the standard SqueezeSeg architecture to enable camera and lidar fusion. The sensor processing method is termed pixel-block point-cloud fusion. Using co-registered camera and lidar sensors, the input section of the proposed network creates a feature vector by extracting information from a block of RGB pixels from each point in the point-cloud that is also in the camera's field of view. Essentially, each lidar point is paired with neighboring RGB data so the feature extractor has more meaningful information from the image. This fusion method adds rich information on object color and texture from the camera data to enhance the overall performance. The image pixel blocks will not only add color information to the lidar data, but it will also add information about texture. The proposed pixel-block point-cloud fusion method yields better results than single-pixel fusion.
Yan ZhangKang LiuHong BaoYing ZhengYi Yang
Weijing ShiRagunathan Rajkumar
Guotao XieChen Zhi-yuanMing GaoManjiang HuXiaohui Qin