With the rapid progress in autonomous driving technology, the integration of multiple sensors into autonomous driving systems has become crucial. Existing methods often use point-level fusion, where LiDAR point clouds are projected onto a plane and fused with RGB features. However, point-level fusion approach leads to a loss of semantic density from the RGB features during the transformation process. To overcome this limitation, recent methods have transformed RGB pixels into 3D space using depth prediction techniques, generating virtual point clouds. While this preserves the semantic density of camera features, it introduces challenges such as computational load and depth completion inaccuracies. In this paper, we propose a novel fusion method that unifies the representation of multimodal features in the Bird’s Eye View (BEV) space, preserving geometric and semantic information. We introduce the BEV feature fuse module to effectively integrate rich semantic features from RGB data into voxel features. Furthermore, we utilize the Focal Sparse Convolution module to enhance feature learning stability through position-weighted predictions, thereby improving the capability of point cloud feature extraction. Our fusion approach retains semantic features and enhances point cloud feature extraction. Experimental results on the nuScenes public dataset demonstrate superior performance in 3D object detection and tracking, highlighting the valuable application potential of this approach in autonomous driving systems.
Hongmei ChenS. Grace ChangWen YeDongbing GuQiangwei XuMiaoxin Ji
Jesslyn NathaniaQiyuan LiuZhiheng LiLiming LiuYipeng Gao
Bingli ZhangChengbiao ZhangYixin WangJunzhao JiangY ZhangXinyu WangGan ShenXiang Luo
Yuhao XiaoXiaohong ChenYingkai WangZhongliang Fu
Peng XueShan GaoJie GuoMang Ou‐YangLiwei ChenTong Wang