Yunfei ZhangFeipeng DaShaoyan Gai
A high-precision 3D object detection in autonomous driving requires effective LiDAR-camera fusion. However, the heterogeneous nature of these modalities makes it challenging to fully integrate geometric and semantic information. Existing methods adopt either sparse or dense fusion: sparse fusion retains geometric accuracy but lacks semantic richness, while dense fusion offers better semantics but suffers from inefficiency and noise sensitivity. To address this, we propose the multimodal sparse dense fusion (MMSDF), a complementary framework that combines both fusion strategies. It includes (1) a sparse fusion attention (SFA) module that projects non-empty LiDAR voxels onto the image plane to extract local semantic features; (2) a dense bird’s eye view (BEV) feature alignment (BFA) module using optical flow and frequency-domain convolutions to align LiDAR and image BEV features; and (3) a roI point-voxel fusion attention (RPVFA) module that enhances roI representations via cross-attention between point and multiscale voxel features. Experiments on KITTI show that MMSDF achieves 88.21% and 84.26% accuracy on validation and test sets, respectively, with ablation studies confirming the effectiveness of each module.
Yulu GaoChonghao SimaShaoshuai ShiShangzhe DiSi LiuHongyang Li
Martin HakerThomas MartinetzErhardt Barth
Anas MahmoudJordan S. K. HuSteven L. Waslander
Chen ZhaoBin‐Jie HuChengxi LuoGuohao ChenHaohui Zhu