Recently, there is growing research interest in extracting Bird's-Eye-View (BEV) features from images and LiDAR to improve 3D object detection. However, existing methods mainly combine the features mechanically, which limits the utilization of BEV features. To address this limitation, we draw inspiration from TransFusion and design a two-layer transformer decoder to fuse LiDAR and camera BEV features. By doing so, we can omit the steps of reference point backprojection and feature sampling, which results in better correlation between the fused LiDAR and image features and higher robustness to the calibration matrix. Furthermore, we add 3D position encoding to the BEV features to compensate for the lack of height information. We also propose an length-width-height modulated attention mechanism to incorporate scale information. We also perform comprehensive experiments to verify the effectiveness of our methods.
James GunnZygmunt LenykAnuj SharmaAndrea DonatiAlexandru BuburuzanJohn RedfordRomain Mueller
Peicheng ShiZhiqiang LiuXinlong DongAixi Yang
Byeong-Jun YuDongkyu LeeJae-Seol LeeSeok-Cheol Kee
Yuhao XiaoXiaohong ChenYingkai WangZhongliang Fu