JOURNAL ARTICLE

A Frustum-Aware Fusion Network With Cross-Attention for Multi-Modal 3D Detection

Jingkun XuKe XiaoWenjie JiChunlin Li

Year: 2025 Journal:   IEEE Access Vol: 13 Pages: 200777-200790   Publisher: Institute of Electrical and Electronics Engineers

Abstract

The achievement of precise 3D object detection via LiDAR–camera fusion is of utmost importance for enabling autonomous driving in complex traffic environments. However, existing methods encounter two substantial challenges: spatial misalignment among heterogeneous sensor modalities and the loss of fine-grained geometric details during cross-modal feature fusion. To address these issues, we propose a frustum-aware fusion network with cross-attention. This network utilizes a unified point–voxel representation to integrate both image and point cloud data, thus improving the quality of multi-modal fusion for 3D object detection. Specifically, we present an innovative multi-modal alignment module (MAM) that achieves more accurate cross-modal feature alignment by integrating depth-aware geometric structures from point clouds with high-level semantic features extracted from images. Subsequently, a sliding frustum mechanism is employed to dynamically partition point clouds within target regions and extract their key features, thereby refining the representation of local geometry. Moreover, a region-aligned attention module (RAAM) is designed, which utilizes 2D detection boxes as queries to guide the refinement and focus of 3D features, facilitating adaptive fusion of multi-scale point–voxel representations. Finally, extensive experiments conducted on the KITTI, NuScenes, and Waymo datasets validate the effectiveness of the proposed framework. On the KITTI test set, our method achieves 3D mAPs of 90.88%, 84.47%, and 80.99% for the Car category under Easy, Moderate, and Hard settings, respectively. On the NuScenes benchmark, the proposed model attains APs of 89.2% and 64.9% for Pedestrian and Bicycle detection, respectively. Moreover, on the Waymo dataset, our framework achieves 3D mAPH scores of 82.07% (L1) and 76.38% (L2) across all object categories, demonstrating significant improvements over current state-of-the-art approaches.

Keywords:
© 2026 ScienceGate Book Chapters — All rights reserved.