A Frustum-Aware Fusion Network With Cross-Attention for Multi-Modal 3D Detection

Jingkun Xu; Ke Xiao; Wenjie Ji; Chunlin Li

doi:10.1109/access.2025.3637156

ScienceGate Book Chapters

JOURNAL ARTICLE

A Frustum-Aware Fusion Network With Cross-Attention for Multi-Modal 3D Detection

Jingkun Xu Ke Xiao Wenjie Ji Chunlin Li

Year: 2025 Journal: IEEE Access Vol: 13 Pages: 200777-200790 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2025.3637156

Get Full-Text PDF Get Analytical Report

Abstract

The achievement of precise 3D object detection via LiDAR–camera fusion is of utmost importance for enabling autonomous driving in complex traffic environments. However, existing methods encounter two substantial challenges: spatial misalignment among heterogeneous sensor modalities and the loss of fine-grained geometric details during cross-modal feature fusion. To address these issues, we propose a frustum-aware fusion network with cross-attention. This network utilizes a unified point–voxel representation to integrate both image and point cloud data, thus improving the quality of multi-modal fusion for 3D object detection. Specifically, we present an innovative multi-modal alignment module (MAM) that achieves more accurate cross-modal feature alignment by integrating depth-aware geometric structures from point clouds with high-level semantic features extracted from images. Subsequently, a sliding frustum mechanism is employed to dynamically partition point clouds within target regions and extract their key features, thereby refining the representation of local geometry. Moreover, a region-aligned attention module (RAAM) is designed, which utilizes 2D detection boxes as queries to guide the refinement and focus of 3D features, facilitating adaptive fusion of multi-scale point–voxel representations. Finally, extensive experiments conducted on the KITTI, NuScenes, and Waymo datasets validate the effectiveness of the proposed framework. On the KITTI test set, our method achieves 3D mAPs of 90.88%, 84.47%, and 80.99% for the Car category under Easy, Moderate, and Hard settings, respectively. On the NuScenes benchmark, the proposed model attains APs of 89.2% and 64.9% for Pedestrian and Bicycle detection, respectively. Moreover, on the Waymo dataset, our framework achieves 3D mAPH scores of 82.07% (L1) and 76.38% (L2) across all object categories, demonstrating significant improvements over current state-of-the-art approaches.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

A Frustum-Aware Fusion Network With Cross-Attention for Multi-Modal 3D Detection

Abstract

Metrics

Topics

Related Documents

Attention-Aware Cross-Modal Cross-Level Fusion Network for RGB-D Salient Object Detection

Global-Aware Attention Network for Multi-modal Sarcasm Detection

Frustum FusionNet: Amodal 3D Object Detection with Multi-Modal Feature Fusion

MACF-Net: Multi-Level Aware Cross-Modal Fusion Network for 3D Object Detection

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism