ABSTRACT Accurate perception and understanding of the three‐dimensional environment is crucial for autonomous vehicles to navigate efficiently and make wise decisions. However, in complex real‐world scenarios, the information obtained by a single‐modal sensor is often incomplete, severely affecting the detection accuracy of occluded targets. To address this issue, this paper proposes a novel adaptive multi‐scale attention aggregation strategy, efficiently fusing multi‐scale feature representations of heterogeneous data to accurately capture the shape details and spatial relationships of targets in three‐dimensional space. This strategy utilises learnable sparse keypoints to dynamically align heterogeneous features in a data‐driven manner, adaptively modelling the cross‐modal mapping relationships between keypoints and their corresponding multi‐scale image features. Given the importance of accurately obtaining the three‐dimensional shape information of targets for understanding the size and rotation pose of occluded targets, this paper adopts a shape prior knowledge‐based constraint method and data augmentation strategy to guide the model to more accurately perceive the complete three‐dimensional shape and rotation pose of occluded targets. Experimental results show that our proposed model achieves 2.15%, 3.24% and 2.75% improvement in 3D R40 mAP score under the easy, moderate and hard difficulty levels compared to MVXNet, significantly enhancing the detection accuracy and robustness of occluded targets in complex scenarios.
Zehui ChenZhenyu LiShiquan ZhangLiangji FangQinhong JiangFeng Zhao
Xiao-Yong YuSiyuan WuXiaoqiang LuGuilong Gao
Zehui ChenZhenyu LiShiquan ZhangLiangji FangQinhong JiangFeng ZhaoBolei ZhouHang Zhao
Sin-Ye JhongMing‐Chih HoSiyu LuYung-Yao Chen
Anzhi WangChunhong RenShuang ZhaoShibiao Mu