Recent Transformer-based 3D object detectors learn point cloud features either from point- or voxel-based representations. However, the former requires time-consuming sampling while the latter introduces quantization errors. In this paper, we present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD) that takes advantage of these two representations. Specifically, we first use voxel-based sparse convolutions for efficient feature encoding. Then, we propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels while attaining accurate positions from points. The key to associating the two different representations is our introduced input-dependent Query Initialization module, which could efficiently generate reference points and content queries. Then, PVT adaptively fuses long-range contextual and local geometric information around reference points into content queries. Further, to quickly find the neighboring points of reference points, we design the Virtual Range Image module, which generalizes the native range image to multi-sensor and multi-frame. The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method. Code will be available.
Yifan ZhangQingyong HuKe XuJianwei WanYulan Guo
Yulin LiuHuihui WangHao LiuShiyou Chen
Zheng WuWeiliang TangLi JiangChi‐Wing Fu
Kailai HuangMi WenChen WangLina LingA VaswaniN ShazeerN ParmarJ UszkoreitL JonesA GomezI KaiserPolosukhinH TouvronM CordM DouzeF MassaA SablayrollesH JgouC SzegedyS IoffeV VanhouckeA AlemiK HeX ZhangS RenJ SunY FangB LiaoX WangJ FangJ QiR WuJ NiuW LiuN CarionF MassaG SynnaeveN UsunierA KirillovS ZagoruykoW LiuD AnguelovD ErhanC SzegedyS ReedC.-Y FuA BergT.-Y LinM MaireS BelongieJ HaysP PeronaD RamananP DollrC ZitnickH RezatofighiN TsoiJ GwakA SadeghianI ReidS Savarese
Shuai LiuDi WangQuan WangKai Huang