Point cloud semantic segmentation is important for road scene perception, a task for driverless vehicles to achieve full fledged autonomy. In this work, we introduce Mask Point Transformer Network (MPT-Net), a novel architecture for point cloud segmentation that is simple to implement. MPT-Net consists of a local and global feature encoder and a transformer based decoder; a 3D Point-Voxel Convolution encoder backbone with voxel self attention to encode features and a Mask Point Transformer module to decode point features and segment the point cloud. Firstly, we introduce the novel MPT designed to specifically handle point cloud segmentation. MPT offers two benefits. It attends to every point in the point cloud using mask tokens to extract class specific features globally with cross attention, and provide inter-class feature information exchange using self attention on the learned mask tokens. Secondly, we design a backbone to use sparse point voxel convolutional blocks and a self attention block using transformers to learn local and global contextual features. We evaluate MPT-Net on large scale outdoor driving scene point cloud datasets, SemanticKITTI and nuScenes. Our experiments show that by replacing the standard segmentation head with MPT, MPT-Net achieves a state-of-the-art performance over our baseline approach by 3.8% in SemanticKITTI and is highly effective in detecting 'stuffs' in point cloud.
Xiangqian LiXin TanZhizhong ZhangYuan XieLizhuang Ma
Ziyin ZengYongyang XuZhong XieWei TangJie WanWeichao Wu
Changji WenLong ZhangJunfeng RenRundong HongChenshuang LiCe YangYanfeng LvHongbing ChenNing Yang
Jinge SongZhenyuan CaoXueyan LiXiuying LiShuxu Guo
Xiang HeXu LiPeizhou NiXu WangQimin XuXixiang Liu