Autonomous systems require a profound understanding of their surroundings, encompassing both semantic and 3D geometry. This study focuses on advancing 3D semantic scene completion approaches using a camera. Building upon the foundation laid by VoxFormer [1], which is recognized for its state-of-the-art performance in 3D semantic scene completion, our approach involves two distinct stages. In the initial stage, scene completion is done with depth images, while in the second stage, the final 3D scene completion is performed using masked autoencoder. To enhance the performance of VoxFormer, we introduced two key modifications. First, we modified the first stage using multi-scale feature maps. Second, we further modified the first stage using a masked autoencoder. Experimental results, based on the adapted VoxFormer model in both stages are presented. Our two proposed approaches exhibit notable improvements, particularly in the context of small objects. However, these enhancements warrant further investigation for optimization and refinement.
Di LinHaotian DongEnhui MaLubo WangPing Li
Ruochong FuHang WuMengxiang HaoYubin Miao
Xinhang SongShuqiang JiangLuis Herranz