Compared with other sensors, high-quality depth estimation based on monocular camera has strong competitiveness and widespread application in intelligent transportation, etc. Although the barrier of training has been greatly lowered by unsupervised learning, most related works are still based on convolutional neural networks (CNNs) that suffer from unbridgeable gaps in the full-stage global information and high-resolution features while extracting multi-scale features. To break this predicament, we attempt to introduce vision transformer. However, the vision transformer with large sequence length due to image embedding brings great challenges to the computational cost. Thus, this work proposes a new pure transformer backbone named pooling pyramid vision transformer (PPViT), simultaneously shrinking out multi-scale features and reducing sequence length used for attention operation. Then, we provide two backbone settings including PPViT10 and PPViT18 whose number of parameters are close to the common ResNet18 and ResNet50, respectively. The experiments on KITTI dataset demonstrate that our work show a great potentiality of improving the capability of model and produce superior results to the previous CNN-based works. Equally important, we have lower latency than the related transformer-based work.
Luca GhilardiAndrea ScorsoglioRoberto Furfaro
Ankit ChaudharyDharmendra TripathiHarsh Ashok MishraPushparaj Mani PathakAshish Dixit
Zihao ZhangYuchao JiangLizuo Jin