JOURNAL ARTICLE

Pooling Pyramid Vision Transformer for Unsupervised Monocular Depth Estimation

Abstract

Compared with other sensors, high-quality depth estimation based on monocular camera has strong competitiveness and widespread application in intelligent transportation, etc. Although the barrier of training has been greatly lowered by unsupervised learning, most related works are still based on convolutional neural networks (CNNs) that suffer from unbridgeable gaps in the full-stage global information and high-resolution features while extracting multi-scale features. To break this predicament, we attempt to introduce vision transformer. However, the vision transformer with large sequence length due to image embedding brings great challenges to the computational cost. Thus, this work proposes a new pure transformer backbone named pooling pyramid vision transformer (PPViT), simultaneously shrinking out multi-scale features and reducing sequence length used for attention operation. Then, we provide two backbone settings including PPViT10 and PPViT18 whose number of parameters are close to the common ResNet18 and ResNet50, respectively. The experiments on KITTI dataset demonstrate that our work show a great potentiality of improving the capability of model and produce superior results to the previous CNN-based works. Equally important, we have lower latency than the related transformer-based work.

Keywords:
Computer science Transformer Artificial intelligence Pooling Convolutional neural network Embedding Computer vision Pattern recognition (psychology) Engineering Voltage

Metrics

1
Cited By
0.12
FWCI (Field Weighted Citation Impact)
31
Refs
0.38
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Processing Techniques and Applications
Physical Sciences →  Engineering →  Media Technology
Optical measurement and interference techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.