Pooling Pyramid Vision Transformer for Unsupervised Monocular Depth Estimation

Qingyu Zhang; Chunyan Wei; Qing Li; Xiaosen Tian; Chuanpeng Li

doi:10.1109/smartiot55134.2022.00025

ScienceGate Book Chapters

JOURNAL ARTICLE

Pooling Pyramid Vision Transformer for Unsupervised Monocular Depth Estimation

Qingyu Zhang Chunyan Wei Qing Li Xiaosen Tian Chuanpeng Li

Year: 2022 Pages: 100-107

DOI: 10.1109/smartiot55134.2022.00025

Get Full-Text PDF Get Analytical Report

Abstract

Compared with other sensors, high-quality depth estimation based on monocular camera has strong competitiveness and widespread application in intelligent transportation, etc. Although the barrier of training has been greatly lowered by unsupervised learning, most related works are still based on convolutional neural networks (CNNs) that suffer from unbridgeable gaps in the full-stage global information and high-resolution features while extracting multi-scale features. To break this predicament, we attempt to introduce vision transformer. However, the vision transformer with large sequence length due to image embedding brings great challenges to the computational cost. Thus, this work proposes a new pure transformer backbone named pooling pyramid vision transformer (PPViT), simultaneously shrinking out multi-scale features and reducing sequence length used for attention operation. Then, we provide two backbone settings including PPViT10 and PPViT18 whose number of parameters are close to the common ResNet18 and ResNet50, respectively. The experiments on KITTI dataset demonstrate that our work show a great potentiality of improving the capability of model and produce superior results to the previous CNN-based works. Equally important, we have lower latency than the related transformer-based work.

Keywords:

Computer science Transformer Artificial intelligence Pooling Convolutional neural network Embedding Computer vision Pattern recognition (psychology) Engineering Voltage

Metrics

Cited By

0.12

FWCI (Field Weighted Citation Impact)

Refs

0.38

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Processing Techniques and Applications

Physical Sciences → Engineering → Media Technology

Optical measurement and interference techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Pooling Pyramid Vision Transformer for Unsupervised Monocular Depth Estimation

Abstract

Metrics

Citation History

Topics

Related Documents

ISS Monocular Depth Estimation Via Vision Transformer

Employing Vision Transformer for Monocular Depth Estimation

Vision Transformer-Based Monocular Depth Estimation for Fisheye Cameras

MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Depth-PVT: Pyramid Vision Transformer with Channel Attention for Depth Estimation