Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Seung-Jun Hwang; Sung Jun Park; Joong-Hwan Baek; Byungkyu Kim

doi:10.1109/jsen.2022.3199265

ScienceGate Book Chapters

JOURNAL ARTICLE

Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Seung-Jun Hwang Sung Jun Park Joong-Hwan Baek Byungkyu Kim

Year: 2022 Journal: IEEE Sensors Journal Vol: 22 (19)Pages: 18762-18770 Publisher: IEEE Sensors Council

DOI: 10.1109/jsen.2022.3199265

Get Full-Text PDF Get Analytical Report

Abstract

Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder–decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .

Keywords:

Encoder Artificial intelligence Computer science Monocular Computer vision Transformer Pattern recognition (psychology) Engineering Electrical engineering Voltage

Metrics

Cited By

2.72

FWCI (Field Weighted Citation Impact)

Refs

0.90

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Optical measurement and interference techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Processing Techniques and Applications

Physical Sciences → Engineering → Media Technology

Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Abstract

Metrics

Citation History

Topics

Related Documents

Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation

HCTNet: Hybrid CNN-Transformer Architecture Network for Self-Supervised Monocular Depth Estimation

Self-Supervised Monocular Depth Estimation Using Hybrid CNN-VMamba Architecture

Multiple prior representation learning for self-supervised monocular depth estimation via hybrid transformer

A Dual Encoder–Decoder Network for Self-Supervised Monocular Depth Estimation