Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Zhe Wang; Yongjia Zou; Jin Lv; 洋 大草; Hongfei Yu

doi:10.1109/access.2024.3494872

ScienceGate Book Chapters

JOURNAL ARTICLE

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Zhe Wang Yongjia Zou Jin Lv 洋大草 Hongfei Yu

Year: 2024 Journal: IEEE Access Vol: 12 Pages: 167934-167943 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/access.2024.3494872

Get Full-Text PDF Get Analytical Report

Abstract

Self-supervised monocular depth estimation is a promising research area due to its ability to train models without relying on expensive and difficult-to-obtain ground truth depth labels. In this domain, models often employ Convolutional Neural Networks (CNNs) and Transformers for feature extraction. While CNNs excel at capturing local features, they struggle with global information due to their limited receptive field. On the other hand, Transformers can capture global features but are computationally expensive. To balance performance and computational efficiency, this paper proposes a lightweight self-supervised monocular depth estimation model that integrates CNN and Transformer architectures. The model introduces an Agent Attention mechanism to effectively model global context while significantly reducing computational complexity. Furthermore, spatial and channel restructured convolution techniques are utilized to minimize the computational cost associated with redundant feature extraction in visual tasks. Validation on the KITTI dataset shows that the model reaches an Absolute Relative Error of 0.104 and a Squared Relative Error of 0.757 while maintaining a nearly constant number of parameters. The accuracy improved to 0.889, with computational complexity (FLOPs) reduced to 4.993G, and training time decreased from 15.5 hours to 13.5 hours. The model also demonstrated strong generalization on the Make 3D dataset, with only 3.0M parameters and low computational complexity, indicating its suitability for resource-constrained devices.

Keywords:

Computer science Artificial intelligence Monocular Computer vision Transformer Pattern recognition (psychology) Engineering Electrical engineering

Metrics

Cited By

1.84

FWCI (Field Weighted Citation Impact)

Refs

0.82

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Image Processing Techniques and Applications

Physical Sciences → Engineering → Media Technology

Industrial Vision Systems and Defect Detection

Physical Sciences → Engineering → Industrial and Manufacturing Engineering

Advanced Vision and Imaging

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Abstract

Metrics

Citation History

Topics

Related Documents

TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer

LEDepth: A Lightweight Self-Supervised Monocular Depth Estimation Network Combining CNN and Transformer

Spatial-Aware Dynamic Lightweight Self-Supervised Monocular Depth Estimation

Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation