JOURNAL ARTICLE

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Zhe WangYongjia ZouJin Lv洋 大草Hongfei Yu

Year: 2024 Journal:   IEEE Access Vol: 12 Pages: 167934-167943   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Self-supervised monocular depth estimation is a promising research area due to its ability to train models without relying on expensive and difficult-to-obtain ground truth depth labels. In this domain, models often employ Convolutional Neural Networks (CNNs) and Transformers for feature extraction. While CNNs excel at capturing local features, they struggle with global information due to their limited receptive field. On the other hand, Transformers can capture global features but are computationally expensive. To balance performance and computational efficiency, this paper proposes a lightweight self-supervised monocular depth estimation model that integrates CNN and Transformer architectures. The model introduces an Agent Attention mechanism to effectively model global context while significantly reducing computational complexity. Furthermore, spatial and channel restructured convolution techniques are utilized to minimize the computational cost associated with redundant feature extraction in visual tasks. Validation on the KITTI dataset shows that the model reaches an Absolute Relative Error of 0.104 and a Squared Relative Error of 0.757 while maintaining a nearly constant number of parameters. The accuracy improved to 0.889, with computational complexity (FLOPs) reduced to 4.993G, and training time decreased from 15.5 hours to 13.5 hours. The model also demonstrated strong generalization on the Make 3D dataset, with only 3.0M parameters and low computational complexity, indicating its suitability for resource-constrained devices.

Keywords:
Computer science Artificial intelligence Monocular Computer vision Transformer Pattern recognition (psychology) Engineering Electrical engineering

Metrics

3
Cited By
1.84
FWCI (Field Weighted Citation Impact)
33
Refs
0.82
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Image Processing Techniques and Applications
Physical Sciences →  Engineering →  Media Technology
Industrial Vision Systems and Defect Detection
Physical Sciences →  Engineering →  Industrial and Manufacturing Engineering
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.