JOURNAL ARTICLE

Enhancing Monocular Depth Estimation Using Attention in Hybrid Network

Abstract

Within the realm of deep learning-based Monocular Depth Estimation (MDE), Vision Transformers (ViTs) have garnered substantial attention as a network framework owing to their unique structure and impressive attention mechanism. However, ViTs encounter limitations in effectively capturing spatial features and exhibit a lack of sensitivity towards local information when compared to Convolutional Neural Networks (CNNs). This study aims to address the underutilization of valuable local information by ViTs and the oversight of performance improvement facilitated by the decoder. To tackle these challenges, we propose a hybrid network that leverages the strengths of both ViTs and CNNs to enhance both local information and long-range dependencies. In the encoder stage, we introduce an innovative patch attention mechanism to capture varying levels of attention across diverse regions. Furthermore, in the decoder stage, a cross-attention mechanism is devised to enhance feature fusion. Through extensive experimentation on diverse datasets, including KITTI, DIW, DIODE, and Sintel, our approach achieves more effective and representative features, leading to significant performance improvements of up to 13.98% compared to the state-of-the-art benchmarks in the MDE task.

Keywords:
Computer science Encoder Artificial intelligence Convolutional neural network Fusion mechanism Monocular Deep learning Task (project management) Artificial neural network Feature (linguistics) Machine learning Fusion Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
32
Refs
0.35
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Industrial Vision Systems and Defect Detection
Physical Sciences →  Engineering →  Industrial and Manufacturing Engineering
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Optical measurement and interference techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Dual-Stream Multiscale Attention Monocular Depth Estimation Network

Ying ZouZhe ChenFuliang Yin

Journal:   IEEE Internet of Things Journal Year: 2025 Vol: 12 (13)Pages: 23073-23084
JOURNAL ARTICLE

Patch-Wise Attention Network for Monocular Depth Estimation

Sihaeng LeeJanghyeon LeeByungju KimEojindl YiJunmo Kim

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2021 Vol: 35 (3)Pages: 1873-1881
JOURNAL ARTICLE

Depth estimation from single monocular images using deep hybrid network

Aleksei GrigorevFeng JiangSeungmin RhoWorku J. SoriShaohui LiuS.V. Sai

Journal:   Multimedia Tools and Applications Year: 2016 Vol: 76 (18)Pages: 18585-18604
JOURNAL ARTICLE

Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

Sumanta BhattacharyyaJu ShenStephen WelchChen Chen

Journal:   Journal of Real-Time Image Processing Year: 2021 Vol: 18 (4)Pages: 1357-1368
© 2026 ScienceGate Book Chapters — All rights reserved.