Within the realm of deep learning-based Monocular Depth Estimation (MDE), Vision Transformers (ViTs) have garnered substantial attention as a network framework owing to their unique structure and impressive attention mechanism. However, ViTs encounter limitations in effectively capturing spatial features and exhibit a lack of sensitivity towards local information when compared to Convolutional Neural Networks (CNNs). This study aims to address the underutilization of valuable local information by ViTs and the oversight of performance improvement facilitated by the decoder. To tackle these challenges, we propose a hybrid network that leverages the strengths of both ViTs and CNNs to enhance both local information and long-range dependencies. In the encoder stage, we introduce an innovative patch attention mechanism to capture varying levels of attention across diverse regions. Furthermore, in the decoder stage, a cross-attention mechanism is devised to enhance feature fusion. Through extensive experimentation on diverse datasets, including KITTI, DIW, DIODE, and Sintel, our approach achieves more effective and representative features, leading to significant performance improvements of up to 13.98% compared to the state-of-the-art benchmarks in the MDE task.
Shubhra AichJean Marie Uwabeza VianneyMd Amirul IslamMannat Kaur Bingbing Liu
Sihaeng LeeJanghyeon LeeByungju KimEojindl YiJunmo Kim
Aleksei GrigorevFeng JiangSeungmin RhoWorku J. SoriShaohui LiuS.V. Sai
Sumanta BhattacharyyaJu ShenStephen WelchChen Chen