JOURNAL ARTICLE

Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation

Hai‐Kun WangJiahui DuKechen SongLimin Cui

Year: 2024 Journal:   Engineering Science and Technology an International Journal Vol: 54 Pages: 101705-101705   Publisher: Elsevier BV

Abstract

Calculating depth using just one image is a crucial issue since it has applications in numerous computer vision domains. Although some recent works directly obtain the depth map through some complex and powerful networks, we want to combine the encoder and decoder feature maps more effectively. To this end, we propose a novel U-Net like network. The encoder is based on Swin Transformer. For the decoder, we propose Polarization Cross Attention to effectively combine codec features by optimizing the initialization of the k and v vector. In order to conduct a more in-depth global analysis of the decoded output, a Sparse Transformer post-processing module is proposed. In the Sparse Transformer module, we adopt Kullback–Leibler divergence to obtain a sparse Q matrix and achieve O((hw)ln(hw)) in time complexity and memory usage. Results from experiments utilizing the KITTI and NYUV2 datasets demonstrate how well the suggested strategy enhances the precision of monocular depth perception when compared with state-of-the-art methods.

Keywords:
Monocular Transformer Computer science Artificial intelligence Engineering Electrical engineering Voltage

Metrics

2
Cited By
1.06
FWCI (Field Weighted Citation Impact)
50
Refs
0.66
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Processing Techniques and Applications
Physical Sciences →  Engineering →  Media Technology
Image Enhancement Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.