WANG Sicheng, JIANG Hao, CHEN Xiao
At present, deep Multi-View Stereo (MVS) methods widely introduce Transformers into cascade networks to achieve high-resolution depth estimation, thereby ensuring highly accurate and complete 3D reconstruction results. However, Transformer-based methods are limited by their computational costs and cannot be extended to more refined stages. To solve this problem, this paper proposes a novel cross-scale Transformer-based MVS network that can manage feature representations at different stages without incurring additional computation. In particular, this study introduces an Adaptive Matching-aware Transformer (AMT), which uses different interactive attention combinations on multiple scales, enabling the proposed network to capture contextual information within images and enhance the feature relationships between images. In addition, this study proposes Dual Feature Guided Aggregation(DFGA) to embed coarse global semantic information into finer cost body construction, further enhancing the perception of global and local features. Simultaneously, a feature metric loss is designed to evaluate feature deviation before and after the Transformation and thereby reduce the impact of feature mismatch on depth estimation. Experimental results show that the integrity and overall measurements of the proposed network are 0.264 and 0.302 on the DTU dataset, respectively. The average reconstruction values for Tank and temples scenarios are 64.28 and 38.03, respectively.
Sicheng WangHao JiangLei Xiang
Yu LiangDongxu DuanYuhong YuanKai Zhang
Xiaofeng WangZheng ZhuGuan HuangFangbo QinYun YeYijia HeChi XuXingang Wang
Lina WangJiangfeng SheQiang ZhaoXiang WenYuzheng Guan
Pan LiMingfu XiongXinrong HuTao PengZiqi Wang