Multi-View Stereo (MVS) has been a popular area of interest in computer vision research. The learning-based MVS approach consists of four steps: 2D CNN feature extraction, variance-based cost aggregation by homography warping, 3D CNN cost regularisation and deep regression. Existing MVS methods often benefit from heavy backbones at the expense of model size, so designing lightweight effective models is crucial for applications using low-configuration devices. In this paper, LTMVSNet is proposed for small scenes to explore for feature extraction and cost aggregation. With a lightweight Feature Extraction Transformer (FET) and internal attention, LTMVSNet is able to aggregate global contextual information and improve the handling of low-texture and non-Lambertian regions or severely occluded areas. For cost aggregation, LTMVSNet utilises epipolar constraints to construct 3D associations of 2D features, reducing the number of depth assumptions and eliminating the need for additional parameters. Propagation of depth maps using a coarse- to-fine cascade structure, and extensive experiments show that LTMVSNet achieves state-of-the-art performance on the DTU dataset as well as the Tanks and Temples intermediate set.
Changfei KongZiyi ZhangJiafa MaoSixian ChanWeigou Sheng
Lina WangJiangfeng SheQiang ZhaoXiang WenYuzheng Guan
WANG Sicheng, JIANG Hao, CHEN Xiao
Ning ZhaoHeng WangQuanlong CuiLan Wu
Yu LiangDongxu DuanYuhong YuanKai Zhang