JOURNAL ARTICLE

LTMVSNet: A Lightweight Transformer Network for Multi-View Stereo

Abstract

Multi-View Stereo (MVS) has been a popular area of interest in computer vision research. The learning-based MVS approach consists of four steps: 2D CNN feature extraction, variance-based cost aggregation by homography warping, 3D CNN cost regularisation and deep regression. Existing MVS methods often benefit from heavy backbones at the expense of model size, so designing lightweight effective models is crucial for applications using low-configuration devices. In this paper, LTMVSNet is proposed for small scenes to explore for feature extraction and cost aggregation. With a lightweight Feature Extraction Transformer (FET) and internal attention, LTMVSNet is able to aggregate global contextual information and improve the handling of low-texture and non-Lambertian regions or severely occluded areas. For cost aggregation, LTMVSNet utilises epipolar constraints to construct 3D associations of 2D features, reducing the number of depth assumptions and eliminating the need for additional parameters. Propagation of depth maps using a coarse- to-fine cascade structure, and extensive experiments show that LTMVSNet achieves state-of-the-art performance on the DTU dataset as well as the Tanks and Temples intermediate set.

Keywords:
Computer science Image warping Artificial intelligence Feature extraction Transformer Computer vision Pattern recognition (psychology) Voltage

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
34
Refs
0.23
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Optical measurement and interference techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Robotics and Sensor-Based Localization
Physical Sciences →  Engineering →  Aerospace Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.