Monocular video-based 3D reconstruction has emerged as a fundamental yet challenging problem in computer vision, due to depth ambiguity, scale uncertainty, and limited viewpoint coverage. Traditional geometry-related approaches, which include Structure-from-Motion (SfM), Multi-View Stereo (MVS) and SLAM, are partial solutions and usually result in incomplete or noisy reconstruction. Neural Radiance Fields (NeRF) broke the previous paradigm of 3D generation by modelling the scene as a continuous volumetric generator, which takes 3D coordinates and viewing directions as inputs and neural colour and density as outputs to generate photorealistic novel-view images. This review follows the history of NeRF and its initial extensions to monocular video, such as sparse-view adaptations (PixelNeRF, DietNeRF, RegNeRF), dynamic and deformable scene modeling (D-NeRF, NSFF, NeRF-T), and optimization strategies, such as pose estimation, regularization, and efficiency. We address evaluation policies, datasets, and applications in the areas of AR/VR, robotics, cultural heritage, and digital content creation. Lastly, we provide a critical reflection on the limitations of NeRF and are able to identify future perspectives, such as improved priors in monocular input, faster inference, generalizable architectures, and lightweight models. The paper is a detailed overview of the methods that form the basis of neural-radiance-field-based monocular-video reconstruction and preconditions for further progress in that direction.
Yu-Jie YuanLeif KobbeltJie YangYu‐Kun LaiLin Gao
Jeremy S. RuthbergRandall A. BlyNicole GundersonPengcheng ChenMahdi AligheziEric J. SeibelWaleed M. Abuzeid
Beerend G. A. GeratsJelmer M. WolterinkIvo A. M. J. Broeders
Antoni RosinolJohn J. LeonardLuca Carlone