Learning Monocular Cues in 3D Reconstruction
Repository URI
Repository DOI
Change log
Authors
Abstract
3D reconstruction is a fundamental problem in computer vision. While a wide range of methods has been proposed within the community, most of them have fixed input and output modalities, limiting their usefulness. In an attempt to build bridges between different 3D reconstruction methods, we focus on the most widely available input element -- a single image. While 3D reconstruction from a single 2D image is an ill-posed problem, monocular cues -- e.g. texture gradients and vanishing points -- allow us to build a plausible 3D reconstruction of the scene.
The goal of this thesis is to propose new techniques to learn monocular cues and demonstrate how they can be used in various 3D reconstruction tasks. We take a data-driven approach and learn monocular cues by training deep neural networks to predict pixel-wise surface normal and depth. For surface normal estimation, we propose a new parameterisation for the surface normal probability distribution and use it to estimate the aleatoric uncertainty associated with the prediction. We also introduce an uncertainty-guided training scheme to improve the performance on small structures and near object boundaries. Surface normals provide useful constraints on how the depth should change around each pixel. By using surface normals to propagate depths between pixels, we demonstrate how depth refinement and upsampling can be formulated as a classification of choosing the neighbouring pixel to propagate from.
We then address three challenging 3D reconstruction tasks to demonstrate the usefulness of the learned monocular cues. The first is multi-view depth estimation, where we use single-view depth probability distribution to improve the efficiency of depth candidate sampling and enforce the multi-view depth to be consistent with the single-view predictions. We also propose an iterative multi-view stereo framework where the per-pixel depth distributions are updated via sparse multi-view matching. We then address human foot reconstruction and CAD model alignment to show how monocular cues can be exploited in prior-based object reconstruction. The shape of the human foot is parameterised by a generative model while the CAD model shape is known a priori. We substantially improve the accuracy for both tasks by encouraging the rendered shape to be consistent with the single-view depth and normal predictions.