Learning Birds-Eye View Representations for Autonomous Driving
Over the past few years, progress towards the ambitious goal of widespread fully-autonomous vehicles on our roads has accelerated dramatically. This progress has been spurred largely by the success of highly accurate LiDAR sensors, as well the use of detailed high-resolution maps, which together allow a vehicle to navigate its surroundings effectively. Often, however, one or both of these resources may be unavailable, whether due to cost, sensor failure, or the need to operate in an unmapped environment. The aim of this thesis is therefore to demonstrate that it is possible to build detailed three-dimensional representations of traffic scenes using only 2D monocular camera images as input. Such an approach faces many challenges: most notably that 2D images do not provide explicit 3D structure. We overcome this limitation by applying a combination of deep learning and geometry to transform image-based features into an orthographic birds-eye view representation of the scene, allowing algorithms to reason in a metric, 3D space. This approach is applied to solving two challenging perception tasks central to autonomous driving.
The first part of this thesis addresses the problem of monocular 3D object detection, which involves determining the size and location of all objects in the scene. Our solution was based on a novel convolutional network architecture that processed features in both the image and birds-eye view perspective. Results on the KITTI dataset showed that this network outperformed existing works at the time, and although more recent works have improved on these results, we conducted extensive analysis to find that our solution performed well in many difficult edge-case scenarios such as objects close to or distant from the camera.
In the second part of the thesis, we consider the related problem of semantic map prediction. This consists of estimating a birds-eye view map of the world visible from a given camera, encoding both static elements of the scene such as pavement and road layout, as well as dynamic objects such as vehicles and pedestrians. This was accomplished using a second network that built on the experience from the previous work and achieved convincing performance on two real-world driving datasets. By formulating the maps as an occupancy grid map (a widely used representation from robotics), we were able to demonstrate how predictions could be accumulated across multiple frames, and that doing so further improved the robustness of maps produced by our system.