Current object class detection methods typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications, such as autonomous driving and 3D scene understanding, would benefit from more detailed and richer object hypotheses. In this talk I will present our recent work on building more detailed object class detectors, bridging the gap between higher level tasks and state-of-the-art object detectors. I will present a 3D object class detection method that can reliably estimate the 3D position, orientation and 3D shape of objects from a single image. Based on state-of-the-art CNN features, the method is a carefully designed 3D detection pipeline where each step is tuned for better performance, resulting in a registered CAD model for every object in the image.
In the second part of the talk, I will focus on our work on what is holding back convolutional neural nets for detection. We analyze the R-CNN object detection pipeline in combination with state-of-the-art network architectures (AlexNet, GoogleNet and VGG16). Focusing on two central questions, what did the convnets learn and what can they learn, we illustrate that the three network architectures suffer from the same weaknesses, and these downsides can not be alleviated by simply introducing more data. Therefore we conclude that architectural changes are needed. Furthermore, we show that additional, synthetical generated training data, sampled from the modes of the data distribution can further increase the overall detection performance, while still suffering from the same weaknesses. Last, we hint at the complementary nature of the features of the three network architectures considered in this work.