Supervised learning with deep convolutional networks is the workhorse of the majority of computer vision research today. While much progress has been made already, exploiting deep architectures with standard components, enormous datasets, and massive computational power, I will argue that it pays to scrutinize some of the components of modern deep networks. I will begin with looking at the common pooling operation and show how we can replace standard pooling layers with a perceptually-motivated alternative, with consistent gains in accuracy. Next, I will show how we can leverage self-similarity, a well known concept from the study of natural images, to derive non-local layers for various vision tasks that boost the discriminative power. Finally, I will present a lightweight approach to obtaining predictive probabilities in deep networks, allowing to judge the reliability of the prediction.
Organizers: Michael Black
This talk aims to argue for a fine-grained perspective onto human-object interactions, from video sequences. I will present approaches for the understanding of ‘what’ objects one interacts with during daily activities, ‘when’ should we label the temporal boundaries of interactions, ‘which’ semantic labels one can use to describe such interactions and ‘who’ is better when contrasting people perform the same interaction. I will detail my group’s latest works on sub-topics related to: (1) assessing action ‘completion’ – when an interaction is attempted but not completed [BMVC 2018], (2) determining skill or expertise from video sequences [CVPR 2018] and (3) finding unequivocal semantic representations for object interactions [ongoing work]. I will also introduce EPIC-KITCHENS 2018, the recently released largest dataset of object interactions in people’s homes, recorded using wearable cameras. The dataset includes 11.5M frames fully annotated with objects and actions, based on unique annotations from the participants narrating their own videos, thus reflecting true intention. Three open challenges are now available on object detection, action recognition and action anticipation [http://epic-kitchens.github.io]
Organizers: Mohamed Hassan
In this talk, I will take an autobiographical approach to explain both where we have come from in computer graphics from the early days of rendering, and to point towards where we are going in this new world of smartphones and social media. We are at a point in history where the abilities to express oneself with media is unparalleled. The ubiquity and power of mobile devices coupled with new algorithmic paradigms is opening new expressive possibilities weekly. At the same time, these new creative media (composite imagery, augmented imagery, short form video, 3D photos) also offer unprecedented abilities to move freely between what is real and unreal. I will focus on the spaces in between images and video, and in between objective and subjective reality. Finally, I will close with some lessons learned along the way.
In this talk I will be presenting recent work on combining ideas from deformable models with deep learning. I will start by describing DenseReg and DensePose, two recently introduced systems for establishing dense correspondences between 2D images and 3D surface models ``in the wild'', namely in the presence of background, occlusions, and multiple objects. For DensePose in particular we introduce DensePose-COCO, a large-scale dataset for dense pose estimation, and DensePose-RCNN, a system which operates at multiple frames per second on a single GPU while handling multiple humans simultaneously. I will then present Deforming AutoEncoders, a method for unsupervised dense correspondence estimation. We show that we can disentangle deformations from appearance variation in an entirely unsupervised manner, and also provide promising results for a more thorough disentanglement of images into deformations, albedo and shading. Time permitting we will discuss a parallel line of work aiming at combining grouping with deep learning, and see how both grouping and correspondence can be understood as establishing associations between neurons.
Organizers: Vassilis Choutas
The reconstruction of 3D scenes and their appearance from imagery is one of the longest-standing problems in computer vision. Originally developed to support robotics and artificial intelligence applications, it has found some of its most widespread use in support of interactive 3D scene visualization. One of the keys to this success has been the melding of 3D geometric and photometric reconstruction with a heavy re-use of the original imagery, which produces more realistic rendering than a pure 3D model-driven approach. In this talk, I give a retrospective of two decades of research in this area, touching on topics such as sparse and dense 3D reconstruction, the fundamental concepts in image-based rendering and computational photography, applications to virtual reality, as well as ongoing research in the areas of layered decompositions and 3D-enabled video stabilization.
Organizers: Mohamed Hassan
Humans act upon their environment through motion, the ability to plan their movements is therefore an essential component of their autonomy. In recent decades, motion planning has been widely studied in robotics and computer graphics. Nevertheless robots still fail to achieve human reactivity and coordination. The need for more efficient motion planning algorithms has been present through out my own research on "human-aware" motion planning, which aims to take the surroundings humans explicitly into account. I believe imitation learning is the key to this particular problem as it allows to learn both, new motion skills and predictive models, two capabilities that are at the heart of "human-aware" robots while simultaneously holding the promise of faster and more reactive motion generation. In this talk I will present my work in this direction.
Two talks for the price of one! I will present my recent work on the challenging problem of stereo matching of scenes with little or no surface texture, attacking the problem from two very different angles. First, I will discuss how surface orientation priors can be added to the popular semi-global matching (SGM) algorithm, which significantly reduces errors on slanted weakly-textured surfaces. The orientation priors serve as a soft constraint during matching and can be derived in a variety of ways, including from low-resolution matching results and from monocular analysis and Manhattan-world assumptions. Second, we will examine the pathological case of Mondrian Stereo -- synthetic scenes consisting solely of solid-colored planar regions, resembling paintings by Piet Mondrian. I will discuss assumptions that allow disambiguating such scenes, present a novel stereo algorithm employing symbolic reasoning about matched edge segments, and discuss how similar ideas could be utilized in robust real-world stereo algorithms for untextured environments.
Organizers: Anurag Ranjan
Non-planar object deformations result in challenging but informative signal variations. We aim to recover this information in a feedforward manner by employing discriminatively trained convolutional networks. We formulate the task as a regression problem and train our networks by leveraging upon manually annotated correspondences between images and 3D surfaces. In this talk, the focus will be on our recent work "DensePose", where we form the "COCO-DensePose" dataset by introducing an efficient annotation pipeline to collect correspondences between 50K persons appearing in the COCO dataset and the SMPL 3D deformable human-body model. We use our dataset to train CNN-based systems that deliver dense correspondences 'in the wild', namely in the presence of background, occlusions, multiple objects and scale variations. We experiment with fully-convolutional networks and region-based DensePose-RCNN model and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly accurate results in real time (http://densepose.org).
Organizers: Georgios Pavlakos
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence in that (1) how we can generalize the image classification models to the cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label,explain why the predicted label is appropriate for the image whereas another label is not.
Organizers: Andreas Geiger
Complex shapes can can be summarized using a coarsely defined structure which is consistent and robust across variety of observations. However, existing synthesis techniques do not consider structural decomposition during synthesis, causing generation of implausible or structurally unrealistic shapes. We explore how structure-aware reasoning can benefit existing generative techniques for complex 2D and 3D shapes. We evaluate our methodology on a 3D dataset of chairs and a 2D dataset of typefaces.
Organizers: Sergi Pujades