Non-planar object deformations result in challenging but informative signal variations. We aim to recover this information in a feedforward manner by employing discriminatively trained convolutional networks. We formulate the task as a regression problem and train our networks by leveraging upon manually annotated correspondences between images and 3D surfaces. In this talk, the focus will be on our recent work "DensePose", where we form the "COCO-DensePose" dataset by introducing an efficient annotation pipeline to collect correspondences between 50K persons appearing in the COCO dataset and the SMPL 3D deformable human-body model. We use our dataset to train CNN-based systems that deliver dense correspondences 'in the wild', namely in the presence of background, occlusions, multiple objects and scale variations. We experiment with fully-convolutional networks and region-based DensePose-RCNN model and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly accurate results in real time (http://densepose.org).
Organizers: Georgios Pavlakos
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence in that (1) how we can generalize the image classification models to the cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label,explain why the predicted label is appropriate for the image whereas another label is not.
Organizers: Andreas Geiger
Complex shapes can can be summarized using a coarsely defined structure which is consistent and robust across variety of observations. However, existing synthesis techniques do not consider structural decomposition during synthesis, causing generation of implausible or structurally unrealistic shapes. We explore how structure-aware reasoning can benefit existing generative techniques for complex 2D and 3D shapes. We evaluate our methodology on a 3D dataset of chairs and a 2D dataset of typefaces.
Organizers: Sergi Pujades
Organizers: Ahmed Osman
Visual perception involves a complex interaction between feedforward and feedback processes. A mechanistic understanding of these processing, and its limitations, is a necessary first step towards elucidating key aspects of perceptual functions and dysfunctions. In this talk, I will review our ongoing effort towards the understanding of how feedback visual processing operates at the level of the thalamus, a dynamic relay station halfway between the retina and the cortex. I will present experimental evidence from several recent electrophysiology studies performed on subjects engaged in visual detection tasks. The results show that modulatory driving provided by top-down processes (the feedback from primary visual cortex) critically influences the ongoing thalamic activity and shapes the message to be delivered to the cortex. When neuromodulatory techniques (Transcranial Magnetic Stimulation or static magnetic fields) are used to transiently disrupt cortical activity two very interesting effects show up: (1) alterations in stimulus detection and (2) the spatial properties of thalamic receptive fields are dramatically modified. Finally, I will show how sensory information can be a powerful tool to interact with the motor system and re-organize altered patterns of movement in neurological disorders such as Parkinson's disease.
Organizers: Daniel Cudeiro
Disney Research has been actively pushing the state-of-the-art in digitizing humans over the past decade, impacting both academia and industry. In this talk I will give an overview of a selected few projects in this area, from research into production. I will be talking about photogrammetric shape acquisition and dense performance capture for faces, eye and teeth scanning and parameterization, as well as physically based capture and modelling for hair and volumetric tissues.
Organizers: Timo Bolkart
The definition of art has been debated for more than 1000 years, and continues to be a puzzle. While scientific investigations offer hope of resolving this puzzle, machine learning classifiers that discriminate art from non-art images generally do not provide an explicit definition, and brain imaging and psychological theories are at present too coarse to provide a formal characterization. In this work, rather than approaching the problem using a machine learning approach trained on existing artworks, we hypothesize that art can be defined in terms of preexisting properties of the visual cortex. Specifically, we propose that a broad subset of visual art can be defined as patterns that are exciting to a visual brain. Resting on the finding that artificial neural networks trained on visual tasks can provide predictive models of processing in the visual cortex, our definition is operationalized by using a trained deep net as a surrogate “visual brain”, where “exciting” is defined as the activation energy of particular layers of this net. We find that this definition easily discriminates a variety of art from non-art, and further provides a ranking of art genres that is consistent with our subjective notion of ‘visually exciting’. By applying a deep net visualization technique, we can also validate the definition by generating example images that would be classified as art. The images synthesized under our definition resemble visually exciting art such as Op Art and other human- created artistic patterns.
Organizers: Michael Black
One of the central problems of artificial intelligence is machine perception, i.e., the ability to understand the visual world based on input from sensors such as cameras. In this talk, I will present recent progress with respect to data generation using weak annotations, motion information and synthetic data. I will also discuss our recent results for action recognition, where human tubes and tubelets have shown to be successful. Our tubelets moves away from state-of-the-art frame based approaches and improve classification and localization by relying on joint information from several frames. I also show how to extend this type of method to weakly supervised learning of actions, which allows us to scale to large amounts of data with sparse manual annotation. Furthermore, I discuss several recent extensions, including 3D pose estimation.
Organizers: Ahmed Osman
Quantifying behavior is crucial for many applications in neuroscience. Videography provides easy methods for the observation and recording of animal behavior in diverse settings, yet extracting particular aspects of a behavior for further analysis can be highly time consuming. In motor control studies, humans or other animals are often marked with reflective markers to assist with computer-based tracking, yet markers are intrusive (especially for smaller animals), and the number and location of the markers must be determined a priori. Here, we present a highly efficient method for markerless tracking based on transfer learning with deep neural networks that achieves excellent results with minimal training data. We demonstrate the versatility of this framework by tracking various body parts in a broad collection of experimental settings: mice odor trail-tracking, egg-laying behavior in drosophila, and mouse hand articulation in a skilled forelimb task. For example, during the skilled reaching behavior, individual joints can be automatically tracked (and a confidence score is reported). Remarkably, even when a small number of frames are labeled (≈200), the algorithm achieves excellent tracking performance on test frames that is comparable to human accuracy.
Organizers: Melanie Feldhofer
Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.