Current solutions to discriminative and generative tasks in computer vision exist separately and often lack interpretability and explainability. Using faces as our application domain, here we present an architecture that is based around two core ideas that address these issues: first, our framework learns an unsupervised, low-dimensional embedding of faces using an adversarial autoencoder that is able to synthesize high-quality face images. Second, a supervised disentanglement splits the low-dimensional embedding vector into four sub-vectors, each of which contains separated information about one of four major face attributes (pose, identity, expression, and style) that can be used both for discriminative tasks and for manipulating all four attributes in an explicit manner. The resulting architecture achieves state-of-the-art image quality, good discrimination and face retrieval results on each of the four attributes, and supports various face editing tasks using a face representation of only 99 dimensions. Finally, we apply the architecture's robust image synthesis capabilities to visually debug label-quality issues in an existing face dataset.
Organizers: Timo Bolkart
Optics with long focal length have been extensively used for shooting 2D cinema and television, either to virtually get closer to the scene or to produce an aesthetical effect through the deformation of the perspective. However, in 3D cinema or television, the use of long focal length either creates a ``cardboard effect'' or causes visual divergence. To overcome this problem, state-of-the-art methods use disparity mapping techniques, which is a generalization of view interpolation, and generate new stereoscopic pairs from the two image sequences. We propose to use more than two cameras to solve for the remaining issues in disparity mapping methods. In the first part of the talk, we briefly review the causes of visual fatigue and visual discomfort when viewing a stereoscopic film. We model the depth perception from stereopsis of a 3D scene shot with two cameras, and projected in a movie theater or on a 3DTV. We mathematically characterize this 3D distortion, and derive the mathematical constraints associated with the causes of visual fatigue and discomfort. We illustrate these 3D distortions with a new interactive software, ``The Virtual Projection Room". In order to generate the desired stereoscopic images, we propose to use image-based rendering. These techniques usually proceed in two stages. First, the input images are warped into the target view, and then the warped images are blended together. The warps are usually computed with the help of a geometric proxy (either implicit or explicit). Image blending has been extensively addressed in the literature and a few heuristics have proven to achieve very good performance. Yet the combination of the heuristics is not straightforward, and requires manual adjustment of many parameters. We present a new Bayesian approach to the problem of novel view synthesis, based on a generative model taking into account the uncertainty of the image warps in the image formation model. The Bayesian formalism allows us to deduce the energy of the generative model and to compute the desired images as the Maximum a Posteriori estimate. The method outperforms state-of-the-art image-based rendering techniques on challenging datasets. Moreover, the energy equations provide a formalization of the heuristics widely used inimage-based rendering techniques. Besides, the proposed generative model also addresses the problem of super-resolution, allowing to render images at a higher resolution than the initial ones. In the last part of the presentation, we apply the new rendering technique to the case of the stereoscopic zoom.
The visual effects and entertainment industries are now a fundamental part of the computer graphics and vision landscapes - as well as impacting across society in general. One of the issues in this area is the creation of realistic characters, creating assets for production, and improving work-flow. Advances in computer graphics, vision and rendering have underlined much of the success of these industries, built on top of academic advances. However, there are still many unsolved problems. In this talk I will outline some of the challenges we have faced in crossing over academic research into the visual effects industry. In particular, I will attempt to distinguish between academic challenges and industrial demands we have experienced - and how this has impacted projects. This draws on experience in several themes involving leading Visual Effects and entertainment companies. Our work has been in several diverse areas, including on-set capture, digital doubles, real-time animation and motion capture retargeting. I will describe how many of these problems led to us step back and focus on first solving more fundamental computer vision research problems - particularly in the area of optical flow, non-rigid tracking and shadow removal - and how these opened up other opportunities. Some of these projects are supported through our Centre for Digital Entertainment (CDE) - which has 60 PhD level student embedded across the creative industries in the UK. Others are more specific to partners at The Imaginarium and Double Negative Visual Effects. Attempting to draw these experiences together, we are now starting a new Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), with leading partners across entertainment, elite sport and rehabilitation.
Organizers: Silvia Zuffi
Current object class detection methods typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications, such as autonomous driving and 3D scene understanding, would benefit from more detailed and richer object hypotheses. In this talk I will present our recent work on building more detailed object class detectors, bridging the gap between higher level tasks and state-of-the-art object detectors. I will present a 3D object class detection method that can reliably estimate the 3D position, orientation and 3D shape of objects from a single image. Based on state-of-the-art CNN features, the method is a carefully designed 3D detection pipeline where each step is tuned for better performance, resulting in a registered CAD model for every object in the image. In the second part of the talk, I will focus on our work on what is holding back convolutional neural nets for detection. We analyze the R-CNN object detection pipeline in combination with state-of-the-art network architectures (AlexNet, GoogleNet and VGG16). Focusing on two central questions, what did the convnets learn and what can they learn, we illustrate that the three network architectures suffer from the same weaknesses, and these downsides can not be alleviated by simply introducing more data. Therefore we conclude that architectural changes are needed. Furthermore, we show that additional, synthetical generated training data, sampled from the modes of the data distribution can further increase the overall detection performance, while still suffering from the same weaknesses. Last, we hint at the complementary nature of the features of the three network architectures considered in this work.
Most computer vision systems cannot take advantage of the abundance of Internet videos as training data. This is because current methods typically learn under strong supervision and require expensive manual annotations. (e.g. videos need to be temporally trimmed to cover the duration of a specific action, object bounding boxes, etc.). In this talk, I will present two techniques that can lead to learning the behavior and the structure of articulated object classes (e.g. animals) from videos, with as little human supervision as possible. First, we discover the characteristic motion patterns of an object class from videos of objects performing natural, unscripted behaviors, such as tigers in the wild. Our method generates temporal intervals that are automatically trimmed to one instance of the discovered behavior, and clusters them by type (e.g. running, turning head, drinking water). Second, we automatically recover thousands of spatiotemporal correspondences within the discovered clusters of behavior, which allow mapping pixels of an instance in one video to those of a different instance in a different video. Both techniques rely on a novel motion descriptor modeling the relative displacement of pairs of trajectories, which is more suitable for articulated objects than state-of-the-art descriptors using single trajectories. We provide extensive quantitative evaluation on our new dataset of tiger videos, which contains more than 100k fully annotated frames.
Organizers: Laura Sevilla
The external world is represented in the brain as spatiotemporal patterns of electrical activity. Sensory signals, such as light, sound, and touch, are transduced at the periphery and subsequently transformed by various stages of neural circuitry, resulting in increasingly abstract representations through the sensory pathways of the brain. It is these representations that ultimately give rise to sensory perception. Deciphering the messages conveyed in the representations is often referred to as “reading the neural code”. True understanding of the neural code requires knowledge of not only the representation of the external world at one particular stage of the neural pathway, but ultimately how sensory information is communicated from the periphery to successive downstream brain structures. Our laboratory has focused on various challenges posed by this problem, some of which I will discuss. In contrast, prosthetic devices designed to augment or replace sensory function rely on the principle of artificially activating neural circuits to induce a desired perception, which we might refer to as “writing the neural code”. This requires not only significant challenges in biomaterials and interfaces, but also in knowing precisely what to tell the brain to do. Our laboratory has begun some preliminary work in this direction that I will discuss. Taken together, an understanding of these complexities and others is critical for understanding how information about the outside world is acquired and communicated to downstream brain structures, in relating spatiotemporal patterns of neural activity to sensory perception, and for the development of engineered devices for replacing or augmenting sensory function lost to trauma or disease.
Organizers: Jonas Wulff
Learning of layered or "deep" representations has provided significant advances in computer vision in recent years, but has traditionally been limited to fully supervised settings with very large amounts of training data. New results show that such methods can also excel when learning in sparse/weakly labeled settings across modalities and domains. I'll present our recent long-term recurrent network model which can learn cross-modal translation and can provide open-domain video to text transcription. I'll also describe state-of-the-art models for fully convolutional pixel-dense segmentation from weakly labeled input, and finally will discuss new methods for adapting deep recognition models to new domains with few or no target labels for categories of interest.
Organizers: Jonas Wulff
I will talk about two types of machine learning problems, which are important but have received little attention. The first are problems naturally formulated as learning a one-to-many mapping, which can handle the inherent ambiguity in tasks such as generating segmentations or captions for images. A second problem involves learning representations that are invariant to certain nuisance or sensitive factors of variation in the data while retaining as much of the remaining information as possible. The primary approach we formulate for both problems is a constrained form of joint embedding in a deep generative model, that can develop informative representations of sentences and images. Applications discussed will include image captioning, question-answering, segmentation, classification without discrimination, and domain adaptation.
Organizers: Gerard Pons-Moll
During the last three decades computer graphics established itself as a core discipline within computer science and information technology. Two decades ago, most digital content was textual. Today it has expanded to include audio, images, video, and a variety of graphical representations. New and emerging technologies such as multimedia, social networks, digital television, digital photography and the rapid development of new sensing devices, telecommunication and telepresence, virtual reality, or 3D-internet further indicate the potential of computer graphics in the years to come. Typical for the field is the coincidence of very large data sets with the demand for fast, and possibly interactive, high quality visual feedback. Furthermore, the user should be able to interact with the environment in a natural and intuitive way. In order to address the challenges mentioned above, a new and more integrated scientific view of computer graphics is required. In contrast to the classical approach to computer graphics which takes as input a scene model -- consisting of a set of light sources, a set of objects (specified by their shape and material properties), and a camera -- and uses simulation to compute an image, we like to take the more integrated view of `3D Image Analysis and Synthesis’ for our research. We consider the whole pipeline from data acquisition, over data processing to rendering in our work. In our opinion, this point of view is necessary in order to exploit the capabilities and perspectives of modern hardware, both on the input (sensors, scanners, digital photography, digital video) and output (graphics hardware, multiple platforms) side. Our vision and long term goal is the development of methods and tools to efficiently handle the huge amount of data during the acquisition process, to extract structure and meaning from the abundance of digital data, and to turn this into graphical representations that facilitate further processing, rendering, and interaction. In this presentation I will highlight some of our ongoing research by means of examples. Topics covered include 3D reconstruction and digital geometry processing, shape analysis and shape design, motion and performance capture, and 3D video processing.
Learnable representations, and deep convolutional neural networks (CNNs) in particular, have become the preferred way of extracting visual features for image understanding tasks, from object recognition to semantic segmentation. In this talk I will discuss several recent advances in deep representations for computer vision. After reviewing modern CNN architectures, I will give an example of a state-of-the-art network in text spotting; in particular, I will show that, by using only synthetic data and a sufficiently large deep model, it is possible directly map image regions to English words, a classification problem with 90K classes, obtaining in this manner state-of-the-art performance in text spotting. I will also briefly touch on other applications of deep learning to object recognition and discuss feature universality and transfer learning. In the last part of the talk I will move to the problem of understanding deep networks, which remain largely black boxes, presenting two possible approaches to their analysis. The first one are visualisation techniques that can investigate the information retained and learned by a visual representation. The second one is a method that allows exploring how representation capture geometric notions such as image transformations, and to find whether different representations are related and how.
Recent progress in computer-based visual recognition heavily relies on machine learning methods trained using large scale annotated datasets. While such data has made advances in model design and evaluation possible, it does not necessarily provide insights or constraints into those intermediate levels of computation, or deep structure, perceived as ultimately necessary in order to design reliable computer vision systems. This is noticeable in the accuracy of state of the art systems trained with such annotations, which still lag behind human performance in similar tasks. Nor does the existing data makes it immediately possible to exploit insights from a working system - the human eye - to derive potentially better features, models or algorithms. In this talk I will present a mix of perceptual and computational insights resulted from the analysis of large-scale human eye movement and 3d body motion capture datasets, collected in the context of visual recognition tasks (Human3.6M available at http://vision.imar.ro/human3.6m/, and Actions in the Eye available at http://vision.imar.ro/eyetracking/). I will show that attention models (fixation detectors, scan-paths estimators, weakly supervised object detector response functions and search strategies) can be learned from human eye movement data, and can produce state of the art results when used in end-to-end automatic visual recognition systems. I will also describe recent work in large-scale human pose estimation, showing the feasibility of pixel-level body part labeling from RGB, and towards promising 2D and 3D human pose estimation results in monocular images.In this context, I will discuss perceptual, perhaps surprising recent quantitative experiments, revealing that humans may not be significantly better than computers at perceiving 3D articulated poses from monocular images. Such findings may challenge established definitions of computer vision `tasks' and their expected levels of performance.