Header logo is ps
Department Talks

Appearance Modeling for 4D Multi-view Representations

Talk
  • 15 December 2017 • 12:00 12:45
  • Vagia Tsiminaki
  • PS Seminar Room (N3.022)

The emergence of multi-view capture systems has yield a tremendous amount of video sequences. The task of capturing spatio-temporal models from real world imagery (4D modeling) should arguably benefit from this enormous visual information. In order to achieve highly realistic representations both geometry and appearance need to be modeled in high precision. Yet, even with the great progress of the geometric modeling, the appearance aspect has not been fully explored and visual quality can still be improved. I will explain how we can optimally exploit the redundant visual information of the captured video sequences and provide a temporally coherent, super-resolved, view-independent appearance representation. I will further discuss how to exploit the interdependency of both geometry and appearance as separate modalities to enhance visual perception and finally how to decompose appearance representations into intrinsic components (shading & albedo) and super-resolve them jointly to allow for more realistic renderings.

Organizers: Despoina Paschalidou

Perceptual Grouping using Superpixels

Talk
  • 11 November 2013 • 02:00:00
  • Sven Dickinson
  • MPH Lecture Hall

Perceptual grouping played a prominent role in support of early object recognition systems, which typically took an input image and a database of shape models and identified which of the models was visible in the image.  When the database was large, local features were not sufficiently distinctive to prune down the space of models to a manageable number that could be verified.  However, when causally related shape features were grouped, using intermediate-level shape priors, e.g., cotermination, symmetry, and compactness, they formed effective shape indices and allowed databases to grow in size.  In recent years, the recognition (categorization) community has focused on the object detection problem, in which the input image is searched for a specific target object.  Since indexing is not required to select the target model, perceptual grouping is not required to construct a discriminative shape index; the existence of a much stronger object-level shape prior precludes the need for a weaker intermediate-level shape prior.  As a result, perceptual grouping activity at our major conferences has diminished. However, there are clear signs that the recognition community is moving from appearance back to shape, and from detection back to unexpected object recognition. Shape-based perceptual grouping will play a critical role in facilitating this transition.  But while causally related features must be grouped, they also need to be abstracted before they can be matched to categorical models.   In this talk, I will describe our recent progress on the use of intermediate shape priors in segmenting, grouping, and abstracting shape features. Specifically, I will describe the use of symmetry and non-accidental attachment to detect and group symmetric parts, the use of closure to separate figure from background, and the use of a vocabulary of simple shape models to group and abstract image contours.


  • Padmanabhan Anandan
  • MPH Lecture Hall

T.b.a.


Exploring and editing the appearance of outdoor scenes

Talk
  • 11 October 2013 • 09:30:00
  • Pierre-Yves Laffont
  • MRZ seminar

The appearance of outdoor scenes changes dramatically with lighting and weather conditions, time of day, and season. Specific conditions, such as the "golden hours" characterized by warm light, can be hard to capture because many scene properties are transient -- they change over time. Despite significant advances in image editing software, common image manipulation tasks such as lighting editing require significant expertise to achieve plausible results.
 
In this talk, we first explore the appearance of outdoor scenes with an approach based on crowdsourcing and machine learning. We relate visual changes to scene attributes, which are human-nameable concepts used for high-level description of scenes. We collect a dataset containing thousands of outdoor images, annotate them with transient attributes, and train classifiers to recognize these properties in new images. We develop new interfaces for browsing photo collections, based on these attributes.
 
We then focus on specifically extracting and manipulating the lighting in a photograph. Intrinsic image decomposition separates a photograph into independent layers: reflectance, which represents the color of the materials, and illumination, which encodes the effect of lighting at each pixel. We tackle this ill-posed problem by leveraging additional information provided by multiple photographs of the scene. The methods we describe enable advanced image manipulations such as lighting-aware editing, insertion of virtual objects, and image-based illumination transfer between photographs of a collection.
 


Inference in highly-connected CRFs

Talk
  • 01 October 2013 • 08:00:00
  • Neill Campbell
  • MPH lecture hall

This talk presents recent work from CVPR that looks at inference for pairwise CRF models in the highly (or fully) connected case rather than simply a sparse set of neighbours used ubiquitously in many computer vision tasks. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved very efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. The method presented generalises this model to allow arbitrary, non-parametric models (which can be learnt from training data and conditioned on test data) to be used for the pairwise potentials. This greatly increases the expressive power of such models whilst maintaining efficient inference.


Shape Knowledge in Segmentation and Tracking

Talk
  • 23 September 2013 • 09:15:00
  • Victor Adrian Prisacariu
  • MRZ seminar room

In this talk I will detail methods for simultaneous 2D/3D segmentation, tracking and reconstruction which incorporate high level shape information. I base my work on the assumption that the space of possible 2D object shapes can be either generated by projecting down known rigid 3D shapes or learned from 2D shape examples. I minimise the discrimination between statistical foreground and background appearance models with respect to the parameters governing the shape generative process (the 6 degree-of-freedom 3D pose of the 3D shape or the parameters of the learned space). The foreground region is delineated by the zero level set of a signed distance function, and I define an energy over this region and its immediate background surroundings based on pixel-wise posterior membership probabilities. I obtain the differentials of this energy with respect to the parameters governing shape and conduct searches for the correct shape using standard non-linear minimisation techniques. This methodology first leads to a novel rigid 3D object tracker. For a known 3D shape, the optimisation here aims to find the 3D pose that leads to the 2D projection that best segments a given image. I also extend my approach to track multiple objects from multiple views and show how depth (such as may be available from a Kinect sensor) can be integrated in a straighforward manner. Next, I explore deformable 2D/3D object tracking. I use a non-linear and probabilistic dimensionality reduction, called Gaussian Process Latent Variable Models, to learn spaces of shape. Segmentation becomes a minimisation of an image-driven energy function in the learned space. I can represent both 2D and 3D shapes which I compress with Fourier-based transforms, to keep inference tractable. I extend this method by learning joint shape-parameter spaces, which, novel to the literature, enable simultaneous segmentation and generic parameter recovery. These can describe anything from 3D articulated pose to eye gaze. Finally, I will also be discussing various applications of the proposed techniques, ranging from (limited) articulated hand tracking to semantic SLAM.


Human perception of material properties in the real world

Talk
  • 23 September 2013 • 13:15:00
  • Bei Xiao
  • MRC Aquarium

Humans are very good at recognizing objects as well as the materials that they are made of. We can easily tell cheese from butter, silk from linen and snow from ice just by looking. Understanding material perception is important for many real-world applications. For instance, a robot cooking in the kitchen will benefit from the knowledge of material perception when deciding if food is cooked or raw. In this talk, I will present studies that are motivated by two important applications of material perception: online shopping and computer graphics (CG) rendering. First, I will discuss the image cues that allow humans to infer tactile and mechanical information about deformable materials. I will present an experiment in which subjects were asked to match their tactile and visual perception of fabrics. I will show that image cues such as 3D folds and color are important for predicting subjects' tactile perception. Not only do these findings have immediate practical implications (e.g., improving online shopping interfaces for fabrics), but they also have theoretical implications: image-based visual cues affect tactile perception. Second, I will present a project on the visual perception of translucent materials (e.g., wax, milk, and jade) using computer-rendered stimuli. Humans are very sensitive to subtle differences in translucency (e.g., baby skin vs. adult skin), however, it is difficult to render translucent materials realistically. I will show how we measured the perceptual dimensions of physical scattering parameter space and used those measurements to produce more realistic renderings of materials like marble and jade. Taken together, my findings highlight the importance of material perception in the real world, and demonstrate how human perception can contribute to applications in computer vision and graphics. 


  • Alexander Schwing
  • MPH lecture hall

Sensors acquire an increasing amount of diverse information posing two challenges. Firstly, how can we efficiently deal with such a big amount of data and secondly, how can we benefit from this diversity? In this talk I will first present an approach to deal with large graphical models. The presented method distributes and parallelizes the computation and memory requirements while preserving convergence and optimality guarantees of existing inference and learning algorithms. I will demonstrate the effectiveness of the approach on stereo reconstruction from high-resolution imagery. In the second part I will present a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. This framework allows to linearly combine different sources of information and I will demonstrate its efficacy on the problem of estimating the 3D room layout given a single image. For the latter problem I will in a third part introduce a globally optimal yet efficient inference algorithm based on branch-and-bound.


Depth, You, and the World

Talk
  • 10 September 2013 • 11:15:00
  • Jamie Shotton
  • Max Planck Haus Lecture Hall

Consumer level depth cameras such as Kinect have changed the landscape of 3D computer vision.  In this talk we will discuss two approaches that both learn to directly infer correspondences between observed depth image pixels and 3D model points.  These correspondences can then be used to drive an optimization of a generative model to explain the data.  The first approach, the "Vitruvian Manifold", aims to fit an articulated 3D human model to a depth camera image, and extends our original Body Part Recognition algorithm used in Kinect.  It applies a per-pixel regression forest to infer direct correspondences between image pixels and points on a human mesh model.  This allows an efficient “one-shot” continuous optimization of the model parameters to recover the human pose.  The second approach, "Scene Coordinate Regression", addresses the problem of camera pose relocalization.  It uses a similar regression forest, but now aims to predict correspondences between observed image pixels and 3D world coordinates in an arbitrary 3D scene.  These correspondences are again used to drive an efficient optimization of the camera pose to a highly accurate result from a single input frame.


Efficient Algorithms for Semantic Scene Parsing

Talk
  • 09 September 2013 • 12:00:00
  • Raquel Urtasun
  • MPI Lecture Hall

Developing autonomous systems that are able to assist humans in everyday's tasks is one of the grand challenges in modern computer science. Notable examples are personal robotics for the elderly and people with disabilities, as well as autonomous driving systems which can help decrease fatalities caused by traffic accidents. In order to perform tasks such as navigation, recognition and manipulation of objects, these systems should be able to efficiently extract 3D knowledge of their environment.  In this talk, I'll show how Markov random fields provide a great mathematical formalism to extract this knowledge.  In particular, I'll focus on a few examples, i.e., 3D reconstruction, 3D layout  estimation, 2D holistic parsing and object detection, and show  representations and inference strategies that allow us  to achieve state-of-the-art performance as well as  several orders of magnitude speed-ups.


  • Sanja Fidler
  • MRZ

Object detection is one of the main challenges of computer vision. In the standard setting, we are given an image and the goal is to place bounding boxes around the objects and recognize their classes. In robotics, estimating additional information such as accurate viewpoint or detailed segmentation is important for planning and interaction. In this talk, I'll approach detection in three scenarios: purely 2D, 3D from 2D and 3D from 3D and show how different types of information can be used to significantly boost the current state-of-the-art in detection.