This talk will highlight recent progress on two fronts. First, we will talk about a novel image-conditioned person model that allows for effective articulated pose estimation in realistic scenarios. Second, we describe our work towards activity recognition and the ability to describe video content with natural language.
Both efforts are part of a longer-term agenda towards visual scene understanding. While visual scene understanding has long been advocated as the "holy grail" of computer vision, we believe it is time to address this challenge again, based on the progress in recent years.
In this talk, I will show that, given probabilities of presence of people at various locations in individual time frames, finding the most likely set of trajectories amounts to solving a linear program that depends on very few parameters.
This can be done without requiring appearance information and in real-time, by using the K-Shortest Paths algorithm (KSP). However, this can result in unwarranted identity switches in complex scenes. In such cases, sparse image information can be used within the Linear Programming framework to keep track of people's identities, even when their paths come close to each other or intersect. By sparse, we mean that the appearance needs only be discriminative in a very limited number of frames, which makes our approach widely applicable.
Manifold learning techniques attempt to map a high-dimensional space onto a lower-dimensional one. From a mathematical point of view, a manifold is a topological Hausdorff space that is locally Euclidean. From Machine Learning point of view, we can interpret this embedded manifold as the underlying support of the data distribution. When dealing with high dimensional data sets, nonlinear dimensionality reduction methods can provide more faithful data representation than linear ones. However, the local geometrical distortion induced by the nonlinear mapping leads to a loss of information and affects interpretability, with a negative impact in the model visualization results.
This talk will discuss an approach which involves probabilistic nonlinear dimensionality reduction through Gaussian Process Latent Variables Models. The main focus is on the intrinsic geometry of the model itself as a tool to improve the exploration of the latent space and to recover information loss due to dimensionality reduction. We aim to analytically quantify and visualize the distortion due to dimensionality reduction in order to improve the performance of the model and to interpret data in a more faithful way.
In collaboration with: N.D. Lawrence (University of Sheffield), A. Vellido (UPC)
Perceptual grouping played a prominent role in support of early object recognition systems, which typically took an input image and a database of shape models and identified which of the models was visible in the image. When the database was large, local features were not sufficiently distinctive to prune down the space of models to a manageable number that could be verified. However, when causally related shape features were grouped, using intermediate-level shape priors, e.g., cotermination, symmetry, and compactness, they formed effective shape indices and allowed databases to grow in size. In recent years, the recognition (categorization) community has focused on the object detection problem, in which the input image is searched for a specific target object. Since indexing is not required to select the target model, perceptual grouping is not required to construct a discriminative shape index; the existence of a much stronger object-level shape prior precludes the need for a weaker intermediate-level shape prior. As a result, perceptual grouping activity at our major conferences has diminished. However, there are clear signs that the recognition community is moving from appearance back to shape, and from detection back to unexpected object recognition. Shape-based perceptual grouping will play a critical role in facilitating this transition. But while causally related features must be grouped, they also need to be abstracted before they can be matched to categorical models. In this talk, I will describe our recent progress on the use of intermediate shape priors in segmenting, grouping, and abstracting shape features. Specifically, I will describe the use of symmetry and non-accidental attachment to detect and group symmetric parts, the use of closure to separate figure from background, and the use of a vocabulary of simple shape models to group and abstract image contours.
This talk presents recent work from CVPR that looks at inference for pairwise CRF models in the highly (or fully) connected case rather than simply a sparse set of neighbours used ubiquitously in many computer vision tasks. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved very efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. The method presented generalises this model to allow arbitrary, non-parametric models (which can be learnt from training data and conditioned on test data) to be used for the pairwise potentials. This greatly increases the expressive power of such models whilst maintaining efficient inference.
In this talk I will detail methods for simultaneous 2D/3D segmentation, tracking and reconstruction which incorporate high level shape information. I base my work on the assumption that the space of possible 2D object shapes can be either generated by projecting down known rigid 3D shapes or learned from 2D shape examples. I minimise the discrimination between statistical foreground and background appearance models with respect to the parameters governing the shape generative process (the 6 degree-of-freedom 3D pose of the 3D shape or the parameters of the learned space). The foreground region is delineated by the zero level set of a signed distance function, and I define an energy over this region and its immediate background surroundings based on pixel-wise posterior membership probabilities. I obtain the differentials of this energy with respect to the parameters governing shape and conduct searches for the correct shape using standard non-linear minimisation techniques. This methodology first leads to a novel rigid 3D object tracker. For a known 3D shape, the optimisation here aims to find the 3D pose that leads to the 2D projection that best segments a given image. I also extend my approach to track multiple objects from multiple views and show how depth (such as may be available from a Kinect sensor) can be integrated in a straighforward manner. Next, I explore deformable 2D/3D object tracking. I use a non-linear and probabilistic dimensionality reduction, called Gaussian Process Latent Variable Models, to learn spaces of shape. Segmentation becomes a minimisation of an image-driven energy function in the learned space. I can represent both 2D and 3D shapes which I compress with Fourier-based transforms, to keep inference tractable. I extend this method by learning joint shape-parameter spaces, which, novel to the literature, enable simultaneous segmentation and generic parameter recovery. These can describe anything from 3D articulated pose to eye gaze. Finally, I will also be discussing various applications of the proposed techniques, ranging from (limited) articulated hand tracking to semantic SLAM.
Humans are very good at recognizing objects as well as the materials that they are made of. We can easily tell cheese from butter, silk from linen and snow from ice just by looking. Understanding material perception is important for many real-world applications. For instance, a robot cooking in the kitchen will benefit from the knowledge of material perception when deciding if food is cooked or raw. In this talk, I will present studies that are motivated by two important applications of material perception: online shopping and computer graphics (CG) rendering. First, I will discuss the image cues that allow humans to infer tactile and mechanical information about deformable materials. I will present an experiment in which subjects were asked to match their tactile and visual perception of fabrics. I will show that image cues such as 3D folds and color are important for predicting subjects' tactile perception. Not only do these findings have immediate practical implications (e.g., improving online shopping interfaces for fabrics), but they also have theoretical implications: image-based visual cues affect tactile perception. Second, I will present a project on the visual perception of translucent materials (e.g., wax, milk, and jade) using computer-rendered stimuli. Humans are very sensitive to subtle differences in translucency (e.g., baby skin vs. adult skin), however, it is difficult to render translucent materials realistically. I will show how we measured the perceptual dimensions of physical scattering parameter space and used those measurements to produce more realistic renderings of materials like marble and jade. Taken together, my findings highlight the importance of material perception in the real world, and demonstrate how human perception can contribute to applications in computer vision and graphics.
Sensors acquire an increasing amount of diverse information posing two challenges. Firstly, how can we efficiently deal with such a big amount of data and secondly, how can we benefit from this diversity? In this talk I will first present an approach to deal with large graphical models. The presented method distributes and parallelizes the computation and memory requirements while preserving convergence and optimality guarantees of existing inference and learning algorithms. I will demonstrate the effectiveness of the approach on stereo reconstruction from high-resolution imagery. In the second part I will present a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. This framework allows to linearly combine different sources of information and I will demonstrate its efficacy on the problem of estimating the 3D room layout given a single image. For the latter problem I will in a third part introduce a globally optimal yet efficient inference algorithm based on branch-and-bound.