Since Hubel and Wiesel's seminal findings in the primary visual cortex (V1) more than 50 years ago, progress in vision science has been very limited along previous frameworks and schools of thoughts on understanding vision. Have we been asking the right questions? I will show observations motivating the new path. First, a drastic information bottleneck forces the brain to process only a tiny fraction of the massive visual input information; this selection is called the attentional selection, how to select this tiny fraction is critical. Second, a large body of evidence has been accumulating to suggest that the primary visual cortex (V1) is where this selection starts, suggesting that the visual cortical areas along the visual pathway beyond V1 must be investigated in light of this selection in V1. Placing attentional selection as the center stage, a new path to understanding vision is proposed (articulated in my book "Understanding vision: theory, models, and data", Oxford University Press 2014). I will show a first example of using this new path, which aims to ask new questions and make fresh progresses. I will relate our insights to artificial vision systems to discuss issues like top-down feedbacks in hierachical processing, analysis-by-synthesis, and image understanding.
Fitting statistical 2D and 3D shape models to images is necessary for a variety of tasks, such as video editing and face recognition. Much progress has been made on local fitting from an initial guess, but determining a close enough initial guess is still an open problem. One approach is to detect distinct landmarks in the image and initialize the model fit from these correspondences. This is difficult, because detection of landmarks based only on their local appearance is inherently ambiguous, making it necessary to use global shape information for the detections. We propose a method to solve the combinatorial problem of selecting out of a large number of candidate landmark detections the configuration which is best supported by a shape model.
Our method, as opposed to previous approaches, always finds the globally optimal configuration. The algorithm can be applied to a very general class of shape models and is independent of the underlying feature point detector.
This talk concerns the use of physics-based models for human pose tracking and scene inference. We outline our motivation for physics-based models, some results with monocular pose tracking in terms of biomechanically inspired controllers, and recent results on the inference of scene interactions. We show that physics-based models facilitate the estimation of physically plausible human motion with little or no mocap data required. Scene interactions play an integral role in modeling sources of external forces acting on the body.
In spite of the significant effort that has been devoted to the core problems of object and action recognition in images and videos, the recognition performance of state of the art algorithms is well below what would be required for any successful deployment in many applications. Additionally, there are challenging combinatorial problems associated with constructing globally “optimal” descriptions of images and videos in terms of potentially very large collections of object and action models. The constraints that are utilized in these optimization procedures are loosely referred to as “context.” So, for example, vehicles are generally supported by the ground, so that an estimate of ground plane location parameters in an image constrains positions and apparent sizes of vehicles. Another source of context are the everyday spatial and temporal relationships between objects and actions; so, for example, keyboards are typically “on” tables and not “on” cats.
The first part of the talk will discuss how visually grounded models of object appearance and relations between objects can be simultaneously learned from weakly labeled images (images which are linguistically but not spatially annotated – i.e., we are told there is a car in the image, but not where the car is located).
Next, I will discuss how these models can be more efficiently learned using active learning methods. Once these models are acquired, one approach to inferring what objects appear in a new image is to segment the image into pieces, construct a graph based on the regions in the segmentation and the relationships modeled, and then apply probabilistic inference to the graph. However, this typically results in a very dense graph with many “noisy” edges, leading to inefficient and inaccurate inference. I will briefly describe a learning approach that can construct smaller and more informative graphs for inference.
Finally, I will relax the (unreasonable) assumption that one can segment an image into regions that correspond to objects, and describe an approach that can simultaneously construct instances of objects out of collections of connected segments that look like objects, while also softly enforcing contextual constraints.
Organizers: Michel Besserve
Human pose estimation from monocular images is one of the most challenging and computationally demanding problems in computer vision. Standard models such as Pictorial Structures consider interactions between kinematically-connected joints or limbs, leading to inference quadratic in the number of pixels.
As a result, researchers and practitioners have restricted themselves to simple models which only measure the quality of limb-pair possibilities by their 2D geometric plausibility. In this talk, we propose novel methods which allow for efficient inference in richer models with data-dependent interaction cliques.
First, we introduce structured prediction cascades, a structured analog of binary cascaded classifiers, which learn to focus computational effort where it is needed, filtering out many states cheaply while ensuring the correct output is unfiltered.
Second, we propose a way to decompose models of human pose with cyclic dependencies into a collection of tree models, and provide novel methods to impose model agreement. These techniques allow for sparse and efficient inference on the order of minutes per image or video clip.
As a result, we can afford to model pairwise interaction potentials much more richly with data-dependent features such as contour continuity, segmentation alignment, color consistency, optical flow and more.
Finally, we apply these techniques to higher-order cliques, extending the idea of poselets to structured models. We show empirically that these richer models are worthwhile, obtaining significantly more accurate pose estimation on popular datasets.
Organizers: Michel Besserve
Pose estimation and tracking has been a focus of computer vision research for many years. Despite many successes, however, most approaches to date are still not able to recover physically realistic (natural looking) 3d motions and are restricted to captures indoors or with simplified backgrounds. In the first part of this talk, I will briefly introduce a class of models that use physics to constrain the motion of the subject to more realistic interpretations.
In particular, we formulate the pose tracking problem as one of inference of control mechanisms which implicitly (through physical simulation) generate the kinematic motion matching the image observations. This formulation of the problem has a number of benefits with respect to more traditional kinematic models. In the second part of the talk, I will describe a new proof-of-concept framework for capturing human motion in outdoor environments where traditional motion capture systems, including marker-less motion systems, would typically be inapplicable.
The proposed system consists of a number of small body-mounted cameras, placed on all major segments of the body, and is capable of recovering the underlying skeletal motion by observing the scene as it changes, within each camera view, with the motion of the subjects’ body.
Organizers: Michel Besserve
Shape analysis aims to describe either a single shape or a population of shapes in an efficient and informative way. This is a key problem in various applications such as mesh deformation and animation, object recognition, and mesh parameterization.
I will present a number of approaches to process shapes that are nearly isometric. The first approach computes the correspondence information between a population of shapes in this setting. Second and third are approaches to morph between two shapes and to segment a population of shapes into near-rigid components. Next, I will present an approach for isometry-invariant shape description and feature extraction.
Furthermore, I will present an algorithm to compute the correspondence information between human bodies in varying postures. In addition to being nearly isometric, human body shapes share the same geometric structure, and we can take advantage of this prior geometric information to find accurate correspondences. Finally, I will discuss some applications of shape analysis in computer-aided design.
We propose a geometric approach to articulated tracking, where the human pose representation is expressed on the Riemannian manifold of joint positions. This is in contrast to conventional methods where the problem is phrased in terms of intrinsic parameters of the human pose. Our model is based on a physically natural metric that also has strong links to neurological models of human motion planning. Some benefits of the model is that it allows for easy modeling of interaction with the environment, for data-driven optimization schemes and for well-posed low-pass filtering properties.
To apply the Riemannian model in practice, we derive simulation schemes for Brownian motion on manifolds as well as computationally efficient approximation schemes. The resulting algorithms seem to outperform gold standards both in terms of accuracy and running times.
Organizers: Michel Besserve
A pure refinement procedure for non-rigid registration can be highly effective for establishing dense correspondences between pairs of scanned data, even for significant deformations. I will explain how to design robust non-rigid algorithms and why it is important to couple the optimization of correspondence positions, warping field, and overlapping regions. I will show several applications where it has been successfully applied ranging from film/game production to radiation oncology. One particular interest of mine is facial animation. I will present a fully integrated system for real-time facial performance capture and expression transfer and give a live demo of our latest technology, faceshift. At the end of the talk I
Organizers: Gerard Pons-Moll
Many machine vision/image processing algorithms are designed to be real-time and fully automatic. These attributes are essential, e.g., for stereo robotics vision applications. Visual Effects Studios, however, posses giant server farms and command armies of artists to perform intelligent initialization or provide guidance to algorithms. On the other hand, motion pictures have very high accuracy requirements and the ability to influence an algorithm manually is often more important than other factors, generally considered crucial in Academia. In this talk I will highlight some scenarios, where Academia and the Visual Effects industry disagree.
In the era of perpetually increasing computational capabilities, multi-camera acquisition systems are being increasingly used to capture parameterization-free articulated 3D shapes. These systems allow marker-less shape acquisition and are useful for a wide range of applications in the entertainment, sports, surveillance industries and also in interactive, and augmented reality systems. The availability of vast amount of 3D shape data has increased interest in 3D shape analysis methods. Segmentation and Matching are two important shape analysis tasks. 3D shape segmentation is a subjective task that involves dividing a given shape into constituent parts by assigning each part with a unique segment label.
In the case of 3D shape matching, a dense vertex-to-vertex correspondence between two shapes is desired. However, 3D shapes analysis is particularly difficult in the case of articulated shapes due to complex kinematic poses. These poses induce self-occlusions and shadow effects which cause topological changes such as merging and splitting. In this work we propose robust segmentation and matching methods for articulated 3D shapes represented as mesh-graphs using graph spectral methods.
This talk is divided into two parts. Part one of the talk will focus on 3D shape segmentation, attempted both in an unsupervised and semi-supervised setting by analysing the properties of discrete Laplacian eigenspaces of mesh-graphs. In the second part, 3D shape matching is analysed in a multi-scale heat-diffusion framework derived from Laplacian eigenspace. We believe that this framework is well suited to handle large topological changes and we substantiate our beliefe by showing promising results on various publicly available real mesh datasets.
Organizers: Sebastian Trimpe