Two talks for the price of one! I will present my recent work on the challenging problem of stereo matching of scenes with little or no surface texture, attacking the problem from two very different angles. First, I will discuss how surface orientation priors can be added to the popular semi-global matching (SGM) algorithm, which significantly reduces errors on slanted weakly-textured surfaces. The orientation priors serve as a soft constraint during matching and can be derived in a variety of ways, including from low-resolution matching results and from monocular analysis and Manhattan-world assumptions. Second, we will examine the pathological case of Mondrian Stereo -- synthetic scenes consisting solely of solid-colored planar regions, resembling paintings by Piet Mondrian. I will discuss assumptions that allow disambiguating such scenes, present a novel stereo algorithm employing symbolic reasoning about matched edge segments, and discuss how similar ideas could be utilized in robust real-world stereo algorithms for untextured environments.
Organizers: Anurag Ranjan
Humans act upon their environment through motion, the ability to plan their movements is therefore an essential component of their autonomy. In recent decades, motion planning has been widely studied in robotics and computer graphics. Nevertheless robots still fail to achieve human reactivity and coordination. The need for more efficient motion planning algorithms has been present through out my own research on "human-aware" motion planning, which aims to take the surroundings humans explicitly into account. I believe imitation learning is the key to this particular problem as it allows to learn both, new motion skills and predictive models, two capabilities that are at the heart of "human-aware" robots while simultaneously holding the promise of faster and more reactive motion generation. In this talk I will present my work in this direction.
In this talk I will first outline my different research projects. I will then focus on one project with applications in Health, and introduce the Inter-Battery Topic Model (IBTM). Our approach extends traditional topic models by learning a factorized latent variable representation. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and ability to handle missing data. The factorization is provided by representing data in terms of aligned pairs of observations as different views. This provides means for selecting a representation that separately models topics that exist in both views from the topics that are unique to a single view. This structured consolidation allows for efficient and robust inference and provides a compact and efficient representation.
Understanding people in images and videos is a problem studied intensively in computer vision. While continuous progress has been made, occlusions, cluttered background, complex poses and large variety of appearance remain challenging, especially for crowded scenes. In this talk, I will explore the algorithms and tools that enable computer to interpret people's position, motion and articulated poses in the real-world challenging images and videos.More specifically, I will discuss an optimization problem whose feasible solutions define a decomposition of a given graph. I will highlight the applications of this problem in computer vision, which range from multi-person tracking [1,2,3] to motion segmentation . I will also cover an extended optimization problem whose feasible solutions define a decomposition of a given graph and a labeling of its nodes with the application on multi-person pose estimation . Reference:  Subgraph Decomposition for Multi-Object Tracking; S. Tang, B. Andres, M. Andriluka and B. Schiele; CVPR 2015  Multi-Person Tracking by Multicut and Deep Matching; S. Tang, B. Andres, M. Andriluka and B. Schiele; arXiv 2016  Multi-Person Tracking by Lifted Multicut and Person Re-identification; S. Tang, B. Andres, M. Andriluka and B. Schiele  A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects; M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele; arXiv 2016  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation.: L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele; CVPR16
Organizers: Naureen Mahmood
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priory knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.
Organizers: Javier Romero
Matching between two sets arises in various areas in computer vision, such as feature point matching for 3D reconstruction, person re-identification for surveillance or data association for multi-target tracking. Most previous work focused either on designing suitable features and matching cost functions, or on developing faster and more accurate solvers for quadratic or higher-order problems. In the first part of my talk, I will present a strategy for improving state-of-the-art solutions by efficiently computing the marginals of the joint matching probability. The second part of my talk will revolve around our recent work on online multi-target tracking using recurrent neural networks (RNNs). I will mention some fundamental challenges we encountered and present our current solution.
The accurate reconstruction of facial shape is important for applications such as telepresence and gaming. It can be solved efficiently with the help of statistical shape models that constrain the shape of the reconstruction. In this talk, several methods to statistically analyze static and dynamic 3D face data are discussed. When statistically analyzing faces, various challenges arise from noisy, corrupt, or incomplete data. To overcome the limitations imposed by the poor data quality, we leverage redundancy in the data for shape processing. This is done by processing entire motion sequences in the case of dynamic data, and by jointly processing large databases in a groupwise fashion in the case of static data. First, a fully automatic approach to robustly register and statistically analyze facial motion sequences using a multilinear face model as statistical prior is proposed. Further, a statistical face model is discussed, which consists of many localized, decorrelated multilinear models. The localized and multi-scale nature of this model allows for recovery of fine-scale details while retaining robustness to severe noise and occlusions. Finally, the learning of statistical face models is formulated as a groupwise optimization framework that aims to learn a multilinear model while jointly optimizing the correspondence, or correcting the data.
In this talk we present some recent results on human action recognition in videos. We, first, show how to use human pose for action recognition. To this end we propose a new pose-based convolutional neural network descriptor for action recognition, which aggregates motion and appearance information along tracks of human body parts. Next, we present an approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and then tracks high-scoring proposals in the video. Our tracker relies simultaneously on instance-level and class-level detectors. Action are localized in time with a sliding window approach at the track level. Finally, we show how to extend this method to weakly supervised learning of actions, which allows to scale to large amounts of data without manual annotation.
Typical human actions such as hand-shaking and drinking last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of single frames or short video clips and fail to model actions at their full temporal scale. In this work we learn video representations using neural networks with long-term temporal convolutions. We demonstrate that CNN models with increased temporal extents improve the accuracy of action recognition despite reduced spatial resolution. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 and HMDB51.
Proper handling of occlusions is a big challenge for model based reconstruction, e.g. for multi-view motion capture a major difficulty is the handling of occluding body parts. We propose a smooth volumetric scene representation, which implicitly converts occlusion into a smooth and differentiable phenomena (ICCV2015). Our ray tracing image formation model helps to express the objective in a single closed-form expression. This is in contrast to existing surface(mesh) representations, where occlusion is a local effect, causes non-differentiability, and is difficult to optimize. We demonstrate improvements for multi-view scene reconstruction, rigid object tracking, and motion capture. Moreover, I will show an application of motion tracking to the interactive control of virtual characters (SigAsia2015).
The core focus of my research is on robot perception. Within this broad categorization, I am mainly interested in understanding how teams of robots and sensors can cooperate and/or collaborate to improve the perception of themselves (self-localization) as well as their surroundings (target tracking, mapping, etc.). In this talk I will describe the inter-dependencies of such perception modules and present state-of-the-art methods to perform unified cooperative state estimation. The trade-off between accuracy of estimation and computational speed will be highlighted through a new optimization-based method for unified-state estimation. Furthermore, I will also describe how perception-based multirobot formation control can be achieved. Towards the end, I will present some recent results on cooperative vision-based target tracking and a few comments on our ongoing work regarding cooperative aerial mapping with human-in-the-loop.
Modeling and reconstruction of shape and motion are problems of fundamental importance in computer vision. Inverse Problem theory constitutes a powerful mathematical framework for dealing with ill-posed problems as the ones typically arising in shape and motion modeling. In this talk, I will present methods inspired by Inverse Problem theory, for dealing with four different shape and motion modeling problems. In particular, in the context of shape modeling, I will present a method for component-wise modeling of articulated objects and its application in computing 3D models of animals. Additionally, I will discuss the problem of modeling of specular surfaces via the properties of their material, and I will also present a model for confidence driven depth image fusion based on total variation regularization. Regarding motion, I will discuss a method for the recognition of human actions from motion capture data based on Nonparametric Bayesian models.