Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.
Organizers: Dimitrios Tzionas
From gait, dance to martial art, human movements provide rich, complex yet coherent spatiotemporal patterns reflecting characteristics of a group or an individual. We develop computer algorithms to automatically learn such quality discriminative features from multimodal data. In this talk, I present a trilogy on learning from human movements: (1) Gait analysis from video data: based on frieze patterns (7 frieze groups), a video sequence of silhouettes is mapped into a pair of spatiotemporal patterns that are near-periodic along the time axis. A group theoretical analysis of periodic patterns allows us to determine the dynamic time warping and affine scaling that aligns two gait sequences from similar viewpoints for human identification. (2) Dance analysis and synthesis (mocap, music, ratings from Mechanical Turks): we explore the complex relationship between perceived dance quality/dancer's gender and dance movements respectively. As a feasibility study, we construct a computational framework for an analysis-synthesis-feedback loop using a novel multimedia dance-texture representation for joint angular displacement, velocity and acceleration. Furthermore, we integrate crowd sourcing, music and motion-capture data, and machine learning-based methods for dance segmentation, analysis and synthesis of new dancers. A quantitative validation of this framework on a motion-capture dataset of 172 dancers evaluated by more than 400 independent on-line raters demonstrates significant correlation between human perception and the algorithmically intended dance quality or gender of the synthesized dancers. (3) Tai Chi performance evaluation (mocap + video): I shall also discuss the feasibility of utilizing spatiotemporal synchronization and, ultimately, machine learning to evaluate Tai Chi routines performed by different subjects in our current project of “Tai Chi + Advanced Technology for Smart Health”.
There has been significant prior work on learning realistic, articulated, 3D statistical shape models of the human body. In contrast, there are few such models for animals, despite their many applications in biology, neuroscience, agriculture, and entertainment. The main challenge is that animals are much less cooperative subjects than humans: the best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. In the talk I will illustrate how we extend a state-of-the-art articulated 3D human body model (SMPL) to animals learning from toys a multi-family shape space that can represent lions, cats, dogs, horses, cows and hippos. The generalization of the model is illustrated by fitting it to images of real animals, where it captures realistic animal shapes, even for new species not seen in training.
In this talk I am going to present the work we have been doing at the Computer Vision Lab of the Technical University of Munich which started as an attempt to better deal with videos (and therefore the time domain) within neural network architectures.
Organizers: Joel Janai
Kathleen is the creator of the well-known CAESAR anthropomorphic dataset and is an expert on body shape and apparel fit.
Organizers: Javier Romero
In this talk I will present the portfolio of work we conduct in our lab. Herby, I will present three recent body of work in more detail. This is firstly our work on learning 6D Object Pose estimation and Camera localizing from RGB or RGBD images. I will show that by utilizing the concepts of uncertainty and learning to score hypothesis, we can improve the state of the art. Secondly, I will present a new approach for inferring multiple diverse labeling in a graphical model. Besides guarantees of an exact solution, our method is also faster than existing techniques. Finally, I will present a recent work in which we show that popular Auto-context Decision Forests can be mapped to Deep ConvNets for Semantic Segmentation. We use this to detect the spine of a zebrafish, in case when little training data is available.
Organizers: Aseem Behl
We propose a new computational framework for combinatorial problems arising in machine learning and computer vision. This framework is a special case of Lagrangean (dual) decomposition, but allows for efficient dual ascent (message passing) optimization. In a sense, one can understand both the framework and the optimization technique as a generalization of those for standard undirected graphical models (conditional random fields). We will make an overview of our recent results and plans for the nearest future.
Organizers: Aseem Behl
In this talk I will first outline my different research projects. I will then focus on one project with applications in Health, and introduce the Inter-Battery Topic Model (IBTM). Our approach extends traditional topic models by learning a factorized latent variable representation. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and ability to handle missing data. The factorization is provided by representing data in terms of aligned pairs of observations as different views. This provides means for selecting a representation that separately models topics that exist in both views from the topics that are unique to a single view. This structured consolidation allows for efficient and robust inference and provides a compact and efficient representation.
Understanding people in images and videos is a problem studied intensively in computer vision. While continuous progress has been made, occlusions, cluttered background, complex poses and large variety of appearance remain challenging, especially for crowded scenes. In this talk, I will explore the algorithms and tools that enable computer to interpret people's position, motion and articulated poses in the real-world challenging images and videos.More specifically, I will discuss an optimization problem whose feasible solutions define a decomposition of a given graph. I will highlight the applications of this problem in computer vision, which range from multi-person tracking [1,2,3] to motion segmentation . I will also cover an extended optimization problem whose feasible solutions define a decomposition of a given graph and a labeling of its nodes with the application on multi-person pose estimation . Reference:  Subgraph Decomposition for Multi-Object Tracking; S. Tang, B. Andres, M. Andriluka and B. Schiele; CVPR 2015  Multi-Person Tracking by Multicut and Deep Matching; S. Tang, B. Andres, M. Andriluka and B. Schiele; arXiv 2016  Multi-Person Tracking by Lifted Multicut and Person Re-identification; S. Tang, B. Andres, M. Andriluka and B. Schiele  A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects; M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele; arXiv 2016  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation.: L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele; CVPR16
Organizers: Naureen Mahmood
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priory knowledge of the object's shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning effectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.
Organizers: Javier Romero