We propose a new architecture for the learning of predictive spatio-temporal motion models from data alone. Our approach, dubbed the Dropout Autoencoder LSTM, is capable of synthesizing natural looking motion sequences over long time horizons without catastrophic drift or mo- tion degradation. The model consists of two components, a 3-layer recurrent neural network to model temporal aspects and a novel auto-encoder that is trained to implicitly recover the spatial structure of the human skeleton via randomly removing information about joints during train- ing time. This Dropout Autoencoder (D-AE) is then used to filter each predicted pose of the LSTM, reducing accumulation of error and hence drift over time. Furthermore, we propose new evaluation protocols to assess the quality of synthetic motion sequences even for which no groundtruth data exists. The proposed protocols can be used to assess generated sequences of arbitrary length. Finally, we evaluate our proposed method on two of the largest motion- capture datasets available to date and show that our model outperforms the state-of-the-art on a variety of actions, including cyclic and acyclic motion, and that it can produce natural looking sequences over longer time horizons than previous methods.
Organizers: Gerard Pons-Moll
Estimating 3D shape from monocular 2D images is a challenging and ill-posed problem. Some of these challenges can be alleviated if 3D shape priors are taken into account. In the field of human body shape estimation, research has shown that accurate 3D body estimations can be achieved through optimization, by minimizing error functions on image cues, such as e.g. the silhouette. These methods though, tend to be slow and typically require manual interactions (e.g. for pose estimation). In this talk, we present some recent works that try to overcome such limitations, achieving interactive rates, by learning mappings from 2D image to 3D shape spaces, utilizing data-driven priors, generated from statistically learned parametric shape models. We demonstrate this, either by extracting handcrafted features or directly utilizing CNN-s. Furthermore, we introduce the notion and application of cross-modal or multi-view learning, where abundance of data coming from various views representing the same object at training time, can be leveraged in a semi-supervised setting to boost estimations at test time. Additionally, we show similar applications of the above techniques for the task of 3D garment estimation from a single image.
Organizers: Gerard Pons-Moll
Human observers can classify photographs of real-world scenes after only a very brief exposure to the image (Potter & Levy, 1969; Thorpe, Fize, Marlot, et al., 1996; VanRullen & Thorpe, 2001). Line drawings of natural scenes have been shown to capture essential structural information required for successful scene categorization (Walther et al., 2011). Here, we investigate how the spatial relationships between lines and line segments in the line drawings affect scene classification. In one experiment, we tested the effect of removing either the junctions or the middle segments between junctions. Surprisingly, participants performed better when shown the middle segments (47.5%) than when shown the junctions (42.2%). It appeared as if the images with middle segments tended to maintain the most parallel/locally symmetric portions of the contours. In order to test this hypothesis, in a second experiment, we either removed the most symmetric half of the contour pixels or the least symmetric half of the contour pixels using a novel method of measuring the local symmetry of each contour pixel in the image. Participants were much better at categorizing images containing the most symmetric contour pixels (49.7%) than the least symmetric (38.2%). Thus, results from both experiments demonstrate that local contour symmetry is a crucial organizing principle in complex real-world scenes. Joint work with John Wilder (UofT CS, Psych), Morteza Rezanejad (McGill CS), Kaleem Siddiqi (McGill CS), Allan Jepson (UofT CS), and Dirk Bernhardt-Walther (UofT Psych), to be presented at VSS 2017.
Organizers: Ahmed Osman
Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather the traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies. In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation. We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.
Organizers: Jonas Wulff
This talk addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given video frames as input, our approach first assigns each pixel an object or background label obtained with an encoder-decoder network that takes as input optical flow and is trained on synthetic data. Next, a “visual memory” specific to the video is acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. This is joint work with K. Alahari and P. Tokmakov.
Organizers: Osman Ulusoy
I'll present my master thesis "Biquadratic Forms and Semi-Definite Relaxations". It is about biquadratic optimization programs (which are NP-hard generally) and examines a condition under which there exists an algorithm that finds a solution to every instance of the problem in polynomial time. I'll present a counterexample for which this is not possible generally and face the question of what happens if further knowledge about the variables over which we optimise is applied.
Organizers: Fatma Güney
A large part of image analysis is about breaking things into pieces. Decompositions of a graph are a mathematical abstraction of the possible outcomes. This talk is about optimization problems whose feasible solutions define decompositions of a graph. One example is the correlation clustering problem whose feasible solutions relate one-to-one to the decompositions of a graph, and whose objective function puts a cost or reward on neighboring nodes ending up in distinct components. This talk shows applications of this problem and proposed generalizations to diverse image analysis tasks. It sketches algorithms for finding feasible solutions for large instances in practice, solutions that are often superior in the metrics of application-specific benchmarks. It also sketches algorithms for finding lower bounds and points to new findings and open problems of polyhedral geometry in this context.
Organizers: Christoph Lassner
Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.
Organizers: Dimitris Tzionas
From gait, dance to martial art, human movements provide rich, complex yet coherent spatiotemporal patterns reflecting characteristics of a group or an individual. We develop computer algorithms to automatically learn such quality discriminative features from multimodal data. In this talk, I present a trilogy on learning from human movements: (1) Gait analysis from video data: based on frieze patterns (7 frieze groups), a video sequence of silhouettes is mapped into a pair of spatiotemporal patterns that are near-periodic along the time axis. A group theoretical analysis of periodic patterns allows us to determine the dynamic time warping and affine scaling that aligns two gait sequences from similar viewpoints for human identification. (2) Dance analysis and synthesis (mocap, music, ratings from Mechanical Turks): we explore the complex relationship between perceived dance quality/dancer's gender and dance movements respectively. As a feasibility study, we construct a computational framework for an analysis-synthesis-feedback loop using a novel multimedia dance-texture representation for joint angular displacement, velocity and acceleration. Furthermore, we integrate crowd sourcing, music and motion-capture data, and machine learning-based methods for dance segmentation, analysis and synthesis of new dancers. A quantitative validation of this framework on a motion-capture dataset of 172 dancers evaluated by more than 400 independent on-line raters demonstrates significant correlation between human perception and the algorithmically intended dance quality or gender of the synthesized dancers. (3) Tai Chi performance evaluation (mocap + video): I shall also discuss the feasibility of utilizing spatiotemporal synchronization and, ultimately, machine learning to evaluate Tai Chi routines performed by different subjects in our current project of “Tai Chi + Advanced Technology for Smart Health”.
There has been significant prior work on learning realistic, articulated, 3D statistical shape models of the human body. In contrast, there are few such models for animals, despite their many applications in biology, neuroscience, agriculture, and entertainment. The main challenge is that animals are much less cooperative subjects than humans: the best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. In the talk I will illustrate how we extend a state-of-the-art articulated 3D human body model (SMPL) to animals learning from toys a multi-family shape space that can represent lions, cats, dogs, horses, cows and hippos. The generalization of the model is illustrated by fitting it to images of real animals, where it captures realistic animal shapes, even for new species not seen in training.