There has been significant prior work on learning realistic, articulated, 3D statistical shape models of the human body. In contrast, there are few such models for animals, despite their many applications in biology, neuroscience, agriculture, and entertainment. The main challenge is that animals are much less cooperative subjects than humans: the best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. In the talk I will illustrate how we extend a state-of-the-art articulated 3D human body model (SMPL) to animals learning from toys a multi-family shape space that can represent lions, cats, dogs, horses, cows and hippos. The generalization of the model is illustrated by fitting it to images of real animals, where it captures realistic animal shapes, even for new species not seen in training.
Computer vision problems often involve optimization of two quantities, one of which is time. Such problems can be formulated as time-constrained optimization or performance-constrained search for the fastest algorithm. We show that it is possible to obtain quasi-optimal time-constrained solutions to some vision problems by applying Wald's theory of sequential decision-making. Wald assumes independence of observation, which is rarely true in computer vision. We address the problem by combining Wald's sequential probability ratio test and AdaBoost. The solution, called the WaldBoost, can be viewed as a principled way to build a close-to-optimal “cascade of classifiers” of the Viola-Jones type. The approach will be demonstrated on four tasks: (i) face detection, (ii) establishing reliable correspondences between image, (iii) real-time detection of interest points and (iv) model search and outlier detection using RANSAC. In the face detection problem, the objective is learning the fastest detector satisfying constraints on false positive and false negative rates. The correspondence pruning addresses the problem of fast selection with a predefined false negative rated. In interest point problem we show how a fast implementation of known detectors can obtained by Waldboost. The “mimicked” detectors provide a training set of positive and negative examples of interest ponts and WaldBoost learns a detector, (significantly) faster than the providers of the training set, formed as a linear combination of efficiently computable feature. In RANSAC, we show how to exploit Wald's test in a randomised model verification procedure to obtain an algorithm significantly faster than deterministic verification yet with equivalent probabilistic guarantees of correctness.
Organizers: Gerard Pons-Moll
Stereo matching -- establishing correspondences between images taken from nearby viewpoints -- is one of the oldest problems in computer vision. While impressive progress has been made over the last two decades, most current stereo methods do not scale to the high-resolution images taken by today's cameras since they require searching the full space of all possible disparity hypotheses over all pixels.
In this talk I will describe a new scalable stereo method that only evaluates a small portion of the search space. The method first generates plane hypotheses from matched sparse features, which are then refined into surface hypotheses using local slanted plane sweeps over a narrow disparity range. Finally, each pixel is assigned to one of the local surface hypotheses. The technique achieves significant speedups over previous algorithms and achieves state-of-the-art accuracy on high-resolution stereo pairs of up to 19 megapixels.
I will also present a new dataset of high-resolution stereo pairs with subpixel-accurate ground truth, and provide a brief outlook on the upcoming new version of the Middlebury stereo benchmark.
This talk will give an overview of some of the research in the Image and Video Computing Group at Boston University related to image- and video-based analysis of humans and their behavior, including: tracking humans, localizing and classifying actions in space-time, exploiting contextual cues in action classification, estimating human pose from images, analyzing the communicative behavior of children in video, and sign language recognition and retrieval.
Collaborators in this work include (in alphabetical order): Vassilis Athitsos, Qinxun Bai, Margrit Betke, R. Gokberk Cinbis, Kun He, Nazli Ikizler-Cinbis, Hao Jiang, Liliana Lo Presti, Shugao Ma, Joan Nash, Carol Neidle, Agata Rozga, Tai-peng Tian, Ashwin Thangali, Zheng Wu, and Jianming Zhang.
Organizers: Gerard Pons-Moll
This talk presents our 3D video production method by which a user can watch a real game from any free viewpoint. Players in the game are captured by 10 cameras and they are reproduced three dimensionally by billboard based representation in real time. Upon producing the 3D video, we have also worked on good user interface that can enable people move the camera intuitively. As the speaker is also working on wide variety of computer vision to augmented reality, selected recent works will be also introduced briefly.
Dr. Yoshinari Kameda started his research from human pose estimation as his Ph.D thesis, then he expands his interested topics from computer vision, human interface, and augmented reality.
He is now an associate professor at University of Tsukuba.
He is also a member of Center for Computational Science of U-Tsukuba where some outstanding super-computer s are in operation.
He served International Symposium on Mixed and Augmented Reality as a area chair for four years (2007-2010).
3D reconstruction from 2D still-images (Structure-from-Motion) has reached maturity and together with new image acquisition devices like Micro Aerial Vehicles (MAV), new interesting application scenarios arise. However, acquiring an image set which is suited for a complete and accurate reconstruction is even for expert users a non-trivial task. To overcome this problem, we propose two different methods. In the first part of the talk, we will present a SfM method that performs sparse reconstruction of 10Mpx still-images and a surface extraction from sparse and noisy 3D point clouds in real-time. We therefore developed a novel efficient image localisation method and a robust surface extraction that works in a fully incremental manner directly on sparse 3D points without a densification step. The real-time feedback of the reconstruction quality the enables the user to control the acquisition process interactively. In the second part, we will present ongoing work of a novel view planning method that is designed to deliver a set of images that can be processed by today's multi-view reconstruction pipelines.
This talk will highlight recent progress on two fronts. First, we will talk about a novel image-conditioned person model that allows for effective articulated pose estimation in realistic scenarios. Second, we describe our work towards activity recognition and the ability to describe video content with natural language.
Both efforts are part of a longer-term agenda towards visual scene understanding. While visual scene understanding has long been advocated as the "holy grail" of computer vision, we believe it is time to address this challenge again, based on the progress in recent years.
In this talk, I will show that, given probabilities of presence of people at various locations in individual time frames, finding the most likely set of trajectories amounts to solving a linear program that depends on very few parameters.
This can be done without requiring appearance information and in real-time, by using the K-Shortest Paths algorithm (KSP). However, this can result in unwarranted identity switches in complex scenes. In such cases, sparse image information can be used within the Linear Programming framework to keep track of people's identities, even when their paths come close to each other or intersect. By sparse, we mean that the appearance needs only be discriminative in a very limited number of frames, which makes our approach widely applicable.
Manifold learning techniques attempt to map a high-dimensional space onto a lower-dimensional one. From a mathematical point of view, a manifold is a topological Hausdorff space that is locally Euclidean. From Machine Learning point of view, we can interpret this embedded manifold as the underlying support of the data distribution. When dealing with high dimensional data sets, nonlinear dimensionality reduction methods can provide more faithful data representation than linear ones. However, the local geometrical distortion induced by the nonlinear mapping leads to a loss of information and affects interpretability, with a negative impact in the model visualization results.
This talk will discuss an approach which involves probabilistic nonlinear dimensionality reduction through Gaussian Process Latent Variables Models. The main focus is on the intrinsic geometry of the model itself as a tool to improve the exploration of the latent space and to recover information loss due to dimensionality reduction. We aim to analytically quantify and visualize the distortion due to dimensionality reduction in order to improve the performance of the model and to interpret data in a more faithful way.
In collaboration with: N.D. Lawrence (University of Sheffield), A. Vellido (UPC)
Perceptual grouping played a prominent role in support of early object recognition systems, which typically took an input image and a database of shape models and identified which of the models was visible in the image. When the database was large, local features were not sufficiently distinctive to prune down the space of models to a manageable number that could be verified. However, when causally related shape features were grouped, using intermediate-level shape priors, e.g., cotermination, symmetry, and compactness, they formed effective shape indices and allowed databases to grow in size. In recent years, the recognition (categorization) community has focused on the object detection problem, in which the input image is searched for a specific target object. Since indexing is not required to select the target model, perceptual grouping is not required to construct a discriminative shape index; the existence of a much stronger object-level shape prior precludes the need for a weaker intermediate-level shape prior. As a result, perceptual grouping activity at our major conferences has diminished. However, there are clear signs that the recognition community is moving from appearance back to shape, and from detection back to unexpected object recognition. Shape-based perceptual grouping will play a critical role in facilitating this transition. But while causally related features must be grouped, they also need to be abstracted before they can be matched to categorical models. In this talk, I will describe our recent progress on the use of intermediate shape priors in segmenting, grouping, and abstracting shape features. Specifically, I will describe the use of symmetry and non-accidental attachment to detect and group symmetric parts, the use of closure to separate figure from background, and the use of a vocabulary of simple shape models to group and abstract image contours.