Header logo is ps


2020


Learning to Dress 3D People in Generative Clothing
Learning to Dress 3D People in Generative Clothing

Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J.

In Computer Vision and Pattern Recognition (CVPR), June 2020 (inproceedings)

Abstract
Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.

arxiv project page [BibTex]

2020


Generating 3D People in Scenes without People
Generating 3D People in Scenes without People

Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S.

In Computer Vision and Pattern Recognition (CVPR), June 2020 (inproceedings)

Abstract
We present a fully-automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires (1) the generated human bodies should be semantically plausible with the 3D environment, e.g. people sitting on the sofa or cooking near the stove; (2) the generated human-scene interaction should be physically feasible in the way that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human pose conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR.

PDF link (url) [BibTex]

PDF link (url) [BibTex]


Learning Physics-guided Face Relighting under Directional Light
Learning Physics-guided Face Relighting under Directional Light

Nestmeyer, T., Lalonde, J., Matthews, I., Lehrmann, A. M.

In Conference on Computer Vision and Pattern Recognition, IEEE/CVF, June 2020 (inproceedings) Accepted

Abstract
Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.

Paper [BibTex]

Paper [BibTex]


{VIBE}: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation

Kocabas, M., Athanasiou, N., Black, M. J.

In Computer Vision and Pattern Recognition (CVPR), June 2020 (inproceedings)

Abstract
Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

arXiv code [BibTex]

arXiv code [BibTex]


From Variational to Deterministic Autoencoders
From Variational to Deterministic Autoencoders

Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B.

8th International Conference on Learning Representations (ICLR) , April 2020, *equal contribution (conference) Accepted

Abstract
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.

arXiv [BibTex]

arXiv [BibTex]


Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations
Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations

Rueegg, N., Lassner, C., Black, M. J., Schindler, K.

In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), Febuary 2020 (inproceedings)

Abstract
The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.

pdf [BibTex]

pdf [BibTex]


Learning Multi-Human Optical Flow
Learning Multi-Human Optical Flow

Ranjan, A., Hoffmann, D. T., Tzionas, D., Tang, S., Romero, J., Black, M. J.

International Journal of Computer Vision (IJCV), January 2020 (article)

Abstract
The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single-and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.

Paper Publisher Version poster link (url) DOI [BibTex]


General Movement Assessment from videos of computed {3D} infant body models is equally effective compared to conventional {RGB} Video rating
General Movement Assessment from videos of computed 3D infant body models is equally effective compared to conventional RGB Video rating

Schroeder, S., Hesse, N., Weinberger, R., Tacke, U., Gerstl, L., Hilgendorff, A., Heinen, F., Arens, M., Bodensteiner, C., Dijkstra, L. J., Pujades, S., Black, M., Hadders-Algra, M.

Early Human Development, 2020 (article)

Abstract
Background: General Movement Assessment (GMA) is a powerful tool to predict Cerebral Palsy (CP). Yet, GMA requires substantial training hampering its implementation in clinical routine. This inspired a world-wide quest for automated GMA. Aim: To test whether a low-cost, marker-less system for three-dimensional motion capture from RGB depth sequences using a whole body infant model may serve as the basis for automated GMA. Study design: Clinical case study at an academic neurodevelopmental outpatient clinic. Subjects: Twenty-nine high-risk infants were recruited and assessed at their clinical follow-up at 2-4 month corrected age (CA). Their neurodevelopmental outcome was assessed regularly up to 12-31 months CA. Outcome measures: GMA according to Hadders-Algra by a masked GMA-expert of conventional and computed 3D body model (“SMIL motion”) videos of the same GMs. Agreement between both GMAs was assessed, and sensitivity and specificity of both methods to predict CP at ≥12 months CA. Results: The agreement of the two GMA ratings was substantial, with κ=0.66 for the classification of definitely abnormal (DA) GMs and an ICC of 0.887 (95% CI 0.762;0.947) for a more detailed GM-scoring. Five children were diagnosed with CP (four bilateral, one unilateral CP). The GMs of the child with unilateral CP were twice rated as mildly abnormal. DA-ratings of both videos predicted bilateral CP well: sensitivity 75% and 100%, specificity 88% and 92% for conventional and SMIL motion videos, respectively. Conclusions: Our computed infant 3D full body model is an attractive starting point for automated GMA in infants at risk of CP.

[BibTex]

[BibTex]

2002


Inferring hand motion from multi-cell recordings in motor cortex using a {Kalman} filter
Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter

Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Donoghue, J. P.

In SAB’02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices, pages: 66-73, Edinburgh, Scotland (UK), August 2002 (inproceedings)

pdf [BibTex]

2002

pdf [BibTex]


no image
Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter

Wu, W., Black M., Gao, Y., Bienenstock, E., Serruya, M., Donoghue, J.

Program No. 357.5. 2002 Abstract Viewer/Itinerary Planner, Society for Neuroscience, Washington, DC, 2002, Online (conference)

abstract [BibTex]

abstract [BibTex]


Probabilistic inference of hand motion from neural activity in motor cortex
Probabilistic inference of hand motion from neural activity in motor cortex

Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., Donoghue, J.

In Advances in Neural Information Processing Systems 14, pages: 221-228, MIT Press, 2002 (inproceedings)

Abstract
Statistical learning and probabilistic inference techniques are used to infer the hand position of a subject from multi-electrode recordings of neural activity in motor cortex. First, an array of electrodes provides train- ing data of neural firing conditioned on hand kinematics. We learn a non- parametric representation of this firing activity using a Bayesian model and rigorously compare it with previous models using cross-validation. Second, we infer a posterior probability distribution over hand motion conditioned on a sequence of neural test data using Bayesian inference. The learned firing models of multiple cells are used to define a non- Gaussian likelihood term which is combined with a prior probability for the kinematics. A particle filtering method is used to represent, update, and propagate the posterior distribution over time. The approach is com- pared with traditional linear filtering methods; the results suggest that it may be appropriate for neural prosthetic applications.

pdf [BibTex]

pdf [BibTex]


Automatic detection and tracking of human motion with a view-based representation
Automatic detection and tracking of human motion with a view-based representation

Fablet, R., Black, M. J.

In European Conf. on Computer Vision, ECCV 2002, 1, pages: 476-491, LNCS 2353, (Editors: A. Heyden and G. Sparr and M. Nielsen and P. Johansen), Springer-Verlag , 2002 (inproceedings)

Abstract
This paper proposes a solution for the automatic detection and tracking of human motion in image sequences. Due to the complexity of the human body and its motion, automatic detection of 3D human motion remains an open, and important, problem. Existing approaches for automatic detection and tracking focus on 2D cues and typically exploit object appearance (color distribution, shape) or knowledge of a static background. In contrast, we exploit 2D optical flow information which provides rich descriptive cues, while being independent of object and background appearance. To represent the optical flow patterns of people from arbitrary viewpoints, we develop a novel representation of human motion using low-dimensional spatio-temporal models that are learned using motion capture data of human subjects. In addition to human motion (the foreground) we probabilistically model the motion of generic scenes (the background); these statistical models are defined as Gibbsian fields specified from the first-order derivatives of motion observations. Detection and tracking are posed in a principled Bayesian framework which involves the computation of a posterior probability distribution over the model parameters (i.e., the location and the type of the human motion) given a sequence of optical flow observations. Particle filtering is used to represent and predict this non-Gaussian posterior distribution over time. The model parameters of samples from this distribution are related to the pose parameters of a 3D articulated model (e.g. the approximate joint angles and movement direction). Thus the approach proves suitable for initializing more complex probabilistic models of human motion. As shown by experiments on real image sequences, our method is able to detect and track people under different viewpoints with complex backgrounds.

pdf [BibTex]

pdf [BibTex]


A layered motion representation with occlusion and compact spatial support
A layered motion representation with occlusion and compact spatial support

Fleet, D. J., Jepson, A., Black, M. J.

In European Conf. on Computer Vision, ECCV 2002, 1, pages: 692-706, LNCS 2353, (Editors: A. Heyden and G. Sparr and M. Nielsen and P. Johansen), Springer-Verlag , 2002 (inproceedings)

Abstract
We describe a 2.5D layered representation for visual motion analysis. The representation provides a global interpretation of image motion in terms of several spatially localized foreground regions along with a background region. Each of these regions comprises a parametric shape model and a parametric motion model. The representation also contains depth ordering so visibility and occlusion are rightly included in the estimation of the model parameters. Finally, because the number of objects, their positions, shapes and sizes, and their relative depths are all unknown, initial models are drawn from a proposal distribution, and then compared using a penalized likelihood criterion. This allows us to automatically initialize new models, and to compare different depth orderings.

pdf [BibTex]

pdf [BibTex]


Implicit probabilistic models of human motion for synthesis and tracking
Implicit probabilistic models of human motion for synthesis and tracking

Sidenbladh, H., Black, M. J., Sigal, L.

In European Conf. on Computer Vision, 1, pages: 784-800, 2002 (inproceedings)

Abstract
This paper addresses the problem of probabilistically modeling 3D human motion for synthesis and tracking. Given the high dimensional nature of human motion, learning an explicit probabilistic model from available training data is currently impractical. Instead we exploit methods from texture synthesis that treat images as representing an implicit empirical distribution. These methods replace the problem of representing the probability of a texture pattern with that of searching the training data for similar instances of that pattern. We extend this idea to temporal data representing 3D human motion with a large database of example motions. To make the method useful in practice, we must address the problem of efficient search in a large training set; efficiency is particularly important for tracking. Towards that end, we learn a low dimensional linear model of human motion that is used to structure the example motion database into a binary tree. An approximate probabilistic tree search method exploits the coefficients of this low-dimensional representation and runs in sub-linear time. This probabilistic tree search returns a particular sample human motion with probability approximating the true distribution of human motions in the database. This sampling method is suitable for use with particle filtering techniques and is applied to articulated 3D tracking of humans within a Bayesian framework. Successful tracking results are presented, along with examples of synthesizing human motion using the model.

pdf [BibTex]

pdf [BibTex]


Robust parameterized component analysis: Theory and applications to {2D} facial modeling
Robust parameterized component analysis: Theory and applications to 2D facial modeling

De la Torre, F., Black, M. J.

In European Conf. on Computer Vision, ECCV 2002, 4, pages: 653-669, LNCS 2353, Springer-Verlag, 2002 (inproceedings)

pdf [BibTex]

pdf [BibTex]

1995


Robust estimation of multiple surface shapes from occluded textures
Robust estimation of multiple surface shapes from occluded textures

Black, M. J., Rosenholtz, R.

In International Symposium on Computer Vision, pages: 485-490, Miami, FL, November 1995 (inproceedings)

pdf [BibTex]

1995

pdf [BibTex]


no image
The PLAYBOT Project

Tsotsos, J. K., Dickinson, S., Jenkin, M., Milios, E., Jepson, A., Down, B., Amdur, E., Stevenson, S., Black, M., Metaxas, D., Cooperstock, J., Culhane, S., Nuflo, F., Verghese, G., Wai, W., Wilkes, D., Ye, Y.

In Proc. IJCAI Workshop on AI Applications for Disabled People, Montreal, August 1995 (inproceedings)

abstract [BibTex]

abstract [BibTex]


Recognizing facial expressions under rigid and non-rigid facial motions using local parametric models of image motion
Recognizing facial expressions under rigid and non-rigid facial motions using local parametric models of image motion

Black, M. J., Yacoob, Y.

In International Workshop on Automatic Face- and Gesture-Recognition, Zurich, July 1995 (inproceedings)

video abstract [BibTex]

video abstract [BibTex]


Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion
Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion

Black, M. J., Yacoob, Y.

In Fifth International Conf. on Computer Vision, ICCV’95, pages: 347-381, Boston, MA, June 1995 (inproceedings)

Abstract
This paper explores the use of local parametrized models of image motion for recovering and recognizing the non-rigid and articulated motion of human faces. Parametric flow models (for example affine) are popular for estimating motion in rigid scenes. We observe that within local regions in space and time, such models not only accurately model non-rigid facial motions but also provide a concise description of the motion in terms of a small number of parameters. These parameters are intuitively related to the motion of facial features during facial expressions and we show how expressions such as anger, happiness, surprise, fear, disgust and sadness can be recognized from the local parametric motions in the presence of significant head motion. The motion tracking and expression recognition approach performs with high accuracy in extensive laboratory experiments involving 40 subjects as well as in television and movie sequences.

pdf video publisher site [BibTex]

pdf video publisher site [BibTex]


no image
A computational model for shape from texture for multiple textures

Black, M. J., Rosenholtz, R.

Investigative Ophthalmology and Visual Science Supplement, Vol. 36, No. 4, pages: 2202, March 1995 (conference)

abstract [BibTex]

abstract [BibTex]

1994


Estimating multiple independent motions in segmented images using parametric models with local deformations
Estimating multiple independent motions in segmented images using parametric models with local deformations

Black, M. J., Jepson, A.

In Workshop on Non-rigid and Articulate Motion, pages: 220-227, Austin, Texas, November 1994 (inproceedings)

pdf abstract [BibTex]

1994

pdf abstract [BibTex]


Time to contact from active tracking of motion boundaries
Time to contact from active tracking of motion boundaries

Ju, X., Black, M. J.

In Intelligent Robots and Computer Vision XIII: 3D Vision, Product Inspection, and Active Vision, pages: 26-37, Proc. SPIE 2354, Boston, Massachusetts, November 1994 (inproceedings)

pdf abstract [BibTex]

pdf abstract [BibTex]


A computational and evolutionary perspective on the role of representation in computer vision
A computational and evolutionary perspective on the role of representation in computer vision

Tarr, M. J., Black, M. J.

CVGIP: Image Understanding, 60(1):65-73, July 1994 (article)

Abstract
Recently, the assumed goal of computer vision, reconstructing a representation of the scene, has been critcized as unproductive and impractical. Critics have suggested that the reconstructive approach should be supplanted by a new purposive approach that emphasizes functionality and task driven perception at the cost of general vision. In response to these arguments, we claim that the recovery paradigm central to the reconstructive approach is viable, and, moreover, provides a promising framework for understanding and modeling general purpose vision in humans and machines. An examination of the goals of vision from an evolutionary perspective and a case study involving the recovery of optic flow support this hypothesis. In particular, while we acknowledge that there are instances where the purposive approach may be appropriate, these are insufficient for implementing the wide range of visual tasks exhibited by humans (the kind of flexible vision system presumed to be an end-goal of artificial intelligence). Furthermore, there are instances, such as recent work on the estimation of optic flow, where the recovery paradigm may yield useful and robust results. Thus, contrary to certain claims, the purposive approach does not obviate the need for recovery and reconstruction of flexible representations of the world.

pdf [BibTex]

pdf [BibTex]


Reconstruction and purpose
Reconstruction and purpose

Tarr, M. J., Black, M. J.

CVGIP: Image Understanding, 60(1):113-118, July 1994 (article)

pdf [BibTex]

pdf [BibTex]


The outlier process: Unifying line processes and robust statistics
The outlier process: Unifying line processes and robust statistics

Black, M., Rangarajan, A.

In IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’94, pages: 15-22, Seattle, WA, June 1994 (inproceedings)

pdf abstract [BibTex]

pdf abstract [BibTex]


Recursive non-linear estimation of discontinuous flow fields
Recursive non-linear estimation of discontinuous flow fields

Black, M.

In Proc. Third European Conf. on Computer Vision, ECCV’94,, pages: 138-145, LNCS 800, Springer Verlag, Sweden, May 1994 (inproceedings)

pdf abstract [BibTex]

pdf abstract [BibTex]

1991


Dynamic motion estimation and feature extraction over long image sequences
Dynamic motion estimation and feature extraction over long image sequences

Black, M. J., Anandan, P.

In Proc. IJCAI Workshop on Dynamic Scene Understanding, Sydney, Australia, August 1991 (inproceedings)

[BibTex]

1991

[BibTex]


Robust dynamic motion estimation over time
Robust dynamic motion estimation over time

(IEEE Computer Society Outstanding Paper Award)

Black, M. J., Anandan, P.

In Proc. Computer Vision and Pattern Recognition, CVPR-91,, pages: 296-302, Maui, Hawaii, June 1991 (inproceedings)

Abstract
This paper presents a novel approach to incrementally estimating visual motion over a sequence of images. We start by formulating constraints on image motion to account for the possibility of multiple motions. This is achieved by exploiting the notions of weak continuity and robust statistics in the formulation of the minimization problem. The resulting objective function is non-convex. Traditional stochastic relaxation techniques for minimizing such functions prove inappropriate for the task. We present a highly parallel incremental stochastic minimization algorithm which has a number of advantages over previous approaches. The incremental nature of the scheme makes it truly dynamic and permits the detection of occlusion and disocclusion boundaries.

pdf video abstract [BibTex]

pdf video abstract [BibTex]