Header logo is ps


2020


Grasping Field: Learning Implicit Representations for Human Grasps
Grasping Field: Learning Implicit Representations for Human Grasps

Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
Robotic grasping of house-hold objects has made remarkable progress in recent years. Yet, human grasps are still difficult to synthesize realistically. There are several key reasons: (1) the human hand has many degrees of freedom (more than robotic manipulators); (2) the synthesized hand should conform to the surface of the object; and (3) it should interact with the object in a semantically and physically plausible manner. To make progress in this direction, we draw inspiration from the recent progress on learning-based implicit representations for 3D object reconstruction. Specifically, we propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks. Our insight is that every point in a three-dimensional space can be characterized by the signed distances to the surface of the hand and the object, respectively. Consequently, the hand, the object, and the contact area can be represented by implicit surfaces in a common space, in which the proximity between the hand and the object can be modelled explicitly. We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data. We demonstrate that the proposed grasping field is an effective and expressive representation for human grasp generation. Specifically, our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud. The extensive experiments demonstrate that our generative model compares favorably with a strong baseline and approaches the level of natural human grasps. Furthermore, based on the grasping field representation, we propose a deep network for the challenging task of 3D hand-object interaction reconstruction from a single RGB image. Our method improves the physical plausibility of the hand-object contact reconstruction and achieves comparable performance for 3D hand reconstruction compared to state-of-the-art methods. Our model and code are available for research purpose at https://github.com/korrawe/grasping_field.

pdf arXiv code [BibTex]


{GIF}: Generative Interpretable Faces
GIF: Generative Interpretable Faces

Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de

pdf project code [BibTex]

pdf project code [BibTex]


{PLACE}: Proximity Learning of Articulation and Contact in {3D} Environments
PLACE: Proximity Learning of Articulation and Contact in 3D Environments

Zhang, S., Zhang, Y., Ma, Q., Black, M. J., Tang, S.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
High fidelity digital 3D environments have been proposed in recent years, however, it remains extremely challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depth or semantic maps to represent the scene, and parametric human models to represent 3D bodies. While being straight-forward, their generated human-scene interactions often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To synthesize realistic human-scene interactions, it is essential to effectively represent the physical contact and proximity between the body and the world. To that end, we propose a novel interaction generation method, named PLACE(Proximity Learning of Articulation and Contact in 3D Environments), which explicitly models the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the minimum distances from the basis points to the human body surface. The generated proximal relationship exhibits which region of the scene is in contact with the person. Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally. Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. The code and model are available for research at https://sanweiliti.github.io/PLACE/PLACE.html

pdf arXiv project code [BibTex]

pdf arXiv project code [BibTex]


AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning
AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning

Tallamraju, R., Saini, N., Bonetto, E., Pabst, M., Liu, Y. T., Black, M., Ahmad, A.

IEEE Robotics and Automation Letters, IEEE Robotics and Automation Letters, 5(4):6678 - 6685, IEEE, October 2020, Also accepted and presented in the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). (article)

Abstract
In this letter, we introduce a deep reinforcement learning (DRL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose, and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system, and observation models. Such models are difficult to derive, and generalize across different systems. Moreover, the non-linearities, and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions.

link (url) DOI [BibTex]

link (url) DOI [BibTex]


Learning a statistical full spine model from partial observations
Learning a statistical full spine model from partial observations

Meng, D., Keller, M., Boyer, E., Black, M., Pujades, S.

In Shape in Medical Imaging, pages: 122,133, (Editors: Reuter, Martin and Wachinger, Christian and Lombaert, Hervé and Paniagua, Beatriz and Goksel, Orcun and Rekik, Islem), Springer International Publishing, October 2020 (inproceedings)

Abstract
The study of the morphology of the human spine has attracted research attention for its many potential applications, such as image segmentation, bio-mechanics or pathology detection. However, as of today there is no publicly available statistical model of the 3D surface of the full spine. This is mainly due to the lack of openly available 3D data where the full spine is imaged and segmented. In this paper we propose to learn a statistical surface model of the full-spine (7 cervical, 12 thoracic and 5 lumbar vertebrae) from partial and incomplete views of the spine. In order to deal with the partial observations we use probabilistic principal component analysis (PPCA) to learn a surface shape model of the full spine. Quantitative evaluation demonstrates that the obtained model faithfully captures the shape of the population in a low dimensional space and generalizes to left out data. Furthermore, we show that the model faithfully captures the global correlations among the vertebrae shape. Given a partial observation of the spine, i.e. a few vertebrae, the model can predict the shape of unseen vertebrae with a mean error under 3 mm. The full-spine statistical model is trained on the VerSe 2019 public dataset and is publicly made available to the community for non-commercial purposes. (https://gitlab.inria.fr/spine/spine_model)

Gitlab Code PDF DOI [BibTex]

Gitlab Code PDF DOI [BibTex]


STAR: Sparse Trained Articulated Human Body Regressor
STAR: Sparse Trained Articulated Human Body Regressor

Osman, A. A. A., Bolkart, T., Black, M. J.

In European Conference on Computer Vision (ECCV) , August 2020 (inproceedings)

Abstract
The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at http://star.is.tue.mpg.de.

Project Page Code Video paper supplemental [BibTex]


3D Morphable Face Models - Past, Present and Future
3D Morphable Face Models - Past, Present and Future

Egger, B., Smith, W. A. P., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T., Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., Theobalt, C., Blanz, V., Vetter, T.

ACM Transactions on Graphics, 39(5), August 2020 (article)

Abstract
In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications.

project page pdf preprint DOI [BibTex]

project page pdf preprint DOI [BibTex]


Monocular Expressive Body Regression through Body-Driven Attention
Monocular Expressive Body Regression through Body-Driven Attention

Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J.

In Computer Vision – ECCV 2020, Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.

code Short video Long video arxiv pdf suppl link (url) Project Page [BibTex]


GRAB: A Dataset of Whole-Body Human Grasping of Objects
GRAB: A Dataset of Whole-Body Human Grasping of Objects

Taheri, O., Ghorbani, N., Black, M. J., Tzionas, D.

In Computer Vision – ECCV 2020, Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de.

pdf suppl video (long) video (short) link (url) DOI [BibTex]

pdf suppl video (long) video (short) link (url) DOI [BibTex]


Analysis of motor development within the first year of life: 3-{D} motion tracking without markers for early detection of developmental disorders
Analysis of motor development within the first year of life: 3-D motion tracking without markers for early detection of developmental disorders

Parisi, C., Hesse, N., Tacke, U., Rocamora, S. P., Blaschek, A., Hadders-Algra, M., Black, M. J., Heinen, F., Müller-Felber, W., Schroeder, A. S.

Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz, 63, pages: 881–890, July 2020 (article)

Abstract
Children with motor development disorders benefit greatly from early interventions. An early diagnosis in pediatric preventive care (U2–U5) can be improved by automated screening. Current approaches to automated motion analysis, however, are expensive, require lots of technical support, and cannot be used in broad clinical application. Here we present an inexpensive, marker-free video analysis tool (KineMAT) for infants, which digitizes 3‑D movements of the entire body over time allowing automated analysis in the future. Three-minute video sequences of spontaneously moving infants were recorded with a commercially available depth-imaging camera and aligned with a virtual infant body model (SMIL model). The virtual image generated allows any measurements to be carried out in 3‑D with high precision. We demonstrate seven infants with different diagnoses. A selection of possible movement parameters was quantified and aligned with diagnosis-specific movement characteristics. KineMAT and the SMIL model allow reliable, three-dimensional measurements of spontaneous activity in infants with a very low error rate. Based on machine-learning algorithms, KineMAT can be trained to automatically recognize pathological spontaneous motor skills. It is inexpensive and easy to use and can be developed into a screening tool for preventive care for children.

pdf on-line w/ sup mat DOI [BibTex]

pdf on-line w/ sup mat DOI [BibTex]


Learning to Dress 3D People in Generative Clothing
Learning to Dress 3D People in Generative Clothing

Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J.

In Computer Vision and Pattern Recognition (CVPR), pages: 6468-6477, IEEE, June 2020 (inproceedings)

Abstract
Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.

Project page Code Short video Long video arXiv DOI [BibTex]

Project page Code Short video Long video arXiv DOI [BibTex]


{GENTEL : GENerating Training data Efficiently for Learning to segment medical images}
GENTEL : GENerating Training data Efficiently for Learning to segment medical images

Thakur, R. P., Rocamora, S. P., Goel, L., Pohmann, R., Machann, J., Black, M. J.

Congrès Reconnaissance des Formes, Image, Apprentissage et Perception (RFAIP), June 2020 (conference)

Abstract
Accurately segmenting MRI images is crucial for many clinical applications. However, manually segmenting images with accurate pixel precision is a tedious and time consuming task. In this paper we present a simple, yet effective method to improve the efficiency of the image segmentation process. We propose to transform the image annotation task into a binary choice task. We start by using classical image processing algorithms with different parameter values to generate multiple, different segmentation masks for each input MRI image. Then, instead of segmenting the pixels of the images, the user only needs to decide whether a segmentation is acceptable or not. This method allows us to efficiently obtain high quality segmentations with minor human intervention. With the selected segmentations, we train a state-of-the-art neural network model. For the evaluation, we use a second MRI dataset (1.5T Dataset), acquired with a different protocol and containing annotations. We show that the trained network i) is able to automatically segment cases where none of the classical methods obtain a high quality result ; ii) generalizes to the second MRI dataset, which was acquired with a different protocol and was never seen at training time ; and iii) enables detection of miss-annotations in this second dataset. Quantitatively, the trained network obtains very good results: DICE score - mean 0.98, median 0.99- and Hausdorff distance (in pixels) - mean 4.7, median 2.0-.

Project Page PDF [BibTex]

Project Page PDF [BibTex]


Generating 3D People in Scenes without People
Generating 3D People in Scenes without People

Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S.

In Computer Vision and Pattern Recognition (CVPR), pages: 6194-6204, June 2020 (inproceedings)

Abstract
We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: \url{https://vlg.inf.ethz.ch/projects/PSI/}.

Code PDF DOI [BibTex]

Code PDF DOI [BibTex]


Learning Physics-guided Face Relighting under Directional Light
Learning Physics-guided Face Relighting under Directional Light

Nestmeyer, T., Lalonde, J., Matthews, I., Lehrmann, A. M.

In Conference on Computer Vision and Pattern Recognition, pages: 5123-5132, IEEE/CVF, June 2020 (inproceedings) Accepted

Abstract
Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.

Paper [BibTex]

Paper [BibTex]


Learning and Tracking the {3D} Body Shape of Freely Moving Infants from {RGB-D} sequences
Learning and Tracking the 3D Body Shape of Freely Moving Infants from RGB-D sequences

Hesse, N., Pujades, S., Black, M., Arens, M., Hofmann, U., Schroeder, S.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2540-2551, 2020 (article)

Abstract
Statistical models of the human body surface are generally learned from thousands of high-quality 3D scans in predefined poses to cover the wide variety of human body shapes and articulations. Acquisition of such data requires expensive equipment, calibration procedures, and is limited to cooperative subjects who can understand and follow instructions, such as adults. We present a method for learning a statistical 3D Skinned Multi-Infant Linear body model (SMIL) from incomplete, low-quality RGB-D sequences of freely moving infants. Quantitative experiments show that SMIL faithfully represents the RGB-D data and properly factorizes the shape and pose of the infants. To demonstrate the applicability of SMIL, we fit the model to RGB-D sequences of freely moving infants and show, with a case study, that our method captures enough motion detail for General Movements Assessment (GMA), a method used in clinical practice for early detection of neurodevelopmental disorders in infants. SMIL provides a new tool for analyzing infant shape and movement and is a step towards an automated system for GMA.

pdf Journal DOI [BibTex]

pdf Journal DOI [BibTex]


{VIBE}: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation

Kocabas, M., Athanasiou, N., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 5252-5262, IEEE, June 2020 (inproceedings)

Abstract
Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

arXiv code video supplemental video DOI Project Page [BibTex]

arXiv code video supplemental video DOI Project Page [BibTex]


Machine learning systems and methods of estimating body shape from images
Machine learning systems and methods of estimating body shape from images

Black, M., Rachlin, E., Heron, N., Loper, M., Weiss, A., Hu, K., Hinkle, T., Kristiansen, M.

(US Patent 10,679,046), June 2020 (patent)

Abstract
Disclosed is a method including receiving an input image including a human, predicting, based on a convolutional neural network that is trained using examples consisting of pairs of sensor data, a corresponding body shape of the human and utilizing the corresponding body shape predicted from the convolutional neural network as input to another convolutional neural network to predict additional body shape metrics.

[BibTex]

[BibTex]


General Movement Assessment from videos of computed {3D} infant body models is equally effective compared to conventional {RGB} Video rating
General Movement Assessment from videos of computed 3D infant body models is equally effective compared to conventional RGB Video rating

Schroeder, S., Hesse, N., Weinberger, R., Tacke, U., Gerstl, L., Hilgendorff, A., Heinen, F., Arens, M., Bodensteiner, C., Dijkstra, L. J., Pujades, S., Black, M., Hadders-Algra, M.

Early Human Development, 144, May 2020 (article)

Abstract
Background: General Movement Assessment (GMA) is a powerful tool to predict Cerebral Palsy (CP). Yet, GMA requires substantial training hampering its implementation in clinical routine. This inspired a world-wide quest for automated GMA. Aim: To test whether a low-cost, marker-less system for three-dimensional motion capture from RGB depth sequences using a whole body infant model may serve as the basis for automated GMA. Study design: Clinical case study at an academic neurodevelopmental outpatient clinic. Subjects: Twenty-nine high-risk infants were recruited and assessed at their clinical follow-up at 2-4 month corrected age (CA). Their neurodevelopmental outcome was assessed regularly up to 12-31 months CA. Outcome measures: GMA according to Hadders-Algra by a masked GMA-expert of conventional and computed 3D body model (“SMIL motion”) videos of the same GMs. Agreement between both GMAs was assessed, and sensitivity and specificity of both methods to predict CP at ≥12 months CA. Results: The agreement of the two GMA ratings was substantial, with κ=0.66 for the classification of definitely abnormal (DA) GMs and an ICC of 0.887 (95% CI 0.762;0.947) for a more detailed GM-scoring. Five children were diagnosed with CP (four bilateral, one unilateral CP). The GMs of the child with unilateral CP were twice rated as mildly abnormal. DA-ratings of both videos predicted bilateral CP well: sensitivity 75% and 100%, specificity 88% and 92% for conventional and SMIL motion videos, respectively. Conclusions: Our computed infant 3D full body model is an attractive starting point for automated GMA in infants at risk of CP.

DOI [BibTex]

DOI [BibTex]


Learning Multi-Human Optical Flow
Learning Multi-Human Optical Flow

Ranjan, A., Hoffmann, D. T., Tzionas, D., Tang, S., Romero, J., Black, M. J.

International Journal of Computer Vision (IJCV), (128):873-890, April 2020 (article)

Abstract
The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single-and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.

pdf DOI poster link (url) DOI [BibTex]

pdf DOI poster link (url) DOI [BibTex]


From Variational to Deterministic Autoencoders
From Variational to Deterministic Autoencoders

Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B.

8th International Conference on Learning Representations (ICLR) , April 2020, *equal contribution (conference)

Abstract
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.

arXiv link (url) [BibTex]

arXiv link (url) [BibTex]


Attractiveness and Confidence in Walking Style of Male and Female Virtual Characters
Attractiveness and Confidence in Walking Style of Male and Female Virtual Characters

Thaler, A., Bieg, A., Mahmood, N., Black, M. J., Mohler, B. J., Troje, N. F.

In IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pages: 678-679, March 2020 (inproceedings)

Abstract
Animated virtual characters are essential to many applications. Little is known so far about biological and personality inferences made from a virtual character’s body shape and motion. Here, we investigated how sex-specific differences in walking style relate to the perceived attractiveness and confidence of male and female virtual characters. The characters were generated by reconstructing body shape and walking motion from optical motion capture data. The results suggest that sexual dimorphism in walking style plays a different role in attributing biological and personality traits to male and female virtual characters. This finding has important implications for virtual character animation.

pdf DOI [BibTex]

pdf DOI [BibTex]


Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations
Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations

Rueegg, N., Lassner, C., Black, M. J., Schindler, K.

In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pages: 5561-5569, Febuary 2020 (inproceedings)

Abstract
The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.

pdf [BibTex]

pdf [BibTex]


Machine learning systems and methods for augmenting images
Machine learning systems and methods for augmenting images

Black, M., Rachlin, E., Lee, E., Heron, N., Loper, M., Weiss, A., Smith, D.

(US Patent 10,529,137 B1), January 2020 (patent)

Abstract
Disclosed is a method including receiving visual input comprising a human within a scene, detecting a pose associated with the human using a trained machine learning model that detects human poses to yield a first output, estimating a shape (and optionally a motion) associated with the human using a trained machine learning model associated that detects shape (and optionally motion) to yield a second output, recognizing the scene associated with the visual input using a trained convolutional neural network which determines information about the human and other objects in the scene to yield a third output, and augmenting reality within the scene by leveraging one or more of the first output, the second output, and the third output to place 2D and/or 3D graphics in the scene.

[BibTex]

[BibTex]


Real Time Trajectory Prediction Using Deep Conditional Generative Models
Real Time Trajectory Prediction Using Deep Conditional Generative Models

Gomez-Gonzalez, S., Prokudin, S., Schölkopf, B., Peters, J.

IEEE Robotics and Automation Letters, 5(2):970-976, IEEE, January 2020 (article)

arXiv DOI [BibTex]

2005


Representing cyclic human motion using functional analysis
Representing cyclic human motion using functional analysis

Ormoneit, D., Black, M. J., Hastie, T., Kjellström, H.

Image and Vision Computing, 23(14):1264-1276, December 2005 (article)

Abstract
We present a robust automatic method for modeling cyclic 3D human motion such as walking using motion-capture data. The pose of the body is represented by a time-series of joint angles which are automatically segmented into a sequence of motion cycles. The mean and the principal components of these cycles are computed using a new algorithm that enforces smooth transitions between the cycles by operating in the Fourier domain. Key to this method is its ability to automatically deal with noise and missing data. A learned walking model is then exploited for Bayesian tracking of 3D human motion.

pdf pdf from publisher DOI [BibTex]

2005

pdf pdf from publisher DOI [BibTex]


A quantitative evaluation of video-based {3D} person tracking
A quantitative evaluation of video-based 3D person tracking

Balan, A. O., Sigal, L., Black, M. J.

In The Second Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, VS-PETS, pages: 349-356, October 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


Inferring attentional state and kinematics from motor cortical firing rates
Inferring attentional state and kinematics from motor cortical firing rates

Wood, F., Prabhat, , Donoghue, J. P., Black, M. J.

In Proc. IEEE Engineering in Medicine and Biology Society, pages: 1544-1547, September 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


Motor cortical decoding using an autoregressive moving average model
Motor cortical decoding using an autoregressive moving average model

Fisher, J., Black, M. J.

In Proc. IEEE Engineering in Medicine and Biology Society, pages: 1469-1472, September 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


Fields of Experts: A framework for learning image priors
Fields of Experts: A framework for learning image priors

Roth, S., Black, M. J.

In IEEE Conf. on Computer Vision and Pattern Recognition, 2, pages: 860-867, June 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


A Flow-Based Approach to Vehicle Detection and Background Mosaicking in Airborne Video
A Flow-Based Approach to Vehicle Detection and Background Mosaicking in Airborne Video

Yalcin, H. C. R. B. M. J. H. M.

IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Video Proceedings,, pages: 1202, 2005 (patent)

YouTube pdf [BibTex]

YouTube pdf [BibTex]


On the spatial statistics of optical flow
On the spatial statistics of optical flow

(Marr Prize, Honorable Mention)

Roth, S., Black, M. J.

In International Conf. on Computer Vision, International Conf. on Computer Vision, pages: 42-49, 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


Modeling neural population spiking activity with {Gibbs} distributions
Modeling neural population spiking activity with Gibbs distributions

Wood, F., Roth, S., Black, M. J.

In Advances in Neural Information Processing Systems 18, pages: 1537-1544, 2005 (inproceedings)

pdf [BibTex]

pdf [BibTex]


no image
Energy-based models of motor cortical population activity

Wood, F., Black, M.

Program No. 689.20. 2005 Abstract Viewer/Itinerary Planner, Society for Neuroscience, Washington, DC, 2005 (conference)

abstract [BibTex]

abstract [BibTex]

1997


Robust anisotropic diffusion and sharpening of scalar and vector images
Robust anisotropic diffusion and sharpening of scalar and vector images

Black, M. J., Sapiro, G., Marimont, D., Heeger, D.

In Int. Conf. on Image Processing, ICIP, 1, pages: 263-266, Vol. 1, Santa Barbara, CA, October 1997 (inproceedings)

Abstract
Relations between anisotropic diffusion and robust statistics are described. We show that anisotropic diffusion can be seen as a robust estimation procedure that estimates a piecewise smooth image from a noisy input image. The "edge-stopping" function in the anisotropic diffusion equation is closely related to the error norm and influence function in the robust estimation framework. This connection leads to a new "edge-stopping" function based on Tukey's biweight robust estimator, that preserves sharper boundaries than previous formulations and improves the automatic stopping of the diffusion. The robust statistical interpretation also provides a means for detecting the boundaries (edges) between the piecewise smooth regions in the image. We extend the framework to vector-valued images and show applications to robust image sharpening.

pdf publisher site [BibTex]

1997

pdf publisher site [BibTex]


Robust anisotropic diffusion: Connections between robust statistics, line processing, and anisotropic diffusion
Robust anisotropic diffusion: Connections between robust statistics, line processing, and anisotropic diffusion

Black, M. J., Sapiro, G., Marimont, D., Heeger, D.

In Scale-Space Theory in Computer Vision, Scale-Space’97, pages: 323-326, LNCS 1252, Springer Verlag, Utrecht, the Netherlands, July 1997 (inproceedings)

pdf [BibTex]

pdf [BibTex]


Learning parameterized models of image motion
Learning parameterized models of image motion

Black, M. J., Yacoob, Y., Jepson, A. D., Fleet, D. J.

In IEEE Conf. on Computer Vision and Pattern Recognition, CVPR-97, pages: 561-567, Puerto Rico, June 1997 (inproceedings)

Abstract
A framework for learning parameterized models of optical flow from image sequences is presented. A class of motions is represented by a set of orthogonal basis flow fields that are computed from a training set using principal component analysis. Many complex image motions can be represented by a linear combination of a small number of these basis flows. The learned motion models may be used for optical flow estimation and for model-based recognition. For optical flow estimation we describe a robust, multi-resolution scheme for directly computing the parameters of the learned flow models from image derivatives. As examples we consider learning motion discontinuities, non-rigid motion of human mouths, and articulated human motion.

pdf [BibTex]

pdf [BibTex]


Analysis of gesture and action in technical talks for video indexing
Analysis of gesture and action in technical talks for video indexing

Ju, S. X., Black, M. J., Minneman, S., Kimber, D.

In IEEE Conf. on Computer Vision and Pattern Recognition, pages: 595-601, CVPR-97, Puerto Rico, June 1997 (inproceedings)

Abstract
In this paper, we present an automatic system for analyzing and annotating video sequences of technical talks. Our method uses a robust motion estimation technique to detect key frames and segment the video sequence into subsequences containing a single overhead slide. The subsequences are stabilized to remove motion that occurs when the speaker adjusts their slides. Any changes remaining between frames in the stabilized sequences may be due to speaker gestures such as pointing or writing and we use active contours to automatically track these potential gestures. Given the constrained domain we define a simple ``vocabulary'' of actions which can easily be recognized based on the active contour shape and motion. The recognized actions provide a rich annotation of the sequence that can be used to access a condensed version of the talk from a web page.

pdf [BibTex]

pdf [BibTex]


Modeling appearance change in image sequences
Modeling appearance change in image sequences

Black, M. J., Yacoob, Y., Fleet, D. J.

In Advances in Visual Form Analysis, pages: 11-20, Proceedings of the Third International Workshop on Visual Form, Capri, Italy, May 1997 (inproceedings)

abstract [BibTex]

abstract [BibTex]


Recognizing facial expressions in image sequences using local parameterized models of image motion
Recognizing facial expressions in image sequences using local parameterized models of image motion

Black, M. J., Yacoob, Y.

Int. Journal of Computer Vision, 25(1):23-48, 1997 (article)

Abstract
This paper explores the use of local parametrized models of image motion for recovering and recognizing the non-rigid and articulated motion of human faces. Parametric flow models (for example affine) are popular for estimating motion in rigid scenes. We observe that within local regions in space and time, such models not only accurately model non-rigid facial motions but also provide a concise description of the motion in terms of a small number of parameters. These parameters are intuitively related to the motion of facial features during facial expressions and we show how expressions such as anger, happiness, surprise, fear, disgust, and sadness can be recognized from the local parametric motions in the presence of significant head motion. The motion tracking and expression recognition approach performed with high accuracy in extensive laboratory experiments involving 40 subjects as well as in television and movie sequences.

pdf pdf from publisher abstract video [BibTex]


Recognizing human motion using parameterized models of optical flow
Recognizing human motion using parameterized models of optical flow

Black, M. J., Yacoob, Y., Ju, X. S.

In Motion-Based Recognition, pages: 245-269, (Editors: Mubarak Shah and Ramesh Jain,), Kluwer Academic Publishers, Boston, MA, 1997 (incollection)

pdf [BibTex]

pdf [BibTex]

1996


Cardboard people: A parameterized model of articulated motion
Cardboard people: A parameterized model of articulated motion

Ju, S. X., Black, M. J., Yacoob, Y.

In 2nd Int. Conf. on Automatic Face- and Gesture-Recognition, pages: 38-44, Killington, Vermont, October 1996 (inproceedings)

Abstract
We extend the work of Black and Yacoob on the tracking and recognition of human facial expressions using parameterized models of optical flow to deal with the articulated motion of human limbs. We define a "cardboard person model" in which a person's limbs are represented by a set of connected planar patches. The parameterized image motion of these patches is constrained to enforce articulated motion and is solved for directly using a robust estimation technique. The recovered motion parameters provide a rich and concise description of the activity that can be used for recognition. We propose a method for performing view-based recognition of human activities from the optical flow parameters that extends previous methods to cope with the cyclical nature of human motion. We illustrate the method with examples of tracking human legs over long image sequences.

pdf [BibTex]

1996

pdf [BibTex]


Estimating optical flow in segmented images using variable-order parametric models with local deformations
Estimating optical flow in segmented images using variable-order parametric models with local deformations

Black, M. J., Jepson, A.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10):972-986, October 1996 (article)

Abstract
This paper presents a new model for estimating optical flow based on the motion of planar regions plus local deformations. The approach exploits brightness information to organize and constrain the interpretation of the motion by using segmented regions of piecewise smooth brightness to hypothesize planar regions in the scene. Parametric flow models are estimated in these regions in a two step process which first computes a coarse fit and estimates the appropriate parameterization of the motion of the region (two, six, or eight parameters). The initial fit is refined using a generalization of the standard area-based regression approaches. Since the assumption of planarity is likely to be violated, we allow local deformations from the planar assumption in the same spirit as physically-based approaches which model shape using coarse parametric models plus local deformations. This parametric+deformation model exploits the strong constraints of parametric approaches while retaining the adaptive nature of regularization approaches. Experimental results on a variety of images indicate that the parametric+deformation model produces accurate flow estimates while the incorporation of brightness segmentation provides precise localization of motion boundaries.

pdf pdf from publisher [BibTex]

pdf pdf from publisher [BibTex]


On the unification of line processes, outlier rejection, and robust statistics with applications in early vision
On the unification of line processes, outlier rejection, and robust statistics with applications in early vision

Black, M., Rangarajan, A.

International Journal of Computer Vision , 19(1):57-92, July 1996 (article)

Abstract
The modeling of spatial discontinuities for problems such as surface recovery, segmentation, image reconstruction, and optical flow has been intensely studied in computer vision. While “line-process” models of discontinuities have received a great deal of attention, there has been recent interest in the use of robust statistical techniques to account for discontinuities. This paper unifies the two approaches. To achieve this we generalize the notion of a “line process” to that of an analog “outlier process” and show how a problem formulated in terms of outlier processes can be viewed in terms of robust statistics. We also characterize a class of robust statistical problems for which an equivalent outlier-process formulation exists and give a straightforward method for converting a robust estimation problem into an outlier-process formulation. We show how prior assumptions about the spatial structure of outliers can be expressed as constraints on the recovered analog outlier processes and how traditional continuation methods can be extended to the explicit outlier-process formulation. These results indicate that the outlier-process approach provides a general framework which subsumes the traditional line-process approaches as well as a wide class of robust estimation problems. Examples in surface reconstruction, image segmentation, and optical flow are presented to illustrate the use of outlier processes and to show how the relationship between outlier processes and robust statistics can be exploited. An appendix provides a catalog of common robust error norms and their equivalent outlier-process formulations.

pdf pdf from publisher DOI [BibTex]


Skin and Bones: Multi-layer, locally affine, optical flow and regularization with transparency
Skin and Bones: Multi-layer, locally affine, optical flow and regularization with transparency

(Nominated: Best paper)

Ju, S., Black, M. J., Jepson, A. D.

In IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’96, pages: 307-314, San Francisco, CA, June 1996 (inproceedings)

pdf [BibTex]

pdf [BibTex]


EigenTracking: Robust matching and tracking of articulated objects using a view-based representation
EigenTracking: Robust matching and tracking of articulated objects using a view-based representation

Black, M. J., Jepson, A.

In Proc. Fourth European Conf. on Computer Vision, ECCV’96, pages: 329-342, LNCS 1064, Springer Verlag, Cambridge, England, April 1996 (inproceedings)

pdf video [BibTex]

pdf video [BibTex]


Mixture Models for Image Representation
Mixture Models for Image Representation

Jepson, A., Black, M.

PRECARN ARK Project Technical Report ARK96-PUB-54, March 1996 (techreport)

Abstract
We consider the estimation of local greylevel image structure in terms of a layered representation. This type of representation has recently been successfully used to segment various objects from clutter using either optical ow or stereo disparity information. We argue that the same type of representation is useful for greylevel data in that it allows for the estimation of properties for each of several different components without prior segmentation. Our emphasis in this paper is on the process used to extract such a layered representation from a given image In particular we consider a variant of the EM algorithm for the estimation of the layered model and consider a novel technique for choosing the number of layers to use. We briefly consider the use of a simple version of this approach for image segmentation and suggest two potential applications to the ARK project

pdf [BibTex]

pdf [BibTex]


The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields
The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields

Black, M. J., Anandan, P.

Computer Vision and Image Understanding, 63(1):75-104, January 1996 (article)

Abstract
Most approaches for estimating optical flow assume that, within a finite image region, only a single motion is present. This single motion assumption is violated in common situations involving transparency, depth discontinuities, independently moving objects, shadows, and specular reflections. To robustly estimate optical flow, the single motion assumption must be relaxed. This paper presents a framework based on robust estimation that addresses violations of the brightness constancy and spatial smoothness assumptions caused by multiple motions. We show how the robust estimation framework can be applied to standard formulations of the optical flow problem thus reducing their sensitivity to violations of their underlying assumptions. The approach has been applied to three standard techniques for recovering optical flow: area-based regression, correlation, and regularization with motion discontinuities. This paper focuses on the recovery of multiple parametric motion models within a region, as well as the recovery of piecewise-smooth flow fields, and provides examples with natural and synthetic image sequences.

pdf pdf from publisher [BibTex]

pdf pdf from publisher [BibTex]