SPIN is a state-of-the-art deep network for regressing SMPL body shape and pose parameters directly from an image. SPIN uses a novel training method that combines a bottom-up deep network with a top-down, model-based, fitting method. SMPLify model fitting is used in the loop with the DNN training to provide SMPL parameters used in the training loss. Code is available.
human pose 3D body SMPL deep learning SMPLify
AMASS is a large dataset of human motions - 45 hours and growing. AMASS enables the training of deep neural networks to model human motion. AMASS unifies multiple datasets by fitting the SMPL body model to mocap markers. The dataset includes SMPL-H body shapes and poses as well as DMPL soft tissue motions. If you want to include your own mocap sequences in the dataset, please contact us. The release includes tutorial code for training DNNs with AMASS.
mocap motion capture 3D body SMPL MoSh deep learning
We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. We integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. We couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. The predicted texture map allows a novel per-instance unsupervised optimization over the network features. We called the method SMALST (SMAL with learned Shape and Texture).
3D animal pose estimation; animal shape
Competitive Collaboration is a generic framework in which networks learn to collaborate and compete, thereby achieving specific goals. Competitive Collaboration is a three player game consisting of two players competing for a resource that is regulated by a third player, moderator. This framework is similar in spirit to expectation-maximization (EM) but is formulated for neural network training.
unsupervised-learning; depth; optical-flow; odometry; camera-motion; segmentation; competitive; collaboration
Estimating hand-object manipulation is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work we regularize the joint reconstruction of hands and objects with manipulation constraints. We provide an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. To train and evaluate the model, we also provide a new large-scale synthetic dataset, ObMan, with hand-object manipulations. Our approach significantly improves grasp quality metrics over baselines on synthetic and real datasets, using RGB images as input.
SMPL-X is a major update to the SMPL body model that adds an expressive face and fully articulated hands. If you use SMPL, this is a straightforward upgrade that improves realism and allows you to capture facial expressions and gestures. We also provide SMPLify-X to estimate SMPL-X from a single image. This is a major update to SMPlify in several senses: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy.
Code: We provide the inference code of RingNet. Please check the repository which is self explanatory. NoW Benchmark Dataset and Challenge: Please check the external link to download the data and participate in the challenge.
VOCA (Voice Operated Character Animation) is a framework that takes a speech signal as input and realistically animates a wide range of adult faces.
Code: We provide Python demo code that outputs a 3D head animation given a speech signal and a static 3D head mesh. The codebase further provides animation control to alter the speaking style, identity-dependent facial shape, and head pose (i.e. head rotation around the neck) during animation. The code further demonstrates how to sample 3D head meshes from the publicly available FLAME model, that can then be animated with the provided code.
Dataset: We capture a unique 4D face dataset (VOCASET) with about 29 minutes of 3D scans captured at 60 fps and synchronized audio from 12 speakers. We provide the raw 3D scans, registrations in FLAME topology, and unposed registrations (i.e. registrations in "zero pose").
This is the code for our SIGGRAPH Asia 2018 project
SMIL is a learned 3D model of infant body shape and pose that can be animated and fit to data. It is based on SMPL but the shape space is adapted to capture the body shape of babies.
infant body model pose shape baby RGB-D movement mocap
The "3D Poses in the Wild dataset" is the first dataset with monocular hand-held video together with accurate 3D human poses for evaluation. Our method combines video and IMU to recover accurate 3D human body models and their projection into the video sequences. The dataset includes: 60 video sequences; 2D pose annotations; 3D poses obtained with our method; Camera poses for every frame in the sequences; 3D body scans; and 18 3D human models with different clothing variations.
IMU pose 3D ground truth
The code allows to build convolutional networks on mesh structures analogous to CNNs on images. The code includes mesh convolutions, and introduces downsampling and upsampling operators that can be directly applied to the mesh structure. The code implements a Convolution Mesh Autoencoder using the above mesh processing operators and achieves state of the art results on generating 3D facial meshes.
face mesh convolutions autoencoder
When working in 3D graphics, one needs to load raw data, conduct various processing on it, visualize the results to help understanding, then save the output in different kinds of formats. Here we release the Mesh Library to facilitate all these aforementioned operations. This library is built on top of OpenGL and CGAL, with an easy-to-use Python interface. Other than the basic usages like data IO and interactive visualization, it also supports other more complex functionalities like texture rendering, visibility computation, and geometry arithmetic. We hope the release of this tool makes the entry to 3D world smoother for interested people.
mesh library 3d graphics
The optical flow of humans is well known to be useful for the analysis of human action. Given this, we devise an optical flow algorithm specifically for human motion and show that it is superior to generic flow methods. Designing a method by hand is impractical, so we develop a new training database of image sequences with ground truth optical flow. For this we use a 3D model of the human body and motion capture data to synthesize realistic flow fields. We then train a convolutional neural network to estimate human flow fields from pairs of images. Since many applications in human motion analysis depend on speed, and we anticipate mobile applications, we base our method on SpyNet with several modifications. We demonstrate that our trained network is more accurate than a wide range of top methods on held-out test data and that it generalizes well to real image sequences. When combined with a person detector/tracker, the approach provides a full solution to the problem of 2D human flow estimation. Both the code and the dataset are available for research.
The SMALR release includes an updated SMAL model of animals and 3D animal models recovered from images. SMALR is the Skinned Multi-Animal Linear Model with Refinement. All the 3D shapes from the CVPR paper are available for download as 3D meshes, which can be posed and animated. As we create new meshes, they will be added here.
Trained model to estimate 3D human shape and pose directly from an image. The input is pixels, and the output is a 3D body in SMPL format (shape parameters and pose parameters). Also provided is the code and data needed to train the model.
Model-based reconstruction of 3D SMPL body shape and pose from multi-view images. 2D joints and silhouettes from multi-view are used in the process. And DCT-based temporal prior is utilized to regularize the recovered 3D joint trajectory.
Data, code and model. This includes over 1000 3D hand scans and aligned meshes, the learned 3D hand shape model, the full articulated hand model with pose-dependent blend shapes. Also included is the SMPL body model with the hands attached to it, providing a realistic hand and body model.
FLAME is a lightweight and expressive generic head model learned from over 33,000 of accurately aligned 3D scans. We provide the trained 3D face models, registrations for the dynamic D3DFACS dataset, and demo code in Chumpy and Tensorflow to load and sample the model, and to fit the model to 3D landmarks.
head model face model morphable-model
First large-scale person dataset to generate depth, body parts, optical flow, 2D/3D pose, surface normals ground truth for RGB video input. The dataset contains 6M frames of synthetic humans. The images are photo-realistic renderings of people under large variations in shape, texture, view-point and pose. To ensure realism, the synthetic bodies are created using the SMPL body model, whose parameters are fit by the MoSh method given raw 3D MoCap marker data. Trained CNNs are also provided.
High quality 4D dataset of people in clothing with ground truth 3D shape. The BUFF dataset consists of 5 subjects, 3 male and 2 female wearing 2 clothing styles: a) t-shirt and long pants and b) a soccer outfit. They perform 3 different motions i) hips ii) tilt_twist_left iii) shoulders_mill.
This dataset is a unique resource containing over 40,000 4D scans of multiple people; 4D means 3D scans over time. Processing 4D data is challenging, so we provide aligned data in which we have registered a common template mesh to all scans. This alignment process takes into account geometry and surface texture to make it accurate. The dataset includes the raw scan data, registered template meshes, and masks that say where the template mesh is sufficiently accurate to be considered ground truth.
We provide the SMAL model of animal shapes and demo code. We also provide all the results from the CVPR paper of animal shapes estimated from images. We do not provide the 3D scans of the toy animals for copyright reasons but do provide a shopping list so that you can purchase the same toys that we used.
The dataset includes annotations of common human pose datasets. These include 3D body pose, 91 surface and joint landmarks, foreground segmentation, and body part segments. Together with the images, these can be used to train neural networks for human pose estimation tasks, including 3D pose estimation. The 3D body is represented by SMPL. Training code is provided.
Code for the paper "Optical Flow in Mostly Rigid Scenes" by Jonas Wulff, Laura Sevilla-Lara, Michael Black, CVPR 2017. This is one of the best performing methods across different datasets. In rigid parts of the scene, a plane-plus-parallax model is used. The method segments out the non-rigid regions and uses a more generic flow method there.
We provide an image-based generative model of people in clothing for the full body. The training dataset is built on top of Chictopia10K. We provide processed annotations as well as the SMPL body model fit to the images. We also provide our trained models for download.
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.
We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Check the website for updates; we provide code for the original SypNet as well as an end-to-end trainable version.
Webpages for the GCPR 2013, GCPR 2014, ICCV 2015, IJCV 2016, ECCVw 2016 papers. The data contains: (IJCV 2016, GCPR 2014) annotated RGB-D and multicamera-RGB dataset of one or two hands interacting with each other and/or with a rigid or an articulated object, (ICCV 2015) RGB-D dataset of a hand rotating a rigid object for 3d scanning, (GCPR 2013) synthetic dataset of two hands interacting with each other, (ECCVw 2016) RGB-D dataset of an object under manipulation.
Given a single image, extract the 3D SMPL pose and shape parameters. We provide a Python demo code needed to run SMPLify. We also provide results from the ECCV paper for comparison. For all the datasets we used (LSP, HumanEva-I, Human3.6M) we provide the detected joints and our results as SMPL model parameters and as a mesh (vertices and faces). The code package includes an example script showing how to load results. Please see the README in the code package and the FAQ.
This website provides a tool to explore 3D body shape and linguistic descriptions of shape. We provide a set of shape sliders and linguistic sliders that can be used to change body shape. This allows you to explore how people think about body shape and how shape and adjectives are correlated.
Data and code necessary to reproduce results from the CVPR 2016 paper on semantic optical flow. Semantic scene segmentation enables different flow models to be used in different regions and then composed using a locally layered approach.
Matlab implementation of the paper Video Segmentation via Object Flow Yi-Hsuan Tsai, Ming-Hsuan Yang and Michael J. Black IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
SMPL is like a PDF format for 3D bodies. It is a realistic 3D model of the human body that is based on blend skinning and blend shapes that is learned from thousands of 3D body scans. It is fully portable, works with many existing game engines and is useful for computer vision. This site provides resources to learn about SMPL, including example FBX files with animated SMPL models, and code for using SMPL in Python, Maya and Unity. The Python code shows how to use SMPL in computer vision problems. Maya and Unity scripts help set up the model for animation in these 3D environments. We provide regular updates with new features such as dynamic blend shapes, animated mocap sequences, and model improvements.
Marker-based motion capture (mocap) is widely criticized as producing lifeless animations. MoSh (Motion and Shape capture), automatically extracts detail present in the original mocap maker data. MoSh estimates body shape and pose together using sparse marker data by exploiting a parametric model of the human body. The dataset contains: 1) The original .c3d files with MOCAP marker-data. 2) Estimated 3D shape meshes. 3) 3D scans from a high resolution scanner for comparison.
To look human, digital full-body avatars need to have soft tissue deformations like those of real people. Current methods for physics simulation of soft tissue lack realism, are computationally expensive, or are hard to tune. Learning soft tissue motion from example, however, has been limited by the lack of dense, high-resolution, training data. We address this using a 4D capture system and a method for accurately registering 3D scans across time to a template mesh. Using over 40,000 scans of ten subjects, we compute how soft tissue motion causes mesh triangles to deform relative to a base 3D body model and learn a low-dimensional linear subspace approximating this soft-tissue deformation. This dataset contains all 40,000 training meshes which have the same mesh topology. See the Dynamic FAUST dataset for the raw scans and improved registered meshes.
This release contains code and data. Most mocap datasets are too small or to constrained to capture the full range of human motions. In particular, they are too small to explore joint angle limits. Here we provide a mocap dataset in which the subjects are gymnasts who are able to explore a wide range of human poses. The dataset allows one to develop pose priors that obey these limits and to model how these joint limits actually vary with pose. We include code to learn joint angle limits and to estimate 3D pose from 2D joint locations.
KITTI is one of the most popular datasets for evaluation of vision algorithms, particuarly in the context of street scenes and autonomous driving. The stereo 2015 / flow 2015 / scene flow 2015 benchmark consists of 200 training scenes and 200 test scenes (4 color images per scene, saved in loss less png format). Compared to the stereo 2012 and flow 2012 benchmarks, it comprises dynamic scenes for which the ground truth has been established in a semi-automatic process.
The Stitched Puppet (SP) is a realistic part-based 3D body model of the human body. It offers the best features of part-based body models used in Computer Vision and statistical body models used in Computer Graphics. The release includes data and code to fit the SP model to 3D scans.
This software package contains two algorithms for the computation of optical flow, as described in Wulff & Black, "Efficient Sparse-to-Dense Optical Flow Estimation using a Learned Basis and Layers" (CVPR 2015). PCA-Flow computes approximate optical flow extremely quickly, by making the assumption that optical flow lies on a low-dimensional subspace. PCA-Layers extends this to a layered model to increase accuracy, especially at boundaries. It is the most accurate layered model on the MPI Sintel dataset.
The OpenDR, is the first open source differentiable renderer. It provides a simple Python interface for defining an objective function with a forward generative process and then automatically differentiating and optimizing this. OpenDR allows for quick design and testing of generative models in computer vision. The code provides examples. OpenDR has been widely use.
FAUST contains 300 real, high-resolution human scans of 10 different subjects in 30 different poses, with automatically computed ground-truth correspondences. We provide a training set with scans and ground truth correspondence. We also provide a separate test set of scans with an evaluation website that compares results of mesh correspondence.
The Grassmann Averages PCA is a method for extracting the principal components from a sets of vectors, with the nice following properties: 1) it is of linear complexity wrt. the dimension of the vectors and the size of the data, which makes the method highly scalable, 2) It is more robust to outliers than PCA in the sense that it minimizes an L1 norm instead of the L2 norm of the standard PCA. It comes with two variants: 1) the standard computation, that coincides with the PCA for normally distributed data, also referred to as the GA, 2) a trimmed variant, that is more robust to outliers, referred to the TGA. We provide implementations for the Grassmann Average, the Trimmed Grassmann Average, and the Grassmann Median. The simplest is the Matlab implementation used in the CVPR 2014 paper, but we also provide a faster C++ implementation, which can be used either directly from C++ or through a Matlab wrapper interface. The repository contains the following:
Matlab code for robust optical flow -- Classic++ and Classic-NL -- as described in the IJCV paper "A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles behind Them". This code is widely used as a baseline and starting point for "classical" flow methods. Matlab version of the "Black and Anandan" robust flow method: http://cs.brown.edu/~dqsun/code/ba.zip Matlab version of "Horn and Schunck": http://cs.brown.edu/~dqsun/code/hs.zip Original implementation from CVPR'2010 paper: http://cs.brown.edu/~dqsun/code/cvpr10_flow_code.zip
A fully annotated data set for human actions and human poses. It is based on the HMDB human motion dataset but includes optical flow on the person, the segmentation of the person, joint locations, action labels, and meta data.
Code for ICCV'13 paper on "Estimating Human Pose with Flowing Puppets". This addresses the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. The "flowing puppets" model provide integrates image evidence across frames to improve pose inference. We provide the code used for the experiments in the paper. We also provide the "puppet flow" annotation tool.
This website helps people understand body mass index through a novel visualization of 3D body shape. Enter height and weight to see a 3D body shape with these properties and see the corresponding BMI. Move a slider to change BMI and see how body shape changes.
The KITTI dataset is the de-facto standard for developing and testing computer vision algorithms for real-world autonomous driving scenarios and more.
The MPI Sintel Dataset is one of the most widely used datasets for training and evaluating optical flow algorithms. It is the first synthetic dataset to achieve wide use because of it well represents natural scenes and motions. It is also extremely challenging and current methods have still not fully "solved" the problem of flow estimation for Sintel. Sintel is designed to encourage research on long-range motion, motion blur, multi-frame analysis, non-rigid motion. Algorithms are evaluated on held-out test data and results are displayed for comparison. The dataset contains flow fields, motion boundaries, unmatched regions, and image sequences. The image sequences are rendered with different levels of difficulty. We also provide ground truth depth, stereo, and camera motions. Sintel is an open source animated short film produced by Ton Roosendaal and the Blender Foundation. Here we have modified the film in many ways to make it useful for optical flow evaluation.
This code supports the core representation needed for Lie Bodies as described in Freifeld and Black, ECCV 2012. Currently this is only a partial version of what is presented in the paper. The code takes pairs of triangles and computes the "Q" matrices and the corresponding (R,A,S) decompositions that are the foundation of Lie Bodies (see paper for details).
body shape scape manifold statistics deformation pose shape Lie algebra
This web-based tool lets users enter information about body measurements (height, waist, inseam, etc) and visualize a 3D body shape that corresponds to these measurements.
HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. This is the repository for the widely use HumanEva dataset. This was the first dataset to include mutii-camera video capture of people with ground truth 3D human pose. It established the quantitative evaluation of human pose estimation using well-defined metrics in 2D and 3D. The dataset was developed at Brown University and is hosted by MPI.
The Middlebury flow dataset has been a de-facto standard in the field since 2007. The dataset introduced several innovations. It is the first dataset to contain real image sequences with independent motions, and ground truth optical flow. Second, it provides realistically rendered synthetic scenes with ground truth flow. It also includes a frame interpolation task using real video sequences. While, by today's standards, the dataset is small and the sequences somewhat simple, it remains a useful tool for evaluating the generality of optical flow methods.