Faces, their shape, and their motion are essential to communication. Consequently, we want a model of the face that can capture the full range of face shapes and expressions. Such a model should be realistic, easy to animate, easy to fit to data, and should support inference about human emotion and speech. Additionally we need the tools to estimate faces, their shape, pose, expression, gaze, and movement from images.
To that end, we trained a 3D face model called FLAME [ ] from 4D scans. Because it is learned from large-scale, expressive, data of real people, it is more realistic than previous models. FLAME uses a linear shape space trained from 3800 scans of human heads and combines this with an articulated jaw, neck, eyeballs, pose-dependent corrective blendshapes, and additional global expression blendshapes. The pose and expression dependent articulations are learned from 4D face sequences to which we accurately register a template mesh. In total the model is trained from over 33,000 scans.
While expressive, it is difficult to capture the non-linear deformations of extreme expressions with FLAME's low-D linear subspace. While neural networks would be a natural choice for representing such deformations in a low-D latent space, existing convolutional neural networks do not generalize to 3D meshes in a straightforward way. To address this, we introduce a versatile encoder-decoder framework for meshes using spectral convolutions on a mesh surface [ ]. Additionally, we introduce mesh up- and down-sampling operations that enable a hierarchical mesh representation that captures non-linear variations in shape at multiple scales. Our CoMA mesh convolution algorithm is generic and now widely used.
To capture, model, and understand facial expressions, we need to estimate the parameters of our face models from images and videos. Training a neural network to regress model parameters from image pixels is difficult because we lack paired training data of images and the true 3D face. To address this we learn this mapping using only 2D image features. The key is to leverage multiple images of a person with a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. FLAME enables the network to factor out changes in expression so that it can exploit this shape constancy.