Deep learning has brought rapid progress in computer vision in the recent years. However, training deep models in a supervised fasion requires big datasets with annotated ground truth. Human annotators tend to be reasonably efficient for tasks like sparse 2D joint estimation, however annotation for other tasks like dense optical flow estimation or 3D pose estimation is intractable.
In this direction, progress in computer graphics has enabled rendering of synthetic scenes and people, that gets progressively more efficient and realistic, looking closer to real scenes and people. Synthetic rendering is appealing for creating training datasets, as it is easily scalable and automatically generates ground truth for a big variety of modalities.
We focus on learning from synthetic data, using as many real elements as possible, like motion, body shapes, body textures and backgrounds. We create the SURREAL dataset (Synthetic hUmans foR REAL tasks) and learn a deep model for depth estimation and body part segmentation for humans [ ]. We further create the Sintel and Human-Flow [ ] datasets for learning optical flow either in a general setting or specifically focused on optical flow for human bodies, correspondingly.
Our current work focuses on extending synthetic rendering and inference to multiple people in a single image, for tasks like optical flow, 2D and 3D pose estimation. We further focus on rendering and reconstructing hand-object interactions with realistic hand shapes and poses, object shapes, textures, as well as realistic hand-object grasps. We then plan to extend synthetic data generation to more complex and realistic scenes to reduce the domain gap between real and synthetic data.