The goal of this project is to perform on-board, online and real-time pose and offline shape estimation of humans and animals in outdoor scenarios, using a team of micro aerial vehicles (MAVs) with only on-board, monocular RGB cameras. To realize such an outdoor motion capture system we need to address both the control and the perception related research challenges. In this project we purely address the perception problem. In a separate ongoing project we solve the control-related challenges, with perception problem in the loop.
The perception functionality of AirCap is split into two phases, namely, i) online data acquisition phase, and ii) offline pose and shape estimation phase.
During the online data acquisition phase, the MAVs detect and track the 3D position of a subject while following them. To this end, they perform online and on-board detections using a deep neural network (DNN)-based detector. DNNs often fail at detecting small-scale objects or those that are far away from the camera, which are typical characteristics of a scenario with aerial robots. In our solution [ ] the mutual world knowledge about the tracked person is jointly acquired by our multi-MAV system during cooperative person tracking. Leveraging this, our method actively selects the relevant region of interest (ROI) in images from each MAV that supplies the highest information content. Our method not only reduces the information loss incurred by down-sampling the high-res images, but also increases the chance of the tracked person being completely in the field of view (FOV) of all MAVs. The data acquired in the online data acquisition phase consists of images captured by all MAVs (see, for example, the left image above) and their camera extrinsic and intrinsic parameters.
In the second phase, which is offline, human pose and shape as a function of time are estimated using only the acquired RGB images and the MAV's self-localization poses (the camera extrinsics). Using state-of-the-art methods like VNect or HMR, one can obtain only noisy 3D estimate. Moreover, these estimates from separate cameras cannot be fused easily due to ambiguity in scale and perspectives. In our recently-submitted work we leverage them as noisy sensors for 2D joints positions and show how they can be efficiently fused to obtain a consistent 3D pose and shape estimate.