Purposeful and robust manipulation requires a good hand-eye coordination. To a certain extend this can be achieved using information from joint encoders and known kinematics. However, for many robots a significant error in the pose of the end-effector and fingers of several centimeters remains. Especially for fine manipulation tasks, this poses a challenge.
For achieving the desired accuracy, we aim to visually track the arm in the camera; the same frame in which we usually detect the target object. Given these estimates, we can then control the manipulation tasks with techniques such as visual servoing.
In this project, we propose to frame the problem of marker-less robot arm pose estimation as a learning problem. The only input to the method is the depth image from an RGB-D sensor. The output is the joint configuration of the robot arm. We learn the mapping from a large number of synthetically generated and labeled depth images.
In [ ], we treat this problem as a pixel-wise classification problem using a random decision forest. From all the training samples ending up at a leaf node, a set of offsets is learned that votes for relative joint positions. Pooling these votes over all foreground pixels and subsequent clustering gives us an estimate of the true joint positions. Due to the intrinsic parallelism of pixel-wise classification, this approach can run faster than 30Hz. The approach is a frame-by-frame method and does not require any initialization as for example ICP-style or tracking methods.