The emergence of multi-view capture systems has yield a tremendous amount of video sequences. The task of capturing spatio-temporal models from real world imagery (4D modeling) should arguably benefit from this enormous visual information. In order to achieve highly realistic representations both geometry and appearance need to be modeled in high precision. Yet, even with the great progress of the geometric modeling, the appearance aspect has not been fully explored and visual quality can still be improved. I will explain how we can optimally exploit the redundant visual information of the captured video sequences and provide a temporally coherent, super-resolved, view-independent appearance representation. I will further discuss how to exploit the interdependency of both geometry and appearance as separate modalities to enhance visual perception and finally how to decompose appearance representations into intrinsic components (shading & albedo) and super-resolve them jointly to allow for more realistic renderings.
Organizers: Despoina Paschalidou
Considerable research has demonstrated that the representation is not equally faithful throughout the visual field; representation appears to be coarser in peripheral vision, perhaps as a strategy for dealing with an information bottleneck in visual processing. In the last few years, a convergence of evidence has suggested that in peripheral and unattended regions, the information available consists of summary statistics.
For a complex set of statistics, such a representation can provide a rich and detailed percept of many aspects of a visual scene. However, such a representation is also lossy; we would expect the inherent ambiguities and confusions to have profound implications for vision.
For example, a complex pattern, viewed peripherally, might be poorly represented by its summary statistics, leading to the degraded recognition experienced under conditions of visual crowding. Difficult visual search might occur when summary statistics could not adequately discriminate between a target-present and distractor-only patch of the stimuli. Certain illusory percepts might arise from valid interpretations of the available – lossy – information. It is precisely visual tasks upon which a statistical representation has significant impact that provide the evidence for such a representation in early vision. I will summarize recent evidence that early vision computes summary statistics based upon such tasks.
Human body movements are highly complex spatio-temporal patterns and their control and recognition represent challenging problems for technical as well as neural systems. The talk will present an overview of recent work of our group, exploiting biologically-inspired learning-based reprensentations for the recognition and synthesis of body motion.
The first part of the talk will present a neural theory for the visual processing of goal-directed actions, which reproduces and partially correctly predicts electrophysiological results from action-selective cortical neurons in monkey cortex. In particular, we show that the same neural circuits might account for the recognition of natural and abstract action stimuli.
In the second part of the talk different techniques for the learning of structured online-capable synthesis models for complex body movements are discussed. One approach is based on the learning of kinematic primitives, exploiting anechoic demixing, and the generation of such primitives by networks of canonical dynamical systems.
An approach for the design of a stable overall system dynamics of such nonlinear networks is discussed. The second approach is the learning of hierarchical models for interactive movements, combining Gaussian Process Latent Variable models and Gaussian process Dynamical Models, and resulting in animations that pass the Turing test of computer graphics. The presented work was funded by the DFG, and EC FP 7 projects SEARISE, TANGO and AMARSI.
Variations in lighting can have a significant effect on the appearance of an object. Modeling these variations is important for object recognition and shape reconstruction, particularly of smooth, textureless objects. The recent decade has seen significant progress in handling lambertian objects. In that context I will present our work on using harmonic representations to represent the reflectance of lambertian objects under complex lighting configurations and their application to photometric stereo and prior-assisted shape from shading. In addition, I will present preliminary results in handling specular objects and methods for dealing with moving objects.
Dimensionality reduction applied to neural ensemble data has led to the concept of a 'neural trajectory', a low-dimensional representation of how the state of the network evolves over time. Here we present a novel neural trajectory extraction algorithm which combines spike train distance metrics (Victor and Purpura, 1996) with dimensionality reduction based on local neighborhood statistics (van der Maaten and Hinton, 2008.) . We apply this technique to describe and quantify the activity of primate ventral premotor cortex neuronal ensembles in the context of a cued reaching and grasping task with instructed delay.
Humans interact with their environment in a highly flexible manner. One important component for the successful control of such flexible interactions is an internal body model. To maintain a consistent internal body model, the brain appears to continuously and probabilistically integrate multiple sources of information, including various sensory modalities but also anticipatory, re-afferent information about current body motion. A modular, multimodal arm model (MMM) is presented.
The model represents a seven degree of freedom arm in various interactive modality frames. The modality frames distinguish between proprioceptive, limb-relative orientation, head-relative orientation, and head-relative location frames. Each arm limb is represented separately but highly interactively in each of these modality frames. Incoming sensory and motor feedback information is continuously exchanged in a rigorous, probabilistic fashion, while a consistent overall arm model is maintained due to the local interactions.
The model is able to automatically identify sensory failures and sensory noise. Moreover, it is able to mimic the rubber hand illusion phenomenon. Currently, we endow the model with neural representations for each modality frame to play-out its full potential for planning and goal-directed control.
The amount of digital video content available is growing daily, on sites such as YouTube. Recent statistics on the YouTube website show that around 48 hours of video are uploaded every minute. This massive data production calls for automatic analysis.
In this talk we present some recent results for action recognition in videos. Bag-of-features have shown very good performance for action recognition in videos. We briefly review the underlying principles and introduce trajectory-based video features, which have shown to outperform the state of the art. These trajectory features are obtained by dense point sampling and tracking based on displacement information from a dense optical flow field. Trajectory descriptors are obtained with motion boundary histograms, which are robust to camera motion. We, then, show how to integrate temporal structure into a bag-of-features based on an actom sequence model. Action sequence models localize actions based on sequences of atomic actions, i.e., represent the temporal structure by sequences of histograms of actom-anchored visual features. This representation is flexible, sparse and discriminative. The resulting actom sequence model is shown to significantly improve performance over existing methods for temporal action localization.
Finally, we show how to move towards more structured representations by explicitly modeling human-object interactions. We learn how to represent human actions as interactions between persons and objects. We localize in space and track over time both the object and the person, and represent an action as the trajectory of the object with respect to the person position, i.e., our human-object interaction features capture the relative trajectory of the object with respect to the human. This is joint work with A Gaidon, V. Ferrari, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang.
The supervision of public spaces aims at multiple objectives, such as early acquisition of targets, their identification and pursuit throughout the supervised area. To achieve these, typical sensors such as pan-tilt-zoom cameras need to either focus on individuals, or provide a broad field of view, which are conflicting control settings. We address this problem in an information-theoretic manner: by phrasing each of the objectives in terms of mutual information, they become comparable. The problem turns into maximisation of information, which is predicted for the next time step and phrased as a decision process.
Our approach results in decisions that on average satisfy objectives in desired proportions. At the end of the talk I will address an application of information maximisation to aid in the interactive calibration of cameras.
Recovering the depth of a scene is important for bridging the gap between the real and the virtual world, but also for tasks such as segmenting objects in cluttered scenes. Very cheap single view depth imaging cameras, i.e. Time of Fight cameras (ToF) or Microsoft's Kinect system, are entering the mass consumer market. In general, the acquired images have a low spatial resolution and suffer from noise as well as technology specific artifacts. In this talk I will present algorithmic solutions to the entire depth imaging pipeline, ranging from preprocessing to depth image analysis. For enhancing image intensity and depth maps, a higher order total variation based approach has been developed which exhibits superior results as compared to current state-of-the-art approaches. This performance has been achieved by allowing jumps across object boundaries, computed both from the image gradients and the depth maps. Within objects, staircasing effects as observed in standard total variation approaches is circumvented by higher order regularization. The 2.5 D motion or range flow of the observed scenes is computed by a combined global-local approach.
Particularly on Kinect-data, best results were achieved by discarding information on object edges. These are prone to errors due to the data acquisition process. In conjunction with a calibration procedure, this leads to very accurate and robust motion estimation. On these computed range flow data, we have developed the estimation of robust, scale- and rotation-invariant features. These make it feasible to use our algorithms for a novel approach to gesture recognition for man-machine interactions. This step is currently work inprogress and I will present very promising first results.
For evaluating the results of our algorithms, we plan to use realistic simulations and renderings. We have made significant advances in analyzing the feasibility of these synthetic test images and data. The bidirectional reflectance distribution function (BRDF) of several objects have been measured using a purpose-build “light-dome” setup. This, together with the development of an accurate stereo-acquisition system for measuring 3D-objects lays the ground work for performing realistic renderings. Additionally, we have started to create a test-image database with ground truth for depth, segmentation and light-field data.
3D scanning of moving objects has many applications, for example, marker-less motion capture, analysis on fluid dynamics, object explosion and so on. One of the approach to acquire accurate shape is a projector-camera system, especially the methods that reconstructs a shape by using a single image with static pattern is suitable for capturing fast moving object. In this research, we propose a method that uses a grid pattern consisting of sets of parallel lines. The pattern is spatially encoded by a periodic color pattern. While informations are sparse in the camera image, the proposed method extracts the dense (pixel-wise) phase informations from the sparse pattern.
As the result, continuous regions in the camera images can be extracted by analyzing the phase. Since there remain one DOF for each region, we propose the linear solution to eliminate the DOF by using geometric informations of the devices, i.e. epipolar constraint. In addition, solution space is finite because projected pattern consists of parallel lines with same intervals, the linear equation can be efficiently solved by integer least square method.
In the experiments, a scanning system that can capture an object in fast motion has been actually developed by using a high-speed camera. In the experiments, we show the sequence of dense shapes of an exploding balloon, and other objects at more than 1000 fps.