Funded Post-doc position at INRIA and INSA-Lyon

Deep Learning and Deep-Reinforcement Learning for Human Centered Vision and Robotics


Christian Wolf
Julien Mille
Cordelia Schmid


Lyon, France
INRIA-Chroma and INRIA-Thoth

When, duration

Begin: Early 2018
Duration: 12 months


Learning of deep hierarchical representations (“deep learning”) is established as a powerful methodology in computer vision, capable of learning complex prediction models from large amounts of data. This Post-doc position builds on previous work on deep learning for human motion understanding at the LIRIS laboratory in Lyon, France ([1-5] and others) and at INRIA THOTH, Grenoble [6-8]. The candidate will work on applications in computer vision related to understanding humans, in particular the recognition of complex activities.

Human perception focuses selectively on parts of the scene to acquire information at specific places and times. In machine learning, this kind of process is referred to as attention mechanism, and has drawn increasing interest when dealing with languages, images and other data. Integrating attention can potentially lead to improved overall accuracy, as the system can focus on parts of the data, which are most relevant to the task. In particular, mechanisms of visual attention currently play an important role in many current vision tasks [3][9-13].

The objective of this post-doc is to advance the state-of-the-art in human-centered vision and robotics through visual attention mechanisms for human understanding. A particular focus will be put on “physical” attention mechanisms, where the agent is not virtual but physical (embodied computer vision). This translates to tasks where mobile robots optimize their location/navigation in order to solve complex visual tasks (see Figure 1).

In terms of methodological contributions, this research will focus on deep learning and deep reinforcement learning for agent control [14] and for vision [9,15].

Figure 1: a mobile robots jointly observes a complex visual scene, maps the environemnet and optimizes its position with respect to a task: recognize activities in the scene.


[1] Natalia Neverova, Christian Wolf, Graham W. Taylor and Florian Nebout. ModDrop: adaptive multi-modal gesture recognition. To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016.

[2] Natalia Neverova, Christian Wolf, Florian Nebout, Graham W. Taylor. Hand Pose Estimation through Weakly-Supervised Learning of a Rich Intermediate Representation. In Computer Vision and Image Understanding, 2017.

[3] Fabien Baradel, Christian Wolf, Julien Mille. Pose-conditioned Spatio-Temporal Attention for Human Action Recognition. Arxiv:1703.10106, 2017.

[4] Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, Christophe Garcia, Bülent Sankur. Evaluation of video activity localizations integrating quality and quantity measurements. In Computer Vision and Image Understanding (127):14-30, 2014.

[5] Moez Baccouche, Frank Mamalet, Christian Wolf, Christophe Garcia, Atilla Baskurt. Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In the Proceedings of the British Machine Vision Conference (BMVC), 2012.

[6] P. Tokmakov, K. Alahari, C. Schmid. Learning Motion Patterns in Videos, CVPR 2017

[7] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, C. Schmid. Action Tubelet Detector for Spatio-Temporal Action Localization, ICCV 2017.

[8] H. Wang, A. Kläser, C. Schmid, CL Liu. Dense trajectories and motion boundary descriptors for action recognition, International journal of computer vision, 2013.

[9] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent models of visual attention. In NIPS. 2014.

[10] Jason Kuen, Zhenhua Wang, and Gang Wang. Recurrent Attentional Networks for Saliency Detection. In CVPR, 2016.

[11] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action Recognition using Visual Attention. ICLR Workshop track, 2016.

[12] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. Pre-print : arxiv :1611.06067, 2016.

[13] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to- end Learning of Action Detection from Frame Glimpses in Videos. In CVPR, 2016.

[14] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, page 529–533, 2015

[15] M. Gygli, M. Norouzi and A. Angelova. Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs, arxiv 3/2017.