Language Grounding Learning with Multimodal Learning and Interaction

During this talk, I will present my recent works on visually grounded language learning. As a first step, I will introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. As this game requires complex image/language understanding, it aims at exploring the following questions:
- Can reinforcement learning methods be used to better understand/generate language? To do so, we will see how we can use self-play on GuessWhat?! and compare supervised and RL results.
- Do recent VQA neural architectures works with simple but highly grounded questions? After a quick discussion, I will introduce the FiLM layer as a new tool that improves multimodal learning.
Finally, I will quickly go through some future work and perspectives building upon these concepts.

Friday, October 12, 2018 - 14:00
Inria, room A00
Florian Strub
Inria SequeL