Choisir la langue :

Beyond Optimism: Information-Directed Exploration in Bandits

Institutional tag: 
Thematic tag(s): 

Contextual bandit algorithms are popular methods for trading-off exploration and exploitation, and are typically used in applications
where we seek a best-performing parameter from a small number of experimental trials. The standard assumption is that the learner exactly
observes the context that is chosen by the environment. In many applications, however, the context is itself a noisy measurement or the
output of some forecasting mechanism, and therefore not known exactly. To address this, we introduce a novel stochastic contextual bandit model
where at each step, the adversary chooses a distribution over a context set. The learner observes only the context distribution, while the exact
context realization remains hidden. We show that this setting reduces to an instance of the heteroscedastic bandit problem in which the
observation noise depends on the chosen action. While it is possible to apply the UCB algorithm with a homoscedastic worst-case noise bound, I
will argue that strategies based on the "optimism principle" fail to explore efficiently when the observation noise is heteroscedastic. As an
alternative design principle, I will introduce Frequentist Information Directed Sampling (IDS), a framework that links algorithm design and
regret analysis, and allows for a more fine-grained information-regret trade off. Finally, I will briefly touch on a collaboration with the
Paul Sherrer Institute where we use bandit methods to optimize the beam intensity on the Swiss Free Electron Laser.

Friday, November 29, 2019 - 11:00
Inria, room A00
Johannes Kirschner
ETH Zürich