Choisir la langue :

Exploration–Exploitation in MDPs with Options

The option framework (Sutton et al., 1999) is a simple yet powerful model to introduce temporally-extended actions and hierarchical structures in reinforcement learning (RL) (Sutton and Barto, 1998). An important feature of this framework is that Markov decision process (MDP) planning and learning algorithms can be easily extended to accommodate options, thus obtaining algorithms such as option value iteration and Q-learning (Sutton et al., 1999), LSTD (Sorg and Singh, 2010), and actor-critic (Bacon and Precup, 2015). While options may significantly improve the performance w.r.t. learning with primitive actions, a theoretical understanding of their actual impact on the learning performance is still fairly limited. Notable exceptions are the sample complexity analysis of approximate value iteration with options (Mann and Mannor, 2014) and the PAC-MDP analysis by Brunskill and Li (2014). During this seminar, I will present the first regret analysis of learning with options. Relying on the fact that using options in an MDP induces a semi-Markov decision process (SMDP), I will first introduce a variant of the UCRL algorithm (Jaksch et al., 2010) for SMDPs and I will give upper and lower bounds on its regret. While this result is of independent interest for learning in SMDPs, its most interesting aspect is that it can be translated into a regret bound for learning with options in MDPs and it provides a first understanding on the sufficient conditions for a set of options to reduce the regret w.r.t. learning with primitive actions. I will also describe simple scenarios in which the regret of learning with options can be provably much smaller than the regret suffered when learning with primitive actions. Finally, I will present experimental results supporting the theoretical findings.


Friday, November 4, 2016 - 11:00
Inria, room A00
Ronan Fruit
Inria SequeL