During this talk, I will present two projects (MERL & SAUNA) which have a common denominator: policy gradient methods with function approximation, and a common goal: provide reinforcement learning agents with fast training, robust learning, and high performance in complex environments.

1/ When targeting these objectives, an idea that can legitimately come into mind is to make use of prior knowledge. Instead, we consider problem knowledge signals, which we define as any relevant indicator useful to solve a task, e.g., self-performance assessment and accurate expectations. MERL is a multi-head reinforcement learning framework for generalized auxiliary tasks aimed at structuring reinforcement learning by injecting environment-agnostic quantities in policy gradient updates. While providing faster convergence in single-task learning, we will see that MERL also improves transfer learning.

2/ Policy gradient algorithms in reinforcement learning optimize the policy directly and rely on efficiently sampling an environment. However, in a second project, we hypothesized that modifying the sampling procedure before each policy update could yield better performance. SAUNA is a modified PPO algorithm where transitions are rejected from the gradient updates if they do not meet a particular criterion and kept otherwise. This criterion measures the discrepancy between a model (the value function) and actual samples. We will see that while SAUNA provides a reliable assessment for the selection of samples that will positively impact learning, it also refines sampling and improves the policy gradient algorithm.