Sequentially learning to place items in multi-position displays or lists is a task that can be cast into the multiple-play semi-bandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, much of the content may have been simply ignored by the user : while a click can be directly interpreted as a positive feedback, the absence of click remains ambiguous about the user's real taste. The present work proposes to exploit available information regarding the display position bias under the so-called Position-based click model (PBM).
In the present talk, we give an insight on how the complexity of a bandit problem is affected by the censoring effect of the observations. These remarks are based on lower bounds on the regret in both situations whose proofs will be detailed. We then recall the multiple-plays setting with semi-bandit feedback and we show how to cast the PBM into a bandit problem. We finally present algorithms relying on the ‘optimism-in-the-face-of-uncertainty’ principle that provably reach optimal performances.
This presentation is based on "Multiple-Plays Bandits in the Position-Based Model", P. Lagrée, C. Vernade and O. Cappé, to appear in NIPS 2016.