We will present a new algorithm for a general bandit problem of Active Exploration in the fixed confidence setting. This general problem encompasses several Active Exploration problems, for example the Best Arm Identification problem or the Thresholding Bandits problem.
Usually an optimal proportion of draws appears naturally in the instance dependent asymptotic lower bound of such problems. To reach the asymptotic lower bounds one needs to sample asymptotically according to this optimal proportion of draws. A natural strategy is to sample according to the optimal proportion of draws associated with the current estimate of the true parameter, with some extra exploration. This strategy has a major drawback, computing the optimal proportion of draws involves to solve an often complicated concave optimization problem. Thus, this can lead to rather computationally inefficient strategy since one has to solve it at each step.
We will propose instead to use a gradient ascent to solve, in an online fashion, the optimization problem thus merging the Active Exploration problem and the computation of the optimal proportion of draws. We will show that this leads to an asymptotically optimal and, most importantly, computationally efficient algorithm.