Regret of Narendra Shapiro Bandit Algorithms

Size: px

Start display at page:

Download "Regret of Narendra Shapiro Bandit Algorithms"

Ernest Barrett
5 years ago
Views:

1 Regret of Narendra Shapiro Bandit Algorithms S. Gadat Toulouse School of Economics Joint work with F. Panloup and S. Saadane. Toulouse, 5 avril 2015

2 I - Introduction I - 1 Motivations I - 2 Stochastic multi-armed bandit model II Narendra Schapiro algorithm II - 1 An historical algorithm (1969) II - 2 Improvement through penalization IV Conclusion

I - 1 Motivations - Stochastic Bandit Games Problem : You want to earn as much as possible in casino You are in a casino and want to play with slot machines Each one can give you

3 I - 1 Motivations - Stochastic Bandit Games Problem : You want to earn as much as possible in casino You are in a casino and want to play with slot machines Each one can give you a potential gain, but these gains are not equivalent You sequentially play with one of the arms of the bandit machine How to design a good policy to sequentially optimize the gain?

The treatment are not equivalent You do not know where is the best drug, but you observe the effect of the prescribed treatment on each

4 I - 1 Motivations - Dynamic Ressource Allocation Problem : Optimization of a sequence of clinical trials Imagine you are a doctor : A sequence of patients visit you sequentially (one after another) for a given disease You choose one treatment/drug among (say) 5 availables The treatment are not equivalent You do not know where is the best drug, but you observe the effect of the prescribed treatment on each patient You expect to find the best drug despite some uncertainty on the effect of each treatment How can we design a good sequence of clinical trials?

week/day You observe weekly/daily sales and measure item s popularity You want to restock popular items and weed out unpopular ones

5 I - 1 Motivations - Dynamic Ressource Allocation Problem : Fast fashion retailer Source : Farias & Madan, Operation Research, Vol. 9, No 2, 2011 Imagine you are a firm solding clothes : A population of customers visit you sequentially (one after another) each week/day You observe weekly/daily sales and measure item s popularity You want to restock popular items and weed out unpopular ones on-line You expect to maximize your benefit while finding the best items How can we design a good sequence of fast-fashion operations?

I - 1 Motivations - Dynamic Ressource Allocation Other motivating examples Pricing a product with uncertain demand to maximize revenue Trading (sequentially allocate a ratio of fund to the more

6 I - 1 Motivations - Dynamic Ressource Allocation Other motivating examples Pricing a product with uncertain demand to maximize revenue Trading (sequentially allocate a ratio of fund to the more efficient trader) Recommender systems : advertisement website optimization news, blog posts Computer experiments A code can be simulated in order to optimize a criterion This simulation depends on a set of parameters Simulation is costly and only few choices of parameters are possible

7 I - 2 Stochastic multi-armed bandit model Environment : At your disposal : d arms with unknown parameters θ 1,..., θ d. For any time t, and for any choice I t {1..., d}, you receive a reward : A I t t For any choice of an arm i, rewards are i.i.d. (A i t ) t 0 ν θi. Reward distribution : In general, the reward distributions ν θ belong to a parametric family (Exponential, Poisson,... ) In this talk, simplest case of Bernoulli rewards ν p = B(p) : you obtain a gain of 1 with probability p 0 otherwise (with probability 1 p). Unknown probability of success : (p 1,..., p d ). Without l.o.g., we assume that p 1 > max pj. 2 j d Admissible policy : The agent s action follow a dynamical strategy, which is defined on-line : ( I t = π A I ) t 1 t 1..., AI 1 1. Final goal : Maximize (in expectation) the cumulative rewards : [ n ] E. t=1 A I t t

8 I - 2 Regret of Stochastic multi-armed bandit algorithms Regret of an algorithm It yields the minimization of the expected regret R n n ER n = E max A j t E n 1 j d t=1 t=1 A I t t n = E max (A j t 1 j d AI t t ) t=1 The expectation of the maximum makes the regret difficult to handle, but... Proposition (Pseudo-regret) If we define R [ n ] n := max 1 j d E t=1 (Aj t AI t t ), one has This upper bound is useful since ER n R n log d n +. 2 Proposition (Lower bound - (Auer, Cesa-Bianchi,Freund,Schapire 2002)) Uniformly among all policies π and among all Bernoulli distribution rewards : nd min max ER n π sup p j < p j d Conclusion : Upper bounds of R n of the order nd are competitive (optimal).

9 I - 3 Roadmap In this talk, we will : Briefly describe a standard old-fashioned method X t+1 = X t + γ t+1 h(x t) + γ t+1 M t+1 Introduce a new one whose regret will be deeply studied from a non asymptotic point of view : n N Rn C n Provide an asymptotic limit of this penalized bandit up to a correct scaling w β n(x n δ 1 ) µ n + Describe ergodic properties of the rescaled process (Piecewise Deterministic Markov Process)

10 I - Introduction I - 1 Motivations I - 2 Stochastic multi-armed bandit model II Narendra Schapiro algorithm II - 1 An historical algorithm (1969) II - 2 Improvement through penalization IV Conclusion

11 II - 1 An historical algorithm (1969) The so-called Narendra-Shapiro bandit algorithm (NS bandit for short) defines a probability vector of S d d X t = (Xt 1,..., Xd t ) X j t = 1. Idea : Use X t to sample one arm at step t and upgrade this probability according to the obtained reward at time t. In the two-armed situation with p 2 < p 1, denote X t = (x t, 1 x t) γ t+1 (1 x n) if player 1 is selected and wins x t+1 = x t + γ t+1 x t if player 2 is selected and wins 0 otherwise j=1 Multi-armed situation, I t : arm sampled at time t, A I t t j {1... d} X j t = Xj t 1 + γt [1 {It =j} X j t 1 : obtained reward. Upgrade ] To sum up : If you win : reinforce the probability to sample I t w.r.t. the remaining weights (X j t ) j I t and decrease the probability to sample other arms accordingly. If you loose (A I t t Common step size : = 0) : do nothing. A I t t γ t = ( 1 + t/c) α), α (0, 1) with large enough C.

12 II - 1 An historical algorithm (1969) Few words about NS bandit : Recursive stochastic algorithms Anytime policy Involves nontrivial mathematical difficulties It can be written as mean drift + martingale increment X t+1 = X t + γ t+1 h(x t) + γ t+1 M t+1. In the 2-armed settings (p 2 < p 1 and X t = (x t, 1 x t)) : h(x) = (p 1 p 2 )x(1 x). O.D.E. approximation ẋ = h(x), local trap at {1} and stable equilibrium at {0}. But : the conditional variance term vanishes at 0 and 1, making impossible the use of Duflo s argument about the escape ( of ) local traps. α Indeed, for any sequence γ t = C t+c, α (0, 1), the algorithm is faillible P (lim x t = 0) > 0 = ER n Cn >> n

II - 2 Improvement through penalization What s wrong with NS bandit? Gittins, JRSS(B) 79 : Good regret properties only occur with an exploration/exploitation trade-off.

13 II - 2 Improvement through penalization What s wrong with NS bandit? Gittins, JRSS(B) 79 : Good regret properties only occur with an exploration/exploitation trade-off... NS bandit is a pure exploitation method : no exploration term to exit local traps. Main idea : Introduce a penalty term [Pages & Lamberton, EJP 09] In the 2-armed settings (p 2 < p 1 and X t = (x t, 1 x t)) : +γ t+1 (1 X t) if arm 1 is selected and wins γ t+1 X t if arm 2 is selected and wins X t+1 = X t + ρ t+1 γ t+1 X t if arm 1 is selected and loses +ρ t+1 γ t+1 (1 X t) if arm 2 is selected and loses When one arm fails, decrease the probability to sample it. LP 09 : penalized 2-armed bandit is infaillible (a.s. convergence to the good target) iff ρ t = ρ 1 t β, γ t = γ 1 t α with 0 < β α, α + β 1.

14 II - 3 Over-penalized NS bandit This additional penalty term is still inefficient from the minimax regret point of view. As a last resort : increase the penalty effect to reinforce the escape from local traps : +γ t+1 (1 X t) ρ t+1 γ t+1 X t if arm 1 is selected and wins γ t+1 X t+ρ t+1 γ t+1 (1 X t) if arm 2 is selected and wins X t+1 = X t + ρ t+1 γ t+1 X t if arm 1 is selected and loses +ρ t+1 γ t+1 (1 X t) if arm 2 is selected and loses Whatever happens with the selected arm, it is penalized (escape from local traps).

15 I - Introduction I - 1 Motivations I - 2 Stochastic multi-armed bandit model II Narendra Schapiro algorithm II - 1 An historical algorithm (1969) II - 2 Improvement through penalization IV Conclusion

16 IV Conclusion Remarques importantes : Importance du cadre statistiques (classification / analyse discriminante). L hypothèse de Marge provoque une accéleration du risque. L important est de comprendre la structure de voisinage et la taille des petites boules probabilistes autour de chaque observation. Les kppv sont à utiliser avec précaution (variables descriptives, aléa). Données ECG : erreur de classif. passe de 25% à moins de 2% en utilisant les invariants. Extensions Mathématiques : Pas d utilisation de la régularité de η. (Cf Samworth 12). Pondération? Résultat non optimal en Gaussien (perte minimax en d n au lieu de n2/(2+d) ). Vitesse non optimale (mais presque) en A.D. avec une présence d un log(n). Meilleure inégalité de concentration? Approche alternative à la Poissonisation par des variables N.A. (c est presque le cas pour η n,k lorsque n + ). Variables d entrée sont perturbées par un opérateur partiellement connu/inconnu d un point de vue stat. math, aspects semi-paramétriques ou non paramétriques. Borne inférieure en approche fonctionnelle... Coupler avec une sparse PCA (réduction de dimension et du biais dans les ppv)... Merci de votre attention!

Regret of Narendra Shapiro Bandit Algorithms

Regret of Narendra Shapiro Bandit Algorithms S. Gadat Toulouse School of Economics Joint work with F. Panloup and S. Saadane. Oxford, April, 29 2015 I - Introduction I - 1 Motivations I - 2 Stochastic