Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Size: px

Start display at page:

Download "Two generic principles in modern bandits: the optimistic principle and Thompson sampling"

Christiana Bridges
5 years ago
Views:

1 Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014

2 Outline Two principles: The optimistic principle Thompson sampling In 3 settings: The stochastic multi-armed bandit Linear bandits Bandits in graphs

3 The stochastic multi-armed bandit problem Setting: Set of K arms, defined by distributions ν k,initiallyunknown, At each time t, choose an arm I t and i.i.d. receive reward x t ν It. Goal: maximize the sum of rewards. Exploration-exploitation tradeoff: Explore: learn about the environment Exploit: act optimally according to our current beliefs

4 Definition of regret Definitions: Let µ k = E[ν k ] be the expected value of arm k, Let µ =max k µ k the best expected value, The cumulative expected regret: n def R n = µ µ It = t=1 K k T k (n), k=1 where k def = µ µ k,andt k (n) the number of times arm k has been played up to round n. Equivalent goal: Minimize R n.

5 Proposed solutions This is an old problem! [Robbins, 1952] Maybe surprisingly, not fully solved yet! Many proposed strategies: ɛ-greedy Softmax exploration (e.g., EXP3) Follow the perturbed leader Bayesian exploration (Gittins index)

6 Proposed solutions This is an old problem! [Robbins, 1952] Maybe surprisingly, not fully solved yet! Many proposed strategies: ɛ-greedy Softmax exploration (e.g., EXP3) Follow the perturbed leader Bayesian exploration (Gittins index) Here we will consider: Optimism in the face of uncertainty: Thompson sampling: Play optimally in the best possible world Play optimally in a randomly selected world

7 The UCB1 algorithm Upper Confidence Bound algorithm [Auer, Cesa-Bianchi, Fischer, 2002]: Select the arm with highest where: B k,t def = µ k (t)+ 2 log(t) T k (t), µ k (t) is the empirical rewards collected for arm k at time t T k (t) is the number of times arm k has been pulled up to time t Pull an arm either because it looks good or because it is uncertain

8 Can we stay a long time playing a bad arm?

9 Can we stay a long time playing a bad arm? No, since The more we pull an arm k, the smaller the size of its confidence interval But in h.p., it cannot be pulled once its UCB is smaller than µ Thus a sub-optimal arm k can be only pulled a number of times T k (n) suchthat 2 log n 2 T k (n) µ µ k = k.

10 Regret bound for UCB1 Proposition 1. Each sub-optimal arm k is visited in average, at most: ET k (n) 8 log n 2 k times (where k def = µ µ k > 0). Theorem 1. Thus the expected regret is bounded by: ER n = k E[T k (n)] k (8 k: k >0 +1+ π2 3 1 k ) log n + K(1 + π2 3 ).

11 Intuition of the proof Let k be a sub-optimal arm, and k be an optimal arm. At time t, if arm k is selected, this means that B k,t B k,t 2 log(t) µ k,t + µ k T k (t),t + 2 log(t) µ k +2 T k (t) T k (t) 8 log(t) 2 k 2 log(t) T k (t) µ, with high proba Thus, if T k (t) > 8log(t) 2 k, then there is only a small probability that arm k can be selected.

12 Write u = 8log(n) 2 k T k (n) u + u + Full proof of Proposition 1 n t=u+1 n t=u We have: 1{I t = k; T k (t) > u} [ 2 log t 1{ˆµ k,t µ k T k (t) } + 1{ˆµ 2 log t ] k,t µ k T k (t) } Now, taking the expectation of both sides, E[T k (n)] u + n t=u+1 8 log(n) 2 k 2t π2 3

13 Lower bound We have proven that UCB1 has a regret bounded as: ER n (8 k: k >0 1 ) log n + O(1). k Lower bound [Burnetas, Katehakis, 1996], [Lai, Robbins, 1985] ( ER n k: k >0 ) k K inf (ν k,µ log n + o(log n) ) where K inf (ν k,µ )=inf{kl(ν k,ν ):ν Dand E(ν ) >µ }.

14 UCB with variance estimate Tighter bounds lead to better performance UCB-V [Audibert, Munos, Szepesvári, 2007] Define the UCB as B k,t def = µ k,t + 2 σ k,t 2 log(1.2t) T k (t) Then the expected regret is bounded as: ER n (10 k: k >0 + 3 log(1.2t). T k (t) σk 2 ) +2 log n. k

15 KL-UCB Use the full empirical distribution KL-UCB Given a class of distributions D, { def B k,t =sup E[ν] :ν Dand KL(ˆµ k (t),ν) log t } T k (t) kl(ˆµ k (t),x) log t Tk(t) ˆµ k (t) B k,t

16 Regret of KL-UCB Theorem 2 (Cappé, Garivier, Maillard, Munos, Stoltz, 2013). The regret of KL-UCB is bounded as : ( ER n = k: k >0 ) k K inf (ν k,µ log n + o(log n), ) Reaches the lower-bounds of [Lai, Robbins, 1985], [Burnetas, Katehakis, 1996]: for exponential family (Bernoulli, Gaussian, Gamma, Dirichlet, Poisson,...) finitely supported distributions

17 Thompson sampling The first bandit algorithm ever [Thompson, 1933] Only recently rediscovered: Efficient in practice [Chapelle, Li, 2011],... Recent analyses: Frequentist: [Agrawal, Goyal, ], [Kaufmann, Korda, Munos, ], Bayesian: [Russo, Van Roy, 2013], [Bubeck, Liu, 2013] Properties: Choose a prior on the set of unknown parameters Update the posterior according to the observed rewards Draw a sample from the posterior and play optimally

18 (Frequentist) Analysis of Thompson sampling Theorem 3 (Korda, Kaufmann, Munos, 2013). Assume the arm distributions belong to an exponential family. Use Jeffrey s prior. Then ER n = k: k >0 k K inf (ν k,µ log n + o(log n). ) Reaches the lower-bounds of [Burnetas, Katehakis, 1996], [Lai, Robbins, 1985].

19 Three ingredients for analysing Thompson sampling 1. Concentration of the posterior distributions around true mean 2. The optimal arm is often pulled: for any b (0, 1), T k (t) =Ω(t b ) This is achieved by proving that P(θ π k,t >µ ) c (anti-concentration of the posterior) 3. Comparison of θ π k,t to quantile of π k,t at level 1 1 T k (t)

20 Conclusions on multi-armed bandits Two generic principles: Optimistic principle: act optimally in best possible world compatible with observations Thompson sampling: act optimally in any world randomly selected from the posterior KL-UCB and Thompson sampling are currently among the best algorithms for multi-armed bandits Those principles extend to more complicated settings: Linear bandits bandits in graphs

21 Linear bandits The set of arms X is a subset of R D.Ateachtimestept, Select x t X, Observe r t = x t α + ɛ t,whereα R D is unknown. Define the regret: n R n = (x x t ) α, with x =argmax x X x α. t=1

22 The optimistic principle E t The reward r t = x t α + ɛ t provides information about α along direction x t. IR D ˆα t α Idea: Build a high probability confidence set E t s.t. α E t w.h.p. Play the arm x X that maximizes max α E t x α. 0 X x t x

23 Abitmoreprecisely... UCB idea: Definealeast-squaresestimateˆα t of α : ˆα t =arg min α R N [ t 1 ( rt x s α ) ] 2 + α 2 s=1 and a confidence ellipsoid E t around ˆα t : E t = { α R D, α ˆα t Vt ρ(t) }, where ρ(t) =c D log(t/δ), and V t = t 1 s=1 x sx s + I. Property: w.p.1 δ, α E t for all t 1. Algorithm: x t = argmaxmax x α x X α E t ( or x t = argmax x X x ˆα t + ρ(t) x V 1 t )

24 [Agrawal, Goyal, 2013] Use a Gaussian prior G(0, I ). At each t, drawasample from the posterior: α t G(ˆα t,ρ(t) 2 V 1 t ) Select x t =argmax x X x α t Remarks: Thompson sampling IR D 0 X α t ˆα t α Gaussian prior and Gaussian likelihood model are just there for the design of the TS algo. Computational complexity is generally lower than UCB x t

25 Regret analysis UCB algorithms: With high probability, R n = O ( D n ) or O ( Dn log( X ) ) Thompson sampling: With high probability, R n = O ( D 3/2 n ) or O ( D n log( X ) ) Lower bound: There exists a set X such that for any algorithm, R n =Ω(D n) Ref: [Auer, 2002], [Dani, Hayes, Kakade, 2008], [Rusmevichientong, Tsitsiklis, 2010], [Li, Chu, Langford, Schapire, 2010], [Abbasi-Yadkori, Pál, Szepesvári, 2011], [Agrawal, Goyal, 2013].

26 Bandits in graphs Examples: advertising campains, recommender systems,... The number of arms (nodes) is larger than the number of rounds.

27 Bandit in a graph Let G a known graph with K nodes {1, 2,...,K} Let f be a unknown function defined on the set of nodes For t =1ton, Select a node I t Observe reward r t = f (I t )+ɛ t Goal: maximize sum of expected rewards Equivalently minimize regret: R n = where f =max 1 i K f (i). n (f f (I t )), t=1 We care about the case when K > n

28 Smooth graph function 0 1 Neighboring nodes have similar values Smoothness of the function: S G (f) = 1 w i,j (f i f j ) 2, 2 i,j K where w i,j is a weight of the edge between nodes i and j.

29 Graph Laplacian Graph Laplacian: L = D W,where W: adjacency matrix (edge weights w i,j ). D: diagonal matrix with the entries d i = j w i,j. Example: L = w1,2 = 1 w2,3 =3 2 3 w1,5 =2 Spectral decomposition: L = QΛQ T,where w1,4 = 1 Λ is diagonal containing the eigenvalues of L. 5 w2,4 = 4 w4,5 = 5 Q is orthogonal and its columns are the eigenvectors of L. w3,4 = 2 4

30 Alternative representation Change of basis: Let f = Qα. Thenα = Q f. We can learn α instead of f. Smoothness of f: S G (f) = 1 2 w i,j (f i f j ) 2 i,j K K = f T Lf = f T QΛQ T f = α T Λα = α 2 Λ = λ i αi 2 i=1 f smooth when α i is small for large λ i. α lies in a thin ellipsoid

31 Problem reformulation Eigendecomposition of graph laplacian L = QΛQ T where x 1 Q = q 1 q K =. x K (q i ) 1 i K are orthonormal, as well as (x i ) 1 i K Notice that f i =(Qα) i = x i α Thus this is a linear bandit problem where the set of arms is {x 1,...,x K } R K and α the unknown parameter.

32 Spectral UCB [Valko, Munos, Kveton, Kocák, 2014]. Follows the optimistic principle: Define a penalized least-squares estimate ˆα t =arg min Select the next point: where V t α R K [ t 1 ( rt x s α ) ] 2 + α 2 Λ s=1 x t = argmax x X def = ( t 1 x s x s +Λ. s=1 ) x ˆα t + ρ(t) x V 1, t }{{} UCB on x α Observe reward r t = x t α + ɛ t

33 Spectral TS [Kocák, Valko, Munos, Agrawal, 2014] Idea: Incorporate the smoothness assumption into the prior. Set V 1 = Λ, ˆα 1 = 0. For t =1ton: Sample Select α t G(ˆα t,ρ(t) 2 V 1 t ) x t =argmaxx α t x Observe reward r t = x t α + ɛ t Update the posterior mean ˆα t+1 and covariance matrix V t+1

34 Spectral UCB and Spectral TS regret bound Theorem 4. Both the regret of Spectral UCB and the regret of Spectral TS are bounded, with probability 1 δ, as ( R n = O ( d + α Λ ) ) nd log n/δ where d is the effective dimension: largest d such that dλ d n log n. d is small when the (λ i ) grow rapidly This is related to the number of non-negligible dimensions.

35 Effective dimension vs. Ambient dimension Flixster graph: N= Barabasi Albert graph N= effective dimenstion effective dimenstion time T Usually d K in real world graphs time T

36 (note that )

37 Synthetic experiment cumulative regret Barabasi Albert N=250, basis size=3, effective d=1 SpectralTS LinearTS SpectralUCB LinUCB time T SpectralTS LinearTS SpectralUCB LinUCB Figure: Barabási-Albert random graph results. K = 250 computational time in seconds

38 Experiments with the MovieLens dataset cumulative regret Movielens data N=2019, average of 10 users, T=200, d = 5 SpectralUCB LinUCB SpectralTS LinearTS computational time in seconds time t 0 SpectralTS LinearTS SpectralUCB LinUCB Figure: Results on the MovieLens dataset [Lam, Herlocker, 2012]. K = 2019 Note: the graph has been learnt based on low-rank matrix factorization of the 10 6 ratings matrix.

39 Conclusion on graph bandits Given a known graph, we assume the unknown function to be smooth w.r.t. the graph structure. Spectral UCB and Spectral TS achieves a regret bound of order Õ(d n) where the effective dimension d K Computational complexity per step is O(K 3 )forspectral UCB and O(K 2 )forspectralts. Approximation to the first J eigenvectors in O(Jmlog m) time, where m is the number of edges. Then complexity per step: O(JK 2 )forspectralucbando(jk) forspectralts.

40 Conclusions Bandits = a great source of inspiration: Optimistic approach: act optimally in best possible world compatible with observations Thompson sampling: act optimally in any world randomly selected from the posterior Multi-armed bandit = building block Many extensions in bandits: linear, convex, Lipschitz, Gaussian, contextual, combinatorial,... and in Reinforcement Learning

41 Thanks!!!

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March