Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Size: px

Start display at page:

Download "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group"

Lucas Page
5 years ago
Views:

1 Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group

2 Part 1: i.i.d., adversarial, and Bayesian bandit models

3 i.i.d. multi-armed bandit, Robbins [1952]

4 i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T n.

5 i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T n. Unknown parameters: n probability distributions ν 1,..., ν n on [0, 1] with mean µ 1,..., µ n (notation: µ = max i [n] µ i ).

6 i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T n. Unknown parameters: n probability distributions ν 1,..., ν n on [0, 1] with mean µ 1,..., µ n (notation: µ = max i [n] µ i ). Protocol: For each round t = 1, 2,..., T, the player chooses I t [n] based on past observations and receives a reward/observation Y t ν It (independently from the past).

7 i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T n. Unknown parameters: n probability distributions ν 1,..., ν n on [0, 1] with mean µ 1,..., µ n (notation: µ = max i [n] µ i ). Protocol: For each round t = 1, 2,..., T, the player chooses I t [n] based on past observations and receives a reward/observation Y t ν It (independently from the past). Performance measure: The cumulative regret is the difference between the player s accumulated reward and the maximum the player could have obtained had she known all the parameters, R T = T µ E Y t. t [T ] Fundamental tension between exploration and exploitation. Many applications!

8 i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber(1/2) and ν 2 = Ber(1/2 + ξ ) where ξ { 1, 1} is unknown.

9 i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber(1/2) and ν 2 = Ber(1/2 + ξ ) where ξ { 1, 1} is unknown. With τ expected observations from the second arm there is a probability at least exp( τ 2 ) to make the wrong guess on the value of ξ.

10 i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber(1/2) and ν 2 = Ber(1/2 + ξ ) where ξ { 1, 1} is unknown. With τ expected observations from the second arm there is a probability at least exp( τ 2 ) to make the wrong guess on the value of ξ. Let τ(t) be the expected number of pulls of arm 2 up to time t when ξ = 1.

11 i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber(1/2) and ν 2 = Ber(1/2 + ξ ) where ξ { 1, 1} is unknown. With τ expected observations from the second arm there is a probability at least exp( τ 2 ) to make the wrong guess on the value of ξ. Let τ(t) be the expected number of pulls of arm 2 up to time t when ξ = 1. R T (ξ = +1) + R T (ξ = 1) τ(t ) + T exp( τ(t) 2 ) t=1 min t [T ] (t + T exp( t 2 )) log(t 2 ). See Bubeck, Perchet and Rigollet [2012] for the details.

12 i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber(1/2) and ν 2 = Ber(1/2 + ξ ) where ξ { 1, 1} is unknown. With τ expected observations from the second arm there is a probability at least exp( τ 2 ) to make the wrong guess on the value of ξ. Let τ(t) be the expected number of pulls of arm 2 up to time t when ξ = 1. R T (ξ = +1) + R T (ξ = 1) τ(t ) + T exp( τ(t) 2 ) t=1 min t [T ] (t + T exp( t 2 )) log(t 2 ). See Bubeck, Perchet and Rigollet [2012] for the details. For fixed the lower bound is log(t ), and for the worse ( 1/ T ) it is T (Auer, Cesa-Bianchi, Freund and Schapire [1995]: Tn for the n-armed case).

13 i.i.d. multi-armed bandit: fundamental limitations Notation: i = µ µ i and N i (t) is the number of pulls of arm i up to time t. Then one has R T = n i=1 ien i (T ).

14 i.i.d. multi-armed bandit: fundamental limitations Notation: i = µ µ i and N i (t) is the number of pulls of arm i up to time t. Then one has R T = n i=1 ien i (T ). For p, q [0, 1], kl(p, q) := p log p q + (1 p) log 1 p 1 q.

15 i.i.d. multi-armed bandit: fundamental limitations Notation: i = µ µ i and N i (t) is the number of pulls of arm i up to time t. Then one has R T = n i=1 ien i (T ). For p, q [0, 1], kl(p, q) := p log p q + (1 p) log 1 p 1 q. Theorem (Lai and Robbins [1985]) Consider a strategy s.t. a > 0, we have EN i (T ) = o(t a ) if i > 0. Then for any Bernoulli distributions, R T lim inf T + log(t ) i: i >0 i kl(µ i, µ ).

16 i.i.d. multi-armed bandit: fundamental limitations Notation: i = µ µ i and N i (t) is the number of pulls of arm i up to time t. Then one has R T = n i=1 ien i (T ). For p, q [0, 1], kl(p, q) := p log p q + (1 p) log 1 p 1 q. Theorem (Lai and Robbins [1985]) Consider a strategy s.t. a > 0, we have EN i (T ) = o(t a ) if i > 0. Then for any Bernoulli distributions, R T lim inf T + log(t ) i: i >0 1 Note that 2 i i kl(µ i,µ ) µ (1 µ ) 2 i the Lai and Robbins lower bound is i: i >0 i kl(µ i, µ ). so up to a variance-like term log(t ) 2 i.

17 i.i.d. multi-armed bandit: fundamental strategy Hoeffding s inequality: w.p. 1 1/T, t [T ], i [n], µ i 1 2 log(t ) Y s + =: UCB i (t). N i (t) N i (t) s<t:i s=i

18 i.i.d. multi-armed bandit: fundamental strategy Hoeffding s inequality: w.p. 1 1/T, t [T ], i [n], µ i 1 2 log(t ) Y s + =: UCB i (t). N i (t) N i (t) s<t:i s=i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t argmax UCB i (t). i [n]

19 i.i.d. multi-armed bandit: fundamental strategy Hoeffding s inequality: w.p. 1 1/T, t [T ], i [n], µ i 1 2 log(t ) Y s + =: UCB i (t). N i (t) N i (t) s<t:i s=i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t argmax UCB i (t). i [n] Simple analysis: on a 1 2/T probability event one has N i (t) 8 log(t )/ 2 i UCB i (t) < µ UCB i (t),

20 i.i.d. multi-armed bandit: fundamental strategy Hoeffding s inequality: w.p. 1 1/T, t [T ], i [n], µ i 1 2 log(t ) Y s + =: UCB i (t). N i (t) N i (t) s<t:i s=i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t argmax UCB i (t). i [n] Simple analysis: on a 1 2/T probability event one has N i (t) 8 log(t )/ 2 i UCB i (t) < µ UCB i (t), so that EN i (T ) log(t )/ 2 i and in fact R T log(t ). i i: i >0

21 i.i.d. multi-armed bandit: going further 1. Optimal constant (replacing 8 by 1/2 in the UCB regret bound) and Lai and Robbins variance-like term (replacing i by kl(µ i, µ )): see Cappé, Garivier, Maillard, Munos and Stoltz [2013].

22 i.i.d. multi-armed bandit: going further 1. Optimal constant (replacing 8 by 1/2 in the UCB regret bound) and Lai and Robbins variance-like term (replacing i by kl(µ i, µ )): see Cappé, Garivier, Maillard, Munos and Stoltz [2013]. 2. In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour [2006] for the fixed-confidence setting (see also Jamieson and Nowak [2014] for a recent short survey) and Audibert, Bubeck and Munos [2010] for the fixed budget setting. Key takeaway: one needs of order H := i 2 i rounds to find the best arm.

23 i.i.d. multi-armed bandit: going further 1. Optimal constant (replacing 8 by 1/2 in the UCB regret bound) and Lai and Robbins variance-like term (replacing i by kl(µ i, µ )): see Cappé, Garivier, Maillard, Munos and Stoltz [2013]. 2. In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour [2006] for the fixed-confidence setting (see also Jamieson and Nowak [2014] for a recent short survey) and Audibert, Bubeck and Munos [2010] for the fixed budget setting. Key takeaway: one needs of order H := i 2 i rounds to find the best arm. 3. The UCB analysis extends to sub-gaussian reward distributions. For heavy-tailed distributions, say with 1 + ε moment for some ε (0, 1], one can get a regret that scales with 1/ε (instead of 1 ) by using a robust mean i i estimator, see Bubeck, Cesa-Bianchi and Lugosi [2012].

24 Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001] For t = 1,..., T, the player chooses I t [n] based on previous observations, and simultaneously an adversary chooses a loss vector l t [0, 1] n. The player s loss/observation is l t (I t ).

25 Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001] For t = 1,..., T, the player chooses I t [n] based on previous observations, and simultaneously an adversary chooses a loss vector l t [0, 1] n. The player s loss/observation is l t (I t ). The regret and pseudo-regret are defined as: R T = max (l t (I t ) l t (i)), R T = max E (l t (I t ) l t (i)). i [n] i [n] t [T ] t [T ]

26 Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001] For t = 1,..., T, the player chooses I t [n] based on previous observations, and simultaneously an adversary chooses a loss vector l t [0, 1] n. The player s loss/observation is l t (I t ). The regret and pseudo-regret are defined as: R T = max (l t (I t ) l t (i)), R T = max E (l t (I t ) l t (i)). i [n] i [n] t [T ] t [T ] Obviously ER T R T and there is equality in the oblivious case ( adversary s choice are independent of the player s choice). The case where l 1,..., l T is an i.i.d. sequence corresponds to the i.i.d. case we just studied. In particular we have a Tn lower bound.

27 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)).

28 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)). In five lines one can show R T 2T log(n) with p 1 (i) = 1/n:

29 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)). In five lines one can show R T 2T log(n) with p 1 (i) = 1/n: Ent(δ j p t ) Ent(δ j p t+1 ) = log p t+1(j) p t (j) = log 1 Z t+1 ηl t (j)

30 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)). In five lines one can show R T 2T log(n) with p 1 (i) = 1/n: Ent(δ j p t ) Ent(δ j p t+1 ) = log p t+1(j) p t (j) = log 1 Z t+1 ηl t (j) ψ t := log E I pt exp( η(l t (I ) E I p t l t (I ))) = ηel t (I ) + log(z t+1 )

31 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)). In five lines one can show R T 2T log(n) with p 1 (i) = 1/n: Ent(δ j p t ) Ent(δ j p t+1 ) = log p t+1(j) p t (j) = log 1 Z t+1 ηl t (j) ψ t := log E I pt exp( η(l t (I ) E I p t l t (I ))) = ηel t (I ) + log(z t+1 ) η ( ) p t (i)l t (i) l t (j) = Ent(δ j p 1 ) Ent(δ j p T +1 ) + t t i ψ t

32 Adversarial multi-armed bandit, fundamental strategy Exponential weights strategy for full information (l t is observed at the end of round t): play I t at random from p t where p t+1 (i) = 1 Z t+1 p t (i) exp( ηl t (i)). In five lines one can show R T 2T log(n) with p 1 (i) = 1/n: Ent(δ j p t ) Ent(δ j p t+1 ) = log p t+1(j) p t (j) = log 1 Z t+1 ηl t (j) ψ t := log E I pt exp( η(l t (I ) E I p t l t (I ))) = ηel t (I ) + log(z t+1 ) η ( ) p t (i)l t (i) l t (j) = Ent(δ j p 1 ) Ent(δ j p T +1 ) + t t i Using that l t 0 one has ψ t η2 2 El t(i) 2 thus R T log(n) + ηt η 2 ψ t

33 Adversarial multi-armed bandit, fundamental strategy Exp3: replace l t by l t in the exponential weights strategy, where l t (i) = l t(i t ) p t (i) 1{i = I t}. Key property: E It p t lt (i) = l t (i).

34 Adversarial multi-armed bandit, fundamental strategy Exp3: replace l t by l t in the exponential weights strategy, where l t (i) = l t(i t ) p t (i) 1{i = I t}. Key property: E It p t lt (i) = l t (i). Thus with the analysis from the previous slide: R T log(n) η + η 2 E t E I pt lt (I ) 2.

35 Adversarial multi-armed bandit, fundamental strategy Exp3: replace l t by l t in the exponential weights strategy, where l t (i) = l t(i t ) p t (i) 1{i = I t}. Key property: E It p t lt (i) = l t (i). Thus with the analysis from the previous slide: R T log(n) η + η 2 E t E I pt lt (I ) 2. Amazingly the variance term is automatically controlled: E It,I p t lt (I ) 2 E It,I p t 1{I = I t } p t (I t ) 2 = E I pt 1 p t (I ) = n.

36 Adversarial multi-armed bandit, fundamental strategy Exp3: replace l t by l t in the exponential weights strategy, where l t (i) = l t(i t ) p t (i) 1{i = I t}. Key property: E It p t lt (i) = l t (i). Thus with the analysis from the previous slide: R T log(n) η + η 2 E t E I pt lt (I ) 2. Amazingly the variance term is automatically controlled: E It,I p t lt (I ) 2 E It,I p t 1{I = I t } p t (I t ) 2 = E I pt 1 p t (I ) = n. Thus with η = 2n log(n)/t one gets R T 2Tn log(n).

37 Adversarial multi-armed bandit, going further 1. With the modified loss estimate lt(it)1{i=it}+β p t(i t) one can prove high probability bounds on R T, and by integrating the deviations one can show ER T = O( Tn log(n)).

38 Adversarial multi-armed bandit, going further 1. With the modified loss estimate lt(it)1{i=it}+β p t(i t) one can prove high probability bounds on R T, and by integrating the deviations one can show ER T = O( Tn log(n)). 2. The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that ER T = Ω( Tn log(n)).

39 Adversarial multi-armed bandit, going further 1. With the modified loss estimate lt(it)1{i=it}+β p t(i t) one can prove high probability bounds on R T, and by integrating the deviations one can show ER T = O( Tn log(n)). 2. The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that ER T = Ω( Tn log(n)). 3. T can be replaced by various measure of variance in the loss sequence, see e.g., Hazan and Kale [2009].

40 Adversarial multi-armed bandit, going further 1. With the modified loss estimate lt(it)1{i=it}+β p t(i t) one can prove high probability bounds on R T, and by integrating the deviations one can show ER T = O( Tn log(n)). 2. The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that ER T = Ω( Tn log(n)). 3. T can be replaced by various measure of variance in the loss sequence, see e.g., Hazan and Kale [2009]. 4. There exists strategies which guarantee simultaneously R T = Õ( Tn) in the adversarial model and R T = Õ( i 1 i ) in the i.i.d. model, see Bubeck and Slivkins [2012].

41 Adversarial multi-armed bandit, going further 1. With the modified loss estimate lt(it)1{i=it}+β p t(i t) one can prove high probability bounds on R T, and by integrating the deviations one can show ER T = O( Tn log(n)). 2. The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that ER T = Ω( Tn log(n)). 3. T can be replaced by various measure of variance in the loss sequence, see e.g., Hazan and Kale [2009]. 4. There exists strategies which guarantee simultaneously R T = Õ( Tn) in the adversarial model and R T = Õ( i 1 i ) in the i.i.d. model, see Bubeck and Slivkins [2012]. 5. Graph feedback structure, regret with respect to S switches, label efficient, switching cost...

42 Bayesian multi-armed bandit, Thompson [1933] Set of models {(ν 1 (θ),..., ν n (θ)), θ Θ} and prior distribution π 0 over Θ. The Bayesian regret is defined as BR T (π 0 ) = E θ π0 R T (ν 1 (θ),..., ν n (θ)).

43 Bayesian multi-armed bandit, Thompson [1933] Set of models {(ν 1 (θ),..., ν n (θ)), θ Θ} and prior distribution π 0 over Θ. The Bayesian regret is defined as BR T (π 0 ) = E θ π0 R T (ν 1 (θ),..., ν n (θ)). In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space P(Θ).

44 Bayesian multi-armed bandit, Thompson [1933] Set of models {(ν 1 (θ),..., ν n (θ)), θ Θ} and prior distribution π 0 over Θ. The Bayesian regret is defined as BR T (π 0 ) = E θ π0 R T (ν 1 (θ),..., ν n (θ)). In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space P(Θ). The celebrated Gittins index theorem gives sufficient condition to dramatically reduce the computational complexity of implementing the optimal Bayesian strategy under a strong product assumption on π 0.

45 Bayesian multi-armed bandit, Thompson [1933] Set of models {(ν 1 (θ),..., ν n (θ)), θ Θ} and prior distribution π 0 over Θ. The Bayesian regret is defined as BR T (π 0 ) = E θ π0 R T (ν 1 (θ),..., ν n (θ)). In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space P(Θ). The celebrated Gittins index theorem gives sufficient condition to dramatically reduce the computational complexity of implementing the optimal Bayesian strategy under a strong product assumption on π 0. Notation: π t denotes the posterior distribution on θ at time t.

46 Bayesian multi-armed bandit, Gittins index Theorem (Gittins [1979]) Consider the product and γ-discounted case: Θ = i Θ i, ν i (θ) := ν(θ i ), π 0 = i π 0 (i), and furthermore one is interested in maximizing E t 0 γt Y t.

47 Bayesian multi-armed bandit, Gittins index Theorem (Gittins [1979]) Consider the product and γ-discounted case: Θ = i Θ i, ν i (θ) := ν(θ i ), π 0 = i π 0 (i), and furthermore one is interested in maximizing E t 0 γt Y t. The optimal Bayesian strategy is to pick at time s the arm maximizing: sup { λ R : sup E τ ( γ t X t + t<τ ) } γτ 1 γ λ 1 1 γ λ, where the expectation is over (X t ) drawn from ν(θ) with θ π s (i), and the supremum is taken over all stopping times τ.

48 Bayesian multi-armed bandit, Gittins index Theorem (Gittins [1979]) Consider the product and γ-discounted case: Θ = i Θ i, ν i (θ) := ν(θ i ), π 0 = i π 0 (i), and furthermore one is interested in maximizing E t 0 γt Y t. The optimal Bayesian strategy is to pick at time s the arm maximizing: sup { λ R : sup E τ ( γ t X t + t<τ ) } γτ 1 γ λ 1 1 γ λ, where the expectation is over (X t ) drawn from ν(θ) with θ π s (i), and the supremum is taken over all stopping times τ. For much more (implementation for exponential families, interpretation as a multitoken Markov game,...) see Dumitriu, Tetali and Winkler [2003], Gittins, Glazebrook, Weber [2011], Kaufmann [2014].

49 Bayesian multi-armed bandit, Gittins index Weber [1992] gives an{ exquisite proof of Gittins theorem. } Let λ t (i) := sup λ R : sup E γ t (X t λ) 0 τ t<τ the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information.

50 Bayesian multi-armed bandit, Gittins index Weber [1992] gives an{ exquisite proof of Gittins theorem. } Let λ t (i) := sup λ R : sup E γ t (X t λ) 0 τ t<τ the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as min s t λ s (i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level).

51 Bayesian multi-armed bandit, Gittins index Weber [1992] gives an{ exquisite proof of Gittins theorem. } Let λ t (i) := sup λ R : sup E γ t (X t λ) 0 τ t<τ the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as min s t λ s (i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level). 1. Discounted sum of prevailing charge for played arms is an upper bound (in expectation) on the discounted sum of rewards.

52 Bayesian multi-armed bandit, Gittins index Weber [1992] gives an{ exquisite proof of Gittins theorem. } Let λ t (i) := sup λ R : sup E γ t (X t λ) 0 τ t<τ the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as min s t λ s (i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level). 1. Discounted sum of prevailing charge for played arms is an upper bound (in expectation) on the discounted sum of rewards. 2. Since the prevailing charge is nonincreasing, the discounted sum of prevailing charge is maximized if we always pick the arm with maximum prevailing charge.

53 Bayesian multi-armed bandit, Gittins index Weber [1992] gives an{ exquisite proof of Gittins theorem. } Let λ t (i) := sup λ R : sup E γ t (X t λ) 0 τ t<τ the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as min s t λ s (i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level). 1. Discounted sum of prevailing charge for played arms is an upper bound (in expectation) on the discounted sum of rewards. 2. Since the prevailing charge is nonincreasing, the discounted sum of prevailing charge is maximized if we always pick the arm with maximum prevailing charge. 3. Gittins index does exactly 2. and that in this case 1. is an equality. Q.E.D.

54 Bayesian multi-armed bandit, Thompson Sampling (TS) In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory.

55 Bayesian multi-armed bandit, Thompson Sampling (TS) In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory. In his 1933 paper Thompson proposed the following strategy: sample θ π t and play I t argmax µ i (θ ).

56 Bayesian multi-armed bandit, Thompson Sampling (TS) In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory. In his 1933 paper Thompson proposed the following strategy: sample θ π t and play I t argmax µ i (θ ). Theoretical guarantees for this highly practical strategy have long remained elusive. Recently Agrawal and Goyal [2012] and Kaufmann, Korda and Munos [2012] proved that TS with Bernoulli reward distributions ( and uniform prior on the parameters achieves ) log(t ) R T = O i i (note that this is the frequentist regret!).

57 Bayesian multi-armed bandit, Thompson Sampling (TS) In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory. In his 1933 paper Thompson proposed the following strategy: sample θ π t and play I t argmax µ i (θ ). Theoretical guarantees for this highly practical strategy have long remained elusive. Recently Agrawal and Goyal [2012] and Kaufmann, Korda and Munos [2012] proved that TS with Bernoulli reward distributions ( and uniform prior on the parameters achieves ) log(t ) R T = O i i (note that this is the frequentist regret!). Guha and Munagala [2014] conjecture that, for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.

58 Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis Assume a prior in the adversarial model, that is a prior over (l 1,..., l T ) [0, 1] n T, and let E t denote the posterior distribution (given l 1 (I 1 ),..., l t 1 (I t 1 )).

59 Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis Assume a prior in the adversarial model, that is a prior over (l 1,..., l T ) [0, 1] n T, and let E t denote the posterior distribution (given l 1 (I 1 ),..., l t 1 (I t 1 )). We introduce r t (i) = E t (l t (i) l t (i )), and v t (i) = Var t (E t (l t (i) i )).

60 Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis Assume a prior in the adversarial model, that is a prior over (l 1,..., l T ) [0, 1] n T, and let E t denote the posterior distribution (given l 1 (I 1 ),..., l t 1 (I t 1 )). We introduce r t (i) = E t (l t (i) l t (i )), and v t (i) = Var t (E t (l t (i) i )). Key observation (next slide): E v t (I t ) 1 2 H(i ) t T

61 Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis Assume a prior in the adversarial model, that is a prior over (l 1,..., l T ) [0, 1] n T, and let E t denote the posterior distribution (given l 1 (I 1 ),..., l t 1 (I t 1 )). We introduce r t (i) = E t (l t (i) l t (i )), and v t (i) = Var t (E t (l t (i) i )). Key observation (next slide): which implies: E t T v t (I t ) 1 2 H(i ) t, E t r t (I t ) C E t v t (I t ) T T E r t (I t ) C Evt (I t ) t=1 t=1 BR T C T H(i )/2.

62 Bayesian multi-armed bandit, accumulation of information v t (i) = Var t (E t (l t (i) i )), π t (j) = P t (i = j), E t T v t (I t ) 1 2 H(x )

63 Bayesian multi-armed bandit, accumulation of information v t (i) = Var t (E t (l t (i) i )), π t (j) = P t (i = j), E t T v t (I t ) 1 2 H(x ) Equipped with Pinsker s inequality and basic information theory concepts (such as the mutual information I) one has: v t (i) = j π t (j)(e t (l t (i) i = j) E t (l t (i))) 2 1 π t (j)ent(l t (l t (i) i = j) L t (l t (i))) 2 j = 1 2 I t(l t (i), i ) = H t (i ) H t (i l t (i)).

64 Bayesian multi-armed bandit, accumulation of information v t (i) = Var t (E t (l t (i) i )), π t (j) = P t (i = j), E t T v t (I t ) 1 2 H(x ) Equipped with Pinsker s inequality and basic information theory concepts (such as the mutual information I) one has: v t (i) = j π t (j)(e t (l t (i) i = j) E t (l t (i))) 2 1 π t (j)ent(l t (l t (i) i = j) L t (l t (i))) 2 j = 1 2 I t(l t (i), i ) = H t (i ) H t (i l t (i)). Thus Ev t (I t ) 1 2 E(H t(i ) H t+1 (i )).

65 Bayesian multi-armed bandit, TS information ratio Let l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Then E t r t (I t ) C E t v t (I t ) E t l t (I t ) π t (i) l t (i, i) C E t π t (j)( l t (I t, j) l t (I t )) 2 i j

66 Bayesian multi-armed bandit, TS information ratio Let l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Then E t r t (I t ) C E t v t (I t ) E t l t (I t ) π t (i) l t (i, i) C E t π t (j)( l t (I t, j) l t (I t )) 2 i j For TS the following shows that one can take C = n: E t l t (I t ) i π t (i) l t (i, i) = i n i π t (i)( l t (i) l t (i, i)) π t (i) 2 ( l t (i) l t (i, i)) 2 n i,j π t (i)π t (j)( l t (i) l t (i, j)) 2. Thus TS always satisfies BR T TnH(i ) Tn log(n).

67 Bayesian multi-armed bandit, TS information ratio Let l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Then E t r t (I t ) C E t v t (I t ) E t l t (I t ) π t (i) l t (i, i) C E t π t (j)( l t (I t, j) l t (I t )) 2 i j For TS the following shows that one can take C = n: E t l t (I t ) i π t (i) l t (i, i) = i n i π t (i)( l t (i) l t (i, i)) π t (i) 2 ( l t (i) l t (i, i)) 2 n i,j π t (i)π t (j)( l t (i) l t (i, j)) 2. Thus TS always satisfies BR T TnH(i ) Tn log(n). Side note: by the minimax theorem this implies there exists a strategy for the oblivious adversarial model with regret Tn log(n).

68 Summary of basic results ) log(t ) i ( 1. In the i.i.d. model UCB attains a regret of O i by Lai and Robbins lower bound this is optimal (up to a multiplicative variance term). and 2. In the adversarial model Exp3 attains a regret of O( Tn log(n)) and this is optimal up to the logarithmic term. 3. In the Bayesian model, Gittins index gives an optimal strategy for the case of product priors. For general priors Thompson Sampling is a more flexible strategy. Its Bayesian regret is controlled by the entropy of the optimal decision. Moreover TS with an uninformative prior has frequentist guarantees comparable to UCB.

69 Part 2: Linear, non-linear, and contextual bandit

70 The linear bandit problem, Auer [2002] Known parameters: compact action set A R n, adversary s action set L R n, number of rounds T.

71 The linear bandit problem, Auer [2002] Known parameters: compact action set A R n, adversary s action set L R n, number of rounds T. Protocol: For each round t = 1, 2,..., T, the adversary chooses a loss vector l t L and simultaneously the player chooses a t A based on past observations and receives a loss/observation Y t = l t a t. R T = E T l t a t min E T l t a. a A t=1 t=1

72 The linear bandit problem, Auer [2002] Known parameters: compact action set A R n, adversary s action set L R n, number of rounds T. Protocol: For each round t = 1, 2,..., T, the adversary chooses a loss vector l t L and simultaneously the player chooses a t A based on past observations and receives a loss/observation Y t = l t a t. R T = E T l t a t min E T l t a. a A t=1 Other models: In the i.i.d. model we assume that there is some underlying θ L such that E(Y t a t ) = θ a t. In the Bayesian model we assume that we have a prior distribution ν over the sequence (l 1,..., l T ) (in this case the expectation in R T is also over (l 1,..., l T ) ν). Alternatively we could assume a prior over θ. t=1

73 The linear bandit problem, Auer [2002] Known parameters: compact action set A R n, adversary s action set L R n, number of rounds T. Protocol: For each round t = 1, 2,..., T, the adversary chooses a loss vector l t L and simultaneously the player chooses a t A based on past observations and receives a loss/observation Y t = l t a t. R T = E T l t a t min E T l t a. a A t=1 Other models: In the i.i.d. model we assume that there is some underlying θ L such that E(Y t a t ) = θ a t. In the Bayesian model we assume that we have a prior distribution ν over the sequence (l 1,..., l T ) (in this case the expectation in R T is also over (l 1,..., l T ) ν). Alternatively we could assume a prior over θ. Example: Part 1 was about A = {e 1,..., e n } and L = [0, 1] n. t=1

74 The linear bandit problem, Auer [2002] Known parameters: compact action set A R n, adversary s action set L R n, number of rounds T. Protocol: For each round t = 1, 2,..., T, the adversary chooses a loss vector l t L and simultaneously the player chooses a t A based on past observations and receives a loss/observation Y t = l t a t. R T = E T l t a t min E T l t a. a A t=1 Other models: In the i.i.d. model we assume that there is some underlying θ L such that E(Y t a t ) = θ a t. In the Bayesian model we assume that we have a prior distribution ν over the sequence (l 1,..., l T ) (in this case the expectation in R T is also over (l 1,..., l T ) ν). Alternatively we could assume a prior over θ. Example: Part 1 was about A = {e 1,..., e n } and L = [0, 1] n. Assumption: unless specified otherwise we assume L = A := {l : sup a A l a 1}. t=1

75 Example: path planning

76 Example: path planning Adversary Player

77 Example: path planning Adversary Player

78 Example: path planning Adversary Player

79 Example: path planning Adversary l 1 l 4 l 5 l n 2 l 2 l 6 l n 1 l 3 l 8 l 7 l n l 9 Player

80 Example: path planning Adversary l 1 l 4 l 5 l n 2 l 2 l 6 l n 1 l 3 l 8 l 7 l n l 9 Player loss suffered: l 2 + l l n

81 Example: path planning Adversary Feedback: Full Info: l 1, l 2,..., l n l 1 l 4 l 5 l n 2 l 2 l 6 l n 1 l 3 l 8 l 7 l n l 9 Player loss suffered: l 2 + l l n

82 Example: path planning Adversary Feedback: Full Info: Semi-Bandit: l 1, l 2,..., l n l 2, l 7,..., l n l 1 l 4 l 5 l n 2 l 2 l 6 l n 1 l 3 l 8 l 7 l n l 9 Player loss suffered: l 2 + l l n

83 Example: path planning Adversary Feedback: Full Info: Semi-Bandit: Bandit: l 1, l 2,..., l n l 2, l 7,..., l n l 2 + l l n l 1 l 4 l 5 l n 2 l 2 l 6 l n 1 l 3 l 8 l 7 l n l 9 Player loss suffered: l 2 + l l n

84 Thompson Sampling for linear bandit after RVR14 Assume A = {a 1,..., a A }. Recall from Part 1 that TS satisfies π t (i)( l t (i) l t (i, i)) C π t (i)π t (j)( l t (i, j) l t (i)) 2 i i,j R T C T log( A )/2, where l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j).

85 Thompson Sampling for linear bandit after RVR14 Assume A = {a 1,..., a A }. Recall from Part 1 that TS satisfies π t (i)( l t (i) l t (i, i)) C π t (i)π t (j)( l t (i, j) l t (i)) 2 i i,j R T C T log( A )/2, where l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Writing l ( t (i) = ai l t, l t (i, j) = ai l j t, and πt ) (M i,j ) = (i)π t (j)ai ( l t l j t) we want to show that Tr(M) C M F.

86 Thompson Sampling for linear bandit after RVR14 Assume A = {a 1,..., a A }. Recall from Part 1 that TS satisfies π t (i)( l t (i) l t (i, i)) C π t (i)π t (j)( l t (i, j) l t (i)) 2 i i,j R T C T log( A )/2, where l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Writing l ( t (i) = ai l t, l t (i, j) = ai l j t, and πt ) (M i,j ) = (i)π t (j)ai ( l t l j t) we want to show that Tr(M) C M F. Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr(M) 2 rank(m) M 2 F.

87 Thompson Sampling for linear bandit after RVR14 Assume A = {a 1,..., a A }. Recall from Part 1 that TS satisfies π t (i)( l t (i) l t (i, i)) C π t (i)π t (j)( l t (i, j) l t (i)) 2 i i,j R T C T log( A )/2, where l t (i) = E t l t (i) and l t (i, j) = E t (l t (i) i = j). Writing l ( t (i) = ai l t, l t (i, j) = ai l j t, and πt ) (M i,j ) = (i)π t (j)ai ( l t l j t) we want to show that Tr(M) C M F. Using the eigenvalue formula for the trace and the Frobenius norm one can see that Tr(M) 2 rank(m) M 2 F. Moreover the rank of M is at most n since M = UV where U, V R A n (the i th row of U is π t (i)a i and for V it is π t (i)( l t l i t)).

88 Thompson Sampling for linear bandit after RVR14 1. TS satisfies R T nt log( A ). To appreciate the improvement recall that without the linear structure one would get a regret of order A T and that A can be exponential in the dimension n (think of the path planning example).

89 Thompson Sampling for linear bandit after RVR14 1. TS satisfies R T nt log( A ). To appreciate the improvement recall that without the linear structure one would get a regret of order A T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on l t (or on θ), TS just requires at each step one linear optimization over A.

90 Thompson Sampling for linear bandit after RVR14 1. TS satisfies R T nt log( A ). To appreciate the improvement recall that without the linear structure one would get a regret of order A T and that A can be exponential in the dimension n (think of the path planning example). 2. Provided that one can efficiently sample from the posterior on l t (or on θ), TS just requires at each step one linear optimization over A. 3. TS regret bound is optimal in the following sense. W.l.og. one can assume A (10T ) n and thus TS satisfies R T = O(n T log(t )) for any action set. Furthermore one can show that there exists an action set and a prior such that for any strategy one has R T = Ω(n T ), see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].

91 Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any l t such that E l t (i) = l t (i) and l t (i) 0, R T max i Ent(δ i p 1 ) η + η 2 E t E I pt lt (I ) 2.

92 Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any l t such that E l t (i) = l t (i) and l t (i) 0, R T max i Ent(δ i p 1 ) η + η 2 E t E I pt lt (I ) 2. DHK08 proposed the following (beautiful) unbiased estimator for the linear case: l t = Σ 1 t a t a t l t where Σ t = E a pt (aa ).

93 Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any l t such that E l t (i) = l t (i) and l t (i) 0, R T max i Ent(δ i p 1 ) η + η 2 E t E I pt lt (I ) 2. DHK08 proposed the following (beautiful) unbiased estimator for the linear case: l t = Σ 1 t a t a t l t where Σ t = E a pt (aa ). Again, amazingly, the variance is automatically controlled: E(E a pt ( l t a) 2 ) = E l t Σ t lt Eat Σ 1 t a t = ETr(Σ 1 t a t a t ) = n.

94 Adversarial linear bandit after Dani, Hayes, Kakade [2008] Recall from Part 1 that exponential weights satisfies for any l t such that E l t (i) = l t (i) and l t (i) 0, R T max i Ent(δ i p 1 ) η + η 2 E t E I pt lt (I ) 2. DHK08 proposed the following (beautiful) unbiased estimator for the linear case: l t = Σ 1 t a t a t l t where Σ t = E a pt (aa ). Again, amazingly, the variance is automatically controlled: E(E a pt ( l t a) 2 ) = E l t Σ t lt Eat Σ 1 t a t = ETr(Σ 1 t a t a t ) = n. Up to the issue that l t can take negative values this suggests the optimal nt log( A ) regret bound.

95 Adversarial linear bandit, further development 1. The non-negativity issue of l t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional n in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John s ellipsoid (smallest ellipsoid containing A).

96 Adversarial linear bandit, further development 1. The non-negativity issue of l t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional n in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John s ellipsoid (smallest ellipsoid containing A). 2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions.

97 Adversarial linear bandit, further development 1. The non-negativity issue of l t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional n in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John s ellipsoid (smallest ellipsoid containing A). 2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions. 3. Abernethy, Hazan and Rakhlin [2008] proposed an alternative (beautiful) strategy based on mirror descent. The key idea is to use a n-self-concordant barrier for conv(a) as a mirror map and to sample points uniformly in Dikin ellipses. This method s regret is suboptimal by a factor n and the computational efficiency depends on the barrier being used.

98 Adversarial linear bandit, further development 1. The non-negativity issue of l t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional n in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John s ellipsoid (smallest ellipsoid containing A). 2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions. 3. Abernethy, Hazan and Rakhlin [2008] proposed an alternative (beautiful) strategy based on mirror descent. The key idea is to use a n-self-concordant barrier for conv(a) as a mirror map and to sample points uniformly in Dikin ellipses. This method s regret is suboptimal by a factor n and the computational efficiency depends on the barrier being used. 4. Bubeck and Eldan [2014] s entropic barrier allows for a much more information-efficient sampling than AHR08. This gives another strategy with optimal regret which is efficient when A is convex (and one can do linear optimization on A).

99 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n.

100 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n. 1. Full information case goes back to the end of the 90 s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets A).

101 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n. 1. Full information case goes back to the end of the 90 s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets A). 2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013].

102 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n. 1. Full information case goes back to the end of the 90 s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets A). 2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013]. 3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where L = A.

103 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n. 1. Full information case goes back to the end of the 90 s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets A). 2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013]. 3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where L = A. 4. Optimal regret in the semi-bandit case is mnt and it can be achieved with mirror descent and the natural unbiased estimator for the semi-bandit situation.

104 Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014] Combinatorial setting: A {0, 1} n, max a a 1 = m, L = [0, 1] n. 1. Full information case goes back to the end of the 90 s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets A). 2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013]. 3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where L = A. 4. Optimal regret in the semi-bandit case is mnt and it can be achieved with mirror descent and the natural unbiased estimator for the semi-bandit situation. 5. For the bandit case the bound for exponential weights from the previous slides gives m mnt. However the lower bound from ABL14 is m nt, which is conjectured to be tight.

105 Preliminaries for the i.i.d. case: a primer on least squares Assume Y t = θ a t + ξ t where (ξ t ) is an i.i.d. sequence of centered and sub-gaussian real-valued random variables. The (regularized) least squares estimator for θ based on Y t = (Y 1,..., Y t 1 ) is, with A t = (a 1... a t 1 ) R n t 1 and Σ t = λi n + t 1 s=1 a sa s : ˆθ t = Σ 1 t A t Y t

106 Preliminaries for the i.i.d. case: a primer on least squares Assume Y t = θ a t + ξ t where (ξ t ) is an i.i.d. sequence of centered and sub-gaussian real-valued random variables. The (regularized) least squares estimator for θ based on Y t = (Y 1,..., Y t 1 ) is, with A t = (a 1... a t 1 ) R n t 1 and Σ t = λi n + t 1 s=1 a sa s : ˆθ t = Σ 1 t A t Y t Observe that we can also write θ = Σ 1 t (A t (Y t + ε t ) + λθ) where ε t = (E(Y 1 a 1 ) Y 1,..., E(Y t 1 a t 1 ) Y t 1 )

107 Preliminaries for the i.i.d. case: a primer on least squares Assume Y t = θ a t + ξ t where (ξ t ) is an i.i.d. sequence of centered and sub-gaussian real-valued random variables. The (regularized) least squares estimator for θ based on Y t = (Y 1,..., Y t 1 ) is, with A t = (a 1... a t 1 ) R n t 1 and Σ t = λi n + t 1 s=1 a sa s : ˆθ t = Σ 1 t A t Y t Observe that we can also write θ = Σ 1 t (A t (Y t + ε t ) + λθ) where ε t = (E(Y 1 a 1 ) Y 1,..., E(Y t 1 a t 1 ) Y t 1 ) so that θ ˆθ t Σt = A t ε t + λθ Σ 1 t A t ε t Σ 1 t + λ θ.

108 Preliminaries for the i.i.d. case: a primer on least squares Assume Y t = θ a t + ξ t where (ξ t ) is an i.i.d. sequence of centered and sub-gaussian real-valued random variables. The (regularized) least squares estimator for θ based on Y t = (Y 1,..., Y t 1 ) is, with A t = (a 1... a t 1 ) R n t 1 and Σ t = λi n + t 1 s=1 a sa s : ˆθ t = Σ 1 t A t Y t Observe that we can also write θ = Σ 1 t (A t (Y t + ε t ) + λθ) where ε t = (E(Y 1 a 1 ) Y 1,..., E(Y t 1 a t 1 ) Y t 1 ) so that θ ˆθ t Σt = A t ε t + λθ Σ 1 t A t ε t Σ 1 t + λ θ. A basic martingale argument (see e.g., Abbasi-Yadkori, Pál and Szepesvári [2011]) shows that w.p. 1 δ, t 1, A t ε t Σ 1 t logdet(σ t ) + log(1/(δ 2 λ n )).

109 Preliminaries for the i.i.d. case: a primer on least squares Assume Y t = θ a t + ξ t where (ξ t ) is an i.i.d. sequence of centered and sub-gaussian real-valued random variables. The (regularized) least squares estimator for θ based on Y t = (Y 1,..., Y t 1 ) is, with A t = (a 1... a t 1 ) R n t 1 and Σ t = λi n + t 1 s=1 a sa s : ˆθ t = Σ 1 t A t Y t Observe that we can also write θ = Σ 1 t (A t (Y t + ε t ) + λθ) where ε t = (E(Y 1 a 1 ) Y 1,..., E(Y t 1 a t 1 ) Y t 1 ) so that θ ˆθ t Σt = A t ε t + λθ Σ 1 t A t ε t Σ 1 t + λ θ. A basic martingale argument (see e.g., Abbasi-Yadkori, Pál and Szepesvári [2011]) shows that w.p. 1 δ, t 1, A t ε t Σ 1 t logdet(σ t ) + log(1/(δ 2 λ n )). Note that logdet(σ t ) n log(tr(σ t )/n) n log(λ + t/n) (w.l.o.g. we assumed a t 1).

110 i.i.d. linear bandit after DHK08, RT10, AYPS11 Let β = 2 n log(t ), and E t = {θ : θ ˆθ t Σt β}. We showed that w.p. 1 1/T 2 one has θ E t for all t [T ].

111 i.i.d. linear bandit after DHK08, RT10, AYPS11 Let β = 2 n log(t ), and E t = {θ : θ ˆθ t Σt β}. We showed that w.p. 1 1/T 2 one has θ E t for all t [T ]. The appropriate generalization of UCB is to select: ( θ t, a t ) = argmin (θ,a) E t A θ a (this optimization is NP-hard in general, more on that next slide).

112 i.i.d. linear bandit after DHK08, RT10, AYPS11 Let β = 2 n log(t ), and E t = {θ : θ ˆθ t Σt β}. We showed that w.p. 1 1/T 2 one has θ E t for all t [T ]. The appropriate generalization of UCB is to select: ( θ t, a t ) = argmin (θ,a) E t A θ a (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event: T θ (a t a ) t=1 T (θ θ t ) a t β t=1 T t=1 a t Σ 1 t β T t a t 2. Σ 1 t

113 i.i.d. linear bandit after DHK08, RT10, AYPS11 Let β = 2 n log(t ), and E t = {θ : θ ˆθ t Σt β}. We showed that w.p. 1 1/T 2 one has θ E t for all t [T ]. The appropriate generalization of UCB is to select: ( θ t, a t ) = argmin (θ,a) E t A θ a (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event: T θ (a t a ) t=1 T (θ θ t ) a t β t=1 T t=1 a t Σ 1 t To control the sum of squares we observe that: β T t a t 2. Σ 1 t det(σ t+1 ) = det(σ t ) det(i n +Σ 1/2 t a t (Σ 1/2 t a t ) ) = det(σ t )(1+ a t 2 ) Σ 1 t so that (assuming λ 1) log det(σ T +1 ) log det(σ 1 ) = t log(1+ a t 2 ) 1 Σ 1 t 2 t a t 2. Σ 1 t

114 i.i.d. linear bandit after DHK08, RT10, AYPS11 Let β = 2 n log(t ), and E t = {θ : θ ˆθ t Σt β}. We showed that w.p. 1 1/T 2 one has θ E t for all t [T ]. The appropriate generalization of UCB is to select: ( θ t, a t ) = argmin (θ,a) E t A θ a (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event: T θ (a t a ) t=1 T (θ θ t ) a t β t=1 T t=1 a t Σ 1 t To control the sum of squares we observe that: β T t a t 2. Σ 1 t det(σ t+1 ) = det(σ t ) det(i n +Σ 1/2 t a t (Σ 1/2 t a t ) ) = det(σ t )(1+ a t 2 ) Σ 1 t so that (assuming λ 1) log det(σ T +1 ) log det(σ 1 ) = log(1+ a t 2 ) 1 a Σ 1 t 2. t 2 Σ 1 t t t Putting things together we see that the regret is O(n log(t ) T ).

115 What s the point of i.i.d. linear bandit? So far we did not get any real benefit from the i.i.d. assumption (the regret guarantee we obtained is the same as for the adversarial model). To me the key benefit is in the simplicity of the i.i.d. algorithm which makes it easy to incorporate further assumptions.

116 What s the point of i.i.d. linear bandit? So far we did not get any real benefit from the i.i.d. assumption (the regret guarantee we obtained is the same as for the adversarial model). To me the key benefit is in the simplicity of the i.i.d. algorithm which makes it easy to incorporate further assumptions. 1. Sparsity of θ: instead of regularization with l 2 -norm to define ˆθ one could regularize with l 1 -norm, see e.g., Johnson, Sivakumar and Banerjee [2016].

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known