EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Size: px
Start display at page:

Download "EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University"

Transcription

1 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University

2 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University

3 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards

4 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses

5 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses Need to balance exploration and exploitation Motivation: advertising, clinical trials,

6 EASINESS IN BANDITS A TUTORIAL Hardness in bandits Worst-case upper & lower bounds Easiness in bandits Higher order bounds Stochastic bandits and the best of both worlds Prior-dependent bounds

7 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it

8 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it Goal: minimize expected regret R T = T t=1 T l t,it min i K t=1 l t,i

9 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic

10 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic This talk: how to go beyond this

11 NON-STOCHASTIC BANDITS: UPPER BOUNDS EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

12 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K

13 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η + ηkt 2

14 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η η = + ηkt 2 2 log K KT

15 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014)

16 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K p α

17 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K Theorem (Audibert and Bubeck, 2009, Audibert, Bubeck and Lugosi, 2014, Abernethy, Lee and Tewari, 2015): The regret of PolyINF satisfies R T 2 KT p α

18 BEYOND MINIMAX #1: HIGHER-ORDER BOUNDS

19 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

20 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

21 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2

22 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2

23 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T T 2 log K K t=1 i=1 2 l t,i

24 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

25 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

26 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating

27 VARIANCE BOUNDS not so easy Need to replace where μ T,i = 1 T t i l 2 2 t,i by t i l t,i μ T,i, t l t,i

28 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i

29 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: But that doesn t Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate work! μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i

30 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i

31 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i self-concordant regularizer

32 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i self-concordant regularizer

33 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i Theorem (Hazan and Kale, 2011): The regret of the above algorithm satisfies R T = O K 2 T K t=1 i=1 self-concordant regularizer l 2 2 t,i μ T,i

34 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating

35 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

36 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

37 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i

38 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i Problem: only good if best expert is bad!

39 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K

40 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K Problem: one misbehaving action ruins the bound!

41 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T

42 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T

43 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

44 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Green (Allenberg, Auer, Györfi and Ottucsák, 2006) Parameters: η > 0, γ 0,1. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let p t,i = Draw I t p t. For all i, let w t,i j w t,j and let p t,i = 0 if p t,i γ. l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i

45 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η)

46 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart

47 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart Once p t,i γ occurs, L t,i stops growing, so L T,i L T,j + O log 1/γ /η + O 1/γ

48 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

49 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

50 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

51 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

52 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: log K + η T K Theorem (Allenberg η 2 E et al., 2 p t,i 2006): l t,i The regret of Green t=1 i=1satisfies K R T R T R T R T Rlog K + η T = O η 2 E KL T + K L T,i i=1 log K + η η 2 E K L T,i + O K log K + η η 2 KL T,i + O K

53 A SIMPLER ALGORITHM: EXP3-IX EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

54 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0, γ > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let l t,i = p t,i = l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.

55 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0. Initialization: Theorem (Neu, 2015): For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let The regret of Exp3-IX satisfies R T = O l t,i = p t,i = KL T + K l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.

56 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i

57 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses

58 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates

59 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates +uniform exploration

60 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses IX estimates

61 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

62 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K R T = O KL T,i second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

63 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K

64 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K Theorem: No algorithm can do better than R T = Ω i V t,i

65 BEYOND MINIMAX #2: STOCHASTIC LOSSES AND THE BEST OF BOTH WORLDS

66

67

68 TL;DR: R T = O C ν log T is achievable for i.i.d. losses

69 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses?

70 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses? YES*!! *almost

71 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on

72 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on

73 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

74 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014), PARAPHRASED EXP3++ (SS, 2014) Parameters: η t t > 0, (++). Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = 1 j ε t,j w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = exp η t L t,i + ε t,i.

75 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K

76 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K Proof idea: the ε t,i s are small enough to not change the standard Exp3 analysis: ε t,i = O log K /KT

77 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T = O C ν log 3 T + C ν in the stochastic case

78 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ

79 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i

80 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 e tη tδ i = O K Δ i 2

81 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 But we don t have full info :( e tη tδ i = O K Δ i 2

82 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)

83 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)

84 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2

85 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i ensured by the exploration parameters ε t,i!!! p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i

86 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! e tη tδ i /2 = t + O K Δ i 2

87 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! The rest is grinding out the asymptotics e tη tδ i /2 = t + O K Δ i 2

88 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i

89 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds whenever there is such a gap: contaminated stochastic adversarial with a gap

90 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds That s the exact opposite of what whenever there is such a gap: we need for 1 contaminated st order bounds! stochastic adversarial with a gap

91 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!)

92 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones?

93 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones? See Peter s talk tomorrow!

94 BEYOND MINIMAX #3: PRIOR-DEPENDENT BOUNDS

95 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing R T ρ = O T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T

96 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T

97 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem NO* :( (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = *not const quite for any fixed i, while also guaranteeing R T = O T

98 PRIOR-DEPENDENT BOUNDS FOR BANDITS Theorem (Lattimore, 2015) paraphrased The regrets R T i need to satisfy T R T i min T,. R T j j i In particular, R T i = const implies R T j = Ω T Fixing a prior π and getting a bound R T ρ = O T j (ρ j /π j ) is not possible

99 PRIOR-DEPENDENT BOUNDS: POSITIVE RESULTS Lattimore (2015): For any regret bound satisfying the condition, there exists an algorithm achieving it in the stochastic setting In particular, j ρ j π j T is achievable (see also Rosin, 2011) Neu (2016, made up on the flight here): For non-stochastic bandits, there is an algorithm with R T (i) = O KT softmax π π i

100 BEYOND MINIMAX: CONCLUSIONS

101 CONCLUSIONS Higher order bounds First-order bounds are possible like in full info Second order bounds: much weaker than full info Best-of-both-world bounds Possible and strong against oblivious adversaries Only weak guarantees for adaptive adversaries Prior dependent bounds Nothing fancy is possible

102 THANKS!!!

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

More Adaptive Algorithms for Adversarial Bandits

More Adaptive Algorithms for Adversarial Bandits Proceedings of Machine Learning Research vol 75: 29, 208 3st Annual Conference on Learning Theory More Adaptive Algorithms for Adversarial Bandits Chen-Yu Wei University of Southern California Haipeng

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits

Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits Journal of Machine Learning Research 17 (2016 1-21 Submitted 3/15; Revised 7/16; Published 8/16 Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits Gergely

More information

Minimax Policies for Combinatorial Prediction Games

Minimax Policies for Combinatorial Prediction Games Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Univ. Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr Sébastien Bubeck Centre de Recerca Matemàtica

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Adaptive Learning with Unknown Information Flows

Adaptive Learning with Unknown Information Flows Adaptive Learning with Unknown Information Flows Yonatan Gur Stanford University Ahmadreza Momeni Stanford University June 8, 018 Abstract An agent facing sequential decisions that are characterized by

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Regret in Online Combinatorial Optimization

Regret in Online Combinatorial Optimization Regret in Online Combinatorial Optimization Jean-Yves Audibert Imagine, Université Paris Est, and Sierra, CNRS/ENS/INRIA audibert@imagine.enpc.fr Sébastien Bubeck Department of Operations Research and

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

Online combinatorial optimization with stochastic decision sets and adversarial losses

Online combinatorial optimization with stochastic decision sets and adversarial losses Online combinatorial optimization with stochastic decision sets and adversarial losses Gergely Neu Michal Valko SequeL team, INRIA Lille Nord urope, France {gergely.neu,michal.valko}@inria.fr Abstract

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Proceedings of Machine Learning Research vol 65:1 17, 2017 Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi Università degli Studi di Milano, Milano,

More information

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, Sébastien Gerchinovitz To cite this version: Nicolò Cesa-Bianchi,

More information

Toward a Classification of Finite Partial-Monitoring Games

Toward a Classification of Finite Partial-Monitoring Games Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Combinatorial Bandits

Combinatorial Bandits Combinatorial Bandits Nicolò Cesa-Bianchi Università degli Studi di Milano, Italy Gábor Lugosi 1 ICREA and Pompeu Fabra University, Spain Abstract We study sequential prediction problems in which, at each

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Partial monitoring classification, regret bounds, and algorithms

Partial monitoring classification, regret bounds, and algorithms Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Better Algorithms for Benign Bandits

Better Algorithms for Benign Bandits Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

Mistake Bounds on Noise-Free Multi-Armed Bandit Game Mistake Bounds on Noise-Free Multi-Armed Bandit Game Atsuyoshi Nakamura a,, David P. Helmbold b, Manfred K. Warmuth b a Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan b University

More information

Online learning with noisy side observations

Online learning with noisy side observations Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

An adaptive algorithm for finite stochastic partial monitoring

An adaptive algorithm for finite stochastic partial monitoring An adaptive algorithm for finite stochastic partial monitoring Gábor Bartók Navid Zolghadr Csaba Szepesvári Department of Computing Science, University of Alberta, AB, Canada, T6G E8 bartok@ualberta.ca

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Regret Minimization for Branching Experts

Regret Minimization for Branching Experts JMLR: Workshop and Conference Proceedings vol 30 (2013) 1 21 Regret Minimization for Branching Experts Eyal Gofer Tel Aviv University, Tel Aviv, Israel Nicolò Cesa-Bianchi Università degli Studi di Milano,

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Learning for Contextual Bandits

Learning for Contextual Bandits Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

Better Algorithms for Benign Bandits

Better Algorithms for Benign Bandits Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com

More information

University of Alberta. The Role of Information in Online Learning

University of Alberta. The Role of Information in Online Learning University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

arxiv: v3 [cs.lg] 6 Jun 2017

arxiv: v3 [cs.lg] 6 Jun 2017 Proceedings of Machine Learning Research vol 65:1 27, 2017 Corralling a Band of Bandit Algorithms arxiv:1612.06246v3 [cs.lg 6 Jun 2017 Alekh Agarwal Microsoft Research, New York Haipeng Luo Microsoft Research,

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research 1 Microsoft Way, Redmond,

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

PAC Subset Selection in Stochastic Multi-armed Bandits

PAC Subset Selection in Stochastic Multi-armed Bandits In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Multi-Armed Bandits with Metric Movement Costs

Multi-Armed Bandits with Metric Movement Costs Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information