EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Size: px

Start display at page:

Download "EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University"

Lesley McDowell
5 years ago
Views:

1 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University

2 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University

3 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards

4 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses

5 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses Need to balance exploration and exploitation Motivation: advertising, clinical trials,

6 EASINESS IN BANDITS A TUTORIAL Hardness in bandits Worst-case upper & lower bounds Easiness in bandits Higher order bounds Stochastic bandits and the best of both worlds Prior-dependent bounds

7 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it

8 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it Goal: minimize expected regret R T = T t=1 T l t,it min i K t=1 l t,i

9 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic

10 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic This talk: how to go beyond this

11 NON-STOCHASTIC BANDITS: UPPER BOUNDS EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

12 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K

13 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η + ηkt 2

14 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η η = + ηkt 2 2 log K KT

15 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014)

16 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K p α

17 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K Theorem (Audibert and Bubeck, 2009, Audibert, Bubeck and Lugosi, 2014, Abernethy, Lee and Tewari, 2015): The regret of PolyINF satisfies R T 2 KT p α

18 BEYOND MINIMAX #1: HIGHER-ORDER BOUNDS

19 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

20 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

21 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2

22 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2

23 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T T 2 log K K t=1 i=1 2 l t,i

24 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

25 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating

26 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating

27 VARIANCE BOUNDS not so easy Need to replace where μ T,i = 1 T t i l 2 2 t,i by t i l t,i μ T,i, t l t,i

28 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i

29 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: But that doesn t Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate work! μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i

30 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i

31 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i self-concordant regularizer

32 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i self-concordant regularizer

33 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i Theorem (Hazan and Kale, 2011): The regret of the above algorithm satisfies R T = O K 2 T K t=1 i=1 self-concordant regularizer l 2 2 t,i μ T,i

34 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating

35 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

36 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

37 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i

38 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i Problem: only good if best expert is bad!

39 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K

40 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K Problem: one misbehaving action ruins the bound!

41 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T

42 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T

43 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

44 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Green (Allenberg, Auer, Györfi and Ottucsák, 2006) Parameters: η > 0, γ 0,1. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let p t,i = Draw I t p t. For all i, let w t,i j w t,j and let p t,i = 0 if p t,i γ. l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i

45 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η)

46 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart

47 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart Once p t,i γ occurs, L t,i stops growing, so L T,i L T,j + O log 1/γ /η + O 1/γ

48 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

49 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

50 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

51 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K

52 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: log K + η T K Theorem (Allenberg η 2 E et al., 2 p t,i 2006): l t,i The regret of Green t=1 i=1satisfies K R T R T R T R T Rlog K + η T = O η 2 E KL T + K L T,i i=1 log K + η η 2 E K L T,i + O K log K + η η 2 KL T,i + O K

53 A SIMPLER ALGORITHM: EXP3-IX EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

54 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0, γ > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let l t,i = p t,i = l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.

55 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0. Initialization: Theorem (Neu, 2015): For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let The regret of Exp3-IX satisfies R T = O l t,i = p t,i = KL T + K l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.

56 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i

57 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses

58 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates

59 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates +uniform exploration

60 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses IX estimates

61 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

62 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K R T = O KL T,i second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating

63 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K

64 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K Theorem: No algorithm can do better than R T = Ω i V t,i

65 BEYOND MINIMAX #2: STOCHASTIC LOSSES AND THE BEST OF BOTH WORLDS

68 TL;DR: R T = O C ν log T is achievable for i.i.d. losses

69 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses?

70 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses? YES*!! *almost

71 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on

72 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on

73 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.

74 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014), PARAPHRASED EXP3++ (SS, 2014) Parameters: η t t > 0, (++). Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = 1 j ε t,j w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = exp η t L t,i + ε t,i.

75 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K

76 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K Proof idea: the ε t,i s are small enough to not change the standard Exp3 analysis: ε t,i = O log K /KT

77 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T = O C ν log 3 T + C ν in the stochastic case

78 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ

79 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i

80 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 e tη tδ i = O K Δ i 2

81 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 But we don t have full info :( e tη tδ i = O K Δ i 2

82 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)

83 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)

84 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2

85 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i ensured by the exploration parameters ε t,i!!! p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i

86 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! e tη tδ i /2 = t + O K Δ i 2

87 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! The rest is grinding out the asymptotics e tη tδ i /2 = t + O K Δ i 2

88 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i

89 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds whenever there is such a gap: contaminated stochastic adversarial with a gap

90 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds That s the exact opposite of what whenever there is such a gap: we need for 1 contaminated st order bounds! stochastic adversarial with a gap

91 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!)

92 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones?

93 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones? See Peter s talk tomorrow!

94 BEYOND MINIMAX #3: PRIOR-DEPENDENT BOUNDS

95 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing R T ρ = O T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T

96 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T

97 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem NO* :( (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = *not const quite for any fixed i, while also guaranteeing R T = O T

98 PRIOR-DEPENDENT BOUNDS FOR BANDITS Theorem (Lattimore, 2015) paraphrased The regrets R T i need to satisfy T R T i min T,. R T j j i In particular, R T i = const implies R T j = Ω T Fixing a prior π and getting a bound R T ρ = O T j (ρ j /π j ) is not possible

99 PRIOR-DEPENDENT BOUNDS: POSITIVE RESULTS Lattimore (2015): For any regret bound satisfying the condition, there exists an algorithm achieving it in the stochastic setting In particular, j ρ j π j T is achievable (see also Rosin, 2011) Neu (2016, made up on the flight here): For non-stochastic bandits, there is an algorithm with R T (i) = O KT softmax π π i

100 BEYOND MINIMAX: CONCLUSIONS

101 CONCLUSIONS Higher order bounds First-order bounds are possible like in full info Second order bounds: much weaker than full info Best-of-both-world bounds Possible and strong against oblivious adversaries Only weak guarantees for adaptive adversaries Prior dependent bounds Nothing fancy is possible

102 THANKS!!!

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New