EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University
|
|
- Lesley McDowell
- 5 years ago
- Views:
Transcription
1 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University
2 EASINESS IN BANDITS Gergely Neu Pompeu Fabra University
3 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards
4 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses
5 THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards minimize losses Need to balance exploration and exploitation Motivation: advertising, clinical trials,
6 EASINESS IN BANDITS A TUTORIAL Hardness in bandits Worst-case upper & lower bounds Easiness in bandits Higher order bounds Stochastic bandits and the best of both worlds Prior-dependent bounds
7 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it
8 NON-STOCHASTIC BANDITS Parameters: number of arms K, number of rounds T Interaction: For each round t = 1,2,, T Learner chooses action I t K Environment chooses losses l t,i 0,1 for all i Learner incurs and observes loss l t,it Goal: minimize expected regret R T = T t=1 T l t,it min i K t=1 l t,i
9 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic
10 NON-STOCHASTIC BANDITS: LOWER BOUNDS Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): In the worst case, any algorithm will suffer a regret of Ω KT This result also holds for stochastic bandits, as the counterexample is stochastic This talk: how to go beyond this
11 NON-STOCHASTIC BANDITS: UPPER BOUNDS EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.
12 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K
13 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η + ηkt 2
14 THE REGRET OF EXP3 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T 2KT log K Proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η η = + ηkt 2 2 log K KT
15 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014)
16 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K p α
17 HEY, BUT THAT S NOT MINIMAX! Exp3 is strictly suboptimal: you can t remove the log K (Audibert, Bubeck and Lugosi, 2014) A minimax algorithm: PolyINF p t = arg min ηp L t 1 + S α p p Δ K where S α p is the Tsallis entropy: S α p = 1 1 α 1 i=1 K Theorem (Audibert and Bubeck, 2009, Audibert, Bubeck and Lugosi, 2014, Abernethy, Lee and Tewari, 2015): The regret of PolyINF satisfies R T 2 KT p α
18 BEYOND MINIMAX #1: HIGHER-ORDER BOUNDS
19 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating
20 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating
21 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2
22 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2
23 SECOND-ORDER BOUNDS The Exp3 proof : R T R T = log K η log K η T + η 2 E t=1 T + η 2 t=1 K i=1 K i=1 2 p t,i l t,i l 2 t,i log K η Easy! + ηkt 2 Theorem (Auer, Cesa-Bianchi, Freund and Schapire, 2002): The regret of EXP3 satisfies R T T 2 log K K t=1 i=1 2 l t,i
24 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) Easy! variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating
25 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) with a little cheating
26 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating
27 VARIANCE BOUNDS not so easy Need to replace where μ T,i = 1 T t i l 2 2 t,i by t i l t,i μ T,i, t l t,i
28 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i
29 VARIANCE BOUNDS not so easy Need to replace t i l 2 2 t,i by t i l t,i μ T,i, where μ T,i = 1 T t l t,i Hazan and Kale (2011), heavily paraphrased: But that doesn t Replace μ T,i by μ t,i (easy) Estimate μ t,i by an appropriate work! μ t,i : reservoir sampling in exploration rounds Use Exp3 with loss estimates l t,i = l t,i μ t,i p t,i + μ t,i
30 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i
31 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i self-concordant regularizer
32 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i self-concordant regularizer
33 THE RIGHT WAY TO GET VARIANCE BOUNDS Instead of Exp3, use SCRiBLe: p t = arg min p L t 1 + Ψ p p Δ K t 1 with L t 1,i = s=1 c t,i + μ t,i c t,i appropriate unbiased estimate of l t,i μ t,i Theorem (Hazan and Kale, 2011): The regret of the above algorithm satisfies R T = O K 2 T K t=1 i=1 self-concordant regularizer l 2 2 t,i μ T,i
34 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) not so easy with a little cheating
35 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating
36 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating
37 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i
38 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: Consider the gain game with g t,i = 1 l t,i Auer, Cesa-Bianchi, Freund and Schapire (2002): R T = O KG T,i log K G T,i = t g t,i Problem: only good if best expert is bad!
39 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K
40 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: R T = O A little trickier analysis gives KG T,i log K R T = O t i g t,i log K or R T = O t i l t,i log K Problem: one misbehaving action ruins the bound!
41 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T
42 FIRST-ORDER BOUNDS should be easy? Small-gain bounds: A little trickier analysis gives Actual first-order bounds: Stoltz (2005): K L T R T = O KG T,i log K R T = O t i l t,i log K Allenberg, Auer, Györfi and Ottucsák (2006): KL T Rakhlin and Sridharan (2013): K 3/2 L T
43 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.
44 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Green (Allenberg, Auer, Györfi and Ottucsák, 2006) Parameters: η > 0, γ 0,1. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let p t,i = Draw I t p t. For all i, let w t,i j w t,j and let p t,i = 0 if p t,i γ. l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i
45 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η)
46 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart
47 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Analysis idea: As long as p t,i γ for an i, we have L t 1,i L t 1,j + O(log 1/γ /η) the loss estimates are not too far apart Once p t,i γ occurs, L t,i stops growing, so L T,i L T,j + O log 1/γ /η + O 1/γ
48 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K
49 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K
50 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K
51 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: R T log K η T + η 2 E t=1 K K i=1 2 p t,i l t,i log K R T + η η 2 E L T,i i=1 log K R T + η η 2 E K L T,i + O K log K R T + η η 2 KL T,i + O K
52 THE GREEN ALGORITHM (ALLENBERG ET AL., 2006) Getting back to the Exp3 proof: log K + η T K Theorem (Allenberg η 2 E et al., 2 p t,i 2006): l t,i The regret of Green t=1 i=1satisfies K R T R T R T R T Rlog K + η T = O η 2 E KL T + K L T,i i=1 log K + η η 2 E K L T,i + O K log K + η η 2 KL T,i + O K
53 A SIMPLER ALGORITHM: EXP3-IX EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.
54 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0, γ > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let l t,i = p t,i = l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.
55 A SIMPLER ALGORITHM: EXP3-IX EXP3-IX (Kocák et al., 2014, Neu 2015a, Neu 2015b) Parameter: η > 0. Initialization: Theorem (Neu, 2015): For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let The regret of Exp3-IX satisfies R T = O l t,i = p t,i = KL T + K l t,i w t,i j w t,j l t,i = l t,i 1 p It =i. t,i For all i, update weight as p t,i + γ 1 I t =i w t+1,i = w t,i e η l t,i.
56 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i
57 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses
58 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates
59 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses unbiased estimates +uniform exploration
60 IMPLICIT EXPLORATION IN ACTION l t,i = l t,i p t,i + γ 1 I t =i True losses IX estimates
61 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K should be easy? second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating
62 HIGHER-ORDER BOUNDS Full information Bandit minimax R T = O T log K R T = O KT first-order L T,i = t l t,i R T = O L T,i log K R T = O KL T,i second-order 2 S T,i = t l t,i R T = O S t,i log K Cesa-Bianchi, Mansour, Stoltz (2005) R T = O i S t,i Auer et al. (2002) + some hacking variance V T,i = t l t,i m 2 R T = O V T,i log K Hazan and Kale (2010) R T = O K 2 i V t,i Hazan and Kale (2011) with a little cheating
63 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K
64 HIGHER-ORDER LOWER BOUNDS Gerchinovitz and Lattimore (2016), heavily paraphrased: Theorem: No algorithm can do better than R T = Ω L T K Theorem: No algorithm can do better than R T = Ω i V t,i
65 BEYOND MINIMAX #2: STOCHASTIC LOSSES AND THE BEST OF BOTH WORLDS
66
67
68 TL;DR: R T = O C ν log T is achievable for i.i.d. losses
69 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses?
70 THE BEST OF BOTH WORLDS Is it possible to come up with an algorithm with R T = O KT for non-stochastic losses and R T = O C ν log T for stochastic losses? YES*!! *almost
71 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on
72 THE BEST OF BOTH WORLDS: ALGORITHMS Bubeck and Slivkins (2012): Assume that environment is stochastic, act aggressively If the losses fail on a stochasticity test, then fall back to Exp3 Regret: O KT on adversarial, O(log 2 T) on stochastic Auer and Chiang (2016), see Peter s talk tomorrow: Better test, better algorithm for stochastic losses Regret: O stochastic KT log K on adversarial, O C ν log T on
73 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014) EXP3 (Auer, Cesa-Bianchi, Freund and Schapire, 1995, 2002) Parameter: η > 0. Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = w t,i e η l t,i.
74 A SIMPLE ALGORITHM: EXP3++ (SELDIN AND SLIVKINS, 2014), PARAPHRASED EXP3++ (SS, 2014) Parameters: η t t > 0, (++). Initialization: For all i, set w 1,i = 1. For each round t = 1,2,, T For all i, let Draw I t p t. For all i, let p t,i = 1 j ε t,j w t,i j w t,j l t,i = l t,i p t,i 1 It =i. For all i, update weight as w t+1,i = exp η t L t,i + ε t,i.
75 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K
76 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T 4 TK log K Proof idea: the ε t,i s are small enough to not change the standard Exp3 analysis: ε t,i = O log K /KT
77 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Theorem (SS, 2014): The regret of Exp3++ satisfies R T = O C ν log 3 T + C ν in the stochastic case
78 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ
79 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i
80 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 e tη tδ i = O K Δ i 2
81 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Proof ideas: Let Δ i = μ i μ Wishful thinking: if we had full information, then e tη tδ i p t,i e tη tδ i j e tη tδ j holds for all suboptimal arms i Thus, the expected number of suboptimal draws is T t=1 p t,i T t=1 But we don t have full info :( e tη tδ i = O K Δ i 2
82 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)
83 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t)
84 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) ensured by the exploration parameters ε t,i!!! Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2
85 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i ensured by the exploration parameters ε t,i!!! p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i
86 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! e tη tδ i /2 = t + O K Δ i 2
87 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Idea: ensure that the estimated gap is reasonable : t Δ t,i L t,i L t tδ i o(t) For large enough t (t t ), we have t Δ t,i tδ i /2 This gives e tη tδ t,i p t,i = e tη tδ t,i e tη tδ i/2 j e tη tδ t,j for all suboptimal arms i Thus, T t=1 p t,i t + T t=t ensured by the exploration parameters ε t,i!!! The rest is grinding out the asymptotics e tη tδ i /2 = t + O K Δ i 2
88 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i
89 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds whenever there is such a gap: contaminated stochastic adversarial with a gap
90 EXP3++ ANALYSIS (HEAVILY PARAPHRASED) Bottom line: if there is a linear gap between L t,i and L t, this should be exposed in the estimated gap t Δ t,i Corollaries: strong bounds That s the exact opposite of what whenever there is such a gap: we need for 1 contaminated st order bounds! stochastic adversarial with a gap
91 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!)
92 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones?
93 OPEN QUESTIONS Is there a way to exploit gaps that are growing slower than linear? Is there a way to improve asymptotics? (In SS 14, t is horribly big!) So far, all positive results hold only for oblivious adversaries is it possible to extend these to adaptive ones? See Peter s talk tomorrow!
94 BEYOND MINIMAX #3: PRIOR-DEPENDENT BOUNDS
95 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing R T ρ = O T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T
96 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = const for any fixed i, while also guaranteeing R T = O T
97 PRIOR-DEPENDENT BOUNDS FOR FULL INFO Theorem (Luo and Schapire, 2015, Koolen and Van Erven, 2015, Orabona and Pal, 2016) There exist algorithms guaranteeing Anything similar Rpossible T ρ = O for bandits?? T 1 + RE ρ π for any fixed prior π Δ K and any comparator ρ Δ K Theorem NO* :( (Even-Dar et al., 2007, Sani et al., 2014) There exist algorithms guaranteeing R T i = *not const quite for any fixed i, while also guaranteeing R T = O T
98 PRIOR-DEPENDENT BOUNDS FOR BANDITS Theorem (Lattimore, 2015) paraphrased The regrets R T i need to satisfy T R T i min T,. R T j j i In particular, R T i = const implies R T j = Ω T Fixing a prior π and getting a bound R T ρ = O T j (ρ j /π j ) is not possible
99 PRIOR-DEPENDENT BOUNDS: POSITIVE RESULTS Lattimore (2015): For any regret bound satisfying the condition, there exists an algorithm achieving it in the stochastic setting In particular, j ρ j π j T is achievable (see also Rosin, 2011) Neu (2016, made up on the flight here): For non-stochastic bandits, there is an algorithm with R T (i) = O KT softmax π π i
100 BEYOND MINIMAX: CONCLUSIONS
101 CONCLUSIONS Higher order bounds First-order bounds are possible like in full info Second order bounds: much weaker than full info Best-of-both-world bounds Possible and strong against oblivious adversaries Only weak guarantees for adaptive adversaries Prior dependent bounds Nothing fancy is possible
102 THANKS!!!
Yevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationA Second-order Bound with Excess Losses
A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands
More informationMore Adaptive Algorithms for Adversarial Bandits
Proceedings of Machine Learning Research vol 75: 29, 208 3st Annual Conference on Learning Theory More Adaptive Algorithms for Adversarial Bandits Chen-Yu Wei University of Southern California Haipeng
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationOnline Learning with Predictable Sequences
JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More informationApplications of on-line prediction. in telecommunication problems
Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationImportance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits
Journal of Machine Learning Research 17 (2016 1-21 Submitted 3/15; Revised 7/16; Published 8/16 Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits Gergely
More informationMinimax Policies for Combinatorial Prediction Games
Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Univ. Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr Sébastien Bubeck Centre de Recerca Matemàtica
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More informationReducing contextual bandits to supervised learning
Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationAdaptive Learning with Unknown Information Flows
Adaptive Learning with Unknown Information Flows Yonatan Gur Stanford University Ahmadreza Momeni Stanford University June 8, 018 Abstract An agent facing sequential decisions that are characterized by
More informationMinimax strategy for prediction with expert advice under stochastic assumptions
Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction
More informationThe multi armed-bandit problem
The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization
More informationRegret in Online Combinatorial Optimization
Regret in Online Combinatorial Optimization Jean-Yves Audibert Imagine, Université Paris Est, and Sierra, CNRS/ENS/INRIA audibert@imagine.enpc.fr Sébastien Bubeck Department of Operations Research and
More informationOn Minimaxity of Follow the Leader Strategy in the Stochastic Setting
On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction
More informationOnline combinatorial optimization with stochastic decision sets and adversarial losses
Online combinatorial optimization with stochastic decision sets and adversarial losses Gergely Neu Michal Valko SequeL team, INRIA Lille Nord urope, France {gergely.neu,michal.valko}@inria.fr Abstract
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationLearning with Exploration
Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.
More informationGambling in a rigged casino: The adversarial multi-armed bandit problem
Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at
More informationLecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are
More informationAlgorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning
Proceedings of Machine Learning Research vol 65:1 17, 2017 Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi Università degli Studi di Milano, Milano,
More informationAlgorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning
Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, Sébastien Gerchinovitz To cite this version: Nicolò Cesa-Bianchi,
More informationToward a Classification of Finite Partial-Monitoring Games
Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationCombinatorial Bandits
Combinatorial Bandits Nicolò Cesa-Bianchi Università degli Studi di Milano, Italy Gábor Lugosi 1 ICREA and Pompeu Fabra University, Spain Abstract We study sequential prediction problems in which, at each
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationPartial monitoring classification, regret bounds, and algorithms
Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationBetter Algorithms for Benign Bandits
Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationTHE first formalization of the multi-armed bandit problem
EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationA Drifting-Games Analysis for Online Learning and Applications to Boosting
A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationMulti-armed Bandits in the Presence of Side Observations in Social Networks
52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness
More informationOnline learning with feedback graphs and switching costs
Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence
More informationMistake Bounds on Noise-Free Multi-Armed Bandit Game
Mistake Bounds on Noise-Free Multi-Armed Bandit Game Atsuyoshi Nakamura a,, David P. Helmbold b, Manfred K. Warmuth b a Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan b University
More informationOnline learning with noisy side observations
Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationCombinatorial Multi-Armed Bandit: General Framework, Results and Applications
Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science
More informationTwo generic principles in modern bandits: the optimistic principle and Thompson sampling
Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle
More informationLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract
More informationReinforcement Learning
Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the
More informationAn adaptive algorithm for finite stochastic partial monitoring
An adaptive algorithm for finite stochastic partial monitoring Gábor Bartók Navid Zolghadr Csaba Szepesvári Department of Computing Science, University of Alberta, AB, Canada, T6G E8 bartok@ualberta.ca
More informationarxiv: v3 [cs.lg] 30 Jun 2012
arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the
More informationBandit Convex Optimization: T Regret in One Dimension
Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February
More informationExtracting Certainty from Uncertainty: Regret Bounded by Variation in Costs
Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft
More informationAnytime optimal algorithms in stochastic multi-armed bandits
Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed
More informationRegret Minimization for Branching Experts
JMLR: Workshop and Conference Proceedings vol 30 (2013) 1 21 Regret Minimization for Branching Experts Eyal Gofer Tel Aviv University, Tel Aviv, Israel Nicolò Cesa-Bianchi Università degli Studi di Milano,
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationOnline Optimization : Competing with Dynamic Comparators
Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online
More informationLearning for Contextual Bandits
Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to
More informationPure Exploration Stochastic Multi-armed Bandits
C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationMulti-Armed Bandit Formulations for Identification and Control
Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,
More informationBetter Algorithms for Benign Bandits
Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com
More informationUniversity of Alberta. The Role of Information in Online Learning
University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationarxiv: v3 [cs.lg] 6 Jun 2017
Proceedings of Machine Learning Research vol 65:1 27, 2017 Corralling a Band of Bandit Algorithms arxiv:1612.06246v3 [cs.lg 6 Jun 2017 Alekh Agarwal Microsoft Research, New York Haipeng Luo Microsoft Research,
More informationInformational Confidence Bounds for Self-Normalized Averages and Applications
Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationExtracting Certainty from Uncertainty: Regret Bounded by Variation in Costs
Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research 1 Microsoft Way, Redmond,
More informationDynamic Regret of Strongly Adaptive Methods
Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret
More informationPAC Subset Selection in Stochastic Multi-armed Bandits
In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More informationMulti-Armed Bandits with Metric Movement Costs
Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il
More informationOnline Optimization with Gradual Variations
JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute
More informationFrom Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More information