Reducing contextual bandits to supervised learning

Size: px
Start display at page:

Download "Reducing contextual bandits to supervised learning"

Transcription

1 Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1

2 Learning to interact: example #1 Practicing physician 2

3 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2

4 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 2

5 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). 2

6 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes. 2

7 Learning to interact: example #2 Website operator 3

8 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 3

9 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3

10 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). 3

11 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). Goal: choose content that yield desired user behavior. 3

12 Contextual bandit problem For t = 1, 2,..., T : 4

13 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 4

14 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 4

15 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

16 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

17 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). 4

18 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. 4

19 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. Bandit: r t (a) for a a t is not observed. (Non-bandit setting: whole reward vector r t [0, 1] A is observed.) 4

20 Challenges 5

21 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 5

22 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 5

23 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 3. Selection bias, especially while exploiting. 5

24 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner 6

25 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! 6

26 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! Goal: regret 0 as fast as possible as T. 6

27 Our result New fast and simple algorithm for contextual bandits. 7

28 Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). 7

29 Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). Statistically (near) optimal regret bound. 7

30 Need for exploration No-exploration approach: 8

31 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 8

32 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. ˆπ(x) := arg max ˆr(a x), a A 8

33 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. Suffers from selection bias. ˆπ(x) := arg max ˆr(a x), a A 8

34 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. 9

35 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 9

36 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y

37 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. 9

38 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y

39 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Reward estimates A B X Y

40 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Never try action B in context X. Reward estimates A B X Y

41 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Reward estimates A B X Y Never try action B in context X. Ω(1) regret. True rewards A B X Y

42 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! 10

43 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. 10

44 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). 10

45 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. 10

46 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. 10

47 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. But perhaps policy class Π has some structure... 10

48 Hypothetical full-information setting If we observed rewards for all actions... 11

49 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. 11

50 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier 11

51 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π i=1 11

52 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π Can t directly use this in bandit setting. i=1 11

53 Using AMO with some exploration Explore-then-exploit 12

54 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 12

55 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 12

56 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. 12

57 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. Regret bound with best τ: T 1/3 (sub-optimal). (Dependencies on A and Π hidden.) 12

58 Previous contextual bandit algorithms 13

59 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. 13

60 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. 13

61 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, & Zhang, UAI 2011) Near optimal regret, but O(T 6 ) calls to AMO. 13

62 Our result Let K := A and N := Π. Our result: a new, fast and simple algorithm. ( ) Regret bound: Õ. K log N T Near optimal. ) # calls to AMO: Õ( TK log N. Less than once per round! 14

63 Rest of the talk Components of the new algorithm: Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS 1. Classical tricks: randomization, inverse propensity weighting. 2. Efficient algorithm for balancing exploration/exploitation. 3. Additional tricks: warm-start and epoch structure. 15

64 1. Classical tricks 16

65 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 17

66 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? 17

67 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? A: Randomize. Draw a t p t for some pre-specified prob. dist. p t. 17

68 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: a A ˆr t (a) := r t(a t ) 1{a = a t } p t (a t ) 18

69 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. 18

70 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). 18

71 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). 18

72 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). 18

73 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). How should we choose the p t? 18

74 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 19

75 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution Policy distribution: Q = (Q(π) : π Π) probability dist. over policies π in the policy class Π 19

76 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute distribution p t over A (using Q t and x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new distribution Q t+1 over policies Π. 9: end for 19

77 2. Efficient construction of good policy distributions 20

78 Our approach Q: How do we choose Q t for good exploration/exploitation? 21

79 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! 21

80 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 21

81 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. 21

82 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. Algorithm only accesses Π via calls to AMO = nnz(q) = O(# AMO calls) 21

83 The good policy distribution problem Convex feasibility problem for policy distribution Q 22

84 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg t (π) π Π K log N t (Low regret) 22

85 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) 22

86 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. 22

87 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. But Π variables and > Π constraints,... 22

88 Solving the convex feasibility problem Solver for good policy distribution problem (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

89 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

90 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

91 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

92 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (If no such violated constraint, stop and return Q.) (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

93 Implementation via AMO Finding low variance constraint violation: 1. Create fictitious rewards for each i = 1, 2,..., t: r i (a) := ˆr i (a) + µ Q(a x i ) where µ (log N)/(Kt). ( {(xi 2. Obtain π := AMO, r i ) } ) t. i=1 a A, 3. Rewt ( π) > threshold iff π s low variance constraint is violated. 24

94 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

95 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

96 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) Returns a feasible solution after Õ Kt steps. log N (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

97 Algorithm 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute action distribution p t := Q t ( x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new policy distribution Q t+1 using coordinate descent + AMO. 9: end for 26

98 Recap 27

99 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. 27

100 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. 27

101 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. Analysis: In round t, nnz(q t+1 ) = O(# AMO calls) = Õ Kt. log N 27

102 3. Additional tricks: warm-start and epoch structure 28

103 Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N 29

104 Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N To compute Q t+1 in all rounds t = 1, 2,..., T, need Õ K log N T 1.5 AMO calls over T rounds. 29

105 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 30

106 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). 30

107 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 30

108 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N 30

109 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N But still need an AMO call to even check if Q t is feasible! 30

110 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. 31

111 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! 31

112 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. 31

113 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. 31

114 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. Squares: only update on rounds 1 2, 2 2, 3 2, 4 2,... T updates, so Õ( K/ log N) AMO calls per update, on average. 31

115 Warm start + epoch trick Over all T rounds: Update policy distribution on rounds 1 2, 2 2, 3 2, 4 2,..., i.e., total of T times. Total # calls to AMO: Õ KT. log N # AMO calls per update (on average): Õ K. log N 32

116 4. Closing remarks and open problems 33

117 Recap 34

118 Recap 1. New algorithm for general contextual bandits 34

119 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 34

120 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 34

121 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 4. Epoch structure allows for policy distribution to change very infrequently; combine with warm start for computational improvements. 34

122 Open problems 35

123 Open problems 1. Empirical evaluation. 35

124 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 35

125 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. 35

126 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. Thanks! 35

127 36

128 Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). 37

129 Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). We actually use p t := Q µt t ( x t ) := (1 Kµ t )Q t ( x t ) + µ t 1 so every action has probability at least µ t (to be determined). 37

130 The potential function [ (Êx Ht RE(uniform Q µ t ( x)) ] π Π Φ(Q) := tµ t + Q(π) Reg ) t (π), 1 Kµ t Kt µ t 38

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Slides at:

Slides at: Learning to Interact John Langford @ Microsoft Research (with help from many) Slides at: http://hunch.net/~jl/interact.pdf For demo: Raw RCV1 CCAT-or-not: http://hunch.net/~jl/vw_raw.tar.gz Simple converter:

More information

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Alekh Agarwal, Daniel Hsu 2, Satyen Kale 3, John Langford, Lihong Li, and Robert E. Schapire 4 Microsoft Research 2 Columbia University

More information

Learning for Contextual Bandits

Learning for Contextual Bandits Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to

More information

Contextual semibandits via supervised learning oracles

Contextual semibandits via supervised learning oracles Contextual semibandits via supervised learning oracles Akshay Krishnamurthy Alekh Agarwal Miroslav Dudík akshay@cs.umass.edu alekha@microsoft.com mdudik@microsoft.com College of Information and Computer

More information

Lecture 10 : Contextual Bandits

Lecture 10 : Contextual Bandits CMSC 858G: Bandits, Experts and Games 11/07/16 Lecture 10 : Contextual Bandits Instructor: Alex Slivkins Scribed by: Guowei Sun and Cheng Jie 1 Problem statement and examples In this lecture, we will be

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

arxiv: v1 [cs.lg] 1 Jun 2016

arxiv: v1 [cs.lg] 1 Jun 2016 Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits arxiv:1606.00313v1 cs.g 1 Jun 2016 Vasilis Syrgkanis Microsoft Research Haipeng uo Princeton University Robert E. Schapire Microsoft

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Contextual Bandit Learning with Predictable Rewards

Contextual Bandit Learning with Predictable Rewards Contetual Bandit Learning with Predictable Rewards Alekh Agarwal alekh@cs.berkeley.edu Miroslav Dudík mdudik@yahoo-inc.com Satyen Kale sckale@us.ibm.com John Langford jl@yahoo-inc.com Robert. Schapire

More information

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning Doubly Robust Policy Evaluation and Learning Miroslav Dudik, John Langford and Lihong Li Yahoo! Research Discussed by Miao Liu October 9, 2011 October 9, 2011 1 / 17 1 Introduction 2 Problem Definition

More information

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning Miroslav Dudík John Langford Yahoo! Research, New York, NY, USA 10018 Lihong Li Yahoo! Research, Santa Clara, CA, USA 95054 MDUDIK@YAHOO-INC.COM JL@YAHOO-INC.COM LIHONG@YAHOO-INC.COM Abstract We study

More information

Counterfactual Evaluation and Learning

Counterfactual Evaluation and Learning SIGIR 26 Tutorial Counterfactual Evaluation and Learning Adith Swaminathan, Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Website: http://www.cs.cornell.edu/~adith/cfactsigir26/

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Counterfactual Model for Learning Systems

Counterfactual Model for Learning Systems Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Lecture 10: Contextual Bandits

Lecture 10: Contextual Bandits CSE599i: Online and Adaptive Machine Learning Winter 208 Lecturer: Lalit Jain Lecture 0: Contextual Bandits Scribes: Neeraja Abhyankar, Joshua Fan, Kunhui Zhang Disclaimer: These notes have not been subjected

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011 Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm

More information

Discover Relevant Sources : A Multi-Armed Bandit Approach

Discover Relevant Sources : A Multi-Armed Bandit Approach Discover Relevant Sources : A Multi-Armed Bandit Approach 1 Onur Atan, Mihaela van der Schaar Department of Electrical Engineering, University of California Los Angeles Email: oatan@ucla.edu, mihaela@ee.ucla.edu

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Learning from Logged Implicit Exploration Data

Learning from Logged Implicit Exploration Data Learning from Logged Implicit Exploration Data Alexander L. Strehl Facebook Inc. 1601 S California Ave Palo Alto, CA 94304 astrehl@facebook.com Lihong Li Yahoo! Research 4401 Great America Parkway Santa

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning Mingyan Liu (Joint work with Yang Liu) Department of Electrical Engineering and Computer Science University of Michigan,

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Sham M. Kakade Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute, 427 East 60th Street, Chicago, Illinois 60637, USA sham@tti-c.org shai@tti-c.org tewari@tti-c.org Keywords: Online learning,

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

arxiv: v1 [stat.ml] 12 Feb 2018

arxiv: v1 [stat.ml] 12 Feb 2018 Practical Evaluation and Optimization of Contextual Bandit Algorithms arxiv:1802.04064v1 [stat.ml] 12 Feb 2018 Alberto Bietti Inria, MSR-Inria Joint Center alberto.bietti@inria.fr Alekh Agarwal Microsoft

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

Introduction to Multi-Armed Bandits

Introduction to Multi-Armed Bandits Introduction to Multi-Armed Bandits (preliminary and incomplete draft) Aleksandrs Slivkins Microsoft Research NYC https://www.microsoft.com/en-us/research/people/slivkins/ Last edits: Jan 25, 2017 i Preface

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Estimation Considerations in Contextual Bandits

Estimation Considerations in Contextual Bandits Estimation Considerations in Contextual Bandits Maria Dimakopoulou Zhengyuan Zhou Susan Athey Guido Imbens arxiv:1711.07077v4 [stat.ml] 16 Dec 2018 Abstract Contextual bandit algorithms are sensitive to

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Off-policy Evaluation and Learning

Off-policy Evaluation and Learning COMS E6998.001: Bandits and Reinforcement Learning Fall 2017 Off-policy Evaluation and Learning Lecturer: Alekh Agarwal Lecture 7, October 11 Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Adaptive Learning with Unknown Information Flows

Adaptive Learning with Unknown Information Flows Adaptive Learning with Unknown Information Flows Yonatan Gur Stanford University Ahmadreza Momeni Stanford University June 8, 018 Abstract An agent facing sequential decisions that are characterized by

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Sequential Multi-armed Bandits

Sequential Multi-armed Bandits Sequential Multi-armed Bandits Cem Tekin Mihaela van der Schaar University of California, Los Angeles Abstract In this paper we introduce a new class of online learning problems called sequential multi-armed

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Tsinghua Machine Learning Guest Lecture, June 9,

Tsinghua Machine Learning Guest Lecture, June 9, Tsinghua Machine Learning Guest Lecture, June 9, 2015 1 Lecture Outline Introduction: motivations and definitions for online learning Multi-armed bandit: canonical example of online learning Combinatorial

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Online Learning under Full and Bandit Information

Online Learning under Full and Bandit Information Online Learning under Full and Bandit Information Artem Sokolov Computerlinguistik Universität Heidelberg 1 Motivation 2 Adversarial Online Learning Hedge EXP3 3 Stochastic Bandits ε-greedy UCB Real world

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Offline Evaluation of Ranking Policies with Click Models

Offline Evaluation of Ranking Policies with Click Models Offline Evaluation of Ranking Policies with Click Models Shuai Li The Chinese University of Hong Kong Joint work with Yasin Abbasi-Yadkori (Adobe Research) Branislav Kveton (Google Research, was in Adobe

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Exploration Scavenging

Exploration Scavenging John Langford jl@yahoo-inc.com Alexander Strehl strehl@yahoo-inc.com Yahoo! Research, 111 W. 40th Street, New York, New York 10018 Jennifer Wortman wortmanj@seas.upenn.edu Department of Computer and Information

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

A Time and Space Efficient Algorithm for Contextual Linear Bandits

A Time and Space Efficient Algorithm for Contextual Linear Bandits A Time and Space Efficient Algorithm for Contextual Linear Bandits José Bento, Stratis Ioannidis, S. Muthukrishnan, and Jinyun Yan Stanford University, Technicolor, Rutgers University, Rutgers University

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Large-scale Information Processing, Summer Recommender Systems (part 2)

Large-scale Information Processing, Summer Recommender Systems (part 2) Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information