Reducing contextual bandits to supervised learning

Size: px

Start display at page:

Download "Reducing contextual bandits to supervised learning"

Bruce Walker
5 years ago
Views:

1 Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1

2 Learning to interact: example #1 Practicing physician 2

3 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2

4 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 2

5 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). 2

6 Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes. 2

7 Learning to interact: example #2 Website operator 3

8 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 3

9 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3

10 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). 3

11 Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). Goal: choose content that yield desired user behavior. 3

12 Contextual bandit problem For t = 1, 2,..., T : 4

13 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 4

14 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 4

15 Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

16 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

17 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). 4

18 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. 4

19 Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. Bandit: r t (a) for a a t is not observed. (Non-bandit setting: whole reward vector r t [0, 1] A is observed.) 4

20 Challenges 5

21 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 5

22 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 5

23 Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 3. Selection bias, especially while exploiting. 5

24 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner 6

25 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! 6

26 Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! Goal: regret 0 as fast as possible as T. 6

27 Our result New fast and simple algorithm for contextual bandits. 7

28 Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). 7

29 Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). Statistically (near) optimal regret bound. 7

30 Need for exploration No-exploration approach: 8

31 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 8

32 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. ˆπ(x) := arg max ˆr(a x), a A 8

33 Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. Suffers from selection bias. ˆπ(x) := arg max ˆr(a x), a A 8

34 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. 9

35 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 9

36 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y

37 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. 9

38 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y

39 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Reward estimates A B X Y

40 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Never try action B in context X. Reward estimates A B X Y

41 Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X Y New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y Reward estimates A B X Y Never try action B in context X. Ω(1) regret. True rewards A B X Y

42 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! 10

43 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. 10

44 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). 10

45 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. 10

46 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. 10

47 Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. But perhaps policy class Π has some structure... 10

48 Hypothetical full-information setting If we observed rewards for all actions... 11

49 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. 11

50 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier 11

51 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π i=1 11

52 Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π Can t directly use this in bandit setting. i=1 11

53 Using AMO with some exploration Explore-then-exploit 12

54 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 12

55 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 12

56 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. 12

57 Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. Regret bound with best τ: T 1/3 (sub-optimal). (Dependencies on A and Π hidden.) 12

58 Previous contextual bandit algorithms 13

59 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. 13

60 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. 13

61 Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, & Zhang, UAI 2011) Near optimal regret, but O(T 6 ) calls to AMO. 13

62 Our result Let K := A and N := Π. Our result: a new, fast and simple algorithm. ( ) Regret bound: Õ. K log N T Near optimal. ) # calls to AMO: Õ( TK log N. Less than once per round! 14

63 Rest of the talk Components of the new algorithm: Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS 1. Classical tricks: randomization, inverse propensity weighting. 2. Efficient algorithm for balancing exploration/exploitation. 3. Additional tricks: warm-start and epoch structure. 15

64 1. Classical tricks 16

65 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 17

66 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? 17

67 What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? A: Randomize. Draw a t p t for some pre-specified prob. dist. p t. 17

68 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: a A ˆr t (a) := r t(a t ) 1{a = a t } p t (a t ) 18

69 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. 18

70 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). 18

71 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). 18

72 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). 18

73 Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). How should we choose the p t? 18

74 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 19

75 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution Policy distribution: Q = (Q(π) : π Π) probability dist. over policies π in the policy class Π 19

76 Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute distribution p t over A (using Q t and x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new distribution Q t+1 over policies Π. 9: end for 19

77 2. Efficient construction of good policy distributions 20

78 Our approach Q: How do we choose Q t for good exploration/exploitation? 21

79 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! 21

80 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 21

81 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. 21

82 Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. Algorithm only accesses Π via calls to AMO = nnz(q) = O(# AMO calls) 21

83 The good policy distribution problem Convex feasibility problem for policy distribution Q 22

84 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg t (π) π Π K log N t (Low regret) 22

85 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) 22

86 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. 22

87 The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. But Π variables and > Π constraints,... 22

88 Solving the convex feasibility problem Solver for good policy distribution problem (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

89 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

90 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

91 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

92 Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (If no such violated constraint, stop and return Q.) (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

93 Implementation via AMO Finding low variance constraint violation: 1. Create fictitious rewards for each i = 1, 2,..., t: r i (a) := ˆr i (a) + µ Q(a x i ) where µ (log N)/(Kt). ( {(xi 2. Obtain π := AMO, r i ) } ) t. i=1 a A, 3. Rewt ( π) > threshold iff π s low variance constraint is violated. 24

94 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

95 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

96 Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) Returns a feasible solution after Õ Kt steps. log N (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

97 Algorithm 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute action distribution p t := Q t ( x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new policy distribution Q t+1 using coordinate descent + AMO. 9: end for 26

98 Recap 27

99 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. 27

100 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. 27

101 Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. Analysis: In round t, nnz(q t+1 ) = O(# AMO calls) = Õ Kt. log N 27

102 3. Additional tricks: warm-start and epoch structure 28

103 Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N 29

104 Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N To compute Q t+1 in all rounds t = 1, 2,..., T, need Õ K log N T 1.5 AMO calls over T rounds. 29

105 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 30

106 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). 30

107 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 30

108 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N 30

109 Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N But still need an AMO call to even check if Q t is feasible! 30

110 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. 31

111 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! 31

112 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. 31

113 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. 31

114 Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. Squares: only update on rounds 1 2, 2 2, 3 2, 4 2,... T updates, so Õ( K/ log N) AMO calls per update, on average. 31

115 Warm start + epoch trick Over all T rounds: Update policy distribution on rounds 1 2, 2 2, 3 2, 4 2,..., i.e., total of T times. Total # calls to AMO: Õ KT. log N # AMO calls per update (on average): Õ K. log N 32

116 4. Closing remarks and open problems 33

117 Recap 34

118 Recap 1. New algorithm for general contextual bandits 34

119 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 34

120 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 34

121 Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 4. Epoch structure allows for policy distribution to change very infrequently; combine with warm start for computational improvements. 34

122 Open problems 35

123 Open problems 1. Empirical evaluation. 35

124 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 35

125 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. 35

126 Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. Thanks! 35

127 36

128 Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). 37

129 Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). We actually use p t := Q µt t ( x t ) := (1 Kµ t )Q t ( x t ) + µ t 1 so every action has probability at least µ t (to be determined). 37

130 The potential function [ (Êx Ht RE(uniform Q µ t ( x)) ] π Π Φ(Q) := tµ t + Q(π) Reg ) t (π), 1 Kµ t Kt µ t 38

Learning with Exploration

Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.