Slides at:

Size: px
Start display at page:

Download "Slides at:"

Transcription

1 Learning to Interact John Microsoft Research (with help from many) Slides at: For demo: Raw RCV1 CCAT-or-not: Simple converter: wget Vowpal Wabbit for learning:

2 Examples of Interactive Learning Repeatedly: 1 A user comes to Microsoft (with history of previous visits, IP address, data related to an account) 2 Microsoft chooses information to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something, clicks, comes back and clicks again,...) Microsoft wants to interactively choose content and use the observed feedback to improve future content choices.

3 Another Example: Clinical Decision Making Repeatedly: 1 A patient comes to a doctor with symptoms, medical history, test results 2 The doctor chooses a treatment 3 The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients.

4 The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x X 2 The learner chooses an action a A 3 The world reacts with reward r a [0, 1] Goal: Learn a good policy for choosing actions given context.

5 The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it?

6 The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world. Very Expensive!

7 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a).

8 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. x 1 x 2 a 1 a 2

9 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed a 1 a 2 x 1.8? x 2?.2

10 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.5 x 2?/.5.2 /.2

11 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.5 x 2.3/.5.2 /.2

12 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.514 x 2.3/.3.2 /.014

13 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2

14 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 1: Generalization alone is not sucient.

15 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 2: Exploration is required to succeed.

16 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 3: Prediction errors not controlled exploration.

17 Outline 1 Using Exploration 1 Problem Denition 2 Direct Method fails 3 Importance Weighting 4 Missing Probabilities 5 Doubly Robust 2 Doing Exploration

18 Method 3: The Importance Weighting Trick Let π : X A be a policy mapping features to actions. How do we evaluate it?

19 Method 3: The Importance Weighting Trick Let π : X A be a policy mapping features to actions. How do we evaluate it? One answer: Collect T exploration samples of the form where x = context a = action r a = reward for action p a = probability of action a then evaluate: (x, a, r a, p a ), Value(π) = Average ( ) ra 1(π(x) = a) p a

20 The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E (x, r) D [r π(x) ] = E[ Value(π) ] with deviations bounded by ( ) 1 O T min x p π(x) [ ] r Proof: [Part 1] E a1(π(x)=a) a p p a = a p r a1(π(x)=a) a p a = r π(x)

21 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B.

22 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B. Learn predictor ˆp(a x) on (x, a) data. Dene new estimator: ˆV (π) = Ê x,a,ra [ small number. r ai (π(x)=a) max{τ,ˆp(a x)} ] where τ =

23 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B. Learn predictor ˆp(a x) on (x, a) data. Dene new estimator: ˆV (π) = Ê x,a,ra [ small number. r ai (π(x)=a) max{τ,ˆp(a x)} ] where τ = Theorem: For all IID D, for all policies π with p(a x) > τ reg(ˆp) Value(π) E ˆV (π) τ where reg(ˆp) = E x D,a p(a x) [(p(a x) ˆp(a x)) 2 ] = squared loss regret.

24 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?

25 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a

26 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work?

27 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work? E a p (ˆr(a, x)1(π(x) = a) p a ) = ˆr(π(x), x) so ( ) ˆr(a, x)1(π(x) = a) 0 = E a p + ˆr(π(x), x) p a

28 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work? E a p (ˆr(a, x)1(π(x) = a) p a ) = ˆr(π(x), x) so ( ) ˆr(a, x)1(π(x) = a) 0 = E a p + ˆr(π(x), x) p a So it's safe. It helps, because r a ˆr(a, x) small reduces variance.

29 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do?

30 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do? 1 Pick classication dataset. 2 Generate (x, a, r, p) quads via uniform random exploration of actions

31 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do? 1 Pick classication dataset. 2 Generate (x, a, r, p) quads via uniform random exploration of actions Apply transform to RCV1 dataset. wget wget Output format is: action:cost:probability features Example: 1:1:0.5 tuesday year million short compan vehicl line stat nanc commit exchang plan corp subsid credit issu debt pay gold bureau prelimin ren billion telephon time draw basic relat le spokesm reut secur acquir form prospect period interview regist toront resourc barrick ontario qualif bln prospectus convertibl vinc borg arequip...

32 How do you train? 1 Learn ˆr(a, x). 2 Compute for each x the double-robust estimate for each a {1,..., K}: (r ˆr(a, x))i (a = a) p(a x) + ˆr(a, x) 3 Learn π using a cost-sensitive classier. We'll use Vowpal Wabbit:

33 How do you train? 1 Learn ˆr(a, x). 2 Compute for each x the double-robust estimate for each a {1,..., K}: (r ˆr(a, x))i (a = a) p(a x) + ˆr(a, x) 3 Learn π using a cost-sensitive classier. We'll use Vowpal Wabbit: vw cb 2 cb_type dr rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.25 Progressive 0/1 loss: vw cb 2 cb_type ips rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l Progressive 0/1 loss: vw cb 2 cb_type dm rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l Progressive 0/1 loss:

34 Experimental Results IPS = Inverse probability DR = Doubly Robust, with ˆr(a, x) = w a x Filter Tree = Cost Sensitive Multiclass classier Oset Tree = Earlier method for CB learning with same representation IPS (Filter Tree) DR (Filter Tree) Offset Tree ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error

35 Summary of methods 1 Deployment. Aka A/B testing. Gold standard for measurement and cost. 2 Direct Method. Often used by people who don't know what they are doing. Some value when used in conjunction with careful exploration. 3 Inverse probability. Unbiased, but possibly high variance. 4 Inverse propensity score. For when you don't know or don't trust recorded probabilities. Debugging tool that gives hints, but caution is in order. 5 Oset Tree. (not discussed) Only known logarithmic time method. 6 Double robust. Best known oine method. Unbiased + reduced variance.

36 Reminder: Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x X 2 The learner chooses an action a A 3 The world reacts with reward r a [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Eciently competing with some large reference class of policies Π = {π : X A}: Regret = max π Π average t(r π(x) r a )

37 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future.

38 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future. There are two kinds: 1 Deterministic. Choose action A, then B, then C, then A, then B,... 2 Randomized. Choose random actions according to some distribution over actions.

39 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future. There are two kinds: 1 Deterministic. Choose action A, then B, then C, then A, then B,... 2 Randomized. Choose random actions according to some distribution over actions. We discuss Randomized here. 1 There are no good deterministic exploration algorithms in this setting. 2 Supports o-policy evaluation. (See rst half.) 3 Randomize = robust to delayed updates, which are very common in practice.

40 Outline 1 Using Exploration 1 Problem Denition 2 Direct Method fails 3 Importance Weighting 4 Missing Probabilities 5 Doubly Robust 2 Doing Exploration 1 Exploration First 2 ɛ-greedy 3 epoch Greedy 4 Policy Elimination 5 Other things: Thompson Sampling and Covering

41 Explore τ then Follow the Leader (Explore-τ)

42 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best.

43 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ

44 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ + T A ln( Π /δ) T T τ.

45 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ + T A ln( Π /δ) T T τ. At optimal τ?

46 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ T + T T A ln( Π /δ) τ. At optimal τ? O ( ( A ln Π T ) 1/3 )

47 What does this mean? 1 +Easiest approach: oine prerecorded exploration can feed into any learning algorithm. See rst half. 2 -Doesn't adapt when world changes. 3 -Underexploration common. Think of clinical trials.

48 What does this mean? 1 +Easiest approach: oine prerecorded exploration can feed into any learning algorithm. See rst half. 2 -Doesn't adapt when world changes. 3 -Underexploration common. Think of clinical trials. Can we do better?

49 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ).

50 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)).

51 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). Theorem: ɛ-greedy has regret O ( ɛ + ) A ln Π T ɛ

52 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). Theorem: ɛ-greedy has regret O For optimal epsilon? ( ɛ + ) A ln Π T ɛ

53 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). ( Theorem: ɛ-greedy has regret O ɛ + ( ( ) ) 1/3 A ln Π For optimal epsilon? O T ) A ln Π T ɛ

54 What does this mean? 1 -Harder Approach: Need online learning algorithm to use. 2 +Adapts when world changes. 3 -Overexploration common. Bad possibilities keep being explored.

55 What does this mean? 1 -Harder Approach: Need online learning algorithm to use. 2 +Adapts when world changes. 3 -Overexploration common. Bad possibilities keep being explored. Can we do better?

56 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated.

57 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ).

58 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ). With probability ɛ t 1 Choose Uniform random other a 2 Observe r, update ɛ t and learn with (x, a, r, ɛ t /( A 1)).

59 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ). With probability ɛ t 1 Choose Uniform random other a 2 Observe r, update ɛ t and learn with (x, a, r, ɛ t /( A 1)). ( ( ) ) 1/3 A ln Π Theorem: Epoch Greedy has regret O with high probability. Autotuning! T

60 What does this mean? 1 -Harder Approach: Need online learning algorithm to use + keeping track of deviation bound. 2 +Adapts when world changes. 3 +Neither under nor over exploration.

61 What does this mean? 1 -Harder Approach: Need online learning algorithm to use + keeping track of deviation bound. 2 +Adapts when world changes. 3 +Neither under nor over exploration. Is it possible to do better? Supervised ( ( ) 1 ) ln Π 2 Regret O T τ-rst/ɛ-greedy/epoch-greedy ( ( ) 1 ) A ln Π 3 O T

62 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose distribution P over Π t 1 s.t. for every remaining policy π, the expected variance of a value estimate is small. 2 observe x 3 Let p(a) = fraction of P choosing a given x. 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

63 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = fraction of P choosing a given x. 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

64 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = (1 Kµ t ) Pr π P (π(x) = a) + µ t 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

65 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = (1 Kµ t ) Pr π P (π(x) = a) + µ t 4 Choose a p and observe reward r 5 Let Π t = {π Π t 1 : η t (π) max π Π t 1 η t(π ) Kµ t } Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

66 What does this mean? 1 -Doesn't adapt when world changes. 2 ++Much more ecient exploration. Only ecient in special cases Much Harder Approach: Need to keep track of policies, which is often intractable.

67 What does this mean? 1 -Doesn't adapt when world changes. 2 ++Much more ecient exploration. Only ecient in special cases Much Harder Approach: Need to keep track of policies, which is often intractable. Adapting algorithms exist (EXP4). More ecient versions exist (RUCB), but not yet ecient enough.

68 Can you do better?

69 Can you do better? Not in general. Theorem: For all algorithms, there exists problems imposing regret: ( ) A ln Π Ω T

70 The current state Starter Baseline Purring Shiny Something to try Shiny New

71 The current state Explore-τ Baseline Purring Shiny Something to try Shiny New Simplest Possible

72 The current state Explore-τ ɛ-greedy Purring Shiny Something to try Shiny New Simplest Possible Simplest Adaptive

73 The current state Explore-τ ɛ-greedy Epoch Greedy Shiny Something to try Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement

74 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Something to try Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical

75 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Thompson Sampling Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical Sometimes Excellent

76 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Thompson Sampling Covering Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical Sometimes Excellent The answer??? You can see the edge of the understood world here. We hope to see further soon. Further discussion:

77 Bibliography: Using Exploration Inverse An old technique, not sure where it was rst used. Nonrand J. Langford, A. Strehl, and J. Wortman Exploration Scavenging ICML Oset A. Beygelzimer and J. Langford, The Oset Tree for Learning with Partial Labels KDD Implicit A. Strehl, J. Langford, S. Kakade, and L. Li Learning from Logged Implicit Exploration Data NIPS DRobust M. Dudik, J. Langford and L. Li, Doubly Robust Policy Evaluation and Learning, ICML 2011.

78 Bibliography: Doing Exploration Tau-rst Unclear rst use? ɛ-greedy Unclear rst use? Epoch J. Langford and T. Zhang, The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits, NIPS EXP4 P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):4877, 2002b. EXP4++ B. McMahan and M. Streeter, Tighter bounds for Multi-Armed Bandits with Expert Advice, COLT PolyElim M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang, Ecient Optimal Learning for Contextual Bandits, UAI 2011.

79 Bibliograph: Doing Exploration II ealizable A. Agarwal, M. Dudik, S. Kale, and J. Langford, Contextual Bandit Learning with Predictable Rewards, AIStat hompson W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285294, Prior fail Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert E. Schapire, An Optimal High Probability Algorithm for the Contextual Bandit Problem AIStat Empirical O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling, NIPS Cover A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits,

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

Learning for Contextual Bandits

Learning for Contextual Bandits Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning Doubly Robust Policy Evaluation and Learning Miroslav Dudik, John Langford and Lihong Li Yahoo! Research Discussed by Miao Liu October 9, 2011 October 9, 2011 1 / 17 1 Introduction 2 Problem Definition

More information

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning Miroslav Dudík John Langford Yahoo! Research, New York, NY, USA 10018 Lihong Li Yahoo! Research, Santa Clara, CA, USA 95054 MDUDIK@YAHOO-INC.COM JL@YAHOO-INC.COM LIHONG@YAHOO-INC.COM Abstract We study

More information

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Alekh Agarwal, Daniel Hsu 2, Satyen Kale 3, John Langford, Lihong Li, and Robert E. Schapire 4 Microsoft Research 2 Columbia University

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Contextual Bandit Learning with Predictable Rewards

Contextual Bandit Learning with Predictable Rewards Contetual Bandit Learning with Predictable Rewards Alekh Agarwal alekh@cs.berkeley.edu Miroslav Dudík mdudik@yahoo-inc.com Satyen Kale sckale@us.ibm.com John Langford jl@yahoo-inc.com Robert. Schapire

More information

Lecture 10 : Contextual Bandits

Lecture 10 : Contextual Bandits CMSC 858G: Bandits, Experts and Games 11/07/16 Lecture 10 : Contextual Bandits Instructor: Alex Slivkins Scribed by: Guowei Sun and Cheng Jie 1 Problem statement and examples In this lecture, we will be

More information

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary Outline Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators 1. Self-Normalized Estimator 2. Doubly Robust Estimator 3. Slates Estimator Case Studies & Demo Summary IPS: Issues

More information

arxiv: v1 [stat.ml] 12 Feb 2018

arxiv: v1 [stat.ml] 12 Feb 2018 Practical Evaluation and Optimization of Contextual Bandit Algorithms arxiv:1802.04064v1 [stat.ml] 12 Feb 2018 Alberto Bietti Inria, MSR-Inria Joint Center alberto.bietti@inria.fr Alekh Agarwal Microsoft

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Learning from Logged Implicit Exploration Data

Learning from Logged Implicit Exploration Data Learning from Logged Implicit Exploration Data Alexander L. Strehl Facebook Inc. 1601 S California Ave Palo Alto, CA 94304 astrehl@facebook.com Lihong Li Yahoo! Research 4401 Great America Parkway Santa

More information

Counterfactual Evaluation and Learning

Counterfactual Evaluation and Learning SIGIR 26 Tutorial Counterfactual Evaluation and Learning Adith Swaminathan, Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Website: http://www.cs.cornell.edu/~adith/cfactsigir26/

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Counterfactual Model for Learning Systems

Counterfactual Model for Learning Systems Counterfactual Model for Learning Systems CS 7792 - Fall 28 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Contextual semibandits via supervised learning oracles

Contextual semibandits via supervised learning oracles Contextual semibandits via supervised learning oracles Akshay Krishnamurthy Alekh Agarwal Miroslav Dudík akshay@cs.umass.edu alekha@microsoft.com mdudik@microsoft.com College of Information and Computer

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

arxiv: v3 [cs.lg] 3 Apr 2016

arxiv: v3 [cs.lg] 3 Apr 2016 The Offset Tree for Learning with Partial Labels Alina Beygelzimer Yahoo Research John Langford Microsoft Research beygel@yahoo-inc.com jcl@microsoft.com arxiv:0812.4044v3 [cs.lg] 3 Apr 2016 Abstract We

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Discover Relevant Sources : A Multi-Armed Bandit Approach

Discover Relevant Sources : A Multi-Armed Bandit Approach Discover Relevant Sources : A Multi-Armed Bandit Approach 1 Onur Atan, Mihaela van der Schaar Department of Electrical Engineering, University of California Los Angeles Email: oatan@ucla.edu, mihaela@ee.ucla.edu

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Lecture 10: Contextual Bandits

Lecture 10: Contextual Bandits CSE599i: Online and Adaptive Machine Learning Winter 208 Lecturer: Lalit Jain Lecture 0: Contextual Bandits Scribes: Neeraja Abhyankar, Joshua Fan, Kunhui Zhang Disclaimer: These notes have not been subjected

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Doubly Robust Policy Evaluation and Optimization 1

Doubly Robust Policy Evaluation and Optimization 1 Statistical Science 2014, Vol. 29, No. 4, 485 511 DOI: 10.1214/14-STS500 c Institute of Mathematical Statistics, 2014 Doubly Robust Policy Evaluation and Optimization 1 Miroslav Dudík, Dumitru Erhan, John

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Doubly Robust Policy Evaluation and Optimization 1

Doubly Robust Policy Evaluation and Optimization 1 Statistical Science 2014, Vol. 29, No. 4, 485 511 DOI: 10.1214/14-STS500 Institute of Mathematical Statistics, 2014 Doubly Robust Policy valuation and Optimization 1 Miroslav Dudík, Dumitru rhan, John

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Online Learning under Full and Bandit Information

Online Learning under Full and Bandit Information Online Learning under Full and Bandit Information Artem Sokolov Computerlinguistik Universität Heidelberg 1 Motivation 2 Adversarial Online Learning Hedge EXP3 3 Stochastic Bandits ε-greedy UCB Real world

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

A Time and Space Efficient Algorithm for Contextual Linear Bandits

A Time and Space Efficient Algorithm for Contextual Linear Bandits A Time and Space Efficient Algorithm for Contextual Linear Bandits José Bento, Stratis Ioannidis, S. Muthukrishnan, and Jinyun Yan Stanford University, Technicolor, Rutgers University, Rutgers University

More information

Efficient Online Bandit Multiclass Learning with Õ( T ) Regret

Efficient Online Bandit Multiclass Learning with Õ( T ) Regret with Õ ) Regret Alina Beygelzimer 1 Francesco Orabona Chicheng Zhang 3 Abstract We present an efficient second-order algorithm with Õ 1 ) 1 regret for the bandit online multiclass problem. he regret bound

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Sham M. Kakade Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute, 427 East 60th Street, Chicago, Illinois 60637, USA sham@tti-c.org shai@tti-c.org tewari@tti-c.org Keywords: Online learning,

More information

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011 Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Large-scale Information Processing, Summer Recommender Systems (part 2)

Large-scale Information Processing, Summer Recommender Systems (part 2) Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

arxiv: v1 [cs.lg] 1 Jun 2016

arxiv: v1 [cs.lg] 1 Jun 2016 Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits arxiv:1606.00313v1 cs.g 1 Jun 2016 Vasilis Syrgkanis Microsoft Research Haipeng uo Princeton University Robert E. Schapire Microsoft

More information

Optimal and Adaptive Algorithms for Online Boosting

Optimal and Adaptive Algorithms for Online Boosting Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example

More information

An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models

An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models Journal of Machine Learning Research 1 (2011) xx-xx Submitted x/xx; Published xx/xx An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models Lihong Li Yahoo! Research

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Extreme Multi Class Classification

Extreme Multi Class Classification Extreme Multi Class Classification Anna Choromanska Department of Electrical Engineering Columbia University aec163@columbia.edu Alekh Agarwal Microsoft Research New York, NY, USA alekha@microsoft.com

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Learning to Trade Off Between Exploration and Exploitation in Multiclass Bandit Prediction

Learning to Trade Off Between Exploration and Exploitation in Multiclass Bandit Prediction Learning to Trade Off Between Exploration and Exploitation in Multiclass Bandit Prediction Hamed Valizadegan Department of Computer Science University of Pittsburgh Pittsburgh, USA hamed@cs.pitt.edu Rong

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Off-policy Evaluation and Learning

Off-policy Evaluation and Learning COMS E6998.001: Bandits and Reinforcement Learning Fall 2017 Off-policy Evaluation and Learning Lecturer: Alekh Agarwal Lecture 7, October 11 Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Estimation Considerations in Contextual Bandits

Estimation Considerations in Contextual Bandits Estimation Considerations in Contextual Bandits Maria Dimakopoulou Zhengyuan Zhou Susan Athey Guido Imbens arxiv:1711.07077v4 [stat.ml] 16 Dec 2018 Abstract Contextual bandit algorithms are sensitive to

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Imitation Learning by Coaching. Abstract

Imitation Learning by Coaching. Abstract Imitation Learning by Coaching He He Hal Daumé III Department of Computer Science University of Maryland College Park, MD 20740 {hhe,hal}@cs.umd.edu Abstract Jason Eisner Department of Computer Science

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization with Application to Multi-document Summarization Jiaji Zhou jiajiz@andrew.cmu.edu Stephane Ross stephaneross@cmu.edu Yisong Yue yisongyue@cmu.edu Debadeepta Dey debadeep@cs.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu

More information

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization with Application to Multi-document Summarization Jiaji Zhou jiajiz@andrew.cmu.edu Stephane Ross stephaneross@cmu.edu Yisong Yue yisongyue@cmu.edu Debadeepta Dey debadeep@cs.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information