Slides at:

Size: px

Start display at page:

Download "Slides at:"

Arlene Booth
5 years ago
Views:

1 Learning to Interact John Microsoft Research (with help from many) Slides at: For demo: Raw RCV1 CCAT-or-not: Simple converter: wget Vowpal Wabbit for learning:

Examples of Interactive Learning Repeatedly: 1 A user comes to Microsoft (with history of previous visits, IP address, data related to an account) 2 Microsoft chooses information to present (urls,

2 Examples of Interactive Learning Repeatedly: 1 A user comes to Microsoft (with history of previous visits, IP address, data related to an account) 2 Microsoft chooses information to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something, clicks, comes back and clicks again,...) Microsoft wants to interactively choose content and use the observed feedback to improve future content choices.

3 Another Example: Clinical Decision Making Repeatedly: 1 A patient comes to a doctor with symptoms, medical history, test results 2 The doctor chooses a treatment 3 The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients.

4 The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x X 2 The learner chooses an action a A 3 The world reacts with reward r a [0, 1] Goal: Learn a good policy for choosing actions given context.

5 The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it?

6 The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world. Very Expensive!

7 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a).

8 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. x 1 x 2 a 1 a 2

9 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed a 1 a 2 x 1.8? x 2?.2

10 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.5 x 2?/.5.2 /.2

11 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.5 x 2.3/.5.2 /.2

12 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated a 1 a 2 x 1.8/.8?/.514 x 2.3/.3.2 /.014

13 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2

14 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 1: Generalization alone is not sucient.

15 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 2: Exploration is required to succeed.

16 The Direct method Use past data to learn a reward predictor ˆr(x, a), and act according to arg max a ˆr(x, a). Example: Deployed policy always takes a 1 on x 1 and a 2 on x 2. Observed/Estimated/True a 1 a 2 x 1.8/.8/.8?/.514/1 x 2.3/.3/.3.2 /.014 /.2 Basic observation 3: Prediction errors not controlled exploration.

17 Outline 1 Using Exploration 1 Problem Denition 2 Direct Method fails 3 Importance Weighting 4 Missing Probabilities 5 Doubly Robust 2 Doing Exploration

18 Method 3: The Importance Weighting Trick Let π : X A be a policy mapping features to actions. How do we evaluate it?

19 Method 3: The Importance Weighting Trick Let π : X A be a policy mapping features to actions. How do we evaluate it? One answer: Collect T exploration samples of the form where x = context a = action r a = reward for action p a = probability of action a then evaluate: (x, a, r a, p a ), Value(π) = Average ( ) ra 1(π(x) = a) p a

20 The Importance Weighting Trick Theorem For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E (x, r) D [r π(x) ] = E[ Value(π) ] with deviations bounded by ( ) 1 O T min x p π(x) [ ] r Proof: [Part 1] E a1(π(x)=a) a p p a = a p r a1(π(x)=a) a p a = r π(x)

21 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B.

22 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B. Learn predictor ˆp(a x) on (x, a) data. Dene new estimator: ˆV (π) = Ê x,a,ra [ small number. r ai (π(x)=a) max{τ,ˆp(a x)} ] where τ =

23 What if you don't know probabilities? Suppose p was: 1 misrecorded We randomized some actions, but then the Business Logic did something else. 2 not recorded We randomized some scores which had an unclear impact on actions. 3 nonexistent On Tuesday we did A and on Wednesday B. Learn predictor ˆp(a x) on (x, a) data. Dene new estimator: ˆV (π) = Ê x,a,ra [ small number. r ai (π(x)=a) max{τ,ˆp(a x)} ] where τ = Theorem: For all IID D, for all policies π with p(a x) > τ reg(ˆp) Value(π) E ˆV (π) τ where reg(ˆp) = E x D,a p(a x) [(p(a x) ˆp(a x)) 2 ] = squared loss regret.

24 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?

25 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a

26 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work?

27 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work? E a p (ˆr(a, x)1(π(x) = a) p a ) = ˆr(π(x), x) so ( ) ˆr(a, x)1(π(x) = a) 0 = E a p + ˆr(π(x), x) p a

28 Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value'(π) = Average ( ) (ra ˆr(a, x))1(π(x) = a) + ˆr(π(x), x) p a Why does this work? E a p (ˆr(a, x)1(π(x) = a) p a ) = ˆr(π(x), x) so ( ) ˆr(a, x)1(π(x) = a) 0 = E a p + ˆr(π(x), x) p a So it's safe. It helps, because r a ˆr(a, x) small reduces variance.

29 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do?

30 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do? 1 Pick classication dataset. 2 Generate (x, a, r, p) quads via uniform random exploration of actions

31 How do you test things? Contextual Bandit datasets tend to be highly proprietary. What can you do? 1 Pick classication dataset. 2 Generate (x, a, r, p) quads via uniform random exploration of actions Apply transform to RCV1 dataset. wget wget Output format is: action:cost:probability features Example: 1:1:0.5 tuesday year million short compan vehicl line stat nanc commit exchang plan corp subsid credit issu debt pay gold bureau prelimin ren billion telephon time draw basic relat le spokesm reut secur acquir form prospect period interview regist toront resourc barrick ontario qualif bln prospectus convertibl vinc borg arequip...

32 How do you train? 1 Learn ˆr(a, x). 2 Compute for each x the double-robust estimate for each a {1,..., K}: (r ˆr(a, x))i (a = a) p(a x) + ˆr(a, x) 3 Learn π using a cost-sensitive classier. We'll use Vowpal Wabbit:

33 How do you train? 1 Learn ˆr(a, x). 2 Compute for each x the double-robust estimate for each a {1,..., K}: (r ˆr(a, x))i (a = a) p(a x) + ˆr(a, x) 3 Learn π using a cost-sensitive classier. We'll use Vowpal Wabbit: vw cb 2 cb_type dr rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l 0.25 Progressive 0/1 loss: vw cb 2 cb_type ips rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l Progressive 0/1 loss: vw cb 2 cb_type dm rcv1.train.txt.gz -c ngram 2 skips 4 -b 24 -l Progressive 0/1 loss:

34 Experimental Results IPS = Inverse probability DR = Doubly Robust, with ˆr(a, x) = w a x Filter Tree = Cost Sensitive Multiclass classier Oset Tree = Earlier method for CB learning with same representation IPS (Filter Tree) DR (Filter Tree) Offset Tree ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error

35 Summary of methods 1 Deployment. Aka A/B testing. Gold standard for measurement and cost. 2 Direct Method. Often used by people who don't know what they are doing. Some value when used in conjunction with careful exploration. 3 Inverse probability. Unbiased, but possibly high variance. 4 Inverse propensity score. For when you don't know or don't trust recorded probabilities. Debugging tool that gives hints, but caution is in order. 5 Oset Tree. (not discussed) Only known logarithmic time method. 6 Double robust. Best known oine method. Unbiased + reduced variance.

36 Reminder: Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x X 2 The learner chooses an action a A 3 The world reacts with reward r a [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Eciently competing with some large reference class of policies Π = {π : X A}: Regret = max π Π average t(r π(x) r a )

37 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future.

38 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future. There are two kinds: 1 Deterministic. Choose action A, then B, then C, then A, then B,... 2 Randomized. Choose random actions according to some distribution over actions.

39 What is exploration? Exploration = Choosing not-obviously best actions to gather information for better performance in the future. There are two kinds: 1 Deterministic. Choose action A, then B, then C, then A, then B,... 2 Randomized. Choose random actions according to some distribution over actions. We discuss Randomized here. 1 There are no good deterministic exploration algorithms in this setting. 2 Supports o-policy evaluation. (See rst half.) 3 Randomize = robust to delayed updates, which are very common in practice.

40 Outline 1 Using Exploration 1 Problem Denition 2 Direct Method fails 3 Importance Weighting 4 Missing Probabilities 5 Doubly Robust 2 Doing Exploration 1 Exploration First 2 ɛ-greedy 3 epoch Greedy 4 Policy Elimination 5 Other things: Thompson Sampling and Covering

41 Explore τ then Follow the Leader (Explore-τ)

42 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best.

43 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ

44 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ + T A ln( Π /δ) T T τ.

45 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ + T A ln( Π /δ) T T τ. At optimal τ?

46 Explore τ then Follow the Leader (Explore-τ) Initially, h = For the rst τ rounds 1 Observe x. 2 Choose a uniform randomly. 3 Observe r, and add (x, a, r) to h. For the next T rounds, use empirical best. Suppose all examples are drawn from a xed distribution D(x, r). ) Theorem: For all D, Π,Explore-τ has regret O + with high probability. ( τ T A ln Π τ Proof: After τ rounds, a large deviation bound implies empirical average value of a policy deviates from expectation E (x, r) D [r π(x) ] A ln( Π /δ) by at most τ, so regret is bounded by τ T + T T A ln( Π /δ) τ. At optimal τ? O ( ( A ln Π T ) 1/3 )

47 What does this mean? 1 +Easiest approach: oine prerecorded exploration can feed into any learning algorithm. See rst half. 2 -Doesn't adapt when world changes. 3 -Underexploration common. Think of clinical trials.

48 What does this mean? 1 +Easiest approach: oine prerecorded exploration can feed into any learning algorithm. See rst half. 2 -Doesn't adapt when world changes. 3 -Underexploration common. Think of clinical trials. Can we do better?

49 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ).

50 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)).

51 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). Theorem: ɛ-greedy has regret O ( ɛ + ) A ln Π T ɛ

52 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). Theorem: ɛ-greedy has regret O For optimal epsilon? ( ɛ + ) A ln Π T ɛ

53 ɛ-greedy 1 Observe x. 2 With probability 1 ɛ 1 Choose learned a 2 Observe r, and learn with (x, a, r, 1 ɛ). With probability ɛ 1 Choose Uniform random other a 2 Observe r, and learn with (x, a, r, ɛ/( A 1)). ( Theorem: ɛ-greedy has regret O ɛ + ( ( ) ) 1/3 A ln Π For optimal epsilon? O T ) A ln Π T ɛ

54 What does this mean? 1 -Harder Approach: Need online learning algorithm to use. 2 +Adapts when world changes. 3 -Overexploration common. Bad possibilities keep being explored.

55 What does this mean? 1 -Harder Approach: Need online learning algorithm to use. 2 +Adapts when world changes. 3 -Overexploration common. Bad possibilities keep being explored. Can we do better?

56 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated.

57 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ).

58 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ). With probability ɛ t 1 Choose Uniform random other a 2 Observe r, update ɛ t and learn with (x, a, r, ɛ t /( A 1)).

59 Epoch Greedy At every timestep t, the learned policy has an empirical performance known up to some precision ɛ t which can be estimated. 1 Observe x. 2 With probability 1 ɛ t 1 Choose learned a 2 Observe r, update ɛ t and learn with (x, a, r, 1 ɛ t ). With probability ɛ t 1 Choose Uniform random other a 2 Observe r, update ɛ t and learn with (x, a, r, ɛ t /( A 1)). ( ( ) ) 1/3 A ln Π Theorem: Epoch Greedy has regret O with high probability. Autotuning! T

60 What does this mean? 1 -Harder Approach: Need online learning algorithm to use + keeping track of deviation bound. 2 +Adapts when world changes. 3 +Neither under nor over exploration.

61 What does this mean? 1 -Harder Approach: Need online learning algorithm to use + keeping track of deviation bound. 2 +Adapts when world changes. 3 +Neither under nor over exploration. Is it possible to do better? Supervised ( ( ) 1 ) ln Π 2 Regret O T τ-rst/ɛ-greedy/epoch-greedy ( ( ) 1 ) A ln Π 3 O T

62 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose distribution P over Π t 1 s.t. for every remaining policy π, the expected variance of a value estimate is small. 2 observe x 3 Let p(a) = fraction of P choosing a given x. 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

63 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = fraction of P choosing a given x. 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

64 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = (1 Kµ t ) Pr π P (π(x) = a) + µ t 4 Choose a p and observe reward r 5 Let Π t = remaining near empirical best policies. Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

65 Better 1: Policy Elimination Policy_Elimination Let Π 0 = Π and µ t = 1/ Kt and η t (π) =empirical reward For each t = 1, 2,... 1 Choose[ distribution P over Π t 1 s.t. ] π Π t 1 : 1 E x DX (1 Kµ t) Pr π P (π (x)=π(x))+µ t 2K 2 observe x 3 Let p(a) = (1 Kµ t ) Pr π P (π(x) = a) + µ t 4 Choose a p and observe reward r 5 Let Π t = {π Π t 1 : η t (π) max π Π t 1 η t(π ) Kµ t } Theorem: With high probability Policy_Elimination has regret ( ) A ln Π O T

66 What does this mean? 1 -Doesn't adapt when world changes. 2 ++Much more ecient exploration. Only ecient in special cases Much Harder Approach: Need to keep track of policies, which is often intractable.

67 What does this mean? 1 -Doesn't adapt when world changes. 2 ++Much more ecient exploration. Only ecient in special cases Much Harder Approach: Need to keep track of policies, which is often intractable. Adapting algorithms exist (EXP4). More ecient versions exist (RUCB), but not yet ecient enough.

68 Can you do better?

69 Can you do better? Not in general. Theorem: For all algorithms, there exists problems imposing regret: ( ) A ln Π Ω T

70 The current state Starter Baseline Purring Shiny Something to try Shiny New

71 The current state Explore-τ Baseline Purring Shiny Something to try Shiny New Simplest Possible

72 The current state Explore-τ ɛ-greedy Purring Shiny Something to try Shiny New Simplest Possible Simplest Adaptive

73 The current state Explore-τ ɛ-greedy Epoch Greedy Shiny Something to try Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement

74 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Something to try Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical

75 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Thompson Sampling Shiny New Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical Sometimes Excellent

76 The current state Explore-τ ɛ-greedy Epoch Greedy Policy Elimination Thompson Sampling Covering Simplest Possible Simplest Adaptive Unequivocal Improvement Optimal Impractical Sometimes Excellent The answer??? You can see the edge of the understood world here. We hope to see further soon. Further discussion:

77 Bibliography: Using Exploration Inverse An old technique, not sure where it was rst used. Nonrand J. Langford, A. Strehl, and J. Wortman Exploration Scavenging ICML Oset A. Beygelzimer and J. Langford, The Oset Tree for Learning with Partial Labels KDD Implicit A. Strehl, J. Langford, S. Kakade, and L. Li Learning from Logged Implicit Exploration Data NIPS DRobust M. Dudik, J. Langford and L. Li, Doubly Robust Policy Evaluation and Learning, ICML 2011.

78 Bibliography: Doing Exploration Tau-rst Unclear rst use? ɛ-greedy Unclear rst use? Epoch J. Langford and T. Zhang, The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits, NIPS EXP4 P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):4877, 2002b. EXP4++ B. McMahan and M. Streeter, Tighter bounds for Multi-Armed Bandits with Expert Advice, COLT PolyElim M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang, Ecient Optimal Learning for Contextual Bandits, UAI 2011.

79 Bibliograph: Doing Exploration II ealizable A. Agarwal, M. Dudik, S. Kale, and J. Langford, Contextual Bandit Learning with Predictable Rewards, AIStat hompson W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285294, Prior fail Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert E. Schapire, An Optimal High Probability Algorithm for the Contextual Bandit Problem AIStat Empirical O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling, NIPS Cover A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits,

Learning with Exploration

Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.