Lecture 4 January 23

Similar documents
STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Analysis of Thompson Sampling for the multi-armed bandit problem

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit models: a tutorial

Multi-armed bandit models: a tutorial

Lecture 4: Lower Bounds (ending); Thompson Sampling

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Advanced Machine Learning

Online Learning and Sequential Decision Making

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

The Multi-Arm Bandit Framework

Analysis of Thompson Sampling for the multi-armed bandit problem

On Bayesian bandit algorithms

Basics of reinforcement learning

1 MDP Value Iteration Algorithm

The Multi-Armed Bandit Problem

Statistics for IT Managers

Lecture 1 January 18

(1) Introduction to Bayesian statistics

Stochastic bandits: Explore-First and UCB

Grundlagen der Künstlichen Intelligenz

Evaluation of multi armed bandit algorithms and empirical algorithm

The information complexity of sequential resource allocation

Thompson Sampling for Monte Carlo Tree Search and Maxi-min Action Identification

Sparse Linear Contextual Bandits via Relevance Vector Machines

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

CS-E3210 Machine Learning: Basic Principles

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Bayesian and Frequentist Methods in Bandit Models

Lecture 5: Regret Bounds for Thompson Sampling

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

Introduction to Statistical Inference

arxiv: v1 [cs.lg] 15 Oct 2014

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

An Adaptive Algorithm for Selecting Profitable Keywords for Search-Based Advertising Services

The knowledge gradient method for multi-armed bandit problems

Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

CPSC 540: Machine Learning

Bandits : optimality in exponential families

Introduction to Machine Learning. Lecture 2

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Classical and Bayesian inference

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Exploration and exploitation of scratch games

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Bayesian Regression (1/31/13)

Contents. Decision Making under Uncertainty 1. Meanings of uncertainty. Classical interpretation

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Mechanisms with Learning for Stochastic Multi-armed Bandit Problems

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Using Probability to do Statistics.

Model Checking and Improvement

The information complexity of best-arm identification

DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Learning to play K-armed bandit problems

Topic 16 Interval Estimation. The Bootstrap and the Bayesian Approach

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Random Numbers and Simulation

Foundations of Statistical Inference

The Multi-Armed Bandit Problem

Subsampling, Concentration and Multi-armed bandits

The Components of a Statistical Hypothesis Testing Problem

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

New Algorithms for Contextual Bandits

Tsinghua Machine Learning Guest Lecture, June 9,

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Principles of Bayesian Inference

Stat 231 Exam 2 Fall 2013

Do students sleep the recommended 8 hours a night on average?

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Chapter 8: Sampling distributions of estimators Sections

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Thompson Sampling for the MNL-Bandit

Lecture 3 - More Conditional Probability

arxiv: v1 [cs.ir] 29 Sep 2014

Reinforcement Learning

THE first formalization of the multi-armed bandit problem

CS 543 Page 1 John E. Boon, Jr.

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

CSE250A Fall 12: Discussion Week 9

17 : Markov Chain Monte Carlo

Transcription:

STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are drawn sequentially, and chosen based on the results of previous samples. They don t usually show up in DOE texts, but they should! 4.1.1 The Usual (Offline) Approach We have k treatments, and wish to determine which is best, on average. Obtain samples X ij for i = 1,..., n j, j = 1,..., k and compute for each j = 1,..., k. We then pick ˆµ j = X j = 1 n j n j X ij, i=1 and take this as our best treatment. j = argmax 1 j k ˆµ j, 4.1.2 Alternative (Online) Approach The idea behind the online approach is that we learn about the treatments as we perform the experiment. We continue to try all j = 1,..., k, but choose the best option more often. This is the idea behind the bandit. 1 Thompson propsed the following approach in 1933, in the context of clinical trials. Following the notation above, choose j with probability P r = P r(j is best). We will give a concrete example of this in a (simple) example below. Bandits are starting to gain popularity, and enjoy optimality properties. A simpler approach suggested by Box, Hunter, Hunter is to perform a sort of moving experiment performing sequential optimization. Choose a set of samples in one region of parameter space, update based on results, and repeat. This was originally propsed in the context of industrial processes. 1 The term bandit comes from one-armed bandit, another name for slot machines. Old slot machines have a single lever, to which the one-arm refers. They are bandits in the sense that they take your money. While one-armed bandits are not very interesting, k-armed bandits are somewhat more interesting if you are forced to play, which arm should you pull to minimize your loss? Mathematicians formulated and studied this problem; e.g. Lai and Robbins. 4-1

Bandits have also been used to serve advertisements along with articles. similar applications in auctions, etc. 4.1.3 Bandit Concepts They have Suppose we have some k options to pick from at time t; we choose option i which pays off randomly. Define the payoff at time t to be X it ν i, where ν i is the distribution for choice i. We choose I t {1,..., k} sequentially, and obtain X Itt. Thus, we can only learn about the ν i through the X Itt we observe. We define the best choice to be i = argmax 1 i k where µ i E(X i ). The choice of i pays µ i, µ = max 1 i k µ i, on average. The aim of a bandit is to pick the I t intelligently; in order to minimize our regret. Regret can be defined various ways; if we play for n total timesteps, we may define R n = nµ E( R n = max 1 i k n µ It ), t=1 n (X it X Itt), t=1 where R n is the regret, and R n is the pseudo-regret. There are various ways to optimize regret; one such method is through the Upper Confidence Bound (UBC) heuristic. 4.1.4 Upper Confidence Bound Heuristic Figure 1 depicts a possible scenario for the middle of a bandit procedure. Based on data collected so far, we have the depicted confidence intervals for the three different treatment groups. Treatment A has the greatest possibility it may be the case that the true distribution for A is skewed higher than we ve seen so far. Treatment B has the highest mean, so it seems like a better bet on average. Treatment C has the highest lower bound, so it seems like a less risky bet than the others. Which treatment should we choose next? The Upper Confidence Bound (UBC) heuristic suggests we pick A in this case. There is both the potential for greater gains, and the opportunity to learn more about A (tightening its confidence bounds). We will discover whether A is good. Asymptotically, the UBC heuristic is optimal in R n and R n. 4-2

Figure 4.1. Example treatment confidence intervals, based on observations taken so far; note that A has the greatest upper bound, B the greatest mean, and C the greatest lower bound. Note that our confidence level drops in the finite n setting. E.g. if we use 95% confidence intervals, we have X i ± 1.96S i ni, where the bounds shrink as n i grows. compute an alternative regret Alternatively, in the n setting, we may (µ E(µ It ))θ t, t=1 where 0 < θ < 1 is a chosen discount factor. This approach yields a fixed confidence level. (I didn t understand this fully, and am probably explaining it poorly.) 4-3

4.1.5 Modern Bandit Versions There are many modern versions of the bandit procedure, which deal with various issues: Many arms, e.g. news stories Adversarial payoffs, e.g. spam Many parameters, e.g. story is weather Constraints, e.g. budget Drifting payoffs, so called restless bandits Linked payoffs, e.g. electric grid 4.1.6 Thompson Sampling Thompson Sampling is a Bayesian approach to bandits this implies that we will define a prior and liklihood function, and yield a posterior distribution. In this approach, at each timestep t we pick I t = i with probability P r = P r(µ i largest). In the special case of Bernoulli bandits, we can derive an analytic posterior distribution this is detailed below. Suppose that for i = 1,..., k we have X it {0, 1} with unknown Bernoulli parameter. Then our variables follow { 0 P r = 1 µi X it =, 1 P r = µ i and we choose a uniform prior µ i U(0, 1) = Beta(1, 1). We choose to use an equivalent Beta distribution, as it gives us a conjugate prior. 2 Recall that a Beta distribution Beta(α, β) has a PDF proportional to x α 1 (1 x) β 1 with α, β > 0, and has an expectation E(Beta(α, β)) = α. α+β Note that if we have µ Beta(α, β) and X Ber(µ), then µ X Beta(α+X, β +1 X) this is the aforementioned conjugate prior property. In our Thompson Sampling procedure, after S i successes and F i failures, we will have a posterior µ i Beta(1+S i, 1+F i ). Combining these elements, we can define the following algorithm. 2 A conjugate prior is a distribution chosen such that our posterior distribution remains in the same family as our prior. This is a useful property! 4-4

for i = 1,..., k do Initialize S i 0, F i 0 Main Loop for t = 1,..., n do for i = 1,..., k do θ i Beta(S i + 1, F i + 1) I t argmax i θ i X It,t (Observation) S It S It + X It,t F It F It + (1 X It,t) Note: Drawing θ i as above allows us to pick according to the proper probabilities. Also note: This approach is easier than messing with confidence intervals, and we can modify the procedure to update offline rather than online. If available, one may incorporate pre-existing data into the prior. Also also note: There is theory to support this approach; some results are summarized below. 4.1.7 Theory: Agrawal & Goyal (2012, Microsoft Research) Agrawal and Goyal analyzed the Thompson Sampling procedure for multi-armed bandits. To illustrate one of their results, consider the k = 2 case, and without loss of generality assume µ 1 is best, with = µ 1 µ 2 0. Then Thompson Sampling gives ( log n E(R n ) = O + 1 ) 3 Note that: 1. We have E(R n )/n = O(log n/n), which tends to zero as n 2. We have worse E(R n ) for small Recent theory (Grid Science 2017) has shown that the Thompson Sampling procedure is actually even better than this. Generally, for k > 3 we have ( ) k 1 E(R n ) = O ( 2 )2 log n. i=2 Aside: The UCB heuristic is also O(log n/ ). 4-5

4.1.8 Modified Algorithm for a Bounded Variable Suppose our response is continuous rather than Bernoulli. We can attempt to modify the procedure to handle this case. If our possible response values lie in some finite interval(s), we can normalize 0 X it 1 and compute for i = 1,..., k do Initialize S i 0, F i 0 Main Loop for t = 1,..., n do for i = 1,..., k do θ i Beta(S i + 1, F i + 1) I t argmax i θ i X It,t (Observation) Y Ber(X It,t) Draw Bernoulli based on response S It S It + Y F It F It + (1 Y ) Note: There is (asymptotic?) theory to support this approach, but it is an open question whether or not it is a good idea in practice. 4.1.9 Bandits vs A/B Testing On any post on A/B testing on any hacker site, there will inevitably be someone claiming that bandits are a superior approach. 3 The best approach is almost surely problem dependent. One advantage of the bandit approach is that the procedure adapts automatically. However, one could argue that this is a disadvantage, as it requires that we waste effort discovering behavior our procedure is not optimized a priori. Additionally, we may miss some insight by formulating a bandit problem: One of Prof. Owen s students worked at Microsoft on their search engine. During their tenure there, a bug found its way into the search results, where the ranked items were shown in reverse order. Since the top results were not appealing, users tended to click on ads rather than actual search results! This insight was discovered through A/B testing; a randomized control experiment can give us a causal story. An adaptive procedure may converge to optimized results, but does not guarantee causal inference. Finally, bandits have not caught on in some fields. For instance, ethical issues (and the FDA) discourage using a bandit formulation in clinical trials. This is curious, as Thompson originally proposed this procedure in the context of clinical trials! 3 The converse is also true. 4-6

4.2 Benefits of Randomization Randomization is great! Let s illustrate why with an example. Suppose we have 10 plots of land, and aim to compare tomatoes A and B. Consider the following experimental designs, depicted in Table 1. Design 0 1 2 3 4 5 6 7 8 9 1 A A A A A B B B B B 2 A B A B A B A B A B 3 A B B B A A B A A B Table 4.1. Example designs for tomato experiment. Design 1 is a terrible idea; what if there is a fertility gradient in the land? This will strongly bias the results. Design 2 should negate this gradient effect, but is still susceptible to periodic effects. Design 3 is randomly generated we expect that it will perform better than the other designs in some sense. To quantify the differences between these designs, let s study a test statistic one might use to infer differences between the two treatments. 4.2.1 Example t-test Comparison Consider the (un-normalized) t statistic (ȲA ȲB ) (µ A µ B ) = ē A ē B, where the ē terms are treatment error terms. Our null hypothesis H 0 is that the treatments A and B are the same; under this assumption the statistic above would be zero. However, our experimental design may bias the statistic away from zero; this could lead to us incorrectly rejecting H 0! As a concrete example, let s suppoese there exists a fertility gradient, which manifests as E(e i ) = c(i 5.5), for some constant c R. That is, there exists some linear fertility gradient across the field. Under this gradient, the statistic above would suffer the following biases, shown in Table 2. Design Bias 1 5c 2 c 3 3 c 5 Table 4.2. Summary of bias values for fertility gradient example As expected, Design 1 exhibits a very strong bias. Design 2 has a comparatively weaker bias, which is expected as it was designed to do so. However, Design 3 has an even smaller (in magnitude) bias! Randomization tends to give designs which are robust against these effects. For another comparison, suppose the plots are correlated with their nearest neighbor, in the following way 4-7

1 i = j Corr(e i, e j ) E = ρ i j = 1 0 else This leads to a variance in response given by V (ȲA ȲB ) = σ2 25 νt Eν, = 2 5 σ2 + ρσ2 25 κ, = 25σ 2 (10 + ρκ) where ν (±1) 10 is a vector which describes the experimental design, and κ varies based on design. The quantities for the different designs are summarized in Table 3. Design ν T κ 1 (+1, +1, +1, +1, +1, 1, 1, 1, 1, 1) 14 2 (+1, 1, +1, 1, +1, 1, +1, 1, +1, 1) -18 3 (+1, 1, 1, 1, +1, +1, 1, +1, +1, 1) 2 Table 4.3. Summary of results for variance example Design 1 has the largest variance; it is overall a very poor design. Design 2 has the least variance, while Design 3 sits in the middle. 4 Selecting between Design 2 and Design 3 requires that we make a tradeoff between bias and variance. However, in general randomization tends to balance adjacency effects. Box, Hunter, and Hunter note that randomization allows us to perform our usual statistical tests without accounting for the complicated minutiae of our design randomization makes t-tests safer. 4 An observant reader may worry that Design 2 has negative variance. This is not the case, as the requirement that E be positive-semidefinite puts constraints on the value of ρ. 4-8