Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Similar documents
1 Online Learning and Regret Minimization

Reinforcement learning II

A recursive construction of efficiently decodable list-disjunct matrices

Reinforcement Learning

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

MATH362 Fundamentals of Mathematical Finance

38.2. The Uniform Distribution. Introduction. Prerequisites. Learning Outcomes

Review of Calculus, cont d

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

1 Probability Density Functions

Continuous Random Variables

Physics 202H - Introductory Quantum Physics I Homework #08 - Solutions Fall 2004 Due 5:01 PM, Monday 2004/11/15

Math 426: Probability Final Exam Practice

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

5.7 Improper Integrals

Heat flux and total heat

Chapter 5 : Continuous Random Variables

The Regulated and Riemann Integrals

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

19 Optimal behavior: Game theory

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

arxiv: v1 [stat.ml] 9 Aug 2016

AP Calculus Multiple Choice: BC Edition Solutions

Student Activity 3: Single Factor ANOVA

A BRIEF INTRODUCTION TO UNIFORM CONVERGENCE. In the study of Fourier series, several questions arise naturally, such as: c n e int

LECTURE NOTE #12 PROF. ALAN YUILLE

Properties of the Riemann Integral

1 Structural induction, finite automata, regular expressions

Math 1B, lecture 4: Error bounds for numerical methods

p-adic Egyptian Fractions

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Lecture 1: Introduction to integration theory and bounded variation

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1

Is there an easy way to find examples of such triples? Why yes! Just look at an ordinary multiplication table to find them!

Appendix to Notes 8 (a)

14.3 comparing two populations: based on independent samples

Chapter 9: Inferences based on Two samples: Confidence intervals and tests of hypotheses

Reinforcement Learning and Policy Reuse

Bellman Optimality Equation for V*

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Near-Bayesian Exploration in Polynomial Time

Integral equations, eigenvalue, function interpolation

New Integral Inequalities for n-time Differentiable Functions with Applications for pdfs

Lecture 21: Order statistics

Problem Set 9. Figure 1: Diagram. This picture is a rough sketch of the 4 parabolas that give us the area that we need to find. The equations are:

Online Supplements to Performance-Based Contracts for Outpatient Medical Services

13: Diffusion in 2 Energy Groups

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Lecture 3 Gaussian Probability Distribution

Name Solutions to Test 3 November 8, 2017

Math 8 Winter 2015 Applications of Integration

PARTIAL FRACTION DECOMPOSITION

Lecture 20: Numerical Integration III

S. S. Dragomir. 2, we have the inequality. b a

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

Riemann Sums and Riemann Integrals

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Administrivia CSE 190: Reinforcement Learning: An Introduction

Automated Recommendation Systems

Online Markov Decision Processes under Bandit Feedback

Riemann Sums and Riemann Integrals

Homework 11. Andrew Ma November 30, sin x (1+x) (1+x)

A Fast and Reliable Policy Improvement Algorithm

Chapter 6 Continuous Random Variables and Distributions

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

Bases for Vector Spaces

ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac

The Periodically Forced Harmonic Oscillator

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Lecture 1. Functional series. Pointwise and uniform convergence.

Numerical Analysis: Trapezoidal and Simpson s Rule

Reinforcement learning

Lecture 14: Quadrature

Recitation 3: More Applications of the Derivative

5 Probability densities

Hoeffding, Azuma, McDiarmid

Analytical Methods Exam: Preparatory Exercises

For the percentage of full time students at RCC the symbols would be:

S. S. Dragomir. 1. Introduction. In [1], Guessab and Schmeisser have proved among others, the following companion of Ostrowski s inequality:

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Principles of Real Analysis I Fall VI. Riemann Integration

CS 188: Artificial Intelligence Spring 2007

Section 11.5 Estimation of difference of two proportions

Multi-Bandit Best Arm Identification

When a force f(t) is applied to a mass in a system, we recall that Newton s law says that. f(t) = ma = m d dt v,

8 Laplace s Method and Local Limit Theorems

dt. However, we might also be curious about dy

21.6 Green Functions for First Order Equations

3.4 Numerical integration

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

POLYPHASE CIRCUITS. Introduction:

Tests for the Ratio of Two Poisson Rates

AM1 Mathematical Analysis 1 Oct Feb Exercises Lecture 3. sin(x + h) sin x h cos(x + h) cos x h

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230

Numerical Analysis. 10th ed. R L Burden, J D Faires, and A M Burden

Lesson 1.6 Exercises, pages 68 73

Transcription:

CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent Arms: {1,... K} Ech rm returns rndom rewrd R if pulled. (simpler cse) ssume R is not time vrying. Gme: You chose rm t t time t. You then observe: X t = R t where R t is smpled from the underlying distribution of tht rm. Criticlly, the distribution over R is not known. 1.1 Regret: n online performnce mesure Our objective is to mximize our long term rewrd. We hve (possibly rndomized) sequentil strtegy/lgorithm A, which is of the form: t = A( 1, X 1, 2, X 2,... t 1, X t 1 ) In T rounds, our rewrd is: T E[ X t A where the expecttion is with respect to the rewrd process nd our lgorithm. Suppose: µ = E[R, nd let us ssume 0 µ 1. Also, define: µ = mx µ. In T rounds nd in expecttion, the best we cn do is obtin µ T. We will mesure our performnce by our expected regret, defined s follows: In T rounds, our (observed) regret is: µ T T X t A 1

nd our expected regret is: µ T E X t A where the expecttion is with the rndomness in our outcomes (nd possibly our lgorithm if it is rndomized). 1.2 Cvet: Our presenttion in these notes will be loose in terms of log( ) fctors, in both K nd T. There re multiple good tretments tht provide improvements in terms of these fctors. 2 Review: Hoeffding s bound With N smples, denote the smple men s: ˆµ = 1 X t. N Lemm 2.1. Supposing tht the X t s hve n i.i.d. distribution nd re bounded between 0 nd 1, then, with probbility greter thn 1 δ, we hve tht log(2/δ) ˆµ µ 2N. 3 Wrmup: A non-dptive strtegy t Suppose we first pull ech rm times, in n explortion phse. Then, for the reminder of the T steps, we pull the rm which hd the best observed rewrd during the explortion phse. By the union bound, with probbility greter thn 1 δ, for ll ctions, ˆµ µ O. To see this, we simply mke our error probbility to be δ/k, to the totl error probbility is δ. Thus ll the confidence intervls will hold. During the explortion rounds, our cumultive regret is t most K, trivil upper bound. During the exploittion rounds, let us bound our cumultive regret for the reminder of T K. Note tht for the rm i tht we pull, we must hve tht: ˆµ i ˆµ i where i is n optiml rm. This implies tht µ i µ c where c is universl constnt. To see this, note tht by construction of the lgorithm ˆµ i ˆµ i, which implies µ i ˆµ i ˆµ i µ i ˆµ i ˆµ i µ i µ i ˆµ i µ i ˆµ i µ i, nd the clim follows using the confidence intervl bounds. 2

Hence, our totl regret is: Now let us optimize for. µ T T X t K + O (T K) Lemm 3.1. (Regret of the non-dptive strtegy) The totl expected regret of the non-dptive strtegy is: µ T E X t ck 1/3 T 2/3 (log T ) 1/3 where c is universl constnt. Proof. Choose = K 2/3 T 2/3 nd δ = 1/T 2. Note tht with probbility greter thn 1 1/T 2, our regret is bounded by (K 1/3 T 2/3 (log(kt )) 1/3 ). Also, if we fil, the lrgest regret we cn py is T, nd this occurs with probbility less thn 1/T 2, so the reget is: exp. regret Pr(no filure event) K 1/3 T 2/3 (log(kt )) 1/3 + Pr(filure event)t c(1 1/T 2 )K 1/3 T 2/3 (log(kt )) 1/3 + 1 T. This shows tht the regret is bounded s O(K 1/3 T 2/3 (log(kt )) 1/3 ). For T > K, log(kt ) 2 log T (nd for K < T, the climed regret bound is trivilly true). This completes the proof (for different universl constnt). 3.1 A (minimx) optiml dptive lgorithm We will now provide n optiml (up to log fctors) lgorithms (optiml under the i.i.d. ssumption for the rewrds re distributed nd using tht the rewrds re upper bounded by 1). Let be the number of times we pulled rm up to time t. The question is wht rm should pull time t + 1? 3.2 Confidence bounds If we don t cre bout log fctors, then the following is strightforwrd rgument to see tht our confidence bounds will simultneously hold for ll times t (from 0 to ) nd ll K rms. Lemm 3.2. With probbility greter thn 1 δ, we will hve tht for ll times t K, ll [K, ˆµ,t µ c where c is universl constnt. Proof. We will ctully prove stronger sttement: suppose tht we observe the outcome of every rm, we will first provide probbilistic sttement for the confidence intervls of ll the rms (nd for ll smple sizes). Let us pply Hoeffding s bound with n error probbility of δ/(k 2 ). Specificlly, for the rm with smples, we hve tht with probbility greter thn 1 δ/(k 2 ): ˆµ, µ c 3

(by strightforwrd ppliction of Hoeffding s bound). Now tht the totl error probbility over ll rms n over smple size is: δ K 2 = δπ2 /6 =0 (the π 2 /6 is from Bsel s problem). Note the sum is finite, which mens the error totl probbility for ll of these confidence intervls is less thn constnt δ. We hve thus shown the following (note the quntifiers): with probbility greter thn 1 δ, tht for ll rms nd ll smple sizes 1 tht: ˆµ, µ c, (for possibility different constnt c). Observe tht the confidence bounds tht ny lgorithm uses t time t is due to hving smples, so we cn now pply the bove bound in this cse, where: log( K/δ) log(tk/δ) c c since t. This shows tht these confidence bounds re vlid for ll times t nd ll rms. The proof is completed by nothing for t K, log(kt) 2 log t. 3.3 The Upper Confidence Bound (UCB) Algorithm At ech time t, Pull rm: t = rg mx ˆµ,t + c := rg mx ˆµ,t + ConfBound,t (where c 10 is constnt). Observe rewrd X t. Updte µ,t,, nd ConfBound,t. With probbility greter thn 1 δ ll the confidence bounds will hold for ll rms nd ll times t. 3.4 Anlysis of UCB If pull rm t time t, wht is our instntneous regret, i.e. wht is: µ µ t? Let i be n optiml rm. Note by construction of the lgorithm we hve, if we pull rm t time t, then: ˆµ,t + ConfBound,t ˆµ i + ConfBound i µ i, the lst step follows due to tht µ i is contined within the confidence intervl for i. Using this we hve tht: µ t ˆµ,t ConfBound,t ˆµ i 2ConfBound,t 4

Theorem 3.3. (UCB regret) The totl expected regret of UCB is: µ T E X t c KT log T for n ppropritely chosen universl constnt c. Proof. The expected regret is bounded s: µ T E X t 2 ConfBound,t t 2c N t,t 2c log(t/δ)n,t. (1) Note the following constrint on the N,T s must hold: N,T = T One cn now show the worst cse setting of N,T tht mkes Eqution 1 s lrge s possible (subject to this constrint on the N,T s) is when = T/K. Finlly, to obtin the expected regret bound, the proof is identicl to tht of the previous rgument (in the non-dptive cse, where we choose δ = 1/T 2 ). 5