Multi-armed Bandits and the Gittins Index

Size: px
Start display at page:

Download "Multi-armed Bandits and the Gittins Index"

Transcription

1 Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6

2 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,...

3 Two-armed Bandit 3, 10, 4, 9, 12, 1,..., 6, 2, 15, 2, 7,... 5

4 Two-armed Bandit 3, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6

5 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1. Of course, in practice we must choose which arms to pull without knowing the future sequences of rewards.

6 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend? Dating strategy: should I contact a new prospect, or try another date with someone I have dated before?

7 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989) Exploration versus exploitation

8 Multi-armed Bandit Allocation Indices 2nd Edition edition 11 March 2011 Gittins, Glazebrook and Weber

9 Clinical Trials

10 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples. θ i is unknown, but has a prior distribution, perhaps uniform on [0,1] f(θ i ) = 1, 0 θ i 1.

11 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2). We wish to maximize the expected total discounted sum of number of successes.

12 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen). Objective: maximize the expected total β-discounted reward [ ] E r it (x it (t))β t, t=0 where i t is the arm pulled at time t, (0 < β < 1).

13 Dynamic Programming Solution The dynamic programming equation is F(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)f(x 1,...,x i 1,y,x i+1,...,x N )

14 Single Machine Scheduling N jobs are to be processed successively on one machine. Job i has a known processing times t i, a positive integer. On completion of job i a reward r i is obtained. If job 1 is processed immediately before job 2 the sum of discounted rewards from the two jobs is r 1 β t 1 +r 2 β t 1+t 2. r 1 β t 1 +r 2 β t 1+t 2 > r 2 β t 2 +r 1 β t 2+t 1 G 1 = (1 β) r 1β t 1 1 β t 1 > (1 β) r 2β t2 1 β t 2 = G 2. So total discounted reward is maximized by the index policy which processes jobs in decreasing order of indices, G i.

15 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. G i (x i ) is called the Gittins index.

16 Gittins Index [ ] τ 1 E G i (x i ) = sup t=0 ri(x i (t))βt xi [ (0) = x i τ 1 ] τ 1 E t=0 βt xi (0) = x i Discounted reward up to τ. Discounted time up to τ. In the scheduling problem τ = t i and G i = r i β t i 1+β + +β t i 1 = (1 β) r iβ ti 1 β t i

17 Calibration Alternatively, { G i (x i ) = sup λ : t=0 β t λ supe τ 1 [ τ 1 β t r i (x i (t))+ β t λ x i (0) = x i ]}. t=0 That is, we consider a problem with two bandit processes: bandit process B i and a calibrating bandit process, say Λ, which pays out a known reward λ at each step it is continued. t=τ The Gittins index of B i is the value of λ for which we are indifferent as to which of B i and Λ to continue initially. Notice that once we decide, at time τ, to switch from continuing B i to continuing Λ then information about B i does not change and so it must be optimal to stick with continuing Λ ever after.

18 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index = So we prefer to use drug 1 next, even though it has the smaller probability of success.

19 Gittins Index Theorem is Surprising Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved? Sir, the multi-armed bandit problem is not of such a nature that it can be solved.

20 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988) Weber (1992) El Karoui, Karatzas (1993) Ishikida and Varaiya (1994) Tsitsiklis (1994) Bertsimas, Niño-Mora (1996) Glazebrook, Garbe (1996) Kaspi, Mandelbaum (1998) Bäuerle, Stidham (2001) Dimitriu, Tetali, Winkler (2003)

21 Proof of the Index Theorem Start with a problem in which only bandit process B i is available. Define the fair charge, γ i (x i ), as the maximum amount that a gambler would be willing to pay per step to be permitted to continue B i for at least one more step, and with subsequent freedom to stop continuing it whenever he likes thereafter. This is γ i (x i ) = sup { λ : 0 supe τ 1 [ τ 1 t=0 Easy to check that γ i (x i ) = G i (x i ). The stopping time τ will be the first time that G i (x i (τ)) < G i (x i (0)), ]} β t( ) xi r i (x i (t)) λ (0) = x i. i.e. the first time that the fair charge becomes too expensive. The gambler would rather stop that pay this charge.

22 Prevailing charges Suppose that when G i (x i (τ)) < G i (x i (0)), we reduce the charge to a level such that it remains just-profitable for the gambler to keep on playing. This is called the prevailing charge, say g i (t) = min s t G i (x i (s)). g i (t) is a nonincreasing function of t and its value depends only on the states through which bandit i evolves. Observation 1. Suppose that in the problem with n alternative bandit processes, B 1,...,B n, the gambler not only collects r it (x it (t)), but also pays the prevailing charge g it (x it (t)) of the bandit B it that he chooses to continue at time t. Then he cannot do better than just break even (i.e. expected profit 0). This is because he could only make a strictly positive profit (in expected value) if this were to happen for at least one bandit. Yet the prevailing charge has been defined so that if he pays the prevailing charges he can only just break even.

23 Observation 2. He maximizes the expected discounted sum of the prevailing charges that he pays by always continuing the bandit with the greatest prevailing charge. This is because he thereby interleaves the n nonincreasing sequences of prevailing charges g i into one nonincreasing sequence of prevailing charges. This way of interleaving them maximizes their discounted sum. For example, prevailing charges of g 1 : 10,10,9,5,5,3,... g 2 : 20,15,7,4,2,2,... are best interleaved (so as to maximize discounted charge paid) as 20,15,10,10,9,7,5,5,4,3,2,2,...

24 Observation 3. Using this strategy he also just breaks even (because once he starts continuing B i he does so until its prevailing charge decreases). So this strategy, (of always continuing the bandit with the greatest G i (x i )), must maximize the expected discounted sum of the rewards that he can obtain from the bandits. I.e., for any policy, the definition of prevailing charges implies [ ] 0 E β t( ) x(0) r it (x it (t)) g it (x it(t)) t=0 and so [ ] [ ] E β t r it (x it ) x(0) E β t g it (x it ) x(0). t=0 But the right hand side is maximized by always continuing the bandit with greatest g i (x i (t)). This policy also achieves equality in the above, and so it must maximize the expected discounted sum of rewards (the left hand side). t=0

25 Restless Bandits Spinning Plates

26 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed inspection no inspection

27 Questions

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011 The Multi-armed Bandit Problem The Multi-armed Bandit

More information

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Sequential Selection of Projects

Sequential Selection of Projects Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Index Policies and Performance Bounds for Dynamic Selection Problems

Index Policies and Performance Bounds for Dynamic Selection Problems Index Policies and Performance Bounds for Dynamic Selection Problems David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu James E. Smith Tuck School of Business Dartmouth College jim.smith@dartmouth.edu

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

MULTI-ARMED BANDIT PROBLEMS

MULTI-ARMED BANDIT PROBLEMS Chapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis University of Michigan, Ann Arbor, MI, USA 1. Introduction Multi-armed bandit MAB)

More information

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions arxiv:188.6314v2 [math.pr 22 Aug 218 A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions Wenqing Bao 1, Xiaoqiang Cai 2, Xianyi Wu 3 1 Zhejiang Normal University, Jinhua Zhejiang,

More information

The Multi-Armed Bandit Problem with Commitments:

The Multi-Armed Bandit Problem with Commitments: The Multi-Armed Bandit Problem with Commitments: Generalized Depreciation and Necessity of Restart in State Indices Wesley Cowan Department of Mathematics, Rutgers University 110 Frelinghuysen Rd., Piscataway,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

On Robust Arm-Acquiring Bandit Problems

On Robust Arm-Acquiring Bandit Problems On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Lecture 4 January 23

Lecture 4 January 23 STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are

More information

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems / 25

Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems / 25 Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems Department of Mathematics and Statistics University of North Carolina at Charlotte, USA http://...isaac Sonin in Google,

More information

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Near-optimal policies for broadcasting files with unequal sizes

Near-optimal policies for broadcasting files with unequal sizes Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies

More information

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam

More information

Ordinal Optimization and Multi Armed Bandit Techniques

Ordinal Optimization and Multi Armed Bandit Techniques Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014 The ordinal optimization problem Determining the best of d alternative designs for a system, on

More information

Multi-armed Bandits with Limited Exploration

Multi-armed Bandits with Limited Exploration Multi-armed Bandits with Limited Exploration Sudipto Guha Kamesh Munagala Abstract A central problem to decision making under uncertainty is the trade-off between exploration and exploitation: between

More information

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES*

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES* Probability in the Engineering and Informational Sciences, 14, 2000, 259 297+ Printed in the U+S+A+ ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE

More information

Notes from Week 8: Multi-Armed Bandit Problems

Notes from Week 8: Multi-Armed Bandit Problems CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 8: Multi-Armed Bandit Problems Instructor: Robert Kleinberg 2-6 Mar 2007 The multi-armed bandit problem The multi-armed bandit

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

Bandit Problems With Infinitely Many Arms

Bandit Problems With Infinitely Many Arms University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 997 Bandit Problems With Infinitely Many Arms Donald A. Berry Robert W. Chen Alan Zame David C. Heath Larry A. Shepp

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Dynamic Programming and Optimal Control. Richard Weber

Dynamic Programming and Optimal Control. Richard Weber Dynamic Programming and Optimal Control Richard Weber Autumn 2014 i Contents Table of Contents ii 1 Dynamic Programming 2 1.1 Control as optimization over time...................... 2 1.2 The principle

More information

Optimal learning with non-gaussian rewards

Optimal learning with non-gaussian rewards Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems

More information

Applications of Optimal Stopping and Stochastic Control

Applications of Optimal Stopping and Stochastic Control Applications of and Stochastic Control YRM Warwick 15 April, 2011 Applications of and Some problems Some technology Some problems The secretary problem Bayesian sequential hypothesis testing the multi-armed

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits CAS2016 Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction Optimal PAC Algorithm (Best-Arm, Best-k-Arm):

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1. Problem 1 (21 points) An economist runs the regression y i = β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 + ε i (1) The results are summarized in the following table: Equation 1. Variable Coefficient Std. Error

More information

Topic 2: Logistic Regression

Topic 2: Logistic Regression CS 4850/6850: Introduction to Machine Learning Fall 208 Topic 2: Logistic Regression Instructor: Daniel L. Pimentel-Alarcón c Copyright 208 2. Introduction Arguably the simplest task that we can teach

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Theory Field Examination Game Theory (209A) Jan Question 1 (duopoly games with imperfect information)

Theory Field Examination Game Theory (209A) Jan Question 1 (duopoly games with imperfect information) Theory Field Examination Game Theory (209A) Jan 200 Good luck!!! Question (duopoly games with imperfect information) Consider a duopoly game in which the inverse demand function is linear where it is positive

More information

A Structured Multiarmed Bandit Problem and the Greedy Policy

A Structured Multiarmed Bandit Problem and the Greedy Policy A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research

More information

Wireless Channel Selection with Restless Bandits

Wireless Channel Selection with Restless Bandits Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may

More information

SOME MEMORYLESS BANDIT POLICIES

SOME MEMORYLESS BANDIT POLICIES J. Appl. Prob. 40, 250 256 (2003) Printed in Israel Applied Probability Trust 2003 SOME MEMORYLESS BANDIT POLICIES EROL A. PEKÖZ, Boston University Abstract We consider a multiarmed bit problem, where

More information

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling Nicolas Gast Bruno Gaujal Grenoble University Knoxville, May 13th-15th 2009 N. Gast (LIG) Mean

More information

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies Restless Bandits with Switching Costs: Linear Programming Relaations, Performance Bounds and Limited Lookahead Policies Jerome Le y Laboratory for Information and Decision Systems Massachusetts Institute

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Multi-armed Bandit Allocation Indices

Multi-armed Bandit Allocation Indices Multi-armed Bandit Allocation Indices Multi-armed Bandit Allocation Indices 2nd Edition John Gittins Department of Statistics, University of Oxford, UK Kevin Glazebrook Department of Management Science,

More information

Lecture PowerPoints. Chapter 4 Physics: for Scientists & Engineers, with Modern Physics, 4th edition Giancoli

Lecture PowerPoints. Chapter 4 Physics: for Scientists & Engineers, with Modern Physics, 4th edition Giancoli Lecture PowerPoints Chapter 4 Physics: for Scientists & Engineers, with Modern Physics, 4th edition Giancoli 2009 Pearson Education, Inc. This work is protected by United States copyright laws and is provided

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Lecture 3 - More Conditional Probability

Lecture 3 - More Conditional Probability Lecture 3 - More Conditional Probability Statistics 02 Colin Rundel January 23, 203 Review R course Library R Course http:// library.duke.edu/ events/ data/ event.do? id=645&occur=4025 Statistics 02 (Colin

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Sequential Decisions

Sequential Decisions Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal

More information

Optimistic Gittins Indices

Optimistic Gittins Indices Optimistic Gittins Indices Eli Gutin Operations Research Center, MIT Cambridge, MA 0242 gutin@mit.edu Vivek F. Farias MIT Sloan School of Management Cambridge, MA 0242 vivekf@mit.edu Abstract Starting

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Generative Learning algorithms

Generative Learning algorithms CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,

More information

Change-point models and performance measures for sequential change detection

Change-point models and performance measures for sequential change detection Change-point models and performance measures for sequential change detection Department of Electrical and Computer Engineering, University of Patras, 26500 Rion, Greece moustaki@upatras.gr George V. Moustakides

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic

Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic Annals of Operations Research manuscript No. (will be inserted by the editor) Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic

More information

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities Lecture 15: Bandit problems. Markov Processes Bandit problems Action values (and now to compute them) Exploration-exploitation trade-off Simple exploration strategies -greedy Softmax (Boltzmann) exploration

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Coupled Bisection for Root Ordering

Coupled Bisection for Root Ordering Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider

More information

Unit 1- Function Families Quadratic Functions

Unit 1- Function Families Quadratic Functions Unit 1- Function Families Quadratic Functions The graph of a quadratic function is called a. Use a table of values to graph y = x 2. x f(x) = x 2 y (x,y) -2-1 0 1 2 Verify your graph is correct by graphing

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Finite-time Analysis of the Multiarmed Bandit Problem*

Finite-time Analysis of the Multiarmed Bandit Problem* Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010

More information

1 [15 points] Search Strategies

1 [15 points] Search Strategies Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.

More information

Math-Stat-491-Fall2014-Notes-V

Math-Stat-491-Fall2014-Notes-V Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan November 18, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially

More information

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been

More information

Mean field equilibria of multiarmed bandit games

Mean field equilibria of multiarmed bandit games Mean field equilibria of multiarmed bandit games Ramesh Johari Stanford University Joint work with Ramki Gummadi (Stanford University) and Jia Yuan Yu (IBM Research) A look back: SN 2000 2 Overview What

More information

The Markov Decision Process (MDP) model

The Markov Decision Process (MDP) model Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the

More information

Terminology. Experiment = Prior = Posterior =

Terminology. Experiment = Prior = Posterior = Review: probability RVs, events, sample space! Measures, distributions disjoint union property (law of total probability book calls this sum rule ) Sample v. population Law of large numbers Marginals,

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

arxiv: v1 [cs.lg] 13 May 2014

arxiv: v1 [cs.lg] 13 May 2014 Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards Omar Besbes Columbia University Yonatan Gur Columbia University May 15, 2014 Assaf Zeevi Columbia University

More information

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 4: Solutions Fall 2007

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 4: Solutions Fall 2007 UC Berkeley Department of Electrical Engineering and Computer Science EE 6: Probablity and Random Processes Problem Set 4: Solutions Fall 007 Issued: Thursday, September 0, 007 Due: Friday, September 8,

More information

Lecture Notes on Monotone Comparative Statics and Producer Theory

Lecture Notes on Monotone Comparative Statics and Producer Theory Lecture Notes on Monotone Comparative Statics and Producer Theory Susan Athey Fall 1998 These notes summarize the material from the Athey, Milgrom, and Roberts monograph which is most important for this

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Q-Learning for Bandit Problems. (CMPSCI Technical Report 95-26) Michael O. Du. Department of Computer Science. University of Massachusetts

Q-Learning for Bandit Problems. (CMPSCI Technical Report 95-26) Michael O. Du. Department of Computer Science. University of Massachusetts Q-Learning for Bandit Problems (CMPSCI Technical Report 95-26) Michael O. Du Department of Computer Science University of Massachusetts Amherst, MA 01003 duff@cs.umass.edu March 24, 1995 Abstract Multi-armed

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

STAT 830 Hypothesis Testing

STAT 830 Hypothesis Testing STAT 830 Hypothesis Testing Richard Lockhart Simon Fraser University STAT 830 Fall 2018 Richard Lockhart (Simon Fraser University) STAT 830 Hypothesis Testing STAT 830 Fall 2018 1 / 30 Purposes of These

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

CS 1538: Introduction to Simulation Homework 1

CS 1538: Introduction to Simulation Homework 1 CS 1538: Introduction to Simulation Homework 1 1. A fair six-sided die is rolled three times. Let X be a random variable that represents the number of unique outcomes in the three tosses. For example,

More information

Political Economy of Institutions and Development: Problem Set 1. Due Date: Thursday, February 23, in class.

Political Economy of Institutions and Development: Problem Set 1. Due Date: Thursday, February 23, in class. Political Economy of Institutions and Development: 14.773 Problem Set 1 Due Date: Thursday, February 23, in class. Answer Questions 1-3. handed in. The other two questions are for practice and are not

More information

November 28 th, Carlos Guestrin 1. Lower dimensional projections

November 28 th, Carlos Guestrin 1. Lower dimensional projections PCA Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 1 Lower dimensional projections Rather than picking a subset of the features, we can new features that are

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information