Multi-armed Bandits and the Gittins Index
|
|
- Stephen Rose
- 5 years ago
- Views:
Transcription
1 Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6
2 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,...
3 Two-armed Bandit 3, 10, 4, 9, 12, 1,..., 6, 2, 15, 2, 7,... 5
4 Two-armed Bandit 3, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6
5 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1. Of course, in practice we must choose which arms to pull without knowing the future sequences of rewards.
6 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend? Dating strategy: should I contact a new prospect, or try another date with someone I have dated before?
7 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989) Exploration versus exploitation
8 Multi-armed Bandit Allocation Indices 2nd Edition edition 11 March 2011 Gittins, Glazebrook and Weber
9 Clinical Trials
10 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples. θ i is unknown, but has a prior distribution, perhaps uniform on [0,1] f(θ i ) = 1, 0 θ i 1.
11 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2). We wish to maximize the expected total discounted sum of number of successes.
12 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen). Objective: maximize the expected total β-discounted reward [ ] E r it (x it (t))β t, t=0 where i t is the arm pulled at time t, (0 < β < 1).
13 Dynamic Programming Solution The dynamic programming equation is F(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)f(x 1,...,x i 1,y,x i+1,...,x N )
14 Single Machine Scheduling N jobs are to be processed successively on one machine. Job i has a known processing times t i, a positive integer. On completion of job i a reward r i is obtained. If job 1 is processed immediately before job 2 the sum of discounted rewards from the two jobs is r 1 β t 1 +r 2 β t 1+t 2. r 1 β t 1 +r 2 β t 1+t 2 > r 2 β t 2 +r 1 β t 2+t 1 G 1 = (1 β) r 1β t 1 1 β t 1 > (1 β) r 2β t2 1 β t 2 = G 2. So total discounted reward is maximized by the index policy which processes jobs in decreasing order of indices, G i.
15 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. G i (x i ) is called the Gittins index.
16 Gittins Index [ ] τ 1 E G i (x i ) = sup t=0 ri(x i (t))βt xi [ (0) = x i τ 1 ] τ 1 E t=0 βt xi (0) = x i Discounted reward up to τ. Discounted time up to τ. In the scheduling problem τ = t i and G i = r i β t i 1+β + +β t i 1 = (1 β) r iβ ti 1 β t i
17 Calibration Alternatively, { G i (x i ) = sup λ : t=0 β t λ supe τ 1 [ τ 1 β t r i (x i (t))+ β t λ x i (0) = x i ]}. t=0 That is, we consider a problem with two bandit processes: bandit process B i and a calibrating bandit process, say Λ, which pays out a known reward λ at each step it is continued. t=τ The Gittins index of B i is the value of λ for which we are indifferent as to which of B i and Λ to continue initially. Notice that once we decide, at time τ, to switch from continuing B i to continuing Λ then information about B i does not change and so it must be optimal to stick with continuing Λ ever after.
18 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index = So we prefer to use drug 1 next, even though it has the smaller probability of success.
19 Gittins Index Theorem is Surprising Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved? Sir, the multi-armed bandit problem is not of such a nature that it can be solved.
20 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988) Weber (1992) El Karoui, Karatzas (1993) Ishikida and Varaiya (1994) Tsitsiklis (1994) Bertsimas, Niño-Mora (1996) Glazebrook, Garbe (1996) Kaspi, Mandelbaum (1998) Bäuerle, Stidham (2001) Dimitriu, Tetali, Winkler (2003)
21 Proof of the Index Theorem Start with a problem in which only bandit process B i is available. Define the fair charge, γ i (x i ), as the maximum amount that a gambler would be willing to pay per step to be permitted to continue B i for at least one more step, and with subsequent freedom to stop continuing it whenever he likes thereafter. This is γ i (x i ) = sup { λ : 0 supe τ 1 [ τ 1 t=0 Easy to check that γ i (x i ) = G i (x i ). The stopping time τ will be the first time that G i (x i (τ)) < G i (x i (0)), ]} β t( ) xi r i (x i (t)) λ (0) = x i. i.e. the first time that the fair charge becomes too expensive. The gambler would rather stop that pay this charge.
22 Prevailing charges Suppose that when G i (x i (τ)) < G i (x i (0)), we reduce the charge to a level such that it remains just-profitable for the gambler to keep on playing. This is called the prevailing charge, say g i (t) = min s t G i (x i (s)). g i (t) is a nonincreasing function of t and its value depends only on the states through which bandit i evolves. Observation 1. Suppose that in the problem with n alternative bandit processes, B 1,...,B n, the gambler not only collects r it (x it (t)), but also pays the prevailing charge g it (x it (t)) of the bandit B it that he chooses to continue at time t. Then he cannot do better than just break even (i.e. expected profit 0). This is because he could only make a strictly positive profit (in expected value) if this were to happen for at least one bandit. Yet the prevailing charge has been defined so that if he pays the prevailing charges he can only just break even.
23 Observation 2. He maximizes the expected discounted sum of the prevailing charges that he pays by always continuing the bandit with the greatest prevailing charge. This is because he thereby interleaves the n nonincreasing sequences of prevailing charges g i into one nonincreasing sequence of prevailing charges. This way of interleaving them maximizes their discounted sum. For example, prevailing charges of g 1 : 10,10,9,5,5,3,... g 2 : 20,15,7,4,2,2,... are best interleaved (so as to maximize discounted charge paid) as 20,15,10,10,9,7,5,5,4,3,2,2,...
24 Observation 3. Using this strategy he also just breaks even (because once he starts continuing B i he does so until its prevailing charge decreases). So this strategy, (of always continuing the bandit with the greatest G i (x i )), must maximize the expected discounted sum of the rewards that he can obtain from the bandits. I.e., for any policy, the definition of prevailing charges implies [ ] 0 E β t( ) x(0) r it (x it (t)) g it (x it(t)) t=0 and so [ ] [ ] E β t r it (x it ) x(0) E β t g it (x it ) x(0). t=0 But the right hand side is maximized by always continuing the bandit with greatest g i (x i (t)). This policy also achieves equality in the above, and so it must maximize the expected discounted sum of rewards (the left hand side). t=0
25 Restless Bandits Spinning Plates
26 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed inspection no inspection
27 Questions
Multi-armed Bandits and the Gittins Index
Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011 The Multi-armed Bandit Problem The Multi-armed Bandit
More informationINDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS
Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationSequential Selection of Projects
Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationCOMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.
COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several
More informationBayesian Contextual Multi-armed Bandits
Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating
More informationIndex Policies and Performance Bounds for Dynamic Selection Problems
Index Policies and Performance Bounds for Dynamic Selection Problems David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu James E. Smith Tuck School of Business Dartmouth College jim.smith@dartmouth.edu
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationMULTI-ARMED BANDIT PROBLEMS
Chapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis University of Michigan, Ann Arbor, MI, USA 1. Introduction Multi-armed bandit MAB)
More informationA General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions
arxiv:188.6314v2 [math.pr 22 Aug 218 A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions Wenqing Bao 1, Xiaoqiang Cai 2, Xianyi Wu 3 1 Zhejiang Normal University, Jinhua Zhejiang,
More informationThe Multi-Armed Bandit Problem with Commitments:
The Multi-Armed Bandit Problem with Commitments: Generalized Depreciation and Necessity of Restart in State Indices Wesley Cowan Department of Mathematics, Rutgers University 110 Frelinghuysen Rd., Piscataway,
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationOn Robust Arm-Acquiring Bandit Problems
On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given
More informationRestless Bandit Index Policies for Solving Constrained Sequential Estimation Problems
Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,
More informationAn Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits
An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationLecture 4 January 23
STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are
More informationOrdinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods
Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied
More informationThe knowledge gradient method for multi-armed bandit problems
The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationOptimal Stopping of Markov chain, Gittins Index and Related Optimization Problems / 25
Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems Department of Mathematics and Statistics University of North Carolina at Charlotte, USA http://...isaac Sonin in Google,
More informationCrowdsourcing & Optimal Budget Allocation in Crowd Labeling
Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationNear-optimal policies for broadcasting files with unequal sizes
Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies
More informationStochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationExploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms
Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam
More informationOrdinal Optimization and Multi Armed Bandit Techniques
Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014 The ordinal optimization problem Determining the best of d alternative designs for a system, on
More informationMulti-armed Bandits with Limited Exploration
Multi-armed Bandits with Limited Exploration Sudipto Guha Kamesh Munagala Abstract A central problem to decision making under uncertainty is the trade-off between exploration and exploitation: between
More informationON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES*
Probability in the Engineering and Informational Sciences, 14, 2000, 259 297+ Printed in the U+S+A+ ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE
More informationNotes from Week 8: Multi-Armed Bandit Problems
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 8: Multi-Armed Bandit Problems Instructor: Robert Kleinberg 2-6 Mar 2007 The multi-armed bandit problem The multi-armed bandit
More informationMulti-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays
Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises
More informationBandit Problems With Infinitely Many Arms
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 997 Bandit Problems With Infinitely Many Arms Donald A. Berry Robert W. Chen Alan Zame David C. Heath Larry A. Shepp
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationLecture 2: Learning from Evaluative Feedback. or Bandit Problems
Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those
More informationDynamic Programming and Optimal Control. Richard Weber
Dynamic Programming and Optimal Control Richard Weber Autumn 2014 i Contents Table of Contents ii 1 Dynamic Programming 2 1.1 Control as optimization over time...................... 2 1.2 The principle
More informationOptimal learning with non-gaussian rewards
Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems
More informationApplications of Optimal Stopping and Stochastic Control
Applications of and Stochastic Control YRM Warwick 15 April, 2011 Applications of and Some problems Some technology Some problems The secretary problem Bayesian sequential hypothesis testing the multi-armed
More informationPure Exploration Stochastic Multi-armed Bandits
CAS2016 Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction Optimal PAC Algorithm (Best-Arm, Best-k-Arm):
More informationBandits : optimality in exponential families
Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities
More information(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.
Problem 1 (21 points) An economist runs the regression y i = β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 + ε i (1) The results are summarized in the following table: Equation 1. Variable Coefficient Std. Error
More informationTopic 2: Logistic Regression
CS 4850/6850: Introduction to Machine Learning Fall 208 Topic 2: Logistic Regression Instructor: Daniel L. Pimentel-Alarcón c Copyright 208 2. Introduction Arguably the simplest task that we can teach
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationBayesian and Frequentist Methods in Bandit Models
Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,
More informationCsaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,
More informationTheory Field Examination Game Theory (209A) Jan Question 1 (duopoly games with imperfect information)
Theory Field Examination Game Theory (209A) Jan 200 Good luck!!! Question (duopoly games with imperfect information) Consider a duopoly game in which the inverse demand function is linear where it is positive
More informationA Structured Multiarmed Bandit Problem and the Greedy Policy
A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research
More informationWireless Channel Selection with Restless Bandits
Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may
More informationSOME MEMORYLESS BANDIT POLICIES
J. Appl. Prob. 40, 250 256 (2003) Printed in Israel Applied Probability Trust 2003 SOME MEMORYLESS BANDIT POLICIES EROL A. PEKÖZ, Boston University Abstract We consider a multiarmed bit problem, where
More informationA Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling
A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling Nicolas Gast Bruno Gaujal Grenoble University Knoxville, May 13th-15th 2009 N. Gast (LIG) Mean
More informationRestless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies
Restless Bandits with Switching Costs: Linear Programming Relaations, Performance Bounds and Limited Lookahead Policies Jerome Le y Laboratory for Information and Decision Systems Massachusetts Institute
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationMulti-armed Bandit Allocation Indices
Multi-armed Bandit Allocation Indices Multi-armed Bandit Allocation Indices 2nd Edition John Gittins Department of Statistics, University of Oxford, UK Kevin Glazebrook Department of Management Science,
More informationLecture PowerPoints. Chapter 4 Physics: for Scientists & Engineers, with Modern Physics, 4th edition Giancoli
Lecture PowerPoints Chapter 4 Physics: for Scientists & Engineers, with Modern Physics, 4th edition Giancoli 2009 Pearson Education, Inc. This work is protected by United States copyright laws and is provided
More informationAn Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes
An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationLecture 3 - More Conditional Probability
Lecture 3 - More Conditional Probability Statistics 02 Colin Rundel January 23, 203 Review R course Library R Course http:// library.duke.edu/ events/ data/ event.do? id=645&occur=4025 Statistics 02 (Colin
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationSequential Decisions
Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal
More informationOptimistic Gittins Indices
Optimistic Gittins Indices Eli Gutin Operations Research Center, MIT Cambridge, MA 0242 gutin@mit.edu Vivek F. Farias MIT Sloan School of Management Cambridge, MA 0242 vivekf@mit.edu Abstract Starting
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationGenerative Learning algorithms
CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,
More informationChange-point models and performance measures for sequential change detection
Change-point models and performance measures for sequential change detection Department of Electrical and Computer Engineering, University of Patras, 26500 Rion, Greece moustaki@upatras.gr George V. Moustakides
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationResource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic
Annals of Operations Research manuscript No. (will be inserted by the editor) Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic
More informationLecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities
Lecture 15: Bandit problems. Markov Processes Bandit problems Action values (and now to compute them) Exploration-exploitation trade-off Simple exploration strategies -greedy Softmax (Boltzmann) exploration
More informationPower Allocation over Two Identical Gilbert-Elliott Channels
Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationCoupled Bisection for Root Ordering
Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider
More informationUnit 1- Function Families Quadratic Functions
Unit 1- Function Families Quadratic Functions The graph of a quadratic function is called a. Use a table of values to graph y = x 2. x f(x) = x 2 y (x,y) -2-1 0 1 2 Verify your graph is correct by graphing
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationAn Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement
An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationFinite-time Analysis of the Multiarmed Bandit Problem*
Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010
More information1 [15 points] Search Strategies
Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.
More informationMath-Stat-491-Fall2014-Notes-V
Math-Stat-491-Fall2014-Notes-V Hariharan Narayanan November 18, 2014 Martingales 1 Introduction Martingales were originally introduced into probability theory as a model for fair betting games. Essentially
More informationLower PAC bound on Upper Confidence Bound-based Q-learning with examples
Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been
More informationMean field equilibria of multiarmed bandit games
Mean field equilibria of multiarmed bandit games Ramesh Johari Stanford University Joint work with Ramki Gummadi (Stanford University) and Jia Yuan Yu (IBM Research) A look back: SN 2000 2 Overview What
More informationThe Markov Decision Process (MDP) model
Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the
More informationTerminology. Experiment = Prior = Posterior =
Review: probability RVs, events, sample space! Measures, distributions disjoint union property (law of total probability book calls this sum rule ) Sample v. population Law of large numbers Marginals,
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationarxiv: v1 [cs.lg] 13 May 2014
Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards Omar Besbes Columbia University Yonatan Gur Columbia University May 15, 2014 Assaf Zeevi Columbia University
More informationUC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 4: Solutions Fall 2007
UC Berkeley Department of Electrical Engineering and Computer Science EE 6: Probablity and Random Processes Problem Set 4: Solutions Fall 007 Issued: Thursday, September 0, 007 Due: Friday, September 8,
More informationLecture Notes on Monotone Comparative Statics and Producer Theory
Lecture Notes on Monotone Comparative Statics and Producer Theory Susan Athey Fall 1998 These notes summarize the material from the Athey, Milgrom, and Roberts monograph which is most important for this
More informationBandits and Exploration: How do we (optimally) gather information? Sham M. Kakade
Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22
More informationQ-Learning for Bandit Problems. (CMPSCI Technical Report 95-26) Michael O. Du. Department of Computer Science. University of Massachusetts
Q-Learning for Bandit Problems (CMPSCI Technical Report 95-26) Michael O. Du Department of Computer Science University of Massachusetts Amherst, MA 01003 duff@cs.umass.edu March 24, 1995 Abstract Multi-armed
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationSTAT 830 Hypothesis Testing
STAT 830 Hypothesis Testing Richard Lockhart Simon Fraser University STAT 830 Fall 2018 Richard Lockhart (Simon Fraser University) STAT 830 Hypothesis Testing STAT 830 Fall 2018 1 / 30 Purposes of These
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationCS 1538: Introduction to Simulation Homework 1
CS 1538: Introduction to Simulation Homework 1 1. A fair six-sided die is rolled three times. Let X be a random variable that represents the number of unique outcomes in the three tosses. For example,
More informationPolitical Economy of Institutions and Development: Problem Set 1. Due Date: Thursday, February 23, in class.
Political Economy of Institutions and Development: 14.773 Problem Set 1 Due Date: Thursday, February 23, in class. Answer Questions 1-3. handed in. The other two questions are for practice and are not
More informationNovember 28 th, Carlos Guestrin 1. Lower dimensional projections
PCA Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 1 Lower dimensional projections Rather than picking a subset of the features, we can new features that are
More informationThe Multi-Armed Bandit Problem
Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which
More information