Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Similar documents
COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

The Multi-Armed Bandit Problem

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Stochastic bandits: Explore-First and UCB

Advanced Machine Learning

Reinforcement Learning

Online Learning and Sequential Decision Making

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Two optimization problems in a stochastic bandit model

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

The Multi-Armed Bandit Problem

On Bayesian bandit algorithms

Reinforcement learning an introduction

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

The Multi-Arm Bandit Framework

Bandit models: a tutorial

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Multi-armed bandit models: a tutorial

1 MDP Value Iteration Algorithm

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Basics of reinforcement learning

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Large-scale Information Processing, Summer Recommender Systems (part 2)

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Grundlagen der Künstlichen Intelligenz

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Reinforcement Learning

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Exploration. 2015/10/12 John Schulman

CS599 Lecture 1 Introduction To RL

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

The multi armed-bandit problem

Stochastic Contextual Bandits with Known. Reward Functions

Bayesian and Frequentist Methods in Bandit Models

Q-learning. Tambet Matiisen

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Pure Exploration in Finitely Armed and Continuous Armed Bandits

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Evaluation of multi armed bandit algorithms and empirical algorithm

Bandits : optimality in exponential families

Computational Oracle Inequalities for Large Scale Model Selection Problems

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

The information complexity of sequential resource allocation

Thompson sampling for web optimisation. 29 Jan 2016 David S. Leslie

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Reinforcement Learning. Introduction

Approximate Universal Artificial Intelligence

THE first formalization of the multi-armed bandit problem

Artificial Intelligence

Online regret in reinforcement learning

Active Learning and Optimized Information Gathering

Analysis of Thompson Sampling for the multi-armed bandit problem

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

ONLINE ADVERTISEMENTS AND MULTI-ARMED BANDITS CHONG JIANG DISSERTATION

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

arxiv: v1 [cs.lg] 25 Jul 2018

RL 3: Reinforcement Learning

Online Learning with Gaussian Payoffs and Side Observations

Bandit View on Continuous Stochastic Optimization

REINFORCEMENT LEARNING

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Subsampling, Concentration and Multi-armed bandits

Learning Algorithms for Minimizing Queue Length Regret

Optimistic Bayesian Sampling in Contextual-Bandit Problems

Improved Algorithms for Linear Stochastic Bandits

Learning to play K-armed bandit problems

Reinforcement Learning

Approximation Methods in Reinforcement Learning

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

Lecture 5: Regret Bounds for Thompson Sampling

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Bits of Machine Learning Part 2: Unsupervised Learning

Reinforcement learning

Sparse Linear Contextual Bandits via Relevance Vector Machines

Multi-Armed Bandit Formulations for Identification and Control

Grundlagen der Künstlichen Intelligenz

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Robustness of Anytime Bandit Policies

New Algorithms for Contextual Bandits

Reinforcement Learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Transcription:

Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 2 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 3 / 27

Exploration vs. Exploitation Dilemma Online decision-making involves a fundamental choice: Exploitation Make the best decision given current information Exploration Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions Examples: Online Banner Advertisements: Exploitation Show the most successful advert; Exploration Show a different advert Game Playing: Exploitation Play the move you believe is best; Exploration Play an experimental move Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 4 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 5 / 27

How to do Exploration Random exploration: ɛ-greedy: Pull a random chosen arm a fraction ɛ of the time and the other 1 ɛ time, pull the arm which estimated to be the most profitable.(devote a fraction ɛ of resources to testing) Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 6 / 27

How to do Exploration Random exploration: ɛ-greedy: Pull a random chosen arm a fraction ɛ of the time and the other 1 ɛ time, pull the arm which estimated to be the most profitable.(devote a fraction ɛ of resources to testing) Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 6 / 27

How to do Exploration Optimism in the face of uncertainty: Estimate uncertainty on value Prefer to explore states/actions with highest uncertainty Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 7 / 27

How to do Exploration Information state space(most correct but computationally difficult): Consider agent s information as part of its state Look ahead to see how information helps reward Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 8 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 9 / 27

The Multi-Armed Bandit A multi-armed bandit is a tuple < A, R > A is a known set of m actions(or arms ) R a (r) = P[r a] is an unknown probability distribution over rewards At each step t the agent selects an action a t A The environment generates a reward r t R at The goal is to maximise cumulative reward Σ t τ=1 r τ Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 10 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 11 / 27

Regret The action-value is the mean reward for action a: The optimal value V is Q(a) = E[r a] V = Q(a ) = max a A Q(a) The regret is the opportunity loss for one step: I t = E[V Q(a t )] The total regret is the total opportunity loss L t = E[Σ t τ=1v Q(a τ )] Maximise cumulative reward minimise total regret Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 12 / 27

Counting Regret The count N t (a) is expected number of selections for action a The gap a is the difference in value between action a and optimal action a, = V Q(a) Regret is a function of gaps and the counts: L t = E[Σ t τ=1v Q(a τ )] = Σ a A E[N t (a)](v Q(a)) = Σ a A E[N t (a)] a (1) A good algorithm ensures small counts for large gaps Problem: gaps are not known. Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 13 / 27

Linear or Sublinear regret If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret? Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 14 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 15 / 27

Greedy Algorithm We consider algorithms that estimate ˆQ t (a) Q(a) Estimate the value of each action by Monte-Carlo evaluation: ˆQ t (a) = 1 N t (a) ΣT t=1r t 1(a t = a) The greedy algorithm selects action with highest value: a t = argmax a A ˆQ t (a) Greedy can lock onto a suboptimal action forever Greedy has linear total regret Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 16 / 27

ɛ-greedy Algorithm The ɛ-greedy algorithm continues to explore forever With probability 1 ɛ select a = argmax a A ˆQ(a) With probability ɛ select a random action Constant ɛ ensures minimum regret: ɛ-greedy has linear total regret I t ɛ A Σ a A a Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 17 / 27

Optimistic Initialisation Initialise Q(a) to high value Update action value by incremental Monte-Carlo evaluation: ˆQ t (a t ) = ˆQ t 1 + 1 N t (a t ) (r t ˆQ t 1 ) Encourages systematic exploration early on But can still lock onto suboptimal action greedy(ɛ-greedy) + optimistic initialisation has linear total regret Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 18 / 27

Decaying ɛ t -Greedy Algorithm Pick a decay schedule for ɛ 1, ɛ 2,... Consider the following schedule: c > 0 d = min a a>0 a ɛ t = min{1, c A d 2 t } Logarithmic asymptotic total regret Requires advance knowledge of gaps Goal: find an algorithm with sublinear regret for any multi-armed bandit (without knowledge of R) Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 19 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 20 / 27

Lower Bound The performance of any algorithm is determined by similarity between optimal arm and other arms Hard problems have similar-looking arms with different means This is described formally by the gap a and the similarity in distributions KL(R a R a ) Theorem (Lai and Robbins) Asymptotic total regret is at least logarithmic in number of steps a lim L t log tσ a a>0 t KL(R a R a ) Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 21 / 27

Outline 1 Introduction Exploration vs. Exploitation Dilemma How to do Exploration 2 Multi-Armed Bandits The Multi-Armed Bandit Regret Greedy and ɛ-greedy algorithms Lower Bound Upper Confidence Bound Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 22 / 27

Upper Confidence Bound Which action should we pick? The more uncertain we are about an action-value The more important it is to explore that action It could turn out to be the best action Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 23 / 27

Upper Confidence Bound Estimate an upper confidence Û t (a) for each action value Such that Q(a) ˆQ t (a) + Û t (a) with high probability This depends on the number of times N(a) has been selected Small N t (a) largeû t (a)(estimated value is uncertain) Large N t (a) smallût(a)(estimated value is accurate) Select action maximising Upper Confidence Bound (UCB) a t = argmax a A ˆQt (a) + Ût(a) Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 24 / 27

Upper Confidence Bound Theorem (Hoeffding s Inequality) Let X 1,..., X t be i.i.d. random variables in [0,1], and let X t = 1 τ Σt τ=1 X τ be the sample mean. Then P[E[X ] > X t + u] e 2tu2 We will apply Hoeffdings Inequality to rewards of the bandit conditioned on selecting action a P[Q(a) > ˆQ t (a) + U t (a)] e 2Nt(a)Ut(a)2 Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 25 / 27

Calculating Upper Confidence Bounds Pick a probability p that true value exceeds UCB Now solve for U t (a): e 2Nt(a)Ut(a)2 = p log p U t (a) = 2N t (a) Reduce p as we observe more rewards, e.g. p = t 4 Ensures we select optimal action as t Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 26 / 27

UCB1 This leads to the UCB1 algorithm a t = argmax a A Q(a) + 2 log t N t (a) Theorem The UCB algorithm achieves logarithmic asymptotic total regret lim t L t 8 log tσ a a>0 a Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 27 / 27