Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Similar documents
The Multi-Armed Bandit Problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

The Multi-Armed Bandit Problem

Bandit models: a tutorial

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Advanced Machine Learning

Large-scale Information Processing, Summer Recommender Systems (part 2)

Multi-armed bandit models: a tutorial

Online Learning and Sequential Decision Making

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Stochastic bandits: Explore-First and UCB

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

New Algorithms for Contextual Bandits

Sparse Linear Contextual Bandits via Relevance Vector Machines

Evaluation of multi armed bandit algorithms and empirical algorithm

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

THE first formalization of the multi-armed bandit problem

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

arxiv: v1 [cs.lg] 15 Oct 2014

Reinforcement Learning

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Estimation Considerations in Contextual Bandits

The Multi-Arm Bandit Framework

Stochastic Contextual Bandits with Known. Reward Functions

The information complexity of sequential resource allocation

Subsampling, Concentration and Multi-armed bandits

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Lecture 5: Regret Bounds for Thompson Sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Hybrid Machine Learning Algorithms

Online Learning: Bandit Setting

Exploration and exploitation of scratch games

arxiv: v1 [cs.lg] 7 Sep 2018

Analysis of Thompson Sampling for the multi-armed bandit problem

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Thompson Sampling for the MNL-Bandit

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Two optimization problems in a stochastic bandit model

On Bayesian bandit algorithms

Reducing contextual bandits to supervised learning

Lecture 10 : Contextual Bandits

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Lecture 4 January 23

Introducing strategic measure actions in multi-armed bandits

The multi armed-bandit problem

Lecture 4: Lower Bounds (ending); Thompson Sampling

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit Formulations for Identification and Control

1 MDP Value Iteration Algorithm

The No-Regret Framework for Online Learning

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Profile-Based Bandit with Unknown Profiles

Tsinghua Machine Learning Guest Lecture, June 9,

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Grundlagen der Künstlichen Intelligenz

Learning to play K-armed bandit problems

Bayesian and Frequentist Methods in Bandit Models

Exploiting Correlation in Finite-Armed Structured Bandits

Spectral Bandits for Smooth Graph Functions with Applications in Recommender Systems

Multi-armed bandit based policies for cognitive radio s decision making issues

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Applications of on-line prediction. in telecommunication problems

Online Learning under Full and Bandit Information

The information complexity of best-arm identification

Learning with Exploration

Adaptive Learning with Unknown Information Flows

Exploration. 2015/10/12 John Schulman

U Logo Use Guidelines

Bandit View on Continuous Stochastic Optimization

Graphs in Machine Learning

Active Learning and Optimized Information Gathering

arxiv: v4 [cs.lg] 22 Jul 2014

Chapter 2 Stochastic Multi-armed Bandit

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Multi-task Linear Bandits

Basics of reinforcement learning

Mechanisms with Learning for Stochastic Multi-armed Bandit Problems

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Brief introduction to Markov Chain Monte Carlo

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Lecture 16: Perceptron and Exponential Weights Algorithm

Analysis of Thompson Sampling for the multi-armed bandit problem

Introduction of Reinforcement Learning

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

arxiv: v2 [stat.ml] 19 Jul 2012

Bandits : optimality in exponential families

Sequential and reinforcement learning: Stochastic Optimization I

Online Learning with Feedback Graphs

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

Transcription:

Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University

Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3 Contextual Bandits LinUCB

K Slot Machines......

K Slot Machines Choose a machine and get a reward Only have T chances Goal: maximize the cumulative rewards How to choose the machines (arms)?

Two Phases: Explore-Exploit Explore: Try different arms. Exploit: Play the most rewarding arm. A/B Testing: Draw the arms uniformly during T/2 rounds Then draw the empirical best until the end Jump from pure exploration to pure exploitation. Bandits: Transit from exploration to exploitation more smoothly.

A/B Testing v.s. Bandits......

Multi-Armed Bandits (MAB) A set of K arms, denoted by A. The number of rounds T is fixed. At the round t, player picks arm at a it A. Player gets rewards r it [0, 1] for the chosen arm (no full information). Assume for each arm a, its rewards iid D a (unknown) People usually call this iid setting stochastic bandits.

Regret Regret: R(T ) = T µ T µ(a t ), t=1 where µ(a i ) := E(r i ) denotes the mean reward for the arm a i and µ := max a A µ(a) is the optimal mean reward. Maximizing rewards Minimizing regret.

MAB Algorithms......

Exploration-First Algorithm A/B testing style Exploration phase: try each arm N times Select the arm a opt with the highest average reward Exploitation phase: play arm a opt in all remaining rounds Regret bound: with a proper value of N, R(T ) 1 + 2K log(2kt ) 2, min where min = min a A\{a } ( µ µ(a) ).

Epsilon-Greedy Algorithm Exploit: with probability 1 ϵ, play the best arm so far Explore: with probability ϵ, choose an arm randomly

Epsilon-Greedy Algorithm If ϵ is held constant, only a linear bound on R(T ) can be achieved. When ϵ decreases with time, it could obtain ( ) K log T R(T ) O 2 min with ϵ t = O(1/t).

Adaptive Exploration A big flaw: the exploration is completely random, does not depend on the history of the observed rewards. Adaptive v.s. non-adaptive exploration.

Softmax Algorithm Idea: pick each arm with a probability that is proportional to its average reward. Given initial empirical means ˆµ i for i = 1,..., K P(Play kth Arm) = ˆµ k K i=1 ˆµ i Softmax Algorithm (Boltzmann Exploration): P(Play kth Arm) = exp(ˆµ k /τ) K i=1 exp(ˆµ i/τ) τ: a temperature parameter, controlling the randomness of the choice. Annealing: dynamically decrease τ over time.

Epsilon-Greedy and Softmax Tend to select the best arm currently. Sometimes decide to explore and choose an option that is not currently best. Drawbacks: ignore the randomness (noise) of rewards could be easily misled by negative experiences: non-robust

UCB1 Algorithm Given initial empirical means ˆµ i for i = 1,..., K Play the jth arm with j = arg max UCB(a i ) := ˆµ i + i=1,...,k α log t n i, where n i is the number of times a i was played so far. Usually, α = 2.

UCB UCB: Upper Confidence Bounds UCB(a i ) := ˆµ i + α log t n i By Hoeffding s inequality ( ) ( α log t ( α log t )) P ˆµ i + µ i exp 2n i = 1 n i n i t 2α

Regret Bound for UCB1 Regret Bound for UCB1 (Auer et al., 2002) E(R(T )) 8 where i = µ µ i. i: i >0 ( K log T O min No need to set any parameter! log T i + ), ) (1 + π2 K i, 3 i=1

UCB1-Tuned A variant which takes into account the variance of each arm. Play the jth arm with j = arg max i=1,...,k ˆµ i + log t n i ( 1 min i) 4, V, where V i = ˆσ 2 i + 2 log t ˆσ 2 i can be computed from the historical rewards. n i

Thompson Sampling Bayes point of view: quantify the reward in terms of a distribution rather than a point estimate. Prior: assume rewards follow Beta(S, F ). S: wins, F: fails. Set S i = F i = 0 for i = 1,..., K. At the t-th round, Sample θ i Beta(S i + 1, F i + 1), i. Play arm j t := arg max i θ i and receive reward r t Generate w t Ber(r t ). Update S jt = S jt + 1 if w t = 1, otherwise F jt = F jt + 1.

Regret Bound for Thompson Sampling Regret Bound for Thompson sampling (Agrawal and Goyal, 2012) ( ( 1 ) ) 2 E(R(T )) O log T. 2 i: i >0 i This rate is the same as that in UCB1, but is inferior in terms of constant factors and dependence on. Cheap in computation and have competitive performance to UCB1.

Comparison We generated bernoulli rewards with K = 5, T = 10000. Figure: Average of 100 runs

Adversarial Bandits Non-stochastic: sometimes the reward distribution cannot be modeled by a stationary distribution. Adversarial bandit game: at time t The player chooses arm a it Simultaneously, an adversary chooses a vector of rewards [r t 1, r t 2,..., r t K] The player only receives the reward r t i t. We still assume r [0, 1] for simplicity.

Adversarial Bandits For any deterministic algorithm there exists a sequence of rewards such that R(T ) T/2. The idea is to add randomization to the selection of the arm.

Exp3 Algorithm Set p 0 i = 1/K, i = 1,..., K and G 0 i = 0 (estimated cumulative rewards) At the round t (t 1): Sample arm i based on the probability p t 1, observe r t i t Estimating rewards for all actions: g t = [0,..., 0, Update cumulative rewards: Update sampling probability ri t t p t 1 i G t = G t 1 + g t, 0,..., 0] p t = exp( ηgt ) 1, exp( ηg t )

Exp3 Algorithm Unbiased estimate of unseen rewards [ r t E[gj] t j = E A variant: p t 1 j 1 j=it ] = rt j p t 1 j E[1 j=it ] = rt j p t 1 j p t 1 j p t exp( ηg t ) = (1 γ) 1, exp( ηg t ) + γ 1, γ (0, 1] K = r t j A mixture of the uniform distribution (exploration) and a exponential weights (exploitation).

Contextual Bandits Some additional information available at each round. This information (context) could help with the arm choices. Web article recommendation system. Contextual information about visitors: demographic browsing history location

Contextual Bandits Each round t proceeds The player observes a context x t The player chooses an arm a it Reward r t [0, 1] is realized Reward distribution could be stochastic (iid) or adversarial.

Example: Web Article Recommendation 3 articles but only one space: K = 3 2 user features: if they had clicked on a sports article or a politics article in the past. Find which articles are best for people given their past click behaviors. arm clk sports clk politics reward 1 1 0 0.58 1 1 1 0.69 2 0 0 0.19 3 0 1 0.51

LinUCB A combination of supervised learning and UCB At each round, fit a ridge regression for each arm ˆθ a = (X a X a + I) 1 X a r a X a is the feature matrix for the arm a so far. r a is the reward vector from choosing the arm a so far. [ ] 1 0 X 1 =, X 1 1 2 = [ 0 0 ], X 3 = [ 0 1 ] r 1 = [ ] 0.58, r 0.69 2 = [ 0.19 ], r 3 = [ 0.51 ]

LinUCB Choose the arm which gives the largest UCB for a new observed context x t ( ) a + = arg max x ˆθ t a +α x a A }{{} t (X a X a + I) 1 x t }{{} prediction standard deviation

LinUCB Feature engineering is extremely important. Could use both user features and arm features. Hybrid-LinUCB allows arms to share contextual variables. GLM-UCB for rewards following distributions from exponential family.