Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Similar documents
Advanced Machine Learning

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Bandit models: a tutorial

Stochastic bandits: Explore-First and UCB

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem

Evaluation of multi armed bandit algorithms and empirical algorithm

Multi-armed bandit models: a tutorial

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

THE first formalization of the multi-armed bandit problem

The Multi-Arm Bandit Framework

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Reinforcement Learning

Applications of on-line prediction. in telecommunication problems

Lecture 4: Lower Bounds (ending); Thompson Sampling

Online Learning with Feedback Graphs

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Two optimization problems in a stochastic bandit model

Gambling in a rigged casino: The adversarial multi-armed bandit problem

New Algorithms for Contextual Bandits

Exponential Weights on the Hypercube in Polynomial Time

Lecture 3: Lower Bounds for Bandit Algorithms

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Online learning with feedback graphs and switching costs

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Learning Algorithms for Minimizing Queue Length Regret

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Regional Multi-Armed Bandits

1 MDP Value Iteration Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm

Convergence and No-Regret in Multiagent Learning

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

1 A Support Vector Machine without Support Vectors

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Piecewise-stationary Bandit Problems with Side Observations

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Introducing strategic measure actions in multi-armed bandits

Multi armed bandit problem: some insights

Informational Confidence Bounds for Self-Normalized Averages and Applications

Grundlagen der Künstlichen Intelligenz

Stochastic Contextual Bandits with Known. Reward Functions

The information complexity of sequential resource allocation

Multi-armed Bandit Algorithms and Empirical Evaluation

Anytime optimal algorithms in stochastic multi-armed bandits

Online Learning and Sequential Decision Making

Lecture 5: Regret Bounds for Thompson Sampling

PAC Subset Selection in Stochastic Multi-armed Bandits

Ordinal Optimization and Multi Armed Bandit Techniques

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Multi-Armed Bandit Formulations for Identification and Control

Learning to play K-armed bandit problems

Subsampling, Concentration and Multi-armed bandits

arxiv: v1 [cs.gt] 1 Sep 2015

Online Learning: Bandit Setting

Notes from Week 8: Multi-Armed Bandit Problems

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v1 [cs.lg] 12 Sep 2017

Performance and Convergence of Multi-user Online Learning

Exploration and exploitation of scratch games

Regret Bounds for Sleeping Experts and Bandits

From Bandits to Experts: A Tale of Domination and Independence

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

Exploiting Correlation in Finite-Armed Structured Bandits

Tsinghua Machine Learning Guest Lecture, June 9,

Selecting the State-Representation in Reinforcement Learning

Online Learning and Sequential Decision Making

Improved Algorithms for Linear Stochastic Bandits

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

CCN Interest Forwarding Strategy as Multi-Armed Bandit Model with Delays

The multi armed-bandit problem

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

On Bayesian bandit algorithms

Yevgeny Seldin. University of Copenhagen

Exploration Scavenging

Bandit Convex Optimization: T Regret in One Dimension

The No-Regret Framework for Online Learning

Hybrid Machine Learning Algorithms

Lecture 10: Contextual Bandits

Exploration. 2015/10/12 John Schulman

ONLINE ADVERTISEMENTS AND MULTI-ARMED BANDITS CHONG JIANG DISSERTATION

On the Complexity of Best Arm Identification with Fixed Confidence

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Agnostic Online learnability

Adaptive Concentration Inequalities for Sequential Decision Problems

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Bayesian and Frequentist Methods in Bandit Models

Transcription:

Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe: Hossein Keshavarz Announcements There is no class on Wednesday, November 7 Good job on project proposals so far! 9. Review on stochastic multi-armed bandit problem Consider a gambler playing against a N armed slot machine and sequentially pulls up arms to minimize the expected regret. Each arm, i {,...,N} has a probability distribution D i supported on the closed interval [0,]. Let µ i be the expected value of the corresponding loss to arm i and define Ĩ = argmin j N µ j as the index of the optimal expected value among arms. Suppose that the gambler selects arm I t at round t, then the expected regret of the gambler up to round T is defined as E regret = E where X i,t D i denotes the loss associated with choosing arm i at round t. It s worthwhile to mention that the expected value is taken over the distributions {D i } N i= and the randomness of choosing arms. 9. Analysis of greedy algorithm This section is devoted to state and prove the theorem regarding the upper bound on the expected regret of the greedy algorithm. The greedy algorithm is introduced in the last lecture. Theorem 9.. Suppose that there exists a positive scalar such that µ j µ Ĩ the expected regret of the greedy algorithm has the following upper bound for any j Ĩ. Then E regret + N lognt Proof. Let us to decompose the expected regret as sum of two terms in which the first term is the associated regret to the sampling phase and the second term corresponds to the exploitation phase.

Lecture 9: UCB Algorithm and Adversarial Bandit Problem E regret = E b Nm + E + E I It Ĩ = Nm + +mn +mn a Nm + E P I t Ĩ Nm + T max +Nm t T P I t Ĩ 9. Note that inequalities a and b are immediate consequence of the fact that X k,t [0,] for all k N and t T. Therefore, in order to complete the proof, we need to control the deterministic term P I t Ĩ uniformly from above. If I t Ĩ then ˆµ It ˆµ Ĩ, hence using some straightforward algebraic manipulation we obtain µ It µ = Ĩ µ It ˆµ It + ˆµIt ˆµ Ĩ + ˆµĨ µ Ĩ µit ˆµ It + 0 + ˆµĨ µ Ĩ max µ j ˆµ j j N In other words, P I t Ĩ P max j N µj ˆµ j for any t. Combination of union bound inequality a and Hoeffding s inequality used to prove inequality b give us the desired upper bound. = max P I t Ĩ P max µ j ˆµ j +Nm t T j N b m N exp a N max j N P µj ˆµ j 9. Letting the optional parameter m = log N and substituting inequality 9. into inequality 9. yield E regret Nm + T = N N log + T Choosing = T terminates the proof. Note that the reason that appeared in the denumerator of the expected regret is that the inequality E XIt,t X Ĩ,t Nm is not tight enough. Lemma 9. Hoeffding s inequality. Let Z,...,Z n are indepent and identically distributed random variables such that Z [0,] almost sure and EZ = µ. Then, P n n Z i µ ɛ exp nɛ j=

Lecture 9: UCB Algorithm and Adversarial Bandit Problem 3 9.3 Upper Confidence Bound UCB algorithm In order to introduce UCB algorithm let us to introduce some notation. Define, T j t t I Is =j as the number of times that the gambler played the arm j up to time t and let the empirical expected regret of arm j as the following ˆµ j,t = T j t s= t X j,s I Is =j 9.3 s= Algorithm. UCB algorithm to solve multi armed bandit problem Initialization Play each arm once For t = to T For j = to N Update ˆµ j,t according to the identity 9.3 Play the arm I t = argmin j N ˆµ j,t 3logt T j t The arm selection criterion of UCB algorithm encourages to choose the rarely chosen arms. The next theorem characterizes the upper bound on the expected regret of UCB algorithm. The interested reader is referred to [] for further details and proof of Theorem 9.3. Theorem 9.3. Let j = µ j µ Ĩ such that j > 0 for any j Ĩ. Then the expected regret of UCB algorithm can be upper bounded by the following inequality. E regret O Theorem 9.3 shows that the expected regret of UCB algorithm is far better than greedy method for small j s. Strictly speaking, if min = min j Ĩ j, then j Ĩ logt j N logt E regret O min 9.4 Adversarial Bandit Unlike the multi armed bandit setting where the loss of each action has a stationary distribution over time, in the adversarial bandit problem there is no statistical assumption about the form of the generating process of losses. In this new formulation, the associated regret to each arm is determined at each round by an adversary and the player only knows the reward of previously chosen actions. The only assumption about the loss vector is that, l t [0,] N for each round t. Since the adversary can assign low reward to the previously selected actions and high rewards to the unseen arms, hence, a deterministic policy of arm selection can not optimize the expected

Lecture 9: UCB Algorithm and Adversarial Bandit Problem 4 regret function. Finally, we need to mention that the distribution of action I t only deps on the loss of previous actions, { l s I s } t s=. Due to the lack of full information about the associated losses of choosing arms, the player can t run exponential weighted algorithm to optimize the regret function. One doubtful solution is to estimate of loss vector, l t, based on the observation of a single component l t I t. Taking advantage of the conditional expectation property, we can show that if l t is an unbiased estimator of l t, i.e. E lt F t = l t where F t is the generated σ field by observations up to round t, then the expected regrets with respect to l t and l t are the same. E l t. p t p = E E l t. p t p F t = E E lt F t. p t p = E l t. p t p 9.4. Unbiased estimation of l t We claim that the following procedure which is called exponential weighted algorithm with ɛ exploration generates an unbiased estimator of l t.. With probability ɛ, choose p t = N,..., and select I t p t uniformly at random. Let l t = 0,...,0, Nlt It ɛ,0,...,0.. With probability ɛ choose p t by exponential weighted algorithm on the loss vectors { l s } t s= and let l t = 0. proof of claim. E lt = ɛ.0 + ɛ N N 0,...,0, Nlt j,0,...,0 = lt ɛ j= Since l t is an unbiased estimator of l t, so at the first glance, it seems that the expected regret of above algorithm is exactly equal to the expected regret of EWA. However, a contingent reader notices that l t is no longer in the closed cube [0,] N. Recalling the proof of Theorem 3.. in the lecture notes, one can easily show that logn E regret of EWA with ɛ exploration T ɛ+ ɛ + l t = O T ɛ + logn + T N ɛ Now choosing the regularization parameters ɛ = N T upper bound is given by T logn and = ɛ logn N T, the optimal NT 3 E regret of EWA with ɛ exploration O 4 logn 4 9.4

Lecture 9: UCB Algorithm and Adversarial Bandit Problem 5 It s worthwhile to mention that although max t T l t N ɛ, but most of the times with probability ɛ, we have l t = 0 [0,]. Hence, the proof can be slightly modified in a smart way to obtain the following upper bound. E regret of EWA with ɛ exploration O which leads to E regret of EWA with ɛ exploration O NT 3. T ɛ + logn + T N ɛ Question: Is there any algorithm with the expected regret O NT? Yes! EXP3 algorithm 9.5 EXP3 Algorithm Let L t be the cumulative loss up to round t. Algorithm. EXP3 algorithm to for adversarial multi-armed bandit problem Input Regularization parameter Initialization St initial value for p For t = to T For j = to N Sample I t absed on the distribution p t Observe l t I t Let l t = 0,...,0, lt It p t It Update p t+ by p t+,0,...,0 j = exp L j t N exp L t j j= The EXP3 algorithm will be analyzed in the next class. References [] P. Auer, N. Cesa-Bianchi and P. Fischer, Finite-time analysis of the multi-armed bandit problem, Machine learning 47, no. -3 00: 35-56. [] P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, In Foundations of Computer Science, 995. Proceedings., 36th Annual Symposium on, pp. 3-33. IEEE, 995.