Multi-Armed Bandit Formulations for Identification and Control

Similar documents
Multi-armed bandit models: a tutorial

Bandit models: a tutorial

Advanced Machine Learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

The information complexity of sequential resource allocation

Online Learning and Sequential Decision Making

The Multi-Arm Bandit Framework

Evaluation of multi armed bandit algorithms and empirical algorithm

Reinforcement Learning

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

On Bayesian bandit algorithms

Two optimization problems in a stochastic bandit model

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Lecture 4: Lower Bounds (ending); Thompson Sampling

The information complexity of best-arm identification

The Multi-Armed Bandit Problem

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

The No-Regret Framework for Online Learning

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Bandits for Online Optimization

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Full-information Online Learning

Analysis of Thompson Sampling for the multi-armed bandit problem

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

On the Complexity of Best Arm Identification with Fixed Confidence

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Online Learning and Sequential Decision Making

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Learning to play K-armed bandit problems

Sparse Linear Contextual Bandits via Relevance Vector Machines

On the Complexity of Best Arm Identification with Fixed Confidence

Bayesian and Frequentist Methods in Bandit Models

THE first formalization of the multi-armed bandit problem

Models of collective inference

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Basics of reinforcement learning

Grundlagen der Künstlichen Intelligenz

Yevgeny Seldin. University of Copenhagen

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Stochastic Contextual Bandits with Known. Reward Functions

Online Learning with Feedback Graphs

Bandits : optimality in exponential families

Pure Exploration Stochastic Multi-armed Bandits

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Tsinghua Machine Learning Guest Lecture, June 9,

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

arxiv: v4 [cs.lg] 22 Jul 2014

Adaptive Learning with Unknown Information Flows

Exploration and exploitation of scratch games

Multi-armed Bandits in the Presence of Side Observations in Social Networks

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 4 January 23

The Multi-Armed Bandit Problem

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Pure Exploration Stochastic Multi-armed Bandits

Reinforcement Learning

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Mean field equilibria of multiarmed bandit games

Multi armed bandit problem: some insights

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Learning Algorithms for Minimizing Queue Length Regret

Introducing strategic measure actions in multi-armed bandits

Active Learning and Optimized Information Gathering

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Analysis of Thompson Sampling for the multi-armed bandit problem

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors

Multi-armed bandit based policies for cognitive radio s decision making issues

Exploration. 2015/10/12 John Schulman

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

arxiv: v1 [cs.lg] 15 Oct 2014

1 MDP Value Iteration Algorithm

Data-driven H -norm estimation via expert advice

Informational Confidence Bounds for Self-Normalized Averages and Applications

Online Bounds for Bayesian Algorithms

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

ONLINE ADVERTISEMENTS AND MULTI-ARMED BANDITS CHONG JIANG DISSERTATION

Learning with Exploration

A Second-order Bound with Excess Losses

arxiv: v1 [cs.lg] 12 Sep 2017

Bayesian reinforcement learning

Multi-armed Bandit with Additional Observations

Applications of on-line prediction. in telecommunication problems

Corrupt Bandits. Abstract

Subsampling, Concentration and Multi-armed bandits

Adaptive Online Learning in Dynamic Environments

Learning, Games, and Networks

Exponential Weights on the Hypercube in Polynomial Time

Transcription:

Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27, 2017

Outline Motivation: identification and control Introduction to multi-armed bandits Lower bounds and optimal algorithms Application to H 8-norm estimation Summary 2

Motivation: Identification and control input design experiment identification method controller design controller testing Need to make choices in identification procedure focusing on end goal E.g., input design can emphasize system properties of interest, while properties of little or no interest can be hidden However, to design a good input requires knowing what we don t know yet: the true system! This can be solved by adaptively tuning the input: S.D. Silvey, Optimal Design: An Introduction to the Theory for Parameter Estimation. Chapman and Hall, 1980 L. Pronzato, Optimal experimental design and some related control problems. Automatica, 44(2):303 325, 2008 L. Gerencsér, H. Hjalmarsson and L. Huang. Adaptive input design for LTI systems. IEEE Transactions on Automatic Control, 62(5):2390 2405, 2017 3

Introduction to Multi-Armed Bandits Popular machine learning framework for adaptive control (but where the plant is static) Name coined in 1952 by Herbert Robbins, in the context of Sequential Design of Experiments Exploration vs exploitation dilemma First asymptotically optimal solution proposed in 1985 by T.L. Lai and H. Robbins 4

Introduction to Multi-Armed Bandits (cont.) Basic setup: There are K arms (slot machines) to choose from One can play one arm in each round Each arm j gives a reward X j,t P t0, 1u (t: round) (Obs only the reward of the selected arm j is revealed) Problem: which machine should one play in each round? Performance of a strategy measured in terms of expected cumulative regret: RpTq Tÿ pµ Etµ apiq uq i 1 where: T: number of rounds µ max i µ i : best reward apiq: arm chosen at round i 5

Introduction to Multi-Armed Bandits (cont.) Different formulations: Stochastic: rewards sampled from an unknown distribution (independent between rounds) Example: X j,t : i.i.d. Bernoulli variables with unknown mean µ j Adversarial: rewards chosen by an adversary Oblivious adversary: X j,t chosen a priori (at round 0) Adaptive adversary: X j,t chosen based on history of selected arms and rewards so far Markovian: rewards are Markov processes (evolving only when the respective arm is chosen) Large literature from the 70 s based on Gittins indices We will focus mostly on stochastic MABs (for applications of adversarial MABs to identification, check G. Rallo et. al. Data-driven H 8-norm estimation via expert advice. CDC 17.) 6

Introduction to Multi-Armed Bandits (cont.) Applications: Clinical trials 2 4 5 2.94* Ad placement on webpages 5 4 5 1 2 2.48* Recommender systems 1 5 4 Computer game-playing 4 5 4 1 2 1.12* 7

Lower bounds and optimal algorithms The performance of a strategy depends on the true reward distribution of the arms To obtain reasonable problem-dependent lower bounds on the achievable performance, one needs to restrict the class of strategies Consider a Bernoulli MAB: Definition (Uniformly good strategy) A strategy is called uniformly good / efficient if, for any mean reward distribution pµ 1,..., µ K q, the number of times T j ptq that any suboptimal arm j (µ j µ ) is chosen up to round t satisfies EtT j ptqu opt α q, for all α ą 0 Note This definition should be suitably changed for other types of MABs 8

Lower bounds and optimal algorithms (cont.) Theorem (Lower bound (Lai&Robbins, 1985)) For any uniformly good strategy and suboptimal arm j, where lim inf tñ8 Ipx, yq x log T j ptq log t ě 1 w. p. 1, Ipµ j, µ q ˆ ˆ x 1 x ` p1 xq log y 1 y is the KL divergence between two Bernoulli distributions with means x and y. Therefore, lim inf tñ8 K Rptq log t ě ÿ j 1 µ µ j Ipµ j, µ q Idea of proof Transform problem into hypothesis testing: a unif. good strategy should detect quickly the best arm, but for that it needs to collect enough samples of every suboptimal arm. Stein-Chernoff s Lemma provides a lower bound for T j ptq to achieve consistent detection 9

Lower bounds and optimal algorithms (cont.) Two large families of asymptotically optimal algorithms: Upper confidence bound (UCB) Thompson sampling 10

Lower bounds and optimal algorithms (cont.) UCB algorithm: J. Neural Eng. 10 (2013) 016012 J Fruitet et al tasks by mixing the features of one of the real tasks (the feet) with different proportions of the features extracted during the idle periods. At each round t: construct a confidence interval around µ j for each arm j, of significance level α t 2.5. Modeling the problem Let K denote the number of different images to be presented to the subject during the learning stage (K is thus the number of imaginary choosetasks) armandwhose N is the total upper number of images (the budget). Our goal is to find a selection strategy (i.e. that chooses confidence at each time bound step t is {1,...,N} the largest an image k t {1,...,K} (Optimism to present) inwhich the allows face us ofto select in fine the most discriminative task (i.e. with the highest classification rate inuncertainty) generalization). Note that, in order to learn an efficient classifier, we need as much training data as possible, so our selection strategy should rapidly select the most promising images in order to obtain more samples from these rather than t = 1 t = 2 t = 3 Arm 1 Arm 2 Arm 1 Arm 2 Arm 1 Arm 2 Figure 4. This figure represents three snapshots, at times t = 1, 2 Significance level α from the others. t should be carefullyandtuned 3, of a bandit sowith that two arms. α t Although Ñ 1, arm to1 is obtain the best arman (r1 This issue is relatively close to the stochastic bandit > r 2, represented by the red stars), at time t = 1, since asymptotically optimal strategy. The resulting B1,t < B2,t, arm upper 2 is selected. bounds Pulling arm 2are gives a better estimate problem [11, 15]. The classical stochastic bandit problem ˆr2,2 of r2 and reduces the confidence interval. At times t = 2and is defined by a set ofd K actions (pulling different arms of t = 3, B1,t will be greater than B2,t, so arm 1 will be selected. bandit machines). With each 2 logptq action a reward distribution is b associated, which is initially unknown from the learner. At Table 1. Pseudo-code of the UCB-classif algorithm. j ptq ˆµ j ptq ` time t {1,...,N}, if we choose T j ptq, ˆµ jptq : average reward of arm j an action k t {1,...,K}, The UCB-Classif Algorithm we receive a reward sample drawn independently from the Parameters: a, N, q distribution of the corresponding action k t. The goal of the Present each image q times (thus set Tk,qK = q). stochastic bandit algorithm is to find a selection strategy which for t = qk + 1,...,N do Evaluate by a q-split cross-validation the performance ˆrk,t of each maximizes the sum of obtained rewards. image. We model the K different images as the K possible actions Similar Compute the UCB: Bk,t =ˆrk,t + a log N for each image (or arms), and to we the define Bet theon reward the as the Best detection (BoB) rate of principle of S. BittantiT k,t 1 and M.C. the corresponding imaginary motor task, against the idle state. 1 k K. Campi (Comm. Inf. & Syst., 6(4):299 320, Present 2006) image: kt = arg maxk {1,...,K} Bk,t. In the bandit problem, pulling a bandit arm directly gives a Update T : Tkt,t = Tkt,t 1 + 1and k kt, Tk,t = Tk,t 1 stochastic reward which is used to estimate the distribution end for T j ptq : # times arm j has been played up to round t 11

Lower bounds and optimal algorithms (cont.) For Bernoulli rewards, UCB algorithm gives logarithmic regret, but its regret does not exactly match the lower bound A variant, called KL-UCB, does match the lower bound; the upper bound used is b j ptq maxtq ď 1 : T j ptq Ipˆµ j ptq, qq ď f ptqu where f ptq logptq ` 3 logplogptqq is the confidence level Interpretation For optimal performance, an algorithm has to sample each suboptimal arm as many times as given by the lower bound b j ptq keeps track of how far from this quota arm j has been sampled Term 3 logplogptqq accounts for uncertainty on ˆµ j and optimal arm 12

Lower bounds and optimal algorithms (cont.) Thompson sampling: (Thompson, 1933) Much older than UCB, conceived for adaptive clinical trials Bayesian origin: Assume a uniform prior on µ j for every j, and update the posterior p µj based on samples up to round t At round t, sample ˆµ j from posterior p µj, and pick arm for which ˆµ j is largest Empirically shown that TS has better finite sample mean performance than UCB algorithms, but its variance can be higher Kaufmann, Korda & Munos (ALT, 2012) showed that Thompson Sampling is asymptotically optimal 13

Application to H 8 -norm estimation Applications of MAB to control problems are very sparse. Some examples: P.R. Kumar. An adaptive controller inspired by recent results on learning from experts. In K.J. Åström, G.C. Goodwin & P.R. Kumar, Adaptive Control, Filtering, and Signal Processing, Springer, 1995 M. Raginsky, A. Rakhlin, and S. Yüksel. Online convex programming and regularization in adaptive control. CDC, 2010 Our goal: apply MAB theory to problems of iterative identification 14

Application to H 8 -norm estimation (cont.) Setup e τ u τ Gpqq + y τ Work with data batches, of length N, sufficiently spaced in time At each iteration τ, an input batch u τ pu 1,..., u N q is designed and applied to the system The output of the system, y τ py 1,..., y N q, is collected Goal Determine the H 8 norm of the system, as accurately as possible Why H 8-norm is important for bounding model error (needed for robust control, etc.) 15

Application to H 8 -norm estimation (cont.) Main Idea: Design u τ in frequency domain, considering each freq. 2πk{N as an arm! This is a standard MAB problem, except that: More than one arm can be pulled at once (in fact, we can choose a distribution over the arms!) The outcomes are complex-valued Gaussian distributed (variance inversely proportional to applied power) 16

Application to H 8 -norm estimation (cont.) Derived a lower bound for the problem, which shows that choosing only one freq. is not more restrictive (asymptotically in τ) than a continuous spectrum for u τ Proposed a weighted Thompson sampling algorithm with better regret than standard TS Still... power iterations has better initial transient than MAB algorithms! 25 20 Expected Cumulative Regret comparison Weighted Thompson Sampling Thompson Sampling ˆ10-3 1.2 1 0.8 Mean-squared Error Comparison Weighted Thompson Sampling Power Iterations 15 0.6 Regret 10 MSE 0.4 5 0.2 0 0 10 20 30 40 50 Number of Rounds T 0 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds T More information on Matias poster! 17

Summary MABs are a useful approach to adaptive control Standard theory applicable to some problems of iterative identification and control A relevant example: H 8-norm estimation Control applications require non-trivial extensions to basic MAB framework: Interesting research directions! 18

Some references S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. NOW, 2012 N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006 B. Christian and T. Griffiths. Algorithms to Live By: The Computer Science of Human Decisions. William Collins, 2016 19

Thank you for your attention! 20