Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Similar documents
Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

An application of the temporal difference algorithm to the truck backer-upper problem

Evaluation of multi armed bandit algorithms and empirical algorithm

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Bayesian Active Learning With Basis Functions

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

An objective function for self-limiting neural plasticity rules.

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Multi-objective Quadratic Assignment Problem instances generator with a known optimum solution

Replacing eligibility trace for action-value learning with function approximation

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Learning to play K-armed bandit problems

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Linear and Non-Linear Dimensionality Reduction

Basics of reinforcement learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Multiple Identifications in Multi-Armed Bandits

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Bandit models: a tutorial

Sparse Linear Contextual Bandits via Relevance Vector Machines

Reinforcement Learning and NLP

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Model-Based Reinforcement Learning with Continuous States and Actions

Reinforcement Learning

Convergence and No-Regret in Multiagent Learning

A CUSUM approach for online change-point detection on curve sequences

Improving the performance of Continuous Action Reinforcement Learning Automata

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

1 MDP Value Iteration Algorithm

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Optimal Convergence in Multi-Agent MDPs

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Stochastic bandits: Explore-First and UCB

The Multi-Arm Bandit Framework

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Procedia Computer Science 00 (2011) 000 6

Time series prediction

Introducing strategic measure actions in multi-armed bandits

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Bandits for Online Optimization

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Multi-armed bandit models: a tutorial

Analysis of Multiclass Support Vector Machines

Using the Delta Test for Variable Selection

The knowledge gradient method for multi-armed bandit problems

Bias-Variance Error Bounds for Temporal Difference Updates

Lecture 1: March 7, 2018

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Bandit View on Continuous Stochastic Optimization

Long-Term Time Series Forecasting Using Self-Organizing Maps: the Double Vector Quantization Method

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Stochastic Contextual Bandits with Known. Reward Functions

An Adaptive Clustering Method for Model-free Reinforcement Learning

Negatively Correlated Echo State Networks

Multi-objective Contextual Bandit Problem with Similarity Information

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

Grundlagen der Künstlichen Intelligenz

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Multi-Objective Autonomous Spacecraft Motion Planning around Near-Earth Asteroids using Machine Learning

Elements of Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

arxiv: v1 [cs.ai] 5 Nov 2017

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Advanced Machine Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Non-parametric Residual Variance Estimation in Supervised Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Multi-armed Bandit Algorithms and Empirical Evaluation

Bayesian Active Learning With Basis Functions

Q-Learning in Continuous State Action Spaces

Fisher Memory of Linear Wigner Echo State Networks

Online Learning and Sequential Decision Making

Experts in a Markov Decision Process

Linear Least-squares Dyna-style Planning

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

Policy Gradient Reinforcement Learning Without Regret

RL 3: Reinforcement Learning

Exploration and exploitation of scratch games

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Bayesian Semi Non-negative Matrix Factorisation

Large-scale Information Processing, Summer Recommender Systems (part 2)

Q-learning. Tambet Matiisen

Reinforcement learning

6 Reinforcement Learning

Transcription:

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem Saba Yahyaa, Madalina M. Drugan and Bernard Manderick Computational Modeling group, Artificial Intelligence Lab, Computer Science Department Vrije Universiteit Brussel, Pleinlaan, 050 Brussels, Belgium syahyaa, mdrugan, bmanderi@vub.ac.be Abstract. he multi-objective, multi-armed bandits (MOMABs) problem is a Markov decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single reward and these multiple rewards might be conflicting. he agent has a set of optimal arms and the agent s goal is not only finding the optimal arms, but also playing them fairly. o find the optimal arm set, the agent uses a linear scalarized (LS) function which converts the multi-objective arms into one-objective arms. LS function is simple, however it can not find all the optimal arm set. As a result, we extend knowledge gradient (KG) policy to LS function. We propose two variants of linear scalarized-kg, LS-KG across arms and dimensions. We experimentally compare the two variant, LS-KG across arms finds the optimal arm set, while LS-KG across dimensions plays fairly the optimal arms. INRODUCION he one-objective, Multi-Armed Bandits (MABs) is a sequential Markov decision process where an agent tries to optimize its decisions while improving its knowledge concerning the arms among which it has to pull. At each time step t, the agent pulls one from the available arms set A and receives a reward signal. hat reward is independent from the past rewards of the pulled arm and from all other arms. he rewards from arm i are drawn from a normal distribution N (µi, σi ) with mean µi and variance σi. he goal of the agent is to find the best arm i which has the maximum mean µ = maxi=,, A µi to minimize the loss, or total expected regret, RL, RL = PL Lµ t= µi (t) of not pulling the best arm i all the time. Where L is a fixed number of time steps and µi (t) is the mean of the selected arm i at time step t. However, the mean µi and variance σi parameters are unknown to the agent. hus, by pulling each arm, the agent improves its estimates µ i and σ i of the true mean µi and the variance σi, respectively. o find the optimal arm as soon as possible, the agent has several policies, e.g. Knowledge Gradient (KG) policy []. Intuitively, KG policy finds the optimal arm by adding a bonus to the estimated mean of each arm i and selects the arm that has the maximum estimated mean plus the bonus. In this paper, we extend KG policy to the Multi-Objective, Multi-Armed Bandits problem (MOMABs). In the Multi-Objective (MO) setting, there is a set of Pareto optimal arms that are incomparable, i.e. can not be ordered using a designed partial order relationship. he Pareto optimal arm set (Pareto front set) can be found by using Linear Scalarized (LS) function which converts the MO space to a single-objective space, i.e. the mean vectors are transformed into scalar values []. he LS function is simple but cannot find all the optimal arms when the mean vector set of the Pareto front 47

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. set is a non-convex (concave) set. As a result, we extend KG policy to be used in linear scalarization function to find the optimal arms in a concave mean vector set. his paper is organized as follows: first, we give background information on the algorithm used (Section ). We introduce the MOMABs, we propose two variants of the linear KG scalarization functions: linear scalarized-kg across arms and dimensions and we present scalarized multi-objective bandits. (Section 3). We describe the experimental set up followed by the experimental results (Section 4). Finally, we conclude. Background Here, we introduce LS function and regret measures for the MOMABs problem. Let us consider MOMABs problems with number of arms A, A arms and with D objectives (dimensions) per arm. he mean vector of the rewards of arm i, i A, is µ i = (µi,, µd i ), where is the transpose. Each objective d, d D, has a specific value and the objectives might be conflicting with each other. his means that the reward of arm i corresponding with one objective can be better but for another objective can be worse than that for another arm j. Linear Scalarized (LS) Function converts the MO into a one-objective []. However, solving a MO optimization problem means finding the Pareto front set. hus, we need a set of scalarized functions S to generate the variety of elements belonging to the Pareto front set. LS function assigns to each value µdi of the mean vector µ i of arm i a weight wd and the result is the sum of these weighted mean values. he LS function f is defined as: µi ) = w µi + + wd µd f j (µ () i where f j is a linear function with scalarization j, jp S. Each j has a different set of D predefined weights w j = (w,, wd ), such that d= wd =. Linear scalarization is very popular because of its simplicity. However, it cannot find all the arms in the Pareto optimal arm set A if its corresponding mean set is a concave set. After transforming the multi-objective problem to a single-objective one, the LS function selects µi ). the arm i that has the maximum function value, i.e. i = argmax i A f j (µ Regret Metrics. o measure the performance of the LS function, [3] have proposed two regret metric criteria. he scalarized regret metric measures the distance between the maximum value of a scalarized function and the scalarized value of an arm that is pulled at time step t. Scalarized regret is the difference between the maximum value for a LS function f j on the set of arms A and the scalarized value for an arm k that is pulled µi ) f j (µ µk )(t). by the scalarized f j at time step t, Rscalarizedj (t) = max i A f j (µ he unfairness regret metric is related to the variance in drawing all the optimal armsp which is the variance of the times the arms in A are pulled: Runf airness (t) = i A (Ni (t) N A (t)), where Runf airness (t) is the unfairness regret at time A step t, A is the number of optimal arms, Ni (t) is the number of times an optimal arm i has been selected at time step t and N A (t) is the number of times the optimal arms, i =,, A have been selected at time step t. Knowledge Gradient (KG) Policy is an index policy that determines for each arm i 48

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. the index ViKG as follows []: µ i ˆi x( ViKG = σ max µ j j6=i,j A ˆi σ ) () ˆi = σ i/ni is the Root Mean Square Error (RMSE) of the estimated mean µ i of where σ arm i. he function x(ζ) = ζφ(ζ)+φ(ζ) where Φ and φ are the cumulative distribution and density of the standard normal distribution N (0, ), respectively. KG chooses the arm i KG with the largest ViKG, i.e. i KG = argmaxi A µ i + (L t)vikg where L is the horizon of experiment. KG policy prefers those arms about which comparatively little is known. hese arms are the ones whose distributions around the estimate mean µ i have larger estimated variance σ i. hus, KG prefers an arm i over its alternatives if its confidence in the estimate mean µ i is low. In [5], it is shown that KG policy outperforms other policies on the one-objective MABs in terms of the average frequency of optimal selection performance. Moreover, the KG-policy does not have any parameter to be tuned. For these reasons, we extend it to scalarized multi-objective KG. 3 MOMAB Framework In this section, we present linear scalarized knowledge gradient (scalarized-kg) functions. Linear scalarized-kg functions make use of the estimated mean and variance. At each time step t, the agent selects one arm i and receives a reward vector. he re µi, σ i ), where µ i = (µi,, µd ward vector is drawn from a normal distribution N (µ i ) is the mean vector and σ i = (σi,, µdi ) is the diagonal covariance matrix of arm i since the reward distributions corresponding with different arms are assumed to be independent. hese parameters are unknown to the agent. But by drawing arm i, the agent can update its estimates µ i and σ i in each dimension d as follows [4]: Nit+ = Nit +, σ d(t+) = µ dt+ = ( ) µ d + rd Nit+ t Nit+ t+ Nit+ σ dt + (rd µ dt ) Nit+ Nit+ t+ (3) (4) d where rt+ is the collected reward from arm i in the dimension d, Nit+ is the updated number of times arm i has been selected, µ dt+, and σ d(t+) are the updated estimated mean and covariance of arm i for dimension d, respectively. Linear Scalarized-KG across Arms, (LS-KG) converts immediately the MO estimated mean µ i and estimated variance σ i of each arm to one-dimension, then computes the corresponding bound ExpBi. We use σ i to refer to the estimated variance vector of arm i. At each time step t, LS-KG weighs both the estimated mean vector, i.e. ([µ i,, µ D i ] ) and estimated variance vector, i.e. ([σ i,, σ Di ] ) of each arm i, converts the MO vectors to one-objective values by summing the elements of each vector. hus, we have one-dimension, MABs. KG calculates for each arm, a bound which depends on all other arms and selects the arm that has the maximum estimated mean plus bound. he LS-KG is as follows: µei = f j (µ i ) = w µ i + + wd µ D ei = f j (σ i ) = w σ i + + wd σ Di i i, σ 49

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. e i = σei/ni, vi = σ e i x( σ µei max µ fj j6=i, j A e i σ ) i (5) where f j is a LS function that has a predefined set of weight (w,, wd )j. µei and σ ei are the modified estimated mean and variance of an arm i, respectively which are e i is the RMSE of an arm i. vi is the KG index of an arm i. LS-KG selects values. σ the optimal arm i according to: i LS KG = argmaxi A (µei + ExpBi ) = argmaxi A (µei + (L t) A D vi ) where ExpBi is the bound of arm i and D is the number of dimensions. Linear scalarized-kg across dimensions, (LS-KG) computes the bound vector ExpB i for each arm, i.e. ExpB i = [ExpBi,, ExpBD i ], adds the ExpB i to the corresponding estimated mean vector µ i, then converts the multi-objective problem to one-objective. At each time step t, LS-KG computes bounds for all dimensions of each arm, sums the estimated mean in each dimension with its corresponding bound, weighs each dimension, then converts the multi-dimension to one-dimension value by taking the summation over each vector of each arm. LS-KG is as follows: D f j (µ i ) = w (µ i + ExpBi ) + + wd (µ D i + ExpBi ) i, where ExpBdi = (L t) A D vid, vid = ˆid σ (6) µ di max µ dj x( j6=i, j A ˆid σ ) d D ˆid, and ExpBid are the index, the estimated mean, the RMSE, and the bound vid, µ di, σ of arm i for dimension d, respectively. LS-KG selects the optimal arm i that has maximum f j (µ i ), i.e. i LS KG = argmaxi=,, A f j (µ i ).. Input:length of trajectory L;type of scalarized function f ; set of scalarized function S = (f,, f S );reward rd N (µ, σr ).. Initialize: For s = to S µ i )s ;(σ σ i )s play each arm Initial steps;observe(rr i )s ;update:nis ;(µ End 3. Repeat 4. Select: a function s S uniformly, at random 5. Select: the optimal arm i that maximizes f s 6. Observe: reward vector r i, r i = [ri,, rid ] µ i ;σ σ i 7. Update: Nis ;µ 8. Compute: unfairness regret;scalarized regret 9. Until L 0. Output: Unfairness regret;scalarized regret. Fig. : Algorithm:(Scalarized multi-objective function). he scalarized multi-objectieve bandits, he pseudocode of the scalarized MOMAB problems is given in Figure. Given the type of the scalarized function f, (f is either LS-KG or LS-KG) and the scalarized function set (f,, f S ) where each scalarized function f s has different weight set, w s = (w,s,, wd,s ). 50

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. he algorithm in Figure plays each arm of each scalarized function f s, Initial plays (step: ). Nis is the number of times the arm i under the scalarized function f s is pulled. (rr i )s is the reward vector of the pulled arm i which is drawn from a normal µ, σ r ) where µ is the mean vector and σ r is the variance vector of the distribution N (µ s µ i ) and (σ σ i )s are the estimated mean and standard deviation vectors of the reward. (µ arm i under the scalarized function s, respectively. After initial playing, the algorithm chooses uniformly at random one of the scalarized function (step: 4), selects the optimal arm i that maximizes the type of this scalarized function (step: 5), simulates the µ i )s and (σ σ i )s (step: 7). his procedure is repeated selected arm i, and updates N s, (µ until the end of playing L steps. Note that the proposed algorithm is an adapted version from [3], but here the algorithm can be applied to KG with normal reward distribution. 4 Experiments We experimentally compare the linear scalarized-kg variants across arms and dimensions of KG, Section 3 on MOMABs with concave mean vector arm set. he performance measures are: ) he average number of times optimal arms are pulled and the average number of times each of the optimal arms is drawn which are the average of M experiments. ) he scalarized and the unfairness average regret, Section at each time step which are the average of M experiments. he number of experiments M and the horizon of each experiment L are 000. he rewards of each arm i in each dimension µi, σ i,r ) where µ i = [µi,, µd d, d D are drawn from normal distribution N (µ i ], D, is the mean and σ i,r = [σi,r,, σi,r ] is the variance of the reward. he variance for arms in each dimension is equal to 0.0. he mean for arms in each dimension is [0, ]. he means and the variances of arms are unknown parameters to the agent. KG needs the estimated variance for each arm, σ i, therefore, each arm is played initially Initial, Initial times to estimate the variance. he number of Pareto optimal arms A is unknown to the agent, therefore, A = A. able : Average number of times optimal arms are pulled and average number of times each one of the optimal arm is pulled performances on concave -objective, 6-armed. Functions LS-KG LS-KG A 999.9 ±.04 999.7 ±.33 ± 9.7 368. ± 7.6.6 ± 7.4 303. ± 8. 3 30.5 ± 4.4 96 ± 9.3 4 353.8 ±. 3.4 ± 8.5 Experiment : We use the same example in [3] that contains concave mean set with µ = [0.55, 0.5], µ = [0.53, 0.5], A = 6 and D =. he true mean set vector is (µ µ 3 = [0.5, 0.54], µ 4 = [0.5, 0.57], µ 5 = [0.5, 0.5], µ 6 = [0.5, 0.5] ). Note that the Pareto optimal arm (Pareto front) set is A = (,, 3, 4 ) where i refers to the optimal arm i. he suboptimal a5 is not dominated by the two optimal arms and 4, but and 3 dominates a5 while a6 is dominated by all the other mean vectors. We consider weight sets, i.e. w = {(, 0), (0.9, 0.),, (0., 0.9), (0, ) }. able gives the average number ± the upper and lower bounds of the confidence interval that the optimal arms are selected in column A, and one of the optimal arm is pulled in columns,, 3, and 4 using the scalarized functions in column Functions. able shows KG policy is able to explore all the optimal arms. LS-KG 5

ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from http://www.i6doc.com/fr/livre/?gcoi=8000043440. performs better than LS-KG in selecting and playing fairly the optimal arms. LS-KG prefers the optimal arm 4, while LS-KG prefers the optimal arm. Increasing Arms and Dimensions: In order to compare the variants linear scalarizedkg performances on a more complex MOMABs problem, We add another 4 arms and another 3 dimensions in Experiment, resulting 5-objective, 0-armed. he Pareto optimal set of arms A contains now 7 arms. We used set of weights, w. Figure gives the average scalarized and unfairness regret performances. he x-axis is the horizon of each experiment. he y-axis is either the average of the scalarized or the unfairness regret performance which is the average of 000 experiments. Figure shows LSKG performs better than LS-KG according to the scalarized regret performance, while LS-KG performs better than LS-KG according to the unfairness regret performance. Fig. : Performance on 5-objective, 0-armed with A = 7. Left sub-figure shows the scalarized regret. Right sub-figure shows the unfairness regret. 5 Conclusions We presented MOMABs problem, linear scalarized LS function, the scalarized regret measures and KG policy. We extended KG policy to the MOMABs. We proposed two types of LS-KG (LS-KG across arms (LS-KG) and dimensions (LS-KG)). Finally, we compared LS-KG, and LS-KG and concluded that: ) KG policy is able to find the Pareto optimal arms set in a concave mean vector set. ) when the number of arms and dimensions are increased LS-KG outperforms LS-KG according to the scalarized regret, while LS-KG outperforms LS-KG according to the unfairness regret. References [] I. O. Ryzhov, W. B. Powell and P. I. Frazier, he knowledge-gradient policy for a general class of online learning problems, Operation Research, 0. [] G. Eichfelder, editor. Adaptive Scalarization Methods in Multiobjective Optimization, Springer-Verlag Berlin Heidelberg, 008. [3] M. M. Drugan and A. Nowe, Designing multi-objective multi-armed bandits algorithms: A study, proceedings of the International Joint Conference on Neural Networks (IJCNN 03), 03. [4] W. B. Powell, editor. Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Willey and Sons, New York, USA, 007. [5] S. Q. Yahyaa and B. Manderick, he exploration vs exploitation trade-off in the multi-armed bandit problem: An empirical study In M. Verleysen, editor, proceedings of the 0th European Symposium on Artificial Neural Networks (ESANN 0), d-side pub., pages 549-554, April 5-7, Belgium, 004. 5