Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Similar documents
Advanced Machine Learning

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

The Multi-Arm Bandit Framework

Bandit models: a tutorial

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Exploration. 2015/10/12 John Schulman

Reinforcement Learning

Multi-armed bandit models: a tutorial

The Multi-Armed Bandit Problem

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Reinforcement Learning

Full-information Online Learning

Online Learning: Bandit Setting

Bandit Online Convex Optimization

New Algorithms for Contextual Bandits

Evaluation of multi armed bandit algorithms and empirical algorithm

Bandits for Online Optimization

The information complexity of sequential resource allocation

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

The No-Regret Framework for Online Learning

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Sparse Linear Contextual Bandits via Relevance Vector Machines

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Online Learning under Full and Bandit Information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Stochastic Contextual Bandits with Known. Reward Functions

Large-scale Information Processing, Summer Recommender Systems (part 2)

Stochastic bandits: Explore-First and UCB

Bandits : optimality in exponential families

Reducing contextual bandits to supervised learning

Multi-armed Bandit with Additional Observations

Notes on AdaGrad. Joseph Perla 2014

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Bandit Convex Optimization

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Online Learning and Sequential Decision Making

Bandit View on Continuous Stochastic Optimization

arxiv: v3 [cs.gt] 17 Jun 2015

Online learning with feedback graphs and switching costs

ONLINE ADVERTISEMENTS AND MULTI-ARMED BANDITS CHONG JIANG DISSERTATION

A survey: The convex optimization approach to regret minimization

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Trade-Offs in Distributed Learning and Optimization

arxiv: v4 [cs.lg] 22 Jul 2014

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Online Learning and Sequential Decision Making

The information complexity of best-arm identification

1 Problem Formulation

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Distributed online optimization over jointly connected digraphs

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Ad Placement Strategies

Tsinghua Machine Learning Guest Lecture, June 9,

Multi-Armed Bandit Formulations for Identification and Control

Convergence and No-Regret in Multiagent Learning

THE first formalization of the multi-armed bandit problem

Lecture 6: Non-stochastic best arm identification

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Two optimization problems in a stochastic bandit model

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Learning and Selecting the Right Customers for Reliability: A Multi-armed Bandit Approach

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Online Learning with Gaussian Payoffs and Side Observations

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Convergence rate of SGD

Distributed online optimization over jointly connected digraphs

1 MDP Value Iteration Algorithm

Online Learning and Online Convex Optimization

Mean field equilibria of multiarmed bandit games

Adaptive Online Gradient Descent

Mechanisms with Learning for Stochastic Multi-armed Bandit Problems

From Bandits to Experts: A Tale of Domination and Independence

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Yevgeny Seldin. University of Copenhagen

Introducing strategic measure actions in multi-armed bandits

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Subsampling, Concentration and Multi-armed bandits

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Optimal and Adaptive Online Learning

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-armed Bandits with Limited Exploration

Exponential Weights on the Hypercube in Polynomial Time

Transcription:

s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29

Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29

At each iteration t, the player chooses x t K. A convex loss function f t F : K R is revealed. A cost f t (x t ) is incurred. F is a set of bounded functions. f t is revealed after choosing x t. f t can be adversarially chosen. s (OCO) Convex 4/29

Regret Given an algorithm A. Regret of A after T iterations is defined as: regret T (A) := sup { {f i } T i=1 F T t=1 Online Gradient Descent O(GD T ). f t (x t ) min x K T f t (x)} t=1 If f i is α-strongly convex then O( G 2 2α (1 + log(t )). s (OCO) Convex 6/29

Convex 1 In OCO we had access to f t (x t ). in BCO we only observe f t (x t ). further constrains the BCO setting. s (OCO) Convex 1 Material from [1]. 7/29

Convex In data networks, the decision maker can measure the RTD of a packet, but rarely has access to the congestion pattern of the entire network. In ad-placement, the search engine can inspect which ads were clicked through, but cannot know whether different ads, had they been chosen, would have been click through or not. Given a fixed budget, how to allocate resources among the research projects whose outcome is only partially known at the time of allocation and may change through time. Wikipedia. Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it. 9/29 s (OCO) Convex

(MAB) On each iteration t, the player choses an action i t from a predefined set of discrete actions {1,..., n}. An adversary, independently, chooses a loss [0, 1] for each action. The loss associated with i t is then revealed to the player. There are a variety of MAB specifications with various assumptions and constraints. This definition is similar to the multi-expert problem, except we do not observe the loss associated with the other experts. s (OCO) Convex 11/29

MAB as a BCO The algorithms usually choose an action w.r.t a distribution over the actions. If we define K = n, an n-dimensional simplex then f t (x) = l t x = n l t (i)x(i) i=1 x K We have an exploration-exploitation trade-off. A simple approach would be to Exploration With some probability, explore by choosing actions uniformly at random. Construct an estimate of the actions losses with the feedback. Exploitation Otherwise, use the estimates to make a decision. s (OCO) Convex 12/29

An algorithm can be constructed: s (OCO) Convex 14/29

Analysis This algorithm guarantees: T E[ l t (i t ) min i t=1 T l t (i)] O(T 3 4 n) t=1 E[ˆl t (i)] = P[b t = 1] P[i t = i b t = 1] n δ l t(i) = l t (i) s (OCO) Convex ˆl t 2 n δ l t(i t ) n δ E[ˆf t ] = f t 15/29

Analysis s (OCO) Convex 16/29

s Has a worst-case near-optimal regret bound of O( Tn log n). See P104 of the OCO book for proof. (OCO) Convex 18/29

Stochastic On each iteration t, the player choses an action i t from a predefined set of discrete actions {1,..., n}. Each action i has an underlying (fixed) probability distribution P i with mean µ i. The loss associated with i t is then revealed to the player. (A sample is taken from P it ). P i s could be a simple Bernoulli variable. A more complex version could assume a Markov process for each action, within which the state of one or all processes change after each iteration. We still have the exploration-exploitation trade-off. s (OCO) Convex 20/29

Bernoulli 2 N[a] The number of times arm a is pulled. Q[a] The running average of rewards for arm a. S[a] The number of successes for arm a. F [a] The number of failures for arm a. The same notion of regret, except the optimal strategy is to pull the arm with the largest mean, µ. s (OCO) Convex 2 Material from [2]. 22/29

Random Selection O(T ). Greedy Selection O(T ). ɛ-greedy Selection O(T ). Boltzmann Exploration O(T ). Upper-Confidence Bound O(ln(T )). Thompson Sampling O(ln(T )). s (OCO) Convex 24/29

Empirical Evaluation s (OCO) Convex 25/29

There are scenarios within which we can have access to more information. The extra information can be encoded as a context vector. In online advertising, the behaviour of each user, or the search context for instance, can provide valuable information. One simple way is to treat each context having its own bandit problem. Variations of the previous algorithms relate the context vector with the expected reward through linear models, neural networks, kernels, or random forests. s (OCO) Convex 27/29

Thanks s Thanks! Questions? (OCO) Convex 28/29

References I s E. Hazan. Introduction to. http://ocobook.cs.princeton.edu/ocobook.pdf S. Raja. s and Exploration Strategies. https://sudeepraja.github.io/s/ (OCO) Convex 29/29