Yevgeny Seldin. University of Copenhagen

Similar documents
EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Advanced Machine Learning

Online Learning with Feedback Graphs

A Second-order Bound with Excess Losses

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Bandit models: a tutorial

New Algorithms for Contextual Bandits

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Multi-armed bandit models: a tutorial

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Online Learning and Sequential Decision Making

THE first formalization of the multi-armed bandit problem

Improved Algorithms for Linear Stochastic Bandits

The No-Regret Framework for Online Learning

Optimal and Adaptive Online Learning

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Online Learning and Sequential Decision Making

The Multi-Arm Bandit Framework

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Time Series Prediction & Online Learning

0.1 Motivating example: weighted majority algorithm

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Exponential Weights on the Hypercube in Polynomial Time

Learning, Games, and Networks

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

From Bandits to Experts: A Tale of Domination and Independence

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Online Learning under Full and Bandit Information

Applications of on-line prediction. in telecommunication problems

Hybrid Machine Learning Algorithms

Reinforcement Learning

Evaluation of multi armed bandit algorithms and empirical algorithm

Learning with Large Number of Experts: Component Hedge Algorithm

Exponentiated Gradient Descent

More Adaptive Algorithms for Adversarial Bandits

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Sparse Accelerated Exponential Weights

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Online Learning with Predictable Sequences

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Online Learning and Online Convex Optimization

Online Learning for Time Series Prediction

Sparse Linear Contextual Bandits via Relevance Vector Machines

Adaptive Online Gradient Descent

Multiple Identifications in Multi-Armed Bandits

Online Learning with Feedback Graphs

Reducing contextual bandits to supervised learning

The Online Approach to Machine Learning

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

No-Regret Algorithms for Unconstrained Online Convex Optimization

Learning with Exploration

Efficient learning by implicit exploration in bandit problems with side observations

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Basics of reinforcement learning

Lecture 4: Lower Bounds (ending); Thompson Sampling

Bandits for Online Optimization

Minimax strategy for prediction with expert advice under stochastic assumptions

Agnostic Online learnability

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Dynamic resource allocation: Bandit problems and extensions

Multi-Armed Bandit Formulations for Identification and Control

Advanced Machine Learning

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Informational Confidence Bounds for Self-Normalized Averages and Applications

Full-information Online Learning

Adaptive Learning with Unknown Information Flows

Toward a Classification of Finite Partial-Monitoring Games

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Online Learning: Bandit Setting

Exploration. 2015/10/12 John Schulman

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Efficient and Principled Online Classification Algorithms for Lifelon

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Notes from Week 8: Multi-Armed Bandit Problems

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Convex Optimization: T Regret in One Dimension

Ad Placement Strategies

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Grundlagen der Künstlichen Intelligenz

Lecture 16: Perceptron and Exponential Weights Algorithm

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

University of Alberta. The Role of Information in Online Learning

Piecewise-stationary Bandit Problems with Side Observations

The multi armed-bandit problem

An adaptive algorithm for finite stochastic partial monitoring

Better Algorithms for Benign Bandits

Learning for Contextual Bandits

An Introduction to Adaptive Online Learning

Transcription:

Yevgeny Seldin University of Copenhagen

Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New situation Decision/Action

How Online is different from batch? Batch Learning Online Learning Collect Data Act Analyze Analyze Get More Data Act

Examples Investment in the stock market Online advertising/personalization Online routing Games Robotics

When do we need Online Learning? Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working with anything but a fixed number of independent random variables. A major advance now appears to be in the making with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves. (Robbins, 1952)

When do we need Online Learning? Interactive learning Adversarial game-theoretic settings No i.i.d. assumption Intelligent data collection / experiment design Large-scale data analysis

The Space of Online Learning Problems limited (bandit)

The Space of Online Learning Problems stock market medical treatments advertising limited (bandit)

The Space of Online Learning Problems stock market medical treatments advertising limited (bandit) Exploration-Exploitation Trade-off

Exploration-Exploitation Trade-off What drug to give to a new patient Never tried 0/2 6/10 When there are more patients to come Act Analyze More Data We are building the dataset for ourselves

The Space of Online Learning Problems environmental resistance adversarial spam filtering stock market medical treatments weather prediction i.i.d. bandit

The Space of Online Learning Problems environmental resistance adversarial spam filtering stock market medical treatments weather prediction i.i.d. bandit

Learning in Adversarial Environments Game theoretic setting Cannot be treated in batch learning Collect Data Act Evaluation measure: regret Analyze Difference in performance compared to the best choice in hindsight (out of a limited set) E.g. investment revenue vs. the best stock in hindsight Act Analyze More Data

The Space of Online Learning Problems environmental resistance adversarial contextual Markov Decision Processes (MDP) i.i.d. stateless structural complexity bandit medical records of different patients subsequent treatments of the same patient

The Space of Online Learning Problems environmental resistance adversarial contextual Markov Decision Processes (MDP) i.i.d. stateless structural complexity bandit medical records of different patients subsequent treatments of the same patient Delayed Feedback

Delayed Feedback Would you like to have a drink? Yes No Would you like to have a drink? Yes!!! No The next morning.

The Space of Online Learning Problems environmental resistance Reinforcement Learning contextual Markov Decision Processes (MDP) adversarial i.i.d. stateless structural complexity bandit subsequent treatments of the same patient Delayed Feedback

The Space of Online Learning Problems environmental resistance adversarial contextual i.i.d. stateless bandit MDPs Classical batch ML would be here structural complexity

Simplicities along the axes environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

Simplicities along the axes environmental resistance adversarial The more the simpler i.i.d. stateless contextual bandit MDPs structural complexity

Simplicities along the axes environmental resistance adversarial Gaps (between action outcomes) Variance/Range (of action outcomes) i.i.d. stateless contextual bandit MDPs structural complexity

Simplicities along the axes environmental resistance adversarial Reducibility of the state space (context relevance) Mixing time (in MDPs) i.i.d. stateless contextual bandit MDPs structural complexity

Some standard algorithms environmental resistance adversarial Prediction with expert advice Adversarial bandits i.i.d. stateless contextual bandit Stochastic bandits MDPs structural complexity

Typical performance scaling environmental resistance T ln K adversarial KT contextual i.i.d. stateless bandit a a ln T Δ(a) MDPs structural complexity K the number of actions T the number of rounds Δ(a) suboptimality gap

Standard algorithms environmental resistance T ln K adversarial KT contextual i.i.d. stateless bandit a a ln T Δ(a) MDPs Assume a certain form of simplicity and exploit it structural complexity

environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

Prediction with limited advice / Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Multiple algorithms or/and parametrizations, but limited computational resources. Only a subset can be executed. Expensive resources

Environmental resistance in info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, Cesa-Bianchi, Mansour & Stoltz, MLJ, 2007, ] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Small deviations from i.i.d. Not adversariality Adaptation to i.i.d. Adaptation to effective outcome range

Environmental resistance in bandits [Bubeck & Slivkins, COLT, 2012, Seldin & Slivkins, ICML, 2014, Auer & Chiang, COLT, 2016, Seldin & Lugosi, COLT, 2017, Wei & Luo, COLT, 2018, Zimmert & Seldin, AISTATS, 2019] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Contaminated observations Adaptation to i.i.d. X Adaptation to effective outcome range is impossible! [Gershinovitz & Lattimore, NIPS, 2016]

Prediction with Limited Advice [Thune & Seldin, NIPS, 2018] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Starting from at least 2 observations Adaptation to i.i.d. Adaptation to effective outcome range

K actions Prediction with Limited Advice Problem Setting 1 l 1 1 l 2 1 l t i l 1 i l 2 i l t K l 1 K l 2 K l t time Adversarial l t i arbitrary in [0,1] I.I.D. = μ i Gaps: Δ i = μ i min μ j j Effective Loss Range ε E l t i i, j, t: l t i l t j ε Evaluation measure: pseudo-regret R T = E σt A t=1 l t t min E σ T i t=1 l t i {1,,K}

SODA: Second Order Difference Adjustment [Thune & Seldin, NIPS, 2018] The primary action A t sampled according to p t i e η i td t 1 η 2 i t S t 1 The secondary observation B t sampled uniformly from 1,, K \{A t } Unbiased loss difference estimates Δl i B s = K 1 I B t = i l t A t l t t Loss difference estimator and its second moment D i t = σt i s=1 Δls S i t = σt s=1 The learning rate Δls i 2 η t ln K max i Si t 1

SODA: Regret Guarantees [Thune & Seldin, NIPS, 2018] Adversarial R T = O ε KT ln K Stochastic R T = O Kε2 ln K σ i:δ i >0 1 Δ i Knowledge of i.i.d./adversarial not required! Knowledge of ε not required! Simultaneous adaptation to two types of easiness!

An optimal algorithm for i.i.d. and adversarial bandits [Zimmert & Seldin, AISTATS, 2019] environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

K actions I.I.D. and adversarial bandits Problem Setting 1 l 1 1 l 2 1 l t i l 1 i l 2 i l t K l 1 K l 2 K l t time Evaluation measure: pseudo-regret (as before) R T = E σt A t=1 l t t min E σ T i t=1 l t i {1,,K} Adversarial as before Stochastically Constrained Adversary [Wei & Luo, 2018] i best action E l t i l t i = Δ i 0 I.I.D.: special case when E l t i = μ i

α-tsallis-inf [Zimmert & Seldin, AISTATS, 2019] Tsallis entropy H α x = 1 1 α 1 σ i x i α INF Implicitly Normalized Forecaster [Audibert & Bubeck, 2009]

α-tsallis-inf [Zimmert & Seldin, AISTATS, 2019] w t = arg min w, L t 1 w Δ K Sample A t w t Update L i i t = L t 1 + σ i wi + l t i I(A t =i) w t i α αη t,i α = 1 corresponds to entropic regularization EXP3 algorithm α = 0 corresponds to log-barrier

1 2 -Tsallis-INF: Regret Guarantees [Zimmert & Seldin, AISTATS, 2019] η t = 1/t Adversarial R T 4 KT + 1 Stochastically Constrained Adversary R T σ i i 4 ln T+20 Δ i i.i.d. is a special case + 4 K Both results match the lower bounds (up to constants)! No knowledge of i.i.d./adversarial is required!

Tsallis-INF in relation to prior work [Zimmert & Seldin, AISTATS, 2019] BROAD [Wei&Luo,2018] Log-barrier + doubling (α = 0) Regime Upper Bound Lower Bound i.i.d. O K 2 adversarial O ln T α = 1 2 i.i.d. & adversarial O 1 EXP3++ [Seldin&Lugosi,2017] Entropic regularization + mix in extra exploration (α = 1) i.i.d. adversarial O O ln T ln K

1 2 -Tsallis-INF: Experiments [Zimmert & Seldin, AISTATS, 2019] i.i.d. Stochastically Constrained Adversary

1 2 -Tsallis-INF: Additional Results [Zimmert & Seldin, AISTATS, 2019] Optimality in the moderately contaminated stochastic regime I.I.D. and adversarial optimality in utility-based dueling bandits

environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

Check my homepage https://sites.google.com/site/yevgenyseldin/ (or Google me) Papers, tutorials, lecture notes, etc Join our Advanced Topics in Machine Learning course Co-taught by Christian Igel September-October 2019 WARNING: Math-heavy course

α-tsallis-inf: Some Considerations [Zimmert & Seldin, AISTATS, 2019] Exploration rate η t i L t i L t i 1 1 α For i.i.d. regret of E L t i L t i = Δ i t η t i Δ i t 1 1 α = 1 Δ i 2 t ln T Δ i the exploration rate must be 1 Δ i 2 t η t i = Δ i 1 2α t α α = 1 2 is the only value for which η t i requires no tuning by (unknown) Δ i

α-tsallis-inf: Some Considerations [Zimmert & Seldin, AISTATS, 2019] E L t i L t i = Δ i t The variance of L t i L t i is of order Δ i 2 t 2 L t i L t i cannot be efficiently controlled by Bernstein s inequality Our analysis is based on a self-bounding property of the regret