Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Similar documents
Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Optimal and Adaptive Online Learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Learnability, Stability, Regularization and Strong Convexity

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Online Learning and Online Convex Optimization

Ad Placement Strategies

New Algorithms for Contextual Bandits

A survey: The convex optimization approach to regret minimization

Littlestone s Dimension and Online Learnability

Adaptive Online Gradient Descent

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

The No-Regret Framework for Online Learning

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Online Submodular Minimization

Full-information Online Learning

The Online Approach to Machine Learning

Online Learning with Experts & Multiplicative Weights Algorithms

The convex optimization approach to regret minimization

Logarithmic Regret Algorithms for Online Convex Optimization

Distributed online optimization over jointly connected digraphs

Yevgeny Seldin. University of Copenhagen

Regret bounded by gradual variation for online convex optimization

Optimal and Adaptive Algorithms for Online Boosting

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Optimization for Machine Learning

Logistic Regression Logistic

Efficient Bandit Algorithms for Online Multiclass Prediction

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Hybrid Machine Learning Algorithms

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Online Optimization with Gradual Variations

Online Learning and Sequential Decision Making

Distributed online optimization over jointly connected digraphs

Machine Learning Theory (CS 6783)

Online Advertising is Big Business

Advanced Machine Learning

Bandit Online Convex Optimization

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Imitation Learning by Coaching. Abstract

From Batch to Transductive Online Learning

Online Learning, Mistake Bounds, Perceptron Algorithm

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

No-Regret Algorithms for Unconstrained Online Convex Optimization

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Exponential Weights on the Hypercube in Polynomial Time

An Algorithms-based Intro to Machine Learning

Computational and Statistical Learning Theory

Agnostic Online learnability

Online Manifold Regularization: A New Learning Setting and Empirical Study

Adaptive Online Learning in Dynamic Environments

Online Convex Optimization

Advanced Machine Learning

The Perceptron Algorithm, Margins

arxiv: v1 [cs.lg] 8 Feb 2018

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Learning with Exploration

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Empirical Risk Minimization Algorithms

Warm up: risk prediction with logistic regression

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Learning, Games, and Networks

Exponentiated Gradient Descent

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Online Advertising is Big Business

Learning theory. Ensemble methods. Boosting. Boosting: history

Lecture 16: Perceptron and Exponential Weights Algorithm

Near-Optimal Algorithms for Online Matrix Prediction

Lecture 8. Instructor: Haipeng Luo

Lecture: Adaptive Filtering

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Introduction to Logistic Regression

Learning by constraints and SVMs (2)

Machine Learning Theory (CS 6783)

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

CS60021: Scalable Data Mining. Large Scale Machine Learning

arxiv: v4 [cs.lg] 27 Jan 2016

Logarithmic regret algorithms for online convex optimization

Convex Repeated Games and Fenchel Duality

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

1 Review and Overview

Multi-armed Bandits: Competing with Optimal Sequences

Ad Placement Strategies

COMS 4771 Lecture Boosting 1 / 16

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

From Bandits to Experts: A Tale of Domination and Independence

4.1 Online Convex Optimization

Stochastic Gradient Descent

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Logarithmic Regret Algorithms for Online Convex Optimization

Machine Learning for NLP

Transcription:

Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

Agenda 1. Motivating examples 2. Unified framework & why it makes sense 3. Algorithms & main results / techniques 4. Research directions & open questions

Section 1: Motivating Examples Inherently adversarial & online

Sequential Spam Classification - online & adversarial learning observe n features (words): a " R % Predict label b' " { 1,1}, feedback b t Objective: average error à best linear model in hindsight min 4 log(1 + e <= > 1 @ A > ) 1 2 3 "

Spam Classifier Selection Stream of emails: log 1 + e <= 1@A + λ D x G a R H log 1 + e <= 1@A + λ G x G log 1 + e <= 1@A + λ % x G Objective: make predictions with accuracy à best classifier in hindsight

Universal portfolio selection Price relatives: MNOPQ%R STQMU r J i = OSU%Q%R STQMU for commodity i r t 1.5 x t 1/2 Distribution of wealth x t 1 0 Market - No Statistical assumptions 1 0.5 1/2 Wealth multiplier = 0 (½*1.5 + 0*1 +½*1 + 0 * ½) = 1 ¼ log W t+1 = log W t + log(r > t x t ) Price Relatives vector r t Objective: choose portfolios s.t. avg. log wealth à avg. log wealth of best CRP

Constant rebalancing portfolio 0.1 2 0.1 2 0.1 2 0.1 2 0.1 2 Single currency depreciates exponentially The 50/50 Constant Rebalanced Portfolio (CRP) makes 5% every day

Recommendation systems 18000 movies X 1 1 X X X X X X X 480000 users X X -1 X X X X 1 X -1 X -1 X X X X X X X X 1 X X X X X X -1 X Objective: ratings with accuracy à best low rank matrix Computation challenge: optimizing over low rank matrices

Ad Selection Yahoo, Google etc display ads on pages, clicks generate revenue Abstraction: Observe a stream of users Given a user, select one ad to display Feedback: click (or not) on the ad shown Objective: average revenue per ad à best policy in a given class Additional difficulty: partial (bandit) feedback (feedback only for ad actually shown)

2. Methodology: Online Convex Optimization definition & relation to PAC learning go through examples argue it makes sense

Statistical (PAC) learning Nature: i.i.d from distribution D over A B = {(a, b)} (a 1, b 1 ) (a M, b M ) learner: Hypothesis h Loss, e.g. ll h, a, b = h a b G h 1 h 2 err h = E A,= l [ll(h, a, b ] h N Hypothesis class H: X - > Y is learnable if ε, δ > 0 exists algorithm s.t. after seeing m examples, for m = poly(δ, ε, dimension(h)) finds h s.t. w.p. 1- δ: err(h) apple min h 2H err(h )+

Online Learning in Games Iteratively, for t = 1,2,, T Player: h " H Adversary: (a ",b " ) A Loss ll(h ",(a ", b " )) H A x B Goal: minimize (average, expected) regret: 1 T " X `(h t, (a t,b t ) t # X min `(h, (a t,b t ))! 0 h 2H T!1 t Can we learn in games efficiently? Regret à o(t)? (i.e. as fundamental theorem of stat. learning)

Online vs. Batch (PAC, Statistical) Online convex optimization & regret minimization Learning against an adversary Regret Streaming setting: process examples sequentially Less assumptions, stronger guarantees Harder algorithmically Batch learning, PAC/statistical learning Nature is benign, i.i.d. samples from unknown distribution Generalization error Batch: process dataset multiple times Weaker guarantees (changing distribution, etc.) Easier algorithmically

Online Convex Optimization Adversary Convex cost function f t Online Player Point x t in convex set K in R n loss f t (x t ) Total loss Σ t f t (x t ) Access to K,f t? Regret = X t f t (x t ) min x 2K X f t (x ) t

Prediction from expert advice Decision set = set of all distributions over n experts K = n = {x 2 R n, X i x i =1, x i 0} Cost functions? Let: c t (i) = loss of expert i in round t f t (x) =c > t x = X i c t (i)x(i) f t (x) = expected loss if choosing experts by distribution x t Regret = difference in # expected loss vs. best expert: Regret = X t f t (x t ) min x 2K X f t (x ) t

Parameterized Hypothesis Classes H - parameterized by vector x in convex set K R % In round t, define loss function based on example (a t, b t ) 1. Online Linear Spam Filtering: K = x x ω Loss function f " x = a " { x b " G 2. Online Soft Margin SVM: K = R n Loss function f " x = max 0, 1 b " a " { x + λ x G 3. Online Matrix Completion: K = X R % %, X k matrices with bounded nuclear norm At time t, if a t = (i t, j t ), then loss function f " x = x i ", j " b " G

Universal portfolio selection Recall change in log- wealth: log W t+1 = log W t + log(r > t x t ) Thus: K = n = {x 2 R n, X i x i =1, x i 0} f t (x) = log(r > t x t ) A Universal Portfolio algorithm: Regret T = 1 T X f t (x t ) t 1 T min X f t (x ) 7! 0 x t

Bandit Online Convex Optimization Same as OCO, except f t is not revealed to learner, only f t (x t ) is. E.g. Ad selection: a bandit version of prediction with expert advice Decision set K = set of distributions over possible ads to be shown K = n = {x 2 R n, X i x i =1, x i 0} Loss function: let c t (i) = 0 if ad i would be clicked if it were shown, and 1 otherwise f t (x) =c > t x = X i c t (i)x(i)

Algorithms randomized weighted majority follow- the- leader online gradient descent, regularization? log- regret, Online- Newton- Step Bandit algorithms (FKM, barriers, volumetric spanners) Projection- free methods (FW algorithm, online FW) Part 1: Foundations

Prediction from expert advice a R H 0 if correct 1 for mistake N experts (different regularizations) Can we perform as good as the best expert? (non- stochastic, adversarial setting)

Weighted Majority [Littlestone- Warmuth 89] Binary classification: {0, 1} loss n experts x " h D x " h G.. x " h % Initially, all x " h D = 1 round t, predict by weighted majority Update weights: x "ƒd h = 1 η x " h if expert h errs x "ƒd h = x " h otherwise Thm: Total- errors 2 1 + η Best- Expert + G % ˆ

Weighted Majority: Analysis Total weight: Φ " = x " (h) Œ 1. Alg mistake à at least half weight on experts who erred 2. If h * is best expert, Thus: t+1 apple 1 2 t (1 )+ 1 2 t apple t (1 2 ) T x T (h )=(1 ) #(h errors) (1 ) #(best expert errors) apple T apple 0 (1 errors) 2 )#(alg #(alg errors) apple 2(1 + )#(best expert errors) + 2 log n

Randomized Weighted Majority [LW 89] general: expert h time t: loss = c " h [0,1] n experts x " h D x " h G.. x " h % Initially, all x " h D = 1 round t, predict h w.p.: Update weights: x "ƒd h = e <ˆM > x " 1 > 1 > ( ) h Thm: For η = Ž { Regret = O( T logn )

Reminder: Online Convex Optimization Adversary Convex cost function f t Online Player Point x t in convex set K in R n loss f t (x t ) Total loss Σ t f t (x t ) Access to f t? Benchmark? Regret = X t f t (x t ) min x 2K X f t (x ) t

Minimize regret: best- in- hindsight Most natural: Regret = X t Provably works: (even for non- convex sets [Kalai- Vempala 05]) Is x t close to x t+1? f t (x t ) Xt 1 x t = arg min{ f i (x)} i=1 min x 2K X f t (x ) t x t+1 = arg min x2k ( tx ) f i (x) i=1 Decision may be unstable! (teaser: this can be fixed w. regularization) x t = arg min{ tx f i (x)+ 1 R(x)} i=1

Offline: Steepest Descent Move in the direction of steepest descent, which is: [rf(x)] i = @ @x i f(x) p * p 3 p 2 p 1

Online gradient descent [Zinkevich 03] y t+1 = x t rf t (x t ) x t+1 = arg min x2k ky t+1 x t k Theorem: Regret = O T

Analysis Observation 1: ky t+1 x k 2 = kx t x k 2 2 r t (x x t )+ 2 kr t k 2 Observation 2: (Pythagoras) Thus: kx t+1 x kappleky t+1 x k kx t+1 x k 2 applekx t x k 2 2 r t (x x t )+ 2 kr t k 2 X Convexity: [f t (x t ) f t (x )] apple X r t (x t x ) t t " f " (x " ) apple 1 (kx t x k 2 kx t+1 x k 2 )+ X t kr t k 2 apple 1 kx 1 x k 2 + TG = O( p T )

Lower bound 2 experts, T iterations: First expert has random loss in {- 1,1} Second expert loss = first * - 1 Expected loss = 0 (any algorithm) Regret = (think of first expert only) Regret = ( p T ) E[ #1 0 s #( 1) 0 s ] = ( p T )

Algorithms randomized weighted majority follow- the- leader online gradient descent, regularization? log- regret, Online- Newton- Step Bandit algorithms (FKM, barriers, volumetric spanners) Projection- free methods (FW algorithm, online FW) Part 2 coming up