Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Similar documents
Lecture 19: Follow The Regulerized Leader

Lecture 16: Perceptron and Exponential Weights Algorithm

ECS171: Machine Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

4.1 Online Convex Optimization

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Logistic Regression Logistic

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Adaptive Online Gradient Descent

The FTRL Algorithm with Strongly Convex Regularizers

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Review and Overview

Convergence rate of SGD

Full-information Online Learning

Lecture 1: September 25

Machine Learning for NLP

Lecture 6: September 12

A survey: The convex optimization approach to regret minimization

Linear Regression. S. Sumitra

Ad Placement Strategies

Classification Logistic Regression

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Advanced Machine Learning

Online Convex Optimization

CS260: Machine Learning Algorithms

Lecture: Adaptive Filtering

The Online Approach to Machine Learning

Online Passive-Aggressive Algorithms

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Calibrated Surrogate Losses

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Lecture 16: FTRL and Online Mirror Descent

Stochastic Gradient Descent

Lecture 6: September 17

Lecture 12 Unconstrained Optimization (contd.) Constrained Optimization. October 15, 2008

Computational and Statistical Learning Theory

Lecture 9: September 28

Online Passive-Aggressive Algorithms

Selected Topics in Optimization. Some slides borrowed from

Lecture 25: Subgradient Method and Bundle Methods April 24

Linear Regression (continued)

Deep Learning for Computer Vision

Overfitting, Bias / Variance Analysis

1 Review of Winnow Algorithm

Optimization and Gradient Descent

Lecture 5: September 15

Lecture 14: October 17

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

IFT Lecture 2 Basics of convex analysis and gradient descent

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Lecture 5: September 12

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Dynamic Regret of Strongly Adaptive Methods

Introduction to Machine Learning (67577) Lecture 7

Bandit Online Convex Optimization

Regression with Numerical Optimization. Logistic

Machine Learning (CSE 446): Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Least Mean Squares Regression

INTRODUCTION TO DATA SCIENCE

Stochastic and Adversarial Online Learning without Hyperparameters

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

CSCI567 Machine Learning (Fall 2014)

Online Learning and Online Convex Optimization

Announcements Kevin Jamieson

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Why should you care about the solution strategies?

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Optimization

Exponentiated Gradient Descent

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

A Parallel SGD method with Strong Convergence

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Lecture 25: November 27

Composite Objective Mirror Descent

CSC242: Intro to AI. Lecture 21

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Structured Prediction

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Logistic Regression. COMP 527 Danushka Bollegala

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Adaptive Online Learning in Dynamic Environments

Introduction to Logistic Regression and Support Vector Machine

An Online Convex Optimization Approach to Blackwell s Approachability

Machine Learning for NLP

Learnability, Stability, Regularization and Strong Convexity

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Logistic Regression Trained with Different Loss Functions. Discussion

Stochastic Subgradient Method

Transcription:

EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. 23.1 Online convex optimization: generalization of several algorithms Many of the algorithms that we studied in this course are special cases of online convex optimization. We recall a few algorithms studied so far. Later on we will argue how they form a special case of online convex optimization. Expert Setting 2 Algorithm chooses w t n ; 3 Nature chooses loss l t [0, 1] n ; 4 end Linear Perceptron Algorithm 23.1: Expert setting 2 Algorithm chooses w t R n ; 3 Nature chooses (x t, y t ) R n { 1, 1}; 4 Algorithm experiences loss l t = 1[(w t x t )y t < 0] = max(0, y t (w t x t )); Online linear regression Algorithm 23.2: Linear Perceptron 2 Algorithm chooses w t R n ; 3 Nature chooses x t R n and y t R ; 4 Algorithm experiences loss l t = (w t x t y t ) 2 ; Algorithm 23.3: Online linear regression 23-1

Lecture 23: Online convex optimization 23-2 Portfolio selection 2 Algorithm chooses w t n ; 3 Nature chooses price Price t (i) for all i; 4 Algorithm experiences gain g t = w t x t where x t (i) = Pricet(i) Price t 1(i) ; Algorithm 23.4: Portfolio selection 23.2 Online convex optimization Now we introduce the generalized form of convex optimization that generalizes the above given examples. Given: A convex compact decision set: X R n A class of loss functions: L = {l : l : X R, l is convex and Lipshitz continuous } 3 Nature chooses l t L; 4 Algorithm experiences loss l t (w t ); 5 Algorithm updates w t+1 ; 6 end Algorithm 23.5: Generalized pattern of online convex optimization Our goal is to design a rule Alg. 23.5 step 5 that minimizes regret Regret (Alg) := l t (w t ) min w X l t (w ). (23.1) All problems listed above are the same convex optimization. [1] proposed online gradient descent (OGD) with the regret bound Regret (OGD) DG, where D is the L 2 diameter of convex compact set X and G is the L 2 Lipschitz bound on L. 23.3 Online gradient descent Initialize: w t is the center of the set X. Let be a parameter. 3 Nature chooses l t L; 4 Algorithm experiences loss l t (w t ); 5 w t+1 w t l t (w t ) ; 6 w t+1 Projection X ( w t+1 ) := arg inf w X w w t+1 2 ; 7 end Algorithm 23.6: Generalized pattern of online convex optimization heorem 23.1. Regret defined in (23.1) for Algorithm 23.6 has an upper bound of DG. Proof:

Lecture 23: Online convex optimization 23-3 Define potential function φ t = 1 2 w w t 2 2. he change in potential at every iteration is given by φ t+1 φ t = 1 2 ( w w t+1 2 2 1 2 w w t 2 2) (23.2) = 1 2 ( w Projection X (w t l t (w t )) 2 2 w w t 2 2) (23.3) 1 2 ( w w t + l t (w t ) 2 2 w w t 2 2) (23.4) = 1 2 (2 l t (w t ) 2 2 2 l t (w w t )) (23.5) = l t (w t )(w w t ) 2 l t(w t ) 2 2 (23.6) (l t (w ) l t (w t )) 2 l t(w t ) 2 2 (23.7) l t (w t ) l t (w ) 2 G2. (23.8) Here (23.4) is due to the fact that projection of point is to X always closer to any point inside X. And (23.7) is due to convexity of l t L because for any convex function f, f(x)(y x) f(y) f(x). Equation (23.8) is because of G-Lipschitz continuity of l t L. Summing both sides of (23.8) from t = 1 to t =, we get Regret (OGD) = l t (w t ) l t (w ) ( φ t+1 φ t + 2 G2) (23.9) = φ +1 φ 1 + 2 G2 (23.10) φ 1 + 2 G2 = 1 2 w w 1 2 2 + 2 G2 (23.11) 1 2 D2 + 2 G2. (23.12) Equation (23.12) is because D is the diameter of the convex set X. Since (23.12) is true for every, we chose = to get D G Regret (OGD) DG. (23.13) Note that Perceptron is OGD with l t (w t ) = max(0, y t (wt x t ) and gradient given by { 0 if y t (wt x t ) > 0 l t (w t ) = y t x t otherwise (23.14) 23.4 Further Generalization A more generic algorithm is a modification of Follow the Regularized Leader (FRL). In FRL the update step is w t+1 = arg min w X t s=1 l s (w) + 1 R(w), (23.15)

Lecture 23: Online convex optimization 23-4 where R(w) is the regulariser. Define a new algorithm FRL with the update step w t+1 = arg min w X t s=1 l s (w) + 1 R(w). (23.16) Note that OGD is a special case of FRL with R(w) = 1 2 w 2 2. Similarly Exponential Weights Algorithm (EWA) is also a FRL with R(w) = k i=1 w i log(w i ). he above observations are very recent. Only recently since 2008, we got to understand about the equivalence of OGD, EWA and FRL. he upper bound on regret of FRL is given by Regret (FRL) R(w ) R(w t ) + D R (w t, w t+1 ), (23.17) where D R (w t, w t+1 ) is the Bragman Divergence. For more details please refer to the surveys on the course website. 23.5 Online to Batch Conversion Standard stochastic setting. Assume that we have unknown distribution D on X Y. Let (x t, y t ) D for all t = {1,..., }. Use only online learning algorithm on this sequence input : (x t, y t ) D for all t = {1,..., }. Batch output: w = 1 w t 3 Algorithm observes (x t, y t ); 4 Algorithm experiences loss given by convex loss function l(w t x t, y t ); 5 w t+1 Projection X (w t l(w t x t, y t )) ; 6 end Algorithm 23.7: Online to batch conversion Note that we are choosing the average weight w rather than the last weight w. his differences arises because of a different objective function than regret. For batch problem we want to minimize the risk, Risk(w) := heorem 23.2. he output of Online to Batch Conversion enjoys: [ ] Risk( w ) Risk(w Regret (Alg) ) E Proof: From the definition of risk of averaged weights we get, E l(w x, y) (23.18) (x,y) D

Lecture 23: Online convex optimization 23-5 Risk( w ) = ( E l( w x, y) = E l 1 (x,y) D (x,y) D ) w t x, y (23.19) 1 E l(w t x, y) (23.20) (x,y) D = 1 E l(w t x t, y t ) = E 1 l(w t x t, y t ) (23.21) (x t,y t) D (x t,y t) D }{{} l t(w t) [ ] 1 E l(w x t, y t ) + Regret t(alg) (23.22) (x t,y t) D [ ] = Risk(w Regret (Alg) ) + E. (23.23) Here Eq. (23.20) is due to convexity of the loss function with respect to w. Note that random variables (x, y) D are replaced with identically distributed (x t, y t ) D in (23.21). Equation (23.22) uses definition of Regret (Alg) as given by (23.1). References [1] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.