OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Similar documents
Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Lecture 16: Perceptron and Exponential Weights Algorithm

Full-information Online Learning

Littlestone s Dimension and Online Learnability

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Online Convex Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Learnability, Stability, Regularization and Strong Convexity

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Large-scale Stochastic Optimization

Big Data Analytics: Optimization and Randomization

Optimistic Rates Nati Srebro

Accelerating Stochastic Optimization

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Online Learning and Online Convex Optimization

Online Learning with Experts & Multiplicative Weights Algorithms

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Notes on AdaGrad. Joseph Perla 2014

Computational and Statistical Learning Theory

Warm up: risk prediction with logistic regression

Online Passive-Aggressive Algorithms

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

ECS171: Machine Learning

Adaptive Online Gradient Descent

On the tradeoff between computational complexity and sample complexity in learning

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Support vector machines Lecture 4

Ad Placement Strategies

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

No-Regret Algorithms for Unconstrained Online Convex Optimization

Online Passive-Aggressive Algorithms

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning for NLP

Trade-Offs in Distributed Learning and Optimization

Support Vector Machine

Towards stability and optimality in stochastic gradient descent

Distributed online optimization over jointly connected digraphs

Classification Logistic Regression

Linear & nonlinear classifiers

Stochastic Optimization Algorithms Beyond SG

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Logistic Regression Logistic

Optimal and Adaptive Online Learning

Online Learning and Sequential Decision Making

Convex Optimization Lecture 16

Online Learning Summer School Copenhagen 2015 Lecture 1

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Adaptive Gradient Methods AdaGrad / Adam

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Introduction to Machine Learning (67577) Lecture 7

Convergence rate of SGD

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Online Manifold Regularization: A New Learning Setting and Empirical Study

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic Subgradient Method

Exponentiated Gradient Descent

Logistic Regression. Stochastic Gradient Descent

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Machine Learning in the Data Revolution Era

Agnostic Online learnability

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Fast Stochastic Optimization Algorithms for ML

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Convex Repeated Games and Fenchel Duality

Linear & nonlinear classifiers

Mini-Batch Primal and Dual Methods for SVMs

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

0.1 Motivating example: weighted majority algorithm

Empirical Risk Minimization Algorithms

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Composite Objective Mirror Descent

STA141C: Big Data & High Performance Statistical Computing

Distributed online optimization over jointly connected digraphs

The Online Approach to Machine Learning

A survey: The convex optimization approach to regret minimization

Machine Learning A Geometric Approach

ECS289: Scalable Machine Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

A Second-order Bound with Excess Losses

Adaptive Online Learning in Dynamic Environments

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

The FTRL Algorithm with Strongly Convex Regularizers

Convex Optimization Algorithms for Machine Learning in 10 Slides

Empirical Risk Minimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods

1 Review of Winnow Algorithm

arxiv: v4 [math.oc] 5 Jan 2016

Bregman Divergence and Mirror Descent

Stochastic gradient methods for machine learning

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Transcription:

OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research

References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and Online Convex Optimization, Shai Shalev-Shwartz, Hebrew University 1

Why Convex in Non-Convex World? Building blocks for solving non-convex problem Better understanding of properties of convex & non-convex problems Algorithms for convex problems work well in non-convex settings 2

A Gentle Start

Using Experts Advice: The Consistent Algorithm We consult with n weather forecasters h 1,..., h n One of the forecasters is always correct in her forecasts On day t forecaster j predicts: h j (x t ) = temp(lima) 25 c [+1 or -1] We keep a list of forecasters who have been Consistent On day t + 1 we choose a forecaster from list for prediction By end of day t + 1 if y t+1 h i (x t ) remove i from list 3

The Consistent Algorithm V 1 = [n] = {1,..., n} For t = 1, 2,..., T,... : Pick i V t and predict ŷ t+1 = h i (x t ) If ŷ t+1 y t+1 then V t+1 V t {i} else V t+1 V t Analysis There is one perfect forecaster. In the worst case all erroneous forecasters will be selected once and then get discarded. Thus Consistent is going to make at most n 1 mistakes. 4

Using Experts Advice: Halving Algorithm We can suffer much smaller number of errors by forming a consensus. In the following we use y t, ŷ t { 1, +1}. V 1 = [n] = {1,..., n} For t = 1, 2,..., T,... : ( ) Set ŷ t = sign h i (x t ) (majority vote) i V t V t+1 V t {i : h i (x t ) y t } 5

The Halving Algorithm Claim Halving is going to make at most log 2 (n) mistakes. Proof Whenever Halving makes a prediction error on round t, at least half of the forecasters in V t are wrong. Therefore, on the next round V t+1 is at most half the size of V t. Rounds without a prediction mistake may or may not reduce the size of V t but we do not care. 6

Neat, but... What if a single consistent predictor does not exist? Claim Assume that the experts are deterministic and master which combines their predictions deterministically. The number of mistakes made by the best expert is at most T/2 while the master algorithm makes T mistakes. Proof Assume that there are 2 experts such that one always predicts +1 while the other always 1. The master algorithm makes a deterministic prediction ŷ t and then an adversary sets y t = ŷ t. Therefore, one of the experts makes at most T/2 mistakes while the master is always mistaken. 7

Notation Scalars: a, b, c,..., i, j, k Column vectors: x, w, u,...... Standard vector space R d with inner-product w, v = w v = w v = j w jv j Matrices: A, B, C,..., (except for T ) Eigen values of A: λ max (A) = def λ 1 (A) λ 2 (A)... λ d (A) = def λ min (A) Trace of A: Tr (A) = def i,i A i,i = i λ i(a) PSD: A 0 w : w A w 0 λ d (A) 0 Condition number of A: λ max (A)/λ min (A) 8

Conventions Input dimension: d Number of examples: m Step number / iteration index: t Number of iterations: T 9

Concepts in Convex Analysis

Convex Sets A set Ω R d is convex if for any vectors u, v Ω, the line segment from u to v is in Ω, α [0, 1] : αu + (1 α)v Ω non-convex convex 10

Convex Functions Let Ω be a convex set. A function f : Ω R is convex if for every u, v Ω and α [0, 1], f (αu + (1 α)v) αf (u) + (1 α)f (v) f (v) αf (u) + (1 α)f (v) f (u) f (αu + (1 α)v) u αu + (1 α)v v 11

Local Minimum Global Minimum (Proof) Let B(u, r) = {v : v u r} f (u) local minimum r > 0 s.t. v B(u, r) : f (v) f (u) For any v (not necessarily in B), α > 0 such that u + α(v u) B(u, r) and therefore f (u) f (u + α(v u)) Since f is convex f (u + α(v u)) = f (αv + (1 α)u) (1 α)f (u) + αf (v) Therefore, f (u) (1 α)f (u) + αf (v) f (u) f (v) Since it holds for all v, f (u) is also a global minimum of f. 12

Gradients or What Lies Beneath? Gradient of f at w: f (w) = ( f (w) w 1,..., ) f (w) w d f is convex & differentiable: u f (u) f (w) + f (w), u w f (w) f (u) f (w) + u w, f (w) w u 13

Subgradients v is sub-gradient of f at w if u, f (u) f (w) + v, u w Differential set, f (w), is the set of subgradients of f at w Lemma: f is convex iff for every w, f (w) f (w) f (u) f (w) + u w, v w u 14

Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is ρ-lipschitz if w 1, w 2 Ω, f (w 1 ) f (w 2 ) ρ w 1 w 2 15

Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is α-strongly-convex if f (w) f (u) + f (u) (w u) + α 2 w u 2 16

Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is β-smooth f (w) f (u) + f (u) (w u) + β 2 w u 2 which is equivalent to Lipschitz condition of the gradients f (w) f (v) β w v f is twice differentiable, strong-convexity & smoothness imply w : αi 2 f (w) βi f is γ-well-conditioned where γ = α β 1 (condition number) 17

Strong Convexity & Smoothness f (u) + f (u) (w u) + β 2 w u 2 f (w) f (u) + f (u) (w u) + α 2 w u 2 18

Projection Onto Convex Set Projection of u onto Ω: P Ω (u) arg min w u w Ω Theorem (Pythagoras, 500 BC) Let Ω R d be a convex set, u R d, & w = P Ω (u). Then for any v Ω we have u v w v. Unconstrained minimzation: f (w) = 0 w arg min u R n f (u) Theorem (Karush-Kuhn-Tucker) Let Ω R d be a convex set, w arg min w Ω f (w). If f is convex, then for any u Ω we have f (w ) (u w ) 0. 19

KKT Theorem - Illustration Ω u Constrained Optimum w* - f(w*) Unconstrained Optimum 20

Deterministic Gradient Descent

Gradient Descent Algorithm (Basic) Input: f, T, initial point w 1 Ω, step sizes {η t } Loop: For t = 1 to T Gradient Stepping: w t+1 w t η t f (w t ) Optional Projection: w t+1 P Ω (w t+1 ) 21

( ) t 2 2α t 22 Convergence of GD Theorem Let f (w) be γ-well-conditioned (α s.c. & β smooth) and set η t = 1 β. Then, unconstrained gradient descent converges exponentially fast, f (w t ) f (w ) e γt Proof From strong convexity, f (y) f (x) + f (x) (y x) + α x y 2 { 2 min f (x) + f (x) (z x) + α } z 2 x z 2 = f (x) 1 f (x) 2 [Setting: 2α z = x 1/α f (x)] def def t = f (w t ), t = f (w t ) f (w ), x w t, y w :

Convergence of GD (cont.) t+1 t = f (w t+1 ) f (w t ) t (w t+1 w t ) + β 2 w t+1 w t 2 β-smoothness = η t t 2 + β 2 η2 t t 2 GD step = 1 2β t 2 α β t η t = 1 β From ( ) t+1 t (1 α/β) 1 (1 γ) t 1 e γt 23

What if f is not strongly-convex? Suppose that f (w) = log ( 1 + exp( w x) ) Prove that f is not strongly-convex when z = w x R Smooth f by adding a quadratic term f (w) = f (w) + α/2 w 2 Use GD (with or without domain constraint) as before Lemma Let w = arg min w f (w), then b s.t. w b 2f (0)/ α 24

Obtaining Strong-Convexity by Smothering Thm. Assume f is β-smooth and set α = β log(t) b 2 t. Then, ( ) β log(t) t = O t Proof outline (Complete Details) Properties: f is α + β smooth α stronly convex γ = α/( α + β) well conditioned Progress: t (f ) t ( f ) + αb 2 Recursion: as before Note O ( β t ) can be obtained via direct analysis 25

GD For General Convex Functions Rates of Convergence General 1 T α-strongly-convex 1 αt β-smooth β T γ-well-conditioned e γt 26

Convex Learning & Optimization

Convex Losses Convex loss function denoted l(w, z) : l : Z R + Restriction: w Ω R d Linear Regression Z = R d R, l(w, (x, y)) = ( w x y ) 2 Logistic Regression Z = R d { 1, +1}, l(w, (x, y)) = log ( 1 + exp( yw x) ) Hinge Loss (SVM) Z = R d { 1, +1}, l(w, (x, y)) = [ 1 yw x ] + where [z] + = max{z, 0} 27

Predicting Using Experts Advice Consistent & Halving combined the predictions of individual experts, Z = { 1, +1} d { 1, +1} Domain Ω of w is not convex, Ω = {0, 1} d Loss is 0 1 loss, l(w, (x, y)) = 1 [ sign(w x) y ] Finding w which minimizes 0 1 loss over a sequence of examples is NP-hard 28

Why do Consistent and Halving work? Paradigm Online Learning and Optimization Loss Relaxation Replace 0 1 with Hinge loss l hinge l 0 1 1 1 y w, x Covexify Domain Ω = {0, 1} d [0, 1] d 29

Convex Learning Problem (Ω, Z, l) is termed convex-learning if Ω is convex and l(w, z) is convex in w Ω for all z Z Prove linear / logistic regression & SVM employ convex losses Example z i induces a convex function f i : Ω R + f i (w) = def l(w, z i ) Empirical Risk Minimization (aka Batch Learning ) min w Ω 1 m m i=1 1 l(w, z i ) = min w Ω m m f i (w) i=1 30

Regret

Online Learning or Online Convex Optimization Example arrive one at a time Initialization choose w 1 Ω On each round t: Loss: f t (w t ) Update: w t+1 T Ω (w t, f t ( )) Focus on updates T which are gradient-based 31

Regret Bound Model regret T = T t=1 f t (w t ) min w Ω T f t (w) t=1 Online learning w 1 w 2... w t... Compared to fixed w T = arg min w Ω t=1 f t(w) Does not make distribution assumptions on f t ( ) Pessimistic - w is minimizer in hindsight and holds for all T 32

Why OL & OCO? Simple to describe Simple to analyze Coupling between algorithm & analysis OCO Stochastic Optimization & Generalization 33

Online Gradient Descent

Online Gradient def Convention: t = f t (w t ) = f (w) w=wt Input: Domain Ω, step sizes {η t } Initialize: w 1 Ω For t = 1,..., T do:. Predict using w t. Suffer loss f t (w t ). Update w t+1 w t η t t. Project w t+1 P Ω (w t+1 ) t Output: w T = 1 T w s (SGM) s=1 34

GD vs. SG vs. SG with Averaging 35

Analysis of Online Gradient Descent Assumptions: f t ( ) are ρ-lipschitz, max w Ω w w r, t : η t = η Pythagoras to the rescue, w t+1 w 2 = P Ω (w t η t ) w 2 w t η t w 2 w t+1 w 2 w t w 2 + η 2 t 2 2η t (w t w ) 2 t (w t w ) w t w 2 w t+1 w 2 η + ηρ 2 Next we sum over t and telescope the summands... 36

Analysis of Online Gradient Descent (cont.) T t=1 t (w t w ) 1 ( w1 w 2 w T +1 w 2) T ηρ 2 + 2η }{{} 2 r 2 Choosing η = r/ρ T we get, T t (w t w ) rρ T From convexity, Regret bound, t=1 f t (w t ) f t (w ) t (w t w ) regret T = T (f t (w t ) f t (w )) rρ T t=1 37

Properties of Online Gradient Descent Infinite Horizon By setting η = r/ρ t obtain t 1 : regret t 3rρ T /2 If f t ( ) are α-strongly convex, the regret of SG with η t = 1 αt regret t ρ (1 + log(t)) 2α Lower Bound! Algorithms for online convex optimization incur Ω ( rρ T ) regret in worst case, even when f t ( ) are generated from an i.i.d distribution 38

Regret Vs. Convergence Regret Bound for α-strongly convex is O (log(t )/α) Convergence Rate for α-strongly convex is O (1/αT )) Next we discuss how Online Gradient Descent can be converted to Stochastic Gradient Method for empirical risk minimization yielding approximation error of O(log(T )/T ) Slower Convergence factor of log(t ) Two Sets of Parameters Much Faster Iterates w t & 1 t O(1) vs. m s t w s 39

Stochastic Gradient Method

Stochastic Gradient Method Define f (w) = 1 m i l(w, z i) i l(w, z i ) is a stochastic estimate of f (w) where i chosen with probability 1 m Linearity of expectation, E[ i ] = f and E[ i 2 ] ρ 2 Theorem E [f ( w T )] min f (w) + 3rρ w Ω 2 }{{ T } =regret T 40

Exercise Time Prove of disprove: Hinge & logistic losses are strong convex Find: strong convexity & Lipschitz constants for logistic loss when x b Derive: projection algorithm onto Ω = {w w r} Generate: synthetic data m = 10000 d = 1000 ( w, x i N 0, 1/ ) di w P K (w) for r = 1/2 d y i = sign((w ) x i ) y i y i w.p 0.05 41

Coding Time Implement: GD, OGD, and SGM for logistic regression Implement: projection onto Ω Run: all algorithms Check: convergence and regret bounds by comparing w T & w T to w (in term of losses) Repeat: for hinge loss (Pegasos) 42