OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research
References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and Online Convex Optimization, Shai Shalev-Shwartz, Hebrew University 1
Why Convex in Non-Convex World? Building blocks for solving non-convex problem Better understanding of properties of convex & non-convex problems Algorithms for convex problems work well in non-convex settings 2
A Gentle Start
Using Experts Advice: The Consistent Algorithm We consult with n weather forecasters h 1,..., h n One of the forecasters is always correct in her forecasts On day t forecaster j predicts: h j (x t ) = temp(lima) 25 c [+1 or -1] We keep a list of forecasters who have been Consistent On day t + 1 we choose a forecaster from list for prediction By end of day t + 1 if y t+1 h i (x t ) remove i from list 3
The Consistent Algorithm V 1 = [n] = {1,..., n} For t = 1, 2,..., T,... : Pick i V t and predict ŷ t+1 = h i (x t ) If ŷ t+1 y t+1 then V t+1 V t {i} else V t+1 V t Analysis There is one perfect forecaster. In the worst case all erroneous forecasters will be selected once and then get discarded. Thus Consistent is going to make at most n 1 mistakes. 4
Using Experts Advice: Halving Algorithm We can suffer much smaller number of errors by forming a consensus. In the following we use y t, ŷ t { 1, +1}. V 1 = [n] = {1,..., n} For t = 1, 2,..., T,... : ( ) Set ŷ t = sign h i (x t ) (majority vote) i V t V t+1 V t {i : h i (x t ) y t } 5
The Halving Algorithm Claim Halving is going to make at most log 2 (n) mistakes. Proof Whenever Halving makes a prediction error on round t, at least half of the forecasters in V t are wrong. Therefore, on the next round V t+1 is at most half the size of V t. Rounds without a prediction mistake may or may not reduce the size of V t but we do not care. 6
Neat, but... What if a single consistent predictor does not exist? Claim Assume that the experts are deterministic and master which combines their predictions deterministically. The number of mistakes made by the best expert is at most T/2 while the master algorithm makes T mistakes. Proof Assume that there are 2 experts such that one always predicts +1 while the other always 1. The master algorithm makes a deterministic prediction ŷ t and then an adversary sets y t = ŷ t. Therefore, one of the experts makes at most T/2 mistakes while the master is always mistaken. 7
Notation Scalars: a, b, c,..., i, j, k Column vectors: x, w, u,...... Standard vector space R d with inner-product w, v = w v = w v = j w jv j Matrices: A, B, C,..., (except for T ) Eigen values of A: λ max (A) = def λ 1 (A) λ 2 (A)... λ d (A) = def λ min (A) Trace of A: Tr (A) = def i,i A i,i = i λ i(a) PSD: A 0 w : w A w 0 λ d (A) 0 Condition number of A: λ max (A)/λ min (A) 8
Conventions Input dimension: d Number of examples: m Step number / iteration index: t Number of iterations: T 9
Concepts in Convex Analysis
Convex Sets A set Ω R d is convex if for any vectors u, v Ω, the line segment from u to v is in Ω, α [0, 1] : αu + (1 α)v Ω non-convex convex 10
Convex Functions Let Ω be a convex set. A function f : Ω R is convex if for every u, v Ω and α [0, 1], f (αu + (1 α)v) αf (u) + (1 α)f (v) f (v) αf (u) + (1 α)f (v) f (u) f (αu + (1 α)v) u αu + (1 α)v v 11
Local Minimum Global Minimum (Proof) Let B(u, r) = {v : v u r} f (u) local minimum r > 0 s.t. v B(u, r) : f (v) f (u) For any v (not necessarily in B), α > 0 such that u + α(v u) B(u, r) and therefore f (u) f (u + α(v u)) Since f is convex f (u + α(v u)) = f (αv + (1 α)u) (1 α)f (u) + αf (v) Therefore, f (u) (1 α)f (u) + αf (v) f (u) f (v) Since it holds for all v, f (u) is also a global minimum of f. 12
Gradients or What Lies Beneath? Gradient of f at w: f (w) = ( f (w) w 1,..., ) f (w) w d f is convex & differentiable: u f (u) f (w) + f (w), u w f (w) f (u) f (w) + u w, f (w) w u 13
Subgradients v is sub-gradient of f at w if u, f (u) f (w) + v, u w Differential set, f (w), is the set of subgradients of f at w Lemma: f is convex iff for every w, f (w) f (w) f (u) f (w) + u w, v w u 14
Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is ρ-lipschitz if w 1, w 2 Ω, f (w 1 ) f (w 2 ) ρ w 1 w 2 15
Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is α-strongly-convex if f (w) f (u) + f (u) (w u) + α 2 w u 2 16
Lipschitzness, Strong Convexity, & Smoothness Function f : Ω R is β-smooth f (w) f (u) + f (u) (w u) + β 2 w u 2 which is equivalent to Lipschitz condition of the gradients f (w) f (v) β w v f is twice differentiable, strong-convexity & smoothness imply w : αi 2 f (w) βi f is γ-well-conditioned where γ = α β 1 (condition number) 17
Strong Convexity & Smoothness f (u) + f (u) (w u) + β 2 w u 2 f (w) f (u) + f (u) (w u) + α 2 w u 2 18
Projection Onto Convex Set Projection of u onto Ω: P Ω (u) arg min w u w Ω Theorem (Pythagoras, 500 BC) Let Ω R d be a convex set, u R d, & w = P Ω (u). Then for any v Ω we have u v w v. Unconstrained minimzation: f (w) = 0 w arg min u R n f (u) Theorem (Karush-Kuhn-Tucker) Let Ω R d be a convex set, w arg min w Ω f (w). If f is convex, then for any u Ω we have f (w ) (u w ) 0. 19
KKT Theorem - Illustration Ω u Constrained Optimum w* - f(w*) Unconstrained Optimum 20
Deterministic Gradient Descent
Gradient Descent Algorithm (Basic) Input: f, T, initial point w 1 Ω, step sizes {η t } Loop: For t = 1 to T Gradient Stepping: w t+1 w t η t f (w t ) Optional Projection: w t+1 P Ω (w t+1 ) 21
( ) t 2 2α t 22 Convergence of GD Theorem Let f (w) be γ-well-conditioned (α s.c. & β smooth) and set η t = 1 β. Then, unconstrained gradient descent converges exponentially fast, f (w t ) f (w ) e γt Proof From strong convexity, f (y) f (x) + f (x) (y x) + α x y 2 { 2 min f (x) + f (x) (z x) + α } z 2 x z 2 = f (x) 1 f (x) 2 [Setting: 2α z = x 1/α f (x)] def def t = f (w t ), t = f (w t ) f (w ), x w t, y w :
Convergence of GD (cont.) t+1 t = f (w t+1 ) f (w t ) t (w t+1 w t ) + β 2 w t+1 w t 2 β-smoothness = η t t 2 + β 2 η2 t t 2 GD step = 1 2β t 2 α β t η t = 1 β From ( ) t+1 t (1 α/β) 1 (1 γ) t 1 e γt 23
What if f is not strongly-convex? Suppose that f (w) = log ( 1 + exp( w x) ) Prove that f is not strongly-convex when z = w x R Smooth f by adding a quadratic term f (w) = f (w) + α/2 w 2 Use GD (with or without domain constraint) as before Lemma Let w = arg min w f (w), then b s.t. w b 2f (0)/ α 24
Obtaining Strong-Convexity by Smothering Thm. Assume f is β-smooth and set α = β log(t) b 2 t. Then, ( ) β log(t) t = O t Proof outline (Complete Details) Properties: f is α + β smooth α stronly convex γ = α/( α + β) well conditioned Progress: t (f ) t ( f ) + αb 2 Recursion: as before Note O ( β t ) can be obtained via direct analysis 25
GD For General Convex Functions Rates of Convergence General 1 T α-strongly-convex 1 αt β-smooth β T γ-well-conditioned e γt 26
Convex Learning & Optimization
Convex Losses Convex loss function denoted l(w, z) : l : Z R + Restriction: w Ω R d Linear Regression Z = R d R, l(w, (x, y)) = ( w x y ) 2 Logistic Regression Z = R d { 1, +1}, l(w, (x, y)) = log ( 1 + exp( yw x) ) Hinge Loss (SVM) Z = R d { 1, +1}, l(w, (x, y)) = [ 1 yw x ] + where [z] + = max{z, 0} 27
Predicting Using Experts Advice Consistent & Halving combined the predictions of individual experts, Z = { 1, +1} d { 1, +1} Domain Ω of w is not convex, Ω = {0, 1} d Loss is 0 1 loss, l(w, (x, y)) = 1 [ sign(w x) y ] Finding w which minimizes 0 1 loss over a sequence of examples is NP-hard 28
Why do Consistent and Halving work? Paradigm Online Learning and Optimization Loss Relaxation Replace 0 1 with Hinge loss l hinge l 0 1 1 1 y w, x Covexify Domain Ω = {0, 1} d [0, 1] d 29
Convex Learning Problem (Ω, Z, l) is termed convex-learning if Ω is convex and l(w, z) is convex in w Ω for all z Z Prove linear / logistic regression & SVM employ convex losses Example z i induces a convex function f i : Ω R + f i (w) = def l(w, z i ) Empirical Risk Minimization (aka Batch Learning ) min w Ω 1 m m i=1 1 l(w, z i ) = min w Ω m m f i (w) i=1 30
Regret
Online Learning or Online Convex Optimization Example arrive one at a time Initialization choose w 1 Ω On each round t: Loss: f t (w t ) Update: w t+1 T Ω (w t, f t ( )) Focus on updates T which are gradient-based 31
Regret Bound Model regret T = T t=1 f t (w t ) min w Ω T f t (w) t=1 Online learning w 1 w 2... w t... Compared to fixed w T = arg min w Ω t=1 f t(w) Does not make distribution assumptions on f t ( ) Pessimistic - w is minimizer in hindsight and holds for all T 32
Why OL & OCO? Simple to describe Simple to analyze Coupling between algorithm & analysis OCO Stochastic Optimization & Generalization 33
Online Gradient Descent
Online Gradient def Convention: t = f t (w t ) = f (w) w=wt Input: Domain Ω, step sizes {η t } Initialize: w 1 Ω For t = 1,..., T do:. Predict using w t. Suffer loss f t (w t ). Update w t+1 w t η t t. Project w t+1 P Ω (w t+1 ) t Output: w T = 1 T w s (SGM) s=1 34
GD vs. SG vs. SG with Averaging 35
Analysis of Online Gradient Descent Assumptions: f t ( ) are ρ-lipschitz, max w Ω w w r, t : η t = η Pythagoras to the rescue, w t+1 w 2 = P Ω (w t η t ) w 2 w t η t w 2 w t+1 w 2 w t w 2 + η 2 t 2 2η t (w t w ) 2 t (w t w ) w t w 2 w t+1 w 2 η + ηρ 2 Next we sum over t and telescope the summands... 36
Analysis of Online Gradient Descent (cont.) T t=1 t (w t w ) 1 ( w1 w 2 w T +1 w 2) T ηρ 2 + 2η }{{} 2 r 2 Choosing η = r/ρ T we get, T t (w t w ) rρ T From convexity, Regret bound, t=1 f t (w t ) f t (w ) t (w t w ) regret T = T (f t (w t ) f t (w )) rρ T t=1 37
Properties of Online Gradient Descent Infinite Horizon By setting η = r/ρ t obtain t 1 : regret t 3rρ T /2 If f t ( ) are α-strongly convex, the regret of SG with η t = 1 αt regret t ρ (1 + log(t)) 2α Lower Bound! Algorithms for online convex optimization incur Ω ( rρ T ) regret in worst case, even when f t ( ) are generated from an i.i.d distribution 38
Regret Vs. Convergence Regret Bound for α-strongly convex is O (log(t )/α) Convergence Rate for α-strongly convex is O (1/αT )) Next we discuss how Online Gradient Descent can be converted to Stochastic Gradient Method for empirical risk minimization yielding approximation error of O(log(T )/T ) Slower Convergence factor of log(t ) Two Sets of Parameters Much Faster Iterates w t & 1 t O(1) vs. m s t w s 39
Stochastic Gradient Method
Stochastic Gradient Method Define f (w) = 1 m i l(w, z i) i l(w, z i ) is a stochastic estimate of f (w) where i chosen with probability 1 m Linearity of expectation, E[ i ] = f and E[ i 2 ] ρ 2 Theorem E [f ( w T )] min f (w) + 3rρ w Ω 2 }{{ T } =regret T 40
Exercise Time Prove of disprove: Hinge & logistic losses are strong convex Find: strong convexity & Lipschitz constants for logistic loss when x b Derive: projection algorithm onto Ω = {w w r} Generate: synthetic data m = 10000 d = 1000 ( w, x i N 0, 1/ ) di w P K (w) for r = 1/2 d y i = sign((w ) x i ) y i y i w.p 0.05 41
Coding Time Implement: GD, OGD, and SGM for logistic regression Implement: projection onto Ω Run: all algorithms Check: convergence and regret bounds by comparing w T & w T to w (in term of losses) Repeat: for hinge loss (Pegasos) 42