MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these algorithms. MLCC 2017 2

Empirical Risk Minimization (ERM) Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. General idea: considering the empirical error Ê(f) = 1 n n l(y i, f(x i )), i=1 as a proxy for the expected error E(f) = E[l(y, f(x))] = dxdyp(x, y)l(y, f(x)). MLCC 2017 3

The Expected Risk is Not Computable Recall that l measures the price we pay predicting f(x) when the true label is y E(f) cannot be directly computed, since p(x, y) is unknown MLCC 2017 4

From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: Fix a suitable hypothesis space H Minimize Ê over H H should allow feasible computations and be rich, since the complexity of the problem is not known a priori. MLCC 2017 5

Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = {f : R d R : w R d such that f(x) = x T w, x R d }. Each function f is defined by a vector w f w (x) = x T w. MLCC 2017 6

Rich Hs May Require Regularization If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) Regularization techniques restore stability and ensure generalization MLCC 2017 7

Tikhonov Regularization Consider the Tikhonov regularization scheme, min Ê(f w ) + λ w 2 (1) w R d It describes a large class of methods sometimes called Regularization Networks. MLCC 2017 8

The Regularizer w 2 is called regularizer It controls the stability of the solution and prevents overfitting λ balances the error term and the regularizer MLCC 2017 9

Loss Functions Different loss functions l induce different classes of methods We will see common aspects and differences in considering different loss functions There exists no general computational scheme to solve Tikhonov Regularization The solution depends on the considered loss function MLCC 2017 10

The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization min w R D Ê(f w ) + λ w 2, Ê(f w ) = 1 n Square loss function: n l(y i, f w (x i )) (2) i=1 l(y, f w (x)) = (y f w (x)) 2 We then obtain the RLS optimization problem (linear model): 1 min w R D n n (y i w T x i ) 2 + λw T w, λ 0. (3) i=1 MLCC 2017 11

Matrix Notation The n d matrix X n, whose rows are the input points The n 1 vector Y n, whose entries are the corresponding outputs. With this notation, 1 n n (y i w T x i ) 2 = 1 n Y n X n w 2. i=1 MLCC 2017 12

Gradients of the ER and of the Regularizer By direct computation, Gradient of the empirical risk w. r. t. w 2 n XT n (Y n X n w) Gradient of the regularizer w. r. t. w 2w MLCC 2017 13

The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system (X T n X n + λni)w = X T n Y n. λ controls the invertibility of (X T n X n + λni) MLCC 2017 14

Choosing the Cholesky Solver Several methods can be used to solve the above linear system Cholesky decomposition is the method of choice, since Xn T X n + λi is symmetric and positive definite. MLCC 2017 15

Time Complexity Time complexity of the method : Training: O(nd 2 ) (assuming n >> d) Testing: O(d) MLCC 2017 16

Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset: How to estimate b from data? w T x + b MLCC 2017 17

Idea: Augmenting the Dimension of the Input Space Simple idea: augment the dimension of the input space, considering x = (x, 1) and w = (w, b). This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w 2 = w 2 + b 2. MLCC 2017 18

Avoiding to Penalize the Solutions with Offset We want to regularize considering only w 2, without penalizing the offset. The modified regularized problem becomes: 1 min (w,b) R D+1 n n (y i w T x i b) 2 + λ w 2. i=1 MLCC 2017 19

Solution with Offset: Centering the Data It can be proved that a solution w, b of the above problem is given by b = ȳ x T w where ȳ = 1 n x = 1 n n i=1 n i=1 y i x i MLCC 2017 20

Solution with Offset: Centering the Data w solves 1 min w R D n n (yi c w T x c i) 2 + λ w 2. i=1 where y c i = y ȳ and xc i = x x for all i = 1,..., n. Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC 2017 21

Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the logistic loss function: n l(y i, f w (x i )) (4) i=1 l(y, f w (x)) = log(1 + e yfw(x) ) MLCC 2017 22

The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC 2017 23

Minimization Through Gradient Descent The logistic loss function is differentiable The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC 2017 24

Regularized Logistic Regression (RLR) The regularized ERM problem associated with the logistic loss is called regularized logistic regression Its solution can be computed via gradient descent Note: Ê(f w) = 1 n n i=1 x i y i e yixt i wt 1 1 + e yixt i wt 1 = 1 n n i=1 x i y i 1 + e yixt i wt 1 MLCC 2017 25

RLR: Gradient Descent Iteration For w 0 = 0, the GD iteration applied to is w t = w t 1 γ for t = 1,... T, where min Ê(f w ) + λ w 2 w R d ( ) 1 n y i x i + 2λw t 1 n 1 + e yixt i wt 1 i=1 }{{} a a = (Ê(f w) + λ w 2 ) MLCC 2017 26

Logistic Regression and Confidence Estimation The solution of logistic regression has a probabilistic interpretation It can be derived from the following model p(1 x) = where h is called logistic function. ext w 1 + e xt w }{{} h This can be used to compute a confidence for each prediction MLCC 2017 27

Support Vector Machines Formulation in terms of Tikhonov regularization: min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the Hinge loss function: n l(y i, f w (x i )) (5) i=1 l(y, f w (x)) = 1 yf w (x) + 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x) MLCC 2017 28

A more classical formulation (linear case) with λ = 1 C w 1 = min w R d n n 1 y i w x i + + λ w 2 i=1 MLCC 2017 29

A more classical formulation (linear case) w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i y i w x i 1 ξ i i {1... n} subject to MLCC 2017 30

A geometric intuition - classification In general do you have many solutions 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 What do you select? MLCC 2017 31

A geometric intuition - classification Intuitively I would choose an equidistant line 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 32

A geometric intuition - classification Intuitively I would choose an equidistant line 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 33

Maximum margin classifier I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 34

Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have with α = x w w and x = x αw. x = αw + x Point-Hyperplane distance: d(x, w) = x MLCC 2017 35

Margin An hyperplane w well classifies an example (x i, y i ) if y i = 1 and w x i > 0 or y i = 1 and w x i < 0 therefore x i is well classified iff y i w x i > 0 Margin: m i = y i w x i Note that x = x yimi w w MLCC 2017 36

Maximum margin classifier definition I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples w = max min d(x i, w) 2 1 i n w R d m i > 0 i {1... n} Let call µ the smallest m i thus we have subject to w = max min x i (x i w)2 w R d 1 i n,µ 0 w 2 subject to that is y i w x i µ i {1... n} MLCC 2017 37

Computation of w w = max min µ2 w R d µ 0 w 2 subject to y i w x i µ i {1... n} MLCC 2017 38

Computation of w w = max w R d, µ 0 µ 2 w 2 subject to y i w x i µ i {1... n} Note that if y i w x i µ, then y i (αw) x i αµ and µ2 w = (αµ)2 2 αw for 2 any α 0. Therefore we have to fix the scale parameter, in particular we choose µ = 1. MLCC 2017 39

Computation of w w 1 = max w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 40

Computation of w w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 41

What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 42

What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i subject to y i w x i 1 ξ i i {1... n} where C is a penalization parameter for the average error 1 n n i=1 ξ i. MLCC 2017 43

Dual formulation It can be shown that the solution of the SVM problem is of the form w = n α i y i x i i=1 where α i are given by the solution of the following quadratic programming problem: n max α R n i=1 α i 1 n 2 i,j=1 y iy j α i α j x T i x j subj to α i 0 i = 1,..., n The solution requires the estimate of n rather than D coefficients α i are often sparse. The input points associated with non-zero coefficients are called support vectors MLCC 2017 44

Wrapping up Regularized Empirical Risk Minimization w 1 = min w R d n n l(y i, w x i ) + λ w 2 i=1 Examples of Regularization Networks l(y, t) = (y t) 2 (Square loss) leads to Least Squares l(y, t) = log(1 + e yt ) (Logistic loss) leads to Logistic Regression l(y, t) = 1 yt + (Hinge loss) leads to Maximum Margin Classifier MLCC 2017 45

Next class... beyond linear models! MLCC 2017 46