MLCC 2017 Regularization Networks I: Linear Models

Size: px

Start display at page:

Download "MLCC 2017 Regularization Networks I: Linear Models"

Amos Evan Franklin
6 years ago
Views:

1 MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

2 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these algorithms. MLCC

3 Empirical Risk Minimization (ERM) Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. General idea: considering the empirical error Ê(f) = 1 n n l(y i, f(x i )), i=1 as a proxy for the expected error E(f) = E[l(y, f(x))] = dxdyp(x, y)l(y, f(x)). MLCC

4 The Expected Risk is Not Computable Recall that l measures the price we pay predicting f(x) when the true label is y E(f) cannot be directly computed, since p(x, y) is unknown MLCC

5 From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: Fix a suitable hypothesis space H Minimize Ê over H H should allow feasible computations and be rich, since the complexity of the problem is not known a priori. MLCC

6 Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = {f : R d R : w R d such that f(x) = x T w, x R d }. Each function f is defined by a vector w f w (x) = x T w. MLCC

7 Rich Hs May Require Regularization If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) Regularization techniques restore stability and ensure generalization MLCC

8 Tikhonov Regularization Consider the Tikhonov regularization scheme, min Ê(f w ) + λ w 2 (1) w R d It describes a large class of methods sometimes called Regularization Networks. MLCC

9 The Regularizer w 2 is called regularizer It controls the stability of the solution and prevents overfitting λ balances the error term and the regularizer MLCC

10 Loss Functions Different loss functions l induce different classes of methods We will see common aspects and differences in considering different loss functions There exists no general computational scheme to solve Tikhonov Regularization The solution depends on the considered loss function MLCC

11 The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization min w R D Ê(f w ) + λ w 2, Ê(f w ) = 1 n Square loss function: n l(y i, f w (x i )) (2) i=1 l(y, f w (x)) = (y f w (x)) 2 We then obtain the RLS optimization problem (linear model): 1 min w R D n n (y i w T x i ) 2 + λw T w, λ 0. (3) i=1 MLCC

12 Matrix Notation The n d matrix X n, whose rows are the input points The n 1 vector Y n, whose entries are the corresponding outputs. With this notation, 1 n n (y i w T x i ) 2 = 1 n Y n X n w 2. i=1 MLCC

13 Gradients of the ER and of the Regularizer By direct computation, Gradient of the empirical risk w. r. t. w 2 n XT n (Y n X n w) Gradient of the regularizer w. r. t. w 2w MLCC

14 The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system (X T n X n + λni)w = X T n Y n. λ controls the invertibility of (X T n X n + λni) MLCC

15 Choosing the Cholesky Solver Several methods can be used to solve the above linear system Cholesky decomposition is the method of choice, since Xn T X n + λi is symmetric and positive definite. MLCC

16 Time Complexity Time complexity of the method : Training: O(nd 2 ) (assuming n >> d) Testing: O(d) MLCC

17 Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset: How to estimate b from data? w T x + b MLCC

18 Idea: Augmenting the Dimension of the Input Space Simple idea: augment the dimension of the input space, considering x = (x, 1) and w = (w, b). This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w 2 = w 2 + b 2. MLCC

19 Avoiding to Penalize the Solutions with Offset We want to regularize considering only w 2, without penalizing the offset. The modified regularized problem becomes: 1 min (w,b) R D+1 n n (y i w T x i b) 2 + λ w 2. i=1 MLCC

20 Solution with Offset: Centering the Data It can be proved that a solution w, b of the above problem is given by b = ȳ x T w where ȳ = 1 n x = 1 n n i=1 n i=1 y i x i MLCC

21 Solution with Offset: Centering the Data w solves 1 min w R D n n (yi c w T x c i) 2 + λ w 2. i=1 where y c i = y ȳ and xc i = x x for all i = 1,..., n. Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC

22 Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the logistic loss function: n l(y i, f w (x i )) (4) i=1 l(y, f w (x)) = log(1 + e yfw(x) ) MLCC

23 The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC

24 Minimization Through Gradient Descent The logistic loss function is differentiable The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC

25 Regularized Logistic Regression (RLR) The regularized ERM problem associated with the logistic loss is called regularized logistic regression Its solution can be computed via gradient descent Note: Ê(f w) = 1 n n i=1 x i y i e yixt i wt e yixt i wt 1 = 1 n n i=1 x i y i 1 + e yixt i wt 1 MLCC

26 RLR: Gradient Descent Iteration For w 0 = 0, the GD iteration applied to is w t = w t 1 γ for t = 1,... T, where min Ê(f w ) + λ w 2 w R d ( ) 1 n y i x i + 2λw t 1 n 1 + e yixt i wt 1 i=1 }{{} a a = (Ê(f w) + λ w 2 ) MLCC

27 Logistic Regression and Confidence Estimation The solution of logistic regression has a probabilistic interpretation It can be derived from the following model p(1 x) = where h is called logistic function. ext w 1 + e xt w }{{} h This can be used to compute a confidence for each prediction MLCC

28 Support Vector Machines Formulation in terms of Tikhonov regularization: min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the Hinge loss function: n l(y i, f w (x i )) (5) i=1 l(y, f w (x)) = 1 yf w (x) Hinge Loss y * f(x) MLCC

29 A more classical formulation (linear case) with λ = 1 C w 1 = min w R d n n 1 y i w x i + + λ w 2 i=1 MLCC

30 A more classical formulation (linear case) w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i y i w x i 1 ξ i i {1... n} subject to MLCC

31 A geometric intuition - classification In general do you have many solutions What do you select? MLCC

32 A geometric intuition - classification Intuitively I would choose an equidistant line MLCC

33 A geometric intuition - classification Intuitively I would choose an equidistant line MLCC

34 Maximum margin classifier I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples MLCC

35 Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have with α = x w w and x = x αw. x = αw + x Point-Hyperplane distance: d(x, w) = x MLCC

36 Margin An hyperplane w well classifies an example (x i, y i ) if y i = 1 and w x i > 0 or y i = 1 and w x i < 0 therefore x i is well classified iff y i w x i > 0 Margin: m i = y i w x i Note that x = x yimi w w MLCC

37 Maximum margin classifier definition I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples w = max min d(x i, w) 2 1 i n w R d m i > 0 i {1... n} Let call µ the smallest m i thus we have subject to w = max min x i (x i w)2 w R d 1 i n,µ 0 w 2 subject to that is y i w x i µ i {1... n} MLCC

38 Computation of w w = max min µ2 w R d µ 0 w 2 subject to y i w x i µ i {1... n} MLCC

39 Computation of w w = max w R d, µ 0 µ 2 w 2 subject to y i w x i µ i {1... n} Note that if y i w x i µ, then y i (αw) x i αµ and µ2 w = (αµ)2 2 αw for 2 any α 0. Therefore we have to fix the scale parameter, in particular we choose µ = 1. MLCC

40 Computation of w w 1 = max w R d w 2 subject to y i w x i 1 i {1... n} MLCC

41 Computation of w w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC

42 What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC

43 What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i subject to y i w x i 1 ξ i i {1... n} where C is a penalization parameter for the average error 1 n n i=1 ξ i. MLCC

44 Dual formulation It can be shown that the solution of the SVM problem is of the form w = n α i y i x i i=1 where α i are given by the solution of the following quadratic programming problem: n max α R n i=1 α i 1 n 2 i,j=1 y iy j α i α j x T i x j subj to α i 0 i = 1,..., n The solution requires the estimate of n rather than D coefficients α i are often sparse. The input points associated with non-zero coefficients are called support vectors MLCC

45 Wrapping up Regularized Empirical Risk Minimization w 1 = min w R d n n l(y i, w x i ) + λ w 2 i=1 Examples of Regularization Networks l(y, t) = (y t) 2 (Square loss) leads to Least Squares l(y, t) = log(1 + e yt ) (Logistic loss) leads to Logistic Regression l(y, t) = 1 yt + (Hinge loss) leads to Maximum Margin Classifier MLCC

46 Next class... beyond linear models! MLCC

Introductory Machine Learning Notes 1

Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu December