DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Regularized risk minimization Recall the generic form of regularized risk minimization: I reg [f ] = 1 N N L ( y i, f (x i ) ) + C R[f ], i=1 We already saw two examples of loss functions: the squared loss for regression, and the cross-entropy loss for classification. As we will see shortly, suitably varying the loss function can change the characteristics of our learning problem. For regression, most loss functions depend exclusively on a single scalar quantity computing the error: e = y f (x).

Absolute loss The squared loss is simple analytically, but it is not robust with respect to outliers, i.e., points which are very far from the other observations. Alternatively, we can penalize each error linearly using the absolute loss: L(e) = e. (1) Linear regression with the absolute loss is known as least absolute deviations (LAD) or Laplace regression. Solving the optimization problem in its basic form requires some care due to the non-differentiable points.

Linear programming for LAD For optimization, LAD can be reformulated as a linear programming problem (assuming no regularization): min w subject to N a i i=1 a i y i f (x i ), i = 1,..., N a i f (x i ) y i, i = 1,..., N. (2) where we have introduced one new auxiliary variable a i for each example. Linear programming libraries can solve the previous problem even for large-scale datasets.

ε-insensitive loss Suppose we know that the outputs in our training dataset are corrupted by some amount of noise, that we are able to approximately quantify. As we saw in a previous section, this introduces a constant amount of expected risk which cannot be eliminated, regardless of the learning algorithm we use. Under this assumption, it makes sense to disregard any error which is lower than some user-defined threshold ε: L(e) = max {0, e ε}. (3) The previous loss is known as ε-insensitive loss. It is particularly important for support vector algorithms.

Huber loss We can combine the resistance to outliers of the absolute loss with the smoothness of the squared loss using the Huber loss: { 1 2 L(e) = e2 if e δ δ ( e 1δ) otherwise, (4) 2 for some scalar coefficient δ > 0 chosen by the user. The penalization is linear for large errors, and quadratic for small errors, where small depends on δ. It is also differentiable everywhere.

Plot of loss functions for regression 4 3 Squared loss Absolute loss 0.5-insensitive loss Huber loss (δ = 1.5) L(e) 2 1 0 2 1 0 1 2 e Figure 1 : Plot of several loss functions for regression. The ε for the ε-insensitive loss is relatively high for better visualization.

OLS v.s. Huber regression y 6 5 4 3 2 1 Real OLS Huber Observations 0 0 2 4 6 8 x Figure 2 : A comparison of OLS and Huber regression for a linear model, on a 1-D dataset corrupted with a single outlier point.

Margin of classification Let us go back to the problem of binary classification, with y i { 1, +1}. In this case, most loss functions can be defined in term of the margin of the prediction, defined as: m = y f (x). The absolute value of the margin tells us how confident we are in a prediction, the sign tells us whether we are correct. For example, we can rewrite the 0/1 loss as: { 0 if m > 0, L(m) = 1 if m 0. The 0/1 loss is extremely difficult to optimize, being non-convex and non-differentiable.

The hinge loss We can make a convex approximation as follows: Any prediction such that m 1 gets 0 error. Anything else gets an error which is linear with respect to m. In this way, we also penalize correct predictions which are not confident enough. The result is the so-called hinge loss: L(m) = max {0, 1 m}. To obtain a loss that is differentiable everywhere, we can penalize the errors quadratically, obtaining the squared hinge loss: L(m) = max {0, (1 m)} 2.

Squared hinge loss and log loss Another popular loss function is the log loss: L(m) = log {1 + exp { m}}. All of these losses can be interpreted as convex approximations to the 0/1 loss. For logistic regression, the log loss is equivalent to the crossentropy loss. This can be trivially shown by noting that: p(y = 1 x) = 1 σ(s) = σ( s).

Plot of loss functions for binary classification 4 3 0/1 loss Hinge loss Squared hinge loss Log loss L(m) 2 1 0 2 1 0 1 2 m Figure 3 : A comparison of several losses for binary classification when expressed as a function of the margin m. The log loss is rescaled by 1/ log(2) to make it pass through the point (0, 1).

Sparse learning and LASSO Consider again linear regression, where we replace the Euclidean norm on the weights with an absolute norm, as follows: where: J(w) = 1 2N y Xw 2 2 + C w 1 w 1 = d w i. The new term will push several elements of w to be exactly zero, making the result a sparse parameter vector. The algorithm is called least absolute shrinkage and selection operator (LASSO). i=1

The importance of sparsity The previous term can also be used with other models/loss functions, obtaining similar effects. Having a sparse vector of parameters has several beneficial effects: 1. It can improve generalization, particularly when the model is over parameterized. 2. Sparse vectors require less storage space and can make the model more efficient to be evaluated. 3. For a linear model, each weight is associated to a single feature, so that LASSO implicitly performs feature selection on the dataset.

LASSO with orthonormal inputs We can gain some intuition on the behavior of LASSO if we assume that the matrix X has orthonormal columns. Denote by ŵ i the ith coefficient found by (unregularized) OLS. Ridge regression in this case will give as solution: w i = ŵi 1 + C while LASSO will give as solution: (5) w i = sgn(ŵ i ) max {0, ŵ i λ}. (6) where sgn( ) takes the sign. These are plotted on the next slide.

Ridge and LASSO solutions (visualization) Corrected solution 1.0 0.5 0.0 0.5 OLS Ridge LASSO 1.0 1.0 0.5 0.0 0.5 1.0 OLS solution Figure 4 : We plot the values obtained by ridge regression and LASSO in the case of orthonormal inputs, when compared to a basic OLS with no regularization (gray line).

Another intuitition Alternatively, we can note that the previous problem can be reformulated as follows: where µ depends linearly on C. 1 min w 2N y Xw 2 2 subject to w 1 µ This reformulation is always possible as long as the overall optimization problem is convex (and feasible). This shows that the optimal weight vector must lie inside (or on the border) of the shape defined by the inequality w 1 µ. (7)

Visualizing the shapes (a) Ridge regression (b) LASSO regularization Figure 5 : A visualization of the shapes defined by the regularization terms in ridge regression and LASSO. LASSO has a large probability of handing up on a vertex, corresponding to one or more features set to zero.

Non-differentiable costs and subgradients The absolute value is not differentiable at 0, so that our cost function is not differentiable and we cannot apply anything from our lecture on optimization - not even defining the optimum of the problem! Formally, we need an extension of the gradient called a subgradient. A vector g is a subgradient of a convex function p( ) at the point a if: p(b) p(a) + g T (b a), b. The function is differentiable at a if it has a unique subgradient.

Ridge and LASSO solutions (visualization) Figure 6 : Comparison of subgradients on a differentiable function and a non-differentiable one [Taken from Tibshirani, CS 10-725 (Optimization), Lecture 6: Subgradient method].

Optimality conditions for non-differentiable costs The set of all subgradients of p( ) at the point a is called the sub-differentiable, and it is denoted as p(a). A vector w is a minimum of p( ) if and only if: 0 p(w). (8) For a differentiable cost function, we only have a single subgradient and the previous expression reduces to the classical firstorder optimality condition. We can devise a subgradient descent for optimization by iteratively following the inverse of any subgradient: w n+1 = w n α n p(w n ). (9)

An example of LASSO MSE on test data 0.050 0.045 0.040 0.035 0.030 Error 10 6 10 5 10 4 10 3 10 2 10 1 10 0 C value (a) MSE Sparsity [%] 100 80 60 40 20 0 Sparsity [%] 10 6 10 5 10 4 10 3 10 2 10 1 10 0 C value (b) LASSO regularization Figure 7 : Example of application of LASSO on the Diabetes dataset. For C = 10 3, the MSE is very close to the optimal one, but the output parameter s vector has 30% of zero elements.

Elastic net penalty We can apply both types of regularization together: J(w) = 1 2N y Xw 2 2 + C 1 w 1 + C 2 w 2 2, where we now have two scalar coefficients C 1, C 2. This is called the elastic net penalty. The l 2 penalty is used to further stabilize weights that are not set to 0 by the l 1 penalty. This idea is relatively general: you can combine several penalizations to obtain a wide range of effects. For more than 2 penalties, however, setting their relative weights might not be immediate, particularly if their effects are interdependent.

Subset selection If we care about sparse weight vectors, a natural formulation is to penalize directly the number of non-zero entries in the vector: J(w) = 1 2N y Xw 2 2 + C d (1 I 0 (w i )), where I 0 ( ) is the indicator function, which is zero if its argument is zero, one otherwise. Sometimes, this is called the l 0 norm, although mathematically it is not a norm. It is extremely difficult to directly optimize in numerical fashion. The previous problem is called best subset selection. It is more naturally solved by applying some heuristic feature selection algorithms. i=1

Group sparse regularization Suppose that our variables can be grouped logically in T groups w 1,..., w T, such that: w = T w k. k=1 We can set to zero entire groups together by using the so-called group LASSO regularization: J(w) = 1 T 2N y Xw 2 2 + C dk w k 2, where d k is the size of the kth group. k=1

Further readings The LASSO algorithm and some subset selection methods are presented in Chapter 4 of the book. A very good starting point for a theoretical analysis on the properties of loss functions is: Rosasco, L., De Vito, E., Caponnetto, A., Piana, M. and Verri, A., 2004. Are loss functions all the same?. Neural Computation, 16(5), pp.1063-1076. The following is a comprehensive monograph on optimization for sparsity-inducing penalties (in the convex case): Bach, F., Jenatton, R., Mairal, J. and Obozinski, G., 2012. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), pp.1-106.