DATA MINING AND MACHINE LEARNING

Similar documents
Statistical Data Mining and Machine Learning Hilary Term 2016

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Introduction to Machine Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

CPSC 340: Machine Learning and Data Mining

Least Squares Regression

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Least Mean Squares Regression

Probabilistic Machine Learning. Industrial AI Lab.

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Classification Logistic Regression

Is the test error unbiased for these programs?

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Discriminative Learning and Big Data

Statistical Methods for Data Mining

Convex relaxation for Combinatorial Penalties

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Discriminative Models

Optimization methods

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

Least Squares Regression

Logistic Regression. COMP 527 Danushka Bollegala

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Linear and Logistic Regression. Dr. Xiaowei Huang

Statistical Pattern Recognition

Proximal Methods for Optimization with Spasity-inducing Norms

LASSO Review, Fused LASSO, Parallel LASSO Solvers

CSC321 Lecture 4: Learning a Classifier

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CSC 411 Lecture 17: Support Vector Machine

Lecture 4: Training a Classifier

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

ISyE 691 Data mining and analytics

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Is the test error unbiased for these programs? 2017 Kevin Jamieson

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Online Learning With Kernel

Optimization methods

Linear Methods for Regression. Lijun Zhang

1 Sparsity and l 1 relaxation

Machine Learning for OR & FE

Manifold Regularization

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Discriminative Models

Least Mean Squares Regression. Machine Learning Fall 2018

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Gradient-Based Learning. Sargur N. Srihari

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Logistic Regression Trained with Different Loss Functions. Discussion

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Midterm exam CS 189/289, Fall 2015

9 Classification. 9.1 Linear Classifiers

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Introduction to Logistic Regression

Stability and the elastic net

Big Data Analytics: Optimization and Randomization

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Convex Optimization Lecture 16

CSC321 Lecture 4: Learning a Classifier

l 1 and l 2 Regularization

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Overfitting, Bias / Variance Analysis

ESL Chap3. Some extensions of lasso

Linear classifiers: Logistic regression

ECE521 week 3: 23/26 January 2017

Lecture 6. Notes on Linear Algebra. Perceptron

Lasso: Algorithms and Extensions

MLCC 2017 Regularization Networks I: Linear Models

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Regularization Algorithms for Learning

OWL to the rescue of LASSO

Parameter Norm Penalties. Sargur N. Srihari

Lecture 14: Shrinkage

Support Vector Machines and Bayes Regression

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Linear Regression. Volker Tresp 2018

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

6. Regularized linear regression

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Linear classifiers: Overfitting and regularization

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Ordinary Least Squares Linear Regression

Support Vector Machine I

Linear Regression. S. Sumitra

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Transcription:

DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Regularized risk minimization Recall the generic form of regularized risk minimization: I reg [f ] = 1 N N L ( y i, f (x i ) ) + C R[f ], i=1 We already saw two examples of loss functions: the squared loss for regression, and the cross-entropy loss for classification. As we will see shortly, suitably varying the loss function can change the characteristics of our learning problem. For regression, most loss functions depend exclusively on a single scalar quantity computing the error: e = y f (x).

Absolute loss The squared loss is simple analytically, but it is not robust with respect to outliers, i.e., points which are very far from the other observations. Alternatively, we can penalize each error linearly using the absolute loss: L(e) = e. (1) Linear regression with the absolute loss is known as least absolute deviations (LAD) or Laplace regression. Solving the optimization problem in its basic form requires some care due to the non-differentiable points.

Linear programming for LAD For optimization, LAD can be reformulated as a linear programming problem (assuming no regularization): min w subject to N a i i=1 a i y i f (x i ), i = 1,..., N a i f (x i ) y i, i = 1,..., N. (2) where we have introduced one new auxiliary variable a i for each example. Linear programming libraries can solve the previous problem even for large-scale datasets.

ε-insensitive loss Suppose we know that the outputs in our training dataset are corrupted by some amount of noise, that we are able to approximately quantify. As we saw in a previous section, this introduces a constant amount of expected risk which cannot be eliminated, regardless of the learning algorithm we use. Under this assumption, it makes sense to disregard any error which is lower than some user-defined threshold ε: L(e) = max {0, e ε}. (3) The previous loss is known as ε-insensitive loss. It is particularly important for support vector algorithms.

Huber loss We can combine the resistance to outliers of the absolute loss with the smoothness of the squared loss using the Huber loss: { 1 2 L(e) = e2 if e δ δ ( e 1δ) otherwise, (4) 2 for some scalar coefficient δ > 0 chosen by the user. The penalization is linear for large errors, and quadratic for small errors, where small depends on δ. It is also differentiable everywhere.

Plot of loss functions for regression 4 3 Squared loss Absolute loss 0.5-insensitive loss Huber loss (δ = 1.5) L(e) 2 1 0 2 1 0 1 2 e Figure 1 : Plot of several loss functions for regression. The ε for the ε-insensitive loss is relatively high for better visualization.

OLS v.s. Huber regression y 6 5 4 3 2 1 Real OLS Huber Observations 0 0 2 4 6 8 x Figure 2 : A comparison of OLS and Huber regression for a linear model, on a 1-D dataset corrupted with a single outlier point.

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Margin of classification Let us go back to the problem of binary classification, with y i { 1, +1}. In this case, most loss functions can be defined in term of the margin of the prediction, defined as: m = y f (x). The absolute value of the margin tells us how confident we are in a prediction, the sign tells us whether we are correct. For example, we can rewrite the 0/1 loss as: { 0 if m > 0, L(m) = 1 if m 0. The 0/1 loss is extremely difficult to optimize, being non-convex and non-differentiable.

The hinge loss We can make a convex approximation as follows: Any prediction such that m 1 gets 0 error. Anything else gets an error which is linear with respect to m. In this way, we also penalize correct predictions which are not confident enough. The result is the so-called hinge loss: L(m) = max {0, 1 m}. To obtain a loss that is differentiable everywhere, we can penalize the errors quadratically, obtaining the squared hinge loss: L(m) = max {0, (1 m)} 2.

Squared hinge loss and log loss Another popular loss function is the log loss: L(m) = log {1 + exp { m}}. All of these losses can be interpreted as convex approximations to the 0/1 loss. For logistic regression, the log loss is equivalent to the crossentropy loss. This can be trivially shown by noting that: p(y = 1 x) = 1 σ(s) = σ( s).

Plot of loss functions for binary classification 4 3 0/1 loss Hinge loss Squared hinge loss Log loss L(m) 2 1 0 2 1 0 1 2 m Figure 3 : A comparison of several losses for binary classification when expressed as a function of the margin m. The log loss is rescaled by 1/ log(2) to make it pass through the point (0, 1).

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Sparse learning and LASSO Consider again linear regression, where we replace the Euclidean norm on the weights with an absolute norm, as follows: where: J(w) = 1 2N y Xw 2 2 + C w 1 w 1 = d w i. The new term will push several elements of w to be exactly zero, making the result a sparse parameter vector. The algorithm is called least absolute shrinkage and selection operator (LASSO). i=1

The importance of sparsity The previous term can also be used with other models/loss functions, obtaining similar effects. Having a sparse vector of parameters has several beneficial effects: 1. It can improve generalization, particularly when the model is over parameterized. 2. Sparse vectors require less storage space and can make the model more efficient to be evaluated. 3. For a linear model, each weight is associated to a single feature, so that LASSO implicitly performs feature selection on the dataset.

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

LASSO with orthonormal inputs We can gain some intuition on the behavior of LASSO if we assume that the matrix X has orthonormal columns. Denote by ŵ i the ith coefficient found by (unregularized) OLS. Ridge regression in this case will give as solution: w i = ŵi 1 + C while LASSO will give as solution: (5) w i = sgn(ŵ i ) max {0, ŵ i λ}. (6) where sgn( ) takes the sign. These are plotted on the next slide.

Ridge and LASSO solutions (visualization) Corrected solution 1.0 0.5 0.0 0.5 OLS Ridge LASSO 1.0 1.0 0.5 0.0 0.5 1.0 OLS solution Figure 4 : We plot the values obtained by ridge regression and LASSO in the case of orthonormal inputs, when compared to a basic OLS with no regularization (gray line).

Another intuitition Alternatively, we can note that the previous problem can be reformulated as follows: where µ depends linearly on C. 1 min w 2N y Xw 2 2 subject to w 1 µ This reformulation is always possible as long as the overall optimization problem is convex (and feasible). This shows that the optimal weight vector must lie inside (or on the border) of the shape defined by the inequality w 1 µ. (7)

Visualizing the shapes (a) Ridge regression (b) LASSO regularization Figure 5 : A visualization of the shapes defined by the regularization terms in ridge regression and LASSO. LASSO has a large probability of handing up on a vertex, corresponding to one or more features set to zero.

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Non-differentiable costs and subgradients The absolute value is not differentiable at 0, so that our cost function is not differentiable and we cannot apply anything from our lecture on optimization - not even defining the optimum of the problem! Formally, we need an extension of the gradient called a subgradient. A vector g is a subgradient of a convex function p( ) at the point a if: p(b) p(a) + g T (b a), b. The function is differentiable at a if it has a unique subgradient.

Ridge and LASSO solutions (visualization) Figure 6 : Comparison of subgradients on a differentiable function and a non-differentiable one [Taken from Tibshirani, CS 10-725 (Optimization), Lecture 6: Subgradient method].

Optimality conditions for non-differentiable costs The set of all subgradients of p( ) at the point a is called the sub-differentiable, and it is denoted as p(a). A vector w is a minimum of p( ) if and only if: 0 p(w). (8) For a differentiable cost function, we only have a single subgradient and the previous expression reduces to the classical firstorder optimality condition. We can devise a subgradient descent for optimization by iteratively following the inverse of any subgradient: w n+1 = w n α n p(w n ). (9)

An example of LASSO MSE on test data 0.050 0.045 0.040 0.035 0.030 Error 10 6 10 5 10 4 10 3 10 2 10 1 10 0 C value (a) MSE Sparsity [%] 100 80 60 40 20 0 Sparsity [%] 10 6 10 5 10 4 10 3 10 2 10 1 10 0 C value (b) LASSO regularization Figure 7 : Example of application of LASSO on the Diabetes dataset. For C = 10 3, the MSE is very close to the optimal one, but the output parameter s vector has 30% of zero elements.

Table of contents Loss functions Loss functions for regression problems Loss functions for binary classification Sparse regularization The LASSO algorithm Two intuitive explanations Optimization for sparse penalties Some extensions

Elastic net penalty We can apply both types of regularization together: J(w) = 1 2N y Xw 2 2 + C 1 w 1 + C 2 w 2 2, where we now have two scalar coefficients C 1, C 2. This is called the elastic net penalty. The l 2 penalty is used to further stabilize weights that are not set to 0 by the l 1 penalty. This idea is relatively general: you can combine several penalizations to obtain a wide range of effects. For more than 2 penalties, however, setting their relative weights might not be immediate, particularly if their effects are interdependent.

Subset selection If we care about sparse weight vectors, a natural formulation is to penalize directly the number of non-zero entries in the vector: J(w) = 1 2N y Xw 2 2 + C d (1 I 0 (w i )), where I 0 ( ) is the indicator function, which is zero if its argument is zero, one otherwise. Sometimes, this is called the l 0 norm, although mathematically it is not a norm. It is extremely difficult to directly optimize in numerical fashion. The previous problem is called best subset selection. It is more naturally solved by applying some heuristic feature selection algorithms. i=1

Group sparse regularization Suppose that our variables can be grouped logically in T groups w 1,..., w T, such that: w = T w k. k=1 We can set to zero entire groups together by using the so-called group LASSO regularization: J(w) = 1 T 2N y Xw 2 2 + C dk w k 2, where d k is the size of the kth group. k=1

Further readings The LASSO algorithm and some subset selection methods are presented in Chapter 4 of the book. A very good starting point for a theoretical analysis on the properties of loss functions is: Rosasco, L., De Vito, E., Caponnetto, A., Piana, M. and Verri, A., 2004. Are loss functions all the same?. Neural Computation, 16(5), pp.1063-1076. The following is a comprehensive monograph on optimization for sparsity-inducing penalties (in the convex case): Bach, F., Jenatton, R., Mairal, J. and Obozinski, G., 2012. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), pp.1-106.