MLCC 2017 Regularization Networks I: Linear Models

Similar documents
Introductory Machine Learning Notes 1

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Oslo Class 4 Early Stopping and Spectral Regularization

RegML 2018 Class 2 Tikhonov regularization and kernels

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Oslo Class 2 Tikhonov regularization and kernels

Machine Learning And Applications: Supervised Learning-SVM

Max Margin-Classifier

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines, Kernel SVM

Linear Regression (continued)

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines for Classification and Regression

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Overfitting, Bias / Variance Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Discriminative Models

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Recap from previous lecture

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Reproducing Kernel Hilbert Spaces

Linear & nonlinear classifiers

Support Vector Machine

Support Vector Machines

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVMs, Duality and the Kernel Trick

Machine Learning for NLP

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Support Vector Machines. Machine Learning Fall 2017

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Discriminative Models

Lecture 2 Machine Learning Review

SVMC An introduction to Support Vector Machines Classification

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

MLCC 2015 Dimensionality Reduction and PCA

Machine Learning Linear Models

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Announcements - Homework

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Statistical Data Mining and Machine Learning Hilary Term 2016

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning 4771

Introduction to Support Vector Machines

Support vector machines Lecture 4

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Machine Learning Lecture 6 Note

Reproducing Kernel Hilbert Spaces

Homework 3. Convex Optimization /36-725

Regularized Least Squares

Pattern Recognition 2018 Support Vector Machines

TUM 2016 Class 1 Statistical learning theory

Kernel Methods and Support Vector Machines

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Warm up: risk prediction with logistic regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multiclass Classification-1

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines

Statistical Machine Learning from Data

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

18.9 SUPPORT VECTOR MACHINES

Logistic Regression. Machine Learning Fall 2018

ICS-E4030 Kernel Methods in Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

LECTURE 7 Support vector machines

STA141C: Big Data & High Performance Statistical Computing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

TUM 2016 Class 3 Large scale learning by regularization

Introduction to Logistic Regression and Support Vector Machine

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines and Bayes Regression

Support Vector Machines for Classification: A Statistical Portrait

COMP 875 Announcements

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Machine Learning. Support Vector Machines. Manfred Huber

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

CS229 Supplemental Lecture notes

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machines and Kernel Methods

CMU-Q Lecture 24:

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization

Machine Learning Lecture 7

CSCI567 Machine Learning (Fall 2014)

Transcription:

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these algorithms. MLCC 2017 2

Empirical Risk Minimization (ERM) Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. General idea: considering the empirical error Ê(f) = 1 n n l(y i, f(x i )), i=1 as a proxy for the expected error E(f) = E[l(y, f(x))] = dxdyp(x, y)l(y, f(x)). MLCC 2017 3

The Expected Risk is Not Computable Recall that l measures the price we pay predicting f(x) when the true label is y E(f) cannot be directly computed, since p(x, y) is unknown MLCC 2017 4

From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: Fix a suitable hypothesis space H Minimize Ê over H H should allow feasible computations and be rich, since the complexity of the problem is not known a priori. MLCC 2017 5

Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = {f : R d R : w R d such that f(x) = x T w, x R d }. Each function f is defined by a vector w f w (x) = x T w. MLCC 2017 6

Rich Hs May Require Regularization If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) Regularization techniques restore stability and ensure generalization MLCC 2017 7

Tikhonov Regularization Consider the Tikhonov regularization scheme, min Ê(f w ) + λ w 2 (1) w R d It describes a large class of methods sometimes called Regularization Networks. MLCC 2017 8

The Regularizer w 2 is called regularizer It controls the stability of the solution and prevents overfitting λ balances the error term and the regularizer MLCC 2017 9

Loss Functions Different loss functions l induce different classes of methods We will see common aspects and differences in considering different loss functions There exists no general computational scheme to solve Tikhonov Regularization The solution depends on the considered loss function MLCC 2017 10

The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization min w R D Ê(f w ) + λ w 2, Ê(f w ) = 1 n Square loss function: n l(y i, f w (x i )) (2) i=1 l(y, f w (x)) = (y f w (x)) 2 We then obtain the RLS optimization problem (linear model): 1 min w R D n n (y i w T x i ) 2 + λw T w, λ 0. (3) i=1 MLCC 2017 11

Matrix Notation The n d matrix X n, whose rows are the input points The n 1 vector Y n, whose entries are the corresponding outputs. With this notation, 1 n n (y i w T x i ) 2 = 1 n Y n X n w 2. i=1 MLCC 2017 12

Gradients of the ER and of the Regularizer By direct computation, Gradient of the empirical risk w. r. t. w 2 n XT n (Y n X n w) Gradient of the regularizer w. r. t. w 2w MLCC 2017 13

The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system (X T n X n + λni)w = X T n Y n. λ controls the invertibility of (X T n X n + λni) MLCC 2017 14

Choosing the Cholesky Solver Several methods can be used to solve the above linear system Cholesky decomposition is the method of choice, since Xn T X n + λi is symmetric and positive definite. MLCC 2017 15

Time Complexity Time complexity of the method : Training: O(nd 2 ) (assuming n >> d) Testing: O(d) MLCC 2017 16

Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset: How to estimate b from data? w T x + b MLCC 2017 17

Idea: Augmenting the Dimension of the Input Space Simple idea: augment the dimension of the input space, considering x = (x, 1) and w = (w, b). This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w 2 = w 2 + b 2. MLCC 2017 18

Avoiding to Penalize the Solutions with Offset We want to regularize considering only w 2, without penalizing the offset. The modified regularized problem becomes: 1 min (w,b) R D+1 n n (y i w T x i b) 2 + λ w 2. i=1 MLCC 2017 19

Solution with Offset: Centering the Data It can be proved that a solution w, b of the above problem is given by b = ȳ x T w where ȳ = 1 n x = 1 n n i=1 n i=1 y i x i MLCC 2017 20

Solution with Offset: Centering the Data w solves 1 min w R D n n (yi c w T x c i) 2 + λ w 2. i=1 where y c i = y ȳ and xc i = x x for all i = 1,..., n. Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC 2017 21

Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the logistic loss function: n l(y i, f w (x i )) (4) i=1 l(y, f w (x)) = log(1 + e yfw(x) ) MLCC 2017 22

The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC 2017 23

Minimization Through Gradient Descent The logistic loss function is differentiable The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC 2017 24

Regularized Logistic Regression (RLR) The regularized ERM problem associated with the logistic loss is called regularized logistic regression Its solution can be computed via gradient descent Note: Ê(f w) = 1 n n i=1 x i y i e yixt i wt 1 1 + e yixt i wt 1 = 1 n n i=1 x i y i 1 + e yixt i wt 1 MLCC 2017 25

RLR: Gradient Descent Iteration For w 0 = 0, the GD iteration applied to is w t = w t 1 γ for t = 1,... T, where min Ê(f w ) + λ w 2 w R d ( ) 1 n y i x i + 2λw t 1 n 1 + e yixt i wt 1 i=1 }{{} a a = (Ê(f w) + λ w 2 ) MLCC 2017 26

Logistic Regression and Confidence Estimation The solution of logistic regression has a probabilistic interpretation It can be derived from the following model p(1 x) = where h is called logistic function. ext w 1 + e xt w }{{} h This can be used to compute a confidence for each prediction MLCC 2017 27

Support Vector Machines Formulation in terms of Tikhonov regularization: min w R d Ê(f w ) + λ w 2, Ê(f w ) = 1 n With the Hinge loss function: n l(y i, f w (x i )) (5) i=1 l(y, f w (x)) = 1 yf w (x) + 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x) MLCC 2017 28

A more classical formulation (linear case) with λ = 1 C w 1 = min w R d n n 1 y i w x i + + λ w 2 i=1 MLCC 2017 29

A more classical formulation (linear case) w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i y i w x i 1 ξ i i {1... n} subject to MLCC 2017 30

A geometric intuition - classification In general do you have many solutions 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 What do you select? MLCC 2017 31

A geometric intuition - classification Intuitively I would choose an equidistant line 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 32

A geometric intuition - classification Intuitively I would choose an equidistant line 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 33

Maximum margin classifier I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 MLCC 2017 34

Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have with α = x w w and x = x αw. x = αw + x Point-Hyperplane distance: d(x, w) = x MLCC 2017 35

Margin An hyperplane w well classifies an example (x i, y i ) if y i = 1 and w x i > 0 or y i = 1 and w x i < 0 therefore x i is well classified iff y i w x i > 0 Margin: m i = y i w x i Note that x = x yimi w w MLCC 2017 36

Maximum margin classifier definition I want the classifier that classifies perfectly the dataset maximize the distance from its closest examples w = max min d(x i, w) 2 1 i n w R d m i > 0 i {1... n} Let call µ the smallest m i thus we have subject to w = max min x i (x i w)2 w R d 1 i n,µ 0 w 2 subject to that is y i w x i µ i {1... n} MLCC 2017 37

Computation of w w = max min µ2 w R d µ 0 w 2 subject to y i w x i µ i {1... n} MLCC 2017 38

Computation of w w = max w R d, µ 0 µ 2 w 2 subject to y i w x i µ i {1... n} Note that if y i w x i µ, then y i (αw) x i αµ and µ2 w = (αµ)2 2 αw for 2 any α 0. Therefore we have to fix the scale parameter, in particular we choose µ = 1. MLCC 2017 39

Computation of w w 1 = max w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 40

Computation of w w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 41

What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d w 2 subject to y i w x i 1 i {1... n} MLCC 2017 42

What if the problem is not separable? We relax the constraints and penalize the relaxation w = min w R d,ξ i 0 w 2 + C n n i=1 ξ i subject to y i w x i 1 ξ i i {1... n} where C is a penalization parameter for the average error 1 n n i=1 ξ i. MLCC 2017 43

Dual formulation It can be shown that the solution of the SVM problem is of the form w = n α i y i x i i=1 where α i are given by the solution of the following quadratic programming problem: n max α R n i=1 α i 1 n 2 i,j=1 y iy j α i α j x T i x j subj to α i 0 i = 1,..., n The solution requires the estimate of n rather than D coefficients α i are often sparse. The input points associated with non-zero coefficients are called support vectors MLCC 2017 44

Wrapping up Regularized Empirical Risk Minimization w 1 = min w R d n n l(y i, w x i ) + λ w 2 i=1 Examples of Regularization Networks l(y, t) = (y t) 2 (Square loss) leads to Least Squares l(y, t) = log(1 + e yt ) (Logistic loss) leads to Logistic Regression l(y, t) = 1 yt + (Hinge loss) leads to Maximum Margin Classifier MLCC 2017 45

Next class... beyond linear models! MLCC 2017 46