Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression ith Shrinkage

Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle fuel efficiency (mpg) from 8 attributes:

Linear Regression The linear regression model: M f ( X ) 0 j j X j The are unknon parameters and X can come from different sources such as: Inputs Transformation of inputs (Log, Square root, etc ) Basis expansions

Linear Regression Typically e have a set of training data ( x, y)...( x n, y from hich to estimate the parameters. n ) Each ith case. x i ( x i,.., x im ) is a vector of measurements for or h x ) { h ( x ),..., h ( x )} a basis expansion of xi ( i 0 i M i y( x) f ( x; ) M t 0 jhj ( x) h( x) j (define h 0 ( x) )

Basis Functions There are many basis functions e can use e.g. Polynomial Radial basis functions Sigmoidal Splines, Fourier, Wavelets, etc ) ( j j x x h exp ) ( s x x h j j s x x h j j ) (

Linear Regression Estimation Minimize the residual error prediction loss in terms of mean squared error on n training samples. n J n ) yi f ( xi; ) n i n M ( ) t J n yi jx j X y X y n i j X x... x M ( empirical squared loss N samples M basis functions eights M basis functions y N samples measurements

Linear Regression Solution By setting the derivatives of X y t X to zero, e get the solution (as e did for MSE): y ˆ t t X X X y The solution is a linear function of the outputs y.

Statistical vie of linear regression In a statistical regression model e model both the function and noise Observed output = function + noise y( x) f ( x; ) here, e.g., ~ N(0, ) Whatever e cannot capture ith our chosen family of functions ill be interpreted as noise

Statistical vie of linear regression f(x;) is trying to capture the mean of the observations y given the input x: E[ y x] E[ f ( x; ) x] f ( x; ) here E[ y x] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X)

Statistical vie of linear regression According to our statistical model y ( x) f ( x; ), ~ N(0, ) the outputs y given x are normally distributed ith mean f (x;) and variance σ : p( y x,, ) exp ( y f ( x; )) (e model the uncertainty in the predictions, not just the mean)

Maximum likelihood estimation Given observations e find the parameters that maximize the likelihood of the outputs: Maximize log-likelihood ), ),...,(, ( n x n y y x D n i k k n n i i i x f y x y p L ; exp ),, ( ), ( n i k k n x f y L ; log ), ( log minimize

Thus But the empirical squared loss is Maximum likelihood estimation n i i i MLE x f y ; arg min n i i i n x f y n J ) ; ( ) ( Least-squares Linear Regression is MLE for Gaussian noise!!!

Linear Regression (is it good?) Simple model Straightforard solution BUT

Linear Regression - Example In this example the output is generated by the model (here ε is a small hite Gaussian noise): y 0.8x 0. x Three training sets ith different correlations beteen the to inputs ere randomly chosen, and the linear regression solution as applied.

Linear Regression Example And the results are x RSS=0.9 =0.8, 0.09.5 0.5 0 0.5.5 RSS 0.090465 0.80776, 0.093 0 3 x x 3 0 RSS=0.04 RSS 0.04955 =0.4, 0.43483 0.56, 0.56833 0 3 x 0 3 x Strong correlation can cause the coefficients to be very large, and they ill cancel each other to achieve a good RSS. x RSS=0.47 =30.36, -9. 3 0 RSS 0.075469 30.363, 9.96 x,x uncorrelated x,x correlated x,x strongly correlated

Linear Regression What can be done?

Shrinkage methods Ridge Regression Lasso PCR (Principal Components Regression) PLS (Partial Least Squares)

Shrinkage methods Before e proceed: Since the folloing methods are not invariant under input scale, e ill assume the input is standardized (mean 0, variance ): x ij 0 x ij The offset is alays estimated as x j and e ill ork ith centered y (meaning y y- 0 ) x ij x ij j 0 ( i y i ) / N

Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called eight decay) In ridge regression, e add a quadratic penalty on the eights: N M M J( ) yi 0 xijj j i j j here λ 0 is a tuning parameter that controls the amount of shrinkage. The size constraint prevents the phenomenon of ildly large coefficients that cancel each other from occurring.

Ridge Regression Solution Ridge regression in matrix form: J( ) The solution is ˆ t t y X y X t t ( X X I ) X y LS t ˆ X X X t y ridge M The solution adds a positive constant to the diagonal of X T X before inversion. This makes the problem non singular even if X does not have full column rank. For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: ridge LS ˆ ˆ 0

Ridge Regression (insights) The matrix X can be represented by it s SVD: X UDV U is a N*M matrix, it s columns span the column space of X D is a M*M diagonal matrix of singular values V is a M*M matrix, it s columns span the ro space of X Lets see ho this method looks in the Principal Components coordinates of X T

Ridge Regression (insights) X ' X ˆ' XV V T XV V X ' ' The least squares solution is given by: ˆ ' ˆ V T T T T VDU UDV VDU y D U y ls T ls T The Ridge Regression is similarly given by: ridge V T ˆ ridge V T T ls D I DU y D I D ˆ' VDU T UDV T I VDU T y Diagonal

Ridge Regression (insights) ridge ˆ ' j d j d j ˆ ' In the PCA axes, the ridge coefficients are just scaled LS coefficients! The coefficients that correspond to smaller input variance directions are scaled don more. ls j d d

Ridge Regression In the folloing simulations, quantities are plotted versus the quantity df M j d d j j This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Ridge Regression Simulation results: 4 dimensions (=,,3,4) No correlation Strong correlation 6 5 4 3 0 Ridge Regression 6 5 4 3 0 Ridge Regression RSS 6 5 4 3 0 0.5.5.5 3 3.5 Effective Dimension 0.5.5.5 3 3.5 Effective Dimension RSS 6 5 4 3 0 0 3 4 Effective Dimension 0.5.5.5 3 3.5 4 Effective Dimension

Ridge regression is MAP ith Gaussian prior J ( ) log log ( y n i P( D Ν( y X) i t ) P( ) t xi, ) Ν( 0, ) t ( y X) const This is the same objective function that ridge solves, using / Ridge: J( ) t t y X y X

Lasso Lasso is a shrinkage method like ridge, ith a subtle but important difference. It is defined by ˆ lasso arg min N i y i M j x ij j subject to M j There is a similarity to the ridge regression problem: the L ridge regression penalty is replaced by the L lasso penalty. The L penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it. j t

Lasso If t is chosen to be larger then t 0 : t t M ls ˆ 0 j then the lasso estimation is identical to the least squares. On the other hand, for say t=t 0 /, the least squares coefficients are shrunk by about 50% on average. For convenience e ill plot the simulation results versus shrinkage factor s: s t t M 0 ls ˆ j t

Lasso Simulation results: No correlation 6 5 4 3 0 Lasso Strong correlation 6 5 4 3 0 Lasso RSS 6 5 4 3 0 0. 0.4 0.6 0.8 Shrinkage factor RSS 4 3 0 0. 0.4 0.6 0.8 Shrinkage factor 0 0. 0.4 0.6 0.8 Shrinkage factor 0 0. 0.4 0.6 0.8 Shrinkage factor

L vs L penalties Contours of the LS error function L L t 0 t In Lasso the constraint region has corners; hen the solution hits a corner the corresponding coefficients becomes 0 (hen M> more than one).

Problem: Figure plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see belo) but got confused about hich plot corresponds to hich regularization method. Please assign each plot to one (and only one) of the folloing regularization method.

Select all the appropriate models for each column. X X X X