Linear Regression Linear Regression ith Shrinkage
Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle fuel efficiency (mpg) from 8 attributes:
Linear Regression The linear regression model: f ( X ) = 0 + M = X The are unknon parameters and X can come from different sources such as: Inputs Transformation of inputs (Log, Square root, etc ) Basis expansions
Linear Regression Typically e have a set of training data ( x, y)...( x n, y from hich to estimate the parameters. x = ( x,.., xim ) n Each i i is a vector of measurements for ith case. or ) h x ) = { h ( x ),..., h ( x )} a basis expansion of xi ( i 0 i M i y( x) f ( x; ) = M t 0 + h ( x) = h( x) = (define h 0 ( x) = )
Basis Functions There are many basis functions e can use e.g. Polynomial h ( x) Radial basis functions Sigmoidal h ( x) = x h ( x) x µ = σ s Splines, Fourier, Wavelets, etc exp ( x µ ) = s
Linear Regression Estimation Minimize the residual error prediction loss in terms of mean squared error on n training samples. n = ( ) J n ) yi f ( xi; ) n i= n M ( ) = t J n yi x = X y X y n i= = X= x... x M ( empirical squared loss N samples M basis functions = eights M basis functions ( )( ) y= N samples measurements
Linear Regression Solution By setting the derivatives of X y t X to zero, e get the solution (as e did for MSE): ( )( y) = ˆ ( t ) t X X X y The solution is a linear function of the outputs y.
Statistical vie of linear regression In a statistical regression model e model both the function and noise Observed output = function + noise y( x) = f ( x; )+ε here, e.g., ε ~ N (0, σ ) Whatever e cannot capture ith our chosen family of functions ill be interpreted as noise
Statistical vie of linear regression f(x;) is trying to capture the mean of the observations y given the input x: E [ y x] = E[ f ( x; ) + ε x] = f ( x; ) here E[ y x] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X)
Statistical vie of linear regression According to our statistical model y ( x) = f ( x; )+ε, ε ~ N(0, σ ) the outputs y given x are normally distributed ith mean f (x;) and variance σ : p( y x,, σ ) = exp ( y f ( x; )) πσ σ (e model the uncertainty in the predictions, not ust the mean)
Maximum likelihood estimation Given observations e find the parameters that maximize the likelihood of the outputs: { } ), ),...,(, ( n x n y y x D= = = n n i i i x y p L ),, ( ), ( σ σ Maximize log-likelihood ( ) ( ) = = n i k k n x f y ; exp σ πσ ( ) ( ) = = n i k k n x f y L ; log ), ( log σ πσ σ minimize
Maximum likelihood estimation Thus MLE = arg min ( y ( )) i f xi; i= But the empirical squared loss is n J n ( ) = n n ( y ) i f ( xi; ) i= Least-squares Linear Regression is MLE for Gaussian noise!!!
Linear Regression (is it good?) Simple model Straightforard solution BUT
Linear Regression - Example In this example the output is generated by the model (here ε is a small hite Gaussian noise): y = 0.8x+ 0. x +ε Three training sets ith different correlations Three training sets ith different correlations beteen the to inputs ere randomly chosen, and the linear regression solution as applied.
Linear Regression Example And the results are x RSS=0.9 =0.8, 0.09.5 0.5 0-0.5 - -.5 RSS=0.090465 b=80.80776, 0.093< - - 0 3 x x 3 0 - - RSS=0.04 RSS=0.04955 =0.4, b=80.43483 0.56, 0.56833< - - 0 3 x x RSS=0.47 =30.36, -9. 3 0 RSS=0.075469 b=830.363, - 9.96< - - 0 3 x x,x uncorrelated x,x correlated x,x strongly correlated Strong correlation can cause the coefficients to be very large, and they ill cancel each other to achieve a good RSS. - -
Linear Regression What can be done?
Shrinkage methods Ridge Regression Lasso PCR (Principal Components Regression) PLS (Partial Least Squares)
Shrinkage methods Before e proceed: Since the folloing methods are not invariant under input scale, e ill assume the input is standardized (mean 0, variance ): The offset x i 0 x i x x i is alays estimated as and e ill ork ith centered y (meaning y y- 0 ) x σ i 0 = ( i y i ) / N
Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called eight decay) In ridge regression, e add a quadratic penalty on the eights: N M M J ( ) = yi 0 xi + λ i= = = here λ 0 is a tuning parameter that controls the amount of shrinkage. The size constraint prevents the phenomenon of ildly large coefficients that cancel each other from occurring.
Ridge Regression Solution Ridge regression in matrix form: J ( ) = The solution is ˆ t t ( y X)( y X) +λ t t = ( X X + λ I ) X y LS ( t ) ˆ = X X X t y ridge M The solution adds a positive constant to the diagonal of X T X before inversion. This makes the problem non singular even if X does not have full column rank. For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: ridge LS ˆ = γ ˆ 0 γ
Ridge Regression (insights) The matrix X can be represented by it s SVD: X = UDV U is a N*M matrix, it s columns span the column space of X D is a M*M diagonal matrix of singular values V is a M*M matrix, it s columns span the ro space of X Lets see ho this method looks in the Principal Components coordinates of X T
Ridge Regression (insights) X ' = X= XV ( )( T XV V ) = X ' ' The least squares solution is given by: ( T T T T VDU UDV ) VDU y= D U y T ls T ˆ' = V ˆ = V = ˆ' ls The Ridge Regression is similarly given by: = ridge = V T ˆ ridge = V T ( T T VDU UDV + λi) ( ) T ( ) ls D + λi DU y= D + λi D ˆ' VDU T y= Diagonal
Ridge Regression (insights) ridge ˆ' = d d ˆ ' +λ In the PCA axes, the ridge coefficients are ust scaled LS coefficients! The coefficients that correspond to smaller input variance directions are scaled don more. ls d d
Ridge Regression In the folloing simulations, quantities are plotted versus the quantity df ( λ) = d M = d + λ This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.
Ridge Regression Simulation results: 4 dimensions (=,,3,4) No correlation b 6 5 4 3 0 Ridge Regression Strong correlation b 6 5 4 3 0 Ridge Regression RSS 6 5 4 3 0 0.5.5.5 3 3.5 Effective Dimension 0.5.5.5 3 3.5 Effective Dimension RSS 6 5 4 3 0 0 3 4 Effective Dimension 0.5.5.5 3 3.5 4 Effective Dimension
Ridge regression is MAP ith Gaussian prior J ( ) = log P( D ) P( ) n t = log Ν( yi xi, σ ) Ν( 0, τ ) i= t t = ( y X) ( y X) + + const σ τ This is the same obective function that ridge solves, using λ=σ /τ Ridge: J ( ) = t t ( y X)( y X) +λ
Lasso Lasso is a shrinkage method like ridge, ith a subtle but important difference. It is defined by ˆ lasso = N M arg min yi x β i= = i subect to M = There is a similarity to the ridge regression problem: the L ridge regression penalty is replaced by the L lasso penalty. The L penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it. t
Lasso If t is chosen to be larger then t 0 : t t M ls ˆ 0 = then the lasso estimation is identical to the least squares. On the other hand, for say t=t 0 /, the least squares coefficients are shrunk by about 50% on average. For convenience e ill plot the simulation results versus shrinkage factor s: s t t = M 0 ls ˆ = t
Lasso Simulation results: No correlation b 6 5 4 3 0 Lasso Strong correlation b 6 5 4 3 0 Lasso RSS 6 5 4 3 0 0 0. 0.4 0.6 0.8 Shrinkage factor 0. 0.4 0.6 0.8 Shrinkage factor RSS 4 3 0 0 0. 0.4 0.6 0.8 Shrinkage factor 0. 0.4 0.6 0.8 Shrinkage factor
L vs L penalties Contours of the LS error function L L + β t β β = 0 β +β t In Lasso the constraint region has corners; hen the solution hits a corner the corresponding coefficients becomes 0 (hen M> more than one).
Problem: Figure plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see belo) but got confused about hich plot corresponds to hich regularization method. Please assign each plot to one (and only one) of the folloing regularization method.
Select all the appropriate models for each column. X X X X
Select all the appropriate models for each column. X X X X