Linear Regression Linear Regression with Shrinkage

Similar documents
Linear Regression Linear Regression with Shrinkage

Lecture VIII Dim. Reduction (I)

y(x) = x w + ε(x), (1)

Linear Methods for Regression. Lijun Zhang

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Least Squares Regression

CS540 Machine learning Lecture 5

Least Squares Regression

A Modern Look at Classical Multivariate Techniques

Linear Models for Regression. Sargur Srihari

ISyE 691 Data mining and analytics

Linear Regression (9/11/13)

CS 340 Lec. 15: Linear Regression

1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Lecture 6: Methods for high-dimensional problems

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Logistic Regression. Machine Learning Fall 2018

Machine Learning for OR & FE

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models for Regression CS534

Theorems. Least squares regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

STK-IN4300 Statistical Learning Methods in Data Science

Regression Models - Introduction

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Lecture 14: Shrinkage

Linear Models for Regression

The prediction of house price

Linear Models for Regression CS534

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Eigenvalues and diagonalization

MMSE Equalizer Design

Learning with Singular Vectors

Data Mining Stat 588

Bayesian Linear Regression [DRAFT - In Progress]

Linear Model Selection and Regularization

Regression, Ridge Regression, Lasso

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

y Xw 2 2 y Xw λ w 2 2

Regularization: Ridge Regression and the LASSO

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS 195-5: Machine Learning Problem Set 1

Linear Models for Regression

LINEAR REGRESSION, RIDGE, LASSO, SVR

Linear Regression and Its Applications

Bayesian Linear Regression. Sargur Srihari

Linear Models for Regression CS534

Linear Models in Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Unit 10: Simple Linear Regression and Correlation

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Introduction to Machine Learning

4 Bias-Variance for Ridge Regression (24 points)

STA414/2104 Statistical Methods for Machine Learning II

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Camera Calibration. (Trucco, Chapter 6) -Toproduce an estimate of the extrinsic and intrinsic camera parameters.

Machine Learning for Biomedical Engineering. Enrico Grisan

Chapter 3. Linear Models for Regression

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

MSG500/MVE190 Linear Models - Lecture 15

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

MS-C1620 Statistical inference

Linear regression COMS 4771

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Math 533 Extra Hour Material

Principal Component Analysis and Linear Discriminant Analysis

Regression Models - Introduction

Theoretical Exercises Statistical Learning, 2009

Next is material on matrix rank. Please see the handout

Today. Calculus. Linear Regression. Lagrange Multipliers

Lecture 16 Solving GLMs via IRWLS

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Biostatistics Advanced Methods in Biostatistics IV

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

These notes give a quick summary of the part of the theory of autonomous ordinary differential equations relevant to modeling zombie epidemics.

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

PCA, Kernel PCA, ICA

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computer Vision Group Prof. Daniel Cremers. 3. Regression

3 - Vector Spaces Definition vector space linear space u, v,

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Outline lecture 2 2(30)

Overfitting, Bias / Variance Analysis

Machine Learning Linear Regression. Prof. Matteo Matteucci

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

ORIE 4741: Learning with Big Messy Data. Regularization

Transcription:

Linear Regression Linear Regression ith Shrinkage

Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle fuel efficiency (mpg) from 8 attributes:

Linear Regression The linear regression model: f ( X ) = 0 + M = X The are unknon parameters and X can come from different sources such as: Inputs Transformation of inputs (Log, Square root, etc ) Basis expansions

Linear Regression Typically e have a set of training data ( x, y)...( x n, y from hich to estimate the parameters. x = ( x,.., xim ) n Each i i is a vector of measurements for ith case. or ) h x ) = { h ( x ),..., h ( x )} a basis expansion of xi ( i 0 i M i y( x) f ( x; ) = M t 0 + h ( x) = h( x) = (define h 0 ( x) = )

Basis Functions There are many basis functions e can use e.g. Polynomial h ( x) Radial basis functions Sigmoidal h ( x) = x h ( x) x µ = σ s Splines, Fourier, Wavelets, etc exp ( x µ ) = s

Linear Regression Estimation Minimize the residual error prediction loss in terms of mean squared error on n training samples. n = ( ) J n ) yi f ( xi; ) n i= n M ( ) = t J n yi x = X y X y n i= = X= x... x M ( empirical squared loss N samples M basis functions = eights M basis functions ( )( ) y= N samples measurements

Linear Regression Solution By setting the derivatives of X y t X to zero, e get the solution (as e did for MSE): ( )( y) = ˆ ( t ) t X X X y The solution is a linear function of the outputs y.

Statistical vie of linear regression In a statistical regression model e model both the function and noise Observed output = function + noise y( x) = f ( x; )+ε here, e.g., ε ~ N (0, σ ) Whatever e cannot capture ith our chosen family of functions ill be interpreted as noise

Statistical vie of linear regression f(x;) is trying to capture the mean of the observations y given the input x: E [ y x] = E[ f ( x; ) + ε x] = f ( x; ) here E[ y x] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X)

Statistical vie of linear regression According to our statistical model y ( x) = f ( x; )+ε, ε ~ N(0, σ ) the outputs y given x are normally distributed ith mean f (x;) and variance σ : p( y x,, σ ) = exp ( y f ( x; )) πσ σ (e model the uncertainty in the predictions, not ust the mean)

Maximum likelihood estimation Given observations e find the parameters that maximize the likelihood of the outputs: { } ), ),...,(, ( n x n y y x D= = = n n i i i x y p L ),, ( ), ( σ σ Maximize log-likelihood ( ) ( ) = = n i k k n x f y ; exp σ πσ ( ) ( ) = = n i k k n x f y L ; log ), ( log σ πσ σ minimize

Maximum likelihood estimation Thus MLE = arg min ( y ( )) i f xi; i= But the empirical squared loss is n J n ( ) = n n ( y ) i f ( xi; ) i= Least-squares Linear Regression is MLE for Gaussian noise!!!

Linear Regression (is it good?) Simple model Straightforard solution BUT

Linear Regression - Example In this example the output is generated by the model (here ε is a small hite Gaussian noise): y = 0.8x+ 0. x +ε Three training sets ith different correlations Three training sets ith different correlations beteen the to inputs ere randomly chosen, and the linear regression solution as applied.

Linear Regression Example And the results are x RSS=0.9 =0.8, 0.09.5 0.5 0-0.5 - -.5 RSS=0.090465 b=80.80776, 0.093< - - 0 3 x x 3 0 - - RSS=0.04 RSS=0.04955 =0.4, b=80.43483 0.56, 0.56833< - - 0 3 x x RSS=0.47 =30.36, -9. 3 0 RSS=0.075469 b=830.363, - 9.96< - - 0 3 x x,x uncorrelated x,x correlated x,x strongly correlated Strong correlation can cause the coefficients to be very large, and they ill cancel each other to achieve a good RSS. - -

Linear Regression What can be done?

Shrinkage methods Ridge Regression Lasso PCR (Principal Components Regression) PLS (Partial Least Squares)

Shrinkage methods Before e proceed: Since the folloing methods are not invariant under input scale, e ill assume the input is standardized (mean 0, variance ): The offset x i 0 x i x x i is alays estimated as and e ill ork ith centered y (meaning y y- 0 ) x σ i 0 = ( i y i ) / N

Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called eight decay) In ridge regression, e add a quadratic penalty on the eights: N M M J ( ) = yi 0 xi + λ i= = = here λ 0 is a tuning parameter that controls the amount of shrinkage. The size constraint prevents the phenomenon of ildly large coefficients that cancel each other from occurring.

Ridge Regression Solution Ridge regression in matrix form: J ( ) = The solution is ˆ t t ( y X)( y X) +λ t t = ( X X + λ I ) X y LS ( t ) ˆ = X X X t y ridge M The solution adds a positive constant to the diagonal of X T X before inversion. This makes the problem non singular even if X does not have full column rank. For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: ridge LS ˆ = γ ˆ 0 γ

Ridge Regression (insights) The matrix X can be represented by it s SVD: X = UDV U is a N*M matrix, it s columns span the column space of X D is a M*M diagonal matrix of singular values V is a M*M matrix, it s columns span the ro space of X Lets see ho this method looks in the Principal Components coordinates of X T

Ridge Regression (insights) X ' = X= XV ( )( T XV V ) = X ' ' The least squares solution is given by: ( T T T T VDU UDV ) VDU y= D U y T ls T ˆ' = V ˆ = V = ˆ' ls The Ridge Regression is similarly given by: = ridge = V T ˆ ridge = V T ( T T VDU UDV + λi) ( ) T ( ) ls D + λi DU y= D + λi D ˆ' VDU T y= Diagonal

Ridge Regression (insights) ridge ˆ' = d d ˆ ' +λ In the PCA axes, the ridge coefficients are ust scaled LS coefficients! The coefficients that correspond to smaller input variance directions are scaled don more. ls d d

Ridge Regression In the folloing simulations, quantities are plotted versus the quantity df ( λ) = d M = d + λ This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Ridge Regression Simulation results: 4 dimensions (=,,3,4) No correlation b 6 5 4 3 0 Ridge Regression Strong correlation b 6 5 4 3 0 Ridge Regression RSS 6 5 4 3 0 0.5.5.5 3 3.5 Effective Dimension 0.5.5.5 3 3.5 Effective Dimension RSS 6 5 4 3 0 0 3 4 Effective Dimension 0.5.5.5 3 3.5 4 Effective Dimension

Ridge regression is MAP ith Gaussian prior J ( ) = log P( D ) P( ) n t = log Ν( yi xi, σ ) Ν( 0, τ ) i= t t = ( y X) ( y X) + + const σ τ This is the same obective function that ridge solves, using λ=σ /τ Ridge: J ( ) = t t ( y X)( y X) +λ

Lasso Lasso is a shrinkage method like ridge, ith a subtle but important difference. It is defined by ˆ lasso = N M arg min yi x β i= = i subect to M = There is a similarity to the ridge regression problem: the L ridge regression penalty is replaced by the L lasso penalty. The L penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it. t

Lasso If t is chosen to be larger then t 0 : t t M ls ˆ 0 = then the lasso estimation is identical to the least squares. On the other hand, for say t=t 0 /, the least squares coefficients are shrunk by about 50% on average. For convenience e ill plot the simulation results versus shrinkage factor s: s t t = M 0 ls ˆ = t

Lasso Simulation results: No correlation b 6 5 4 3 0 Lasso Strong correlation b 6 5 4 3 0 Lasso RSS 6 5 4 3 0 0 0. 0.4 0.6 0.8 Shrinkage factor 0. 0.4 0.6 0.8 Shrinkage factor RSS 4 3 0 0 0. 0.4 0.6 0.8 Shrinkage factor 0. 0.4 0.6 0.8 Shrinkage factor

L vs L penalties Contours of the LS error function L L + β t β β = 0 β +β t In Lasso the constraint region has corners; hen the solution hits a corner the corresponding coefficients becomes 0 (hen M> more than one).

Problem: Figure plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see belo) but got confused about hich plot corresponds to hich regularization method. Please assign each plot to one (and only one) of the folloing regularization method.

Select all the appropriate models for each column. X X X X

Select all the appropriate models for each column. X X X X