Linear Regression Linear Regression with Shrinkage

Similar documents
Linear Regression Linear Regression with Shrinkage

Lecture VIII Dim. Reduction (I)

y(x) = x w + ε(x), (1)

Linear Methods for Regression. Lijun Zhang

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

CS540 Machine learning Lecture 5

Least Squares Regression

A Modern Look at Classical Multivariate Techniques

Least Squares Regression

1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

ISyE 691 Data mining and analytics

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Linear Models for Regression. Sargur Srihari

Logistic Regression. Machine Learning Fall 2018

Linear Model Selection and Regularization

CS 340 Lec. 15: Linear Regression

Lecture 14: Shrinkage

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Lecture 6: Methods for high-dimensional problems

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Machine Learning for OR & FE

Linear Models for Regression CS534

Theorems. Least squares regression

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression (9/11/13)

Camera Calibration. (Trucco, Chapter 6) -Toproduce an estimate of the extrinsic and intrinsic camera parameters.

CS 195-5: Machine Learning Problem Set 1

Learning with Singular Vectors

ECE521 lecture 4: 19 January Optimization, MLE, regularization

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Regression Models - Introduction

Linear Models for Regression CS534

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Linear Models for Regression

MMSE Equalizer Design

The prediction of house price

Unit 10: Simple Linear Regression and Correlation

Data Mining Stat 588

Linear Regression and Its Applications

Regression, Ridge Regression, Lasso

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 16 Solving GLMs via IRWLS

LINEAR REGRESSION, RIDGE, LASSO, SVR

Dimension Reduction Methods

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Models for Regression CS534

These notes give a quick summary of the part of the theory of autonomous ordinary differential equations relevant to modeling zombie epidemics.

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

4 Bias-Variance for Ridge Regression (24 points)

STA414/2104 Statistical Methods for Machine Learning II

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Computer Vision Group Prof. Daniel Cremers. 3. Regression

PCA, Kernel PCA, ICA

Introduction to Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Regularization: Ridge Regression and the LASSO

Eigenvalues and diagonalization

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Biostatistics Advanced Methods in Biostatistics IV

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

STK-IN4300 Statistical Learning Methods in Data Science

Statistical Methods for SVM

Linear Models for Regression

Principal Components Analysis. Sargur Srihari University at Buffalo

MS-C1620 Statistical inference

Regression Models - Introduction

y Xw 2 2 y Xw λ w 2 2

Math 533 Extra Hour Material

Principal Component Analysis and Linear Discriminant Analysis

Support Vector Machines

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Linear Models in Machine Learning

Lasso, Ridge, and Elastic Net

Chapter 3. Linear Models for Regression

Kernel Methods. Machine Learning A W VO

Machine Learning (BSMC-GA 4439) Wenke Liu

3 - Vector Spaces Definition vector space linear space u, v,

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Introduction to Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Inverse Theory. COST WaVaCS Winterschool Venice, February Stefan Buehler Luleå University of Technology Kiruna

Linear Models for Regression

Linear regression methods

Bayesian Linear Regression. Sargur Srihari

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Linear regression COMS 4771

Bayesian Linear Regression [DRAFT - In Progress]

Theoretical Exercises Statistical Learning, 2009

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning Linear Regression. Prof. Matteo Matteucci

MSG500/MVE190 Linear Models - Lecture 15

Transcription:

Linear Regression Linear Regression ith Shrinkage

Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle fuel efficiency (mpg) from 8 attributes:

Linear Regression The linear regression model: M f ( X ) 0 j j X j The are unknon parameters and X can come from different sources such as: Inputs Transformation of inputs (Log, Square root, etc ) Basis expansions

Linear Regression Typically e have a set of training data ( x, y)...( x n, y from hich to estimate the parameters. n ) Each ith case. x i ( x i,.., x im ) is a vector of measurements for or h x ) { h ( x ),..., h ( x )} a basis expansion of xi ( i 0 i M i y( x) f ( x; ) M t 0 jhj ( x) h( x) j (define h 0 ( x) )

Basis Functions There are many basis functions e can use e.g. Polynomial Radial basis functions Sigmoidal Splines, Fourier, Wavelets, etc ) ( j j x x h exp ) ( s x x h j j s x x h j j ) (

Linear Regression Estimation Minimize the residual error prediction loss in terms of mean squared error on n training samples. n J n ) yi f ( xi; ) n i n M ( ) t J n yi jx j X y X y n i j X x... x M ( empirical squared loss N samples M basis functions eights M basis functions y N samples measurements

Linear Regression Solution By setting the derivatives of X y t X to zero, e get the solution (as e did for MSE): y ˆ t t X X X y The solution is a linear function of the outputs y.

Statistical vie of linear regression In a statistical regression model e model both the function and noise Observed output = function + noise y( x) f ( x; ) here, e.g., ~ N(0, ) Whatever e cannot capture ith our chosen family of functions ill be interpreted as noise

Statistical vie of linear regression f(x;) is trying to capture the mean of the observations y given the input x: E[ y x] E[ f ( x; ) x] f ( x; ) here E[ y x] is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution of X)

Statistical vie of linear regression According to our statistical model y ( x) f ( x; ), ~ N(0, ) the outputs y given x are normally distributed ith mean f (x;) and variance σ : p( y x,, ) exp ( y f ( x; )) (e model the uncertainty in the predictions, not just the mean)

Maximum likelihood estimation Given observations e find the parameters that maximize the likelihood of the outputs: Maximize log-likelihood ), ),...,(, ( n x n y y x D n i k k n n i i i x f y x y p L ; exp ),, ( ), ( n i k k n x f y L ; log ), ( log minimize

Thus But the empirical squared loss is Maximum likelihood estimation n i i i MLE x f y ; arg min n i i i n x f y n J ) ; ( ) ( Least-squares Linear Regression is MLE for Gaussian noise!!!

Linear Regression (is it good?) Simple model Straightforard solution BUT

Linear Regression - Example In this example the output is generated by the model (here ε is a small hite Gaussian noise): y 0.8x 0. x Three training sets ith different correlations beteen the to inputs ere randomly chosen, and the linear regression solution as applied.

Linear Regression Example And the results are x RSS=0.9 =0.8, 0.09.5 0.5 0 0.5.5 RSS 0.090465 0.80776, 0.093 0 3 x x 3 0 RSS=0.04 RSS 0.04955 =0.4, 0.43483 0.56, 0.56833 0 3 x 0 3 x Strong correlation can cause the coefficients to be very large, and they ill cancel each other to achieve a good RSS. x RSS=0.47 =30.36, -9. 3 0 RSS 0.075469 30.363, 9.96 x,x uncorrelated x,x correlated x,x strongly correlated

Linear Regression What can be done?

Shrinkage methods Ridge Regression Lasso PCR (Principal Components Regression) PLS (Partial Least Squares)

Shrinkage methods Before e proceed: Since the folloing methods are not invariant under input scale, e ill assume the input is standardized (mean 0, variance ): x ij 0 x ij The offset is alays estimated as x j and e ill ork ith centered y (meaning y y- 0 ) x ij x ij j 0 ( i y i ) / N

Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size ( also called eight decay) In ridge regression, e add a quadratic penalty on the eights: N M M J( ) yi 0 xijj j i j j here λ 0 is a tuning parameter that controls the amount of shrinkage. The size constraint prevents the phenomenon of ildly large coefficients that cancel each other from occurring.

Ridge Regression Solution Ridge regression in matrix form: J( ) The solution is ˆ t t y X y X t t ( X X I ) X y LS t ˆ X X X t y ridge M The solution adds a positive constant to the diagonal of X T X before inversion. This makes the problem non singular even if X does not have full column rank. For orthogonal inputs the ridge estimates are the scaled version of least squares estimates: ridge LS ˆ ˆ 0

Ridge Regression (insights) The matrix X can be represented by it s SVD: X UDV U is a N*M matrix, it s columns span the column space of X D is a M*M diagonal matrix of singular values V is a M*M matrix, it s columns span the ro space of X Lets see ho this method looks in the Principal Components coordinates of X T

Ridge Regression (insights) X ' X ˆ' XV V T XV V X ' ' The least squares solution is given by: ˆ ' ˆ V T T T T VDU UDV VDU y D U y ls T ls T The Ridge Regression is similarly given by: ridge V T ˆ ridge V T T ls D I DU y D I D ˆ' VDU T UDV T I VDU T y Diagonal

Ridge Regression (insights) ridge ˆ ' j d j d j ˆ ' In the PCA axes, the ridge coefficients are just scaled LS coefficients! The coefficients that correspond to smaller input variance directions are scaled don more. ls j d d

Ridge Regression In the folloing simulations, quantities are plotted versus the quantity df M j d d j j This monotonic decreasing function is the effective degrees of freedom of the ridge regression fit.

Ridge Regression Simulation results: 4 dimensions (=,,3,4) No correlation Strong correlation 6 5 4 3 0 Ridge Regression 6 5 4 3 0 Ridge Regression RSS 6 5 4 3 0 0.5.5.5 3 3.5 Effective Dimension 0.5.5.5 3 3.5 Effective Dimension RSS 6 5 4 3 0 0 3 4 Effective Dimension 0.5.5.5 3 3.5 4 Effective Dimension

Ridge regression is MAP ith Gaussian prior J ( ) log log ( y n i P( D Ν( y X) i t ) P( ) t xi, ) Ν( 0, ) t ( y X) const This is the same objective function that ridge solves, using / Ridge: J( ) t t y X y X

Lasso Lasso is a shrinkage method like ridge, ith a subtle but important difference. It is defined by ˆ lasso arg min N i y i M j x ij j subject to M j There is a similarity to the ridge regression problem: the L ridge regression penalty is replaced by the L lasso penalty. The L penalty makes the solution non-linear in y and requires a quadratic programming algorithm to compute it. j t

Lasso If t is chosen to be larger then t 0 : t t M ls ˆ 0 j then the lasso estimation is identical to the least squares. On the other hand, for say t=t 0 /, the least squares coefficients are shrunk by about 50% on average. For convenience e ill plot the simulation results versus shrinkage factor s: s t t M 0 ls ˆ j t

Lasso Simulation results: No correlation 6 5 4 3 0 Lasso Strong correlation 6 5 4 3 0 Lasso RSS 6 5 4 3 0 0. 0.4 0.6 0.8 Shrinkage factor RSS 4 3 0 0. 0.4 0.6 0.8 Shrinkage factor 0 0. 0.4 0.6 0.8 Shrinkage factor 0 0. 0.4 0.6 0.8 Shrinkage factor

L vs L penalties Contours of the LS error function L L t 0 t In Lasso the constraint region has corners; hen the solution hits a corner the corresponding coefficients becomes 0 (hen M> more than one).

Problem: Figure plots linear regression results on the basis of only three data points. We used various types of regularization to obtain the plots (see belo) but got confused about hich plot corresponds to hich regularization method. Please assign each plot to one (and only one) of the folloing regularization method.

Select all the appropriate models for each column. X X X X

Select all the appropriate models for each column. X X X X