Regression, Ridge Regression, Lasso

Size: px

Start display at page:

Download "Regression, Ridge Regression, Lasso"

Alison George
5 years ago
Views:

1 Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018

2 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n. A covariate is also called a feature or a predictor. The term regression is usually employed when Y is a continuous variable. That is, variables are quantitative rather than qualitative.

3 Basic model and basic questions Model: Y = f (X) + ɛ, where X are the covariates and ɛ is some noise. Questions: Is there a relationship? A linear relationship? How strong? Are all covariates useful? How can we predict Y? Usually referred to as statistical inference.

4 Linear regression Linear regression adopts a linear relationship: Y = β 0 + β 1 X 1 + β 2 X β n X n.

5 Linear regression Linear regression adopts a linear relationship: Y = β 0 + β 1 X 1 + β 2 X β n X n. Textbook, Figures 3.1 and 3.4 (Pages 62 and 73).

6 The least-squares solution Suppose we have observations for j {1,..., N}. x 1,j, x 2,j,..., x n,j, y j

7 The least-squares solution Suppose we have observations x 1,j, x 2,j,..., x n,j, y j for j {1,..., N}. Consider the residual sum of squares RSS = n (y j β 0 β 1 x 1,j β n x n,j ) 2. j=1

8 The least-squares solution Suppose we have observations x 1,j, x 2,j,..., x n,j, y j for j {1,..., N}. Consider the residual sum of squares RSS = n (y j β 0 β 1 x 1,j β n x n,j ) 2. j=1 We might choose ˆβ i to minimize RSS.

9 The solution for a single covariate Suppose we have Y = β 0 + β 1 X 1.

10 The solution for a single covariate Suppose we have Y = β 0 + β 1 X 1. To minimize RSS, we must set: ˆβ 0 = y j /N ˆβ 1 x 1,j /N, j j j ˆβ 1 = (x 1,j ( k x 1,k/N))(y j ( k y k/n)) j (x 1,j ( k x. 1,k/N)) 2

11 Often used notation N j=1 x 1,j N and ȳ = N j=1 y j N. Adopt x 1 = Then: ˆβ 0 = ȳ ˆβ 1 x 1, j ˆβ 1 = (x 1,j x 1 )(y j ȳ) j (x. 1,j x 1 ) 2

12 Deriving the expressions Differentiate RSS = N j=1 (y j β 0 β 1 x 1,j ) 2 : RSS β 0 = 2 j (y j β 0 β 1 x 1,j ), Solve: RSS β 1 = 2 j x 1,j (y j β 0 β 1 x 1,j ). ȳ β 0 β 1 x 1 = 0, x 1 y β 0 x 1 β 1 x 2 1 = 0. Get: β 0 = ȳ β 1 x 1, β 1 = x 1y x 1 ȳ x 2 1 x 2.

13 A probabilistic version Suppose we assume Y = β 0 + β 1 X 1 + Z, where Z N(0, σ 2 ) is a probabilistic disturbance.

14 A probabilistic version Suppose we assume Y = β 0 + β 1 X 1 + Z, where Z N(0, σ 2 ) is a probabilistic disturbance. Then the previous ˆβ 0 and ˆβ 1 are the maximum likelihood estimates of β 0 and β 1, and moreover ˆσ 2 = 1 (y j ŷ j ) 2 N maximizes likelihood for j ŷ j = ˆβ 0 + ˆβ 1 x 1,j.

15 Maximum likelihood versus unbiasedness Funny: the maximum likelihood estimator for σ 2 is not unbiased; often the following unbiased estimator is used: ˆσ 2 = 1 N 2 (y j ŷ j ) 2. (This estimator is sometimes called RSE.) j

16 Some properties Maximum likelihood estimators of β 0 and β 1 are consistent and asymptotic normal. There are closed-form expressions for the variance of these estimators and for confidence intervals (details, not covered in this course, in the Textbook, page 66).

17 Checking whether there is relationship There is a standard (asymptotic) test for H 0 : β 1 = 0 versus H 1 : β 1 0. Reject H 0 when ˆβ 1 ˆσ 2 j (x > z 1,j x 1 ) 2 α/2 where ˆσ 2 is (usually) the unbiased estimate of σ 2.

18 And the p-value Test is of the form: T > c for T that depends on data, some c. So, p-value: probability that T is larger than observed T, for true H 0 (that is, β 1 = 0). Small p-value: evidence against H 0.

19 Usual measure of fit The R 2 statistic R 2 = 1 RSS N j=1 (y j ȳ) 2 (the proportion of variability of Y that can be explained by covariates, between 0 and 1; higher is better).

20 Usual measure of fit The R 2 statistic R 2 = 1 RSS N j=1 (y j ȳ) 2 (the proportion of variability of Y that can be explained by covariates, between 0 and 1; higher is better). For one covariate, R 2 is equal to the square of the correlation between X 1 and Y :

21 Usual measure of fit The R 2 statistic R 2 = 1 RSS N j=1 (y j ȳ) 2 (the proportion of variability of Y that can be explained by covariates, between 0 and 1; higher is better). For one covariate, R 2 is equal to the square of the correlation between X 1 and Y : 2 j (x 1,j x 1 )(y j ȳ) j (x 1,j x 1 ) 2 j (y. j ȳ) 2

22 Many covariates Now suppose Y = β 1 X β n X n + Z, where E[Z] = 0 (we can take X 1 equal to one to simulate a fixed value at covariates zero).

23 Many covariates Now suppose Y = β 1 X β n X n + Z, where E[Z] = 0 (we can take X 1 equal to one to simulate a fixed value at covariates zero). The RSS is then (C AB) T (C AB), where x 1,1 x 1,2... x n,1 A =...., B = x 1,N x 2,N... x n,n β 1. β n, C = y 1. y N.

24 Regression for many covariates For Y = β 1 X β n X n + Z, then ˆB = (A T A) 1 A T C.

25 Some properties This estimator is consistent. If Z has variance σ 2, then ˆB has variance σ 2 (A T A) 1. The estimator ˆB is approximately Gaussian, for large N.

26 Checking whether there is relationship There is a standard (asymptotic) test for H 0 : β 1 = = β n = 0 versus alternative H 1.

27 Checking whether there is relationship There is a standard (asymptotic) test for H 0 : β 1 = = β n = 0 versus alternative H 1. Test is Reject H 0 when F -statistic is large.

28 Checking whether there is relationship There is a standard (asymptotic) test for H 0 : β 1 = = β n = 0 versus alternative H 1. Test is Reject H 0 when F -statistic is large. The F -statistic: F = (TSS RSS)/n TSS/(N n 1), where TSS is the total sum of squares: TSS = N (y j ȳ) 2. j=1

29 Checking whether there is relationship There is a standard (asymptotic) test for H 0 : β 1 = = β n = 0 versus alternative H 1. Test is Reject H 0 when F -statistic is large. The F -statistic: F = (TSS RSS)/n TSS/(N n 1), where TSS is the total sum of squares: TSS = N (y j ȳ) 2. j=1 The F -statistic has an F -distribution...

30 Checking subset of parameters There are also tests to verify whether subsets of parameters are equal to zero. Take H 0 : β n m+1 = β n m+2 = = β n = 0.

31 Checking subset of parameters There are also tests to verify whether subsets of parameters are equal to zero. Take H 0 : β n m+1 = β n m+2 = = β n = 0. Reject H 0 when the following F -statistic is large : F = (RSS 0 RSS)/m TSS/(N n 1) where RSS 0 is the sum of residual squares for the model containing only β 1,..., β n m.

32 Checking subset of parameters There are also tests to verify whether subsets of parameters are equal to zero. Take H 0 : β n m+1 = β n m+2 = = β n = 0. Reject H 0 when the following F -statistic is large : F = (RSS 0 RSS)/m TSS/(N n 1) where RSS 0 is the sum of residual squares for the model containing only β 1,..., β n m. Possible to check each parameter at a time (m = 1).

33 Usual measure of fit The R 2 statistic R 2 = 1 RSS N j=1 (y j ȳ) 2 (the proportion of variability of Y that can be explained by covariates, between 0 and 1; higher is better).

34 Usual measure of fit The R 2 statistic R 2 = 1 RSS N j=1 (y j ȳ) 2 (the proportion of variability of Y that can be explained by covariates, between 0 and 1; higher is better). But: the more covariates, the higher the R 2 statistic.

35 Feature selection Usually we must discard some covariates. Too many covariates lead to small bias but large variance, while too few covariates lead to small variance but large bias (the bias-variance trade-off). In particular, a covariate should not be a linear function of other covariates... We must score each set of covariates, to choose the best one. A possible score is empirical error (by cross-validation).

36 The Akaike Information Criterion One popular score is the AIC: L S S, where S is set of covariates in the scored model, and L S is the log-likelihood of the model with covariates in S, evaluated at the maximum likelihood estimates. That is, goodness of fit - model complexity

37 Another score The BIC (Bayesian Information Criterion): L S ( S /2) log N. For large N, the posterior probability of the scored model is proportional to e BIC, when all possible models get identical probability.

38 Yet another score Mallow s C p : 2 S ˆσ 2 + N j=1 (ŷ S j y j ) 2, where ˆσ 2 is the estimate of variance with all covariates, while ŷj S are produced with covariates in S. This score estimates the training error that is expected from a model with covariates in S.

39 Structure search Forward stepwise regression: start with no covariates, add one that leads to largest score, then add one that leds to best score, etc. Backward stepwise regression: start with all covariates, drop one that leads to largest score, etc.

40 Structure search Forward stepwise regression: start with no covariates, add one that leads to largest score, then add one that leds to best score, etc. Backward stepwise regression: start with all covariates, drop one that leads to largest score, etc. There are many other search schemes. Additional details, not covered in this course, in the Textbook, Section 6.1.

41 Qualitative features in linear regression Use some encoding. If values are ordered: turn into numbers. If not appropriate to do so: one-hot encoding.

42 Other challenges in linear regression Relationship is not linear: detect with tests and residual plots. Disturbances are correlated or vary with features. Outliers and high leverage points (must be tested for, discarded). Collinearity (leads to numerical problems and higher variance). These issues, not covered in this course, are discussed in the Textbook, Section

43 Introducing non-linear features Suppose we have X 1 and X 2. We can assume: Y = β 0 + β 1 X 1 + β 2 X 2 + β 12 X 1 X 2, or any other set of functions of X 1 and X 2.

44 Introducing non-linear features Suppose we have X 1 and X 2. We can assume: Y = β 0 + β 1 X 1 + β 2 X 2 + β 12 X 1 X 2, or any other set of functions of X 1 and X 2. If functions are polynomials of covariates, we have polynomial regression. If functions are all produced by a set of transformations, they are referred to as basis functions.

45 Introducing non-linear features Suppose we have X 1 and X 2. We can assume: Y = β 0 + β 1 X 1 + β 2 X 2 + β 12 X 1 X 2, or any other set of functions of X 1 and X 2. If functions are polynomials of covariates, we have polynomial regression. If functions are all produced by a set of transformations, they are referred to as basis functions. Other functions lead to General Additive Models (not discussed in this course; see Textbook Section 7.7).

46 Nonparametric regression We might assume that Y depends on splines (Y is a piecewise but smooth function of covariates). Or we might assume that Y is a function of neighboring points (knn regression): 1 To obtain ŷ corresponding to given covariates, weigh each point in the neighborhood. 2 Then run weighted regression with those points only.

47 Nonparametric regression We might assume that Y depends on splines (Y is a piecewise but smooth function of covariates). Or we might assume that Y is a function of neighboring points (knn regression): 1 To obtain ŷ corresponding to given covariates, weigh each point in the neighborhood. 2 Then run weighted regression with those points only. These topics are not covered in this course; they are discussed in Textbook, Sections 7.5 and 7.6, and Section 3.5.

48 Shrinkage methods: Regularization The idea is to penalize the size of parameters, to shrink them, to reduce variance (but with increase in bias...). Useful to reduce overfitting, particularly when there are too many covariates. Two main strategies: ridge regression, and lasso (least absolute shrinkage and selection operator).

49 Ridge regression Suppose we minimize ( ) 2 N n y j β i x i,j + λ j=1 i=1 n βi 2. i=1 The larger the parameter λ, the smaller the values of ˆβ i. Tuning λ: usually by cross-validation. Solution is easy (but biased!): ˆB = (A T A + λi ) 1 A T C.

50 Standardization In ridge regression, the measurement scale for covariates is important... In linear regression, multiplying covariates leads to multiplied estimates. Usual assumption: data are standardized (mean 0, variance 1): x i,j = x i,j (1/N) N j=1 (x i,j x i ) 2. Usually Y is centered (mean is subtracted).

51 The bias-variance trade-off Ridge regression increases bias, decreases variance (compared to linear regression). So it is useful when variance is large. For instance, when the number of covariates is very large. Textbook, Figure 6.5.

52 The Bayesian interpretation Suppose we have Gaussian likelihood and prior proportional to e λ i β2 i. Then the posterior to be maximized (for 0-1 loss) lead to minimization of ( ) 2 N n n y j β i x i,j + λ βi 2. j=1 i=1 i=1

53 Lasso I Suppose we minimize ( ) 2 N n y j β i x i,j + λ j=1 i=1 n β i. i=1 The larger the parameter λ, the smaller the values of ˆβ i. Tuning λ: usually by cross-validation.

54 Lasso II Minimize ( N y j j=1 n i=1 ) 2 β i x i,j + λ n β i. i=1 Usual assumption: X i standardized (mean 0, variance 1), Y centered. Amazing fact: if λ is large enough, many β i s are set to zero: so covariate selection is done automatically! The result is a sparse solution.

55 Working with lasso No closed-form solution, but convex optimization does it. Gains in accuracy have been observed, particularly when too many covariates are present. Too many covariates can overfit any input data... With too many covariates, many may have the same predictive power, many may be highly correlated and useless. Thus selection of covariates is a big plus.

56 Another perspective, with some intuition It is possible to show that both ridge regression and lasso minimize RSS, but Ridge regression: subject to n i=1 β2 i λ. Lasso: subject to n i=1 β i λ. Textbook: Figures 6.7 and 6.10.

Machine Learning for OR & FE

Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com