Linear regression methods

Size: px

Start display at page:

Download "Linear regression methods"

Michael Johns
5 years ago
Views:

1 Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response and X i = (X i1,..., X ip ) T is the p-vector of predictors. The parameter of interest is β = (β 1,..., β p ) T. For all methods let s assume that each covariate and the response is centered and scaled to have mean zero and variance one. Outline of methods: 1. Large p: Subset selection (stepwise, etc), dimension reduction (PCA, etc) and penalized regression (lasso, etc) 2. Large n: meta-analysis, parallelization 3. Streaming: Kalman filter (2) Linear regression - Part 1 Page 1

2 Least squares The ordinary least squares (OLS) estimate of β is: In matrix notation, this can be written: This solution exists only if X has full column rank, i.e., p n and no covariates are completely linearly dependent. If p > n then the least squares solution does not exist (is not unique)! Assuming the mean really is linear in covariates, then this estimate is unbiased. If the errors are iid, then the least squares estimate is the Best Linear Unbiased Estimator (BLUE) by the Gauss-Markov theorem. If we further assume errors are normal, then the sampling distribution of ˆβ is: (2) Linear regression - Part 1 Page 2

3 Large p poses many problems: Large p - screening One approach is to find a subset of the covariates. There are many traditional algorithms: 1. All-subsets: 2. Forward selection: 3. Backwards selection: 4. Stepwise selection: (2) Linear regression - Part 1 Page 3

4 Large p - screening How to pick the best model? Cross-validation: AIC: BIC: Cp: Post-selection inference: (2) Linear regression - Part 1 Page 4

5 Large p - sure independence screening (SIS) When p is gigantic, even forward selection can be slow. If the covariates are independent, then we could simply rank them by their correlation with the response and include the top q variables. This is much faster than traditional search algorithms for massive p. This will not find the optimal model when the covariates are correlated because importance of a covariate will depend on the other covariates included in the model. However, this can be a useful way to screen out the obviously bad predictors. Say you start with p = 50, 000. Then you might screen down to those q = 500 with the highest correlation with the response, and then do some other method to select the best of the remaining q. Formally: (2) Linear regression - Part 1 Page 5

6 Large p - Sure independence screening (SIS) Theorem: (2) Linear regression - Part 1 Page 6

7 Large p - penalized regression Penalized regression is an alternative to screening. If p < n the least squares estimate is unbiased However, when p n the sampling distribution has huge variance and the estimates are unstable. Penalized regression attempts to stabilize the estimates by shrinking them towards zero. The first such approach is ridge regression (RR): In matrix notation the solution is: Unlike least squares, this exists when p > n for any λ > 0. When X T X = I, then (2) Linear regression - Part 1 Page 7

8 Large p - penalized regression RR is equivalent to a constrained regression problem: (2) Linear regression - Part 1 Page 8

9 Large p - penalized regression The ridge plot shows the estimates as a function of λ: When λ = 0: When λ = : How to pick λ? (2) Linear regression - Part 1 Page 9

10 Large p - penalized regression The sampling distribution for a fixed λ is: This can use used to derive CIs and p-values (though it suppresses uncertainty in λ). In what sense is this better than usual least squares? Mean squared error (MSE) is a reasonable metric for comparison: (2) Linear regression - Part 1 Page 10

11 Large p - penalized regression Say X T X = I p. Then: (2) Linear regression - Part 1 Page 11

12 Large p - penalized regression RR can improve MSE for β compared to OLS. This can also improve prediction because Ŷ = Xˆβ. A drawback is that all variables are included in the model. This can be problematic: The Least Absolute Shrinkage and Selection Operator (LASSO) is an alternative to RR. It performs shrinkage (like RR) and selection (like forward selection) simultaneously by modifying the ridge penalty so that some of the estimates are exactly zero. The estimate is: When λ = 0: When λ = : (2) Linear regression - Part 1 Page 12

13 Large p - penalized regression LASSO is equivalent to a constrained regression problem: (2) Linear regression - Part 1 Page 13

14 Large p - penalized regression The LASSO solution when X T X = I p is: In general, the solution does not have a closed form. However, we can use coordinate descent (CD). The optimal value for β j holding all other coefficients fixed is: (2) Linear regression - Part 1 Page 14

15 Large p - penalized regression Least Angle Regression (LAR) provides LASSO solution for several λ. This is called the solution path: This also provides a nice connection between the LASSO and forward selection. Forward selection: Forward stagewise regression: (2) Linear regression - Part 1 Page 15

16 Large p - penalized regression The LAR solution path has the form: FACT: This is equivalent to the LASSO solution path! (2) Linear regression - Part 1 Page 16

17 Large p - penalized regression How to select λ? How to perform statistical inference? (2) Linear regression - Part 1 Page 17

18 Large p - Penalized regression There are many extensions to the LASSO. One of the most useful is the adaptive LASSO. The LASSO penalizes all the coefficients equally. This overshrinks the really important ones and causes bias. The adaptive LASSO allows each coefficient to have its own shrinkage parameter: The weights are chosen adaptively (using the data): (2) Linear regression - Part 1 Page 18

19 Large p - Penalized regression The adaptive LASSO possesses the oracle property: (2) Linear regression - Part 1 Page 19

20 Large p - Penalized regression Proof: (2) Linear regression - Part 1 Page 20

21 Large p - Penalized regression There are many other LASSO extensions for various settings: Elastic net for correlated covariates: OSCAR for clustered covariates: Fused LASSO for covariates in sequence: Grouped LASSO for groups: (2) Linear regression - Part 1 Page 21

22 Large p - Other forms of dimension reduction When variables are highly correlated it is hard to identify their individual effects. Most subset selection methods will pick one member of a highly correlated group of predictors and discard the rest. Principle components analysis (PCA) explores the covariance structure of the covariates. The first PC is defined as: The j th PC is defined as: The spectral decomposition of the covariance matrix is: The proportion of variance explained by the first j eigenvectors is: (2) Linear regression - Part 1 Page 22

23 Large p - Other forms of dimension reduction Instead reducing the dimension of the model by selecting a subset of the predictors, principal components regression (PCR) reduces the dimension by extracting a few linear combinations of the covariates that explain most of the variability in the covariates. PCR is: Advantages: Disadvantages: (2) Linear regression - Part 1 Page 23

24 Large p - Other forms of dimension reduction A potential drawback of PCA is that the response is not used when computing the eignenvectors. There is no guarantee that best combination of the covariates for predicting the response appears in the first few PCs. Partial least squares (PLS) uses both the covariance of the predictors and their correlation with the response to construct covariates. The first PLS vector is: The j th PLS vector is: These are then used in regression just like the PCs. (2) Linear regression - Part 1 Page 24

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived