Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Size: px

Start display at page:

Download "Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman"

Byron Baldwin
6 years ago
Views:

1 Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman

2 Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical ransformed predictors (X 4 =log X 3 Basis expansions (X 4 =X 3, X 5 =X 33, etc. Interactions (X 4 =X X 3 Popular choice for estimation is least squares: N p ( y i " # 0 " X j j i= 1 j= 1 RSS (# = # j

4 Least Squares RSS( = ( y " X ( y " X ˆ 1 " # = ( X X X y " yˆ = X# ˆ = X ( X X 1 X y hat matrix Often assume that the Y s are independent and normally distributed, leading to various classical statistical tests and confidence intervals

5 Gauss-Markov heorem Consider any linear combination of the β s: he least squares estimate of θ is: # = a " ˆ = ˆ 1 a ( X X If the linear model is correct, this estimate is unbiased (X fixed: E( ˆ = E(a (X X "1 X y = a (X X "1 X X# = a # Gauss-Markov states that for any other linear unbiased estimator ~ = c y : i.e., E(c y = a, Var( a " ˆ Var( c y Of course, there might be a biased estimator with lower MSE X y " = a

6 bias-variance ~ ~ ( ( ~ ( ~ ( ( ~ ( ~ ( ~ ( ~ ( ~ ( " + = " + " = " + " = E Var E E E E E E E ~ ( ~ ( " = E MSE For any estimator : bias Note MSE closely related to prediction error: ~ ( ~ ( ( ~ ( " x MSE x x E x Y E x Y E + = # + # = #

7 oo Many Predictors? When there are lots of X s, get models with high variance and prediction suffers. hree solutions: 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods,. Shrinkage/Ridge Regression 3. Derived Inputs

8 Subset Selection Standard all-subsets finds the subset of size k, k=1,,p, that minimizes RSS: Choice of subset size requires tradeoff AIC, BIC, marginal likelihood, cross-validation, etc. Leaps and bounds is an efficient algorithm to do all-subsets

9 e.g. 10-fold cross-validation: Cross-Validation Randomly divide the data into ten parts rain model using 9 tenths and compute prediction error on the remaining 1 tenth Do these for each 1 tenth of the data Average the 10 prediction error estimates One standard error rule pick the simplest model within one standard error of the minimum

10 Shrinkage Methods Subset selection is a discrete process individual variables are either in or out his method can have high variance a different dataset from the same source can result in a totally different model Shrinkage methods allow a variable to be partly included in the model. hat is, the variable is included but with a shrunken co-efficient.

11 subject to: Ridge Regression ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # p j= 1 # j y # " s j Equivalently: & N p p ridge = $ ( ' '( ˆ arg min ( y + ( i 0 x ij j * % i= 1 j= 1 j= 1 j # " his leads to: ˆ ridge 1 # = ( X X + " I Choose λ by cross-validation. Predictors should be centered. X y works even when X X is singular

12 effective number of X s

13 Ridge Regression = Bayesian Regression yi ~ N( $ 0 + x $ ~ N(0, j i $," same as ridge with # = "

14 he Lasso ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # subject to: p j= 1 # j y # Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. " s j & N p p ~ = $ ( ' '( arg min ( y + ( j i 0 x ij * % i= 1 j= 1 j= 1 j q # " q=0: var. sel. q=1: lasso q=: ridge Learn q?

16 function of 1/lambda

17 Principal Component Regression Consider a an eigen-decomposition of X X (and hence the covariance matrix of X: X X = VD V he eigenvectors v j are called the principal components of X D is diagonal with entries d 1 d d p Xv 1 has largest sample variance amongst all normalized linear d1 combinations of the columns of X (var( Xv 1 = N Xv k has largest sample variance amongst all normalized linear combinations of the columns of X subject to being orthogonal to all the earlier ones (X is first centered (X is N x p

19 Principal Component Regression PC Regression regresses on the first M principal components where M<p Similar to ridge regression in some respects see HF, p.66

22 x1<-rnorm(10 x<-rnorm(10 y<-(3*x1 + x + rnorm(10,0.1 par(mfrow=c(1, plot(x1,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x1 plot(x,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x epsilon <- 0.1 r <- y beta <- c(0,0 numiter <- 5 for (i in 1:numIter { } cat(cor(x1,r,"\t",cor(x,r,"\t",beta[1],"\t",beta[],"\n"; if (cor(x1,r > cor(x,r { delta <- epsilon * (( * ((r%*%x1 > 0-1 beta[1] <- beta[1] + delta r <- r - (delta * x1 par(mfg=c(1,1 abline(0,beta[1],col="red" } if (cor(x1,r <= cor(x,r { delta <- epsilon * (( * ((r%*%x > 0-1 beta[] <- beta[] + delta r <- r - (delta * x par(mfg=c(1, abline(0,beta[],col="green" }

25 LARS Start with all coefficients b j = 0 Find the predictor x j most correlated with y Increase b j in the direction of the sign of its correlation with y. ake residuals r=y-yhat along the way. Stop when some other predictor x k has as much correlation with r as x j has Increase (b j,b k in their joint least squares direction until some other predictor x m has as much correlation with the residual r. Continue until all predictors are in the model

27 Fused Lasso If there are many correlated features, lasso gives non-zero weight to only one of them Maybe correlated features (e.g. time-ordered should have similar coefficients? ibshirani et al. (005

29 Group Lasso Suppose you represent a categorical predictor with indicator variables Might want the set of indicators to be in or out regular lasso: group lasso: Yuan and Lin (006

Data Mining Stat 588

Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic