Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman
Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical ransformed predictors (X 4 =log X 3 Basis expansions (X 4 =X 3, X 5 =X 33, etc. Interactions (X 4 =X X 3 Popular choice for estimation is least squares: N p ( y i " # 0 " X j j i= 1 j= 1 RSS (# = # j
Least Squares RSS( = ( y " X ( y " X ˆ 1 " # = ( X X X y " yˆ = X# ˆ = X ( X X 1 X y hat matrix Often assume that the Y s are independent and normally distributed, leading to various classical statistical tests and confidence intervals
Gauss-Markov heorem Consider any linear combination of the β s: he least squares estimate of θ is: # = a " ˆ = ˆ 1 a ( X X If the linear model is correct, this estimate is unbiased (X fixed: E( ˆ = E(a (X X "1 X y = a (X X "1 X X# = a # Gauss-Markov states that for any other linear unbiased estimator ~ = c y : i.e., E(c y = a, Var( a " ˆ Var( c y Of course, there might be a biased estimator with lower MSE X y " = a
bias-variance ~ ~ ( ( ~ ( ~ ( ( ~ ( ~ ( ~ ( ~ ( ~ ( " + = " + " = " + " = E Var E E E E E E E ~ ( ~ ( " = E MSE For any estimator : bias Note MSE closely related to prediction error: ~ ( ~ ( ( ~ ( 0 0 0 0 0 0 0 " x MSE x x E x Y E x Y E + = # + # = #
oo Many Predictors? When there are lots of X s, get models with high variance and prediction suffers. hree solutions: 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods,. Shrinkage/Ridge Regression 3. Derived Inputs
Subset Selection Standard all-subsets finds the subset of size k, k=1,,p, that minimizes RSS: Choice of subset size requires tradeoff AIC, BIC, marginal likelihood, cross-validation, etc. Leaps and bounds is an efficient algorithm to do all-subsets
e.g. 10-fold cross-validation: Cross-Validation Randomly divide the data into ten parts rain model using 9 tenths and compute prediction error on the remaining 1 tenth Do these for each 1 tenth of the data Average the 10 prediction error estimates One standard error rule pick the simplest model within one standard error of the minimum
Shrinkage Methods Subset selection is a discrete process individual variables are either in or out his method can have high variance a different dataset from the same source can result in a totally different model Shrinkage methods allow a variable to be partly included in the model. hat is, the variable is included but with a shrunken co-efficient.
subject to: Ridge Regression ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # p j= 1 # j y # " s j Equivalently: & N p p ridge = $ ( ' '( ˆ arg min ( y + ( i 0 x ij j * % i= 1 j= 1 j= 1 j # " his leads to: ˆ ridge 1 # = ( X X + " I Choose λ by cross-validation. Predictors should be centered. X y works even when X X is singular
effective number of X s
Ridge Regression = Bayesian Regression yi ~ N( $ 0 + x $ ~ N(0, j i $," same as ridge with # = "
he Lasso ˆ N p ridge = arg min ( i " # 0 " x ij # i= 1 j= 1 # subject to: p j= 1 # j y # Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. " s j & N p p ~ = $ ( ' '( arg min ( y + ( j i 0 x ij * % i= 1 j= 1 j= 1 j q # " q=0: var. sel. q=1: lasso q=: ridge Learn q?
function of 1/lambda
Principal Component Regression Consider a an eigen-decomposition of X X (and hence the covariance matrix of X: X X = VD V he eigenvectors v j are called the principal components of X D is diagonal with entries d 1 d d p Xv 1 has largest sample variance amongst all normalized linear d1 combinations of the columns of X (var( Xv 1 = N Xv k has largest sample variance amongst all normalized linear combinations of the columns of X subject to being orthogonal to all the earlier ones (X is first centered (X is N x p
Principal Component Regression PC Regression regresses on the first M principal components where M<p Similar to ridge regression in some respects see HF, p.66
www.r-project.org/user-006/slides/hesterberg+fraley.pdf
x1<-rnorm(10 x<-rnorm(10 y<-(3*x1 + x + rnorm(10,0.1 par(mfrow=c(1, plot(x1,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x1 plot(x,y,xlim=range(c(x1,x,ylim=range(y abline(lm(y~-1+x epsilon <- 0.1 r <- y beta <- c(0,0 numiter <- 5 for (i in 1:numIter { } cat(cor(x1,r,"\t",cor(x,r,"\t",beta[1],"\t",beta[],"\n"; if (cor(x1,r > cor(x,r { delta <- epsilon * (( * ((r%*%x1 > 0-1 beta[1] <- beta[1] + delta r <- r - (delta * x1 par(mfg=c(1,1 abline(0,beta[1],col="red" } if (cor(x1,r <= cor(x,r { delta <- epsilon * (( * ((r%*%x > 0-1 beta[] <- beta[] + delta r <- r - (delta * x par(mfg=c(1, abline(0,beta[],col="green" }
LARS Start with all coefficients b j = 0 Find the predictor x j most correlated with y Increase b j in the direction of the sign of its correlation with y. ake residuals r=y-yhat along the way. Stop when some other predictor x k has as much correlation with r as x j has Increase (b j,b k in their joint least squares direction until some other predictor x m has as much correlation with the residual r. Continue until all predictors are in the model
Fused Lasso If there are many correlated features, lasso gives non-zero weight to only one of them Maybe correlated features (e.g. time-ordered should have similar coefficients? ibshirani et al. (005
Group Lasso Suppose you represent a categorical predictor with indicator variables Might want the set of indicators to be in or out regular lasso: group lasso: Yuan and Lin (006