Data Mining Stat 588

Size: px

Start display at page:

Download "Data Mining Stat 588"

Vernon Warren
5 years ago
Views:

1 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September

2 Regression Problem Quantitative generic output variable Y. Generic input vector X = (X 1,..., X p ) T. Regression function Ŷ = f(x) to predict Y. If the pair (X, Y ) has a joint probability distribution, and we take the loss function as L(Y, f(x)) = E X,Y (Y f(x)) 2, then the optimal regression function is f(x) = E Y X (Y X). Target Typically we have a set of training data (x 1, y 1 ),..., (x N, y N ), from which we want to learn the regression function f(x), or in term of statistical language, we want to estimate f(x).

3 Linear Regression Models The linear regression model assumes f(x) takes the form p f(x) = β 0 + X j β j. Under the decision theory frame work, the linear model assumes that either the regression function E(Y X) is linear, or that the linear from is a reasonable approximation. The variables X j can come from different sources: quantitative inputs; transformations of quantitative inputs, such as log, square-root or square; basis expansions, such as X 2 = X 2 1, X 3 = X 3 1, leading to a polynomial representation; numeric or dummy coding of the levels of qualitative inputs. For example, if G is a five-level factor input, we might create X j, j = 1,..., 5, such that X j = I(G = j). Together this group of X j represents the effect of G by a set of level-dependent constants. interactions between variables, for example, X 3 = X 1 X 2. j=1

4 Least Square Methods Training set: (x 1, y 1 ),..., (x N, y N ). x i = (x i1,..., x ip ) T. β = (β 0, β 1,..., β p ) T. The most popular estimation method is least squares, in which we pick the coefficients β to minimize the residual sum of squares RSS(β) = N y i β 0 i=1 2 p x ij β j. j=1

5 X 1 X 2 Y

6 Normal Equation x j = (x 1j, x 2j,..., x Nj ) T denotes the N measurements on the jth feature/input. 1 is a N-dimensional vector with all entries equal to one. X = (1, x 1, x 2,..., x p ) is a N (p + 1) matrix. The optimal solution ˆβ satisfies the normal equation: X T Xβ = X T y, and is given by ˆβ = ( X T X) 1 X T y. The fitted values at the training inputs are ŷ = X ˆβ ( ) 1 = X X T X X T y. } {{ } hat matrix

7 Geometric Interpretation y x 2 ŷ x 1

8 Statistical Inference x i are fixed. y i are uncorrelated and have constant variance σ 2. Usually enough for large sample/asymptotic theory. The variance matrix of the least squares parameter estimates is given by Var( ˆβ) = The usual estimate of the variance σ 2 is ˆσ 2 = 1 N p 1 ( X T X) 1 σ 2. N (y i ŷ i ) 2. i=1

9 Exact Distribution Theory Generic model. p Y = β 0 + X j β j + ɛ. j=1 The error ɛ N(0, σ 2 ). For the training data, ɛ i, 1 i N are i.i.d. Under this model assumption, [ ( ) 1 ˆβ N β, X T X σ 2] and (N p 1)ˆσ 2 σ 2 χ 2 N p 1. Moreover, ˆβ and ˆσ 2 are independent.

10 Hypothesis Testing To test the hypothesis that a particular coefficient β j = 0, we use Z-score z j = ˆβ j ˆσ v j t N p 1 under null, ( 1. where v j is the jth diagonal element of X X) T To test for the significance of groups of coefficients simultaneously, we use the F -statistic F = (RSS 0 RSS 1 )/(p 1 p 0 ) RSS 1 /(N p 1 1) F p1 p 0,N p 1 1 under null, where RSS 1 is the residual sum-of-squares for the least squares fit of the bigger model with p parameters, and RSS 0 the same for the nested smaller model with p parameters, having p 1 p 0 parameters constrained to be zero.

11 Confidence Region When the sample size N is sufficiently large, the distribution of ( ˆβ j β j )/(ˆσ v j ) is well approximated by N(0, 1), and a 1 α confidence interval for β j is given by ( ˆβj z (1 α/2)ˆσ v j, ˆβ j + z (1 α/2)ˆσ v j ), where z (1 α) is the (1 α/2)th percentile of the normal distribution, which should be replaced by t (1 α) N p 1, the (1 α/2)th percentile of the t N p 1 distribution, when N is not very large. Similarly we can obtain an approximate confidence set for ˆβ, { C β = β : ( ˆβ β) T X T X( ˆβ β) ˆσ 2 χ 2 (1 α)} p+1, where χ 2 p+1 (1 α) is the (1 α)th percentile of χ 2 p+1.

12 Gauss-Markov Theorem An estimate β is called a linear unbiased estimate (LUE) of β if (i) it is linear in y, that is, β = C T y for some C R N (p+1) ; (ii) it is unbiased, that is, E β β = β for all β R p+1. Theorem (Gauss-Markov Theorem) The least square estimate ˆβ has smallest variance among all LUE of β. (a) ˆβ is a LUE. (b) If β is another LUE, then Var( ˆβ) Var( β) in the sense that the matrix Var( β) Var( ˆβ) is positive semi-definite.

13 Bias-Variance Tradeoff Overview of Supervised Learning High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. Figure: Test and training error as functions of model complexity.

14 Subset Selection Two reasons why we are often not satisfied with the least squares estimates. The first is prediction accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the big picture, we are willing to sacrifice some of the small details.

15 Best Subset Selection For each k {0, 1, 2,..., p}, find the subset of size k that gives smallest residual sum of squares. The best subset of size 2, for example, need not include the variable that was in the best subset of size 1. The smallest residual sum of squares as a function of k is necessarily decreasing, so cannot be used to select the subset size k. The question of how to choose k involves the tradeoff between bias and variance, along with the more subjective desire for parsimony. Typically we choose the smallest model that minimizes an estimate of the expected prediction error. Many of the other approaches are similar, in that they use the training data to produce a sequence of models varying in complexity and indexed by a single parameter. Popular method for selecting the right parameter (subset size here) include cross validation and AIC, BIC etc. More details later.

16 Best Subset Selection Residual Sum of Squares

17 Forward-Stepwise Selection Forward-stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Can exploit the QR decomposition for the current fit to rapidly establish the next candidate. It produces a sequence of models indexed by k, the subset size, which must be determined. Pros: computationally efficient, smaller variance as compared to best subset selection, but perhaps more bias. Cons: errors made at the beginning cannot be corrected later.

18 Backward-Stepwise Selection Backward-Stepwise Selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score. Pros: can throw out the right predictor by looking at the full model. Cons: computationally inefficient (start with the full model), cannot work if p N.

19 Hybrid-stepwise Selection Hybrid-stepwise selection considers both forward and backward moves at each step, and select the best of the two. Pros: computationally efficient, error made at an earlier stage can be corrected later. Need a criterion to decide whether to add or drop at each step. e.g. AIC takes proper account of both the number of parameters and how good the model fits.

20 Forward-Stagewise Regression Forward-stagewise regression starts with an intercept equal to ȳ, and centered predictors with coefficients initially all 0. At each step the algorithm identifies the variable most correlated with the current residual. It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current coefficient for that variable. Can take many more than p steps to reach the least squares fit. So historically viewed as inefficient. Quite competitive in very high dimensional problems.

21 Comparison of Four Subset Selection Methods E ˆβ(k) β Best Subset Forward Stepwise Backward Stepwise Forward Stagewise Subset Size k

22 Shrinkage Methods Subset selection produces a model that is interpretable and has possibly lower prediction error than the full model. It is a discrete process, which means variables are either retained or discarded. So it often exhibits high variance, and doesn t reduce the prediction error of the full model. Shrinkage methods are more continuous, and don t suffer as much from high variability.

23 Motivation: James-Stein s Estimate y 1, y 2,..., y N are i.i.d. N(µ, I p ) µ is unknown, which we want to estimate ȳ = 1 N N i=1 y i is sufficient, MLE, BLUE We say an estimate ˆµ of µ is inadmissible if there exists another estimate µ such that (i) E µ µ µ 2 2 E µ ˆµ µ 2 2 for all µ R p ; (ii) for some µ R p, E µ µ µ 2 2 < E µ ˆµ µ 2 2. Otherwise ˆµ is said to be admissible. Theorem (Stein 1956) (a) If p 2, then ȳ is admissible. (b) If p > 2, then ȳ is inadmissible. Theorem (James-Stein, 1961) If p 3, then µ JS = [ 1 (p 2)/ ȳ 2] ȳ has everywhere smaller MSE than ȳ.

24 Ridge Regression The ridge regression solves the optimization problem N p ˆβ ridge = arg min y i β 0 x ij β j β i=1 for some λ 0. An equivalent form is for some t 0. ˆβ ridge = arg min β subject to j=1 N y i β 0 i=1 p βj 2 t, j=1 2 + λ p x ij β j j=1 p j=1 2 β 2 j

25 Ridge Regression Here λ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of λ, the greater the amount of shrinkage. The coefficients are shrunk toward zero (and each other). The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving the optimization problem. The intercept β 0 has been left out of the penalty term. We can solve the optimization problem in two steps. (1) Estimate β 0 by ȳ = 1 N N i=1 yi. (2) Centerize y and normalize each x j for 1 j p. The remaining coefficients get estimated by a ridge regression without intercept, using the normalized x j.

26 Ridge Regression We assume (i) The output vector y is centered, that is, N i=1 y i = 0; (ii) Each predictor x j, 1 j p is normalized, i.e. N x ij = 0 and i=1 N x 2 ij = 1, 1 j p; i=1 (iii) The input matrix X has p (rather than p + 1) columns; and solve the problem (here β = (β 1,..., β p ) T ) { } ˆβ ridge = arg min y Xβ 2 β R n 2 + λ β 2 2. Ridge regression has a closed form solution ˆβ ridge = ( X T X + λi) 1 X T y.

27 Coefficients lcavol lweight age lbph svi lcp gleason pgg45 df(λ)

28 Lasso The LASSO solves the optimization problem ˆβ lasso = arg min β subject to N y i β 0 i=1 p β j t, j=1 p x ij β j j=1 for some t 0. The equivalent Lagrangian form is ˆβ lasso 1 N p = arg min y i β 0 x ij β j β 2 for some λ 0. i=1 j=1 2 + λ 2 p β j j=1

29 Comparison: Subset Selection, Ridge Regression and Lasso When the columns of X are orthonormal, the formulas of different methods are given by Estimator Formula 3.4 Shrinkage Methods 71 TABLE 3.4. Estimators of β j in the case of orthonormal columns of X. M and λ are constants chosen by the corresponding techniques; sign denotes the sign of its argument (±1), and x + denotes positive part of x. Below the table, estimators are shown by broken red lines. The 45 line in gray shows the unrestricted estimate for reference. Best Best subset subset (size (sizem) ˆβj ˆβj I( ˆβ ˆβ j ˆβ (M) (M) ) ) Ridge Ridge ˆβj /(1 + λ) Lasso Lasso sign( ˆβ ˆβ j )( ˆβ j j λ) λ) + + Best Subset Ridge Lasso λ ˆβ (M) (0,0) (0,0) (0,0)

30 Comparison: Ridge Regression and Lasso β 2. ^ β. β 2 ^ β β 1 β 1 FIGURE Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived