Lecture 11: Regression Methods I (Linear Regression)

Lecture 11: Regression Methods I (Linear Regression) Fall, 2017 1 / 40

Outline Linear Model Introduction 1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Ordinary Least Squares Statistical inferences Computational algorithms 2 / 40

Regression Models Linear Model Introduction If the response Y take real values, we refer this type of supervised learning problem as regression problem. linear regression models parametric models nonparametric regression splines, kernel estimator, local polynomial regression semiparametric regression Broad coverage: penalized regression, regression trees, support vector regression, quantile regression 3 / 40

Linear Regression Models A standard linear regression model assumes y i = x T i β + ɛ i, ɛ i i.i.d, E(ɛ i ) = 0, Var(ɛ i ) = σ 2, y i is the response for the ith observation, x i R d is the covariates β R d is the d-dimensional parameter vector Common model assumptions: independence of errors constant error variance (homoscedasticity) ɛ independent of X. Normality is not needed. 4 / 40

About Linear Models Linear Model Introduction Linear models has been a mainstay of statistics for the past 30 years and remains one of the most important tools. The covariates may come from different sources quantitative inputs; dummy coding qualitative inputs. transformed inputs: log(x ), X 2, X,... basis expansion: X 1, X1 2, X 1 3,... (polynomial representation) interaction between variables: X 1 X 2,... 5 / 40

Review on Matrix Theory (I) Let A be an m m matrix. Let I m be the identity matrix of size m. The determinant of A is deta) = A. The trace of A is tr(a) = the sum of the diagonal elements. The roots of the mth degree of polynomial equation in λ. λi m A = 0, denoted by λ 1,, λ m are called the eigenvalues of A. The collection {λ 1,, λ m } is called the spectrum of A. Any nonzero m 1 vector x i 0 such that Ax i = λ i x i is an eigenvector of A corresponding to the eigenvalue λ i. If B is another m m matrix, then AB = A B, tr(ab) = tr(ba). 6 / 40

Review on Matrix Theory (II) The following are equivalent: A 0 rank(a) = m A 1 exists. 7 / 40

Orthogonal Matrix Linear Model Introduction An m m matrix is symmetric if A = A. An m m matrix P is called an orthogonal matrix if PP = P P = I m, or P 1 = P. If P is an orthogonal matrix, then PP = P P = P 2 = I = 1, so P = ±1. For any m m matrix A, we have tr(pap ) = tr(ap P) = tr(a). PAP and A have the same eigenvalues, since λi m PAP = λpp PAP = P 2 λi m A = λi m A. 8 / 40

Spectral Decomposition of Symmetric Matrix For any m m symmetric matrix A, there exists an orthogonal matrix P such that P AP = Λ = diag{λ 1,, λ m }, λ i s are the eigenvalues of A. The corresponding eigenvectors of A are the column vectors of P. Denote the m 1 unit vectors by e i, i = 1,, m, where e i has 1 in the ith position and zeros elsewhere. The ith column of P is p i = Pe i, i = 1,, m. Note PP = m i=1 p ip i = I m. The spectral decomposition of A is m A = PΛP = λ i p i p i i=1 tr(a) = tr(λ) = n i=1 λ i and A = Λ = m i=1 λ i. 9 / 40

Idempotent Matrices Linear Model Introduction An m m matrix A is idempotent if A 2 = AA = A. The eigenvalues of an idempotent matrix are either zero or one λx = Ax = A(Ax) = A(λx) = λ 2 x, = λ = λ 2. A symmetric idempotent matrix A is also referred to as a projection matrix. For any x R m, the vector y = Ax is the orthogonal projection of x onto the subspace of R m generated by the column vectors of A. the vector z = (I A)x is the orthogonal projection of x onto the complementary subspace such that x = y + z = Ax + (I A)x. 10 / 40

Projection Matrices Linear Model Introduction If A is an symmetric idempotent, then If rank(a) = r, then A has r eigenvalues equal to 1 and n 4 zero eigenvalues tr(a) = rank(a). I m A is also symmetric idempotent, of rank m r. 11 / 40

Matrix Notations for Linear Regression The response vector y = (y 1,, y n ) T The design matrix X. Assume the first column of X is 1. The dimension of X is n (1 + d). ( ) β0 The regression coefficients β =. β 1 The error vector ɛ = (ɛ 1,, ɛ n )) T. The linear model is written as: the estimated coefficients β y = X β + ɛ the predicted response ŷ = X β. 12 / 40

Ordinary Least Squares (OLS) The most popular method for fitting the linear model is the ordinary least squares (OLS): min RSS(β) = (y X β β)t (y X β). Normal equations: X T (y X β) = 0 β = (X T X ) 1 X T y and ŷ = X (X T X ) 1 X T y. Residual vector is r = y ŷ = (I P X )y. Residual sum squares RSS = r T r. 13 / 40

Projection Matrix Linear Model Introduction Call the following square matrix the projection or hat matrix: Properties: We have Note symmetric and non-negative P X = X (X T X ) 1 X T. idempotent: P 2 X = P X. The eigenvalues are 0 s and 1 s. X T P X = X T, X T (I P X ) = 0. r = (I P X )y, RSS = y T (I P X )y. X T r = X T (I P X )y = 0. The residual vector is orthogonal to the column space spanned by X, col(x ). 14 / 40

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÈË Ö Ö ÔÐ Ñ ÒØ Ý Ü ¾ Ý Ü ½ ÙÖ º¾ Ì Æ ¹ Ñ Ò ÓÒ Ð ÓÑ ØÖÝ Ó Ð Ø ÕÙ Ö Ö Ö ÓÒ Û Ø ØÛÓ ÔÖ ØÓÖ º Ì ÓÙØÓÑ Ú ØÓÖ Ý ÓÖØ Ó ÓÒ ÐÐÝ ÔÖÓ Ø ÓÒØÓ Ø ÝÔ ÖÔÐ Ò Ô ÒÒ Ý Ø ÒÔÙØ Ú ØÓÖ Ü½ Ò Ü¾º Ì ÔÖÓ Ø ÓÒ Ý Ö ÔÖ ÒØ Ø Ú ØÓÖ Ó Ø Ð Ø ÕÙ Ö ÔÖ Ø ÓÒ 15 / 40

Sampling Properties of β Var(ˆβ) = σ 2 (X T X ) 1, The variance σ 2 can be estimated as ˆσ 2 = SSE/(n d 1). This is an unbiased estimator, i.e., E(ˆσ 2 ) = σ 2 16 / 40

Inferences for Gaussian Errors Under the Normal assumption on the error ɛ, we have ˆβ N(β, σ 2 (X T X ) 1 ) (n d 1)ˆσ 2 σ 2 χ 2 n d 1 ˆβ is independent of ˆσ 2 To test H 0 : β j = 0, we use if σ 2 is known, z j = if σ 2 is unknown, t j = H 0 ; ˆβ j σ v j has a Z distribution under H 0 ; ˆβ j ˆσ v j has a t n d 1 distribution under where v j is the jth diagonal element of (X T X ) 1. 17 / 40

Confidence Interval for Individual Coefficients Under Normal assumption, the 100(1 α)% C.I. of β j is ˆβ j ± t n d 1; α 2 ˆσ v j, where t k;ν is 1 ν percentile of t k distribution. In practice, we use the approximate 100(1 α)% C.I. of β j ˆβ j ± z α 2 ˆσ v j, where z α is 1 α 2 2 percentile of the standard Normal distribution. Even if the Gaussian assumption does not hold, this interval is approximately right, with the coverage probability 1 α as n. 18 / 40

Review on Multivariate Normal Distributions Distributions of Quadratic Form (Non-central χ 2 ): If X N p (µ, I p ), then W = X T X = p i=1 X 2 i χ 2 p(λ), λ = 1 2 µt µ. Special case: If X N p (0, I p ), then W = X T X χ 2 p. If X N p (µ, V ) where V is nonsingular, then W = X T V 1 X χ 2 p(λ), λ = 1 2 µt V 1 µ. If X N p (µ, V ) with V nonsingular, if A is symmetric and AV is idempotent with rank s, then W = X T AX χ 2 s (λ), λ = 1 2 µt Aµ. 19 / 40

Cochran s Theorem Linear Model Introduction Let y N n (µ, σ 2 I n ) and let A j, j = 1,, J be symmetric idempotent matrices with rank s j. Furthermore, assume that J j=1 A j = I n and J j=1 s j = n, then (i) W j = 1 σ 2 yt A j y χ 2 s j (λ j ), where λ j = 1 2σ 2 µ T A j µ (ii) W j s are mutually independent with each other. Essentially: we decompose y T y into the (scaled) sum of its quadratic forms, n i=1 y 2 i y i = y T I n y = J y T A j y. j=1 20 / 40

Application of Cochran s Theorem to Linear Models Example: Assume y N n (X β, σ 2 I n ). Define A = I P X and the residual sum of squares: RSS = y T Ay = r 2 the sum of squares regression: SSR = y T P X y = ŷ 2. By Cochran s Theorem, we have (i) RSS/σ 2 χ 2 n d 1, SSR/σ2 χ 2 d+1 (λ), where λ = (X β) T (X β)/(2σ 2 ), (ii) RSS is independent from SSR. (Note r ŷ) 21 / 40

F Distribution Linear Model Introduction If U 1 χ 2 p, U 2 χ 2 q and U 1 U 2, then F = U 1/p U 2 /q F p,q. If U 1 χ 2 p(λ), U 2 χ 2 q and U 1 U 2, then F = U 1/p U 2 /q F p,q(λ), (noncentral F ) Example: Assume y N n (X β, σ 2 I n ). Let A = I P X, and Then F = RSS = y T Ay T = r 2, SSR = y T P X y = ŷ 2. SSR/(d + 1) RSS/(n d 1) F d+1,n d 1(λ), λ = X β 2 /(2σ 2 ). 22 / 40

Making Inferences about Multiple Parameters Assume X = [X 0, X 1 ], where X 0 consists of the first k columns. Correspondingly, β = [β 0, β 1]. To test H 0 : β 0 = 0, using F = (RSS 1 RSS)/k RSS/(n d 1) RSS 1 = y T (I P X1 )y (reduced model). RSS = y T (I P X )y (full model) RSS 1 σ 2 χ 2 n d 1. RSS 1 RSS = y T (P X P X1 )y. 23 / 40

Testing Multiple Parameter Applying Cochran s Theorem to RSS 1, RSS and RSS 1 RSS, they are independent they respectively follow noncentral χ 2 distributions, with noncentralities (X β) T (I P X1 )(X β)/(2σ 2 ), 0, and (X β) T (P X P X1 )(X β)/(2σ 2 ).. Then we have F F k,n d 1 (λ), with λ = (X β) T (P X P X1 )(X β)/(2σ 2 ). Under H 0, we have X β = X 1 β 1, so F F k,n d 1. 24 / 40

Nested Model Selection To test for significance of groups of coefficients simultaneously, we use F -statistic where F = (RSS 0 RSS 1 )/(d 1 d 0 ), RSS 1 /(n d 1 1) RSS 1 is the RSS for the bigger model with d 1 + 1 parameters RSS 0 is the RSS for the nested smaller model with d 0 + 1 parameter, have d 1 d 0 parameters constrained to zero. F -statistic measure the change in RSS per additional parameter in the bigger model, and it is normalized by ˆσ 2. Under the assumption that the smaller model is correct, F F d1 d 0,n d 1 1. 25 / 40

Confidence Set Linear Model Introduction The approximate confidence set of β is C β = {β (ˆβ β) T (X T X )(ˆβ β) ˆσ 2 χ 2 d+1;1 α }, where χ 2 k;1 α is 1 α percentile of χ2 k distribution. The confidence interval for the true function f (x) = x T β is {x T β β C β } 26 / 40

Gauss-Markov Theorem Assume s T β is linearly estimable, i.e., there exists a linear estimator b + c T y such that E(b + c T y) = s T β. A function s T β is linearly estimable iff s = X T a for some a. Theorem: If s T β is linearly estimable, then s T β is the best linear unbiased estimator (BLUE) of s T β: For any c T y satisfying E(c T y) = s T β, we have Var(s T β) Var(c T y). s T β is the best among all the unbiased estimators. (It is a function of the complete and sufficient statistic (y T y, X T y).) Question: Is it possible to find a slightly biased linear estimator but with smaller variance? ( Trade a little bias for a large reduction in variance.) 27 / 40

Orthogonalize Design Matrix Linear Regression with Orthogonal Design If X is univariate, the least square estimate is ˆβ = i x iy i = < x, y > < x, x >. i x 2 i if X = [x 1,..., x d ] has orthogonal columns, i.e., < x j, x k >= 0, j k; or equivalently, X T X = diag ( x 1 2,..., x d 2). The OLS estimates are given as ˆβ j = < x j, y > < x j, x j > for j = 1,..., d. Each input has no effect on the estimation of other parameters. Multiple linear regression reduces to univariate regression. 28 / 40

How to orthogonalize X? Orthogonalize Design Matrix Consider the simple linear regression y = β 0 1 + β 1 x + ɛ. We regress x onto 1 and obtain the residual z = x x1. Orthogonalization Process: The residual z is orthogonal to the regressor 1. The column space of X is span{1, x}. Note: ŷ span{1, x} = span{1, z}, because β 0 1 + β 1 x = β 0 + β 1 [ x1 + (x x1)] = β 0 + β 1 [ x1 + z] = (β 0 + β 1 x)1 + β 1 z = η 0 1 + β 1 z. {1, z} form an orthogonal basis for the column space of X. 29 / 40

Orthogonalize Design Matrix How to orthogonalize X? (continued) Estimation Process: First, we regress y onto z for the OLS estimate of the slope ˆβ 1 ˆβ 1 = < y, z > < y, x x1 > = < z, z > < x x1, x x1 >. Second, we regress y onto 1 and get the coefficient ˆη 0 = ȳ. The OLS fit is given as ŷ = ˆη 0 1 + ˆβ 1 z = ˆη 0 1 + ˆβ 1 (x x1) = (ˆη 0 ˆβ 1 x)1 + ˆβ 1 x. Therefore, the OLS slope is obtained as ˆβ 0 = ˆη 0 ˆβ 1 x = ȳ ˆβ 1 x. 30 / 40

Orthogonalize Design Matrix Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÈË Ö Ö ÔÐ Ñ ÒØ Ý Ü¾ Þ Ý Ü½ ÙÖ º Ä Ø ÕÙ Ö Ö Ö ÓÒ Ý ÓÖØ Ó ÓÒ Ð Þ ¹ Ø ÓÒ Ó Ø ÒÔÙØ º Ì Ú ØÓÖ Ü¾ Ö Ö ÓÒ Ø Ú ØÓÖ Ü½ Ð Ú Ò Ø Ö Ù Ð Ú ØÓÖ Þº Ì Ö Ö ÓÒ Ó Ý ÓÒ Þ Ú Ø ÑÙÐØ ÔÐ Ö Ö ÓÒ Ó Æ ÒØ Ó Ü¾º Ò ØÓ Ø Ö Ø ÔÖÓ Ø ÓÒ Ó Ý ÓÒ Ó Ü½ Ò Þ Ú Ø Ð Ø ÕÙ Ö Ø Ýº 31 / 40

How to orthogonalize X? (d=2) Consider y = β 1 x 1 + β 2 x 2 + β 3 x 3 + ɛ. Orthogonization process: Orthogonalize Design Matrix 1 We regress x 2 onto x 1, compute the residual z 1 = x 2 γ 12 x 1. (note z 1 x 1 ) 2 We regress x 3 onto (x 1, z 1 ), compute the residual z 2 = x 3 γ 13 x 1 γ 23 z 1. (note z 2 {x 1, z 1 }) Note: span{x 1, x 2, x 3 } = span{x 1, z 1, z 2 }, because β 1 x 1 + β 2 x 2 + β 3 x 3 = β 1 x 1 + β 2 (γ 12 x 1 + z 1 ) + β 3 (γ 13 x 1 + γ 23 z 1 + z 2 ) = (β 1 + β 2 γ 12 + β 3 γ 13 )x 1 + (β 2 + β 3 γ 23 )z 1 + β 3 z 2 = η 1 x 1 + η 2 z 1 + β 3 z 2. 32 / 40

Estimation Process Linear Model Introduction Orthogonalize Design Matrix We project y onto the orthogonal basis {x 1, z 2, z 3 } one by one, and then recover the coefficients corresponding to the original columns of X. First, we regress y onto z 2 for the OLS estimate of the slope ˆβ 3 ˆβ 3 = < y, z 2 > < z 2, z 2 > Second, we regress y onto z 1, leading to the coefficient ˆη 2, and ˆβ 2 = ˆη 2 ˆβ 3 γ 23 Third, we regress y onto x 1, leading to the coefficient ˆη 1, and ˆβ 1 = ˆη 1 ˆβ 3 γ 13 ˆβ 2 γ 12 33 / 40

Orthogonalize Design Matrix Gram-Schmidt Procedure (Successive Orthogonalization) 1 Initialize z 0 = x 0 = 1 2 For j = 1,..., d Regression x j on z 0, z 1,..., z j 1 to produce coefficients ˆγ kj = <z k,x j > <z k,z k > for k = 0,..., j 1, and residual vector z j = x j j 1 k=0 ˆγ kjz k. ({z 0, z 1,..., z j 1 } are orthogonal) 3 Regress y on the residual z d to get ˆβ d = < y, z d > < z d, z d > 4 Compute ˆβ j, j = d 1,, 0 in that order successively. {z 0, z 1,..., z d } forms orthogonal basis for Col(X ). Multiple regression coefficient ˆβ j is the additional contribution of x j to y, after x j has been adjusted for x 0, x 1,..., x j 1, x j+1,..., x d. 34 / 40

Collinearity Issue Linear Model Introduction Orthogonalize Design Matrix The dth coefficient ˆβ d = < z d, y > < z d, z d > If x d is highly correlated with some of the other x js, then The residual vector z d is close to zero The coefficient ˆβ d will be very unstable The variance estimates Var( ˆβ d ) = σ2 z d 2. The precision for estimating ˆβ d depends on the length of z d, or, how much x d is unexplained by the other x k s 35 / 40

QR Decomposition Cholesky Algorithm Two Computational Algorithms For Multiple Regression Consider the Normal Equation X T X β = X T y. We like to avoid computing (X T X ) 1 directly. 1 QR decomposition of X X = QR where Q is orthonormal and R is upper triangular Essentially, a process of orthogonal matrix triangularization 2 Cholesky decomposition of X T X. X T X = RR T where R is lower triangular 36 / 40

QR Decomposition Cholesky Algorithm Matrix Formulation of Orthogonalization In Step 2 of Gram-Schmidt procedure, for j = 1,..., d j 1 z j = x j ˆγ kj z k = x j = ˆγ kj z k + z j. k=0 j 1 k=0 In matrix form X = [x 1,..., x d ] and Z = [z 1,..., z d ], X = ZΓ The columns of Z are orthogonal to each other The matrix Γ is upper triangular, with 1 at the diagonals. Standardizing Z using D = diag{ z 1,..., z d }, X = ZΓ = ZD 1 DΓ QR, with Q = ZD 1, R = DΓ. 37 / 40

QR Decomposition Linear Model Introduction QR Decomposition Cholesky Algorithm The columns of Q consists of an orthonormal basis for the column space of X. Q is orthogonal matrix of n d, satisfying Q T Q = I. R is upper triangular matrix of d d, full-ranked. X T X = (QR) T (QR) = R T Q T QR = R T R The least square solutions are β = (X T X ) 1 X T y = R 1 R T R T Q T y = R 1 Q T y ŷ = X ˆβ = (QR)(R 1 Q T y) = QQ T y. 38 / 40

QR Decomposition Cholesky Algorithm QR Algorithm for Normal Equations Regard β as the solution for linear equations system: Rβ = Q T y. 1 Conduct QR decomposition of X = QR. (Gram-Schmidt Orthogonalization) 2 Compute Q T y. 3 Solve the triangular system Rβ = Q T y. The computational complexity: nd 2 39 / 40

QR Decomposition Cholesky Algorithm Cholesky Decomposition Algorithm For any positive definite square matrix A, we have A = RR T, where R is a lower triangular matrix of full rank. 1 Compute X T X and X T y. 2 Factoring X T X = RR T, then ˆβ = (R T ) 1 R 1 X T y 3 Solve the triangular system Rw = X T y for w. 4 Solve the triangular system R T β = w for β. The computational complexity: d 3 + nd 2 /2 (can be faster than QR for small d, but can be less stable) Var(ŷ 0 ) = Var(x T 0 ˆβ) = σ 2 (x T 0 (R T ) 1 R 1 x 0 ). 40 / 40