Lecture 11: Regression Methods I (Linear Regression)

Similar documents
Lecture 11: Regression Methods I (Linear Regression)

STAT5044: Regression and Anova. Inyoung Kim

Lecture 16: Modern Classification (I) - Separating Hyperplanes

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Data Mining Stat 588

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Matrix Factorizations

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Maximum Likelihood Estimation

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Recall the convention that, for us, all vectors are column vectors.

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

3 Multiple Linear Regression

Introduction to Estimation Methods for Time Series models. Lecture 1

Properties of Matrices and Operations on Matrices

ANOVA: Analysis of Variance - Part I

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

1 Cricket chirps: an example

STAT 100C: Linear models

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Ma 3/103: Lecture 24 Linear Regression I: Estimation

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

STAT 540: Data Analysis and Regression

ELE/MCE 503 Linear Algebra Facts Fall 2018

Statistics 910, #5 1. Regression Methods

Cheat Sheet for MATH461

235 Final exam review questions

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

MATRIX ALGEBRA. or x = (x 1,..., x n ) R n. y 1 y 2. x 2. x m. y m. y = cos θ 1 = x 1 L x. sin θ 1 = x 2. cos θ 2 = y 1 L y.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

1 Appendix A: Matrix Algebra

MA 265 FINAL EXAM Fall 2012

The Statistical Property of Ordinary Least Squares

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Chap 3. Linear Algebra

Linear Algebra in Actuarial Science: Slides to the lecture

Regression, Ridge Regression, Lasso

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

YORK UNIVERSITY. Faculty of Science Department of Mathematics and Statistics MATH M Test #2 Solutions

1. Diagonalize the matrix A if possible, that is, find an invertible matrix P and a diagonal

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

Chapter 7: Symmetric Matrices and Quadratic Forms

STAT200C: Review of Linear Algebra

MATH 240 Spring, Chapter 1: Linear Equations and Matrices

Chapter 4 Euclid Space

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Common-Knowledge / Cheat Sheet

MIT Spring 2015

Matrix Algebra, part 2

MATH 304 Linear Algebra Lecture 34: Review for Test 2.

Online Exercises for Linear Algebra XM511

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Problem Set (T) If A is an m n matrix, B is an n p matrix and D is a p s matrix, then show

Linear Algebra Primer

2. A Review of Some Key Linear Models Results. Copyright c 2018 Dan Nettleton (Iowa State University) 2. Statistics / 28

Vectors and Matrices Statistics with Vectors and Matrices

2. Matrix Algebra and Random Vectors

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Linear Algebra Highlights

Linear regression methods

Lecture 2: Linear Algebra Review

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Linear Models Review

Chapter 3 Transformations

Multivariate Regression

Mathematical Methods wk 2: Linear Operators

Multivariate Statistical Analysis

DIAGONALIZATION. In order to see the implications of this definition, let us consider the following example Example 1. Consider the matrix

STAT 100C: Linear models

Applied Statistics Preliminary Examination Theory of Linear Models August 2017

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Math 423/533: The Main Theoretical Topics

Exercise Sheet 1.

UNIT 6: The singular value decomposition.

MLES & Multivariate Normal Theory

Lecture 1: Review of linear algebra

MTH 102: Linear Algebra Department of Mathematics and Statistics Indian Institute of Technology - Kanpur. Problem Set

A Short Introduction to the Lasso Methodology

Linear models and their mathematical foundations: Simple linear regression

Answer Keys For Math 225 Final Review Problem

MAT Linear Algebra Collection of sample exams

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

Quick Review on Linear Multiple Regression

7. Symmetric Matrices and Quadratic Forms

Linear Algebra Formulas. Ben Lee

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

Warm-up. True or false? Baby proof. 2. The system of normal equations for A x = y has solutions iff A x = y has solutions

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

A Little Necessary Matrix Algebra for Doctoral Studies in Business & Economics. Matrix Algebra

Multivariate Regression Analysis

This can be accomplished by left matrix multiplication as follows: I

lecture 2 and 3: algorithms for linear algebra

Mathematical Foundations of Applied Statistics: Matrix Algebra

Transcription:

Lecture 11: Regression Methods I (Linear Regression) Fall, 2017 1 / 40

Outline Linear Model Introduction 1 Regression: Supervised Learning with Continuous Responses 2 Linear Models and Multiple Linear Regression Ordinary Least Squares Statistical inferences Computational algorithms 2 / 40

Regression Models Linear Model Introduction If the response Y take real values, we refer this type of supervised learning problem as regression problem. linear regression models parametric models nonparametric regression splines, kernel estimator, local polynomial regression semiparametric regression Broad coverage: penalized regression, regression trees, support vector regression, quantile regression 3 / 40

Linear Regression Models A standard linear regression model assumes y i = x T i β + ɛ i, ɛ i i.i.d, E(ɛ i ) = 0, Var(ɛ i ) = σ 2, y i is the response for the ith observation, x i R d is the covariates β R d is the d-dimensional parameter vector Common model assumptions: independence of errors constant error variance (homoscedasticity) ɛ independent of X. Normality is not needed. 4 / 40

About Linear Models Linear Model Introduction Linear models has been a mainstay of statistics for the past 30 years and remains one of the most important tools. The covariates may come from different sources quantitative inputs; dummy coding qualitative inputs. transformed inputs: log(x ), X 2, X,... basis expansion: X 1, X1 2, X 1 3,... (polynomial representation) interaction between variables: X 1 X 2,... 5 / 40

Review on Matrix Theory (I) Let A be an m m matrix. Let I m be the identity matrix of size m. The determinant of A is deta) = A. The trace of A is tr(a) = the sum of the diagonal elements. The roots of the mth degree of polynomial equation in λ. λi m A = 0, denoted by λ 1,, λ m are called the eigenvalues of A. The collection {λ 1,, λ m } is called the spectrum of A. Any nonzero m 1 vector x i 0 such that Ax i = λ i x i is an eigenvector of A corresponding to the eigenvalue λ i. If B is another m m matrix, then AB = A B, tr(ab) = tr(ba). 6 / 40

Review on Matrix Theory (II) The following are equivalent: A 0 rank(a) = m A 1 exists. 7 / 40

Orthogonal Matrix Linear Model Introduction An m m matrix is symmetric if A = A. An m m matrix P is called an orthogonal matrix if PP = P P = I m, or P 1 = P. If P is an orthogonal matrix, then PP = P P = P 2 = I = 1, so P = ±1. For any m m matrix A, we have tr(pap ) = tr(ap P) = tr(a). PAP and A have the same eigenvalues, since λi m PAP = λpp PAP = P 2 λi m A = λi m A. 8 / 40

Spectral Decomposition of Symmetric Matrix For any m m symmetric matrix A, there exists an orthogonal matrix P such that P AP = Λ = diag{λ 1,, λ m }, λ i s are the eigenvalues of A. The corresponding eigenvectors of A are the column vectors of P. Denote the m 1 unit vectors by e i, i = 1,, m, where e i has 1 in the ith position and zeros elsewhere. The ith column of P is p i = Pe i, i = 1,, m. Note PP = m i=1 p ip i = I m. The spectral decomposition of A is m A = PΛP = λ i p i p i i=1 tr(a) = tr(λ) = n i=1 λ i and A = Λ = m i=1 λ i. 9 / 40

Idempotent Matrices Linear Model Introduction An m m matrix A is idempotent if A 2 = AA = A. The eigenvalues of an idempotent matrix are either zero or one λx = Ax = A(Ax) = A(λx) = λ 2 x, = λ = λ 2. A symmetric idempotent matrix A is also referred to as a projection matrix. For any x R m, the vector y = Ax is the orthogonal projection of x onto the subspace of R m generated by the column vectors of A. the vector z = (I A)x is the orthogonal projection of x onto the complementary subspace such that x = y + z = Ax + (I A)x. 10 / 40

Projection Matrices Linear Model Introduction If A is an symmetric idempotent, then If rank(a) = r, then A has r eigenvalues equal to 1 and n 4 zero eigenvalues tr(a) = rank(a). I m A is also symmetric idempotent, of rank m r. 11 / 40

Matrix Notations for Linear Regression The response vector y = (y 1,, y n ) T The design matrix X. Assume the first column of X is 1. The dimension of X is n (1 + d). ( ) β0 The regression coefficients β =. β 1 The error vector ɛ = (ɛ 1,, ɛ n )) T. The linear model is written as: the estimated coefficients β y = X β + ɛ the predicted response ŷ = X β. 12 / 40

Ordinary Least Squares (OLS) The most popular method for fitting the linear model is the ordinary least squares (OLS): min RSS(β) = (y X β β)t (y X β). Normal equations: X T (y X β) = 0 β = (X T X ) 1 X T y and ŷ = X (X T X ) 1 X T y. Residual vector is r = y ŷ = (I P X )y. Residual sum squares RSS = r T r. 13 / 40

Projection Matrix Linear Model Introduction Call the following square matrix the projection or hat matrix: Properties: We have Note symmetric and non-negative P X = X (X T X ) 1 X T. idempotent: P 2 X = P X. The eigenvalues are 0 s and 1 s. X T P X = X T, X T (I P X ) = 0. r = (I P X )y, RSS = y T (I P X )y. X T r = X T (I P X )y = 0. The residual vector is orthogonal to the column space spanned by X, col(x ). 14 / 40

Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÈË Ö Ö ÔÐ Ñ ÒØ Ý Ü ¾ Ý Ü ½ ÙÖ º¾ Ì Æ ¹ Ñ Ò ÓÒ Ð ÓÑ ØÖÝ Ó Ð Ø ÕÙ Ö Ö Ö ÓÒ Û Ø ØÛÓ ÔÖ ØÓÖ º Ì ÓÙØÓÑ Ú ØÓÖ Ý ÓÖØ Ó ÓÒ ÐÐÝ ÔÖÓ Ø ÓÒØÓ Ø ÝÔ ÖÔÐ Ò Ô ÒÒ Ý Ø ÒÔÙØ Ú ØÓÖ Ü½ Ò Ü¾º Ì ÔÖÓ Ø ÓÒ Ý Ö ÔÖ ÒØ Ø Ú ØÓÖ Ó Ø Ð Ø ÕÙ Ö ÔÖ Ø ÓÒ 15 / 40

Sampling Properties of β Var(ˆβ) = σ 2 (X T X ) 1, The variance σ 2 can be estimated as ˆσ 2 = SSE/(n d 1). This is an unbiased estimator, i.e., E(ˆσ 2 ) = σ 2 16 / 40

Inferences for Gaussian Errors Under the Normal assumption on the error ɛ, we have ˆβ N(β, σ 2 (X T X ) 1 ) (n d 1)ˆσ 2 σ 2 χ 2 n d 1 ˆβ is independent of ˆσ 2 To test H 0 : β j = 0, we use if σ 2 is known, z j = if σ 2 is unknown, t j = H 0 ; ˆβ j σ v j has a Z distribution under H 0 ; ˆβ j ˆσ v j has a t n d 1 distribution under where v j is the jth diagonal element of (X T X ) 1. 17 / 40

Confidence Interval for Individual Coefficients Under Normal assumption, the 100(1 α)% C.I. of β j is ˆβ j ± t n d 1; α 2 ˆσ v j, where t k;ν is 1 ν percentile of t k distribution. In practice, we use the approximate 100(1 α)% C.I. of β j ˆβ j ± z α 2 ˆσ v j, where z α is 1 α 2 2 percentile of the standard Normal distribution. Even if the Gaussian assumption does not hold, this interval is approximately right, with the coverage probability 1 α as n. 18 / 40

Review on Multivariate Normal Distributions Distributions of Quadratic Form (Non-central χ 2 ): If X N p (µ, I p ), then W = X T X = p i=1 X 2 i χ 2 p(λ), λ = 1 2 µt µ. Special case: If X N p (0, I p ), then W = X T X χ 2 p. If X N p (µ, V ) where V is nonsingular, then W = X T V 1 X χ 2 p(λ), λ = 1 2 µt V 1 µ. If X N p (µ, V ) with V nonsingular, if A is symmetric and AV is idempotent with rank s, then W = X T AX χ 2 s (λ), λ = 1 2 µt Aµ. 19 / 40

Cochran s Theorem Linear Model Introduction Let y N n (µ, σ 2 I n ) and let A j, j = 1,, J be symmetric idempotent matrices with rank s j. Furthermore, assume that J j=1 A j = I n and J j=1 s j = n, then (i) W j = 1 σ 2 yt A j y χ 2 s j (λ j ), where λ j = 1 2σ 2 µ T A j µ (ii) W j s are mutually independent with each other. Essentially: we decompose y T y into the (scaled) sum of its quadratic forms, n i=1 y 2 i y i = y T I n y = J y T A j y. j=1 20 / 40

Application of Cochran s Theorem to Linear Models Example: Assume y N n (X β, σ 2 I n ). Define A = I P X and the residual sum of squares: RSS = y T Ay = r 2 the sum of squares regression: SSR = y T P X y = ŷ 2. By Cochran s Theorem, we have (i) RSS/σ 2 χ 2 n d 1, SSR/σ2 χ 2 d+1 (λ), where λ = (X β) T (X β)/(2σ 2 ), (ii) RSS is independent from SSR. (Note r ŷ) 21 / 40

F Distribution Linear Model Introduction If U 1 χ 2 p, U 2 χ 2 q and U 1 U 2, then F = U 1/p U 2 /q F p,q. If U 1 χ 2 p(λ), U 2 χ 2 q and U 1 U 2, then F = U 1/p U 2 /q F p,q(λ), (noncentral F ) Example: Assume y N n (X β, σ 2 I n ). Let A = I P X, and Then F = RSS = y T Ay T = r 2, SSR = y T P X y = ŷ 2. SSR/(d + 1) RSS/(n d 1) F d+1,n d 1(λ), λ = X β 2 /(2σ 2 ). 22 / 40

Making Inferences about Multiple Parameters Assume X = [X 0, X 1 ], where X 0 consists of the first k columns. Correspondingly, β = [β 0, β 1]. To test H 0 : β 0 = 0, using F = (RSS 1 RSS)/k RSS/(n d 1) RSS 1 = y T (I P X1 )y (reduced model). RSS = y T (I P X )y (full model) RSS 1 σ 2 χ 2 n d 1. RSS 1 RSS = y T (P X P X1 )y. 23 / 40

Testing Multiple Parameter Applying Cochran s Theorem to RSS 1, RSS and RSS 1 RSS, they are independent they respectively follow noncentral χ 2 distributions, with noncentralities (X β) T (I P X1 )(X β)/(2σ 2 ), 0, and (X β) T (P X P X1 )(X β)/(2σ 2 ).. Then we have F F k,n d 1 (λ), with λ = (X β) T (P X P X1 )(X β)/(2σ 2 ). Under H 0, we have X β = X 1 β 1, so F F k,n d 1. 24 / 40

Nested Model Selection To test for significance of groups of coefficients simultaneously, we use F -statistic where F = (RSS 0 RSS 1 )/(d 1 d 0 ), RSS 1 /(n d 1 1) RSS 1 is the RSS for the bigger model with d 1 + 1 parameters RSS 0 is the RSS for the nested smaller model with d 0 + 1 parameter, have d 1 d 0 parameters constrained to zero. F -statistic measure the change in RSS per additional parameter in the bigger model, and it is normalized by ˆσ 2. Under the assumption that the smaller model is correct, F F d1 d 0,n d 1 1. 25 / 40

Confidence Set Linear Model Introduction The approximate confidence set of β is C β = {β (ˆβ β) T (X T X )(ˆβ β) ˆσ 2 χ 2 d+1;1 α }, where χ 2 k;1 α is 1 α percentile of χ2 k distribution. The confidence interval for the true function f (x) = x T β is {x T β β C β } 26 / 40

Gauss-Markov Theorem Assume s T β is linearly estimable, i.e., there exists a linear estimator b + c T y such that E(b + c T y) = s T β. A function s T β is linearly estimable iff s = X T a for some a. Theorem: If s T β is linearly estimable, then s T β is the best linear unbiased estimator (BLUE) of s T β: For any c T y satisfying E(c T y) = s T β, we have Var(s T β) Var(c T y). s T β is the best among all the unbiased estimators. (It is a function of the complete and sufficient statistic (y T y, X T y).) Question: Is it possible to find a slightly biased linear estimator but with smaller variance? ( Trade a little bias for a large reduction in variance.) 27 / 40

Orthogonalize Design Matrix Linear Regression with Orthogonal Design If X is univariate, the least square estimate is ˆβ = i x iy i = < x, y > < x, x >. i x 2 i if X = [x 1,..., x d ] has orthogonal columns, i.e., < x j, x k >= 0, j k; or equivalently, X T X = diag ( x 1 2,..., x d 2). The OLS estimates are given as ˆβ j = < x j, y > < x j, x j > for j = 1,..., d. Each input has no effect on the estimation of other parameters. Multiple linear regression reduces to univariate regression. 28 / 40

How to orthogonalize X? Orthogonalize Design Matrix Consider the simple linear regression y = β 0 1 + β 1 x + ɛ. We regress x onto 1 and obtain the residual z = x x1. Orthogonalization Process: The residual z is orthogonal to the regressor 1. The column space of X is span{1, x}. Note: ŷ span{1, x} = span{1, z}, because β 0 1 + β 1 x = β 0 + β 1 [ x1 + (x x1)] = β 0 + β 1 [ x1 + z] = (β 0 + β 1 x)1 + β 1 z = η 0 1 + β 1 z. {1, z} form an orthogonal basis for the column space of X. 29 / 40

Orthogonalize Design Matrix How to orthogonalize X? (continued) Estimation Process: First, we regress y onto z for the OLS estimate of the slope ˆβ 1 ˆβ 1 = < y, z > < y, x x1 > = < z, z > < x x1, x x1 >. Second, we regress y onto 1 and get the coefficient ˆη 0 = ȳ. The OLS fit is given as ŷ = ˆη 0 1 + ˆβ 1 z = ˆη 0 1 + ˆβ 1 (x x1) = (ˆη 0 ˆβ 1 x)1 + ˆβ 1 x. Therefore, the OLS slope is obtained as ˆβ 0 = ˆη 0 ˆβ 1 x = ȳ ˆβ 1 x. 30 / 40

Orthogonalize Design Matrix Ð Ñ ÒØ Ó ËØ Ø Ø Ð Ä ÖÒ Ò À Ø Ì Ö Ò ² Ö Ñ Ò ¾¼¼½ ÔØ Ö ÈË Ö Ö ÔÐ Ñ ÒØ Ý Ü¾ Þ Ý Ü½ ÙÖ º Ä Ø ÕÙ Ö Ö Ö ÓÒ Ý ÓÖØ Ó ÓÒ Ð Þ ¹ Ø ÓÒ Ó Ø ÒÔÙØ º Ì Ú ØÓÖ Ü¾ Ö Ö ÓÒ Ø Ú ØÓÖ Ü½ Ð Ú Ò Ø Ö Ù Ð Ú ØÓÖ Þº Ì Ö Ö ÓÒ Ó Ý ÓÒ Þ Ú Ø ÑÙÐØ ÔÐ Ö Ö ÓÒ Ó Æ ÒØ Ó Ü¾º Ò ØÓ Ø Ö Ø ÔÖÓ Ø ÓÒ Ó Ý ÓÒ Ó Ü½ Ò Þ Ú Ø Ð Ø ÕÙ Ö Ø Ýº 31 / 40

How to orthogonalize X? (d=2) Consider y = β 1 x 1 + β 2 x 2 + β 3 x 3 + ɛ. Orthogonization process: Orthogonalize Design Matrix 1 We regress x 2 onto x 1, compute the residual z 1 = x 2 γ 12 x 1. (note z 1 x 1 ) 2 We regress x 3 onto (x 1, z 1 ), compute the residual z 2 = x 3 γ 13 x 1 γ 23 z 1. (note z 2 {x 1, z 1 }) Note: span{x 1, x 2, x 3 } = span{x 1, z 1, z 2 }, because β 1 x 1 + β 2 x 2 + β 3 x 3 = β 1 x 1 + β 2 (γ 12 x 1 + z 1 ) + β 3 (γ 13 x 1 + γ 23 z 1 + z 2 ) = (β 1 + β 2 γ 12 + β 3 γ 13 )x 1 + (β 2 + β 3 γ 23 )z 1 + β 3 z 2 = η 1 x 1 + η 2 z 1 + β 3 z 2. 32 / 40

Estimation Process Linear Model Introduction Orthogonalize Design Matrix We project y onto the orthogonal basis {x 1, z 2, z 3 } one by one, and then recover the coefficients corresponding to the original columns of X. First, we regress y onto z 2 for the OLS estimate of the slope ˆβ 3 ˆβ 3 = < y, z 2 > < z 2, z 2 > Second, we regress y onto z 1, leading to the coefficient ˆη 2, and ˆβ 2 = ˆη 2 ˆβ 3 γ 23 Third, we regress y onto x 1, leading to the coefficient ˆη 1, and ˆβ 1 = ˆη 1 ˆβ 3 γ 13 ˆβ 2 γ 12 33 / 40

Orthogonalize Design Matrix Gram-Schmidt Procedure (Successive Orthogonalization) 1 Initialize z 0 = x 0 = 1 2 For j = 1,..., d Regression x j on z 0, z 1,..., z j 1 to produce coefficients ˆγ kj = <z k,x j > <z k,z k > for k = 0,..., j 1, and residual vector z j = x j j 1 k=0 ˆγ kjz k. ({z 0, z 1,..., z j 1 } are orthogonal) 3 Regress y on the residual z d to get ˆβ d = < y, z d > < z d, z d > 4 Compute ˆβ j, j = d 1,, 0 in that order successively. {z 0, z 1,..., z d } forms orthogonal basis for Col(X ). Multiple regression coefficient ˆβ j is the additional contribution of x j to y, after x j has been adjusted for x 0, x 1,..., x j 1, x j+1,..., x d. 34 / 40

Collinearity Issue Linear Model Introduction Orthogonalize Design Matrix The dth coefficient ˆβ d = < z d, y > < z d, z d > If x d is highly correlated with some of the other x js, then The residual vector z d is close to zero The coefficient ˆβ d will be very unstable The variance estimates Var( ˆβ d ) = σ2 z d 2. The precision for estimating ˆβ d depends on the length of z d, or, how much x d is unexplained by the other x k s 35 / 40

QR Decomposition Cholesky Algorithm Two Computational Algorithms For Multiple Regression Consider the Normal Equation X T X β = X T y. We like to avoid computing (X T X ) 1 directly. 1 QR decomposition of X X = QR where Q is orthonormal and R is upper triangular Essentially, a process of orthogonal matrix triangularization 2 Cholesky decomposition of X T X. X T X = RR T where R is lower triangular 36 / 40

QR Decomposition Cholesky Algorithm Matrix Formulation of Orthogonalization In Step 2 of Gram-Schmidt procedure, for j = 1,..., d j 1 z j = x j ˆγ kj z k = x j = ˆγ kj z k + z j. k=0 j 1 k=0 In matrix form X = [x 1,..., x d ] and Z = [z 1,..., z d ], X = ZΓ The columns of Z are orthogonal to each other The matrix Γ is upper triangular, with 1 at the diagonals. Standardizing Z using D = diag{ z 1,..., z d }, X = ZΓ = ZD 1 DΓ QR, with Q = ZD 1, R = DΓ. 37 / 40

QR Decomposition Linear Model Introduction QR Decomposition Cholesky Algorithm The columns of Q consists of an orthonormal basis for the column space of X. Q is orthogonal matrix of n d, satisfying Q T Q = I. R is upper triangular matrix of d d, full-ranked. X T X = (QR) T (QR) = R T Q T QR = R T R The least square solutions are β = (X T X ) 1 X T y = R 1 R T R T Q T y = R 1 Q T y ŷ = X ˆβ = (QR)(R 1 Q T y) = QQ T y. 38 / 40

QR Decomposition Cholesky Algorithm QR Algorithm for Normal Equations Regard β as the solution for linear equations system: Rβ = Q T y. 1 Conduct QR decomposition of X = QR. (Gram-Schmidt Orthogonalization) 2 Compute Q T y. 3 Solve the triangular system Rβ = Q T y. The computational complexity: nd 2 39 / 40

QR Decomposition Cholesky Algorithm Cholesky Decomposition Algorithm For any positive definite square matrix A, we have A = RR T, where R is a lower triangular matrix of full rank. 1 Compute X T X and X T y. 2 Factoring X T X = RR T, then ˆβ = (R T ) 1 R 1 X T y 3 Solve the triangular system Rw = X T y for w. 4 Solve the triangular system R T β = w for β. The computational complexity: d 3 + nd 2 /2 (can be faster than QR for small d, but can be less stable) Var(ŷ 0 ) = Var(x T 0 ˆβ) = σ 2 (x T 0 (R T ) 1 R 1 x 0 ). 40 / 40