This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Similar documents
Regression: Lecture 2

Data Mining Stat 588

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Methods for Regression. Lijun Zhang

Lecture 11: Regression Methods I (Linear Regression)

Lecture 11: Regression Methods I (Linear Regression)

Machine Learning for OR & FE

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Linear model selection and regularization

Lecture 5: Clustering, Linear Regression

Math 423/533: The Main Theoretical Topics

MS&E 226: Small Data

Ch 2: Simple Linear Regression

Lecture 5: Clustering, Linear Regression

[y i α βx i ] 2 (2) Q = i=1

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

The Statistical Property of Ordinary Least Squares

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Machine Learning for OR & FE

Regression Models - Introduction

Lecture 6 Multiple Linear Regression, cont.

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Statistical Machine Learning, Part I. Regression 2

Foundation of Intelligent Systems, Part I. Regression 2

Multivariate Regression

11 Hypothesis Testing

Lecture 5: Clustering, Linear Regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Linear regression methods

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Ch 3: Multiple Linear Regression

ISyE 691 Data mining and analytics

Homoskedasticity. Var (u X) = σ 2. (23)

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Sparse Linear Models (10/7/13)

Regression Models - Introduction

Multiple Linear Regression

1 Least Squares Estimation - multiple regression.

A Modern Look at Classical Multivariate Techniques

AMS-207: Bayesian Statistics

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

3 Multiple Linear Regression

Linear Models Review

Linear Model Selection and Regularization

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

STAT 100C: Linear models

Lecture 6: Linear Regression

Introduction to Statistical modeling: handout for Math 489/583

STAT 540: Data Analysis and Regression

Applied Regression Analysis

Regression, Ridge Regression, Lasso

Statistics 910, #5 1. Regression Methods

Inference in Regression Analysis

Multiple Linear Regression

Simple and Multiple Linear Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

EC3062 ECONOMETRICS. THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation. (1) y = β 0 + β 1 x β k x k + ε,

5.2 Expounding on the Admissibility of Shrinkage Estimators

High-dimensional regression

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

For GLM y = Xβ + e (1) where X is a N k design matrix and p(e) = N(0, σ 2 I N ), we can estimate the coefficients from the normal equations

14 Multiple Linear Regression

Math 407: Linear Optimization

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Lecture 6: Linear Regression (continued)

Bias Variance Trade-off

Introductory Econometrics

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Multivariate Regression Analysis

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

How the mean changes depends on the other variable. Plots can show what s happening...

Linear Regression and Its Applications

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

4 Multiple Linear Regression

YORK UNIVERSITY. Faculty of Science Department of Mathematics and Statistics MATH M Test #2 Solutions

Chapter 6: Orthogonality

Linear Regression (9/11/13)

A Vector Space Justification of Householder Orthogonalization

CHAPTER 6: SPECIFICATION VARIABLES

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Simple Linear Regression

Review of Econometrics

Probability and Statistics Notes

Matrix Factorizations

Quick Review on Linear Multiple Regression

Cheat Sheet for MATH461

Chapter 3. Linear Models for Regression

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Least Squares Estimation-Finite-Sample Properties

Transcription:

Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear in the parameters. The predictor function for a given β is f β (x) = (1, x T )β. A more practical and relaxed attitude towards linear regression is to say that E(Y X) β 0 + p X j β j = (1, X T )β j=1 where the precision of the approximation of the conditional mean by a linear function is swept under the carpet. All mathematical derivations rely on assuming that the conditional mean is exactly linear, but in reality we will almost always regard linear regression as an approximation. Linearity is for differentiable functions a good local approximation and it may extend reasonably to the convex hull of the x i s. But we must make an attempt to check that by some sort of model control. Having done so, interpolation prediction for a new x in the convex hull of the observed x i s is usually OK, whereas extrapolation is always dangerous. Extrapolation is strongly model dependent and we do not have data to justify that the model is adequate for extrapolation. Least Squares With X the N (p + 1) data matrix, including the column 1, the predicted values for given β are Xβ. The residual sum of squares is N RSS(β) = (y i (1, x T i )β) 2 = y Xβ 2. i=1 The least squares estimate of β is ˆβ = argmin RSS(β). β 1

Figure 3.1 Geometry The linear regression seeks a p-dimensional, affine representation a hyperplane of the p + 1-dimensional variable (X, Y ). The direction of the Y -variable plays a distinctive role the error of the approximating hyperplane is measured parallel to this axis. The Solution the Calculus Way Since RSS(β) = (y Xβ) T (y Xβ) D β RSS(β) = 2(y Xβ) T X The derivative is a 1 p dimensional matrix a row vector. The gradient is β RSS(β) = D β RSS(β) T. D 2 βrss(β) = 2X T X. If X has rank p+1, Dβ 2 RSS(β) is (globally) positive definite and there is a unique minimizer found by solving D β RSS(β) = 0. The solution is ˆβ = (X T X) 1 X T y. For the differentiation it may be useful to think of RSS as a composition of the function a(β) = (y Xβ) from R p to R N with derivative D β a(β) = X and then the function b(z) = z 2 = z T z from R N to R with derivative D z b(z) = 2z T. Then by the chain rule D β RSS(β) = D z b(a(β)))d β a(β) = 2(y Xβ) T X An alternative way to solve the optimization problem is by geometry in the spirit of Figure 3.2. With V = {Xβ β R p } the column space of X the quantity RSS(β) = y Xβ 2 is minimized whenever Xβ is the orthogonal projection of y onto V. The column space projection equals P = X(X T X) 1 X T whenever X has full rank p + 1. In this case Xβ = P y has the unique solution ˆβ = (X T X) 1 X T y. One verifies that P is, in fact, the projection by verifying three characterizing properties: P V = V P 2 = X(X T X) 1 X T X(X T X) 1 X T = X(X T X) 1 X T = P P T = (X(X T X) 1 X T ) T = X(X T X) 1 X T = P If X does not have full rank p the projection is still well defined and can be written as P = X(X T X) X T 2

where (X T X) denotes a generalized inverse. A generalized inverse of a matrix A is a matrix A with the property that AA A = A and using this property one easily verifies the same three conditions for the projection. In this case, however, there is not a unique solution to Xβ = P y. Distributional Results Conditionally on X ε i = Y i (1, X i ) T β Assumption 1: ε 1,..., ε N are, conditionally on X 1,..., X N, uncorrelated with mean value 0 and same variance σ 2. ˆσ 2 = 1 N p 1 N (Y i X ˆβ) 2 = i=1 1 N p 1 Y X ˆβ 2 = RSS( ˆβ) N p 1 Then V (Y X) = σ 2 I N E( ˆβ X) = (X T X) 1 X T Xβ = β V ( ˆβ X) = (X T X) 1 X T σ 2 I N X(X T X) 1 = σ 2 (X T X) 1 E(ˆσ 2 X) = σ 2 The expectation of ˆσ 2 can be computed by noting that E(Y i X ˆβ) = 0, hence E(RSS( ˆβ)) is the sum of the variances of Y i X ˆβ. Since Y has variance matrix σ 2 I N the variance matrix of Y X ˆβ = Y P Y = (I P )Y is σ 2 (I P )(I P ) T = σ 2 (I P ) where we have used that I P is a projection. The sum of the diagonal elements equals the dimension of the subspace that I P is a projection onto, which is N p 1. Distributional Results Conditionally on X Assumption 2: ε 1,..., ε N conditionally on X 1,..., X N are i.i.d. N(0, σ 2 ). The standardized Z-score Or more generally for any a R p+1 ˆβ N(β, σ 2 (X T X) 1 ) (N p 1)ˆσ 2 σ 2 χ 2 N p 1. ˆβ j β j Z j = t N p 1. ˆσ (X T X) 1 jj a T ˆβ a T β ˆσ a T (X T X) 1 a t N p 1. 3

Minimal Variance, Unbiased Estimators Does there exist a minimal variance, unbiased estimator of β? We consider linear estimators only β = C T Y for some N p matrix C requiring that β = C T Xβ for all β. That is, C T X = I p+1 = X T C. Under Assumption 1 and we have V ( β X) = σ 2 C T C, V ( ˆβ β X) = V (((X T X) 1 X T C T )Y X) = σ 2 ((X T X) 1 X T C T )((X T X) 1 X T C T ) T = σ 2 (C T C (X T X) 1 ) The matrix C T C (X T X) 1 is positive semidefinite, i.e. for any a R p+1 a T (X T X) 1 a a T C T Ca Gauss-Markov s Theorem Theorem 1. Under Assumption 1 the least squares estimator of β has minimal variance among all linear, unbiased estimators of β. This means that for any a R p, a T ˆβ has minimal variance among all estimators of a T β of the form a T β where β is a linear, unbiased estimator. It also means that V ( β) (X T X) 1 is positive semidefinite or in the partial ordering on positive semidefinite matrices V ( β) (X T X) 1. Why look any further we have found the optimal estimator...? Biased Estimators The mean squared error is MSE β ( β) = E β ( β β 2 ). By Gauss-Markov s Theorem ˆβ is optimal for all β among the linear, unbiased estimators. Allowing for biased possibly linear estimators we can achieve improvements of the MSE for some β perhaps at the expense of some other β. The Stein estimator is a non-linear, biased estimator, which under Assumption 2 has uniformly smaller MSE than ˆβ whenever p 3. 4

You can find more information on the Stein estimator, discovered by James Stein, in the Wikipedia article. The result was not the expectation of the time. In the Gaussian case it can be hard to imagine that there is an estimator that in terms of MSE performs uniformly better than ˆβ. Digging into the result it turns out to be a consequence of the geometry in R p in dimensions greater than 3. In terms of MSE the estimator ˆβ is inadmissible. The consequences that people have drawn of the result has somewhat divided the statisticians. Some has seen it as the ultimate argument for the non-sense estimators we can get as optimal or at least admissible estimators. If ˆβ the MLE in the Gaussian setup is not admissible there is something wrong with the concept of admissibility. Another reaction is that the Stein estimator illustrates that MLE (and perhaps unbiasedness) is problematic for small sample sizes. To get better performing estimators we should consider biased estimators. However, there are some rather arbitrary choices in the formulation of the Stein estimator, which makes it hard to accept it as an estimator we would use in practice. We will in the course consider a number of biased estimators. However, the reason is not so much due to the Stein result. The reason is much more a consequence of a favorable biasvariance tradeoff when p is large compared to N, which can improve prediction accuracy. Shrinkage Estimators If β = C T Y is some, biased, linear estimator of β we define the estimator β γ = γ ˆβ + (1 γ) β, γ [0, 1]. It is biased. The mean squared error is a quadratic function in γ, and the optimal shrinkage parameter γ(β) can be found it depends upon β! Using the plug-in principle we get an estimator β γ( ˆβ) = γ( ˆβ) ˆβ + (1 γ( ˆβ)) β. It could be uniformly better but the point is that for β where β is not too biased it can be a substantial improvement over ˆβ. Take home message: Bias is a way to introduce soft model restrictions with a locally not globally favorable bias-variance tradeoff. There are other methods than the plug-in principle, which can be useful. It depends on the concrete biased estimator whether we can fit a useful, closed form expression for γ(β) or whether we will need some additional estimation techniques. Regression in practice How does the computer do multiple linear regression? Does it compute the matrix (X T X) 1 X? NO! If the columns x 1,..., x p are orthogonal the solution is ˆβ i = < y, x i > x i 2 Here either 1 is included and orthogonal to the other x i s or all variables have first been centered. 5

We use here the inner product notation < y, x i > as the book, which in terms of matrix multiplication could just be written y T x i. The norm is also given in terms of the inner product, x i 2 =< x i, x i >= x T i x i. Orthogonalization If x 1,..., x p are not orthogonal the Gram-Schmidt orthogonalization produces an orthogonal basis z 1,..., z p spanning the same column space (Algorithm 3.1). Thus ŷ = X T ˆβ = Z T β with β i = < y, z i > z i 2. By Gram-Schmidt span{x 1,..., x i } = span{z 1,..., z i }, hence z p x 1,..., x p 1 x p = z p + w with w z p. Hence z i 2 ˆβp = z T p x p ˆβj = z T p X T ˆβ = z T p Z T β = zi 2 βp. It is assumed above that the p column vectors are linearly independent and thus span a space of dimension p. Equivalently the matrix X has rank p, which in particular requires that p N. If we have centered the variables first we need p N +1. Since we can permute the columns into any order prior to this argument, the conclusion is not special to ˆβ p. Any of the coefficients can be seen as the residual effect of x i when we correct for all the other variables. Figure 3.4 Gram-Schmidt By Gram-Schmidt the multiple regression coefficient ˆβ p equals the coefficient for z p. If z p 2 is small the variance is large and the estimate is uncertain. V ( ˆβ p ) = σ2 z p 2 For observational x i s this problem occurs for highly correlated observables. The QR-decomposition The matrix version of Gram-Schmidt is the decomposition X = ZΓ where the columns in Z are orthogonal and the matrix Γ is upper triangular. If D = diag( z 1,..., z p ) 6

X = ZD }{{ 1 }}{{} DΓ Q R = QR This is the QR-decomposition with Q an orthogonal matrix and R upper triangular. Using the QR-decomposition for Estimation If X = QR is the QR-decomposition we get that ˆβ = (R T Q T QR) 1 R T Q T y = R 1 (R T ) 1 R T Q T y = R 1 Q T y. Or we can write that ˆβ is the solution of R ˆβ = Q T y, which is easy to solve as R is upper triangular. We also get ŷ = X ˆβ = QQ T y. Best Subset For each k {0,..., p} there are ( ) p k different models with k predictors excluding the intercept, and p k parameters = 0. There are in total p k=0 ( ) p = 2 p k different models. For the prostate dataset with 2 8 = 256 possible models we can go through all models in a split second. With 2 40 = 1.099.511.627.776 we approach the boundary. Subset Selection A Constrained Optimization Problem Let L k r for r = 1,..., ( p k) denote all k-dimensional subspaces of the form L k r = {β p k coordinates in β = 0}. ˆβ k = argmin RSS(β) β rl k r The set r L k r is not convex local optimality does not imply global optimality. 7

We can essentially only solve this problem by solving all the ( p k) subproblems, which are convex optimization problems. Conclusion: Subset selection scales computationally badly with the dimension p. Branchand-bound algorithms can help a little... Figure 3.5 Best Subset Selection The residual sum of squares RSS( ˆβ k ) is a monotonely decreasing function in k. The selected models are in general not nested. One can not use RSS( ˆβ k ) to select the appropriate subset size only the best model of subset size k for each k. Model selection criterias such as AIC and Cross-Validation can be used these are major topics later in the course. Test Based Selection Set and fix L l s L k r with l < k. ˆβ k,r = argmin RSS(β) β L k r F = (N k)[rss( ˆβ l,s ) RSS( ˆβ k,r )] (k l)rss( ˆβ k,r ) follows under Assumption 2 an F-distribution with (k l, N k) degrees of freedom if β L l s. L k r is preferred over L l s if Pr(. > F ) 0.05, say the deviation from L l s is unlikely to be explained by randomness alone. Take home message: Test statistics are useful for quantifying if a simple model is inadequate compared to a complex model, but not for general model searching and selection strategies. Some of the problems with using test based criteria for model selection are that: Non-nested models are in-comparable. We only control the type I error for two a priori specified, nested models. We do not understand the distributions of multiple a priori non-nested test statistics. The control of the type I error for a single test depends on the level α chosen. Using α = 0.05 a rejection of the simple model does not in itself provide overwhelming evidence that the simple model is inadequate. However, we typically compute a p-value, which we regard as a quantification of how inadequate the simple model is. In reality, many tests are rejected 8

with p-values that are very small, and much smaller than the 0.05, in which case we can say that there is clear evidence against the simple model. If the p-value is close to 0.05, we must draw our conclusions with greater caution. Formal statistical tests belong most naturally in the world of confirmatory statistical analysis where we want to confirm, or document, a specific hypothesis given upfront for instance the effect or superiority of a given drug when compared to a (simpler) null hypothesis. The theory of hypothesis testing is not developed for model selection. Strategies for Approximating Best Subset Solutions If we can t go through all possible models we need approximations. Forward stepwise selection. Initiate using only the intercept In the k th step include the variable among the p k remaining that improves RSS the most. Backward stepwise selection. Initiate by fitting the full model requires N p In the k th step exclude the variable among the k remaining that increases RSS the least. Mixed stepwise strategies. For instance, initiate by fitting the full model. In the r th step include or exclude a variable according to a tradeoff criteria between the best improvement or least increase in RSS. 9