This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear in the parameters. The predictor function for a given β is f β (x) = (1, x T )β. A more practical and relaxed attitude towards linear regression is to say that E(Y X) β 0 + p X j β j = (1, X T )β j=1 where the precision of the approximation of the conditional mean by a linear function is swept under the carpet. All mathematical derivations rely on assuming that the conditional mean is exactly linear, but in reality we will almost always regard linear regression as an approximation. Linearity is for differentiable functions a good local approximation and it may extend reasonably to the convex hull of the x i s. But we must make an attempt to check that by some sort of model control. Having done so, interpolation prediction for a new x in the convex hull of the observed x i s is usually OK, whereas extrapolation is always dangerous. Extrapolation is strongly model dependent and we do not have data to justify that the model is adequate for extrapolation. Least Squares With X the N (p + 1) data matrix, including the column 1, the predicted values for given β are Xβ. The residual sum of squares is N RSS(β) = (y i (1, x T i )β) 2 = y Xβ 2. i=1 The least squares estimate of β is ˆβ = argmin RSS(β). β 1

Figure 3.1 Geometry The linear regression seeks a p-dimensional, affine representation a hyperplane of the p + 1-dimensional variable (X, Y ). The direction of the Y -variable plays a distinctive role the error of the approximating hyperplane is measured parallel to this axis. The Solution the Calculus Way Since RSS(β) = (y Xβ) T (y Xβ) D β RSS(β) = 2(y Xβ) T X The derivative is a 1 p dimensional matrix a row vector. The gradient is β RSS(β) = D β RSS(β) T. D 2 βrss(β) = 2X T X. If X has rank p+1, Dβ 2 RSS(β) is (globally) positive definite and there is a unique minimizer found by solving D β RSS(β) = 0. The solution is ˆβ = (X T X) 1 X T y. For the differentiation it may be useful to think of RSS as a composition of the function a(β) = (y Xβ) from R p to R N with derivative D β a(β) = X and then the function b(z) = z 2 = z T z from R N to R with derivative D z b(z) = 2z T. Then by the chain rule D β RSS(β) = D z b(a(β)))d β a(β) = 2(y Xβ) T X An alternative way to solve the optimization problem is by geometry in the spirit of Figure 3.2. With V = {Xβ β R p } the column space of X the quantity RSS(β) = y Xβ 2 is minimized whenever Xβ is the orthogonal projection of y onto V. The column space projection equals P = X(X T X) 1 X T whenever X has full rank p + 1. In this case Xβ = P y has the unique solution ˆβ = (X T X) 1 X T y. One verifies that P is, in fact, the projection by verifying three characterizing properties: P V = V P 2 = X(X T X) 1 X T X(X T X) 1 X T = X(X T X) 1 X T = P P T = (X(X T X) 1 X T ) T = X(X T X) 1 X T = P If X does not have full rank p the projection is still well defined and can be written as P = X(X T X) X T 2

where (X T X) denotes a generalized inverse. A generalized inverse of a matrix A is a matrix A with the property that AA A = A and using this property one easily verifies the same three conditions for the projection. In this case, however, there is not a unique solution to Xβ = P y. Distributional Results Conditionally on X ε i = Y i (1, X i ) T β Assumption 1: ε 1,..., ε N are, conditionally on X 1,..., X N, uncorrelated with mean value 0 and same variance σ 2. ˆσ 2 = 1 N p 1 N (Y i X ˆβ) 2 = i=1 1 N p 1 Y X ˆβ 2 = RSS( ˆβ) N p 1 Then V (Y X) = σ 2 I N E( ˆβ X) = (X T X) 1 X T Xβ = β V ( ˆβ X) = (X T X) 1 X T σ 2 I N X(X T X) 1 = σ 2 (X T X) 1 E(ˆσ 2 X) = σ 2 The expectation of ˆσ 2 can be computed by noting that E(Y i X ˆβ) = 0, hence E(RSS( ˆβ)) is the sum of the variances of Y i X ˆβ. Since Y has variance matrix σ 2 I N the variance matrix of Y X ˆβ = Y P Y = (I P )Y is σ 2 (I P )(I P ) T = σ 2 (I P ) where we have used that I P is a projection. The sum of the diagonal elements equals the dimension of the subspace that I P is a projection onto, which is N p 1. Distributional Results Conditionally on X Assumption 2: ε 1,..., ε N conditionally on X 1,..., X N are i.i.d. N(0, σ 2 ). The standardized Z-score Or more generally for any a R p+1 ˆβ N(β, σ 2 (X T X) 1 ) (N p 1)ˆσ 2 σ 2 χ 2 N p 1. ˆβ j β j Z j = t N p 1. ˆσ (X T X) 1 jj a T ˆβ a T β ˆσ a T (X T X) 1 a t N p 1. 3

Minimal Variance, Unbiased Estimators Does there exist a minimal variance, unbiased estimator of β? We consider linear estimators only β = C T Y for some N p matrix C requiring that β = C T Xβ for all β. That is, C T X = I p+1 = X T C. Under Assumption 1 and we have V ( β X) = σ 2 C T C, V ( ˆβ β X) = V (((X T X) 1 X T C T )Y X) = σ 2 ((X T X) 1 X T C T )((X T X) 1 X T C T ) T = σ 2 (C T C (X T X) 1 ) The matrix C T C (X T X) 1 is positive semidefinite, i.e. for any a R p+1 a T (X T X) 1 a a T C T Ca Gauss-Markov s Theorem Theorem 1. Under Assumption 1 the least squares estimator of β has minimal variance among all linear, unbiased estimators of β. This means that for any a R p, a T ˆβ has minimal variance among all estimators of a T β of the form a T β where β is a linear, unbiased estimator. It also means that V ( β) (X T X) 1 is positive semidefinite or in the partial ordering on positive semidefinite matrices V ( β) (X T X) 1. Why look any further we have found the optimal estimator...? Biased Estimators The mean squared error is MSE β ( β) = E β ( β β 2 ). By Gauss-Markov s Theorem ˆβ is optimal for all β among the linear, unbiased estimators. Allowing for biased possibly linear estimators we can achieve improvements of the MSE for some β perhaps at the expense of some other β. The Stein estimator is a non-linear, biased estimator, which under Assumption 2 has uniformly smaller MSE than ˆβ whenever p 3. 4

You can find more information on the Stein estimator, discovered by James Stein, in the Wikipedia article. The result was not the expectation of the time. In the Gaussian case it can be hard to imagine that there is an estimator that in terms of MSE performs uniformly better than ˆβ. Digging into the result it turns out to be a consequence of the geometry in R p in dimensions greater than 3. In terms of MSE the estimator ˆβ is inadmissible. The consequences that people have drawn of the result has somewhat divided the statisticians. Some has seen it as the ultimate argument for the non-sense estimators we can get as optimal or at least admissible estimators. If ˆβ the MLE in the Gaussian setup is not admissible there is something wrong with the concept of admissibility. Another reaction is that the Stein estimator illustrates that MLE (and perhaps unbiasedness) is problematic for small sample sizes. To get better performing estimators we should consider biased estimators. However, there are some rather arbitrary choices in the formulation of the Stein estimator, which makes it hard to accept it as an estimator we would use in practice. We will in the course consider a number of biased estimators. However, the reason is not so much due to the Stein result. The reason is much more a consequence of a favorable biasvariance tradeoff when p is large compared to N, which can improve prediction accuracy. Shrinkage Estimators If β = C T Y is some, biased, linear estimator of β we define the estimator β γ = γ ˆβ + (1 γ) β, γ [0, 1]. It is biased. The mean squared error is a quadratic function in γ, and the optimal shrinkage parameter γ(β) can be found it depends upon β! Using the plug-in principle we get an estimator β γ( ˆβ) = γ( ˆβ) ˆβ + (1 γ( ˆβ)) β. It could be uniformly better but the point is that for β where β is not too biased it can be a substantial improvement over ˆβ. Take home message: Bias is a way to introduce soft model restrictions with a locally not globally favorable bias-variance tradeoff. There are other methods than the plug-in principle, which can be useful. It depends on the concrete biased estimator whether we can fit a useful, closed form expression for γ(β) or whether we will need some additional estimation techniques. Regression in practice How does the computer do multiple linear regression? Does it compute the matrix (X T X) 1 X? NO! If the columns x 1,..., x p are orthogonal the solution is ˆβ i = < y, x i > x i 2 Here either 1 is included and orthogonal to the other x i s or all variables have first been centered. 5

We use here the inner product notation < y, x i > as the book, which in terms of matrix multiplication could just be written y T x i. The norm is also given in terms of the inner product, x i 2 =< x i, x i >= x T i x i. Orthogonalization If x 1,..., x p are not orthogonal the Gram-Schmidt orthogonalization produces an orthogonal basis z 1,..., z p spanning the same column space (Algorithm 3.1). Thus ŷ = X T ˆβ = Z T β with β i = < y, z i > z i 2. By Gram-Schmidt span{x 1,..., x i } = span{z 1,..., z i }, hence z p x 1,..., x p 1 x p = z p + w with w z p. Hence z i 2 ˆβp = z T p x p ˆβj = z T p X T ˆβ = z T p Z T β = zi 2 βp. It is assumed above that the p column vectors are linearly independent and thus span a space of dimension p. Equivalently the matrix X has rank p, which in particular requires that p N. If we have centered the variables first we need p N +1. Since we can permute the columns into any order prior to this argument, the conclusion is not special to ˆβ p. Any of the coefficients can be seen as the residual effect of x i when we correct for all the other variables. Figure 3.4 Gram-Schmidt By Gram-Schmidt the multiple regression coefficient ˆβ p equals the coefficient for z p. If z p 2 is small the variance is large and the estimate is uncertain. V ( ˆβ p ) = σ2 z p 2 For observational x i s this problem occurs for highly correlated observables. The QR-decomposition The matrix version of Gram-Schmidt is the decomposition X = ZΓ where the columns in Z are orthogonal and the matrix Γ is upper triangular. If D = diag( z 1,..., z p ) 6

X = ZD }{{ 1 }}{{} DΓ Q R = QR This is the QR-decomposition with Q an orthogonal matrix and R upper triangular. Using the QR-decomposition for Estimation If X = QR is the QR-decomposition we get that ˆβ = (R T Q T QR) 1 R T Q T y = R 1 (R T ) 1 R T Q T y = R 1 Q T y. Or we can write that ˆβ is the solution of R ˆβ = Q T y, which is easy to solve as R is upper triangular. We also get ŷ = X ˆβ = QQ T y. Best Subset For each k {0,..., p} there are ( ) p k different models with k predictors excluding the intercept, and p k parameters = 0. There are in total p k=0 ( ) p = 2 p k different models. For the prostate dataset with 2 8 = 256 possible models we can go through all models in a split second. With 2 40 = 1.099.511.627.776 we approach the boundary. Subset Selection A Constrained Optimization Problem Let L k r for r = 1,..., ( p k) denote all k-dimensional subspaces of the form L k r = {β p k coordinates in β = 0}. ˆβ k = argmin RSS(β) β rl k r The set r L k r is not convex local optimality does not imply global optimality. 7

We can essentially only solve this problem by solving all the ( p k) subproblems, which are convex optimization problems. Conclusion: Subset selection scales computationally badly with the dimension p. Branchand-bound algorithms can help a little... Figure 3.5 Best Subset Selection The residual sum of squares RSS( ˆβ k ) is a monotonely decreasing function in k. The selected models are in general not nested. One can not use RSS( ˆβ k ) to select the appropriate subset size only the best model of subset size k for each k. Model selection criterias such as AIC and Cross-Validation can be used these are major topics later in the course. Test Based Selection Set and fix L l s L k r with l < k. ˆβ k,r = argmin RSS(β) β L k r F = (N k)[rss( ˆβ l,s ) RSS( ˆβ k,r )] (k l)rss( ˆβ k,r ) follows under Assumption 2 an F-distribution with (k l, N k) degrees of freedom if β L l s. L k r is preferred over L l s if Pr(. > F ) 0.05, say the deviation from L l s is unlikely to be explained by randomness alone. Take home message: Test statistics are useful for quantifying if a simple model is inadequate compared to a complex model, but not for general model searching and selection strategies. Some of the problems with using test based criteria for model selection are that: Non-nested models are in-comparable. We only control the type I error for two a priori specified, nested models. We do not understand the distributions of multiple a priori non-nested test statistics. The control of the type I error for a single test depends on the level α chosen. Using α = 0.05 a rejection of the simple model does not in itself provide overwhelming evidence that the simple model is inadequate. However, we typically compute a p-value, which we regard as a quantification of how inadequate the simple model is. In reality, many tests are rejected 8

with p-values that are very small, and much smaller than the 0.05, in which case we can say that there is clear evidence against the simple model. If the p-value is close to 0.05, we must draw our conclusions with greater caution. Formal statistical tests belong most naturally in the world of confirmatory statistical analysis where we want to confirm, or document, a specific hypothesis given upfront for instance the effect or superiority of a given drug when compared to a (simpler) null hypothesis. The theory of hypothesis testing is not developed for model selection. Strategies for Approximating Best Subset Solutions If we can t go through all possible models we need approximations. Forward stepwise selection. Initiate using only the intercept In the k th step include the variable among the p k remaining that improves RSS the most. Backward stepwise selection. Initiate by fitting the full model requires N p In the k th step exclude the variable among the k remaining that increases RSS the least. Mixed stepwise strategies. For instance, initiate by fitting the full model. In the r th step include or exclude a variable according to a tradeoff criteria between the best improvement or least increase in RSS. 9