Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and basis expansions 4 2.1 Basis expansions and splines in R......................... 6 3 Missing data 7 A Derivation of the least squares solution 8 1 Linear regression and least squares estimation General setup Observations y 1,..., y n the response or outcome of independent random variables Y 1,..., Y n. y = y 1. y n Y = The distribution of Y i depends upon explanatory variables, regressors or covariates x i R m and an unknown parameter θ. The n p design matrix is 1 x T 1 X =.. p = m + 1 1 x T n Y 1. Y n. Least squares regression If Y i is real valued with mean µ i (θ) = µ i (x i, θ) the sum of squares S = S(θ) = i (y i µ i (x i, θ)) 2 = y µ(θ) 2 = (y µ(θ)) T (y µ(θ))
is a measure of fit as a function of θ. Minimization of S(θ) gives the least squares estimator of θ. A generalization is the weighted least squares estimator obtained by minimizing the weighted sum of squares S(θ) = (y µ(θ)) T V 1 (y µ(θ)) where V is a positive definite matrix. An n n matrix is positive definite if y T Vy > 0 for all y R n with y 0. A special type of weight matrix is a diagonal matrix such that w 1 0... 0 V 1 0 w 2... 0 =...... 0 0... w n in which case S(θ) = i w i (y i µ i (x i, θ)) 2. Linear models If θ = β R p and m µ i (β) = β 0 + x ij β j = (1, x T i )β j=1 the model is linear in the parameters. The weighted sum of squares can be expressed using the design matrix; S(β) = (y Xβ) T V 1 (y Xβ). The function S(β) is a quadratic function in β and in fact a convex function. Linear modeling is very popular and useful, and it is possible to derive many theoretical results about the resulting estimator the linear (weighted) least squares estimator. All these mathematical derivations rely on assuming that the mean is exactly linear in β. In practice one should always regard the linear model as an approximation; µ i (β) β 0 + m x ij β j, where the precision of the approximation is swept under the carpet. Linearity is for differentiable functions a good local approximation and it may extend reasonably to the convex hull of the x i s. However, we should always make a serious attempt to check that by some j=1 2
sort of model control. Having done so, and having computed the estimate ˆβ, interpolation approximation of the mean of Y as (1, x T ) ˆβ for x R m in the convex hull of the observed x i s is usually OK, whereas extrapolation is always dangerous. Extrapolation is strongly model dependent and we do not have data to justify that the model is adequate for extrapolation. The least squares solution The least squares solution(s) is characterized as the solution to the linear equation X T V 1 Xβ = X T V 1 y. If X has full rank p there is a unique solution, otherwise the solution is not unique. In practice, the matrix inverse of X T V 1 X is not needed to solve the equation. Implementations will typically solve the equation directly, perhaps using a QR decomposition of X or a Cholesky decomposition of X T V 1 X. 1.1 Distributional results Distributional results The errors are ε i = Y i (1, x i ) T β Assumption 1: ε 1,..., ε n are (conditionally on x 1,..., x n ) uncorrelated with mean value 0 and same variance σ 2. Assumption 2: ε 1,..., ε n are (conditionally on x 1,..., x n ) independent and identically distributed N(0, σ 2 ). Under Assumption 1 the least squares estimator is unbiased, E( ˆβ) = β, var( ˆβ) = σ 2 (X T X) 1 and ˆσ 2 = 1 n (y i (1, x T i ) n p ˆβ) 2 = 1 S( ˆβ) n p i=1 is an unbiased estimator of the variance σ 2. Proof: Under Assumption 1 E(Y ) = Xβ and var(y) = σ 2 I n hence E( ˆβ) = (X T X) 1 X T Xβ = β var( ˆβ) = (X T X) 1 X T σ 2 I n X(X T X) 1 = σ 2 (X T X) 1. Distributional results Under Assumption 2 where the errors are assumed independent and normally distributed we can strengthen the conclusions and have ˆβ N(β, σ 2 (X T X) 1 ) 3
The standardized Z-score Or more generally for any a R p (n p)ˆσ 2 σ 2 χ 2 n p. ˆβ j β j Z j = t n p. ˆσ (X T X) 1 jj a T ˆβ a T β ˆσ a T (X T X) 1 a t n p. More distributional results A 95% confidence interval for a T β is thus obtained as a T ˆβ ± zn pˆσ a T (X T X) 1 a (1) where ˆσ a T (X T X) 1 a is the estimated standard error of a T ˆβ and z n p is the 97.5% quantile in the t n p -distribution. The test for a general r-dimensional linear restriction on β (the assumption that β is in a p r dimensional linear subspace) can be done using the F -test with (r, n p) degrees of freedom. See EH, Chapter 10, for the details. Using lm in R Linear models are fitted in R using the lm function. The model is specified by a formula of the form a ~ b + c + d describing the response a and the explanatory variables b, c and d and how they enter the model. The formula combined with the data results in a design matrix a model matrix in R terminology. 2 Non-linear effects and basis expansions Transformations and non-linear effects Observed or measured explanatory variables need not enter directly in the linear regression model. Any prespecified transformation is possible. The R-language allows for model formulas of the form 4
a ~ b + I(b^2) + log(c) + exp(d) The expansion of the b term as a linear combination of b and b^2 effectively means that we model a quadratic relation between the response variable a and the explanatory variable b. Note the need for the special as is operator I in the R model formula above. It protects the squaring from being interpreted in a model formula context (related to interactions) so that it takes its usual arithmetic meaning. It is common not to know a good transformation upfront. Instead, we can attempt a basis expansion of a variable using several terms in the model formula for a given variable just as the quadratic polynomial expansion of b above. Books can be written on the details of the various sets of basis functions that can be used to expand a non-linear effect. We will take a very direct approach here and focus on spline basis expansions. Splines Define h 1 (x) = 1, h 2 (x) = x and for ξ 1,..., ξ K the knots. h m+2 (x) = (x ξ m ) + t + = max{0, t} f(x) = M+2 m=1 β m h m (x) is a piecewise linear, continuous function. One order-m spline basis with knots ξ 1,..., ξ K is h 1 (x) = 1,..., h M (x) = x M 1, h M+l (x) = (x ξ l ) M 1 +, l = 1,..., K. Natural Cubic Splines Splines of order M are polynomials of degree M 1 beyond the boundary knots ξ 1 and ξ K. The natural cubic splines are the splines of order 4 that are linear beyond the two boundary knots. With K f(x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 + θ k (x ξ k ) 3 + k=1 the restriction is that β 2 = β 3 = 0 = K k=1 θ k = K k=1 θ kξ k = 0. and N 1 (x) = 1, N 2 (x) = x N 2+l (x) = (x ξ l) 3 + (x ξ K) 3 + ξ K ξ l (x ξk 1)3 + (x ξ K) 3 + ξ K ξ K 1 for l = 1,..., K 2 form a basis. Obviously β 2 = β 3 = 0 and then beyond the last knot the second derivative of f is K K K f (x) = 6θ k (x ξ k ) = 6x θ k 6 θ k ξ k, k=1 5 k=1 k=1
which is zero for all x if and only if the conditions above are fulfilled. For N 2+l we see that 1 1 1 1 θ l =, θ K 1 =, θ K = ξ K ξ l ξ K ξ K 1 ξ K ξ K 1 ξ K ξ l and the condition is easily verified. By evaluating the functions in the knots, say, it is on the other hand easy to see that the K different functions are linearly independent. Therefore they must span the space of natural cubic splines of co-dimension 4 in the set of cubic splines. Interactions and tensor products Interactions between continuous variables is difficult the simplest model being a ~ b + c + b:c which, for given value of c is linear in b and vice versa, but jointly a quadratic function. If the effects of both variables are expanded in a basis h 1,..., h K the bivariate functions h i h j (x, y) = h i (x)h j (y) form the tensor product basis for expanding interaction effects. Statistics with basis expansions Expanding the effect of a variable using K basis functions h 1,..., h K should generally be understood and analyzed as follows: The estimated function ĥ = k ˆβ k h k is interpretable and informative whereas the individual parameters ˆβ k are typically not. One should consider combined F -tests that h is 0 or h is linear against the model with a fully expanded h and not parallel or successive tests of individual parameters. Pointwise confidence intervals for ĥ(x) are easily computed by observing that ĥ(x) = a T ˆβ, ak = h k (x) cf. (1). 2.1 Basis expansions and splines in R It is possible to expand the effect explicitly in the model formula but it is typically not desirable. We would rather have a single term in the model formula that describes our intention to expand a given variable using a function basis. This is possible by using a function in the formula specification that, if given a vector of x-values, returns a matrix of basis function evaluations with one column per basis function. The poly function is one such example that can be used to expand the effect in orthogonal polynomials. Another is ns that produces a B-spline basis for the natural cubic splines. In either case the number of basis functions that one wants to use has to be specified. 6
For polynomials in terms of the total degree of the fitted polynomial, and for the splines either in terms of the number of basis functions (the degrees of freedom, and implicitly the number of knots) or by the specific positioning of knots. The rcs function from Harrell s rms package gives a different set of basis functions for the natural cubic splines. Expansions of interaction effects using tensor product bases can be done by including a term in the formula like rcs(a):rcs(b). The very convenient consequence of using functions like poly, ns and rcs is that we can then use anova to test the combined effect of the expanded term, and we can easily obtain term estimates and confidence bands using the predict function. 3 Missing data In this discussion of missing data we formulate the concepts only in terms of the explanatory variables x i. There is no real loss of generality as we can simply regard the distinguished response in these circumstances as just another explanatory variable. Moreover, we consider each observation individually and will avoid the subscript index. Missing completely at random (MCAR) In words: The reason that x j is missing is unrelated to the actual value of the entire vector x. Mechanism: For the j th variable a (loaded) coin flip with probability p j of missingness determines if the variable is missing. Mathematical: If Z j is an indicator variable of whether x j is missing then Z j X 1,..., X m. Deleting observations with missing variables under MCAR will not bias the result but it can lead to inefficient usage of the data available. Missing at random (MAR) In words: The reason that x j is missing is unrelated to the actual values of the missing variables in x. Mechanism: For all index sets z (a subset of {1,..., m}) the probability that z is the indices of observed variables has probability p(z x z ). Here x z denotes the observed part of x corresponding to z. Mathematical: If Z is the random index set of the observed variables in x i then for all z. (Z = z) X X z 7
MAR allows us to predict, or impute, the missing variables using X z c X z = x z e.g. continuous variables as ˆX z c = E(X z c X z = x z ). If the X variables are all discrete, the conditional independence can be expressed in terms of point probabilities and the joint distribution takes the form p(z x z )q(x z c x z )r(x z ). For the general case we can take such a factorization of the joint density w.r.t. a suitable product measure as the definition. Imputation Assuming MAR we build regression models from the available data and: Impute a single best guess from the regression model. Impute a randomly sampled value from the regression model, which reflects the variability in the remaining data set. Do multiple imputations each time using the resulting data set with imputed values for the desired regression analysis and then average parameter estimates. General advice: It is better to invent data (impute) than to discard data. A Derivation of the least squares solution The least squares solution the geometric way With L = {Xβ β R p } the column space of X the quantity S(β) = y Xβ 2 V is minimized whenever Xβ is the orthogonal projection of y onto L in the inner product given by V 1. Here V is used to denote the corresponding norm induced by this inner product. The column space projection equals whenever X has full rank p. P = X(X T V 1 X) 1 X T V 1 In this case Xβ = P y has the unique solution ˆβ = (X T V 1 X) 1 X T V 1 y. 8
One verifies that P is in fact the orthogonal projection by verifying three characterizing properties: P L = L P 2 = X(X T X) 1 V 1 X T V 1 X(X T V 1 X) 1 X T V 1 = X(X T V 1 X) 1 X T V 1 = P P T V 1 = (X(X T V 1 X) 1 X T V 1 ) T V 1 = V 1 X(X T V 1 X) 1 X T V 1 = V 1 P. The latter property is self-adjointness w.r.t. the inner product given by V 1. If X does not have full rank p the projection is still well defined and can be written as P = X(X T V 1 X) X T V 1 where (X T V 1 X) denotes a generalized inverse. A generalized inverse of a matrix A is a matrix A with the property that AA A = A and using this property one easily verifies the same three conditions for the projection. In this case, however, there is not a unique solution to Xβ = P y. The least squares solution the calculus way Since S(β) = (y Xβ) T V 1 (y Xβ) D β S(β) = 2(y Xβ) T V 1 X The derivative is a 1 p dimensional matrix a row vector. The gradient is β S(β) = D β S(β) T. D 2 βs(β) = 2X T V 1 X. If X has rank p, Dβ 2 S(β) is (globally) positive definite and there is a unique minimizer found by solving D β S(β) = 0. The solution is ˆβ = (X T V 1 X) 1 X T V 1 y. For the differentiation it may be useful to think of S(β) as a composition of the function a(β) = (y Xβ) from R p to R n with derivative D β a(β) = X and then the function b(z) = z 2 V = z T V 1 z from R n to R with derivative D 1 z b(z) = 2z T V 1. By the chain rule D β S(β) = D z b(a(β)))d β a(β) = 2(y Xβ) T V 1 X 9