Regression: Lecture 2

Similar documents
This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

[y i α βx i ] 2 (2) Q = i=1

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Data Mining Stat 588

AMS-207: Bayesian Statistics

Linear Regression Model. Badr Missaoui

Linear Models Review

Statistics 910, #5 1. Regression Methods

Theoretical Exercises Statistical Learning, 2009

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Multiple Linear Regression

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

9. Least squares data fitting

Homoskedasticity. Var (u X) = σ 2. (23)

STAT 540: Data Analysis and Regression

Math 423/533: The Main Theoretical Topics

Ch 2: Simple Linear Regression

Multiple Linear Regression

Ch 3: Multiple Linear Regression

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Linear Model Under General Variance

EC3062 ECONOMETRICS. THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation. (1) y = β 0 + β 1 x β k x k + ε,

REGRESSION ANALYSIS AND INDICATOR VARIABLES

The Standard Linear Model: Hypothesis Testing

Sensitivity of GLS estimators in random effects models

Linear models and their mathematical foundations: Simple linear regression

Time Series Analysis

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

11 Hypothesis Testing

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Optimal design of experiments

Lecture 20: Linear model, the LSE, and UMVUE

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Lecture 6: Geometry of OLS Estimation of Linear Regession

,i = 1,2,L, p. For a sample of size n, let the columns of data be

The linear model is the most fundamental of all serious statistical models encompassing:

May 9, 2014 MATH 408 MIDTERM EXAM OUTLINE. Sample Questions

Chapter 5 Matrix Approach to Simple Linear Regression

Fractional Imputation in Survey Sampling: A Comparative Review

Linear Methods for Prediction

1 Least Squares Estimation - multiple regression.

14 Multiple Linear Regression

Multivariate Regression Analysis

Bayesian Linear Regression

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

The outline for Unit 3

Linear Methods for Regression. Lijun Zhang

Properties of the least squares estimates

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

The Hilbert Space of Random Variables

3 Multiple Linear Regression

Lecture 2: Linear Algebra Review

Multivariate Regression

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Linear Regression. Junhui Qian. October 27, 2014

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

1 One-way analysis of variance

WLS and BLUE (prelude to BLUP) Prediction

Lecture 11: Regression Methods I (Linear Regression)

Fitting Linear Statistical Models to Data by Least Squares II: Weighted

CSL361 Problem set 4: Basic linear algebra

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Lecture 11: Regression Methods I (Linear Regression)

Biostatistics 533 Classical Theory of Linear Models Spring 2007 Final Exam. Please choose ONE of the following options.

4 Multiple Linear Regression

ECON 4160, Autumn term Lecture 1

x 21 x 22 x 23 f X 1 X 2 X 3 ε

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

1 Data Arrays and Decompositions

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Solutions to Homework Set #6 (Prepared by Lele Wang)

Appendix A: Review of the General Linear Model

Topics in Probability and Statistics

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture 6: Linear models and Gauss-Markov theorem

Linear Algebra and Robot Modeling

Economics 620, Lecture 5: exp

Generalized Linear Models

Chapter 4: Interpolation and Approximation. October 28, 2005

Spatial Process Estimates as Smoothers: A Review

Computational Physics

Statistics 910, #15 1. Kalman Filter

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Applied Regression Analysis

Simple Linear Regression

Zellner s Seemingly Unrelated Regressions Model. James L. Powell Department of Economics University of California, Berkeley

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Multivariate Regression (Chapter 10)

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Model Selection and Geometry

Cubic Splines; Bézier Curves

variability of the model, represented by σ 2 and not accounted for by Xβ

Lecture 4 Multiple linear regression

Matrix Factorizations

Transcription:

Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and basis expansions 4 2.1 Basis expansions and splines in R......................... 6 3 Missing data 7 A Derivation of the least squares solution 8 1 Linear regression and least squares estimation General setup Observations y 1,..., y n the response or outcome of independent random variables Y 1,..., Y n. y = y 1. y n Y = The distribution of Y i depends upon explanatory variables, regressors or covariates x i R m and an unknown parameter θ. The n p design matrix is 1 x T 1 X =.. p = m + 1 1 x T n Y 1. Y n. Least squares regression If Y i is real valued with mean µ i (θ) = µ i (x i, θ) the sum of squares S = S(θ) = i (y i µ i (x i, θ)) 2 = y µ(θ) 2 = (y µ(θ)) T (y µ(θ))

is a measure of fit as a function of θ. Minimization of S(θ) gives the least squares estimator of θ. A generalization is the weighted least squares estimator obtained by minimizing the weighted sum of squares S(θ) = (y µ(θ)) T V 1 (y µ(θ)) where V is a positive definite matrix. An n n matrix is positive definite if y T Vy > 0 for all y R n with y 0. A special type of weight matrix is a diagonal matrix such that w 1 0... 0 V 1 0 w 2... 0 =...... 0 0... w n in which case S(θ) = i w i (y i µ i (x i, θ)) 2. Linear models If θ = β R p and m µ i (β) = β 0 + x ij β j = (1, x T i )β j=1 the model is linear in the parameters. The weighted sum of squares can be expressed using the design matrix; S(β) = (y Xβ) T V 1 (y Xβ). The function S(β) is a quadratic function in β and in fact a convex function. Linear modeling is very popular and useful, and it is possible to derive many theoretical results about the resulting estimator the linear (weighted) least squares estimator. All these mathematical derivations rely on assuming that the mean is exactly linear in β. In practice one should always regard the linear model as an approximation; µ i (β) β 0 + m x ij β j, where the precision of the approximation is swept under the carpet. Linearity is for differentiable functions a good local approximation and it may extend reasonably to the convex hull of the x i s. However, we should always make a serious attempt to check that by some j=1 2

sort of model control. Having done so, and having computed the estimate ˆβ, interpolation approximation of the mean of Y as (1, x T ) ˆβ for x R m in the convex hull of the observed x i s is usually OK, whereas extrapolation is always dangerous. Extrapolation is strongly model dependent and we do not have data to justify that the model is adequate for extrapolation. The least squares solution The least squares solution(s) is characterized as the solution to the linear equation X T V 1 Xβ = X T V 1 y. If X has full rank p there is a unique solution, otherwise the solution is not unique. In practice, the matrix inverse of X T V 1 X is not needed to solve the equation. Implementations will typically solve the equation directly, perhaps using a QR decomposition of X or a Cholesky decomposition of X T V 1 X. 1.1 Distributional results Distributional results The errors are ε i = Y i (1, x i ) T β Assumption 1: ε 1,..., ε n are (conditionally on x 1,..., x n ) uncorrelated with mean value 0 and same variance σ 2. Assumption 2: ε 1,..., ε n are (conditionally on x 1,..., x n ) independent and identically distributed N(0, σ 2 ). Under Assumption 1 the least squares estimator is unbiased, E( ˆβ) = β, var( ˆβ) = σ 2 (X T X) 1 and ˆσ 2 = 1 n (y i (1, x T i ) n p ˆβ) 2 = 1 S( ˆβ) n p i=1 is an unbiased estimator of the variance σ 2. Proof: Under Assumption 1 E(Y ) = Xβ and var(y) = σ 2 I n hence E( ˆβ) = (X T X) 1 X T Xβ = β var( ˆβ) = (X T X) 1 X T σ 2 I n X(X T X) 1 = σ 2 (X T X) 1. Distributional results Under Assumption 2 where the errors are assumed independent and normally distributed we can strengthen the conclusions and have ˆβ N(β, σ 2 (X T X) 1 ) 3

The standardized Z-score Or more generally for any a R p (n p)ˆσ 2 σ 2 χ 2 n p. ˆβ j β j Z j = t n p. ˆσ (X T X) 1 jj a T ˆβ a T β ˆσ a T (X T X) 1 a t n p. More distributional results A 95% confidence interval for a T β is thus obtained as a T ˆβ ± zn pˆσ a T (X T X) 1 a (1) where ˆσ a T (X T X) 1 a is the estimated standard error of a T ˆβ and z n p is the 97.5% quantile in the t n p -distribution. The test for a general r-dimensional linear restriction on β (the assumption that β is in a p r dimensional linear subspace) can be done using the F -test with (r, n p) degrees of freedom. See EH, Chapter 10, for the details. Using lm in R Linear models are fitted in R using the lm function. The model is specified by a formula of the form a ~ b + c + d describing the response a and the explanatory variables b, c and d and how they enter the model. The formula combined with the data results in a design matrix a model matrix in R terminology. 2 Non-linear effects and basis expansions Transformations and non-linear effects Observed or measured explanatory variables need not enter directly in the linear regression model. Any prespecified transformation is possible. The R-language allows for model formulas of the form 4

a ~ b + I(b^2) + log(c) + exp(d) The expansion of the b term as a linear combination of b and b^2 effectively means that we model a quadratic relation between the response variable a and the explanatory variable b. Note the need for the special as is operator I in the R model formula above. It protects the squaring from being interpreted in a model formula context (related to interactions) so that it takes its usual arithmetic meaning. It is common not to know a good transformation upfront. Instead, we can attempt a basis expansion of a variable using several terms in the model formula for a given variable just as the quadratic polynomial expansion of b above. Books can be written on the details of the various sets of basis functions that can be used to expand a non-linear effect. We will take a very direct approach here and focus on spline basis expansions. Splines Define h 1 (x) = 1, h 2 (x) = x and for ξ 1,..., ξ K the knots. h m+2 (x) = (x ξ m ) + t + = max{0, t} f(x) = M+2 m=1 β m h m (x) is a piecewise linear, continuous function. One order-m spline basis with knots ξ 1,..., ξ K is h 1 (x) = 1,..., h M (x) = x M 1, h M+l (x) = (x ξ l ) M 1 +, l = 1,..., K. Natural Cubic Splines Splines of order M are polynomials of degree M 1 beyond the boundary knots ξ 1 and ξ K. The natural cubic splines are the splines of order 4 that are linear beyond the two boundary knots. With K f(x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 + θ k (x ξ k ) 3 + k=1 the restriction is that β 2 = β 3 = 0 = K k=1 θ k = K k=1 θ kξ k = 0. and N 1 (x) = 1, N 2 (x) = x N 2+l (x) = (x ξ l) 3 + (x ξ K) 3 + ξ K ξ l (x ξk 1)3 + (x ξ K) 3 + ξ K ξ K 1 for l = 1,..., K 2 form a basis. Obviously β 2 = β 3 = 0 and then beyond the last knot the second derivative of f is K K K f (x) = 6θ k (x ξ k ) = 6x θ k 6 θ k ξ k, k=1 5 k=1 k=1

which is zero for all x if and only if the conditions above are fulfilled. For N 2+l we see that 1 1 1 1 θ l =, θ K 1 =, θ K = ξ K ξ l ξ K ξ K 1 ξ K ξ K 1 ξ K ξ l and the condition is easily verified. By evaluating the functions in the knots, say, it is on the other hand easy to see that the K different functions are linearly independent. Therefore they must span the space of natural cubic splines of co-dimension 4 in the set of cubic splines. Interactions and tensor products Interactions between continuous variables is difficult the simplest model being a ~ b + c + b:c which, for given value of c is linear in b and vice versa, but jointly a quadratic function. If the effects of both variables are expanded in a basis h 1,..., h K the bivariate functions h i h j (x, y) = h i (x)h j (y) form the tensor product basis for expanding interaction effects. Statistics with basis expansions Expanding the effect of a variable using K basis functions h 1,..., h K should generally be understood and analyzed as follows: The estimated function ĥ = k ˆβ k h k is interpretable and informative whereas the individual parameters ˆβ k are typically not. One should consider combined F -tests that h is 0 or h is linear against the model with a fully expanded h and not parallel or successive tests of individual parameters. Pointwise confidence intervals for ĥ(x) are easily computed by observing that ĥ(x) = a T ˆβ, ak = h k (x) cf. (1). 2.1 Basis expansions and splines in R It is possible to expand the effect explicitly in the model formula but it is typically not desirable. We would rather have a single term in the model formula that describes our intention to expand a given variable using a function basis. This is possible by using a function in the formula specification that, if given a vector of x-values, returns a matrix of basis function evaluations with one column per basis function. The poly function is one such example that can be used to expand the effect in orthogonal polynomials. Another is ns that produces a B-spline basis for the natural cubic splines. In either case the number of basis functions that one wants to use has to be specified. 6

For polynomials in terms of the total degree of the fitted polynomial, and for the splines either in terms of the number of basis functions (the degrees of freedom, and implicitly the number of knots) or by the specific positioning of knots. The rcs function from Harrell s rms package gives a different set of basis functions for the natural cubic splines. Expansions of interaction effects using tensor product bases can be done by including a term in the formula like rcs(a):rcs(b). The very convenient consequence of using functions like poly, ns and rcs is that we can then use anova to test the combined effect of the expanded term, and we can easily obtain term estimates and confidence bands using the predict function. 3 Missing data In this discussion of missing data we formulate the concepts only in terms of the explanatory variables x i. There is no real loss of generality as we can simply regard the distinguished response in these circumstances as just another explanatory variable. Moreover, we consider each observation individually and will avoid the subscript index. Missing completely at random (MCAR) In words: The reason that x j is missing is unrelated to the actual value of the entire vector x. Mechanism: For the j th variable a (loaded) coin flip with probability p j of missingness determines if the variable is missing. Mathematical: If Z j is an indicator variable of whether x j is missing then Z j X 1,..., X m. Deleting observations with missing variables under MCAR will not bias the result but it can lead to inefficient usage of the data available. Missing at random (MAR) In words: The reason that x j is missing is unrelated to the actual values of the missing variables in x. Mechanism: For all index sets z (a subset of {1,..., m}) the probability that z is the indices of observed variables has probability p(z x z ). Here x z denotes the observed part of x corresponding to z. Mathematical: If Z is the random index set of the observed variables in x i then for all z. (Z = z) X X z 7

MAR allows us to predict, or impute, the missing variables using X z c X z = x z e.g. continuous variables as ˆX z c = E(X z c X z = x z ). If the X variables are all discrete, the conditional independence can be expressed in terms of point probabilities and the joint distribution takes the form p(z x z )q(x z c x z )r(x z ). For the general case we can take such a factorization of the joint density w.r.t. a suitable product measure as the definition. Imputation Assuming MAR we build regression models from the available data and: Impute a single best guess from the regression model. Impute a randomly sampled value from the regression model, which reflects the variability in the remaining data set. Do multiple imputations each time using the resulting data set with imputed values for the desired regression analysis and then average parameter estimates. General advice: It is better to invent data (impute) than to discard data. A Derivation of the least squares solution The least squares solution the geometric way With L = {Xβ β R p } the column space of X the quantity S(β) = y Xβ 2 V is minimized whenever Xβ is the orthogonal projection of y onto L in the inner product given by V 1. Here V is used to denote the corresponding norm induced by this inner product. The column space projection equals whenever X has full rank p. P = X(X T V 1 X) 1 X T V 1 In this case Xβ = P y has the unique solution ˆβ = (X T V 1 X) 1 X T V 1 y. 8

One verifies that P is in fact the orthogonal projection by verifying three characterizing properties: P L = L P 2 = X(X T X) 1 V 1 X T V 1 X(X T V 1 X) 1 X T V 1 = X(X T V 1 X) 1 X T V 1 = P P T V 1 = (X(X T V 1 X) 1 X T V 1 ) T V 1 = V 1 X(X T V 1 X) 1 X T V 1 = V 1 P. The latter property is self-adjointness w.r.t. the inner product given by V 1. If X does not have full rank p the projection is still well defined and can be written as P = X(X T V 1 X) X T V 1 where (X T V 1 X) denotes a generalized inverse. A generalized inverse of a matrix A is a matrix A with the property that AA A = A and using this property one easily verifies the same three conditions for the projection. In this case, however, there is not a unique solution to Xβ = P y. The least squares solution the calculus way Since S(β) = (y Xβ) T V 1 (y Xβ) D β S(β) = 2(y Xβ) T V 1 X The derivative is a 1 p dimensional matrix a row vector. The gradient is β S(β) = D β S(β) T. D 2 βs(β) = 2X T V 1 X. If X has rank p, Dβ 2 S(β) is (globally) positive definite and there is a unique minimizer found by solving D β S(β) = 0. The solution is ˆβ = (X T V 1 X) 1 X T V 1 y. For the differentiation it may be useful to think of S(β) as a composition of the function a(β) = (y Xβ) from R p to R n with derivative D β a(β) = X and then the function b(z) = z 2 V = z T V 1 z from R n to R with derivative D 1 z b(z) = 2z T V 1. By the chain rule D β S(β) = D z b(a(β)))d β a(β) = 2(y Xβ) T V 1 X 9