Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21

Introduction The simple linear regression model can be written as Y i = β 0 + β 1 x i + ɛ i, i = 1,..., n, where β 0 and β 1 are unknown parameters. The designation simple indicates that there is only one x to predict the response. Y i and ɛ i are random variables and the values of x i are known constants (the case in which the x i are random variables is treated later). Winter term 2018/19 2/21

assumptions To complete the model, we make the following assumptions: A1 E(ɛ i ) = 0 i = 1,..., n. A2 Var(ɛ i ) = σ 2 i = 1,..., n. A3 Cov(ɛ i, ɛ j ) = 0 i j. Occasionally, we will make use of the following additional assumption: A4 ɛ i N (0, σ 2 ) i = 1,..., n. Any of these assumptions may fail to hold with real data. Winter term 2018/19 3/21

Methods of estimation Given a random sample of n observations y 1,..., y n and fixed values x 1,..., x n, one can estimate the parameters β 0, β 1, and the error variance σ 2. To obtain the estimates ˆβ 0 and ˆβ 1, we may use the method of least squares, which does not require any of the assumptions A1 A4, or maximum likelihood estimation using assumptions A1 A4. Estimation of σ 2 : Least squares does not yield an estimator of σ 2. Maximum likelihood estimation. Winter term 2018/19 4/21

Least squares estimation The least squares approach seeks estimators ˆβ 0 and ˆβ 1 which minimize the sum of squares of the residuals y i ŷ i of the n observed y i s from their fitted values Ê(y i) = ŷ i = ˆβ 0 + ˆβ 1 x i : n ɛ 2 i = i=1 n i=1 (Y i β 0 β 1 x i ) 2 min β 0,β 1. Note that ˆβ 0 + ˆβ 1 x i estimates β 0 + β 1 x i and not β 0 + β 1 x i + ɛ i. To find the solution to this optimization problem, we differentiate the objective function with respect to β 0 and β 1, set the resulting equations equal to zero and solve for the unknowns. Winter term 2018/19 5/21

Least squares solution The least squares solution is given by ˆβ 1 = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 = n i=1 x iy i n xȳ n i=1 x2 i n x 2 and ˆβ 0 = ȳ ˆβ 1 x. To verify that the estimators above minimize the objective function of interest, we can examine the second derivatives. Winter term 2018/19 6/21

Properties of least squares estimates Using assumptions A1 A3, we obtain the following means and variances of ˆβ 0 and ˆβ 1 : E( ˆβ 0 ) = β 0 E( ˆβ 1 ) = β 1 [ 1 Var( ˆβ 0 ) = σ 2 n + x 2 ] n i=1 (x i x) 2 Var( ˆβ 1 ) = σ 2 n i=1 (x i x) 2. If for n, n (x i x) 2, i=1 then ˆβ 0 and ˆβ 1 are also consistent estimators. Winter term 2018/19 7/21

Estimation of σ 2 ( To estimate σ 2, recall that σ 2 = E [Y i E(Y i )] 2). We estimate σ 2 by an average from the sample, that is n s 2 i=1 = (y i ŷ i ) 2 = SSE n 2 n 2, where SSE = n i=1 (y i ŷ i ) 2 = n i=1 ˆɛ2 i denotes the residual (or error) sum of squares. E(s 2 ) = σ 2. Winter term 2018/19 8/21

Coefficient of determination The coefficient of determination, R 2, is defined as n R 2 i=1 = (ŷ i ȳ) 2 n i=1 (y i ȳ) 2 = SSR SST = 1 SSE SST, where SSR = n i=1 (ŷ i ȳ) 2 is the regression sum of squares and SST = n i=1 (y i ȳ) 2 is the total sum of squares that can be partitioned as follows: SST = SSR + SSE. R 2 is the the square of the sample correlation coefficient between y and x: R 2 = s2 xy s 2 x s 2 y = [ n i=1 (x i x)(y i ȳ)] 2 [ n i=1 (x i x) 2 ] [ n i=1 (y i ȳ) 2 ]. Winter term 2018/19 9/21

ANOVA table for simple linear regression Source of d.f. Sum of Mean square F statistic variation squares Regression 1 SSR MSR = SSR F = MSR MSE Residual n 2 SSE MSE = SSE (n 2) = s2 Total n 1 SST Winter term 2018/19 10/21

Confidence intervals for β 0 and β 1 Assuming A4, ɛ i N (0, σ 2 ), it holds for j = 0, 1: ˆβ j N (β j, σ 2ˆβ j ), ˆβ j β j ˆσ ˆβj t(n 2), where ˆσ ˆβj = Var( ˆβ j ) 1/2 is the standard error of ˆβ j. (1 α) 100% confidence intervals for β 0 and β 1 : [ ˆβ j ± ˆσ ˆβ j t 1 α/2 (n 2)], j = 0, 1. Sufficiently large n: replace quantiles of t(n 2) distribution by quantiles of N (0, 1) distribution. Winter term 2018/19 11/21

Hypothesis tests for β 0 and β 1 Example: Test statistic: H 0 : β 1 = 0 versus H 1 : β 1 0. T = ˆβ 1 0 ˆσ ˆβ 1 = ˆβ 1 ˆσ ˆβ 1 t(n 2), where ˆσ ˆβ1 = s n i=1 (x i x) 2. Rejection region (at significance level α): T > t 1 α/2;n 2, where t 1 α/2;n 2 is the 1 α/2 quantile of a t-distribution with n 2 degrees of freedom (d.f.). Winter term 2018/19 12/21

for σ 2 Note that (n 2)s 2 σ 2 χ 2 n 2. A 100 (1 α)% confidence interval for σ 2 is given by [ ] (n 2)s 2 (n 2)s2 χ 2, 1 α/2;n 2 χ 2, α/2;n 2 where χ 2 1 α/2;n 2 is the 1 α/2 quantile of a χ2 -distribution with n 2 d.f. Winter term 2018/19 13/21

Conditional normal model In the simple regression model we have discussed, the values of the predictor variable, x 1,..., x n, have been fixed, known constants. The conditional normal model with the assumptions A1 A4 is the most common simple linear regression model: Y i N (β 0 + β 1 x i, σ 2 ), i = 1,..., n. Thus the population regression function is E(Y x) = β 0 + β 1 x. Imposing A4 the uncorrelatedness of Y 1,..., Y n (with A1 A3) is strengthened to independence. Moreover, the exact form of the joint pdf of Y 1,..., Y n is now specified. Winter term 2018/19 14/21

Bivariate normal model Sometimes it is more reasonable to assume that these values are actually observed values of random variables. In the bivariate normal model the observed values (y 1, x 1 ),..., (y n, x n ) are realizations of the bivariate random vectors (Y 1, X 1 ),..., (Y n, X n ). The random vectors are assumed to be independent and (Y i, X i ) N 2 (µ y, µ x, σ 2 y, σ 2 x, ρ), i = 1,..., n. The joint pdf of (Y 1, X 1 ),..., (Y n, X n ) is the product of the bivariate pdfs. Winter term 2018/19 15/21

Bivariate normal model (2) For a bivariate normal model, the conditional distribution of Y given X = x is normal. The model implies that the population regression function, E(Y x), is a linear function of x. Linear regression analysis is almost always carried out using the conditional distribution of (Y 1,..., Y n ) given X 1 = x 1,..., X n = x n, rather than the unconditional distribution of (Y 1, X 1 ),..., (Y n, X n ). Inference based on point estimators, intervals, or tests is the same for the conditional normal model and the bivariate normal model, at least with respect to the parts that are relevant for the present course. Winter term 2018/19 16/21

Regression with errors in variables A more complicated model with stochastic regressors than the bivariate normal model is the measurement error model or errors in variables (EIV) model. We observe independent pairs (Y i, X i ), i = 1,..., n, according to Y i = β 0 + β 1 ξ i + ɛ i, ɛ i N (0, σ 2 ɛ ), X i = ξ i + δ i, δ i N (0, σ 2 δ ). The variables ξ i and η i are sometimes called latent variables. If δ i = 0, then the model becomes simple linear regression. Winter term 2018/19 17/21

Functional and structural relationships There are two different types of relationship that can be specified in the EIV model: one that specifies a functional linear relationship, and one describing a structural linear relationship. The different relationship specifications can lead to different estimators with different properties. Winter term 2018/19 18/21

Linear functional relationship model This is the model as presented on slide 17 where we have random variables X i and Y i, with E(X i ) = ξ i and E(Y i ) = η i and we assume the functional relationship η i = β 0 + β 1 ξ i. The ξ i are fixed, unknown parameters and the ɛ i and δ i are independent. The parameters of interest are β 0 and β 1, and inference on these parameters is made using the joint distribution of ((Y 1, X 1 ),..., (Y n, X n )), conditional on ξ 1,..., ξ n. Winter term 2018/19 19/21

Linear structural relationship model Now we assume that ξ 1,..., ξ n are a random sample from a common population (e.g. ξ i N (ξ, σ 2 ξ )). Thus, conditional on ξ 1,..., ξ n we observe pairs (Y i, X i ) (i = 1,..., n) according to the model as presented on slide 17. As before the ɛ i and δ i are independent, but they are also independent of the ξ i. Inference on β 0 and β 1 is made using the joint distribution of ((Y 1, X 1 ),..., (Y n, X n )), unconditional on ξ 1,..., ξ n. Winter term 2018/19 20/21

Orthogonal least squares Let us try to find the best line through the points (y i, x i ) (i = 1,..., n). If the x i s are measured without error, it makes sense to consider minimization of vertical distances (ordinary least squares). In an EIV model perform orthogonal (total) least squares, that is, find the line that minimizes orthogonal distances. Such a distance measure does not favour the x variable but rather treats both variables equitably. Winter term 2018/19 21/21