Chapter 14. Linear least squares

Size: px

Start display at page:

Download "Chapter 14. Linear least squares"

Poppy Hawkins
5 years ago
Views:

1 Serik Sagitov, Chalmers and GU, March 5, 2018 Chapter 14 Linear least squares 1 Simple linear regression model A linear model for the random response Y = Y (x) to an independent variable X = x For a given set of values (x 1,, x n ) of the independent variable put Y i = β 0 + β 1 x i + ɛ i, i = 1,, n, assuming that the noise vector (ɛ 1,, ɛ n ) has independent N(0,σ 2 ) random components Given the data (y 1,, y n ), the model is characterised by the likelihood function of three parameters β 0, β 1, σ 2 L(β 0, β 1, σ 2 ) = n i=1 1 { exp (y i β 0 β 1 x i ) 2 } = (2π) n/2 σ n e S(β 0,β 1 ) 2πσ 2σ 2 2σ 2, where S(β 0, β 1 ) = n i=1 (y i β 0 β 1 x i ) 2 Observe that n 1 S(β 0, β 1 ) = n 1 n (y i β 0 β 1 x i ) 2 = β β 0 β 1 x 2β 0 ȳ 2β 1 xy + β1x y 2 i=1 Least squares estimates Regression lines: true y = β 0 + β 1 x and fitted y = b 0 + b 1 x We want to find (b 0, b 1 ) such that the observed responses y i are approximated by the predicted responses ŷ i = b 0 + b 1 x i in an optimal way Least squares method: find (b 0, b 1 ) minimising the sum of squares S(b 0, b 1 ) = (y i ŷ i ) 2 From S/ b 0 = 0 and S/ b 1 = 0 we get the so-called Normal Equations: xy xȳ { b0 + b 1 x = ȳ b 0 x + b 1 x 2 = xy implying b 1 = x 2 x 2 b 0 = ȳ b 1 x = rs y The least square regression line y = b 0 + b 1 x takes the form y = ȳ + r sy (x x) sample variances s 2 x = 1 (xi x) 2, s 2 n 1 y = 1 (yi ȳ) 2, n 1 sample covariance y = 1 (xi x)(y n 1 i ȳ), sample correlation coefficient r = sxy s y The least square estimates (b 0, b 1 ) are the maximum likelihood estimates of (β 0, β 1 ) The least square estimates (b 0, b 1 ) are not robust: outliers exert leverage on the fitted line 2 Residuals The estimated regression line predicts the responses to the values of the explanatory variable by ŷ i = ȳ + r s y (x i x) 1

2 The noise components of the observed responses y i is represented by the residuals which are dependent via e i = y i ŷ i = y i ȳ r s y (x i x), e e n = 0, x 1 e x n e n = 0, ŷ 1 e ŷ n e n = 0 Residuals e i have normal distributions with zero mean and k Var(e i ) = σ 2( 1 (x k x i ) 2 ), Cov(ei, e n(n 1)s 2 j ) = σ 2 k (x k x i )(x k x j ) x n(n 1)s 2 x The error sum of squares SSE = (y i ŷ i ) 2 = i e 2 i can be expressed as SSE = i (y i ȳ) 2 2r s y n(xy ȳ x) + r 2 s2 y s 2 x (x i x) 2 = (n 1)s 2 y(1 r 2 ) i This leads to the corrected maximum likelihood estimate of σ 2 : s 2 = SSE n 2 = n 1 n 2 s2 y(1 r 2 ) Using y i ȳ = ŷ i ȳ + e i we obtain a decomposition SST = SSR + SSE, where SST = i (y i ȳ) 2 = (n 1)s 2 y is the total sum of squares, SSR = i (ŷ i ȳ) 2 = (n 1)b 2 1s 2 x is the regression sum of squares Coefficient of determination r 2 = SSR SST = 1 SSE SST Coefficient of determination is the proportion of variation in Y explained by main factor X Thus r 2 has a more transparent meaning than the correlation coefficient r To test the normality assumption use the normal distribution plot for the standardized residuals e i where s i = s 1 are the estimated standard deviations of e i k (x k x i ) 2 n(n 1)s 2 x The expected plot of the standardised residuals versu i is a horizontal blur (linearity), variance does not depend on x (homoscedasticity) Example (flow rate vs stream depth) For this example with n = 10, the scatter plot looks slightly non-linear The residual plot gives a clearer picture having the U-shape After the log-log transformation, the scatter plot is closer to linear and the residual plot has a horizontal profile s i, 2

3 3 Confidence intervals and hypothesis testing The list square estimators (b 0, b 1 ) are unbiased and consistent Due to the normality assumption we have the following exact distributions b 0 N(β 0, σ 2 0), σ0 2 σ2 x = 2 i, n(n 1)s 2 x b 1 N(β 1, σ1), 2 σ1 2 = σ2, (n 1)s 2 x b 0 β 0 s b0 t n 2, s b0 = s x 2 i, n(n 1) b 1 β 1 s s b1 t n 2, s b1 = n 1 Weak dependence between the two estimators: Cov(b 0, b 1 ) = σ2 x (n 1)s 2 x Exact 100(1 α)% CI for β i : b i ± t n 2 ( α 2 ) s b i Hypothesis testing H 0 : β i = β i0 : test statistic T = b i β i0 s bi, exact null distribution T t n 2 Model utility test and zero-intercept test H 0 : β 1 = 0 (no relationship between X and Y ), test statistic T = b 1 /s b1, null distribution T t n 2 H 0 : β 0 = 0, test statistic T = b 0 /s b0, null distribution T t n 2 Intervals for individual observations Given x predict the value y for the random variable Y = β 0 +β 1 x+ɛ Its expected value µ = β 0 +β 1 x has the least square estimate ˆµ = b 0 + b 1 x The standard error of ˆµ is computed as the square root of Var(ˆµ) = σ2 + σ2 ( x x n n 1 ) 2 Exact 100(1 α)% confidence interval for the mean µ: b 0 + b 1 x ± t n 2 ( α 2 ) s 1 n + 1 n 1 ( x x ) 2 Exact 100(1 α)% prediction interval for y: b 0 + b 1 x ± t n 2 ( α 2 ) s n + 1 n 1 ( x x ) 2 Prediction interval has wider limits since it contains the uncertainty due the noise factors: Var(Y ˆµ) = σ 2 + Var(ˆµ) = σ 2 (1 + 1 n + 1 n 1 ( x x ) 2 ) Compare these two formulas by drawing the confidence bands around the regression line both for the individual observation y and the mean µ 4 Linear regression and ANOVA Recall the two independent samples case from Chapter 11: first sample µ 1 + ɛ 1,, µ 1 + ɛ n, second sample µ 2 + ɛ n+1,, µ 2 + ɛ n+m, where the noise variables are independent and identically distributed ɛ i N(0, σ 2 ) This setting is equivalent to the simple linear regression model with Y i = β 0 + β 1 x i + ɛ i, x 1 = = x n = 0, x n+1 = = x n+m = 1, µ 1 = β 0, µ 2 = β 0 + β 1 The model utility test H 0 : β 1 = 0 is equivalent to the equality test H 0 : µ 1 = µ 2 3

4 More generally, for the one-way ANOVA setting with I = p levels for the main factor and n = pj observations β 0 + ɛ i, i = 1,, J, β 0 + β 1 + ɛ i, i = J + 1,, 2J, we need a multiple linear regression model β 0 + β p 1 + ɛ i, i = (p 1)J + 1,, n, Y i = β 0 + β 1 x i,1 + + β p 1 x i,p 1 + ɛ i, i = 1,, n with dummy variable i,j taking values 0 and 1 so that x i,1 = 1 only for i = J + 1, 2J, x i,2 = 1 only for i = 2J + 1, 3J, x i,p 1 = 1 only for i = (p 1)J + 1,, n The corresponding design matrix has the following block form X = , where each block is a column vector of length J whose components are either or ones or zeros 5 Multiple linear regression Consider a linear regression model Y = β 0 + β 1 x β p 1 x p 1 + ɛ, ɛ N(0, σ 2 ) with p 1 explanatory variables and a homoscedastic noise This is an extension of the simple linear regression model with p = 2 The corresponding data set consists of observations (y 1,, y n ) with n > p, which are realisations of n independent random variables Y 1 = β 0 + β 1 x 1,1 + + β p 1 x 1,p 1 + ɛ 1, Y n = β 0 + β 1 x n,1 + + β p 1 x n,p 1 + ɛ n In the matrix notation the column vector y = (y 1,, y n ) T is a realisation of Y = Xβ + ɛ, where Y = (Y 1,, Y n ) T, β = (β 0,, β p 1 ) T, ɛ = (ɛ 1,, ɛ n ) T, 4

5 are column vectors, and X is the so called design matrix 1 x 1,1 x 1,p 1 X = 1 x n,1 x n,p 1 assumed to have rank p Least square estimates b = (b 0,, b p 1 ) T minimise S(b) = y Xb 2, where a is the length of a vector a Solving the normal equations X T Xb = X T y we find the least squares estimates being b = MX T y, M = (X T X) 1 Least squares multiple regression: predicted responses ŷ = Xb = Py, where P = XMX T Covariance matrix for the least square estimates Σ bb = σ 2 M is a p p matrix with elements Cov(b i, b j ) The vector of residuals e = y ŷ = (I P)y have a covariance matrix Σ ee = σ 2 (I P) An unbiased estimate of σ 2 is given by s 2 = SSE n p, where SSE = e 2 The standard error of b i is computed as s bj = s m jj, where m jj is a diagonal element of M Exact sampling distributions b j β j s bj t n p, j = 1,, p 1 Inspect the normal probability plot for the standardised residuals elements of P y i ŷ i s 1 p ii, where p ii are the diagonal Coefficient of multiple determination can be computed similarly to the simple linear regression model as R 2 = 1 SSE, where SST = (n SST 1)s2 y The problem with R 2 is that it increases even if irrelevant variables are added to the model To punish for irrelevant variables it is better to use the adjusted coefficient of multiple determination The adjustment factor n 1 n p Ra 2 = 1 n 1 SSE = 1 s2 n p SST gets larger for the larger values of p Example (flow rate vs stream depth) The multiple linear regression framework works for the quadratic model y = β 0 + β 1 x + β 2 x 2 The residuals show no sign of systematic misfit Linear and quadratic terms are statistically significant Coefficient Estimate Standard Error t Value β β β Emperical relationship developed in a region might break down, if extrapolated to a wider region in which no data been observed s 2 y 5

6 Example (catheter length) Doctors want predictions on heart catheter length depending on child s height and weight The pairwise scatterplots for the data of size n = 12 suggests two simple linear regressions Estimate Height t Value Weight t Value b 0 (s b0 ) 121(43) (20) 128 b 1 (s b1 ) 060(010) (004) 70 s r 2 (Ra) (076) 080 (078) The plots of standardised residuals do not contradict the normality assumptions The simple regression models should be compared to the multiple regression model L = β 0 + β 1 H + β 2 W, which gives b 0 = 21, s b0 = 88, b 0 /s b0 = 239, b 1 = 020, s b1 = 036, b 1 /s b1 = 056, b 2 = 019, s b2 = 017, b 2 /s b2 = 112, s = 39, R 2 = 081, Ra 2 = 077 In contrast to the simple models, we can not reject neither H 1 : β 1 = 0 nor H 2 : β 2 = 0 This paradox is explained by different meaning of the slope parameters in the simple and multiple regression models In the multiple model β 1 is the expected change in L when H increased by one unit and W held constant Collinearity problem: height and weight have a strong linear relationship The fitted plane has a well resolved slope along the line about which the (H, W ) points fall and poorly resolved slopes along the H and W axes Conclusion: since the simple model L = β 0 + β 1 W gives the highest adjusted coefficient of determination, there is little or no gain from adding H to the regression model model with a single explanatory variable W 6

Simple Linear Regression

Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.