STA 114: Statistics Notes 1. Linear Regression Introduction A most widely used statistical analysis is regression, where one tries to explain a response variable Y by an explanatory variable X based on paired data (X 1, Y 1 ),, (X n, Y n ). The most common way to model the dependence of Y on X is to look for a linear relationship with additional noise, Y i = β 0 + β 1 X i + ϵ i with ϵ 1,, ϵ n taken to be independent and identically distributed random variables with mean 0 and variance σ. The unknown quantities are (β 0, β 1, σ ). Many pairs of natural measurements exhibit such linear relationships. The figure below shows atmospheric pressure (in inches of Mercury) against boiling point of water (in degrees F) based on 17 pairs of observations. Although water s boiling point and atmospheric pressure should have a precise physical relationship, there would always be some deviation in actual measurements due to factors that are hard to control. Pressure 4 6 8 30 195 00 05 10 Temp Least squares line The straight line you see in the above figure is the line that best fits the data. This is found as follows. For any line y = b 0 +b 1 x, we can find the residuals e i = y i b 0 b 1 x i if we tried 1
to explain the observed values of Y by those of X using this line. The total deviation can be measured by the sum of squares of the residuals d(b 0, b 1 ) = e i = (y i b 0 b 1 x i ). We could find b 0 and b 1 that minimize d(b 0, b 1 ). This is easily done by using calculus. We set b 0 d(b 0, b 1 ) = 0, b 1 d(b 0, b 1 ) = 0 and solve for b 0, b 1. In particular, 0 = b 0 d(b 0, b 1 ) = (y i b 0 b 1 x i )( 1) = n(ȳ b 0 b 1 x) 0 = b 1 d(b 0, b 1 ) = (y i b 0 b 1 x i )( x i ) = ( x i y i nb 0 b 1 x i ). These are two linear equations in two unknowns b 0, b 1. The solutions are: ˆb1 = (y i ȳ)(x i x) (x i x) = y i(x i x) (x i x), ˆb0 = ȳ ˆb 1 x. Let s denote s xy = 1 (y n 1 i ȳ)(x i x) [this is the sample covariance between Y and X. Then using our old notation s x = 1 n n 1 (x i x) we can write the least squares solution as ˆb 0 = ȳ s xy x/s x, ˆb 1 = s xy /s x. The method of least squares was used by physicists working on astrological measurements in the early 18th century. A statistical framework was developed much later. The main import of the statistical development, as usual, has been to incorporate a notion of uncertainty. Statistical analysis of simple linear regression To put the linear regression relationship Y i = β 0 + β 1 X i + ϵ i into a statistical model, we need a distribution on the ϵ i s. The most common choice is a normal distribution Normal(0, σ ). This can be justified as follows: the additional factors that give rise to the noise term are many in number and act independently of each other, each making a little contribution. By the central limit theorem the aggregate of such numerous, independent, small contributions should behave like a normal variable. The mean is fixed at zero because any non-zero mean can be absorbed in the intercept β 0. So our statistical model is Y i = β 0 + β 1 X i + ϵ i, ϵ i Normal(0, σ ) with model parameters β 0 (, ), β 1 (, ), and σ > 0. We also need to assume that the error terms ϵ i s are independent of the explanatory variables X i (because the errors account for additional factors beyond the explanatory variable). Then, the log-likelihood function in the model parameters is given by, l x,y (β 0, β 1, σ ) = const n log σ (y i β 0 β 1 x i ). σ To find the MLE, first notice that for every σ, the log-likelihood function is maximized at (β 0, β 1 ) equal to the least squares solutions (ȳ s xy x/s x, s xy /s x), and so we must have ˆβ 1,MLE (x, y) = ȳ s xy x/s x, ˆβ,MLE (x, y) = s xy /s x. To shorten notations we ll just write ˆβ 0 and ˆβ 1 for ˆβ 0,MLE (x, y) and ˆβ 1,MLE (x, y) respectively.
Consequently, the MLE of σ is found by setting ˆσ MLE(x, y) = 1 n It is more common to estimate σ by ˆσ = s y x = 1 n l σ x,y ( ˆβ 0, ˆβ 1, σ ) = 0. This given, (y i ˆβ 0 ˆβ 1 x i ). (y i ˆβ 0 ˆβ 1 x i ). where n indicates that two unknown quantities (β 0 and β 1 ) were to be estimated to define the residuals. Sampling theory To construct confidence intervals and test procedures for the model parameters, we ll first need to look at their sampling theory. We ll explore these treating the explanatory variable as non-random, i.e., as if we picked and fixed the observations x 1,, x n. You can interpret this as being the conditional sampling theory of the estimated model parameters given the observed explanatory variables. It turns out that when Y i = β 0 + β 1 x i + ϵ i with ϵ i Normal(0, σ ), for any two numbers a and b, 1. a ˆβ 0 + b ˆβ 1 Normal. (n )ˆσ /σ χ (n ) { }) (aβ 0 + bβ 1, σ a + (a x b). n (n 1)s x 3. and these two random variables are mutually independent. A proof of this is given at the end of this handout. It is only for those who are curious to know why this happens. You may ignore it if you re not interested. Two special cases of the above result would be important to us. First, with a = 1 and b = 0 the result gives ˆβ 0 Normal(β 0, σ { 1 + x }) and is independent of ˆσ. Second, n (n 1)s x with a = 0 and b = 1 we have ˆβ 1 Normal(β 1, σ 1 ) and is independent of ˆσ. (n 1)s x ML confidence interval We shall only look at confidence intervals for parameters that can be written as η = aβ 0 + bβ 1 for some real numbers a and b (again, the interesting cases would be (a = 1, b = 0) corresponding to β 0 and (a = 0, b = 1) corresponding to β 1 ). From the result above it follows that ˆη = a ˆβ 0 + b ˆβ 1 Normal(η, σ η) where σ η = σ [ a Let ˆσ η = ˆσ [ a n + (a x b) (n 1)s x + (a x b) n (n 1)s x. A bit of algebra shows that ˆη η ˆσ η t(n ). 3 and ˆη is independent of ˆσ.
Therefore a 100(1 α)% confidence interval for η is given by B(x, y) = ˆη ˆσ η z n (α). This confidence interval is also the ML confidence interval, i.e., it matches {l x,y(η) max η l x,y(η) c /} for some c > 0 where l x,y(η) is the pofile log-likelihood of η, but we would not pursue a proof here. Hypotheses testing Again we restrict ourselves to parameters that can be written as η = aβ 0 + bβ 1. To test H 0 : η = η 0 against H 1 : η η 1, the size-α ML test is the one that rejects H 0 whenever η 0 is outside the 100(1 α)% ML confidence interval ˆη ˆσ η z n (α). The p-value based on these tests is precisely the α for which η 0 is just on the boundary of this interval. Of particular interest is testing H 0 : β 1 = 0 against H 1 : β 1 0. The null hypothesis says that X offers no explanation of Y within the linear relationship framework. This can be dealt as above with a = 0 and b = 1 and thus checking whether 0 is outside the interval ˆβ 1 ˆσz n (α)/ (n 1)s x. Prediction Suppose you want to predict the value Y of Y corresponding to an X = x that is known to you. The model says Y = β 0 + β 1 x + ϵ where ϵ Normal(0, σ ) and is independent of ϵ 1,, ϵ n. If we knew β 0, β 1, σ, we could give the interval β 0 + β 1 x σz(α), taking into account the variability σ of ϵ. Our best guess for β 0 + β 1 x is the fitted value ŷ = ˆβ 0 + ˆβ 1 x (the point on the least squares line with x-coordinate x ) with additional associated variability σ { 1 + ( x x ) } [follows from the result with a = 1, b = x. Also, we have an n (n 1)s x estimate ˆσ for σ. Putting all these together, it follows that the predictive interval has a 100(1 α)% coverage. Testing goodness of fit ˆβ 0 + ˆβ 1 x + ˆσ [1 + 1n + ( x x ) 1 zn (α) (n 1)s x Once we fit the model, we can construct our residuals e i = Y i ˆβ 0 ˆβ 1 x i, i = 1,, n. The histogram of these residuals should match the Normal(0, σ ) pdf, with σ estimated by ˆσ. So we could carry out a Pearson s goodness-of-fit test with these residuals in the same manner we did for iid normal data. The only difference would be that the p-value range would now be given by 1 F k 1 (Q(x)) to 1 F k 4 ((Q(x)) to reflect that 3 parameters are being estimated (β 0, β 1 and σ). Technical details (ignore if you re not interested) To dig into the sampling distribution of ˆβ 0, ˆβ 1 and ˆσ, we first need to recall an important result we had seen a long time back (Notes 7 to be precise). We saw that if Z 1,, Z n 4
Normal(0, 1) and W 1,, W n relate to Z 1,, Z n through the system of equations: W 1 = a 11 Z 1 + + a 1n Z n W = a 1 Z 1 + + a n Z n. W n = a n1 Z 1 + + a nn Z n then W 1,, W n are also independent Normal(0, 1) variables with W1 + +Wn = Z1 + + Zn provided the coefficients on each row form a vector of unit length and these vectors are orthogonal to each other. We had used this result to show that if X 1,, X n Normal(0, 1) then X Normal(0, 1/ n) and (X i X) χ (n 1) and these two are independent of each other. We will pursue a similar track here. First we note that when Y i = β 0 + β 1 x i + ϵ i with ϵ i Normal(0, σ ), Ȳ = 1 (β 0 + β 1 x i + ϵ i ) = β 0 + β 1 x + ϵ n and with u i = (x i x)/ (x i x), ˆβ 1 = Y i(x i x) (x = u i x) i Y i = u i (β 0 + β 1 x i + ϵ i ) = β 1 + u i ϵ i because u i = 0 and u ix i = 1. Define ζ 1 = ϵ 1 /σ,, ζ n = ϵ n /σ, then ζ 1,, ζ n Normal(0, 1) and from calculations above Ȳ = β 0 + β 1 x + σ ζ and ˆβ 1 = β 1 + σ u iζ i. Now get W 1,, W n as [ 1/ (x i x) u i ζ i = W = n ζ = W1 = 1 n ζ 1 + + 1 n ζ n x 1 x (x i x) ζ 1 + + W 3 = a 31 ζ 1 + + a n1 ζ n. W n = a n1 ζ 1 + + a nn ζ n x n x (x i x) ζ n where the rows are unit length and mutually orthogonal (the first two rows are so by design, and so can be extended to a set of n rows by what is known as Gram-Schmidt orthogonalization). So we have W i Normal(0, 1) and W1 + + Wn = ζ1 + + ζn. This leads to a rich collection of results. First we see that ˆβ 1 = β 1 +σw / (x i x) with W Normal(0, 1). Hence we must have ( ) σ ˆβ 1 Normal β 1, (x. i x) 5
Next, ˆβ 0 = Ȳ ˆβ 1 x = β 0 + σ [ W 1 xw n (x i x) with W 1, W independent Normal(0, 1). Therefore, [ ) 1n ˆβ 0 Normal (β 0, σ + ( x) n (x. i x) Furthermore, for any two numbers a and b a ˆβ 0 + b ˆβ 1 = aβ 1 + bβ 0 + σ Next we look at ˆσ. First note that, [ aw 1 (a x b)w n (x i x) { }) a Normal (aβ 0 + bβ 1, σ (a x b) +. n (n 1)s x Y i ˆβ 0 ˆβ 1 x i = ϵ i ( ˆβ 0 β 0 ) ( ˆβ 1 β 1 )x i = σ [ ζ i W 1 (x i x)w n (x i x) which leads to the following identity (with a bit of algebra that you can ignore) (Y i ˆβ 0 ˆβ 1 x i ) σ = ζi W1 W = W3 + + Wn because ζ i = W i. But W3 + +Wn is the sum of n independent Normal(0, 1) variables, so (n )ˆσ n = (Y i ˆβ 0 ˆβ 1 x i ) χ (n ) σ σ and because W 3,, W n are independent of W 1, W, we can conclude ˆσ is independent of ˆβ 0 and ˆβ 1. 6