Bias Variance Trade-off - PDF Free Download

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 )

MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)] + [E(ˆθ) θ]) 2 ) = E([ˆθ E(ˆθ)] 2 ) + 2 E([E(ˆθ) θ][ˆθ E(ˆθ)]) + E([E(ˆθ) θ] 2 ) = Var(ˆθ) + 2 E([E(ˆθ)[ˆθ E(ˆθ)] θ[ˆθ E(ˆθ)])) + (B(ˆθ)) 2 = Var(ˆθ) + 2(0 + 0) + (B(ˆθ)) 2 = Var(ˆθ) + (B(ˆθ)) 2

s 2 estimator for σ 2 for Single population Sum of Squares: n (Y i Ȳ i ) 2 Sample Variance Estimator: i=1 s 2 = n i=1 (Y i Ȳ i ) 2 n 1 s 2 is an unbiased estimator of σ 2. The sum of squares SSE has n 1 degrees of freedom associated with it, one degree of freedom is lost by using Ȳ as an estimate of the unknown population mean µ.

Estimating Error Term Variance σ 2 for Regression Model Regression model Variance of each observation Y i is σ 2 (the same as for the error term ɛ i ). Each Y i comes from a different probability distribution with different means that depend on the level X i The deviation of an observation Y i must be calculated around its own estimated mean.

s 2 estimator for σ 2 s 2 = MSE = SSE n 2 = (Yi Ŷi) 2 n 2 MSE is an unbiased estimator of σ 2 E(MSE) = σ 2 = e 2 i n 2 The sum of squares SSE has n 2 degrees of freedom associated with it, two degrees of freedom is lost when estimating β 0 and β 1. Cochran s theorem (later in the course) tells us where degree s of freedom come from and how to calculate them.

Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) note this is different, now we know the distribution i = 1,..., n

Maximum Likelihood Estimator(s) β 0 b 0 same as in least squares case β 1 b 1 same as in least squares case σ 2 ˆσ 2 = i (Y i Ŷ i ) 2 Note that ML estimator is biased as s 2 is unbiased and s 2 = MSE = n n n 2 ˆσ2

Inference in the Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) i = 1,..., n

Inference concerning β 1 Tests concerning β 1 (the slope) are often of interest, particularly H 0 : β 1 = 0 H a : β 1 0 the null hypothesis model Y i = β 0 + (0)X i + ɛ i implies that there is no relationship between Y and X. Note the means of all the Y i s are equal at all levels of X i.

Sampling Dist. Of b 1 The point estimator for b 1 is b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 The sampling distribution for b 1 is the distribution of b 1 that arises from the variability of b 1 when the predictor variables X i are held fixed and the observed outputs are repeatedly sampled Note that the sampling distribution we derive for b 1 will be highly dependent on our modeling assumptions.

Sampling Dist. Of b 1 In Normal Regr. Model For a normal error regression model the sampling distribution of b 1 is normal, with mean and variance given by E(b 1 ) = β 1 Var(b 1 ) = σ 2 (Xi X ) 2 To show this we need to go through a number of algebraic steps.

(Xi X )(Y i Ȳ ) = (X i X )Y i First step To show we observe (Xi X )(Y i Ȳ ) = (X i X )Y i (X i X )Ȳ = (X i X )Y i Ȳ (X i X ) = (X i X )Y i Ȳ X i + Ȳ n = (X i X )Y i Xi n

b 1 as convex combination of Y i s b 1 can be expressed as a linear combination of the Y i s b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 = (Xi X )Y i (Xi X ) 2 from previous slide = k i Y i where k i = (X i X ) (Xi X ) 2

Properties of the k i s It can be shown that ki = 0 ki X i = 1 k 2 i = 1 (Xi X ) 2 (possible homework). We will use these properties to prove various properties of the sampling distributions of b 1 and b 0.

Normality of b 1 s Sampling Distribution Useful fact: A linear combination of independent normal random variables is normally distributed More formally: when Y1,..., Y n are independent normal random variables, the linear combination a 1 Y 1 + a 2 Y 2 +... + a n Y n is normally distributed, with mean ai E(Y i ) and variance ai 2 Var(Y i )

Normality of b 1 s Sampling Distribution Since b 1 is a linear combination of the Y i s and each Y i is an independent normal random variable, then b 1 is distributed normally as well b 1 = k i Y i, k i = (X i X ) (Xi X ) 2 From previous slide E(b 1 ) = k i E(Y i ), Var(b 1 ) = k 2 i Var(Y i )

b 1 is an unbiased estimator This can be seen using two of the properties E(b 1 ) = E( k i Y i ) = k i E(Y i ) = k i (β 0 + β 1 X i ) = β 0 ki + β 1 ki X i = β 0 (0) + β 1 (1) = β 1

Variance of b 1 Since the Y i are independent random variables with variance σ 2 and the k i s are constants we get Var(b 1 ) = Var( k i Y i ) = k 2 i Var(Y i ) = ki 2 σ 2 = σ 2 ki 2 = σ 2 1 (Xi X ) 2 However, in most cases, σ 2 is unknown.

Estimated variance of b 1 When we don t know σ 2 then we have to replace it with the MSE estimate Let where s 2 = MSE = SSE n 2 SSE = e 2 i and e i = Y i Ŷi plugging in we get Var(b 1 ) = ˆ Var(b 1 ) = σ 2 (Xi X ) 2 s 2 (Xi X ) 2

Recap We now have an expression for the sampling distribution of b 1 when σ 2 is known b 1 N (β 1, σ 2 (Xi X ) 2 ) (1) When σ 2 is unknown we have an unbiased point estimator of σ 2 ˆ Var(b 1 ) = s 2 (Xi X ) 2 As n (i.e. the number of observations grows large) ˆ Var(b 1 ) Var(b 1 ) and we can use Eqn. 1. Questions When is n big enough? What if n isn t big enough?

Digression : Gauss-Markov Theorem In a regression model where E(ɛ i ) = 0 and variance Var(ɛ i ) = σ 2 < and ɛ i and ɛ j are uncorrelated for all i and j the least squares estimators b 0 and b 1 are unbiased and have minimum variance among all unbiased linear estimators. Remember b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 b 0 = Ȳ b 1 X

Proof The theorem states that b 1 as minimum variance among all unbiased linear estimators of the form ˆβ 1 = c i Y i As this estimator must be unbiased we have E( ˆβ 1 ) = c i E(Y i ) = β 1 = c i (β 0 + β 1 X i ) = β 0 ci + β 1 ci X i = β 1

Proof cont. Given these constraints β 0 ci + β 1 ci X i = β 1 clearly it must be the case that c i = 0 and c i X i = 1 The variance of this estimator is Var( ˆβ 1 ) = c 2 i Var(Y i ) = σ 2 c 2 i

Proof cont. Now define c i = k i + d i where the k i are the constants we already defined and the d i are arbitrary constants. k i = (X i X ) (Xi X ) 2 Let s look at the variance of the estimator Var( ˆβ 1 ) = c 2 i Var(Y i ) = σ 2 (k i + d i ) 2 Note we just demonstrated that = σ 2 ( k 2 i + d 2 i + 2 k i d i ) σ 2 k 2 i = Var(b 1 )

Proof cont. Now by showing that k i d i = 0 we re almost done ki d i = k i (c i k i ) = k i (c i k i ) = k i c i k 2 i = ( Xi c X ) 1 i (Xi X ) 2 (Xi X ) 2 = ci X i X c i (Xi X ) 2 1 (Xi X ) 2 = 0

Proof end So we are left with Var( ˆβ 1 ) = σ 2 ( k 2 i + d 2 i ) = Var(b 1 ) + σ 2 ( d 2 i ) which is minimized when all the d i = 0. This means that the least squares estimator b 1 has minimum variance among all unbiased linear estimators.

Sampling Distribution of (b 1 β 1 )/S(b 1 ) b 1 is normally distributed so (b 1 β 1 )/( Var(b 1 )) is a standard normal variable We don t know Var(b 1 ) so it must be estimated from data. We have already denoted it s estimate If using the estimate ˆV (b 1 ) it can be shown that b 1 β 1 Ŝ(b 1 ) Ŝ(b 1 ) = t(n 2) ˆV (b 1 )

Where does this come from? For now we need to rely upon the following theorem For the normal error regression model SSE (Yi Ŷ i ) 2 σ 2 = σ 2 χ 2 (n 2) and is independent of b 0 and b 1 Intuitively this follows the standard result for the sum of squared normal random variables Here there are two linear constraints imposed by the regression parameter estimation that each reduce the number of degrees of freedom by one. We will revisit this subject soon.

Another useful fact : t distributed random variables Let z and χ 2 (ν) be independent random variables (standard normal and χ 2 respectively). The following random variable is a t-dstributed random variable: t(ν) = z χ 2 (ν) ν This version of the t distribution has one parameter, the degrees of freedom ν

Distribution of the studentized statistic To derive the distribution of this statistic, first we do the following rewrite b 1 β 1 b 1 β 1 S(b = 1 ) Ŝ(b 1 ) Ŝ(b 1 ) S(b 1 ) Ŝ(b 1 ) S(b 1 ) = ˆV (b 1 ) Var(b 1 )

Studentized statistic cont. And note the following ˆV (b 1 ) Var(b 1 ) = MSE (Xi X ) 2 σ 2 (Xi X ) 2 = MSE σ 2 = SSE σ 2 (n 2) where we know (by the given theorem) the distribution of the last term is χ 2 and indep. of b 1 and b 0 SSE σ 2 (n 2) χ2 (n 2) n 2

Studentized statistic final But by the given definition of the t distribution we have our result b 1 β 1 Ŝ(b 1 ) t(n 2) because putting everything together we can see that b 1 β 1 Ŝ(b 1 ) z χ 2 (n 2) n 2

Confidence Intervals and Hypothesis Tests Now that we know the sampling distribution of b 1 (t with n-2 degrees of freedom) we can construct confidence intervals and hypothesis tests easily. Things to think about What does the t-distribution look like? Why is the estimator distributed according to a t-distribution rather than a normal distribution? When performing tests why does this matter? When is it safe to cheat and use a normal approximation?

Quick Review : Hypothesis Testing Elements of a statistical test Null hypothesis, H 0 Alternative hypothesis, Ha Test statistic Rejection region

Quick Review : Hypothesis Testing - Errors Errors A type I error is made if H 0 is rejected when H 0 is true. The probability of a type I error is denoted by α. The value of α is called the level of the test. A type II error is made if H0 is accepted when H a is true. The probability of a type II error is denoted by β.

P-value The p-value, or attained significance level, is the smallest level of significance α for which the observed data indicate that the null hypothesis should be rejected.

Hypothesis Testing Example: Court Room Trial A statistical test procedure is comparable to a trial; a defendant is considered innocent as long as his guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is condemned. In the start of the procedure, there are two hypotheses H 0 (null hypothesis): the defendant is innocent H a (alternative hypothesis): the defendant is guilty. H 0 is true H a is true Accept H 0 Right decision Type II Error Reject H 0 Type I Error Right decision

Hypothesis Testing Example: Court Room Trial, cont. The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn t want to condemn an innocent defendant. Such an error is called error of the first kind (i.e. the condemnation of an innocent person), and the occurrence of this error is controlled to be seldom. As a consequence of this asymmetric behaviour, the error of the second kind (setting free a guilty person), is often rather large.

Null Hypothesis If the null hypothesis is that β 1 = 0 then b 1 should fall in the range around zero. The further it is from 0 the less likely the null hypothesis is to hold.

Alternative Hypothesis : Least Squares Fit If we find that our estimated value of b 1 deviates from 0 then we have to determine whether or not that deviation would be surprising given the model and the sampling distribution of the estimator. If it is sufficiently (where we define what sufficient is by a confidence level) different then we reject the null hypothesis.

Testing This Hypothesis Only have a finite sample Different finite set of samples (from the same population / source) will (almost always) produce different point estimates of β 0 and β 1 (b 0, b 1 ) given the same estimation procedure Key point: b 0 and b 1 are random variables whose sampling distributions can be statistically characterized Hypothesis tests about β 0 and β 1 can be constructed using these distributions. The same techniques for deriving the sampling distribution of b = [b 0, b 1 ] are used in multiple regression.

Confidence Interval Example A machine fills cups with margarine, and is supposed to be adjusted so that the content of the cups is µ = 250g of margarine. Observed random variable X Normal(250, 2.5)

Confidence Interval Example, Cont. X 1,..., X 25, a random sample from X. The natural estimator is the sample mean: ˆµ = X = 1 n n i=1 X i. Suppose the sample shows actual weights X 1,..., X 25, with mean: X = 1 25 X i = 250.2grams. 25 i=1

Say we want to get a confidence interval for µ. By standardizing, we get a random variable Z = X µ σ/ n = X µ 0.5 P( z Z z) = 1 α = 0.95. The number z follows from the cumulative distribution function: Φ(z) = P(Z z) = 1 α 2 = 0.975, (2) z = Φ 1 (Φ(z)) = Φ 1 (0.975) = 1.96, (3)

Now we get: 0.95 = 1 α = P( z Z z) = P ( = P X 1.96 n σ µ X + 1.96 n σ ) ( 1.96 X ) µ σ/ n 1.96 (4) (5) = P ( X 1.96 0.5 µ X + 1.96 0.5 ) (6) = P ( X 0.98 µ X + 0.98 ). (7) This might be interpreted as: with probability 0.95 we will find a confidence interval in which we will meet the parameter µ between the stochastic endpoints X 0.98 and X + 0.98.

Therefore, our 0.95 confidence interval becomes: ( X 0.98; X +0.98) = (250.2 0.98; 250.2+0.98) = (249.22; 251.18).

Recap We know that the point estimator of β 1 is b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 Last class we derived the sampling distribution of b 1, it being N(β 1, Var(b 1 ))(when σ 2 known) with Var(b 1 ) = σ 2 {b 1 } = σ 2 (Xi X ) 2 And we suggested that an estimate of Var(b 1 ) could be arrived at by substituting the MSE for σ 2 when σ 2 is unknown. s 2 {b 1 } = MSE (Xi X ) 2 = SSE n 2 (Xi X ) 2

Sampling Distribution of (b 1 β 1 )/s{b 1 } Since b 1 is normally distribute, (b 1 β 1 )/σ{b 1 } is a standard normal variable N(0, 1) We don t know Var(b 1 ) so it must be estimated from data. We have already denoted its estimate s 2 {b 1 } Using this estimate we it can be shown that b 1 β 1 s{b 1 } t(n 2) where s{b 1 } = s 2 {b 1 } It is from this fact that our confidence intervals and tests will derive.

Where does this come from? We need to rely upon (but will not derive) the following theorem For the normal error regression model SSE (Yi Ŷ i ) 2 σ 2 = σ 2 χ 2 (n 2) and is independent of b 0 and b 1. Here are two linear constraints b 1 = b 0 = Ȳ b 1 X (Xi X )(Y i Ȳ ) (Xi X ) 2 = i k i Y i, k i = X i X i (X i X ) 2 imposed by the regression parameter estimation that each reduce the number of degrees of freedom by one (total two).

Reminder: normal (non-regression) estimation Intuitively the regression result from the previous slide follows the standard result for the sum of squared standard normal random variables. First, with σ and µ known n i=1 Z 2 i = n ( ) Yi µ 2 χ 2 (n) σ i=1 and then with µ unknown S 2 = (n 1)S 2 σ 2 = 1 n (Y i Ȳ ) 2 n 1 i=1 n ( Yi Ȳ ) 2 χ 2 (n 1) i=1 σ and Ȳ and S 2 are independent.

Reminder: normal (non-regression) estimation cont. With both µ and σ unknown then (Ȳ ) µ n t(n 1) S because n (Ȳ µ S ) = n(ȳ µ)/σ [(n 1)S 2 /σ 2 ]/(n 1) = N(0, 1) χ 2 (n 1)/(n 1)

Another useful fact : Student-t distribution Let Z and χ 2 (ν) be independent random variables (standard normal and χ 2 respectively). We then define a t random variable as follows: Z t(ν) = χ 2 (ν) ν This version of the t distribution has one parameter, the degrees of freedom ν

Studentized statistic But by the given definition of the t distribution we have our result b 1 β 1 s{b 1 } t(n 2) because putting everything together we can see that b 1 β 1 s{b 1 } z χ 2 (n 2) n 2

Confidence Intervals and Hypothesis Tests Now that we know the sampling distribution of b 1 (t with n-2 degrees of freedom) we can construct confidence intervals and hypothesis tests easily

Confidence Interval for β 1 Since the studentized statistic follows a t distribution we can make the following probability statement P(t(α/2; n 2) b 1 β 1 s{b 1 } t(1 α/2; n 2)) = 1 α

Remember Density: f (y) = df (y) dy Distribution (CDF): F (y) = P(Y y) = y f (t)dt Inverse CDF: F 1 y (p) = y s.t. f (t)dt = p

Interval arriving from picking α Note that by symmetry t(α/2; n 2) = t(1 α/2; n 2) Rearranging terms and using this fact we have P(b 1 t(1 α/2; n 2)s{b 1 } β 1 b 1 + t(1 α/2; n 2)s{b 1 }) = 1 α And now we can use a table to look up and produce confidence intervals

Using tables for Computing Intervals The tables in the book (table B.2 in the appendix) for t(1 α/2; ν) where P{t(ν) t(1 α/2; ν)} = A Provides the inverse CDF of the t-distribution

1 α confidence limits for β 1 The 1 α confidence limits for β 1 are b 1 ± t(1 α/2; n 2)s{b 1 } Note that this quantity can be used to calculate confidence intervals given n and α. Fixing α can guide the choice of sample size if a particular confidence interval is desired Given a sample size, vice versa. Also useful for hypothesis testing

Tests Concerning β 1 Example 1 Two-sided test H 0 : β 1 = 0 H a : β 1 0 Test statistic t = b1 0 s{b 1}

Tests Concerning β 1 We have an estimate of the sampling distribution of b 1 from the data. If the null hypothesis holds then the b 1 estimate coming from the data should be within the 95% confidence interval of the sampling distribution centered at 0 (in this case) t = b 1 0 s{b 1 }

Decision rules if t t(1 α/2; n 2), accept H 0 if t > t(1 α/2; n 2), reject H 0 Absolute values make the test two-sided

Calculating the p-value The p-value, or attained significance level, is the smallest level of significance α for which the observed data indicate that the null hypothesis should be rejected. This can be looked up using the CDF of the test statistic.

p-value Example An experiment is performed to determine whether a coin flip is fair (50% chance, each, of landing heads or tails) or unfairly biased (50% chance of one of the outcomes). Outcome: Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. p-value: The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips Calculation: Prob(14 heads) + Prob(15 heads) + + Prob(20 heads) (8) = 1 [( ) ( ) ( )] 20 20 20 2 20 + + + = 60,460 0.058 (9) 14 15 20 1,048,576

Inferences Concerning β 0 Largely, inference procedures regarding β 0 can be performed in the same way as those for β 1 Remember the point estimator b 0 for β 0 b 0 = Ȳ b 1 X

Sampling distribution of b 0 The sampling distribution of b 0 refers to the different values of b 0 that would be obtained with repeated sampling when the levels of the predictor variable X are held constant from sample to sample. For the normal regression model the sampling distribution of b 0 is normal

Sampling distribution of b 0 When error variance is known E(b 0 ) = β 0 σ 2 {b 0 } = σ 2 ( 1 n + X 2 (Xi X ) 2 ) When error variance is unknown s 2 {b 0 } = MSE( 1 n + X 2 (Xi X ) 2 )

Confidence interval for β 0 The 1 α confidence limits for β 0 are obtained in the same manner as those for β 1 b 0 ± t(1 α/2; n 2)s{b 0 }

Considerations on Inferences on β 0 and β 1 Effects of departures from normality The estimators of β 0 and β 1 have the property of asymptotic normality - their distributions approach normality as the sample size increases (under general conditions) Spacing of the X levels The variances of b 0 and b 1 (for a given n and σ 2 ) depend strongly on the spacing of X

Sampling distribution of point estimator of mean response Let X h be the level of X for which we would like an estimate of the mean response Needs to be one of the observed X s The mean response when X = X h is denoted by E(Y h ) The point estimator of E(Y h ) is Ŷ h = b 0 + b 1 X h We are interested in the sampling distribution of this quantity

Sampling Distribution of Ŷh We have Ŷ h = b 0 + b 1 X h Since this quantity is itself a linear combination of the Y i s it s sampling distribution is itself normal. The mean of the sampling distribution is Biased or unbiased? E{Ŷh} = E{b 0 } + E{b 1 }X h = β 0 + β 1 X h

Sampling Distribution of Ŷh To derive the sampling distribution variance of the mean response we first show that b 1 and (1/n) Y i are uncorrelated and, hence, for the normal error regression model independent We start with the definitions Ȳ = ( 1 n )Y i b 1 = k i Y i, k i = (X i X ) (Xi X ) 2

Sampling Distribution of Ŷh We want to show that mean response and the estimate b 1 are uncorrelated Cov(Ȳ, b 1) = σ 2 {Ȳ, b 1} = 0 To do this we need the following result (A.32) n n σ 2 { a i Y i, c i Y i } = i=1 i=1 n a i c i σ 2 {Y i } i=1 when the Y i are independent

Sampling Distribution of Ŷh Using this fact we have σ 2 { n i=1 1 n Y i, n k i Y i } = i=1 So the Ȳ and b 1 are uncorrelated = n i=1 n i=1 = σ2 n = 0 1 n k iσ 2 {Y i } 1 n k iσ 2 n i=1 k i

Sampling Distribution of Ŷh This means that we can write down the variance σ 2 {Ŷ h } = σ 2 {Ȳ + b 1 (X h X )} alternative and equivalent form of regression function But we know that the mean of Y and b 1 are uncorrelated so σ 2 {Ŷ h } = σ 2 {Ȳ } + σ 2 {b 1 }(X h X ) 2

Sampling Distribution of Ŷh We know (from last lecture) σ 2 {b 1 } = s 2 {b 1 } = σ 2 (Xi X ) 2 MSE (Xi X ) 2 And we can find σ 2 {Ȳ } = 1 n 2 σ 2 {Y i } = nσ2 n 2 = σ2 n

Sampling Distribution of Ŷh So, plugging in, we get σ 2 {Ŷ h } = σ2 n + σ 2 (Xi X ) 2 (X h X ) 2 Or ( 1 σ 2 {Ŷ h } = σ 2 n + (X h X ) 2 ) (Xi X ) 2

Sampling Distribution of Ŷh Since we often won t know σ 2 we can, as usual, plug in S 2 = SSE/(n 2), our estimate for it to get our estimate of this sampling distribution variance ( 1 s 2 {Ŷ h } = S 2 n + (X h X ) 2 ) (Xi X ) 2

No surprise... The sampling distribution of our point estimator for the output is distributed as a t-distribution with two degrees of freedom Ŷ h E{Y h } t(n 2) s{ŷ h } This means that we can construct confidence intervals in the same manner as before.

Confidence Intervals for E(Y h ) The 1 α confidence intervals for E(Y h ) are Ŷ h ± t(1 α/2; n 2)s{Ŷ h } From this hypothesis tests can be constructed as usual.

Comments The variance of the estimator fore(y h ) is smallest near the mean of X. Designing studies such that the mean of X is near X h will improve inference precision When X h is zero the variance of the estimator for E(Y h ) reduces to the variance of the estimator b 0 for β 0

Prediction interval for single new observation Essentially follows the sampling distribution arguments for E(Y h ) If all regression parameters are known then the 1 α prediction interval for a new observation Y h is E{Y h } ± z(1 α/2)σ

Prediction interval for single new observation If the regression parameters are unknown the 1 α prediction interval for a new observation Y h is given by the following theorem Ŷ h ± t(1 α/2; n 2)s{pred} This is very nearly the same as prediction for a known value of X but includes a correction for the fact that there is additional variability arising from the fact that the new input location was not used in the original estimates of b 1, b 0, and s 2

Prediction interval for single new observation We have σ 2 {pred} = σ 2 {Y h Ŷ h } = σ 2 {Y h } + σ 2 {Ŷ h } = σ 2 + σ 2 {Ŷ h } An unbiased estimator of σ 2 {pred} is s 2 {pred} = MSE + s 2 {Ŷ h }, which is given by s 2 {pred} = MSE [1 + 1n + (X h X ) 2 ] (Xi X ) 2