Introduction to Statistical Inference Lecture 8: Linear regression, Tests and confidence la Non-ar
Contents Non-ar Non-ar
Non-ar
Consider n observations (pairs) (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) of (x, y). Assume, that the values y i are observed values of a random variable y and assume, the that values x i are observed non-random values of x. Assume that the values y i depend arly on the value x i. Simple (one explanatory variable) ar model can be presented in the following way: y i = b 0 + b 1 x i + ε i, i 1,..., n, Non-ar where the regression coefficients b 0 and b 1 are unknown constants and the expected value of the residuals ε i is E[ε i ] = 0.
, assumptions for parametric tests and confidence We now consider testing the parameters of a ar regression model and calculating confidence for the estimated parameters under classical assumptions. Measurement of the values x i are error-free. The residuals are independent of the values x i. The residuals are independently and identically distributed (iid). The expected value of the residuals is E[ε i ] = 0, i {1,..., n}. The residuals have the same variance E[ε 2 i ] = σ2, i {1,..., n}. The residuals are uncorrelated i.e. ρ(ε i, ε j ) = 0, i j. The residuals are normally distributed. Non-ar
Non-ar
Testing the slope of the regression The null hypothesis H 0 : b 1 = b1 0. (Typically null hypothesis b 1 = 0 is tested) Possible alternative hypotheses: H 1 : b 1 > b 0 1 (one-tailed), H 1 : b 1 < b 0 1 (one-tailed) or H 1 : b 1 b 0 1 (two-tailed). Non-ar
Testing the slope of the regression t test statistic t = ˆb 1 b 0 1 s/( n 1s x ), where s 2 = var(ˆε) (see lecture 7) and s 2 x is the sample variance of the variable x. Under the null hypothesis H 0, the test statistic follows Student s t-distribution with n 2 degrees of freedom. Under the null hypothesis H 0, expected value of the test statistic is E[t] = 0. Large absolute values of the test statistic suggest, that the null hypothesis H 0 does not hold. The null hypothesis H 0 is rejected, if the p-value is small enough. Non-ar
Testing the slope of the regression Note that, if we have a simple ar model (one response variable and one explaining variable), and we wish to test if the coefficient of determination is 0, we can do that by testing the null hypothesis H 0 : b 1 = 0. Non-ar
, confidence interval Under normality assumption, (1 α) 100% confidence interval for the slope of the regression can be given as s (ˆb1 t n 2,α/2, ˆb s 1 + t n 2,α/2 ), n 1sx n 1sx Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.
/constant term Non-ar
Testing the constant term of the regression The null hypothesis H 0 : b 0 = b 0 0. Possible alternative hypotheses: H 1 : b 0 > b 0 0 (one-tailed), H 1 : b 0 < b 0 0 (one-tailed) or H 1 : b 0 b 0 0 (two-tailed). Non-ar
Testing the constant term of the regression t test statistic ˆb 0 b0 0 t = n s i=1 x i 2 /(, n(n 1)s x ) s 2 = var(ˆε) and s 2 x is the sample variance of the variable x. Under the null hypothesis H 0, the test statistic follows Student s t-distribution with n 2 degrees of freedom. Under the null hypothesis H 0, expected value of the test statistic is E[t] = 0. Large absolute values of the test statistic suggest, that the null hypothesis H 0 does not hold. The null hypothesis H 0 is rejected, if the p-value is small enough. Non-ar
, confidence interval Under normality assumption, (1 α) 100% confidence interval for the constant term of the regression can be given as (ˆb0 t n 2,α/2 s n i=1 x n i 2, ˆb s i=1 0 + t x i 2 n 2,α/2 n(n 1)sx n(n 1)sx ), Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.
Non-ar
the values of variable y A prediction ỹ for the value of the variable y, when x has value x, can be given as ỹ x = ˆb 0 + ˆb 1 x. The more there are observations, smaller the variance σ 2 is, and the closer x is to the the sample mean of x, the better (more accurate) the prediction is. Note that x should be on the range of the observed values of the variable x. Non-ar
the values of variable y Under normality assumption, (1 α) 100% confidence interval for the value of y, when x has value x, can be given as ˆb 0 + ˆb 1 x ± t n 2,α/2 s 1 + 1 ( x x)2 + n (n 1)sx 2, Non-ar where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution.
the expected value of variable y A prediction ˆµ y for the expected value E[y], when x has value x, can be given as ˆµ y x = ˆb 0 + ˆb 1 x. Non-ar
Note that ỹ x estimates the value of a random variable and ˆµ y x estimates the expected value (constant). Estimate ỹ x estimates the values of the variable y on individual level, when x has value x. Estimate ˆµ y x estimates the mean value of the variable y, when x has value x. Even though the estimates are the same, the corresponding confidence are not! Confidence interval for the value of y is wider. It is easier to predict average behaviour than to predict individual values. Non-ar
the expected value of variable y Under normality assumption, (1 α) 100% confidence interval for E[y], when x has value x, can be given as ˆb 0 + ˆb 1 ( x x)2 1 x ± t n 2,α/2 s + n (n 1)sx 2, where s 2 = var(ˆε), s 2 x is the sample variance of the variable x and where t n 2 is the Student s t-distribution with n 2 degrees of freedom, and t n 2,α/2 is the (1 α/2) 100 percentile of the t(n 2)-distribution. Non-ar
Numerical example Numerical example from Lecture slides 7 continues... Summer trainee is asked to predict the sales of Jack s cookies, when 5500 units of Charles cookies are sold. She should also calculate a 95% confidence interval for the sales. Non-ar
Using the regression coefficients ˆb 0 = 10723.87 and ˆb 1 = 0.9386, the prediction of the sales of Jack s cookies, on condition that c = 5500 units of Charles cookies are sold, can be given as j c = ˆb0 + ˆb 1 c = 10723.87 0.9386 5500 = 5561.57. The corresponding confidence interval can be given as ˆb 0 + ˆb 1 c ± t n 2,α/2 s 1 + 1 ( c c)2 + n (n 1)sc 2, Non-ar where t n 2,α/2 = t 10,0.025 = 2.228, c = 5567.833, s c = 302.95 and s 2 = 11948.42.
The 95% confidence interval is ˆb 0 + ˆb 1 c ± t n 2,α/2 s = 5561.57±2.228 11948.42 1 + 1 n = (5308.093, 5815.047). + ( c c)2 (n 1)s 2 c 1 + 1 (5500 5567.833)2 + 12 11 302.95 2 If 5500 units of Charles cookies are sold, then the prediction for the sales of Jack s cookies is 5562 units. A 95 % confidence interval for the prediction is (5308, 5816). The prediction seems reasonable. What about the confidence interval? Is there anything suspicious in the calculated confidence interval? Non-ar
Non-ar
for the regression coefficients Consider the estimated residuals ˆε 1, ˆε 2,..., ˆε n and the fitted values ŷ 1, ŷ 2,..., ŷ n of the regression model. Collect a new sample ˇε 1, ˇε 2,..., ˇε n by picking n data points randomly with replacement from ˆε 1, ˆε 2,..., ˆε n. Form a bootstrap sample where (x 1, ˇy 1 ), (x 2, ˇy 2 ),..., (x n, ˇy n ), ˇy i = ŷ i + ˇε i. Non-ar Calculate estimates for the regression coefficients b 0 and b 1 from the bootstrap sample. Repeat this several times, for example 999 times. Order now all the estimates (the original ones and the 999 bootstrap estimates) from the smallest to the largest. Now an estimate for the 90% confidence interval (l, u) is obtained by choosing the 50th ordered estimate as l and the 951st estimate as u. An estimate for the 95% confidence interval (l, u) is obtained by choosing the 25th estimate as l and the 976th estimate as u.
Prediction, bootstrap confidence A prediction ˆµ y for the expected value E[y], when x has value x, was given as ˆµ y x = ˆb 0 + ˆb 1 x. Consider bootstrap estimates for the regression coefficients b 0 and b 1. One can calculate bootstrap confidence for ˆµ y x by replacing ˆb 0 and ˆb 1 by bootstrap estimates in the formula above. That is then repeated, for example, 999 times. After that, all the 1000 predictions are ordered and bootstrap confidence are obtained. Non-ar
Coefficient of determination, bootstrap confidence Bootstrap samples (x 1, ˇy 1 ), (x 2, ˇy 2 ),..., (x n, ˇy n ) can be used also for calculating bootstrap confidence for the coefficient of determination of the model. Coefficient of determination is estimated (separately) from every bootstrap sample. One can use, for example, 999 or 9999 bootstrap samples. After that, all the 1000 or 10000 estimates are ordered and bootstrap confidence are obtained. Non-ar
, alternative approach Instead of bootstrapping from the estimated residuals, one may take bootstrap samples directly from the original observations (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). Parameter estimates are then calculated from the bootstrap samples, the estimates are ordered and bootstrap confidence are obtained. Non-ar
Non-ar Non-ar
Non-ar What should be done, if the between two variables is non-ar? Try to arise the variables. Try piecewise analysis. Try to fit some other shape, for example parabola. If the dependent variable is binary, use logistic regression. Non-ar
Non-ar
What should one do, if... the residuals are not normally distributed? the residuals are not independent from the values of the variable x? the residuals are heteroscedastic? the residuals are not uncorrelated? Non-ar
J. S. Milton, J. C. Arnold: Introduction to Probability and Statistics, McGraw-Hill Inc 1995. R. V. Hogg, J. W. McKean, A. T. Craig: Introduction to Mathematical Statistics, Pearson Education 2005. A. C. Davison, D. V. Hinkley: Bootstrap Methods and their Applications, Cambridge University Press 2009. Pertti Laininen: Todennäköisyys ja sen tilastoln soveltaminen, Otatieto 1998, numero 586. Ilkka Mellin: Tilastolliset menetelmät, http://math.aalto.fi/opetus/sovtoda/materiaali.html. Non-ar