Regression. Marc H. Mehlman University of New Haven

Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. George Box (University of New Haven) Regression 1 / 41

Table of Contents 1 Simple Regression 2 Confidence Intervals and Significance Tests 3 Variation 4 Chapter #10 R Assignment (University of New Haven) Regression 2 / 41

Simple Regression Simple Regression Simple Regression (University of New Haven) Regression 3 / 41

Simple Regression Let X = the predictor or independent variable Y = the response or dependent variable. Given a bivariate random variable, (X, Y ), is there a linear (straight line) association between X and Y (plus some randomness)? And if so, what is it and how much randomness? Definition (Statistical Model of Simple Linear Regression) Given a predictor, x, the response, y is y = β 0 + β 1 x + ɛ x where β 0 + β 1 x is the mean response for x. The noise terms, the ɛ x s, are assumed to be independent of each other and to be randomly sampled from N(0, σ). The parameters of the model are β 0, β 1 and σ. (University of New Haven) Regression 4 / 41

Simple Regression Conditions for Regression Inference The figure below shows the regression model when the conditions are met. The line in the figure is the population regression line µy= β0 + β1x. For each possible value of the explanatory variable x, the mean of the responses µ(y x) moves along this line. The Normal curves show how y will vary when x is held fixed at different values. All the curves have the same standard deviation σ, so the variability of y is the same for all values of x. The value of σ determines whether the points fall close to the population regression line (small σ) or are widely scattered (large σ). 8 (University of New Haven) Regression 5 / 41

Simple Regression Moderate linear association; regression OK. Obvious nonlinear relationship; regression inappropriate. y = 3 + 0.5x y = 3 + 0.5x One extreme outlier, requiring further examination. Only two values for x; a redesign is due here y = 3 + 0.5x y = 3 + 0.5x (University of New Haven) Regression 6 / 41

Simple Regression Given bivariate random sample from the simple linear regression model, (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ) one wishes to estimate the parameters of the model, (β 0, β 1, σ). Given an arbitrary line, y = mx + b define the sum of the squares of errors to be n i=1 [y i (mx i + b)] 2. Using Calculus, one can find the least squares regression line, y = b 0 + b 1 x, that minimizes the sum of squares of errors. (University of New Haven) Regression 7 / 41

Simple Regression Theorem (Estimating β 0 and β 1 ) Given the bivariate random sample, (x 1, y 1 ), (x n, y n ), the least squares regression line, y = b 0 + b 1 x is obtained by letting ( ) sy b 1 = r and b 0 = ȳ b 1 x. s x where b 0 is an unbiased estimator of β 0 and b 1 is an unbiased estimator of β 1. Note: The point ( x, ȳ) will lie on the regression line, though there is no reason to believe that ( x, ȳ) is one of the data points. One can also calculate b 1 using b 1 = n( n j=1 x jy j ) ( n j=1 x j)( n j=1 y j) n n j=1 x j 2 ( n j=1 x j) 2. (University of New Haven) Regression 8 / 41

Simple Regression Example > plot(trees$girth~trees$height,main="girth vs height") > abline(lm(trees$girth ~ trees$height), col="red") girth vs height trees$girth 8 10 12 14 16 18 20 65 70 75 80 85 trees$height Since both variables come from trees, in order for the R command lm (linear model) to work, trees has to be in the R format, data.frame. > class(trees) # "trees" is in data.frame format - lm will work. [1] "data.frame" > g.lm=lm(girth~height,data=trees) > coef(g.lm) (Intercept) trees$height -6.1883945 0.2557471 (University of New Haven) Regression 9 / 41

Simple Regression Definition The predicted value of y at x j is ŷ j def = b 0 + b 1 x j. The predicted value, ŷ, is a unbiased estimator of the mean response, µ y. Example Using the R dataset trees, one wants the predicted girth of three trees, of heights 74, 83 and 91 respectively. One uses the regression model girth height for our predictions. The work below is done in R. > g.lm=lm(girth~height,data=trees) > predict(g.lm,newdata=data.frame(height=c(74,83,91))) 1 2 3 12.73689 15.03862 17.08459 (University of New Haven) Regression 10 / 41

Simple Regression Never make forecasts, especially about the future. Samuel Goldwyn The regression line only has predictive value for y at x if 1 ρ 0 (if no significant linear correlation, don t use the regression line for predictions.) If ρ 0, then ȳ is best predictor of y at x. 2 only predict y for x s within the range of the x j s one does not predict the girth of a tree with a height of 1000 feet. Interpolate, don t extrapolate. r (or r 2 ) is a measure of how well the regression equation fits data. bigger r better data fits regression line better prediction. (University of New Haven) Regression 11 / 41

Simple Regression Definition The variance of the observed y i s about the predicted ŷ i s is s 2 def (yj ŷ j ) 2 y 2 j b 0 yj b 1 xj y j = = n 2 n 2 which is an unbiased estimator of σ 2. The standard error of estimate (also called the residual standard error) is s, an estimator of σ., Note: (b 0, b 1, s) is an estimator of the parameters of the simple linear regression model, (β 0, β 1, σ). Furthermore, b 0, b 1 and s 2 are unbiased estimators of β 0, β 1 and σ 2. (University of New Haven) Regression 12 / 41

Simple Regression Outliers and influential points Outlier: An observation that lies outside the overall pattern. Influential individual : An observation that markedly changes the regression if removed. This is often an isolated point. Child 19 = outlier (large residual) Child 19 is an outlier of the relationship (it is unusually far from the regression line, vertically). Child 18 = potential influential individual Child 18 is isolated from the rest of the points, and might be an influential point. (University of New Haven) Regression 13 / 41

Simple Regression Outlier All data Without child 18 Without child 19 Influential Child 18 changes the regression line substantially when it is removed. So, Child 18 is indeed an influential point. Child 19 is an outlier of the relationship, but it is not influential (regression line changed very little by its removal). (University of New Haven) Regression 14 / 41

Simple Regression Definition Given a data point, (x j, y j ), the residual of that point is y i ŷ i. Note: 1 Outliers are data points with large residuals. 2 The residuals should be approximately N(0, σ). (University of New Haven) Regression 15 / 41

Simple Regression R command for finding residuals: Example > g.lm=lm(girth~height,data=trees) > residuals(g.lm) 1 2 3 4 5 6 7-3.4139043-1.8351687-1.1236745-1.7253986-3.8271227-4.2386170 0.3090842 8 9 10 11 12 13 14-1.9926400-3.1713756-1.7926400-2.7156285-1.8483871-1.8483871 0.2418428 15 16 17 18 19 20 21-0.9926400 0.1631072-2.6501112-2.5058584 1.7303485 3.6205784 0.2401187 22 23 24 25 26 27 28-0.0713756 1.7631072 3.7746014 2.7958658 2.7728773 2.7171301 3.6286244 29 30 31 3.7286244 3.7286244 4.5383945 (University of New Haven) Regression 16 / 41

Simple Regression Definition Given bivariate data, (x 1, y 1 ),, (x n, y n ), the residual plot is a plot of the residuals against the x j s. If (X, Y ) is bivariate normal, the residuals satisfy the Homoscedasticity Assumption: Definition (Homoscedasticity Assumption) The assumption that the variance around the regression line is the same for all values of the predictor variable X. In other words the pattern of the spread of the residual points around the x axis does not change as one travels left to right on the x axis. There should not be discernible patterns in the residual plot. (University of New Haven) Regression 17 / 41

Simple Regression R command for testing if Linear Model applies (residuals approximately N(0, σ)). Example > g.lm=lm(girth~height,data=trees) > par(mfrow=c(2,2)) # visualize four graphs at once > plot(g.lm) > par(mfrow=c(1,1)) # reset the graphics defaults Residuals vs Fitted Normal Q Q Residuals 4 2 0 2 4 31 5 6 5 6 Standardized residuals 1 0 1 2 31 10 11 12 13 14 15 16 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.4 0.8 1.2 Scale Location 31 6 5 Standardized residuals 2 1 0 1 2 Residuals vs Leverage Cook's distance 6 31 20 0.5 10 11 12 13 14 15 16 0.00 0.05 0.10 0.15 Fitted values Leverage (University of New Haven) Regression 18 / 41

Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests Confidence Intervals and Significance Tests (University of New Haven) Regression 19 / 41

Confidence Intervals and Significance Tests Theorem (Hypothesis Tests and Confidence Intervals for β 0 and β 1:) Let SE b1 def = n s j=1 (x j x) 2 and SE b0 def = 1 n + x 2 n j=1 (x j x) 2. SE b0 and SE b1 are the standard error of the intercept, β 0, and the slope, β 1, for the least squares regression line. To test the hypothesis H 0 : β 1 = 0 use the test statistic t b 1 SE b1 t(n 2). A level (1 α)100% confidence interval for the slope β 1 is b 1 ± t (n 2) SE b1. To test the hypothesis H 0 : β 0 = b use the test statistic t b 0 b SE b0 t(n 2). A level (1 α)100% confidence interval for the intercept β 0 is b 0 ± t (n 2) SE b0. Accepting H 0 : β 1 = 0 is equivalent to accepting H 0 : ρ = 0. (University of New Haven) Regression 20 / 41

Confidence Intervals and Significance Tests Example Example Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants 4 to 10 days old and their later IQ test scores. A snap of a rubber band on the sole of the foot caused the infants to cry. The researchers recorded the crying and measured its intensity by the number of peaks in the most active 20 seconds. They later measured the children s IQ at age three years using the Stanford-Binet IQ test. A scatterplot and Minitab output for the data from a random sample of 38 infants is below. Do these data provide convincing evidence that there is a positive linear relationship between crying counts and IQ in the population of infants? 16 (University of New Haven) Regression 21 / 41

Confidence Intervals and Significance Tests Example (cont.) Example We want to perform a test of H0 : β1 = 0 Ha : β1 > 0 where β1 is the true slope of the population regression line relating crying count to IQ score. The scatterplot suggests a moderately weak positive linear relationship between crying peaks and IQ. The residual plot shows a random scatter of points about the residual = 0 line. IQ scores of individual infants should be independent. The Normal probability plot of the residuals shows a slight curvature, which suggests that the responses may not be Normally distributed about the line at each x-value. With such a large sample size (n = 38), however, the t procedures are robust against departures from Normality. The residual plot shows a fairly equal amount of scatter around the horizontal line at 0 for all x- values. 17 (University of New Haven) Regression 22 / 41

Confidence Intervals and Significance Tests Example (cont.) Example With no obvious violations of the conditions, we proceed to inference. The test statistic and P-value can be found in the Minitab output. t= b 1 SE b1 = 1.4929 0.4870 =3.07 The Minitab output gives P = 0.004 as the P- value for a two-sided test. The P-value for the one-sided test is half of this, P = 0.002. The P-value, 0.002, is less than our α = 0.05 significance level, so we have enough evidence to reject H0 and conclude that there is a positive linear relationship between intensity of crying and IQ score in the population of infants. 18 (University of New Haven) Regression 23 / 41

Confidence Intervals and Significance Tests Given x, the mean response is µ y = β 0 + β 1x. However, since β 0 and β 1 are not def known, one uses ˆµ y = ŷ def x = b 0 + b 1x as an estimator of µ y. Theorem ((1 α)100% Confidence Interval for the mean response, µ y ) A (1 α)100 % confidence interval for the mean response, µ y when x takes on the value x is ˆµ y ± m where the margin of error is 1 m = t α/2 (n 2) s n + (x x) 2 n j=1 (x. j x) 2 }{{} SE ˆµ The standard error of the mean response is SE ˆµ. (University of New Haven) Regression 24 / 41

Confidence Intervals and Significance Tests A confidence interval for µ y : POPULATION μ y μ y ^ ^ = y μ y = β 0 + β 1 x Predicting μ y x * (University of New Haven) Regression 25 / 41

Confidence Intervals and Significance Tests Definition Let y be a future observation corresponding to x. A (1 α)100% Prediction Interval for y is a confidence interval where y will be in the confidence interval (1 α)100% of the time. A prediction interval a confidence interval that not only has to contend with the variability of the response variable, but also the fact that β 0 and β 1 can only be approximated. Theorem ((1 α)100% Prediction Interval for y given x = x ) A (1 α)100% Prediction Interval for y given x = x is ŷ ± m where ŷ = b 0 + b 1 x and the margin of error is m = t α/2 (n 2) s 1 + 1 n + (x x) 2 n j=1 (x j x) 2. }{{} SEŷ (University of New Haven) Regression 26 / 41

Confidence Intervals and Significance Tests A confidence interval for y: (University of New Haven) Regression 27 / 41

Confidence Intervals and Significance Tests R commands: Example > g.lm=lm(girth~height,data=trees) > predict(g.lm,newdata=data.frame(height=c(74,83,91)),interval="prediction",level=.90) fit lwr upr 1 12.73689 8.020516 17.45327 2 15.03862 10.238843 19.83839 3 17.08459 11.971691 22.19750 > summary(g.lm) Call: lm(formula = Girth ~ Height, data = trees) Residuals: Min 1Q Median 3Q Max -4.2386-1.9205-0.0714 2.7450 4.5384 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -6.18839 5.96020-1.038 0.30772 Height 0.25575 0.07816 3.272 0.00276 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 2.728 on 29 degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758 (University of New Haven) Regression 28 / 41

Variation Variation Variation (University of New Haven) Regression 29 / 41

Variation: Variation y j ȳ }{{} = ŷ j ȳ }{{} + y j ŷ j }{{}. total deviation explained deviation unexplained deviation From here, using some math, one gets the following sum of squares, (SS), n n n (y j ȳ) 2 = (ŷ j ȳ) 2 + (y j ŷ j ) 2 j=1 }{{} SS TOT =total variation j=1 }{{} SS A =explained variation j=1 }{{} SS E =unexplained variation (University of New Haven) Regression 30 / 41.

Variation Definition The coefficient of determination is the portion of the variation in y explained by the regression equation r 2 def = SS n A j=1 = (ŷ j ȳ) 2 SS n TOT j=1 (y j ȳ) 2. Properties of the Coefficient of Determination: 1 r 2 = (r) 2 = (correlation coefficient) 2. 2 r 2 = proportion of variation of Y that is explained by the linear relationship between X and Y. Example Using R, since > (cor(trees$girth,trees$height))^2 [1] 0.2696518 one concludes that approximately 27% of variation in tree Girth is explained by tree Height and 73% by other factors. (University of New Haven) Regression 31 / 41

Variation r = 0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = 0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. r = 0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y. (University of New Haven) Regression 32 / 41

Variation With each of the sum of squares is associated a degrees of freedom where df of SS TOT = df of SS A + df of SS E. Also associated with SS A and SS E are the mean squares which equal the sum of squares divided by it s degrees of freedom. Source SS df MS Model SS A 1 MS A = SS A 1 Error SS E n 2 MS E = s 2 = Total SS TOT n 1 n j=1 (ŷ j ȳ) 2 n 2 = SS E n 2 The above is a partial ANOVA table. ANOVA is short of analysis of variance. (University of New Haven) Regression 33 / 41

Variation Theorem (ANOVA F Test for Simple Regression) In the simple linear regression model, consider If H 0 holds, f def = MS A MS E H 0 : β 1 = 0 versus H A : β 1 0. is from F (1, n 2) and one uses a right sided test. Remember, H 0 : β 1 = 0 is equivalent to H 0 : ρ = 0. The following is an ANOVA Table for simple linear regression: Source SS df MS ANOVA F Statistic p value Model SS A 1 MS A f P(F (1, n 2) f ) Error SS E n 2 MS E Total SS TOT n 1 (University of New Haven) Regression 34 / 41

Variation Example (cont.) > g.lm=lm(girth~height,data=trees) > anova(g.lm) Analysis of Variance Table Response: Girth Df Sum Sq Mean Sq F value Pr(>F) Height 1 79.665 79.665 10.707 0.002758 ** Residuals 29 215.772 7.440 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (University of New Haven) Regression 35 / 41

Variation Since β 1 = 0 r = 0 the following is equivalent to the ANOVA F Test. Theorem (Test for correlation) Assuming that X and Y are bivariate normal (the conditions for simple linear regression), consider the hypotheses H 0 : ρ = 0 vs H A : ρ 0 The test statistic is t = r 1 r 2 n 2 t(n 2) for H 0. Remember, accepting H 0 : β 1 = 0 is equivalent to accepting H 0 : ρ = 0. It can be shown that F = t 2. Also it makes no difference if X or Y is the independent or dependent variable - the test is for correlation. An advantage of using the above t test is one can test one sided alternative hypotheses. R command: > cor.test(x,y) (one can also do one sided tests with R). (University of New Haven) Regression 36 / 41

Using R Variation Example (cont) > cor.test(trees$girth,trees$height) Pearson s product-moment correlation data: trees$girth and trees$height t = 3.2722, df = 29, p-value = 0.002758 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2021327 0.7378538 sample estimates: cor 0.5192801 Note that one is assuming that the (trees$height, trees$girth) are sampled from a bivariate normal distribution. (University of New Haven) Regression 37 / 41

Variation Example Each day, for the last 63 days, measurements of the time Joe spends sleeping and the time he spends watching tv are taken. Assume time spent sleeping and time spent watching tv form a bivariate normal random variable. A sample correlation of r = 0.12 is calculated. Find the p value of H 0 : ρ = 0 versus H A : ρ 0. Solution: > tstat=0.12*sqrt((63-2)/(1-0.12^2)) > tstat [1] 0.9440518 > 2*(1-pt(tstat,61)) [1] 0.3488675 There is little evidence that the time Joe spends sleeping and the time Joe spends watching tv is correlated. (University of New Haven) Regression 38 / 41

Chapter #10 R Assignment Chapter #10 R Assignment Chapter #10 R Assignment (University of New Haven) Regression 39 / 41

Chapter #10 R Assignment (from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition Duxbury 1990)) Fifteen alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches this is the distance between the back of the head to the end of the nose). The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. lnlength ~ lnweight. The authors analyzed the data on the log scale (natural logarithms) and we will follow their approach for consistency. > lnlength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76, + 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78) > lnweight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, + 3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25) (University of New Haven) Regression 40 / 41

Chapter #10 R Assignment 1 Create a scatterplot of lnlength lnweight, complete with the regression line. 2 What is the slope and y intercept of the regression line? 3 Predict lnlength when lnweight is five. 4 Use graphs to decide if lnlength lnweight satisfies the requirements for being a linear model. 5 Find a 95% prediction interval for lnlength when lnweight is five. 6 What is the p value of a test of H 0 : β 1 = 0 versus H A : β 1 0? 7 What is the standard error of estimate? 8 What is the coefficient of determination, R 2. 9 What is the explained variation, the unexplained variation and the total variation? 10 What is the F statistic of H 0 : β 1 = 0 versus H A : β 1 0 and what is its degrees of freedom? 11 Using the correlation test, what is the p value of a test that H 0 : ρ = 0 versus H A : ρ 0? (University of New Haven) Regression 41 / 41