Chapter 16: Understanding Relationships Numerical Data

Chapter 16: Understanding Relationships Numerical Data These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck, published by CENGAGE Learning, 2015. Linear models For two quantitative variables, it is often convenient to distinguish between an explanatory (predictor) and a response (predicted) variable, denoted x and y, respectively. The means, µ x, µ y, standard deviations, σ x, σ y, and correlation coefficient, ρ, describe a population. Fitting y x results in a linear model, y = β 0 + β 1 x, describing the population. An association between the variables x and y is characterized by its direction (positive or negative), form (linear or non-linear) and strength (which for linear relationships is measured by the correlation). The sample means, x, ȳ, sample standard deviations, s x, s y, and sample correlation coefficient, r, describe a sample taken from the population. Point estimates for β 0 and β 1 are determined from the sample and are denoted b 0 and b 1. The linear model for the sample takes the form ŷ = b 0 + b 1 x. The residual, e i = y i ŷ i, measures the distance between the actual value, y i, and the predicted value, ŷ i, corresponding to a particular x i. Regression analysis uses properties of a linear model constructed from a sample to deduce properties of a linear relationship in the corresponding population. Least squares line Conditions for least squares : (1) nearly linear relationship, (2) nearly normal residuals, (3) with nearly constant variability. Formulas for the regression coefficients: b 1 = ρ s y s x, b 0 = ȳ b 1 x. Use a least squares line to predict y from x : ŷ = b 0 + b 1 x The center of mass of the sample lies on the least squares line: ȳ = b 0 + b 1 x The squared correlation, r 2, describes the percent of the variance of the response variable explained by the explanatory variable. Two quantitative variables We illustrate simple regression with one of the examples explored by Agresti and Franklin in chapter 12, a data set describing 57 female high school athletes and their performances in several athletic activities. Read in the data set, select two athletic activities, and generate a scatterplot. We use x and y to describe these activities, rather than more descriptive names, to suggest that this type of analysis is widely applicable. Spring 2016 Page 1 of 13

athletes <- read.csv("high_school_female_athletes.csv", header=true) head(athletes) str(athletes) summary(athletes) x <- athletes$brtf..60. # number of 60 lb bench presses y <- athletes$x1rm.bench..lbs. # maximum bench press plot(x, y, pch=19, col="darkred", xlab="number of 60 lb bench presses", ylab="maximum bench press (lbs)", main="female High School Athletes") Female High School Athletes maximum bench press (lbs) 60 70 80 90 100 110 A suggestion of a linear relationship? 5 10 15 20 25 30 35 number of 60 lb bench presses Is there a suggestion of a linear relationship here? Use R s lm procedure to calculate a linear model for this data. plot(x, y, pch=19, col="darkred", xlab="number of 60 lb bench presses", ylab="maximum bench press (lbs)", main="female High School Athletes") athletes.lm <- lm(y ~ x) abline(athletes.lm, col="orange") Spring 2016 Page 2 of 13

Female High School Athletes (lm) maximum bench press (lbs) 60 70 80 90 100 110 5 10 15 20 25 30 35 number of 60 lb bench presses A linear relationship in this context is described by an equation of the form ŷ = a + bx, where the coefficients a and b are part of the linear model. Create a function which calculates ŷ given x and use it to calculate a point along the regression line. The second student in the data set had an x value of 12. What value of y would this linear model predict for the second student? coefficients(athletes.lm) # (Intercept) x # 63.536856 1.491053 predict.y.hat <- function(x){ a <- coefficients(athletes.lm)[1] b <- coefficients(athletes.lm)[2] y.hat <- as.numeric(a + b * x) return(y.hat) } predict.y.hat(12) # 81.42949 We can use R s function predict to do the same calculation. # use predict new.data <- data.frame(x=12) predict(athletes.lm, new.data) # 1 # 81.42949 R s predict can calculate the predictions for every x in the data set. Spring 2016 Page 3 of 13

# calculate y.hat for each student y.hat <- predict(athletes.lm, data.frame(x, y)) head(data.frame(x, y, y.hat)) # x y y.hat # 1 10 80 78.44739 # 2 12 85 81.42949 # 3 20 85 93.35792 # 4 5 65 70.99212 # 5 12 95 81.42949 # 6 10 75 78.44739 A residual is the difference between an actual y and the predicted ŷ. Verify that the second student s residual is ɛ = y ŷ = 3.570507. Testing for association Do the data plausibly cluster around this least-squares line? Just how much evidence is there of a linear relationship in this data? We will test the hypothesis that there is a linear relationship against the alternative hypothesis that there is none. If the regression line is horizontal, then knowing something about x gives no usable information about y, so there would be no association between these two variables. Therefore, the key thought is to determine if the slope of the actual (population) regression line could plausibly be 0 or, equivalently, if the correlation between the two variables is 0. We organize the discussion as a two-sided hypothesis test. Some key statistics are contained in the summary of the linear model for the associated sample. H 0 : β = 0 H a : β 0 # are the two variables associated? summary(athletes.lm) # Call: # lm(formula = y ~ x) # Residuals: # Min 1Q Median 3Q Max # -17.9205-5.9027-0.7237 5.4989 19.0973 # Coefficients: # Estimate Std. Error t value Pr(> t ) # (Intercept) 63.5369 1.9565 32.475 < 2e-16 *** # x 1.4911 0.1497 9.958 6.48e-14 *** # --- # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 # Residual standard error: 8.003 on 55 degrees of freedom # Multiple R-squared: 0.6432,Adjusted R-squared: 0.6368 # F-statistic: 99.17 on 1 and 55 DF, p-value: 6.481e-14 Spring 2016 Page 4 of 13

The value of the slope b in the linear model for the sample ŷ = a + bx is the Estimate to the right of x. Its standard error is the next number to the right in that row, under the title Std. Error. Use b and its SE to calculate the test statistic and then determine its p-value. # HT # H_0 : beta == 0 # H_a : beta!= 0 b <- 1.4911 se <- 0.1497 t <- (b - 0) / se # 9.960588 n <- length(x) p.value <- 2 * (1 - pt(t, df=n-2)) # 6.439294e-14 The p-value is very small, so we reject the null hypothesis, accept the alternative hypothesis, and conclude that the two quantitative variables are associated. A confidence interval centered on the statistic b provides a range of plausible values for the slope β of the (population) regression line. alpha <- 0.05 t.star <- qt(1 - alpha/2, df=n-2) # 2.004045 ci <- b + t.star * se * c(-1, 1); ci # 1.191094 1.791106 So we are 95% confident that our confidence interval [1.191094, 1.791106] contains the population parameter β. Note that this interval does not contain the value 0, so we once again discover that these two quantitative variables are associated. The F statistic mentioned in the summary of the simple linear regression model is an alternate test statistic for the proposition H 0 : β = 0, and in fact it is equal to the square of the t statistic that we have used for that same purpose. The p-value obtained from the F statistic is exactly the same as the p-value obtained from the t statistic. F distributions will play a stronger role in multiple linear regression. Strength of the association When working with categorical variables, we used the chi-square test to determine if the variables were associated, and then we turned to measures of association, such as differences of proportions and relative risk, to determine the strength of the association. For quantitative variables, the correlation measures the strength of the association. The correlation is a number between -1 and 1. Values near 1 and -1 reflect the strongest (positive and negative, resp.) associations. A correlation of 0 means that the two variables are not associated. # correlation cor(x, y) # 0.8020251 Spring 2016 Page 5 of 13

Correlation matrix R s function cor can also return a matrix of correlations. Let s add two more athletic activities to the mix, a leg press and a 40 yard dash. Which activities are most strongly associated? Which have the weakest association. Can you imagine why? What is the interpretation of the negative numbers in this matrix? # matrix of correlations # x bench press # y max bench press # add two more exercises z <- athletes$lp.rtf..200. # leg press w <- athletes$x40.yd..sec. # 40 yd run corr.matrix <- cor(data.frame(x, y, z, w)) # x y z w # x 1.00000000 0.80202510 0.61107645-0.06509459 # y 0.80202510 1.00000000 0.57791717-0.08076663 # z 0.61107645 0.57791717 1.00000000 0.09756962 # w -0.06509459-0.08076663 0.09756962 1.00000000 Interpret this visualization of the correlation matrix. library(corrplot) corrplot(corr.matrix, method="circle") x y z w 1 x 0.8 0.6 0.4 y 0.2 0 z -0.2-0.4-0.6 w -0.8-1 Spring 2016 Page 6 of 13

Regression toward the Mean The equation of the regression line is ŷ = b 0 + b 1 x, where b 0 = ȳ b 1 x and b 1 = rs y /s x, so we can rewrite it as ŷ ȳ = b 1 (x x), = r s y s x (x x). = rs y (x x) s x. Now choose x one standard deviation to the right of x, so x x = s x. The corresponding predicted value ŷ is given by ŷ ȳ = rs y, so the predicted value ŷ is r times one standard deviation s y above ȳ, and of course r 1. Therefore, if x moves one standard deviation to the right of its mean, x = x + s x, then the predicted ŷ moves only rs y above its mean, ŷ = ȳ + rs y. Sons of tall fathers are likely shorter than their dads. Sons of short fathers are likely taller than their dads. This was first noticed by the famous pioneer of statistics, Francis Galton (1822-1911), and it is called regression toward the mean. Regression toward the Mean 1 y = x y y^ = a + bx rs y 0 (x, y) s x 0 1 x Spring 2016 Page 7 of 13

Standardized residuals How do data vary around the regression line? Residuals tell the story, but standardized residuals are more informative, in the same way that a z-score tells how many standard deviations away from a given value a certain result might lie. standardized.residuals <- rstandard(athletes.lm) hist(standardized.residuals, col="orangered") Histogram of standardized.residuals Frequency 0 2 4 6 8 10-2 -1 0 1 2 standardized.residuals Spring 2016 Page 8 of 13

MSE and RSE A basic assumption of simple linear regression is that for each fixed x, the y values are normally distributed with mean ŷ and standard deviation σ. A single value σ describes the spread of the normal distributions about their mean for each one of the x s. The value of σ can be estimated from the data. The mean square error, MSE, is the variance of all of those normal distributions, and the square root of MSE, known as the residual standard error, RSE, is the very important estimate of σ. The RSE and related statistics appear in the output of R s procedure aov (analysis of variance). The MSE is the residual sum of squares, Residual SS, divided by its degrees of freedom, n 2, and the RSE is the square root of MSE. aov(athletes.lm) # Call: # aov(formula = athletes.lm) # Terms: # x Residuals # Sum of Squares 6351.755 3522.806 # Deg. of Freedom 1 55 # Residual standard error: 8.003188 # Estimated effects may be unbalanced residual.ss <- 3522.806 df <- 55 mse <- residual.ss / df rse <- sqrt(mse) # 8.003188 Prediction Two types of prediction are important in this context. Given x we would like to predict plausible values for µ y (the population ŷ) with a confidence interval, CI, and we would like to predict y values for individuals sharing that value of x with a prediction interval, PI. The PI will be wider than the associated CI because the PI encompasses a lot of individual variation, but the CI is a confidence interval for a (much more constrained) mean. In the following approximate formulas (Agresti and Franklin, 3e, p.611), the RSE plays the role of σ, so these formulas resemble previous confidence intervals for means and values. # approximate CI for the population mu_y ci <- y.hat + t.star * rse / sqrt(n) * c(-1, 1) # approximate PI for individual y values pi <- y.hat + t.star * rse * c(-1, 1) Here t is calculated with an R command such as t.star qt(0.975, df = n 2) and the residual standard error, RSE, is obtained from the summary of the linear model or by calling aov on the linear model: summary(athletes.lm) or aov(athletes.lm). Spring 2016 Page 9 of 13

Confidence and Prediction Intervals Using Predict For more accurate confidence and prediction intervals, use R s predict. # confidence and prediction intervals using predict?predict # 95% CI for mu_y given x == 12 new.data <- data.frame(x=12) predict(athletes.lm, new.data, interval="confidence") # fit lwr upr # 1 81.42949 79.28328 83.57571 # 95% PI for y given x == 12 predict(athletes.lm, new.data, interval="prediction") # fit lwr upr # 1 81.42949 65.24778 97.6112 Using predict to calculate confidence and prediction intervals for a whole range of x values produces confidence and prediction bands. Notice that the confidence band is narrowest near ( x, ȳ) = (10.98, 79.91). Female High School Athletes, confidence and prediction bands maximum bench press (lbs) 60 70 80 90 100 110 5 10 15 20 25 30 35 number of 60 lb bench presses Spring 2016 Page 10 of 13

Outline for Presenting an Hypothesis Test Agresti and Franklin suggest using a five-step outline for presenting hypothesis tests such as we are using in this chapter. Here is a sketch of the approach they recommend. Assumptions We assume randomization, normal conditional distributions for y given x, with a linear trend for the means of these distributions, and a common standard deviation for all of them. Hypotheses The null hypothesis is that the variables are independent, and the alternative hypothesis is that they are dependent (associated). H 0 : β = 0 H a : β 0 Test Statistic The slope b of the sample regression line and its standard error, SE, are found in the Coefficients section of the summary of the linear model. t = b/se. p-value The p-value is calculated with an R command such as p.value 2 (1 pt(t, df = n 2)) Conclusion in Context Is there sufficient evidence to reject H 0 or not? What does this mean in the context of this particular investigation? Outline for Presenting a Confidence Interval Confidence Interval A 95% confidence interval for the population parameter β is given by b ± t SE where b and SE are as in the associated hypothesis test, and t is calculated with an R command such as t.star qt(0.975, df = n 2) Conclusion in Context The confidence interval provides a range of plausible values for the population parameter β. State clearly what this means in the context of the present study. Spring 2016 Page 11 of 13

Analyzing Association Associations involve explanatory variables and response variables. Order them like this: explanatory response. categorical categorical (Peck, chapter 15 ) r c contingency table, test for independence 1 c contingency table, goodness of fit Test for independence or goodness of fit with a χ 2 test statistic quantitative quantitative (Peck, chapters 4, 16 ) Linear model for the population µ y = β 0 + β 1 x + β 2 x Linear model describing the sample ŷ = b 0 + b 1 x + b 2 x Test for relevance of the model with an F test statistic. H 0 : all β i s are 0 Estimate the parameters β i with t statistics and confidence intervals. (quantitative and categorical) quantitative Subsume this case into the previous one with indicator variables. categorical quantitative (Peck, chapter 17 ) The categorical variable divides quantitative measurements into groups, and the question becomes one of comparing the mean responses of the groups. Test that all of the means are the same with an F test (ANOVA) H 0 : β 1 = = β g Find which means are different with t tests and confidence intervals for β i β j Control the significance level for multiple comparisons with Tukey HSD quantitative categorical (Peck, chapter 4 ) Use quantitative variables to predict a categorical variable with logistic regression Spring 2016 Page 12 of 13

Exercises We will attempt to solve some of the following exercises as a community project in class today. Finish these solutions as homework exercises, write them up carefully and clearly, and hand them in at the beginning of class next Friday. Homework 16a regression Exercises from Chapter 16: 16.2 (house price), 16.3 (house price), 16.9 (cancer), 16.10 (marketing), 16.16 (R&D) Homework 16b regression Exercises from Chapter 16: 16.18 (money), 16.19 (grasslands), 16.22 (shrimp), 16.28 (skulls), 16.31 (turtles) Spring 2016 Page 13 of 13