Chapter 16: Understanding Relationships Numerical Data

Similar documents
Inference for Regression

Chapter 8: Correlation & Regression

Homework 9 Sample Solution

Density Temp vs Ratio. temp

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Handout 4: Simple Linear Regression

ST430 Exam 1 with Answers

Regression on Faithful with Section 9.3 content

Review of Statistics 101

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

9. Linear Regression and Correlation

Section 4.6 Simple Linear Regression

Ch 2: Simple Linear Regression

Sociology 6Z03 Review II

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Biostatistics 380 Multiple Regression 1. Multiple Regression

Chapter 8: Sampling Variability and Sampling Distributions

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

STAT 215 Confidence and Prediction Intervals in Regression

Comparing Nested Models

Foundations of Correlation and Regression

Inferences for Regression

Introduction and Single Predictor Regression. Correlation

STAT 3022 Spring 2007

Homework 2: Simple Linear Regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression and Models with Multiple Factors. Ch. 17, 18

Inference for the Regression Coefficient

Chapter 8: Correlation & Regression

Regression. Marc H. Mehlman University of New Haven

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Coefficient of Determination

We d like to know the equation of the line shown (the so called best fit or regression line).

Estimated Simple Regression Equation

Basic Business Statistics, 10/e

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

Probability Distributions

Multiple Regression and Regression Model Adequacy

Applied Regression Analysis

2. Outliers and inference for regression

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

df=degrees of freedom = n - 1

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Tests of Linear Restrictions

lm statistics Chris Parrish

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

A discussion on multiple regression models

Lecture 18: Simple Linear Regression

Unit 6 - Introduction to linear regression

INFERENCE FOR REGRESSION

Mathematics for Economics MA course

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

L21: Chapter 12: Linear regression

ANOVA (Analysis of Variance) output RLS 11/20/2016

Lecture 17. Ingo Ruczinski. October 26, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Statistical Modelling in Stata 5: Linear Models

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Confidence Intervals, Testing and ANOVA Summary

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Regression and the 2-Sample t

Unit 6 - Simple linear regression

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Chapter 4. Regression Models. Learning Objectives

1 Multiple Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

y response variable x 1, x 2,, x k -- a set of explanatory variables

STAT22200 Spring 2014 Chapter 5

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Econometrics. 4) Statistical inference

THE PEARSON CORRELATION COEFFICIENT

Statistics for Managers using Microsoft Excel 6 th Edition

Can you tell the relationship between students SAT scores and their college grades?

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

SIMPLE REGRESSION ANALYSIS. Business Statistics

Basic Business Statistics 6 th Edition

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Simple linear regression

Regression Analysis: Exploring relationships between variables. Stat 251

Finding Relationships Among Variables

MATH 644: Regression Analysis Methods

Chapter 8: Correlation & Regression

This gives us an upper and lower bound that capture our population mean.

ST505/S697R: Fall Homework 2 Solution.

Variance Decomposition and Goodness of Fit

STATISTICS 110/201 PRACTICE FINAL EXAM

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Business Statistics. Lecture 10: Course Review

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Correlation & Simple Regression

23. Inference for regression

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

Correlation Analysis

ST430 Exam 2 Solutions

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

movies Name:

Transcription:

Chapter 16: Understanding Relationships Numerical Data These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck, published by CENGAGE Learning, 2015. Linear models For two quantitative variables, it is often convenient to distinguish between an explanatory (predictor) and a response (predicted) variable, denoted x and y, respectively. The means, µ x, µ y, standard deviations, σ x, σ y, and correlation coefficient, ρ, describe a population. Fitting y x results in a linear model, y = β 0 + β 1 x, describing the population. An association between the variables x and y is characterized by its direction (positive or negative), form (linear or non-linear) and strength (which for linear relationships is measured by the correlation). The sample means, x, ȳ, sample standard deviations, s x, s y, and sample correlation coefficient, r, describe a sample taken from the population. Point estimates for β 0 and β 1 are determined from the sample and are denoted b 0 and b 1. The linear model for the sample takes the form ŷ = b 0 + b 1 x. The residual, e i = y i ŷ i, measures the distance between the actual value, y i, and the predicted value, ŷ i, corresponding to a particular x i. Regression analysis uses properties of a linear model constructed from a sample to deduce properties of a linear relationship in the corresponding population. Least squares line Conditions for least squares : (1) nearly linear relationship, (2) nearly normal residuals, (3) with nearly constant variability. Formulas for the regression coefficients: b 1 = ρ s y s x, b 0 = ȳ b 1 x. Use a least squares line to predict y from x : ŷ = b 0 + b 1 x The center of mass of the sample lies on the least squares line: ȳ = b 0 + b 1 x The squared correlation, r 2, describes the percent of the variance of the response variable explained by the explanatory variable. Two quantitative variables We illustrate simple regression with one of the examples explored by Agresti and Franklin in chapter 12, a data set describing 57 female high school athletes and their performances in several athletic activities. Read in the data set, select two athletic activities, and generate a scatterplot. We use x and y to describe these activities, rather than more descriptive names, to suggest that this type of analysis is widely applicable. Spring 2016 Page 1 of 13

athletes <- read.csv("high_school_female_athletes.csv", header=true) head(athletes) str(athletes) summary(athletes) x <- athletes$brtf..60. # number of 60 lb bench presses y <- athletes$x1rm.bench..lbs. # maximum bench press plot(x, y, pch=19, col="darkred", xlab="number of 60 lb bench presses", ylab="maximum bench press (lbs)", main="female High School Athletes") Female High School Athletes maximum bench press (lbs) 60 70 80 90 100 110 A suggestion of a linear relationship? 5 10 15 20 25 30 35 number of 60 lb bench presses Is there a suggestion of a linear relationship here? Use R s lm procedure to calculate a linear model for this data. plot(x, y, pch=19, col="darkred", xlab="number of 60 lb bench presses", ylab="maximum bench press (lbs)", main="female High School Athletes") athletes.lm <- lm(y ~ x) abline(athletes.lm, col="orange") Spring 2016 Page 2 of 13

Female High School Athletes (lm) maximum bench press (lbs) 60 70 80 90 100 110 5 10 15 20 25 30 35 number of 60 lb bench presses A linear relationship in this context is described by an equation of the form ŷ = a + bx, where the coefficients a and b are part of the linear model. Create a function which calculates ŷ given x and use it to calculate a point along the regression line. The second student in the data set had an x value of 12. What value of y would this linear model predict for the second student? coefficients(athletes.lm) # (Intercept) x # 63.536856 1.491053 predict.y.hat <- function(x){ a <- coefficients(athletes.lm)[1] b <- coefficients(athletes.lm)[2] y.hat <- as.numeric(a + b * x) return(y.hat) } predict.y.hat(12) # 81.42949 We can use R s function predict to do the same calculation. # use predict new.data <- data.frame(x=12) predict(athletes.lm, new.data) # 1 # 81.42949 R s predict can calculate the predictions for every x in the data set. Spring 2016 Page 3 of 13

# calculate y.hat for each student y.hat <- predict(athletes.lm, data.frame(x, y)) head(data.frame(x, y, y.hat)) # x y y.hat # 1 10 80 78.44739 # 2 12 85 81.42949 # 3 20 85 93.35792 # 4 5 65 70.99212 # 5 12 95 81.42949 # 6 10 75 78.44739 A residual is the difference between an actual y and the predicted ŷ. Verify that the second student s residual is ɛ = y ŷ = 3.570507. Testing for association Do the data plausibly cluster around this least-squares line? Just how much evidence is there of a linear relationship in this data? We will test the hypothesis that there is a linear relationship against the alternative hypothesis that there is none. If the regression line is horizontal, then knowing something about x gives no usable information about y, so there would be no association between these two variables. Therefore, the key thought is to determine if the slope of the actual (population) regression line could plausibly be 0 or, equivalently, if the correlation between the two variables is 0. We organize the discussion as a two-sided hypothesis test. Some key statistics are contained in the summary of the linear model for the associated sample. H 0 : β = 0 H a : β 0 # are the two variables associated? summary(athletes.lm) # Call: # lm(formula = y ~ x) # Residuals: # Min 1Q Median 3Q Max # -17.9205-5.9027-0.7237 5.4989 19.0973 # Coefficients: # Estimate Std. Error t value Pr(> t ) # (Intercept) 63.5369 1.9565 32.475 < 2e-16 *** # x 1.4911 0.1497 9.958 6.48e-14 *** # --- # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 # Residual standard error: 8.003 on 55 degrees of freedom # Multiple R-squared: 0.6432,Adjusted R-squared: 0.6368 # F-statistic: 99.17 on 1 and 55 DF, p-value: 6.481e-14 Spring 2016 Page 4 of 13

The value of the slope b in the linear model for the sample ŷ = a + bx is the Estimate to the right of x. Its standard error is the next number to the right in that row, under the title Std. Error. Use b and its SE to calculate the test statistic and then determine its p-value. # HT # H_0 : beta == 0 # H_a : beta!= 0 b <- 1.4911 se <- 0.1497 t <- (b - 0) / se # 9.960588 n <- length(x) p.value <- 2 * (1 - pt(t, df=n-2)) # 6.439294e-14 The p-value is very small, so we reject the null hypothesis, accept the alternative hypothesis, and conclude that the two quantitative variables are associated. A confidence interval centered on the statistic b provides a range of plausible values for the slope β of the (population) regression line. alpha <- 0.05 t.star <- qt(1 - alpha/2, df=n-2) # 2.004045 ci <- b + t.star * se * c(-1, 1); ci # 1.191094 1.791106 So we are 95% confident that our confidence interval [1.191094, 1.791106] contains the population parameter β. Note that this interval does not contain the value 0, so we once again discover that these two quantitative variables are associated. The F statistic mentioned in the summary of the simple linear regression model is an alternate test statistic for the proposition H 0 : β = 0, and in fact it is equal to the square of the t statistic that we have used for that same purpose. The p-value obtained from the F statistic is exactly the same as the p-value obtained from the t statistic. F distributions will play a stronger role in multiple linear regression. Strength of the association When working with categorical variables, we used the chi-square test to determine if the variables were associated, and then we turned to measures of association, such as differences of proportions and relative risk, to determine the strength of the association. For quantitative variables, the correlation measures the strength of the association. The correlation is a number between -1 and 1. Values near 1 and -1 reflect the strongest (positive and negative, resp.) associations. A correlation of 0 means that the two variables are not associated. # correlation cor(x, y) # 0.8020251 Spring 2016 Page 5 of 13

Correlation matrix R s function cor can also return a matrix of correlations. Let s add two more athletic activities to the mix, a leg press and a 40 yard dash. Which activities are most strongly associated? Which have the weakest association. Can you imagine why? What is the interpretation of the negative numbers in this matrix? # matrix of correlations # x bench press # y max bench press # add two more exercises z <- athletes$lp.rtf..200. # leg press w <- athletes$x40.yd..sec. # 40 yd run corr.matrix <- cor(data.frame(x, y, z, w)) # x y z w # x 1.00000000 0.80202510 0.61107645-0.06509459 # y 0.80202510 1.00000000 0.57791717-0.08076663 # z 0.61107645 0.57791717 1.00000000 0.09756962 # w -0.06509459-0.08076663 0.09756962 1.00000000 Interpret this visualization of the correlation matrix. library(corrplot) corrplot(corr.matrix, method="circle") x y z w 1 x 0.8 0.6 0.4 y 0.2 0 z -0.2-0.4-0.6 w -0.8-1 Spring 2016 Page 6 of 13

Regression toward the Mean The equation of the regression line is ŷ = b 0 + b 1 x, where b 0 = ȳ b 1 x and b 1 = rs y /s x, so we can rewrite it as ŷ ȳ = b 1 (x x), = r s y s x (x x). = rs y (x x) s x. Now choose x one standard deviation to the right of x, so x x = s x. The corresponding predicted value ŷ is given by ŷ ȳ = rs y, so the predicted value ŷ is r times one standard deviation s y above ȳ, and of course r 1. Therefore, if x moves one standard deviation to the right of its mean, x = x + s x, then the predicted ŷ moves only rs y above its mean, ŷ = ȳ + rs y. Sons of tall fathers are likely shorter than their dads. Sons of short fathers are likely taller than their dads. This was first noticed by the famous pioneer of statistics, Francis Galton (1822-1911), and it is called regression toward the mean. Regression toward the Mean 1 y = x y y^ = a + bx rs y 0 (x, y) s x 0 1 x Spring 2016 Page 7 of 13

Standardized residuals How do data vary around the regression line? Residuals tell the story, but standardized residuals are more informative, in the same way that a z-score tells how many standard deviations away from a given value a certain result might lie. standardized.residuals <- rstandard(athletes.lm) hist(standardized.residuals, col="orangered") Histogram of standardized.residuals Frequency 0 2 4 6 8 10-2 -1 0 1 2 standardized.residuals Spring 2016 Page 8 of 13

MSE and RSE A basic assumption of simple linear regression is that for each fixed x, the y values are normally distributed with mean ŷ and standard deviation σ. A single value σ describes the spread of the normal distributions about their mean for each one of the x s. The value of σ can be estimated from the data. The mean square error, MSE, is the variance of all of those normal distributions, and the square root of MSE, known as the residual standard error, RSE, is the very important estimate of σ. The RSE and related statistics appear in the output of R s procedure aov (analysis of variance). The MSE is the residual sum of squares, Residual SS, divided by its degrees of freedom, n 2, and the RSE is the square root of MSE. aov(athletes.lm) # Call: # aov(formula = athletes.lm) # Terms: # x Residuals # Sum of Squares 6351.755 3522.806 # Deg. of Freedom 1 55 # Residual standard error: 8.003188 # Estimated effects may be unbalanced residual.ss <- 3522.806 df <- 55 mse <- residual.ss / df rse <- sqrt(mse) # 8.003188 Prediction Two types of prediction are important in this context. Given x we would like to predict plausible values for µ y (the population ŷ) with a confidence interval, CI, and we would like to predict y values for individuals sharing that value of x with a prediction interval, PI. The PI will be wider than the associated CI because the PI encompasses a lot of individual variation, but the CI is a confidence interval for a (much more constrained) mean. In the following approximate formulas (Agresti and Franklin, 3e, p.611), the RSE plays the role of σ, so these formulas resemble previous confidence intervals for means and values. # approximate CI for the population mu_y ci <- y.hat + t.star * rse / sqrt(n) * c(-1, 1) # approximate PI for individual y values pi <- y.hat + t.star * rse * c(-1, 1) Here t is calculated with an R command such as t.star qt(0.975, df = n 2) and the residual standard error, RSE, is obtained from the summary of the linear model or by calling aov on the linear model: summary(athletes.lm) or aov(athletes.lm). Spring 2016 Page 9 of 13

Confidence and Prediction Intervals Using Predict For more accurate confidence and prediction intervals, use R s predict. # confidence and prediction intervals using predict?predict # 95% CI for mu_y given x == 12 new.data <- data.frame(x=12) predict(athletes.lm, new.data, interval="confidence") # fit lwr upr # 1 81.42949 79.28328 83.57571 # 95% PI for y given x == 12 predict(athletes.lm, new.data, interval="prediction") # fit lwr upr # 1 81.42949 65.24778 97.6112 Using predict to calculate confidence and prediction intervals for a whole range of x values produces confidence and prediction bands. Notice that the confidence band is narrowest near ( x, ȳ) = (10.98, 79.91). Female High School Athletes, confidence and prediction bands maximum bench press (lbs) 60 70 80 90 100 110 5 10 15 20 25 30 35 number of 60 lb bench presses Spring 2016 Page 10 of 13

Outline for Presenting an Hypothesis Test Agresti and Franklin suggest using a five-step outline for presenting hypothesis tests such as we are using in this chapter. Here is a sketch of the approach they recommend. Assumptions We assume randomization, normal conditional distributions for y given x, with a linear trend for the means of these distributions, and a common standard deviation for all of them. Hypotheses The null hypothesis is that the variables are independent, and the alternative hypothesis is that they are dependent (associated). H 0 : β = 0 H a : β 0 Test Statistic The slope b of the sample regression line and its standard error, SE, are found in the Coefficients section of the summary of the linear model. t = b/se. p-value The p-value is calculated with an R command such as p.value 2 (1 pt(t, df = n 2)) Conclusion in Context Is there sufficient evidence to reject H 0 or not? What does this mean in the context of this particular investigation? Outline for Presenting a Confidence Interval Confidence Interval A 95% confidence interval for the population parameter β is given by b ± t SE where b and SE are as in the associated hypothesis test, and t is calculated with an R command such as t.star qt(0.975, df = n 2) Conclusion in Context The confidence interval provides a range of plausible values for the population parameter β. State clearly what this means in the context of the present study. Spring 2016 Page 11 of 13

Analyzing Association Associations involve explanatory variables and response variables. Order them like this: explanatory response. categorical categorical (Peck, chapter 15 ) r c contingency table, test for independence 1 c contingency table, goodness of fit Test for independence or goodness of fit with a χ 2 test statistic quantitative quantitative (Peck, chapters 4, 16 ) Linear model for the population µ y = β 0 + β 1 x + β 2 x Linear model describing the sample ŷ = b 0 + b 1 x + b 2 x Test for relevance of the model with an F test statistic. H 0 : all β i s are 0 Estimate the parameters β i with t statistics and confidence intervals. (quantitative and categorical) quantitative Subsume this case into the previous one with indicator variables. categorical quantitative (Peck, chapter 17 ) The categorical variable divides quantitative measurements into groups, and the question becomes one of comparing the mean responses of the groups. Test that all of the means are the same with an F test (ANOVA) H 0 : β 1 = = β g Find which means are different with t tests and confidence intervals for β i β j Control the significance level for multiple comparisons with Tukey HSD quantitative categorical (Peck, chapter 4 ) Use quantitative variables to predict a categorical variable with logistic regression Spring 2016 Page 12 of 13

Exercises We will attempt to solve some of the following exercises as a community project in class today. Finish these solutions as homework exercises, write them up carefully and clearly, and hand them in at the beginning of class next Friday. Homework 16a regression Exercises from Chapter 16: 16.2 (house price), 16.3 (house price), 16.9 (cancer), 16.10 (marketing), 16.16 (R&D) Homework 16b regression Exercises from Chapter 16: 16.18 (money), 16.19 (grasslands), 16.22 (shrimp), 16.28 (skulls), 16.31 (turtles) Spring 2016 Page 13 of 13