AMS 7 Correlation and Regression Lecture 8 Department of Applied Mathematics and Statistics, University of California, Santa Cruz Suumer 2014 1 / 18
Correlation pairs of continuous observations. Correlation exists between two variables when one of them is related to the other in some way. e.g. height and weight of people, temperature and altitude, quiz scores and midterm score Query 1: Do the two variables change together? Query 2: Can changes in one variable predict the changes in the other? Query 3: How do we measure the strength of the relationship between two quantitative variables? 2 / 18
Scatterplot: a graph of the paired sample data. 3 / 18
The linear correlation coefficient, r, measures the linear association between two variables. Properties 1 r 1 It does not change if we change the scale of the measurement. It is sensitive to outliers. You need to understand the concept, but you don t need to know the formula or how to test on it. If r = 1, there is a perfect positive linear relationship. If r = 1, there is perfect negative linear relationship. If r = 0, there is no relationship. 4 / 18
5 / 18
Correlation is not causation!!! Coefficient of determination, r 2 : Gives the proportion of the variation in variable 1 that is explained by the linear association between the two variables. 0 r 2 1 0 indicates no linear relationship while 1 indicates a perfect linear relationship. 6 / 18
Review of lines y = 1 + 2x slope = 2 = y x for each one unit change in x, y changes by 2 units. intercept = 1 : value of y when x = 0 y = b 0 + b 1 x, where b 0 is the intercept and b 1 is the slope. 7 / 18
8 / 18
Linear Regression Fitting a line to data - to model the relationship between two quantitative variables. Lots of lines can be fit to the data - which do we choose? Fitted line = regression line = least squares line Fitted values = predicted values = values predicted by the line for a particular value of x 9 / 18
Example: The fitted line is: ŷ = 1 + 1 2 x The fitted values would be : x = 1, ŷ = 1 + 1 2 = 3 2 x = 2, ŷ = 1 + 1 2 2 = 2 x = 3, ŷ = 1 + 1 2 3 = 5 2 x = 4, ŷ = 1 + 1 2 4 = 3 10 / 18
Regression is the predicting of Y from X assuming a linear relationship. X and Y are not treated the same. We are predicting Y from X! The regression line (least-squares line) is the one that minimizes the sum of squared errors in predicting Y (sum of squared residuals): b 0 and b 1 are chosen to minimize n n (ŷ i y i ) 2 = (b 0 +b 1 x i y i ) 2 i=1 i=1 This is always goes through ( x, ȳ) Note that this does not minimize the distance (perpendicular) to the line. 11 / 18
You don t need to know how to compute b 0 and b 1 by hand. You will need to know how to interpret JMP output, compute predicted values, and do hypothesis tests with JMP. Some data examples 12 / 18
How good is a regression model? Statistical significance - test if β 1 = 0 Practical significance - r 2 Check model assumptions - residual plots 13 / 18
Hypothesis Testing for Regression The model is y = β 0 + β 1 x, where β 0 and β 1 are population parameters. If there is a linear relationship between x and y then β 1 0. This is a t-test with n 2 degrees of freedom. 1. H 0 : β 1 = 0 vs. H 1 : β 1 0 2. Level of significance α = 0.05 3. Test statistic: t = b 1 0 (sampling distribution under H 0 is t with n 2 df) s b1 4. Compute t and its p-value with JMP 5. Reject if p-value < 0.05 6. Draw conclusions about linear relationship 14 / 18
r 2 = square of correlation between x and y = % of variability in y is explained by predicting from x n i=1 = (ŷ i ȳ) 2 n i=1 (y i ȳ) 2 = explained variation total variation Recall that s 2 y = 1 n 1 0 r 2 1 n (y i ȳ) 2 i=1 Gives a measure for practical significance 15 / 18
Model assumptions: 1. y is normally distributed with mean β 0 + β 1 x and standard deviation σ. 2. The relationship between x and y is linear. 3. σ is the same for all observations. 4. The observation (x i, y i ) is independent of (x j, y j ) (conditional on β 0, β 1 ) How do we check these? Hypothesis test for (1) and (2). Residual analysis for (2), (3) and (4). 16 / 18
Residuals: e i = y i ŷ i Plot x i vs. e i or ŷ i vs. e i (BUT not y i vs. e i, which are correlated) Make sure there are no patterns in the plot check for non-linearity check for change in variability (heteroscedasticity) Patterns indicate violations of assumptions! Prediction is valid only when statistically significant and no problems with residuals. Prediction interval: a confidence interval for a predicted value - get from JMP. 17 / 18
Correlation Key Concepts!!!!! Slope and Intercept Fitted vs. Predicted values Test for linear relationship r 2 Residual analysis 18 / 18