Announcements Announcements Lecture : Relationship between Measurement Variables Statistics Colin Rundel February, 20 In class Quiz #2 at the end of class Midterm #1 on Friday, in class review Wednesday Today s material: Probably will not finish today Will not be on Midterm #1 Statistics (Colin Rundel) Lecture February, 20 2 / 35 Intro to Regression Intro to Regression Poverty vs. HS graduate rate Response vs. explanatory The scatterplot below shows the relationship between HS graduate rate in all 50 states and the District of Columbia vs the % of residents who live below the poverty line (< $22,350 for a family of 4). 1 1 Statistics (Colin Rundel) Lecture February, 20 3 / 35 What is the response and explanatory variable for this data? 1 1 Statistics (Colin Rundel) Lecture February, 20 4 / 35
Intro to Regression Residuals Eyeballing the line Residuals Which of the following appears to be the line that best fits the linear relationship between and? 1 1 Statistics (Colin Rundel) Lecture February, 20 5 / 35 (a) (b) (c) (d) Residuals are the leftovers from the model fit: Data = Fit + Residual 1 1 Statistics (Colin Rundel) Lecture February, 20 / 35 Residuals Residuals (cont.) Describing the relationship Residual Residual is the difference between the observed and predicted y. 1 1 y^ 4.1 y RI y 5.44 e i = y i ŷ i y^ DC % living in poverty in DC is 5.44% more than predicted. % living in poverty in RI is 4.1% less than predicted. Statistics (Colin Rundel) Lecture February, 20 7 / 35 What to include Shape, Direction, and Strength 1 1 How would you describe the relationship between and? Statistics (Colin Rundel) Lecture February, 20 / 35
Quantifying the relationship Guessing the correlation describes the strength and direction of the linear relationship between two variables. It takes values between -1 (perfect negative relationship) and +1 (perfect positive relationship). A value of 0 indicates no relationship. Which of the following is the best guess for the correlation between % in poverty and? 1 1 (a) 0. (b) -0.75 (c) -0.1 (d) 0.02 (e) -1.5 Statistics (Colin Rundel) Lecture February, 20 9 / 35 Statistics (Colin Rundel) Lecture February, 20 / 35 Guessing the correlation Calculating the correlation Which of the following is the best guess for the correlation between % in poverty and? 1 1 1 1 % female householder, no husband present (a) 0.1 (b) -0. (c) -0.4 (d) 0.9 (e) 0.5 Using computation: cor(poverty$poverty, poverty$graduates) Using a formula: R = 1 n 1 n ( ) ( ) xi x yi ȳ i=1 Note: You won t be asked you to calculate the correlation coefficient by hand, because nobody does it by hand. But you might be given a scatterplot and asked to guess the correlation. s x s y Statistics (Colin Rundel) Lecture February, 20 11 / 35 Statistics (Colin Rundel) Lecture February, 20 / 35
Assessing the correlation Play the game! Which of the following is has the strongest correlation, i.e. coefficient closest to +1 or -1? correlation http:// istics.net/ stat/ correlations/ (a) (b) (c) (d) Statistics (Colin Rundel) Lecture February, 20 13 / 35 Statistics (Colin Rundel) Lecture February, 20 / 35 Best line Best line A measure for the best line Why minimize squares? We want a line that has small residuals One option: Minimize the sum of magnitudes (absolute values) of residuals e 1 + e 2 + + e n Another option: Minimize the sum of squared residuals e 2 1 + e 2 2 + + e 2 n The line that minimizes the sum of squared residuals is the least squares line 1 Most commonly used 2 Easier to compute by hand and using software 3 In many applications, a residual twice as large as another is more than twice as bad Statistics (Colin Rundel) Lecture February, 20 15 / 35 Statistics (Colin Rundel) Lecture February, 20 1 / 35
The least squares line Given... predicted y Notation: Intercept: Parameter: β 0 Point estimate: b 0 Slope: Parameter: β 1 Point estimate: b 1 intercept ŷ = β 0 + β 1 x slope explanatory variable 1 1 (x) (y) mean x =.01 ȳ = 11.35 sd s x = 3.73 s y = 3.1 correlation R = 0.75 Statistics (Colin Rundel) Lecture February, 20 17 / 35 Statistics (Colin Rundel) Lecture February, 20 1 / 35 Slope Intercept Slope The slope of the regression can be calculated as In context... b 1 = s y s x R b 1 = 3.1 0.75 = 0.2 3.73 Interpretation For each % point increase in HS graduate rate, we would expect the % living in poverty to decrease on average by 0.2% points. Intercept The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through ( x, ȳ). ȳ = b 0 + b 1 x 70 0 50 40 30 20 0 intercept 0 20 40 0 0 0 ȳ = b 0 + b 1 x b 0 = 11.35 ( 0.2).01 = 4. Statistics (Colin Rundel) Lecture February, 20 19 / 35 Statistics (Colin Rundel) Lecture February, 20 20 / 35
Interpreting regression line parameter estimates Regression line Interpretation of slope and intercept 1 1 = 4. 0.2 Statistics (Colin Rundel) Lecture February, 20 21 / 35 Intercept: When x = 0, y is expected to equal the intercept. Slope: For each unit increase in x, y is expected to increase/decrease on average by the slope. Statistics (Colin Rundel) Lecture February, 20 22 / 35 Examples of extrapolation Applying a model estimate to values outside of the realm of the original data is called extrapolation. Sometimes the intercept might be an extrapolation. 70 0 50 40 30 20 0 intercept 0 20 40 0 0 0 Statistics (Colin Rundel) Lecture February, 20 23 / 35 Statistics (Colin Rundel) Lecture February, 20 24 / 35
Examples of extrapolation Examples of extrapolation Statistics (Colin Rundel) Lecture February, 20 25 / 35 Statistics (Colin Rundel) Lecture February, 20 2 / 35 Conditions: (1) Linearity 1 Linearity: The relationship between the explanatory and the response variable should be linear. Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. Check using a scatterplot of the data, or a residuals plot. 2 Nearly normal residuals: 3 Constant variability: Statistics (Colin Rundel) Lecture February, 20 27 / 35 Statistics (Colin Rundel) Lecture February, 20 2 / 35
Residuals plot Conditions: (2) Nearly normal residuals 15 RI: = 1 =.3 = 4. 0.2 1 =.4 e = % in poverty =.3.4 = 4.1 The residuals should be nearly normal. This condition may not be satisfied when there are unusual observations that don t follow the trend of the rest of the data. Check using a histogram or normal probability plot of residuals. Normal Q Q Plot 5 5 0 DC: = = 1. = 4. 0.2 = 11.3 e = % in poverty = 1. 11.3 = 5.44 frequency 0 2 4 Sample Quantiles 4 2 0 2 4 5 4 2 0 2 4 residuals 2 1 0 1 2 Theoretical Quantiles Statistics (Colin Rundel) Lecture February, 20 29 / 35 Statistics (Colin Rundel) Lecture February, 20 30 / 35 Conditions: (3) Constant variability Checking conditions What condition is this linear model obviously violating? The variability of points around the least squares line should be roughly constant. This implies that the variability of residuals around the 0 line should be roughly constant as well. Also called homoscedasticity. Check using a histogram or normal probability plot of residuals. 1 1 4 0 4 0 90 Statistics (Colin Rundel) Lecture February, 20 31 / 35 Statistics (Colin Rundel) Lecture February, 20 32 / 35
R 2 Checking conditions R 2 What condition is this linear model obviously violating? The strength of the fit of a linear model is most commonly evaluated using R 2. R 2 is calculated as the square of the correlation coefficient. It tells us what percent of variability in the response variable is explained by the model. The remainder of the variability is explained by variables not included in the model. For the model we ve been working with, R 2 = 0.2 2 = 0.3. Statistics (Colin Rundel) Lecture February, 20 33 / 35 Statistics (Colin Rundel) Lecture February, 20 34 / 35 R 2 Interpretation of R 2 Which of the below is the correct interpretation of R = 0.2, R 2 = 0.3? Statistics (Colin Rundel) Lecture February, 20 35 / 35