Review of Regression Basics When describing a Bivariate Relationship: Make a Scatterplot Strength, Direction, Form Model: y-hat=a+bx Interpret slope in context Make Predictions Residual = Observed-Predicted Assess the Model Interpret r Residual Plot 4 4 3 3 The Endangered Manatee The Endangered Manatee 4 4 3 3 The Endangered Manatee The Endangered Manatee 4 4 7 7 4 4 Registrations ( thousand ) 7 7 Registrations ( thousand ) 4 4 4 4 7 7 7 Registrations ( thousand ) 7 Registrations ( thousand ) Killed = (.12 thousand^-1)registrations - 41.4 ; r Killed = (.12 thousand^-1)registrations - 41.4 ; 2 r 2 =.8 =.8 The The Endangered Endangered Manatee Scatter Manatee Plot 4 4 3 3 4 4 4 4 7 7 7 7 Registrations (( thousand thousand ) ) - - 4 4 4 4 7 7 7 7 Registrations (( thousand thousand )) Killed Killed = (.12 (.12 thousand^-1)registrations -- 41 41 ;; rr 2 2 =.8.8
Reading Minitab Output Regression Analysis: Fat gain versus NEA The regression equation is FatGain = ****** + ******(NEA) Predictor Coef SE Coef T P Constant 3.1.33.4. NEA -.344.74141-4.4. S=.7383 R-Sq =.% R- Sq(adj)=7.8% Regression equations aren t always as easy to spot as they are on your TI-84. Can you find the slope and intercept above?
Age Age at at First First Word Word and and Gesell Gesell Score Score Child Child Age Age Score Score 1 1 1 1 2 2 2 2 2 2 71 71 3 3 3 3 83 83 4 4 4 4 1 1 2 2 87 87 7 7 7 7 18 18 3 3 8 8 8 8 8 8 4 4 4 4 7 7 3 3 12 12 12 12 13 13 13 13 83 83 14 14 14 14 84 84 2 2 1 1 1 1 17 17 17 17 12 12 18 18 18 18 42 42 7 7 1 1 1 1 17 17 121 121 8 8 13 21 21 21 21 13 1 Outliers/Influential Points Does the age of a child s first word predict his/her mental ability? Consider the following data on (age of first word, Gesell Adaptive Score) for 21 children. Age Age at at First First Word Word and and Gesell Gesell Score Scatter Score Scatter Plot Plot 13 13 1 1 1 1 Influential? 8 8 7 7 2 2 3 3 3 3 4 4 4 4 Age Age () () Age at First Word and Gesell Score Age at First Word and Gesell Score 13 13 1 1 1 1 8 8 7 7 2 2 3 3 3 3 4 4 4 Age ( ) 4 Age ( ) Score = (-1.13 ^-1)Age + 1 ; r Score = (-1.13 ^-1)Age + 1 2 ; r 2 =.41 =.41 Age at First Word and Gesell Score Age at First Word and Gesell Score 1 1 1 8 8 7 7 2 2 3 3 3 3 4 4 4 Age ( ) 4 Score = (-.77 ^-1)Age + Age ( ) ; r 2 Score = (-.77 ^-1)Age + ; r 2 =. =. Does the highlighted point markedly affect the equation of the LSRL? If so, it is influential. Test by removing the point and finding the new LSRL.
Explanatory vs. Response The Distinction Between Explanatory and Response variables is essential in regression. Switching the distinction results in a different leastsquares regression line. Hubble 12 data Hubble 12 data 1 1 8 8 4 4 - - -4-4...2.2.4.4...8.8 1. 1. 1.2 1.2 1.4 1.4 1. 1. 1.8 1.8 2. 2. 2.2 r 2.2 r v = 44r - 38 v = 44r - 38 ; r ; r 2 2 =.3 =.3 Hubble 12 data Hubble 12 data 2.2 2.2 2. 2. 1.8 1.8 1. 1. 1.4 1.4 1.2 1.2 1. 1..8.8...4.4.2.2.. -4-4 - - 4 4 8 8 1 v 1 v r =.13v +.3 r =.13v +.3 ; r ; r 2 2 =.3 =.3 Note: The correlation value, r, does NOT depend on the distinction between Explanatory and Response.
Correlation Beer Beer and and Blood Blood Alcohol Scatter P.2.2.1.1.1.1.1.1.1.1.1.1.......... 1 2 3 4 7 8 Beers BAC BAC =.18Beers ; r; 2 r 2 =.8.8 --.13.13 There is a weak, positive, linear relationship between x and y. However, there is a strong nonlinear relationship. r measures the strength of linearity... The correlation, r, describes the strength of the straight-line relationship between x and y. Ex: There is a strong, positive, LINEAR relationship between # of beers and BAC. Collection 1 - - - - - - - - Scatter P 2 4 8 12 12 x y = x ;- r; - 2 r 2 =.14.14
Coefficient of Determination The coefficient of determination, r 2, describes the percent of variability in y that is explained by the linear regression on x. Wine Consumption and Heart Scatter Dise P 3 3 2 1 2 3 4 7 8 Alcohol (L/yr )) DeathRate = (-23. yr/l)alcohol ;; r 2 r =.71 71% of the variability in death rates due to heart disease can be explained by the LSRL on alcohol consumption. That is, alcohol consumption provides us with a fairly good prediction of death rate due to heart disease, but other factors contribute to this rate, so our prediction will be off somewhat.
Cautions Correlation and Regression are NOT RESISTANT to outliers and Influential Points! Correlations based on averaged data tend to be higher than correlations based on all raw data. Extrapolating beyond the observed data can result in predictions that are unreliable.
Correlation vs. Causation Consider the following historical data: Collection 1 Year MinistersRum 1 2 3 4 7 8 12 12 18 3 3 837 837 18 48 48 4 4 187 3 3 7 7 187 4 4 848 848 188 72 72 188 8 8 4 18 8 8 2 18 7 7 7 1 8 8 4 1 83 83 1 1 1388 1 14 14 18 Collection 1 Scatter P 18 18 1 1 14 14 14 1 1 1 8 8 8 8 4 4 4 4 4 4 841 4 4 841 x y = 132x 132x ;; r 2 r+ = 1. 33 1. 33 x There is an almost perfect linear relationship between x and y. (r=.7) x = # Methodist Ministers in New England y = # of Barrels of Rum Imported to Boston CORRELATION DOES NOT IMPLY CAUSATION!
Summary The The Endangered Endangered Manatee Manatee Scatter Scatter P 4 4 3 3 4477 4477 Registrations Registrations (thousand (thousand ) ) The The Endangered Endangered Manatee Manatee Scatter Scatter P 4 4 3 3 4477 4477 Registrations Registrations (thousand (thousand ) ) Killed Killed = = (.12 (.12 thousand^-1)reg thousand^-1)reg ; r ; 2 r 2 =.8 =.8 The The Endangered Endangered Manatee Manatee Scatter Scatter P 4 4 3 3 4477 4477 Registrations Registrations (thousand (thousand ) ) - - 4477 4477 Registrations Registrations (thousand (thousand )) Killed Killed = = (.12 (.12 thousand^-1)reg thousand^-1)reg ; r; 2 r 2 =.8 =.8