Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

Relationship Hypothesis Tests Correlation and Regression Notes Categorical / Categorical Relationship (Chi-Squared Independence Test) Ho: Categorical Variables are independent (show distribution of conditional probabilities are the same) Ha: Categorical Variables are dependent (show distribution of conditional probabilities are different) Categorical / Quantitative Relationship (ANOVA) H : µ = µ = µ = µ = µ = µ 0 1 2 3 4 5 6 (categorical variable and quantitative variable are independent (not related) H : at least one is A (categorical variable and quantitative variable are dependent (related) Quantitative / Quantitative Relationship (Correlation Hypothesis Test) Regression Correlation : See if there is a linear relationship between two different quantitative variables. The study of that relationship is often called Correlation and Regression. Scatterplot : graph for visually seeing correlation or not I. Choosing your variables: Chose which variable will be x (explanatory variable or independent variable) and which variable will be y (response variable or dependent variable) Is one of the variables a natural response variable? Ex) Year (time) and unemployment rates in U.S. Let explanatory variable x be time (years) and let the response variable y be unemployment rate. Unemployment responds to time, but not the other way around.

If the variables respond to each other, pick the response variable to be the one you are most interested in or may want to make predictions about. Ex) The unemployment rate in U.S. and the national debt in the U.S. If you are studying national debt and factors that may be related to the national debt, then you should make the national debt be your response variable y (and that means that unemployment rate would be explanatory x). II. Graphing your data (Scatterplot and Correlation coefficient r ) Make ordered pairs from your x and y data (x, y) and create a scatterplot. Statcato: Graph scatterplot pick columns for x and y show regression curve linear OK StatCrunch: Graph scatterplot pick columns for x and y compute Correlation Study: see how well ordered pair quantitative data fit a line. (regression line) Correlation Coefficient (r) : number between -1 and +1 that measures the strength and direction of correlation. (Always look at the scatterplot with the r value, Do not just look at r value) r close to +1 (r = +0.893) Strong, Positive Correlation (line going up from left to right (positive slope) and the points in scatterplot are close to line), (r +0.6, +0.7, +0.8, +0.9 usually indicate pretty strong positive correlation) r close to -1 (r = -0.916) Strong Negative Correlation (line going down from left to right (negatve slope) and the points in the scatterplot are close to the line) (r 0.6, 0.7, 0.8, 0.9 usually indicate pretty strong negative correlation) r close to 0 (+0.037 or -0.009) No linear correlation. Points in the scatterplot do not follow any linear pattern (but still could be nonlinear). (r ±0.1, ±0.0 usually indicate no linear correlation) r ±0.2, ±0.3 usually indicate very weak linear correlation. There is some linear pattern but the points are very far from the regression line. r ±0.4, ±0.5 usually indicate moderate linear correlation. There is a linear pattern and points are only moderately close to the regression line.

III. R-Squared (Squaring the correlation coefficient r) R-squared : Percentage of variability in y (response) that can explained by the linear relationship with x (explanatory). Confounding Variables: Other variables that might influence the response variable (y) other than the explanatory variable (x) we are studying. IV. Standard Deviation of the residual errors (Se) (two meanings : Average distance from line & prediction error) 1. The average distance that points are from the regression line. 2. If we use the regression line to make a prediction, the standard deviation of the residuals gives us how much average error we can expect in that prediction. Residual : How far a point is above or below the regression line. Regression Line (Line of Best Fit, Line of Least Squares) ŷ = A + Bx (OLI book) ŷ = bb 00 + bb 11 X (most stat books) bb 00 is y intercept (where line crosses y axis) starting value bb 11 is slope (average rate of change) Note: Remember in a linear equation, the number in front of X is the slope. Note: ŷ refers to the predicted y value a y value predicted by the regression line equation and not an actual y value in one of the ordered pairs in the scatterplot. Definition of Slope (bb 11 ): The amount of increase (+) or decrease ( ) in the y-variable for every 1 unit increase in the x-variable (per unit of x). Definition of Y-intercept (bb 00 ): The predicted y value when x is zero. Can also be thought of as an initial value of y.

Statcato Directions: Statistics Correlation and Regression Linear pick x and y columns Show scatterplot and residual plots OK StatCrunch Directions: Stat Regression Simple Linear pick x and y columns compute Example 1: (Health Data) Is a woman s age related to her diastolic blood pressure? Pick x and y (blood pressure responds to age, but age does not respond to blood pressure) X: (explanatory or independent variable) Woman s Age Y: (response or dependent variable) Diastolic Blood Pressure Statcato Scatterplot and Correlation/Regression Printout

The scatterplot and r-value show a strong positive correlation. (r = 0.6359) r-squared = 0.404 = 40.44% r-squared sentence: 40.4% of the variability in a woman s diastolic blood pressure in mm of Hg can be explained by the relationship with woman s age in years. Confounding Variables (influence BP)? Race, Ethnicity, stress, genetics, diet, standard deviation of residual errors (Se) = 9.0898 mm of Hg Two sentences for Se: 1. Points in scatterplot are 9.1 mm of Hg away from the regression line on average.

2. If we use the regression line to predict a woman s diastolic blood pressure from her age, we could have an average error of 9.1 mm of Hg. Slope of regression line? 0.5937 (rate of change between x and y) Slope = CCCCCCCCCCCC iiii YY CCCCCCCCCCCC iiii XX = +00.5555 mmmm oooo HHHH +11 yyyyyyyy Slope Sentence: Women s diastolic blood pressure increases 0.59 mm of Hg per year on average. Y intercept? 47.7 (predicted y value when x is zero) Y intercept sentence: When a woman is zero years old (just born) we predict the diastolic blood pressure to be 47.7 mm of Hg. Note: Predicted Y values are only accurate in the scope of the X-values in the data. Many formulas are not designed to plug in zero for x, so y-intercepts don t always make sense in context. In the previous examples the women in the data set had ages between 12 and 59. Zero is not in this scope. This formula is not designed to plug in zero for x. So the y-intercept is an extrapolation and may not be very accurate. Extrapolation: Plugging in a number into a formula that is out of the scope of the data. Plugging in a number into a formula that the formula was never designed to handle. Use the regression line to predict the diastolic blood pressure of a 50 year old woman? (Replace x with 50 and work it out) Note: 50 is in the scope of the x-values (between 12 and 59) so would not be an extrapolation. ŷ = 47.6999 + 0.5937x ŷ = 47.6999 + 0.5937 (50) ŷ = 47.6999 + 29.685 ŷ = 77.3849 77.4 mm of Hg How much error is there in that prediction? (Use Se = 9.0898!!) The prediction could have an average error of about 9.1 mm of Hg.