Business Statistics Lecture 10: Correlation and Linear Regression
Scatterplot A scatterplot shows the relationship between two quantitative variables measured on the same individuals. It displays the Form (linear or nonlinear?) Direction (positive or negative?) Strength (no, weak, or strong?) of the relationship.
Scatterplot Linear Nonlinear Statistics for Managers 4 th Edition, Prentice-Hall 2004
Negative Positive Scatterplot Strong Weak Statistics for Managers 4 th Edition, Prentice-Hall 2004
Scatterplot No relationship Statistics for Managers 4 th Edition, Prentice-Hall 2004
Scatterplot A scatterplot is a graphical display of the relationship between two quantitative variables. The relationship examined by our eyes may not be satisfactory in many cases. We need a numerical measure to supplement the graph. Correlation is the measure we use.
Correlation The correlation measures the direction and strength of the linear relationship between two quantitative variables. sample correlation coefficient (r) population correlation coefficient ρ (rho)
Correlation: Concept & Computation Pearson correlation coefficient is standardized covariance: Covariance: Cov N i1 (, ) ( i )( N 1 i ) Covariance indicates the degree to which and vary together.
Interpreting Covariance Covariance between two variables: Cov(,) > 0 : and tend to move in the same direction. Cov(,) < 0: and tend to move in opposite directions. Cov(,) = 0: and are independent.
Correlation -1 r +1 Unit free The closer to 1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any linear relationship It is NOT an indicator of causal relationship between variables!
Scatterplots with Correlation Coefficients r = -1 r = -.6 r = 0 r = +1 r = +.3 r = 0 Statistics for Managers 4 th Edition, Prentice-Hall 2004
Causation? A strong correlation between two variables does not mean that changes in one variable cause changes in the other. The only legitimate way to try to establish a causal connection statistically is through the use of designed experiments.
Linear Regression Correlation treats two variables and as equals. It shows a symmetric linear relationship of them In many cases, we want to study an asymmetric linear relationship between and. One variable () influences (or predicts) the other variable (). = IV or predictor variable = DV or outcome variable
Linear Regression Analysis Describes how the DV () changes as a single independent variable () changes (the effect of on ) It is called simple linear regression analysis summarizes the relationship between two variables if the form of the relationship is linear. is often used as a mathematical model to predict the value of DV () based on a value of an IV (). Linear regression model
What is Linear? Linear Nonlinear Statistics for Managers 4 th Edition, Prentice-Hall 2004
What is Fitting a line to data? When a scatterplot displays a linear pattern, we can describe the overall pattern by drawing a straight line through the points. The equation of a line fitted to data gives a compact description of the dependency of the DV on the IV. It is a mathematical model for the straight-line relationship.
What is the equation of line? A straight line relating to has an equation of the form: = a + bx a = intercept b= slope a b
Simple Regression When we have a scatterplot with a linear relationship between the DV () and a single IV (), we are often interested in summarizing the overall pattern. We can do this by drawing a line on the graph. This type of line is called a regression line. A regression line is a straight line that describes how changes as changes.
Simple Regression model Finding a regression line which explains well the relationship between two variables. a = Intercept ˆ a b Mean value of DV when IV is zero b = Slope Amount by which DV changes on average when IV changes by one unit
Simple regression model PS 81.54 1.22income 220 200 180 160 Purchase Spending 140 120 100 80 20 30 40 50 60 70 80 90 Income
How to determine the best regression line? The best regression line is the one that comes the closest to the data points in the vertical direction. There are many ways to make this distance as small as possible. The method of least squares the most common method. The least-squares regression line of on is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Concept of least squares vertical distance between a data point and a line (residual) = ( i Ŷ i ) i Ŷ i Sum of the squares of all vertical distances i ˆ = SS(Error) = i N i1 2
Least squares method: Slope In the least-squares method, the slope b is calculated by: degree to which and vary together (covariability) b N i1 N i i1 i degree to which varies separately (separate variability) i 2
In the least-squares method, the slope b is calculated by: : one unit change how many changes in? N i i N i i i b 1 2 1 Least squares method: Slope
Least squares method: Intercept Once b is estimated, the intercept a is calculated by: a b This is derived from that i = a + b i (i = 1,,N)
Statistical test for Slope We can test the strength of the relationship between the two variables in simple linear regression. H o : β = 0 (β = population slope) There is no linear relationship between and H 1 : β 0 There is a linear relationship. What can we use for testing this? Again, t-test for a single parameter!
Statistical test for Slope Still remember this? t s M M We can easily change this formula for slope: sample slope t b s b population slope (= 0) sample standard dev. of β (standard error of the regression slope)
Statistical test for Slope Here, the sample standard deviation s b is given by: where N i i b s s 1 2 ) ( 2 ) ˆ ( 1 2 N s N i i i
Inferences about the Slope: t Test Example H 0 : β 1 = 0 H A : β 1 0 d.f. = 10-2 = 8 Test Statistic: t = 3.329 sb 1 Coefficients Standard Error t Stat P-value Intercept 98.24833 58.03348 1.69296 0.12892 Square Feet 0.10977 0.03297 3.32938 0.01039 Decision: b 1 t a/2=.02 5 Reject H -t 0 Do not reject H 0 α/2 t α/2. 0 a/2=.025 Reject H 0-2.3060 2.3060 3.329 Reject H 0 Conclusion: There is sufficient evidence that square footage affects house price 29
Statistical test for Slope If the observed value of t is greater than a critical value of t with DF = N-2 and α =.05, we may reject null hypothesis. This indicates that there is a significant linear relationship between and.
Assessing the goodness-of-fit of the regression model: Coefficient of Determination (R 2 ) Proportion of the total variation in accounted for by the regression model. R 2 SS(Regress SS(T) ion) Ranges from 0 to 1. The larger R 2, the more variance of DV explained 0 = No explanation at all. 1 = Perfect explanation. In simple regression, r = R 2
Examples of R 2 Values y R 2 = 1 y R 2 = 1 x Perfect linear relationship between x and y: 100% of the variation in y is explained by variation in x R 2 = 1 x 32
Examples of R 2 Values y 0 < R 2 < 1 y x Weaker linear relationship between x and y: Some but not all of the variation in y is explained by variation in x x 33
Examples of R 2 Values y R 2 = 0 No linear relationship between x and y: R 2 = 0 x The value of does not depend on x. (None of the variation in y is explained by variation in x) 34