Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression

Size: px

Start display at page:

Download "Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression"

Marshall Watts
5 years ago
Views:

1 Introduction to Statistical Data Analysis Lecture 8: and James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 40

2 Introduction In the previous lecture, we learned how to determine whether two random variables were statistically dependent on one another, using the chi-square goodness-of-fit test. However, that test alone does not give us any indication of how the variables are related. In this lecture, we will learn how to use correlation and regression to gain some insight into the nature of the relationship between two variables. James V. Lambers Statistical Data Analysis 2 / 40

3 Independent and Dependent Variables In the following discussion, we classify one of the variables, x, as the independent variable, and the other variable, y, as the dependent variable. This means that x serves as the input and y serves as the output. Mathematically, y is a function of x, meaning that y is determined from x in some systematic way. Therefore, for each value of x, there is only one value of y, whereas one value of y can correspond to more than one value of x. James V. Lambers Statistical Data Analysis 3 / 40

4 Coefficient Testing the Significance of the Coefficient measures the strength and direction of the relationship between x and y. Types of correlation are: positive linear correlation, which means that as x increases, y increases linearly, negative linear correlation, which means that as x increases, y decreases linearly, nonlinear correlation, which means that there is a clear relationship between x and y, but the dependence of y on x cannot be described graphically using a straight line, and no correlation, which means that there is no clear relationship between x and y. In the remainder of this discussion, we will limit ourselves to linear correlation. James V. Lambers Statistical Data Analysis 4 / 40

5 Coefficient Testing the Significance of the Coefficient Coefficient To determine the correlation between two variables x and y, for which we have n observations of each, we compute the correlation coefficient, which is defined by ( n n ) ( n ) n x i y i x i y i i=1 i=1 i=1 r = ( n n ) 2 ( n n ). 2 n x i n y i i=1 x 2 i i=1 Geometrically, r is the cosine of the angle between the vector of x-values and the vector of y-values, with their respective means subtracted. It follows from this interpretation that r 1. i=1 y 2 i i=1 James V. Lambers Statistical Data Analysis 5 / 40

6 Coefficient Testing the Significance of the Coefficient Interpretation If r > 0, then x and y have a positive linear correlation, whereas if r < 0, then x and y have a negative linear correlation. If r = 0, then there is no correlation between x and y. In the extreme cases, r = ±1, we have y = cx for some constant c that is positive (r = 1) or negative (r = 1). The benefit of knowing whether two variables are linearly correlated is that we can, at least approximately, predict values of the dependent variable y from values of the independent variable x. Of course, the accuracy of this prediction depends on r ; if r is nearly zero, such a prediction is not likely to be reliable. James V. Lambers Statistical Data Analysis 6 / 40

7 Coefficient Testing the Significance of the Coefficient Testing the Significance of r Suppose we have determined that x and y are linearly correlated, based on the value of the correlation coefficient r obtained from a sample. How do we know whether a similar correlation applies to the entire population? We can answer this question by performing a hypothesis test on the population correlation coefficient, which we denote by p. If we only wish to test whether p is nonzero, then we can use a two-tail test, with null hypothesis H 0 : p = 0 and alternative hypothesis H 1 : p 0. On the other hand, if we wish to test for a positive linear correlation, we can perform a one-tail test with null hypothesis H 0 : p 0 and alternative hypothesis H 1 : p > 0; testing for a negative linear correlation is similar. James V. Lambers Statistical Data Analysis 7 / 40

8 Coefficient Testing the Significance of the Coefficient Performing the Test For this test, we use the Student t-distribution. The test statistic is t = r 1 r 2 n 2 where, as before, n is the sample size for each variable, d.f. = n 2 is the number of degrees of freedom, and (1 r 2 )/(n 2) is the standard error of the correlation coefficient. For the one-tail test with H 0 : p 0, we reject H 0 and conclude that x and y have a positive linear correlation if t > t α. For the two-tail test with H 0 : p = 0, we reject H 0 and conclude that x and y are linearly correlated if t > t α/2., James V. Lambers Statistical Data Analysis 8 / 40

9 Coefficient Testing the Significance of the Coefficient vs. Causation Always keep in mind: correlation does not imply causation! Meaning: it often occurs that variables exhibit a correlation with one another even though there is no influence whatsoever Even if there is a causal relationship, it s not always clear which is the cause and which is the effect! James V. Lambers Statistical Data Analysis 9 / 40

10 Coefficient Testing the Significance of the Coefficient Reverse Causality Case in point: the effect of Course Signals on student retention at Purdue University Purdue developed Course Signals to use analytics to alert faculty and staff to potential problems for students Purdue claimed that when students took at least two courses that used Course Signals, retention improved by 21%! This conclusion was supported by appropriate data, so what could be the problem? James V. Lambers Statistical Data Analysis 10 / 40

11 Coefficient Testing the Significance of the Coefficient Look for Anomalies! It was observed from the data that taking two Course Signals courses greatly improved retention, whereas taking only one did not help at all Also, an initial bump in retention rate quickly faded after Course Signals had been in use for a few years What the data was really showing was that students were taking more Course Signals courses because they were taking more courses overall (that is, they did not control for freshmen dropping out early) In other words, it was retention that led to increased use of Course Signals, not the other way around! Reference: What the Course Signals Kerfuffle is About, and What it Means to You by Michael Caulfield, posted at educause.edu James V. Lambers Statistical Data Analysis 11 / 40

12 Coefficient Testing the Significance of the Coefficient Causal Inference Given that two variables are correlated, the ideal approach to establishing causation is to understand the mechanism by which it acts Failing that, another approach, if less effective, is to perform a controlled intervention study Establishing causation based solely on observations is much less reliable, but more broadly applicable In fact, this is impossible without making assumptions about the data Reference: Max Planck Institute James V. Lambers Statistical Data Analysis 12 / 40

13 Coefficient Testing the Significance of the Coefficient Inferring Causation via Probability Reichenbach s theory of causation: C is a cause of E if and only if P(E C) > P(E C ), and There is no event B such that P(E B C) = P(E B) (that is, B does not screen off C from E) Equivalently, there is no event B such that C and E are independent given B This theory has several shortcomings, that are somewhat rectified by Cartwright and Skyrms using background contexts (other causes of E that are controlled) in place of screening-off events Eells further refined this theory to define positive and negative causation probabilistically James V. Lambers Statistical Data Analysis 13 / 40

14 Coefficient Testing the Significance of the Coefficient Proof Claim: if B screens off C from E, then C and E are independent, given B Equivalently: If P(E B C) = P(E B), then P(C E B) = P(C B)P(E B) Proof: by conditional probability and the multiplication rule, P(E B C) = P(E B C)/P(B C) = P(E B C)/[P(C B)P(B)] But P(E B C) = P(C E B)P(B) Therefore P(C E B)P(B) = P(C B)P(B)P(E B) James V. Lambers Statistical Data Analysis 14 / 40

15 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions If x and y are found to be linearly correlated, then we can use simple regression to find the straight line that best fits the ordered pairs (x i, y i ), i = 1, 2,..., n. The equation of this line is ŷ = a + bx, where ŷ is the predicted value of y obtained from x. The y-intercept a and slope b need to be determined. James V. Lambers Statistical Data Analysis 15 / 40

16 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions The Least Squares Method To find the values of a and b such that the line ŷ = a + bx best fits the sample data, we use the least squares method. In this method, we compute a and b so as to minimize n (y i ŷ i ) 2 = i=1 n (y i a bx i ) 2. The name of the method comes from the fact that we are trying to minimize a sum of squares, of the deviations between y and ŷ. The line ŷ = a + bx that minimizes this sum of squares, and therefore best fits the data, is called the regression line. i=1 James V. Lambers Statistical Data Analysis 16 / 40

17 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Solving the Least Squares Problem The criterion of minimizing the sum of squares is chosen because it is differentiable, and is therefore suitable for minimization techniques from calculus. The minimizing coefficients are ( n n ) ( n ) n x i y i x i y i i=1 i=1 i=1 b = ( n n ) 2, n x i a = ȳ b x, i=1 where x and ȳ are the sample means x = n i=1 x i, ȳ = n i=1 y i. x 2 i i=1 James V. Lambers Statistical Data Analysis 17 / 40

18 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Discussion It should be noted that b is closely related to the correlation coefficient r; the formulas have the same numerator. It follows that the slope is positive if and only if the correlation coefficient indicates that x and y have a positive linear correlation. In R, the least squares method is implemented in the function lsfit. Its simplest usage is to specify two arguments, which are vectors consisting of the x- and y-values, respectively. It returns a data structure called a named list, which includes the coefficients a and b of the regression line. James V. Lambers Statistical Data Analysis 18 / 40

19 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Example The following code illustrates the use of lsfit, including extraction of the y-intercept a and slope b. Then, both the data points and regression line are plotted. > x=c(1:10) > y=c(8,6,10,6,10,13,9,11,15,17) > lslist=lsfit(x,y) > coefs=lslist[["coefficients"]] > coefs Intercept X James V. Lambers Statistical Data Analysis 19 / 40

20 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Extracting the Coefficients > a=coefs[["intercept"]] > b=coefs[["x"]] > a [1] > b [1] > plot(x,y) > abline(a,b) James V. Lambers Statistical Data Analysis 20 / 40

21 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Code Dissection The first two statements specify vectors of x- and y-values; the x-values are the integers 1 through 10, specified concisely using the colon operator. Note the use of double square brackets to extract elements of a named list; the names of elements of a list returned by a built-in R function are listed in the documentation. The element coefs extracted from lsfit is itself a named list, the elements of which are the y-intercept a and slope b. The plot command plots the individual data points, and abline adds a line to the current plot, with the first argument specifying the y-intercept and the second argument specifying the slope. James V. Lambers Statistical Data Analysis 21 / 40

22 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Plot of Regression Line It is merely coincidence that in this example, the regression line happens to pass through one of the points; in general this does not happen, as the goal of the least squares method is to minimize the distance between all of the predicted y-values and observed y-values. James V. Lambers Statistical Data Analysis 22 / 40

23 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Confidence Interval for the Regression Line To measure how well the regression line fits the data, we can construct a confidence interval. We use the standard error of the estimate, n n n n (y i ŷ i ) 2 yi 2 a y i b x i y i i=1 i=1 i=1 i=1 s e = =, n 2 n 2 which measures the amount of dispersion of the observations around regression line. The smaller s e is, the closer the points are to the regression line. It is worth noting the similarity between this formula and the sample standard deviation; the number of degrees of freedom is n 2 since two degrees of freedom are taken away by the coefficients a and b of the regression line. James V. Lambers Statistical Data Analysis 23 / 40

24 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Testing the Slope of the Regression Line We need to determine whether the slope b of the regression line is indicative of the slope β for the population. To that end, we can perform a hypothesis test. For example, we can use the null hypothesis H 0 : β = β 0 and H 1 : β β 0 for a two-tail test. If β 0 = 0, then we are testing whether there is any linear relationship between x and y, and rejection of H 0 would imply that this is the case. James V. Lambers Statistical Data Analysis 24 / 40

25 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Standard Error of the Slope The standard error of slope is s b = s e n i=1 x 2 i n x 2, where s e is the standard error of the estimate, defined earlier. Note that s b is the standard deviation in the y-values divided by n times the standard deviation of the x-values, which intuitively makes sense because we are testing the slope, which is the change in y divided by the change in x. James V. Lambers Statistical Data Analysis 25 / 40

26 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Test Statistic As with the test of the correlation coefficient, we use the Student s t-distribution to determine the critical value. The test statistic is t = b β 0 s b. This is compared to the critical value t α/2,n 2, the t-value satisfying P( T n 2 > t α/2,n 2 ) = α/2. If t > t α/2,n 2, then we reject H 0 and conclude β β 0. If β 0 = 0, then our conclusion is that x and y are linearly correlated. James V. Lambers Statistical Data Analysis 26 / 40

27 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Interpretation It is important to keep in mind that correlation does not imply causation. That is, even if there is a strong correlation between x and y, that does not necessarily mean that a change in y is caused by a change in x. It could be mere coincidence, or that some other variable influences both x and y in a similar way. James V. Lambers Statistical Data Analysis 27 / 40

28 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions The Coefficient of Determination The strength of the relationship between x and y can be measured by the coefficient of determination, which is defined to be r 2, where r is the correlation coefficient. More precisely, the coefficient of determination measures the percentage of the variation in y that can be explained by the regression line. James V. Lambers Statistical Data Analysis 28 / 40

29 The Least Squares Method Confidence Interval for the Regression Line Testing the Slope of the Regression Line The Coefficient of Determination Assumptions Assumptions For the least squares method to be valid, we need to make the following assumptions: Individual differences between y i and ŷ i, i = 1, 2,..., n, are independent of one another. The observed values of y are normally distributed around ŷ. The variation of y around the regression line is equal for all values of x. James V. Lambers Statistical Data Analysis 29 / 40

30 Polynomial Regression Multiple Linear Regression Exponential Regression Polynomial Regression In linear regression, we are trying to find constants a and b such that the function y = a + bx best fits the data (x i, y i ), i = 1, 2,..., n, in least-squares sense. The method of least squares can readily be generalized to the problem of finding constants a 0, a 2,..., a m such that the function y = c 0 + c 1 x + c 2 x c m x m, a polynomial of degree m, best fits the data. James V. Lambers Statistical Data Analysis 30 / 40

31 Polynomial Regression Multiple Linear Regression Exponential Regression System Set-up We define the n (m + 1) matrix 1 x 1 x1 2 x1 m 1 x 2 x2 2 x2 m A =.. 1 x n xn 2 xn m, and the vectors c = c 0 c 1. c m A is known as a Vandermonde matrix., y = y 1 y 2. y n. James V. Lambers Statistical Data Analysis 31 / 40

32 Polynomial Regression Multiple Linear Regression Exponential Regression The Normal Equations Then, by solving the normal equations A T Ac = A T y, we obtain the coefficients of the best-fitting polynomial of degree m. Note that A T is the transpose of A, which is obtained by changing rows into columns; that is, (A T ) ij = a ji. James V. Lambers Statistical Data Analysis 32 / 40

33 Polynomial Regression Multiple Linear Regression Exponential Regression Example The following R statements construct data vectors x and y, and then call the function lm (short for linear model ) to obtain > x=c(0.6291,0.2956,0.6170,0.9885,0.3440,0.2396,0.0004,... > y=c(0.7487,0.6169,0.1834,0.8436,0.7160,0.6518,0.6128,... > lm(y poly(x,2,raw=true)) Call: lm(formula = y poly(x, 2, raw = TRUE)) Coefficients: (Intercept) poly(x,2,raw=true)1 poly(x,2,raw=true) That is, the quadratic function that best fits the data is y = x x James V. Lambers Statistical Data Analysis 33 / 40

34 Polynomial Regression Multiple Linear Regression Exponential Regression Code Dissection The expression y poly(x,2,raw=true) specifies that y is to be treated as a quadratic function of x. That is, the second argument to poly is the degree. The third argument to poly, raw=true, specifies that the monomial basis 1, x, x 2,... is to be used, instead of the default behavior of poly, which is to use orthogonal polynomials. This is done in order to facilitate interpretation of the coefficients returned by lm. James V. Lambers Statistical Data Analysis 34 / 40

35 Polynomial Regression Multiple Linear Regression Exponential Regression Multiple Linear Regression A similar approach can be used for multiple linear regression, in which we seek a model of the form y = c 0 + c 1 x 1 + c 2 x c m x m. Let x ij be the ith observation of x j. We define the matrix A by 1 x 11 x 12 x 1m 1 x 21 x 22 x 2m A = x n1 x n2 x nm Then, we solve the normal equations A T Ac = A T y to obtain the coefficients c 0, c 1,..., c m. James V. Lambers Statistical Data Analysis 35 / 40

36 Polynomial Regression Multiple Linear Regression Exponential Regression Example Suppose that we have a set of n observations (x i1, x i2, y i ), i = 1, 2,..., n, and seek the coefficients c 0, c 1, c 2 so that the model y = c 0 + c 1 x 1 + c 2 x 2 best fits the data in the least-squares sense. James V. Lambers Statistical Data Analysis 36 / 40

37 Polynomial Regression Multiple Linear Regression Exponential Regression Getting the Job Done in R The following R statements obtain these coefficients. > x1=c(0.4092,0.9977,0.6238,0.3532,0.1827,0.3209,.. > x2=c(0.9525,0.8742,0.1622,0.1467,0.6498,0.7901,... > y=c(0.2549,0.9122,0.3675,0.0380,0.6508,0.8164,... > lm(y x1+x2) Call: lm(formula = y x1 + x2) Coefficients: (Intercept) x1 x That is, c 0 = , c 1 = , and c 2 = James V. Lambers Statistical Data Analysis 37 / 40

38 Polynomial Regression Multiple Linear Regression Exponential Regression Exponential Regression The least squares method can also be used for models of the form y = be ax, where a and b are coefficients that are to be determined. Taking the natural logarithm of both sides yields ln y = ln b + ax, so we can apply the method of least squares to the model z = c + ax, where z = ln y and c = ln b, and then compute b = e c. James V. Lambers Statistical Data Analysis 38 / 40

39 Polynomial Regression Multiple Linear Regression Exponential Regression Maximum Likelihood Let x 1, x 2,..., x n be a sample of n i.i.d (independent and identically distributed) observations, coming from an unknown distribution with probability distribution function of the form f (x, θ) The method of maximum likelihood is used to obtain an estimate ˆθ of the unknown parameter θ Because the observations are independent, we have f (x 1 x 2 x n θ) = f (x 1 θ)f (x 2 θ) f (x n θ) The maximum likelihood estimator (MLE) is the value of ˆθ that maximizes the average log-likelihood ˆl = 1 n n ln f (x i θ) i=1 James V. Lambers Statistical Data Analysis 39 / 40

40 Polynomial Regression Multiple Linear Regression Exponential Regression Example Let the n observations be coin flips of an unfair coin, and let h be the number of heads. These flips follow a binomial distribution ( ) n f (X = h θ) = θ h (1 θ) n h h with unknown probability of success θ The MLE ˆθ maximizes ( 1 n n ln h ) θ h (1 θ) n h = 1 n [ ( n ln h which, through calculus, is maximized at ˆθ = h/n ) ] + h ln θ + (n h) ln(1 θ) James V. Lambers Statistical Data Analysis 40 / 40

Maximum-Likelihood Estimation: Basic Ideas

Sociology 740 John Fox Lecture Notes Maximum-Likelihood Estimation: Basic Ideas Copyright 2014 by John Fox Maximum-Likelihood Estimation: Basic Ideas 1 I The method of maximum likelihood provides estimators