Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 24

Introduction In this lecture, we will use hypothesis testing for new purposes: To determine whether a given data set follows a specific probability distribution, and To determine whether two random variables are statistically independent. James V. Lambers Statistical Data Analysis 2 / 24

Review of Data Measurement Scales Recall from Lecture 1 that there are four data measurement scales: nominal, ordinal, interval, and ratio. The hypothesis testing techniques presented in Lecture 6 only apply to the scales that are more quantitative, interval and ratio. Now, though, we can use hypothesis testing for data measured in nominal or ordinal scales as well. This is because we are working with frequency distributions, which can be constructed from any data set, regardless of its measurement scale. James V. Lambers Statistical Data Analysis 3 / 24

The chi-square goodness-of-fit test uses a sample to determine whether the frequency distribution of the population conforms to a particular probability distribution that it is believed to follow. Example Suppose that a six-sided die is rolled 150 times, and the result of each roll is recorded. The number of rolls that are a 1,2,3,4,5 or 6 should follow a uniform distribution. A chi-square goodness-of-fit test can be used to compare the observed number of rolls for each value, from 1 to 6, to the expected number of rolls for each value, which is 150/6 = 25. James V. Lambers Statistical Data Analysis 4 / 24

For the chi-square goodness-of-fit test, the null hypothesis H 0 is that the population does follow the predicted distribution, and the alternative hypothesis H 1 is that it does not. James V. Lambers Statistical Data Analysis 5 / 24

The chi-square goodness-of-fit test works with two frequency distributions, with the same classes, and frequencies denoted by {O i } and {E i }, respectively. Each frequency O i is the actual number of observations from the sample that belong to the ith class. Each frequency E i is the expected number of observations that should belong to class i, assuming H 0 is true. It is essential that the total number of observations in both frequency distributions are equal; that is, n O i = i=1 where n is the number of classes. n E i, i=1 James V. Lambers Statistical Data Analysis 6 / 24

The test statistic for the chi-square goodness-of-fit test, also known as the chi-square score is given by χ 2 = n i=1 (O i E i ) 2 E i, where, as before, n is the number of classes. James V. Lambers Statistical Data Analysis 7 / 24

Once we have computed the test statistic, we compare it against the critical value χ 2 c, which can be obtained as follows: It can be looked up in a table of right-tail areas for the chi-square distribution, with the degrees of freedom d.f. = n 1 and chosen significance level α, or One can use the R function qchisq with first parameter 1 α and second parameter d.f. = n 1; this function returns the left-tail area corresponding to these parameters, in contrast to the table given in Appendix A, which is why 1 α is given as the first parameter instead of α. If the chi-square score χ 2 is greater than this critical value χ 2 c, then we reject H 0 ; otherwise we do not reject H 0. Because test statistic and critical value are always positive, the chi-square goodness of fit test is always a one-tail test. James V. Lambers Statistical Data Analysis 8 / 24

The chi-square distribution is of a very different character than other distributions that we have seen. If Z 1, Z 2,..., Z n are independent, standard random normal variables, then the random variable Q defined by Q = follows the chi-square distribution with n degrees of freedom. It is not symmetric; rather, its values are skewed toward zero, which is the leftmost value of the distribution. However, as the number of degrees of freedom (d.f.) increases, the distribution becomes more symmetric. n i=1 Z 2 i James V. Lambers Statistical Data Analysis 9 / 24

Characteristics, cont d The probability density function for this distribution is 1 f n (x) = 2 n/2 Γ(n/2) x n/2 1 e x/2, where n is the degrees of freedom and Γ(n) is the gamma function. James V. Lambers Statistical Data Analysis 10 / 24

Suppose a coin is flipped 10 times, and the number of times it comes up heads is recorded. Then, this process is repeated several times, for a total of 100 sequences of 10 flips each. Since coin flips are Bernoulli trials, the number of heads follows a binomial distribution, which yields the expected number of sequences that produces k heads. James V. Lambers Statistical Data Analysis 11 / 24

Observed and Expected Values Number of heads Observed Sequences Expected Sequences 0 1 0.098 1 2 0.977 2 3 4.395 3 9 11.719 4 18 20.508 5 26 24.609 6 21 20.508 7 13 11.719 8 5 4.395 9 2 0.977 10 0 0.098 James V. Lambers Statistical Data Analysis 12 / 24

Performing the Chi-Square Test Our null hypothesis H 0 is that the number of heads does in fact follow a binomial distribution. The chi-square score is χ 2 = 10 i=0 (O i E i ) 2 E i (1 0.098)2 (2 0.977)2 (3 4.395)2 (9 11.719)2 = + + + + 0.098 0.977 4.395 11.719 (18 20.508) 2 (26 24.609)2 (21 20.508)2 (13 11.719)2 + + + 20.508 24.610 20.508 11.719 (5 4.395)2 (2 0.977)2 (0 0.098)2 + + + 4.395 0.977 0.098 = 12.274. James V. Lambers Statistical Data Analysis 13 / 24

And the Verdict is... This is compared to the critical value χ 2 c, with degrees of freedom d.f. = n 1 = 10, since there are n = 11 classes, with level of significance α = 0.05. We can use the R expression qchisq(1-0.05,10) to obtain χ 2 c = 18.307. Since χ 2 < χ 2 c, we do not reject H 0, and conclude that the distribution of the number of heads from each sequence of 10 flips follows a binomial distribution, as expected. James V. Lambers Statistical Data Analysis 14 / 24

Chi-Square Goodness-of-fit Test in R > obs=c(1,2,3,9,18,26,21,13,5,2,0) > pexp=dbinom(0:10,10,0.5) > chisq.test(obs,p=pexp) Chi-squared test for given probabilities data: obs X-squared = 12.2743, df = 10, p-value = 0.2671 James V. Lambers Statistical Data Analysis 15 / 24

Now, we use the chi-square distribution to test whether two given random variables are statistically independent. For this test, the null hypothesis H 0 is that the variables are independent, while the alternative hypothesis H 1 is that they are not. James V. Lambers Statistical Data Analysis 16 / 24

Contingency Tables To compute the test statistic, we construct a contingency table, which is a two-dimensional array, or a matrix, in which each cell contains an observed frequency of an ordered pair of values of the two variables. That is, the entry in row i, column j, which we denote by O i,j, contains the number of observations that fall into class i of the first variable and class j of the second. The frequencies in this table are the observed frequencies for the chi-square goodness of fit test. James V. Lambers Statistical Data Analysis 17 / 24

Computing Expected Frequencies Next, for each row i and each column j, we compute E i,j, which is: (sum of entries in row i) (sum of entries of column j), divided by the total number of observations, to get the expected frequencies for the chi-square goodness-of-fit test. James V. Lambers Statistical Data Analysis 18 / 24

Relation to Independent Events That is, if the contingency table has m rows and n columns, then ( n ) ( m ) O i,k O l,j E i,j = k=1 m l=1 l=1 n k=1 O l,k It should be noted that this quantity, divided again by the total number of observations, is exactly P(A i )P(B j ), where A i is the event that the first variable falls into class i, and B j is the event that the second variable falls into class j. By the multiplication rule, this probability would equal P(A i B j ) if the variables were independent.. James V. Lambers Statistical Data Analysis 19 / 24

The Test Statistic Then, the test statistic is χ 2 = m n i=1 j=1 (O i,j E i,j ) 2 E i,j. We then obtain the critical value χ 2 c using d.f. = (m 1)(n 1) and our chosen level of significance α. As before, if χ 2 > χ 2 c, then we reject H 0 and conclude that the variables are in fact statistically dependent. James V. Lambers Statistical Data Analysis 20 / 24

Example Suppose that 300 voters were surveyed, and classified according to gender and political affiliation: Democrat, Republican, or Independent. The contingency table for these classifications is as follows: Affiliation Gender Democrat Republican Independent Total Female 68 56 32 156 Male 52 72 20 144 Total 120 128 52 300 That is, 68 of the voters are female and Democrat, 72 of the voters are male and Republican, and so on. The entry in row i and column j is the observation O i,j. James V. Lambers Statistical Data Analysis 21 / 24

Computing Expected Frequencies Let G i be the event that the voter is of the gender for row i, i = 1, 2, and let A j be the event that the voter s affiliation corresponds to column j, j = 1, 2, 3. Then, we compute the expected observations as follows: (i, j) G i A j E i,j = P(G i A j ) (156)(120) (1, 1) Female, Democrat = 62.4 300 (156)(128) (1, 2) Female, Republican = 66.56 300 (156)(52) (1, 3) Female, Independent = 27.04 300 (144)(120) (2, 1) Male, Democrat = 57.60 300 (144)(128) (2, 2) Male, Republican = 61.44 300 (144)(52) (2, 3) Male, Independent = 24.96 300 James V. Lambers Statistical Data Analysis 22 / 24

The Test Statistic Then, the test statistic is χ 2 = 2 i=1 j=1 3 (O i,j E i,j ) 2 E i,j (68 62.4)2 = + 62.4 = 6.433. (56 66.56)2 66.56 + (32 27.04)2 27.04 + (52 57.60)2 57.60 We compare this value against the critical value χ 2 c, with degrees of freedom d.f. = (2 1)(3 1) = 2 and significance level 0.05. Since this value is χ 2 c = 5.991, and χ 2 > χ 2 c, we reject the null hypothesis that gender and political affiliation are independent. + James V. Lambers Statistical Data Analysis 23 / 24

Independence Test in R > M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3) > chisq.test(m) Pearson s Chi-squared test data: M X-squared = 6.4329, df = 2, p-value = 0.0401 James V. Lambers Statistical Data Analysis 24 / 24