Chi-Square Analyses Stat 251

Size: px

Start display at page:

Download "Chi-Square Analyses Stat 251"

Penelope Hunter
6 years ago
Views:

1 Chi-Square Analyses Stat 251 While we have analyses for comparing more than 2 means, we cannot use them when trying to compare more than one proportion. However, there is a distribution that is related to the standard normal distribution (z) that works for comparing more than two proportions and can be used to ask some of the following questions about populations: Do the data follow a specified distribution? Are the categories independent of each other? Are the probabilities distributed the same over all categories? Rather than a test statistic for each pair of proportions, we d rather like to use just one. What we do is measure the distance each sample value is from the average (from the norm ). If we had a z-score for each pair, the sum of the squared z-scores would be a new (new to you) distribution called Chi-square (pronounced ky as in sky ), denoted by χ 2. The distribution is a skewed distribution (skewed right) so it is not a symmetric distribution like z or t. χ 2 n 1 = n zi 2 = zi 2 + z zn 2 i The following graph will show how the χ 2 distribution changes shape with increasing df. x=rchisq(1000,df=2); hist(x,prob=t,ylim=c(0,.5),xlim=c(0,15),main='') title('chi-square Distributions \nwith 3 different df') curve(dchisq(x,df=2),add=t,lwd=3,col="blue") curve(dchisq(x,df=4),add=t,lwd=3,col="red") curve(dchisq(x,df=9),add=t,lwd=3,col='green') dfs=c("df = 2 ","df = 4 ","df = 9 "); colours=c('blue','red','green') legend('topright',legend=dfs,lwd=c(3,3,3),col=colours) Chi Square Distributions with 3 different df Density df = 2 df = 4 df = x 1

2 Assumptions of any Chi-square test 1. The data must be counts from categories 2. Independence of observations 3. E i 5; each individual expected value (E i ) must be at least 5 Goodness-of-Fit (GoF) Chi-square for a one-way table (a table that has categories and counts for each category): In evaluating whether there is sufficient evidence that a set of observed counts, O 1, O 2,, O k in k categories are unusually different from what would be expected under a null hypothesis. The expected values under the null hypothesis, called E 1, E 2,..., E k. The hypotheses for GoF test are: Null hypothesis H 0 : p 1 = p 01, p 2 = p 02,, p k = p 0k Or H 0 : p 1 = p 2 = = p k = p 0 Or H 0 : The data follows theoretical distribution Alternative hypothesis H a : At least one p i differs from theoretical probability Or Expected values H a : The data does not follow theoretical distribution E i = np i You will need to find the probabilities associated with the null hypothesized distribution, then multiple each category value by the probability to get the expected value. Test statistic Rejection region Reject H 0 if χ 2 calc χ2 α,df χ 2 = (O i E i ) 2 = (observed expected) 2 E i expected where df = k 1 (k = number of categories). Conclusion (in context) When the null hypothesis is rejected, in terms of the context of the data, it means that we think that the data does not follow the theoretical (specified) distribution. When we fail to reject the null hypothesis, we are maintaining that the data does follow the theoretical (specified) distribution. 2

3 GoF example A paper 1 about an experiment reported the following data on phenotypes resulting from crossing tall cut-leaf tomatoes with dwarf potato-leaf tomatoes. We wish to investigate if the frequencies below are consistent with Mendel s laws of inheritance which implies that the phenotypes should occur in a 9:3:3:1 ratio. Tall cut-leaf Tall potato-leaf Dwarf cut-leaf Dwarf potato-leaf Ratio Count A 9:3:3:1 ratio means the probabilities are 9/16, 3/16 ( 2) and 1/16 (the 16 comes from the sum of all the numbers in the ratio). Is there sufficient evidence that the tomato plants follow Mendel s Law? Keep in mind that when you have to do this in R, I have provided the code to get the data in so make sure you pay most attention to parts of code that have the chisq.test() commands and examples obs <- c(926,288,293,104) types <- c("tall cut","tall potato","dwarf cut","dwarf potato") rnames=c("obs","exp") k <- length(types) mendel <- c(9,3,3,1); mendel [1] probs <- mendel/sum(mendel); probs [1] n <- sum(obs); n [1] 1611 exp <- round(n*probs,2); exp [1] tomatoes <- data.frame(types,obs,exp,probs); tomatoes types obs exp probs 1 Tall cut Tall potato Dwarf cut Dwarf potato Linkage Studies of the Tomato (Transactions of the Royal Canadian Institute, 1931: 1-19) 3

4 # barplot of data (barplot since data is categorical) barplot(obs,names.arg=types,col=rainbow(4),ylab="observed",ylim=c(0,1000)) title('barplot of Tomato Plant \ndihybrids (obs)') Barplot of Tomato Plant dihybrids (obs) Observed Tall cut Dwarf cut barplot(exp,names.arg=types,col=rainbow(4),ylab="expected",ylim=c(0,1000)) title('barplot of Tomato Plant \ndihybrids (exp)') Barplot of Tomato Plant dihybrids (exp) Expected Tall cut Dwarf cut 4

5 # barplot with obs and exp counts <- rbind(obs, exp); colnames(counts) <- types colors=c("cornflowerblue","darkorchid4") barplot(counts,col=colors,legend=rnames,beside=t,ylim=c(0,1000)) title('barplot of Tomato Plant \ndihybrids Obs and Exp') Barplot of Tomato Plant dihybrids Obs and Exp obs exp Tall cut Dwarf cut Hypotheses H 0 : p 1 = 9 16, p 2 = p 3 = 3 16, p 4 = 1 16 or H 0: data follows Mendel s Law H a : At least one p i differs or H a : data does not follow Mendel s Law Expected values E 1 = 1611(9/16) = E 2 = 1611(3/16) = E 3 = 1611(3/16) = E 4 = 1611(1/16) = Here we can check to see all E i 5 Test Statistic E i = n(p i ) χ 2 = (O i E i ) 2 = (observed expected) 2 E i expected = ( ) ( ) ( ) ( ) = df = k 1 where k = 4 so df = 4 1 = 3 Rejection Region 1. Critical Value approach: Reject H 0 if χ 2 calc χ2 α,df where χ2 α,df = χ2.05,3 = pvalue approach: Reject H 0 if pvalue α Results We are doing the critical value approach in the by-hand example. So χ 2.05,3 = so we will fail to reject H 0. Conclusion (in context) We failed to reject H 0 so that tells us that the plants do follow Mendel s law (maintaining the 9:3:3:1 ratio). 5

6 GoF in R The command to use is: chisq.test(x,p=probs,simulate.p.value=f) Where: x: x is either a vector of counts or a minimum of at least a 2X2 table p=probs: the probabilities of the theoretical (specified) distribution simulate.p.value: F (default); T when E i 5 assumption is not met # HERE!!! Pay attention here!!! # Seriously, this part below is what you need after you # get the data in and change names of objects (dataset) # with chisq.test() GoF=chisq.test(obs,p=probs); GoF Chi-squared test for given probabilities data: obs X-squared = , df = 3, p-value = GoF$expected [1] # long way without chisq.test() chisq.calc <- sum((obs-exp)^2/exp); chisq.calc [1] df=k-1 pvalue <- 1-pchisq(chisq.calc,3); pvalue [1] Since the pvalue = α(0.05), we fail to reject H 0 (see conclusion in context from by-hand example). Test of Independence The test of Independence explores whether two categorical random variables are independent or whether some level of dependency existst between them. Each dataset will be constructed into a table with I rows and J columns. Let n ij denote the number of individuals in the sample falling in the (i, j) th cell (of row i, column j) of the table. The following is a prototype of a general table that displays the counts (n ij ) and is called a two-way contingency table. I and J (capital I,J) are the row and column totals, respectively. 6

7 j... J 1 n 11 n n 1j... n 1J 2 n i n i1... n ij.... I n I1... n IJ The hypotheses for Independence test are: Null hypothesis H 0 : p ij = (p i )(p j ) i = 1, 2,..., I and j = 1, 2,..., J Or H 0 : The row context and column context are independent Alternative hypothesis H a : H 0 is not true Expected values E ij = n in j n = (rtotal)(ctotal) grandtotal Test statistic n j n i χ 2 = i=1 j=1 (O ij E ij ) 2 E ij = all r c (observed expected) 2 expected = (O ij E ij ) 2 E ij Rejection region Reject H 0 if χ 2 calc χ2 α,df where df = (r 1)(c 1) (r = number of rows, c = number of columns). Conclusion (in context) When the null hypothesis is rejected, in terms of the context of the data, it means that we think that the context of the rows and context of the columns are dependent (there is a dependency). When we fail to reject the null hypothesis, we are maintaining that the context of the rows and context of the columns are dependent (there is no relationship). Independence Test example A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline 2 reports the accompanying data based on a sample of n = 441 stations. Does the data suggest that facility conditions and pricing policy are independent of one another? gas <- as.table(rbind(c(24,15,17), c(52,73,80),c(58,86,36))) cond=c("substandard","standard","modern"); price=c("aggressive","neutral", "Nonaggressive") dimnames(gas) <- list(condition=cond,pricing.policy=price); gas 2 An Analysis of Price Aggressiveness in Gasoline Marketing, (Journal Marketing Research, 1970: 36-42) 7

8 Pricing.Policy Condition Aggressive Neutral Nonaggressive Substandard Standard Modern gaspr=as.matrix(gas) # row totals n1j=sum(gas[1,]); n2j=sum(gas[2,]); n3j=sum(gas[3,]); n123j=c(n1j,n2j,n3j) # column totals ni1=sum(gas[,1]); ni2=sum(gas[,2]); ni3=sum(gas[,3]); ni123=c(ni1,ni2,ni3) # grand total n=sum(gas) # Expected value for row1,col1 exp11 <- round((n1j*ni1)/n,2); exp11 [1] exp=round((ni123 %*% t(n123j))/n,2) dimnames(exp) <- list(condition=cond,pricing.policy=price) expt=matrix(exp,ncol=3,byrow=t) # barplot of data (barplot since data is categorical) barplot(gas,names.arg=rownames(gas),col=rainbow(9),ylab="observed",ylim=c(0,225)) title('condition by Pricing (obs)') Condition by Pricing (obs) Observed Substandard Modern 8

9 barplot(exp,names.arg=rownames(gas),col=rainbow(9),ylab="expected",ylim=c(0,225)) title('condition by Pricing (exp)') Condition by Pricing (exp) Expected Substandard Modern # barplot with obs and exp counts <- rbind(gaspr,expt) barplot(counts,legend=rownames(gas),beside=t,ylim=c(0,100)) title('condition by Pricing Obs and Exp') Condition by Pricing Obs and Exp Substandard Standard Modern Aggressive Neutral Null Hypothesis H 0 : p ij = (p i )(p j ) i = 1, 2,..., I and j = 1, 2,..., J (pricing and conditions in gasoline marketing are independent) Alternative hypothesis H a : H 0 is not true (pricing and conditions in gasoline are dependent) Expected values E ij = n in j n = (rtotal)(ctotal) grandtotal E 11 = (56 134/441) = E 12 = (56 174/441) =

10 E 13 = (56 133/441) = E 21 = ( /441) = 22.1 E 22 = ( /441) = E 23 = ( /441) = E 31 = ( /441) = E 32 = ( /441) = E 33 = ( /441) = Here we can check to see all E ij 5 Test Statistic (same as the Independence Test Statistic calculation) n j n i χ 2 = i=1 j=1 (O ij E ij ) 2 E ij = all r c (observed expected) 2 expected = (O ij E ij ) 2 E ij = ( ) ( ) ( ) = df = (r 1)(c 1) where r, c = (3, 3) so df = (3 1)(3 1) = 4 Rejection Region 1. Critical Value approach: Reject H 0 if χ 2 calc χ2 α,df where χ2 α,df = χ2.05,4 = pvalue approach: Reject H 0 if pvalue α Results We are doing the critical value approach in the by-hand example. So χ 2.05,4 = so we will reject H 0. Conclusion (in context) We rejected H 0 so that tells us that knowledge of a stations s pricing policy does give information about the condition of facilities at the station. Stations with an aggressive pricing policy appear to have more substandard facilities than stations with a neutral or nonaggressive policy. # HERE!!! Pay attention here!!! # Seriously, this part below is what you need after you # get the data in and change names of objects (dataset) # with chisq.test() indep=chisq.test(gas); indep Pearson's Chi-squared test data: gas X-squared = , df = 4, p-value = # long way without chisq.test() chisq.calc <- round(sum((gaspr-expt)^2/expt),3); chisq.calc [1] pvalue <- 1-pchisq(chisq.calc,4); pvalue [1]

11 indep$expected Pricing.Policy Condition Aggressive Neutral Nonaggressive Substandard Standard Modern Since the pvalue = α(0.05), we reject H 0 (see conclusion in context from by-hand example). Homogeneous Test We are assuming that each individual in every one of the I populations belongs in exactly one of J categories. The hypotheses for Homogeneous test are: Null hypothesis H 0 : p 1j = p 2j =... = p Ij j=1,2,...,j, i=1,2,...,i Or Alternative hypothesis Expected values Same as Independence Test Test statistic Same as Independence Test Rejection region Same as Independence Test H 0 : The row context is distributed homogeneously (the same) over the column context H a : H 0 is not true Conclusion (in context) When the null hypothesis is rejected, in terms of the context of the data, it means that we think that the context of the rows are distributed differently across the context of the columns. When we fail to reject the null hypothesis, we are maintaining that the context of the rows are distributed similarly across the context of the columns. Homogeneous example A company packages a particular product in cans of three different sizes, each one using a different production line. Most cans conform to specifications, but a quality control engineer has identified blemish on a can, crack in the can, improper pull tab location, pull tab missing or some other as main reasons for nonconformities. A sample of nonconforming units is selected from each of the three lines, and each unit is categorical according to reason for nonconformity. Production line Blemish Crack Location Missing Other

12 Do the data suggest that the proportions failing in the various nonoconformance categories are not the same for the three lines? defects <- as.table(rbind(c(34,65,17,21,13), c(23,52,25,19,6),c(32,28,16,14,10))) prodl=1:3; nonconf=c("blemish","crack", "Location","Missing","Other") dimnames(defects) <- list(productionline=prodl,nonconformity=nonconf) defects Nonconformity ProductionLine Blemish Crack Location Missing Other fail=as.matrix(defects) # row totals; n1j=sum(defects[1,]); n2j=sum(defects[2,]); n3j=sum(defects[3,]); n123j=c(n1j,n2j,n3j) # column totals and grand total ni1=sum(defects[,1]); ni2=sum(defects[,2]); ni3=sum(defects[,3]); ni4=sum(defects[,4]) ni5=sum(defects[,5]); ni12345=c(ni1,ni2,ni3,ni4,ni5); n=sum(defects) # Expected value for row1,col1 exp11 <- round((n1j*ni1)/n,2); exp11 [1] 35.6 # barplot of data (barplot since data is categorical); first do data prep for barplots expij=round((ni12345 %*% t(n123j))/n,2) exp=matrix(expij,ncol=5,byrow=t) dimnames(exp) <- list(production.line=prodl,nonconformity=nonconf) barplot(defects,names.arg=colnames(defects),col=rainbow(15),ylab="observed",ylim=c(0,175)) title('production Line Nonconformities \n(obs)') Production Line Nonconformities (obs) Observed Blemish Location Other 12

13 barplot(exp,names.arg=colnames(defects),col=rainbow(15),ylab="expected",ylim=c(0,160)) title('production Line Nonconformities \n(exp)') Production Line Nonconformities (exp) Expected Blemish Location Other # barplot with obs and exp counts <- rbind(fail,exp) barplot(counts,legend=rownames(defects),beside=t,ylim=c(0,75)) title('production Line Nonconformities \n Obs and Exp') Production Line Nonconformities Obs and Exp Blemish Location Other 13

14 Hypotheses Null hypothesis H 0 : p 1j = p 2j =... = p Ij j=1,2,...,j, i=1,2,...,i Or H 0 : Blemish types have a homogeneous distribution over the prouction lines (no one production line has more types of blemishes than any other production line) Alternative hypothesis Expected values E 11 = (158)(97) = 35.6 E 12 = (158)(145) = 58 E 13 = (158)(58) = 23.2 E 14 = (158)(54) = 21.6 E 15 = (158)(29) = 11.6 E 21 = (125)(97) = E 22 = (125)(145) = E 23 = (125)(58) = E 24 = (125)(54) = 18 E 25 = (125)(29) = 9.67 E 31 = (100)(97) = E 32 = (100)(145) = E 33 = (100)(58) = E 34 = (100)(54) = 14.4 E 35 = (100)(29) = 7.73 Here we can check to see all E ij 5 Test Statistic H a : H 0 is not true E ij = (n i)(n j ) n = (rtotal)(ctotal) grandtotal n j n i χ 2 = i=1 j=1 (O ij E ij ) 2 E ij = all r c (observed expected) 2 expected = (O ij E ij ) 2 E ij = ( ) (65 58) ( ) = df = (r 1)(c 1) where r, c = (3, 5) so df = (3 1)(5 1) = 2 4 = 8 Rejection Region 1. Critical Value approach: Reject H 0 if χ 2 calc χ2 α,df where χ2 α,df = χ2.05,8 = pvalue approach: Reject H 0 if pvalue α Results We are doing the critical value approach in the by-hand example. So χ 2.05,8 = so we will fail to reject H 0. Conclusion (in context) We failed to reject H 0 so that tells us that the production lines have the same rates of different types of nonconformities, so no one production line produces more of a specific type of nonconformity. 14

15 # HERE!!! Pay attention here!!! # Seriously, this part below is what you need after you # get the data in and change names of objects (dataset) # with chisq.test() hgeneous=chisq.test(defects); hgeneous Pearson's Chi-squared test data: defects X-squared = , df = 8, p-value = hgeneous$expected; exp Nonconformity ProductionLine Blemish Crack Location Missing Other Nonconformity Production.Line Blemish Crack Location Missing Other # long way without chisq.test() chisq.calc <- round(sum((fail-exp)^2/exp),3); chisq.calc [1] pvalue <- 1-pchisq(chisq.calc,8); pvalue [1]

Chi-square (χ 2 ) Tests

Math 442 - Mathematical Statistics II April 30, 2018 Chi-square (χ 2 ) Tests Common Uses of the χ 2 test. 1. Testing Goodness-of-fit. 2. Testing Equality of Several Proportions. 3. Homogeneity Test. 4.