Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

Department of Economics Business Statistics Chapter 1 Chi-square test of independence & Analysis of Variance ECON 509 Dr. Mohammad Zainal

Chapter Goals After completing this chapter, you should be able to: Set up a contingency analysis table and perform a chisquare test of independence Recognize situations in which to use analysis of variance Perform a single-factor hypothesis test and interpret results

Contingency Tables Contingency Tables Situations involving multiple population proportions Used to classify sample observations according to two or more characteristics Also called a crosstabulation table. Male Female Smoke 150 0 170 Don t Smoke 70 160 30 0 180 400

Contingency Tables It can be of any size: x3, 3x, 3x3, or 4x The first digit refers to the number of rows, and the second refers to the number of columns. In general, R C table contains R rows and C columns. The number of cells in a contingency table is obtained by multiplying the number of rows by the number of columns.

Contingency Tables We may want to know if there is an association between being a male or female and smoking or not. The null hypothesis that the two characteristics of the elements of a given population are not related (independent) Against the alternative hypothesis that the two characteristics are related (dependent). Degrees of freedom for the test are: df = (R-1)(C-1)

Contingency Table Example Left-Handed vs. Gender Dominant Hand: Left vs. Right Gender: Male vs. Female H 0 : Hand preference is independent of gender H A : Hand preference is not independent of gender

Contingency Table Example Sample results organized in a contingency table: (continued) sample size = n = 300: 10 Females, 1 were left handed 180 Males, 4 were left handed Hand Preference Gender Left Right Female 1 108 10 Male 4 156 180 36 64 300

Logic of the Test H 0 : Hand preference is independent of gender H A : Hand preference is not independent of gender If H 0 is true, then the proportion of left-handed females should be the same as the proportion of left-handed males The two proportions above should be the same as the proportion of left-handed people overall

Finding Expected Frequencies 10 Females, 1 were left handed 180 Males, 4 were left handed Overall: P(Left Handed) = 36/300 =.1 If independent, then P(Left Handed Female) = P(Left Handed Male) =.1 So we would expect 1% of the 10 females and 1% of the 180 males to be left handed i.e., we would expect (10)(.1) = 14.4 females to be left handed (180)(.1) = 1.6 males to be left handed

Expected Cell Frequencies Expected cell frequencies: (continued) e ij (i th th Row total)(j Columntotal) Total samplesize Example: e 11 (10)(36) 300 14.4

Observed vs. Expected Frequencies Observed frequencies vs. expected frequencies: Gender Female Hand Preference Left Right Observed = 1 Observed = 108 Expected = 14.4 Expected = 105.6 10 Male Observed = 4 Expected = 1.6 Observed = 156 Expected = 158.4 180 36 64 300

The Chi-Square Test Statistic The Chi-square contingency test statistic is: r i1 c j1 (o ij e e ij ij ) with d.f. (r 1)(c 1) where: o ij = observed frequency in cell (i, j) e ij = expected frequency in cell (i, j) r = number of rows c = number of columns

Observed vs. Expected Frequencies Gender Female Male Hand Preference Left Right Observed = 1 Observed = 108 Expected = 14.4 Expected = 105.6 Observed = 4 Observed = 156 Expected = 1.6 Expected = 158.4 10 180 36 64 300 (1 14.4) 14.4 (108105.6) 105.6 (4 1.6) 1.6 (156158.4) 158.4 0.6848

Contingency Analysis 0.6848 with d.f. (r -1)(c -1) (1)(1) 1 Decision Rule: If > 3.841, reject H 0, otherwise, do not reject H 0 = 0.05.05 = 3.841 Do not reject H 0 Reject H 0 Here, = 0.6848 < 3.841, so we do not reject H 0 and conclude that gender and hand preference are independent

Problem A random sample of 300 adults was selected, and they were asked if they favor giving more freedom to schoolteachers to punish students for violence and lack of discipline. Based on the results of the survey, the two-way classification of the responses of these adults is presented in the following table In Favor (F) Against (A) No Opinion (N) Men (M) 93 70 1 Women (W) 87 3 6

Step 1. State the null and alternative hypotheses Ho: the two attributes are independent. H 1 : these attributes are dependent. Step. Select the distribution to use. We use the chi-square distribution to make a test of independence for a contingency table.

Step 3. Determine the rejection and nonrejection regions. The area of the rejection region is.01, and it falls in the right tail of the chi-square distribution curve. The degrees of freedom are df = (-1)(3-1) =

e ij (i th th Row total)(j Columntotal) Total samplesize In Favor (F) Against (A) No Opinion(N) Row total Men (M) 93 70 1 175 105 59.5 10.5 Women (W) 87 3 6 15 75 4.5 7.5 Column Total 180 10 18 300

Step 4. Calculate the value of the test statistic. Step 5. Make a decision. The value of the test statistic χ = 8.5 is less than the critical value of χ = 9.10, and it falls in the nonrejection region. So, we fail to reject Ho

Chapter Overview Analysis of Variance (ANOVA) One-Way ANOVA F-test Tukey- Kramer test Randomized Complete Block ANOVA F-test Fisher s Least Significant Difference test Two-factor ANOVA with replication

General ANOVA Setting Investigator controls one or more independent variables Called factors (or treatment variables) Each factor contains two or more levels (or categories/classifications) Observe effects on dependent variable Response to levels of independent variable Experimental design: the plan used to test hypothesis

One-Way Analysis of Variance Evaluate the difference among the means of three or more populations Examples: Accident rates for 1 st, nd, and 3 rd shift Expected mileage for five brands of tires Assumptions Populations are normally distributed Populations have equal variances Samples are randomly and independently drawn

Completely Randomized Design Experimental units (subjects) are assigned randomly to treatments Only one factor or independent variable With two or more treatment levels Analyzed by One-factor analysis of variance (one-way ANOVA) Called a Balanced Design if all factor levels have equal sample size

Hypotheses of One-Way ANOVA H0 : μ1 μ μ3 μk H A All population means are equal i.e., no treatment effect (no variation in means among groups) :Not allof the populationmeansare the same At least one population mean is different i.e., there is a treatment effect Does not mean that all population means are different (some pairs may be the same)

One-Factor ANOVA H H 0 : μ1 μ μ3 μk A :Not allμ i are the same All Means are the same: The Null Hypothesis is True (No Treatment Effect) μ 1 μ μ3

One-Factor ANOVA H H 0 : μ1 μ μ3 μk A :Not allμ i are the same (continued) At least one mean is different: The Null Hypothesis is NOT true (Treatment Effect is present) or μ 1 μ μ3 μ1 μ μ3

Partitioning the Variation Total variation can be split into two parts: SST = SSB + SSW SST = Total Sum of Squares SSB = Sum of Squares Between SSW = Sum of Squares Within

Partitioning the Variation SST = SSB + SSW (continued) Total Variation (SST) = the aggregate dispersion of the individual data values across the various factor levels Between-Sample Variation (SSB) = dispersion among the factor sample means Within-Sample Variation (SSW) = dispersion that exists among the data values within a particular factor level

Partition of Total Variation Total Variation (SST) Variation Due to Factor (SSB) = + Commonly referred to as: Sum of Squares Between Sum of Squares Among Sum of Squares Explained Among Groups Variation Variation Due to Random Sampling (SSW) Commonly referred to as: Sum of Squares Within Sum of Squares Error Sum of Squares Unexplained Within Groups Variation

Total Sum of Squares SST = SSB + SSW Where: SST k n i1 j1 i (x ij x) SST = Total sum of squares k = number of populations (levels or treatments) n i = sample size from population i x ij = j th measurement from population i x = grand mean (mean of all data values)

Total Variation (continued) SST (x 11 x) (x 1 x)... (x kn k x) Response, X x Group 1 Group Group 3

Sum of Squares Between SST = SSB + SSW Where: SSB k n i(x i1 x) SSB = Sum of squares between k = number of populations n i = sample size from population i x i = sample mean from population i x = grand mean (mean of all data values) i

Between-Group Variation SSB k n i(x i1 Variation Due to Differences Among Groups i x) SSB MSB k 1 Mean Square Between = SSB/degrees of freedom i j

Between-Group Variation (continued) SSB n (x 1 1 x) n (x x)... n k (x k x) Response, X x x1 x 3 x Group 1 Group Group 3

Sum of Squares Within SST = SSB + SSW Where: SSW k i1 n j j1 (x SSW = Sum of squares within k = number of populations n i = sample size from population i x i = sample mean from population i x ij = j th measurement from population i ij x i )

Within-Group Variation SSW k i1 j1 Summing the variation within each group and then adding over all groups n j (x ij x i ) MSW SSW n k T Mean Square Within = SSW/degrees of freedom i

Within-Group Variation (continued) SSW (x 11 x 1 ) (x 1 x )... (x kn k x k ) Response, X x 3 x 1 x Group 1 Group Group 3

One-Way ANOVA Table Source of Variation SS df MS F ratio Between Samples Within Samples SSB SSB k - 1 MSB = k - 1 SSW n T - k MSW = Total SST = n T - 1 SSB+SSW SSW n T - k F = MSB MSW k = number of populations n T = sum of the sample sizes from all populations df = degrees of freedom

One-Factor ANOVA F Test Statistic H 0 : μ 1 = μ = = μ k H A : At least two population means are different Test statistic F MSB MSW Degrees of freedom MSB is mean squares between variances MSW is mean squares within variances df 1 = k 1 (k = number of populations) df = n T k (n T = sum of sample sizes from all populations)

Interpreting One-Factor ANOVA F Statistic The F statistic is the ratio of the between estimate of variance and the within estimate of variance The ratio must always be positive df 1 = k -1 will typically be small df = n T - k will typically be large The ratio should be close to 1 if H 0 : μ 1 = μ = = μ k is true The ratio will be larger than 1 if H 0 : μ 1 = μ = = μ k is false

One-Factor ANOVA F Test Example You want to see if three different golf clubs yield different distances. You randomly select five measurements from trials on an automated driving machine for each club. At the.05 significance level, is there a difference in mean distance? Club 1 Club Club 3 54 34 00 63 18 41 35 197 37 7 06 51 16 04

One-Factor ANOVA Example: Scatter Diagram Club 1 Club Club 3 54 34 00 63 18 41 35 197 37 7 06 51 16 04 x1 49. x 6.0 x3 x 7.0 05.8 Distance 70 60 50 40 30 0 10 00 190 x 1 x x 3 x 1 3 Club

One-Factor ANOVA Example Computations Club 1 Club Club 3 54 34 00 63 18 41 35 197 37 7 06 51 16 04 x 1 = 49. x = 6.0 x 3 = 05.8 x = 7.0 n 1 = 5 n = 5 n 3 = 5 n T = 15 k = 3 SSB = 5 [ (49. 7) + (6 7) + (05.8 7) ] = 4716.4 SSW = (54 49.) + (63 49.) + + (04 05.8) = 1119.6 MSB = 4716.4 / (3-1) = 358. MSW = 1119.6 / (15-3) = 93.3 F 358. 93.3 5.75

One-Factor ANOVA Example Solution 0 H 0 : μ 1 = μ = μ 3 H A : μ i not all equal =.05 df 1 = df = 1 Do not reject H 0 Critical Value: F = 3.885 =.05 Reject H 0 F.05 = 3.885 F F = 5.75 Test Statistic: MSB MSW Decision: Reject H 0 at = 0.05 Conclusion: 358. 93.3 5.75 There is evidence that at least one μ i differs from the rest

ANOVA -- Single Factor: Excel Output EXCEL: tools data analysis ANOVA: single factor SUMMARY Groups Count Sum Average Variance Club 1 5 146 49. 108. Club 5 1130 6 77.5 Club 3 5 109 05.8 94. ANOVA Source of Variation Between Groups Within Groups SS df MS F P-value F crit 4716.4 358. 5.75 4.99E-05 3.885 1119.6 1 93.3 Total 5836.0 14

Chapter Summary Described one-way analysis of variance The logic of ANOVA ANOVA assumptions F test for difference in k means The Tukey-Kramer procedure for multiple comparisons Described randomized complete block designs F test Fisher s least significant difference test for multiple comparisons Described two-way analysis of variance Examined effects of multiple factors and interaction

Problems Fifteen fourth-grade students were randomly assigned to three groups to experiment with three different methods of teaching arithmetic. At the end of the semester, the same test was given to all 15 students. The table gives the scores of students in the three groups. At the 1% significance level, can we reject the null hypothesis that the mean arithmetic score of all fourthgrade students taught by each of these three methods is the same? Assume that all the assumptions required to apply the one-way ANOVA procedure hold true. Method I Method II Method III 48 55 84 73 85 68 51 70 95 65 69 74 87 90 67

Step 1. State the null and alternative hypotheses Let μ 1, μ, and μ 3 be the mean arithmetic scores of all fourth-grade students who are taught, respectively, by Methods I, II, and III. The null and alternative hypotheses are: Ho: μ 1 = μ = μ 3 Vs. H1: Not all the means are equal Step. Select the distribution to use. Since we are comparing the means for three normally distributed populations, we use the F distribution to make this test.

Step 3. Determine the rejection and nonrejection regions The significance level is.01. ANOVA test is always right-tailed, the area in the right tail of the F distribution curve is.01 Numerator degrees of freedom = k 1 = 3 1 = Denominator degrees of freedom = n k = 15 3 =

Step 4. Calculate the value of the test statistic. T 1 = 48 + 73 + 51 + 65 + 87 = 34 T = 55 + 85 + 70 + 69 + 90 = 369 T 3 = 84 + 68 + 95 + 74 + 67 = 388 x = T 1 + T + T 3 = 1081 n = 5 + 5 + 5 = 15 x = (48) + (73) + (51) + (74) + (67) = 80,709 SSB 34 369 388 1081 5 SSW 80,709 5 5 15 34 369 388 5 5 5 43.1333 37.8000

MSB SSB 43.1333 16.0667 k 1 31 MSW SSW 37.8000 197.7333 n k 15 3 MSB 16.0667 F MSW 197.7333 Step 5. Make a decision The test statistic F = 1.09 is less than the F critical (6.93) We fail to reject the null hypothesis and conclude that the means of the three populations are equal

For convenience, all these calculations are often recorded in a table called the ANOVA table. Anova: Single Factor SUMMARY Groups Count Sum Average Variance Method I 5 34 64.8 58. Method II 5 369 73.8 194.7 Method III 5 388 77.6 140.3 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 43.1333 16.0667 1.093 0.366 6.97 Within Groups 37.8 1 197.7333 Total 804.9333 14

Copyright The materials of this presentation were mostly taken from the PowerPoint files accompanied Business Statistics: A Decision-Making Approach, 7e 008 Prentice-Hall, Inc.