Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Similar documents
Lecture 41 Sections Mon, Apr 7, 2008

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

MA : Introductory Probability

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Example. χ 2 = Continued on the next page. All cells

Relate Attributes and Counts

Lecture 28 Chi-Square Analysis

Testing Independence

POLI 443 Applied Political Research

Introduction to Statistical Data Analysis Lecture 4: Sampling

Chi-Square. Heibatollah Baghi, and Mastee Badii

Statistics for Managers Using Microsoft Excel

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

15: CHI SQUARED TESTS

Rama Nada. -Ensherah Mokheemer. 1 P a g e

Ling 289 Contingency Table Statistics

Statistics Handbook. All statistical tables were computed by the author.

(x t. x t +1. TIME SERIES (Chapter 8 of Wilks)

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Frequency Distribution Cross-Tabulation

STP 226 ELEMENTARY STATISTICS NOTES

13.1 Categorical Data and the Multinomial Experiment

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Hypothesis Testing hypothesis testing approach

Topics on Statistics 3

Quantitative Analysis and Empirical Methods

Definition. A matrix is a rectangular array of numbers enclosed by brackets (plural: matrices).

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Chapter 8 Student Lecture Notes 8-1. Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Discrete Multivariate Statistics

Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Chi square test of independence

Probability Distributions.

Chi square test of independence

Lecture 8: Summary Measures

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Topic 21 Goodness of Fit

Introduction to Statistical Data Analysis Lecture 8: Correlation and Simple Regression

Psych 230. Psychological Measurement and Statistics

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Module 10: Analysis of Categorical Data Statistics (OA3102)

STP 226 EXAMPLE EXAM #3 INSTRUCTOR:

11-2 Multinomial Experiment

The Chi-Square Distributions

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Formulas and Tables by Mario F. Triola

Goodness of Fit Tests

Section 4.6 Simple Linear Regression

Unit 27 One-Way Analysis of Variance

Hypothesis Tests Solutions COR1-GB.1305 Statistics and Data Analysis

Lecture 22. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Statistics 3858 : Contingency Tables

Statistics 135 Fall 2008 Final Exam

Ch 6: Multicategory Logit Models

Chapters 10. Hypothesis Testing

Chapter 10: Chi-Square and F Distributions

2. AXIOMATIC PROBABILITY

Introduction to Statistical Data Analysis Lecture 5: Confidence Intervals

Probability Density Functions and the Normal Distribution. Quantitative Understanding in Biology, 1.2

BIOS 625 Fall 2015 Homework Set 3 Solutions

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

4. Suppose that we roll two die and let X be equal to the maximum of the two rolls. Find P (X {1, 3, 5}) and draw the PMF for X.

Psych Jan. 5, 2005

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

SBAOD Statistical Methods & their Applications - II. Unit : I - V

Multinomial Logistic Regression Models

Sociology 6Z03 Review II

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Outline for Today. Review of In-class Exercise Bivariate Hypothesis Test 2: Difference of Means Bivariate Hypothesis Testing 3: Correla

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12)

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Hypothesis Testing One Sample Tests

STAT 705: Analysis of Contingency Tables

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

4/6/16. Non-parametric Test. Overview. Stephen Opiyo. Distinguish Parametric and Nonparametric Test Procedures

10: Crosstabs & Independent Proportions

Chapter 10. Discrete Data Analysis

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data.

Basic Business Statistics, 10/e

10/4/2013. Hypothesis Testing & z-test. Hypothesis Testing. Hypothesis Testing

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Basic Concepts of Probability

The Chi-Square Distributions

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

3. Tests in the Bernoulli Model

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

Transcription:

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 24

Introduction In this lecture, we will use hypothesis testing for new purposes: To determine whether a given data set follows a specific probability distribution, and To determine whether two random variables are statistically independent. James V. Lambers Statistical Data Analysis 2 / 24

Review of Data Measurement Scales Recall from Lecture 1 that there are four data measurement scales: nominal, ordinal, interval, and ratio. The hypothesis testing techniques presented in Lecture 6 only apply to the scales that are more quantitative, interval and ratio. Now, though, we can use hypothesis testing for data measured in nominal or ordinal scales as well. This is because we are working with frequency distributions, which can be constructed from any data set, regardless of its measurement scale. James V. Lambers Statistical Data Analysis 3 / 24

The chi-square goodness-of-fit test uses a sample to determine whether the frequency distribution of the population conforms to a particular probability distribution that it is believed to follow. Example Suppose that a six-sided die is rolled 150 times, and the result of each roll is recorded. The number of rolls that are a 1,2,3,4,5 or 6 should follow a uniform distribution. A chi-square goodness-of-fit test can be used to compare the observed number of rolls for each value, from 1 to 6, to the expected number of rolls for each value, which is 150/6 = 25. James V. Lambers Statistical Data Analysis 4 / 24

For the chi-square goodness-of-fit test, the null hypothesis H 0 is that the population does follow the predicted distribution, and the alternative hypothesis H 1 is that it does not. James V. Lambers Statistical Data Analysis 5 / 24

The chi-square goodness-of-fit test works with two frequency distributions, with the same classes, and frequencies denoted by {O i } and {E i }, respectively. Each frequency O i is the actual number of observations from the sample that belong to the ith class. Each frequency E i is the expected number of observations that should belong to class i, assuming H 0 is true. It is essential that the total number of observations in both frequency distributions are equal; that is, n O i = i=1 where n is the number of classes. n E i, i=1 James V. Lambers Statistical Data Analysis 6 / 24

The test statistic for the chi-square goodness-of-fit test, also known as the chi-square score is given by χ 2 = n i=1 (O i E i ) 2 E i, where, as before, n is the number of classes. James V. Lambers Statistical Data Analysis 7 / 24

Once we have computed the test statistic, we compare it against the critical value χ 2 c, which can be obtained as follows: It can be looked up in a table of right-tail areas for the chi-square distribution, with the degrees of freedom d.f. = n 1 and chosen significance level α, or One can use the R function qchisq with first parameter 1 α and second parameter d.f. = n 1; this function returns the left-tail area corresponding to these parameters, in contrast to the table given in Appendix A, which is why 1 α is given as the first parameter instead of α. If the chi-square score χ 2 is greater than this critical value χ 2 c, then we reject H 0 ; otherwise we do not reject H 0. Because test statistic and critical value are always positive, the chi-square goodness of fit test is always a one-tail test. James V. Lambers Statistical Data Analysis 8 / 24

The chi-square distribution is of a very different character than other distributions that we have seen. If Z 1, Z 2,..., Z n are independent, standard random normal variables, then the random variable Q defined by Q = follows the chi-square distribution with n degrees of freedom. It is not symmetric; rather, its values are skewed toward zero, which is the leftmost value of the distribution. However, as the number of degrees of freedom (d.f.) increases, the distribution becomes more symmetric. n i=1 Z 2 i James V. Lambers Statistical Data Analysis 9 / 24

Characteristics, cont d The probability density function for this distribution is 1 f n (x) = 2 n/2 Γ(n/2) x n/2 1 e x/2, where n is the degrees of freedom and Γ(n) is the gamma function. James V. Lambers Statistical Data Analysis 10 / 24

Suppose a coin is flipped 10 times, and the number of times it comes up heads is recorded. Then, this process is repeated several times, for a total of 100 sequences of 10 flips each. Since coin flips are Bernoulli trials, the number of heads follows a binomial distribution, which yields the expected number of sequences that produces k heads. James V. Lambers Statistical Data Analysis 11 / 24

Observed and Expected Values Number of heads Observed Sequences Expected Sequences 0 1 0.098 1 2 0.977 2 3 4.395 3 9 11.719 4 18 20.508 5 26 24.609 6 21 20.508 7 13 11.719 8 5 4.395 9 2 0.977 10 0 0.098 James V. Lambers Statistical Data Analysis 12 / 24

Performing the Chi-Square Test Our null hypothesis H 0 is that the number of heads does in fact follow a binomial distribution. The chi-square score is χ 2 = 10 i=0 (O i E i ) 2 E i (1 0.098)2 (2 0.977)2 (3 4.395)2 (9 11.719)2 = + + + + 0.098 0.977 4.395 11.719 (18 20.508) 2 (26 24.609)2 (21 20.508)2 (13 11.719)2 + + + 20.508 24.610 20.508 11.719 (5 4.395)2 (2 0.977)2 (0 0.098)2 + + + 4.395 0.977 0.098 = 12.274. James V. Lambers Statistical Data Analysis 13 / 24

And the Verdict is... This is compared to the critical value χ 2 c, with degrees of freedom d.f. = n 1 = 10, since there are n = 11 classes, with level of significance α = 0.05. We can use the R expression qchisq(1-0.05,10) to obtain χ 2 c = 18.307. Since χ 2 < χ 2 c, we do not reject H 0, and conclude that the distribution of the number of heads from each sequence of 10 flips follows a binomial distribution, as expected. James V. Lambers Statistical Data Analysis 14 / 24

Chi-Square Goodness-of-fit Test in R > obs=c(1,2,3,9,18,26,21,13,5,2,0) > pexp=dbinom(0:10,10,0.5) > chisq.test(obs,p=pexp) Chi-squared test for given probabilities data: obs X-squared = 12.2743, df = 10, p-value = 0.2671 James V. Lambers Statistical Data Analysis 15 / 24

Now, we use the chi-square distribution to test whether two given random variables are statistically independent. For this test, the null hypothesis H 0 is that the variables are independent, while the alternative hypothesis H 1 is that they are not. James V. Lambers Statistical Data Analysis 16 / 24

Contingency Tables To compute the test statistic, we construct a contingency table, which is a two-dimensional array, or a matrix, in which each cell contains an observed frequency of an ordered pair of values of the two variables. That is, the entry in row i, column j, which we denote by O i,j, contains the number of observations that fall into class i of the first variable and class j of the second. The frequencies in this table are the observed frequencies for the chi-square goodness of fit test. James V. Lambers Statistical Data Analysis 17 / 24

Computing Expected Frequencies Next, for each row i and each column j, we compute E i,j, which is: (sum of entries in row i) (sum of entries of column j), divided by the total number of observations, to get the expected frequencies for the chi-square goodness-of-fit test. James V. Lambers Statistical Data Analysis 18 / 24

Relation to Independent Events That is, if the contingency table has m rows and n columns, then ( n ) ( m ) O i,k O l,j E i,j = k=1 m l=1 l=1 n k=1 O l,k It should be noted that this quantity, divided again by the total number of observations, is exactly P(A i )P(B j ), where A i is the event that the first variable falls into class i, and B j is the event that the second variable falls into class j. By the multiplication rule, this probability would equal P(A i B j ) if the variables were independent.. James V. Lambers Statistical Data Analysis 19 / 24

The Test Statistic Then, the test statistic is χ 2 = m n i=1 j=1 (O i,j E i,j ) 2 E i,j. We then obtain the critical value χ 2 c using d.f. = (m 1)(n 1) and our chosen level of significance α. As before, if χ 2 > χ 2 c, then we reject H 0 and conclude that the variables are in fact statistically dependent. James V. Lambers Statistical Data Analysis 20 / 24

Example Suppose that 300 voters were surveyed, and classified according to gender and political affiliation: Democrat, Republican, or Independent. The contingency table for these classifications is as follows: Affiliation Gender Democrat Republican Independent Total Female 68 56 32 156 Male 52 72 20 144 Total 120 128 52 300 That is, 68 of the voters are female and Democrat, 72 of the voters are male and Republican, and so on. The entry in row i and column j is the observation O i,j. James V. Lambers Statistical Data Analysis 21 / 24

Computing Expected Frequencies Let G i be the event that the voter is of the gender for row i, i = 1, 2, and let A j be the event that the voter s affiliation corresponds to column j, j = 1, 2, 3. Then, we compute the expected observations as follows: (i, j) G i A j E i,j = P(G i A j ) (156)(120) (1, 1) Female, Democrat = 62.4 300 (156)(128) (1, 2) Female, Republican = 66.56 300 (156)(52) (1, 3) Female, Independent = 27.04 300 (144)(120) (2, 1) Male, Democrat = 57.60 300 (144)(128) (2, 2) Male, Republican = 61.44 300 (144)(52) (2, 3) Male, Independent = 24.96 300 James V. Lambers Statistical Data Analysis 22 / 24

The Test Statistic Then, the test statistic is χ 2 = 2 i=1 j=1 3 (O i,j E i,j ) 2 E i,j (68 62.4)2 = + 62.4 = 6.433. (56 66.56)2 66.56 + (32 27.04)2 27.04 + (52 57.60)2 57.60 We compare this value against the critical value χ 2 c, with degrees of freedom d.f. = (2 1)(3 1) = 2 and significance level 0.05. Since this value is χ 2 c = 5.991, and χ 2 > χ 2 c, we reject the null hypothesis that gender and political affiliation are independent. + James V. Lambers Statistical Data Analysis 23 / 24

Independence Test in R > M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3) > chisq.test(m) Pearson s Chi-squared test data: M X-squared = 6.4329, df = 2, p-value = 0.0401 James V. Lambers Statistical Data Analysis 24 / 24