Chapter 10. Discrete Data Analysis

Similar documents
Topic 21 Goodness of Fit

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

TUTORIAL 8 SOLUTIONS #

Sociology 6Z03 Review II

Formulas and Tables by Mario F. Triola

Module 10: Analysis of Categorical Data Statistics (OA3102)

13.1 Categorical Data and the Multinomial Experiment

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

11-2 Multinomial Experiment

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Unit 9: Inferences for Proportions and Count Data

Statistics 3858 : Contingency Tables

Unit 9: Inferences for Proportions and Count Data

10.2: The Chi Square Test for Goodness of Fit

F O R SOCI AL WORK RESE ARCH

Summary of Chapters 7-9

Practice Questions: Statistics W1111, Fall Solutions

Example. χ 2 = Continued on the next page. All cells

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Formulas and Tables. for Elementary Statistics, Tenth Edition, by Mario F. Triola Copyright 2006 Pearson Education, Inc. ˆp E p ˆp E Proportion

15: CHI SQUARED TESTS

Introduction to Survey Analysis!

Statistics for Managers Using Microsoft Excel

Testing Independence

POLI 443 Applied Political Research

Confidence Intervals, Testing and ANOVA Summary

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Section 4.6 Simple Linear Regression

Inferences for Proportions and Count Data

Lecture 25: Models for Matched Pairs

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Lecture 41 Sections Mon, Apr 7, 2008

STP 226 ELEMENTARY STATISTICS NOTES

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Lecture 8: Summary Measures

(x t. x t +1. TIME SERIES (Chapter 8 of Wilks)

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Hypothesis testing. Anna Wegloop Niels Landwehr/Tobias Scheffer

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Categorical Data Analysis Chapter 3

Statistics. Statistics

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Formulas and Tables. for Essentials of Statistics, by Mario F. Triola 2002 by Addison-Wesley. ˆp E p ˆp E Proportion.

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )

Discrete Multivariate Statistics

Central Limit Theorem ( 5.3)

Rama Nada. -Ensherah Mokheemer. 1 P a g e

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Lecture 22. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Formulas and Tables for Elementary Statistics, Eighth Edition, by Mario F. Triola 2001 by Addison Wesley Longman Publishing Company, Inc.

Review of One-way Tables and SAS

Chi-Square. Heibatollah Baghi, and Mastee Badii

1.0 Hypothesis Testing

LOOKING FOR RELATIONSHIPS

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Practice Problems Section Problems

Analysis of data in square contingency tables

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Multiple comparisons - subsequent inferences for two-way ANOVA

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

probability of k samples out of J fall in R.

Lecture 7: Hypothesis Testing and ANOVA

On Selecting Tests for Equality of Two Normal Mean Vectors

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Math 152. Rumbos Fall Solutions to Exam #2

Quantitative Analysis and Empirical Methods

Chi-square (χ 2 ) Tests

Ordinal Variables in 2 way Tables

2.3 Analysis of Categorical Data

Section VII. Chi-square test for comparing proportions and frequencies. F test for means

Probability theory and inference statistics! Dr. Paola Grosso! SNE research group!! (preferred!)!!

Ch. 11 Inference for Distributions of Categorical Data

Power and Sample Size for Log-linear Models 1

Review of Statistics 101

The Multinomial Model

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Sampling, Confidence Interval and Hypothesis Testing

2.6.3 Generalized likelihood ratio tests

Pump failure data. Pump Failures Time

Describing Contingency tables

Chi-square (χ 2 ) Tests

2 Describing Contingency Tables

Evaluation. Andrea Passerini Machine Learning. Evaluation

Some comments on Partitioning

Homework 9 Sample Solution

The Chi-Square Distributions

Correspondence Analysis

Session 3 The proportional odds model and the Mann-Whitney test

The Components of a Statistical Hypothesis Testing Problem

Goodness of Fit Goodness of fit - 2 classes


Transcription:

Chapter 1. Discrete Data Analysis 1.1 Inferences on a Population Proportion 1. Comparing Two Population Proportions 1.3 Goodness of Fit Tests for One-Way Contingency Tables 1.4 Testing for Independence in Two-Way Contingency Tables 1.5 Supplementary Problems NIPRL

1.1 Inferences on a Population Proportion Population Proportion p with characteristic Random sample of size n With characteristic Without characteristic Cell probability p Cell probability 1-p Cell frequency x Cell frequency n-x Sample proportion µ p X ~ B( n, p) µ x p = n E( µ p) = p and Var( µ p) = For large n, µ æ p(1 p) ö p ~ N - ç p, çè n ø µ p- p ~ N(,1) p(1 - p) n p(1 - p) n NIPRL

1.1.1 Confidence Intervals for Population Proportions µ p- p Since ~ N(,1) for large n, two-sided 1- a confidence interval is p(1 - p) n æ µ p(1 - p) µ p(1 - p) ö p Î p z /, p z ç - a + a / çè n n ø µ x µ p(1 - p) where p =, se..( p) =. n n µ µ p (1 - µ p ) 1 xn ( - x ) But it is customary that se..( p) = =. n n n Two-Sided Confidence Intervals for a Population Proportion X ~ Bn (, p) Approximate two-sided 1- a confidence interval for p is æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - +. çè n n n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 3

1.1.1 Confidence Intervals for Population Proportions Example 55 : Building Tile Cracks Random sample n = 15 of tiles in a certain group of downtown building for cracking. x = 98 are found to be cracked. µ x 98 p = = =.784, z.5 =.576 n 15 99% two-sided confidence interval æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - + çè n n n n ø æ.575 98 (15-98).575 98 (15-98) ö =.784,.784 ç - + çè 15 15 15 15 ø = (.588,.98)» (.6,.1) Total number of cracked tiles is between.6 5,, = 3, and.1 5,, = 5, NIPRL 4

1.1.1 Confidence Intervals for Population Proportions µ p - p Since ~ N(,1) for large n, one-sided 1- a confidence interval p(1 - p) n æ µ p(1 - p) ö for a lower bound is p Î p za,1 ç -. çè n ø One-Sided Confidence Intervals for a Population Proportion X ~ Bn (, p) Approximate two-sided 1- a confidence interval for p is æ µ za xn ( - x) ö æ µ za xn ( - x) ö p Î p,1 and p, p - Î -. ç n n è ø çè n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 5

1.1. Hypothesis Tests on a Population Proportion Two-Sided Hypothesis Tests H : p = p versus H : p ¹ p A pvalue - = P( X ³ x) if µ p = x / n > p pvalue - = P( X x) if µ p = x / n < p where X ~ Bn (, p ). When np and n(1 - p ) are both larger than 5, a normal approximation can be used to give the p-value of pvalue - = F (- z ) where F ( ) is the standard normal c.d.f. and µ p- p x- np z = =. p(1 - p) np(1 - p) n NIPRL 6

1.1. Hypothesis Tests on a Population Proportion Continuity Correction In order to improve the normal approximation the value in the numerator of the z-statistic x- np -.5 may be used when x- np >.5 - +.5 may x np x np be used when - <.5. A size a hypothesis test accepts H when and rejects H when z z or p - value ³ a / z > z or p - value < a. a / a NIPRL 7

1.1. Hypothesis Tests on a Population Proportion One-Sided Hypothesis Tests for a Population Proportion H : p ³ p versus H : p < p A pvalue - = P( X x) where X ~ Bn (, p ). - The normal approximation to this is x- np + - ( ) and. pvalue = F z z =.5 np (1- p ) - Accept H if z ³ - z and reject H if z < - z. a a For H : p < p versus H : p ³ p, A pvalue - = P( X ³ x) where X ~ Bn (, p ). - The normal approximation to this is x- np -.5 pvalue - = 1 - F ( z) and z =. np (1- p ) a - Accept H if z - z and reject H if z > - z. a NIPRL 8

1.1. Hypothesis Tests on a Population Proportion Example 55 : Building Tile Cracks 1% or more of the building tiles are cracked? H : p ³.1 versus H A : p <.1 n = 15, x = 98, p =.1. z x- np +.5 = = - np (1- p ).5. pvalue - = F (-.5) =.6. z = -.5 NIPRL 9

1.1.3 Sample Size Calculations The most convenient way to assess the amount of precision afforded by a sample size n is to consider L of two-sided confidence interval for p. µ µ p(1 - p) 4 z µ µ a / p(1 - p) L = za / Þ n =. n L Since µ p (1 - µ p ) is unknown, if we take µ p (1 - µ p ) the largest, za / µ 1 n = and p =. L However, if µ p is far from.5, we can take µ p = p* from prior information for p and n 4 z p *(1- p* ).(either p* <.5 or p* >.5) L a / ; NIPRL 1

1.1.3 Sample Size Calculations Example 59 : Political Polling To determine the proportion p of people who agree with the statement The city mayor is doing a good job. within 3% accuracy. (-3% ~ +3%), how many people do they need to poll? Construct a 99% confidence interval for p with a length no large thanl = 6% ( µ p - 3% ~ µ p + 3%) za /.576 Þ n = = = 1843.3 ( a =.1) L.6 Þ representative sample : at least 1844 If "1 respondents, ± 3% sampling error", Þ z = nl = 1.6 = 1.9, F (- 1.9) =.87. Þ a a / ;.57 Þ 95% confidence interval NIPRL 11

1.1.3 Sample Size Calculations Example 55 : Building Tile Cracks Recall a 99% confidence interval for p is (.588,.98). We need to know p within 1% with 99% confidence.( L = %) Based on the above interval, let p* =.1=.5. 4 za / p*(1 - p*) Þ n ; = 597. or about 6. L Þ representative sample : 475 (initial sample = 15) After the secondary sampling, n = 6 andx = 98+ 38 = 46. µ 46 p = =.677 and 6 æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - + çè n n n n ø = (.593,.761) NIPRL 1

1. Comparing Two Population Proportions Comparing two population proportions Assume X ~ Bn (, p ), Y ~ B( m, p ) and X ^ Y. A µ x µ y µ µ x y pa = and pb = Þ pa - pb = -. n m n m µ µ µ µ pa(1 - pa) pb(1 - pb ) Var( pa - pb) = Var( pa) + Var( pb) = +. n m ( µ p µ A - pb) - ( pa - p µ µ B ) ( pa - pb )- ( pa - pb) z = =. µ p (1 µ ) µ (1 µ ) ( ) ( ) A - pa pb - p xn- x y m- y B + + 3 3 n m n m If H : p = p, the pooled estimate is A B µ µ µ x + y pa - pb p = and z =. n + m µ µ æ1 1 ö p(1 - p) ç + çè n m ø B NIPRL 13

1..1 Confidence Intervals for the Difference Between Two Population Proportions Confidence Intervals for the Difference Between Two Population Proportions Assume X ~ Bn (, p ), Y ~ B( m, p ) and X ^ Y. Approximate two-sided 1-a A B confidence level confidence interval is æ µ µ µ p µ µ µ µ µ µ µ ö A(1 - pa) pb(1 - pb) pa - pb Î pa - pb - z µ µ pa(1 - pa) pb(1 - pb ) a /, pa pb z + - - a / + n m n m ç è ø æ µ µ xn ( - x) y( m- y) µ µ xn ( - x) y( m- y) ö = pa pb za /, p 3 3 A pb z ç - - + - - a / + 3 3 çè n m n m ø Approximate one-sided 1-a confidence level confidence interval is æ µ µ xn ( - x) y( m- y) ö pa - pb Î pa pb z 3 3, 1 ç - - a + and çè n m ø æ µ µ xn ( - x) y( m- y) ö pa - pb Î 1, pa pb z ç - - - a +. 3 3 çè n m ø The approximations are reasonable as long as x, n - x, y andm -y are all larger than 5. NIPRL 14

1..1 Confidence Intervals for the Difference Between Two Population Proportions Example 55 : Building Tile Cracks Building A : 46 cracked tiles out of n = 6. Building B : 83 cracked tiles out of m =. µ x 46 µ y 83 p = A.677 and pb.415. n = 6 = = m = = æ µ µ xn ( - x) y( m- y) µ µ xn ( - x) y( m- y) ö pa - pb Î pa pb z /, p 3 3 A pb z ç - - a + - - a / + 3 3 çè n m n m ø = (.1,.44) wherea =.1. The fact this confidence interval contains only positive values indicates that p > p. A B NIPRL 15

1.. Hypothesis Tests on the Difference Between Two Population Proportions Hypothesis Tests of the Equality of Two Population Proportions Assume X ~ Bn (, p ), Y ~ Bm (, p ) and X ^ Y. A B H : p = p versus H : p ¹ p has p -value = F (- z ) A B A A wherez = B p A - pb and µ p = µ µ æ1 1 ö p(1 - p) ç + çè n m ø - AcceptH if z z and RejectH if z > z. a / a / x + y n + m H : p ³ p versus H : p < p has p - value = F ( z) A B A A B AcceptH if z ³ - z and RejectH if z < - z. a a H : p p versus H : p > p has p -value = 1 - F ( z) A B A A B AcceptH if z - z and RejectH if z > - z. a a NIPRL 16

1.. Hypothesis Tests on the Difference Between Two Population Proportions Example 59 : Political Polling µ 67 µ 41 pa = =.65 and pb = =.44 95 143 Population age 18-39 age >= 4 A B H : p = p versus H : p ¹ p A B A A B µ 67 + 41 Pooled estimate p = =.55 95 + 143 z = 11.39, p -value = F (- 11.39) ; Random sample n=95 Random sample m=143 The city mayor is doing a good job. - Two-sided 99% confidence interval is p - p Î A B (.199,.311) Agree : x = 67 Disagree : n-x = 35 Agree : y = 41 Disagree : m-y = 6 NIPRL 17

Summary problems (1) Why do we assume large sample sizes for statistical inferences concerning proportions? So that the Normal approximation is a reasonable approach. α () Can you find an exact size test concerning proportions? No, in general. NIPRL 18

1.3 Goodness of Fit Tests for One-Way Contingency Tables 1.3.1 One-Way Classifications Each of n observations is classified into one of k categories or cells. cell frequencies : x, x, ¼, x withx + x + L + x = n x ~ Bn (, p ) Þ µ p = 1 k 1 cell probabilities : p, p, ¼, p with p + p + L + p = 1 i i i Hypothesis Test 1 k 1 xi n H : p = p 1 i k versus H : H is false * i A The null hypothesis of homogeneity is that p * i i 1 = for 1 i k k k k NIPRL 19

1.3.1 One-Way Classifications Example 1 : Machine Breakdowns n = 46 machine breakdowns. x 1 = 9 : electrical problems x = 4 : mechanical problems x 3 = 13 : operator misuse It is suggested that the probabilities of these three kinds are p* 1 =., p* =.5, p* 3 =.3. NIPRL

1.3.1 One-Way Classifications Goodness of Fit Tests for One-Way Contingency Tables Consider a multinomial distribution with k-cells and p1 p ¼ p k 1 ¼, k with 1 a set of unknown cell probabilities,,, and a set of observed cell frequencies x, x, H p = p i k * : i 1 i p - value = P( c ³ X ) or p - value = P( c ³ G ) k- 1 k- 1 ( x - e ) æx ö wherex = andg = x ln ç ø k k i i i å å i i= 1 ei i= 1 çè ei * with ei np, 1 = i k i The p -value are appropriate as long ase a, k- 1 a, k- 1 a, k- 1 a, k- 1 x x + x + L + x = n ³ 5, 1 i k. At size a, accepth if X c (or if G c ) and rejecth if X > c (or if G > c ). i k NIPRL 1

1.3.1 One-Way Classifications Example 1 : Machine Breakdowns H : p 1 =., p =.5, p 3 =.3 Observed cell freq. Electrical Mechanical Operator misuse x 1 = 9 x = 4 x 3 = 13 n = 46 Expected e 1 = 46*. e = 46*.5 e 3 = 46*.3 n = 46 cell freq. = 9. =3. =13.8 X G (9.- 9.) (4.- 3.) (13. - 13.8) = + + =.94 9. 3. 13.8 æ æ9. ö æ4. ö æ13. ö = 9. ln + 4. ln + 13. ln =.945 ç è çè 9. ø çè 3. ø èç 13.8 ø p - value = P( c ³.94) ;.95 ( k - 1= 3-1= ) NIPRL

1.3.1 One-Way Classifications H is plausible from the p - value = P( c ³.94) ;.95. Consider 95% confidence interval for p. p æ4 1.96 4 (46-4) 4 1.96 4 (46-4) ö Î, ç - + çè 46 46 46 46 46 46 ø = (.378,.666) Check for the homogeneity. Let p = p = p =. 1 1 3 3 Thene = e = e = = 15.33. 46 1 3 3 (9.- 15.33) (4.- 15.33) (13. - 15.33) X = + + = 15.33 15.33 15.33 p - value = Pc ( ³ 7.87) ;. Þ 1 3 ( ) Not plausible. Also notice Ï.378,.666. 7.87 NIPRL 3

1.3. Testing Distributional Assumptions Example 3 : Software Errors For some of expected values are smaller than 5, it is appropriate to group the cells. Test if the data are from a Poisson distribution with mean=3. NIPRL 4

1.3. Testing Distributional Assumptions X (17. - 16.93) (.- 19.4) (5.- 19.4) = + + 16.93 19.4 19.4 (14. - 14.8) (6.- 8.57) (3.- 7.14) + + + 14.8 8.57 7.14 = 5.1 p - value = Pc ( ³ 5.1) =.4 Þ 5 Plausible that the software errors have a Poisson distribution with meanl = 3. If : the software errors have a Poisson distribution, e would be calculated usingl = x =.76 and p -value would i H be calculated from a chi-square distribution with k - 1-1= 4 degrees of freedom. NIPRL 5

1.4 Testing for Independence in Two-Way Contingency Tables 1.4.1 Two-Way Classifications A two-way (r x c) contingency table. Second Categorization Level 1 Level Level j Level c First Categorization Level 1 x 11 x 1 x 1c x 1. Level x 1 x x c x. Level i x ij x i. Level r x r1 x r x rc x r. x. 1 x. x. j x. c n = x.. Column marginal frequencies Row marginal frequencies NIPRL 6

1.4.1 Two-Way Classifications Example 55 : Building Tile Cracks Location Building A Building B Tile Condition Undamaged x 11 = 5594 x 1 = 1917 x 1. = 7511 Cracked x 1 = 46 x = 83 x. = 489 x. 1 = 6 x. = n = x.. = 8 Notice that the column marginal frequencies are fixed. ( x. 1 = 6, x. = ) NIPRL 7

1.4. Testing for Independence Testing for Independence in a Two-Way Contingency Table H : Two categorizations are independent. Test statistics r c ( x ) r c ij - e æ ij x ij ö X = org xij ln å å = å å i= 1 j= 1 e ç ij i= 1 j= 1 çè eij ø xx i.. j whereeij = n p -value = P( c ³ X ) or p - value = P( c ³ G ) where n = ( r - 1) ( c- 1) A sizea hypothesis test accepts H if c > X (or G ) rejects H n if c < X (or G ) a, n n a, n NIPRL 8

1.4. Testing for Independence Example 55 : Building Tile Cracks Building A Building B Undamaged x 11 = 5594 e 11 = 5633.5 x 1 = 1917 e 1 = 1877.75 x 1. = 7511 Cracked x 1 = 46 e 1 = 366.75 x = 83 e = 1.5 x. = 489 x. 1 = 6 x. = n = x.. = 8 e e 7511 6 7511 = = 5633.5, e = = 1877.75 8 8 489 6 489 = = 366.75, e = = 1.5 8 8 11 1 1 NIPRL 9

1.4. Testing for Independence X (5594.- 5633.5) (1917. - 1877.75) = + 5633.5 1877.75 (46.- 366.75) (83.- 1.5) + + 366.75 1.5 = 17.896 p - value = P( c ³ 17.896) ; 1 Consequently, it is not plausible that becomes cracked) and does not contain zero. p B p A (the probability that a tile on Building A (the probability that a tile on Building B becomes cracked) are equal. This is consistent that the confidence interval of p and p p - p Î A B (.1,.44) A formula for the Pearson chi-square statistic for table is X = n( x x - x x ) x x x x 11 1 1 1g g1 g g A B NIPRL 3

Summary problems 1. Construct a goodness-of-fit test for testing a distributional assumption of a normal distribution by applying the one-way classification method.. In testing H : p = p vsh : noth, what is the df of the Pearson * ij ij 1 chi-square statistic, wheni = 1,..., r, j = 1,..., c? NIPRL 31