Chapter 10. Discrete Data Analysis

Chapter 1. Discrete Data Analysis 1.1 Inferences on a Population Proportion 1. Comparing Two Population Proportions 1.3 Goodness of Fit Tests for One-Way Contingency Tables 1.4 Testing for Independence in Two-Way Contingency Tables 1.5 Supplementary Problems NIPRL

1.1 Inferences on a Population Proportion Population Proportion p with characteristic Random sample of size n With characteristic Without characteristic Cell probability p Cell probability 1-p Cell frequency x Cell frequency n-x Sample proportion µ p X ~ B( n, p) µ x p = n E( µ p) = p and Var( µ p) = For large n, µ æ p(1 p) ö p ~ N - ç p, çè n ø µ p- p ~ N(,1) p(1 - p) n p(1 - p) n NIPRL

1.1.1 Confidence Intervals for Population Proportions µ p- p Since ~ N(,1) for large n, two-sided 1- a confidence interval is p(1 - p) n æ µ p(1 - p) µ p(1 - p) ö p Î p z /, p z ç - a + a / çè n n ø µ x µ p(1 - p) where p =, se..( p) =. n n µ µ p (1 - µ p ) 1 xn ( - x ) But it is customary that se..( p) = =. n n n Two-Sided Confidence Intervals for a Population Proportion X ~ Bn (, p) Approximate two-sided 1- a confidence interval for p is æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - +. çè n n n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 3

1.1.1 Confidence Intervals for Population Proportions Example 55 : Building Tile Cracks Random sample n = 15 of tiles in a certain group of downtown building for cracking. x = 98 are found to be cracked. µ x 98 p = = =.784, z.5 =.576 n 15 99% two-sided confidence interval æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - + çè n n n n ø æ.575 98 (15-98).575 98 (15-98) ö =.784,.784 ç - + çè 15 15 15 15 ø = (.588,.98)» (.6,.1) Total number of cracked tiles is between.6 5,, = 3, and.1 5,, = 5, NIPRL 4

1.1.1 Confidence Intervals for Population Proportions µ p - p Since ~ N(,1) for large n, one-sided 1- a confidence interval p(1 - p) n æ µ p(1 - p) ö for a lower bound is p Î p za,1 ç -. çè n ø One-Sided Confidence Intervals for a Population Proportion X ~ Bn (, p) Approximate two-sided 1- a confidence interval for p is æ µ za xn ( - x) ö æ µ za xn ( - x) ö p Î p,1 and p, p - Î -. ç n n è ø çè n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 5

1.1. Hypothesis Tests on a Population Proportion Two-Sided Hypothesis Tests H : p = p versus H : p ¹ p A pvalue - = P( X ³ x) if µ p = x / n > p pvalue - = P( X x) if µ p = x / n < p where X ~ Bn (, p ). When np and n(1 - p ) are both larger than 5, a normal approximation can be used to give the p-value of pvalue - = F (- z ) where F ( ) is the standard normal c.d.f. and µ p- p x- np z = =. p(1 - p) np(1 - p) n NIPRL 6

1.1. Hypothesis Tests on a Population Proportion Continuity Correction In order to improve the normal approximation the value in the numerator of the z-statistic x- np -.5 may be used when x- np >.5 - +.5 may x np x np be used when - <.5. A size a hypothesis test accepts H when and rejects H when z z or p - value ³ a / z > z or p - value < a. a / a NIPRL 7

1.1. Hypothesis Tests on a Population Proportion One-Sided Hypothesis Tests for a Population Proportion H : p ³ p versus H : p < p A pvalue - = P( X x) where X ~ Bn (, p ). - The normal approximation to this is x- np + - ( ) and. pvalue = F z z =.5 np (1- p ) - Accept H if z ³ - z and reject H if z < - z. a a For H : p < p versus H : p ³ p, A pvalue - = P( X ³ x) where X ~ Bn (, p ). - The normal approximation to this is x- np -.5 pvalue - = 1 - F ( z) and z =. np (1- p ) a - Accept H if z - z and reject H if z > - z. a NIPRL 8

1.1. Hypothesis Tests on a Population Proportion Example 55 : Building Tile Cracks 1% or more of the building tiles are cracked? H : p ³.1 versus H A : p <.1 n = 15, x = 98, p =.1. z x- np +.5 = = - np (1- p ).5. pvalue - = F (-.5) =.6. z = -.5 NIPRL 9

1.1.3 Sample Size Calculations The most convenient way to assess the amount of precision afforded by a sample size n is to consider L of two-sided confidence interval for p. µ µ p(1 - p) 4 z µ µ a / p(1 - p) L = za / Þ n =. n L Since µ p (1 - µ p ) is unknown, if we take µ p (1 - µ p ) the largest, za / µ 1 n = and p =. L However, if µ p is far from.5, we can take µ p = p* from prior information for p and n 4 z p *(1- p* ).(either p* <.5 or p* >.5) L a / ; NIPRL 1

1.1.3 Sample Size Calculations Example 59 : Political Polling To determine the proportion p of people who agree with the statement The city mayor is doing a good job. within 3% accuracy. (-3% ~ +3%), how many people do they need to poll? Construct a 99% confidence interval for p with a length no large thanl = 6% ( µ p - 3% ~ µ p + 3%) za /.576 Þ n = = = 1843.3 ( a =.1) L.6 Þ representative sample : at least 1844 If "1 respondents, ± 3% sampling error", Þ z = nl = 1.6 = 1.9, F (- 1.9) =.87. Þ a a / ;.57 Þ 95% confidence interval NIPRL 11

1.1.3 Sample Size Calculations Example 55 : Building Tile Cracks Recall a 99% confidence interval for p is (.588,.98). We need to know p within 1% with 99% confidence.( L = %) Based on the above interval, let p* =.1=.5. 4 za / p*(1 - p*) Þ n ; = 597. or about 6. L Þ representative sample : 475 (initial sample = 15) After the secondary sampling, n = 6 andx = 98+ 38 = 46. µ 46 p = =.677 and 6 æ µ za / xn ( - x) µ za / xn ( - x) ö p Î p, p ç - + çè n n n n ø = (.593,.761) NIPRL 1

1. Comparing Two Population Proportions Comparing two population proportions Assume X ~ Bn (, p ), Y ~ B( m, p ) and X ^ Y. A µ x µ y µ µ x y pa = and pb = Þ pa - pb = -. n m n m µ µ µ µ pa(1 - pa) pb(1 - pb ) Var( pa - pb) = Var( pa) + Var( pb) = +. n m ( µ p µ A - pb) - ( pa - p µ µ B ) ( pa - pb )- ( pa - pb) z = =. µ p (1 µ ) µ (1 µ ) ( ) ( ) A - pa pb - p xn- x y m- y B + + 3 3 n m n m If H : p = p, the pooled estimate is A B µ µ µ x + y pa - pb p = and z =. n + m µ µ æ1 1 ö p(1 - p) ç + çè n m ø B NIPRL 13

1..1 Confidence Intervals for the Difference Between Two Population Proportions Confidence Intervals for the Difference Between Two Population Proportions Assume X ~ Bn (, p ), Y ~ B( m, p ) and X ^ Y. Approximate two-sided 1-a A B confidence level confidence interval is æ µ µ µ p µ µ µ µ µ µ µ ö A(1 - pa) pb(1 - pb) pa - pb Î pa - pb - z µ µ pa(1 - pa) pb(1 - pb ) a /, pa pb z + - - a / + n m n m ç è ø æ µ µ xn ( - x) y( m- y) µ µ xn ( - x) y( m- y) ö = pa pb za /, p 3 3 A pb z ç - - + - - a / + 3 3 çè n m n m ø Approximate one-sided 1-a confidence level confidence interval is æ µ µ xn ( - x) y( m- y) ö pa - pb Î pa pb z 3 3, 1 ç - - a + and çè n m ø æ µ µ xn ( - x) y( m- y) ö pa - pb Î 1, pa pb z ç - - - a +. 3 3 çè n m ø The approximations are reasonable as long as x, n - x, y andm -y are all larger than 5. NIPRL 14

1..1 Confidence Intervals for the Difference Between Two Population Proportions Example 55 : Building Tile Cracks Building A : 46 cracked tiles out of n = 6. Building B : 83 cracked tiles out of m =. µ x 46 µ y 83 p = A.677 and pb.415. n = 6 = = m = = æ µ µ xn ( - x) y( m- y) µ µ xn ( - x) y( m- y) ö pa - pb Î pa pb z /, p 3 3 A pb z ç - - a + - - a / + 3 3 çè n m n m ø = (.1,.44) wherea =.1. The fact this confidence interval contains only positive values indicates that p > p. A B NIPRL 15

1.. Hypothesis Tests on the Difference Between Two Population Proportions Hypothesis Tests of the Equality of Two Population Proportions Assume X ~ Bn (, p ), Y ~ Bm (, p ) and X ^ Y. A B H : p = p versus H : p ¹ p has p -value = F (- z ) A B A A wherez = B p A - pb and µ p = µ µ æ1 1 ö p(1 - p) ç + çè n m ø - AcceptH if z z and RejectH if z > z. a / a / x + y n + m H : p ³ p versus H : p < p has p - value = F ( z) A B A A B AcceptH if z ³ - z and RejectH if z < - z. a a H : p p versus H : p > p has p -value = 1 - F ( z) A B A A B AcceptH if z - z and RejectH if z > - z. a a NIPRL 16

1.. Hypothesis Tests on the Difference Between Two Population Proportions Example 59 : Political Polling µ 67 µ 41 pa = =.65 and pb = =.44 95 143 Population age 18-39 age >= 4 A B H : p = p versus H : p ¹ p A B A A B µ 67 + 41 Pooled estimate p = =.55 95 + 143 z = 11.39, p -value = F (- 11.39) ; Random sample n=95 Random sample m=143 The city mayor is doing a good job. - Two-sided 99% confidence interval is p - p Î A B (.199,.311) Agree : x = 67 Disagree : n-x = 35 Agree : y = 41 Disagree : m-y = 6 NIPRL 17

Summary problems (1) Why do we assume large sample sizes for statistical inferences concerning proportions? So that the Normal approximation is a reasonable approach. α () Can you find an exact size test concerning proportions? No, in general. NIPRL 18

1.3 Goodness of Fit Tests for One-Way Contingency Tables 1.3.1 One-Way Classifications Each of n observations is classified into one of k categories or cells. cell frequencies : x, x, ¼, x withx + x + L + x = n x ~ Bn (, p ) Þ µ p = 1 k 1 cell probabilities : p, p, ¼, p with p + p + L + p = 1 i i i Hypothesis Test 1 k 1 xi n H : p = p 1 i k versus H : H is false * i A The null hypothesis of homogeneity is that p * i i 1 = for 1 i k k k k NIPRL 19

1.3.1 One-Way Classifications Example 1 : Machine Breakdowns n = 46 machine breakdowns. x 1 = 9 : electrical problems x = 4 : mechanical problems x 3 = 13 : operator misuse It is suggested that the probabilities of these three kinds are p* 1 =., p* =.5, p* 3 =.3. NIPRL

1.3.1 One-Way Classifications Goodness of Fit Tests for One-Way Contingency Tables Consider a multinomial distribution with k-cells and p1 p ¼ p k 1 ¼, k with 1 a set of unknown cell probabilities,,, and a set of observed cell frequencies x, x, H p = p i k * : i 1 i p - value = P( c ³ X ) or p - value = P( c ³ G ) k- 1 k- 1 ( x - e ) æx ö wherex = andg = x ln ç ø k k i i i å å i i= 1 ei i= 1 çè ei * with ei np, 1 = i k i The p -value are appropriate as long ase a, k- 1 a, k- 1 a, k- 1 a, k- 1 x x + x + L + x = n ³ 5, 1 i k. At size a, accepth if X c (or if G c ) and rejecth if X > c (or if G > c ). i k NIPRL 1

1.3.1 One-Way Classifications Example 1 : Machine Breakdowns H : p 1 =., p =.5, p 3 =.3 Observed cell freq. Electrical Mechanical Operator misuse x 1 = 9 x = 4 x 3 = 13 n = 46 Expected e 1 = 46*. e = 46*.5 e 3 = 46*.3 n = 46 cell freq. = 9. =3. =13.8 X G (9.- 9.) (4.- 3.) (13. - 13.8) = + + =.94 9. 3. 13.8 æ æ9. ö æ4. ö æ13. ö = 9. ln + 4. ln + 13. ln =.945 ç è çè 9. ø çè 3. ø èç 13.8 ø p - value = P( c ³.94) ;.95 ( k - 1= 3-1= ) NIPRL

1.3.1 One-Way Classifications H is plausible from the p - value = P( c ³.94) ;.95. Consider 95% confidence interval for p. p æ4 1.96 4 (46-4) 4 1.96 4 (46-4) ö Î, ç - + çè 46 46 46 46 46 46 ø = (.378,.666) Check for the homogeneity. Let p = p = p =. 1 1 3 3 Thene = e = e = = 15.33. 46 1 3 3 (9.- 15.33) (4.- 15.33) (13. - 15.33) X = + + = 15.33 15.33 15.33 p - value = Pc ( ³ 7.87) ;. Þ 1 3 ( ) Not plausible. Also notice Ï.378,.666. 7.87 NIPRL 3

1.3. Testing Distributional Assumptions Example 3 : Software Errors For some of expected values are smaller than 5, it is appropriate to group the cells. Test if the data are from a Poisson distribution with mean=3. NIPRL 4

1.3. Testing Distributional Assumptions X (17. - 16.93) (.- 19.4) (5.- 19.4) = + + 16.93 19.4 19.4 (14. - 14.8) (6.- 8.57) (3.- 7.14) + + + 14.8 8.57 7.14 = 5.1 p - value = Pc ( ³ 5.1) =.4 Þ 5 Plausible that the software errors have a Poisson distribution with meanl = 3. If : the software errors have a Poisson distribution, e would be calculated usingl = x =.76 and p -value would i H be calculated from a chi-square distribution with k - 1-1= 4 degrees of freedom. NIPRL 5

1.4 Testing for Independence in Two-Way Contingency Tables 1.4.1 Two-Way Classifications A two-way (r x c) contingency table. Second Categorization Level 1 Level Level j Level c First Categorization Level 1 x 11 x 1 x 1c x 1. Level x 1 x x c x. Level i x ij x i. Level r x r1 x r x rc x r. x. 1 x. x. j x. c n = x.. Column marginal frequencies Row marginal frequencies NIPRL 6

1.4.1 Two-Way Classifications Example 55 : Building Tile Cracks Location Building A Building B Tile Condition Undamaged x 11 = 5594 x 1 = 1917 x 1. = 7511 Cracked x 1 = 46 x = 83 x. = 489 x. 1 = 6 x. = n = x.. = 8 Notice that the column marginal frequencies are fixed. ( x. 1 = 6, x. = ) NIPRL 7

1.4. Testing for Independence Testing for Independence in a Two-Way Contingency Table H : Two categorizations are independent. Test statistics r c ( x ) r c ij - e æ ij x ij ö X = org xij ln å å = å å i= 1 j= 1 e ç ij i= 1 j= 1 çè eij ø xx i.. j whereeij = n p -value = P( c ³ X ) or p - value = P( c ³ G ) where n = ( r - 1) ( c- 1) A sizea hypothesis test accepts H if c > X (or G ) rejects H n if c < X (or G ) a, n n a, n NIPRL 8

1.4. Testing for Independence Example 55 : Building Tile Cracks Building A Building B Undamaged x 11 = 5594 e 11 = 5633.5 x 1 = 1917 e 1 = 1877.75 x 1. = 7511 Cracked x 1 = 46 e 1 = 366.75 x = 83 e = 1.5 x. = 489 x. 1 = 6 x. = n = x.. = 8 e e 7511 6 7511 = = 5633.5, e = = 1877.75 8 8 489 6 489 = = 366.75, e = = 1.5 8 8 11 1 1 NIPRL 9

1.4. Testing for Independence X (5594.- 5633.5) (1917. - 1877.75) = + 5633.5 1877.75 (46.- 366.75) (83.- 1.5) + + 366.75 1.5 = 17.896 p - value = P( c ³ 17.896) ; 1 Consequently, it is not plausible that becomes cracked) and does not contain zero. p B p A (the probability that a tile on Building A (the probability that a tile on Building B becomes cracked) are equal. This is consistent that the confidence interval of p and p p - p Î A B (.1,.44) A formula for the Pearson chi-square statistic for table is X = n( x x - x x ) x x x x 11 1 1 1g g1 g g A B NIPRL 3

Summary problems 1. Construct a goodness-of-fit test for testing a distributional assumption of a normal distribution by applying the one-way classification method.. In testing H : p = p vsh : noth, what is the df of the Pearson * ij ij 1 chi-square statistic, wheni = 1,..., r, j = 1,..., c? NIPRL 31