Inferences for Proportions and Count Data

Size: px

Start display at page:

Download "Inferences for Proportions and Count Data"

Darcy Riley
5 years ago
Views:

1 Inferences for Proportions and Count Data Corresponds to Chapter 9 of Tamhane and Dunlop Slides prepared by Elizabeth Newton (MIT), with some slides by Ramón V. León (University of Tennessee) 1

2 Inference for Proportions Data = {0,1,1,10,0..1,0}, Bernoulli(p) Goal estimate p, probability of success (or proportion of population with a certain attribute) pˆ = x= number of successes in n trials Var( pˆ ) = p(1-p)/n = pq/n Variance depends on the mean. 2

3 Large Sample Confidence Interval for Proportion It follows that: ˆ Recall that ( p pq / n p) N(0,1) if n is large (q = 1- p, np ˆ 10 and n(1 pˆ) 10) Confidence interval for p: ˆ P z α 2 ( p p) zα 2 1 α pq ˆˆ n pq ˆˆ pq ˆˆ pˆ z α 2 p p ˆ + z α 2 n n 3

4 A Better Confidence Interval for Proportion Use this probability statement ˆ z α 2 pq n P ( p p) zα 2 1 α Solve for p using quadratic equation CI for p: z 2 l 2 l pqz z 4 l z 2 lpqz 2 p + + p z 4 4n 2 2n n p 2n n 4n 2 z 2 z n n where z = z α /2 4

5 Example See Example 9.1 on page 301 of the course textbook. 5

6 In S-Plus: >qbinom(.975,800,0.45) [1] 388 > qbinom(.025,800,0.45) [1] 332 Binomial CI 95% CI for proportion of gun owners is: 332/800 p 388/ p This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

7 Sample Size Determination for a Confidence Interval for Proportion Want (1-α)-level two-sided CI: pq ˆˆ pˆ ± E where E is the margin of error. Then E = z α 2. n z α 2 Solving for n gives n = E 2 pq ˆˆ 1 1 Largest value of pq = = 1 so conservative sample size is: z α 2 1 n = (Formula 9.5) E 4 7

8 Example 9.2: Presidential Poll See Example 9.2 on page 302 of the course textbook. Threefold increase in precision requires ninefold increase in sample size 8

9 Largest Sample Hypothesis Test on Proportion H : p = p vs. H : p p Best test statistics: z = pˆ p0 pq n 0 0 Acceptance Region: p 0 ± cd, where c=z a/2 and d=(p 0 q 0 /n) 0.5 9

10 Basketball Problem: z-test See Example 9.3 on page 303 of the course textbook. P-value

11 Exact Binomial Test in S-Plus dbinom(x, 400, 0.7) pbinom(299,400,.7) x 11 This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

12 Sample Size for Z-Test of Proportion H : p p vs. H : 0 p p o 1 > 0 Suppose that the power for rejecting H 0 must be at least 1- β when the true proportion is p= p 1 > p 0. Let δ = p 1 p 0. Then z p q + z pq 2 Test based on: 0 0 β 1 1 n = α pˆ p0 δ z = p0q0 n Replace zα by z α 2 for two-sided test sample size. 12

13 Example 9.4: Pizza Testing See Example 9.4 on page 305 of the course textbook. 2 z pq z pq n = + α β 1 1 δ 13

14 Comparing Two Proportions: Independent Sample Design If np, 1 1 n1q 1, n 2 p 2, n 2 q 2 10, then p p p p Z = ˆ1 ˆ 2 ( 1 2) N(0,1) pq ˆˆ p 2 q ˆ ˆ2 n 1 n 2 Confidence Interval: pˆ q ˆ 1 1 p2ˆq 2 pˆ1 pˆ 2 z + p p2 pˆ1 pˆ 2 + z α 2 1 α 2 n 1 n 2 pˆ q ˆ ˆpq ˆ + n n

15 Test for Equality of Proportions (Large n) Independent Sample Design pooled estimate of p H : = vs. : 0 p 1 p 2 H1 1 p 2p p ˆ Test statitics: z = ˆ1 p pq ˆˆ + n1 n 2 ˆ 1 1+ x + y where pˆ = np n 2 pˆ 2 = n1 + n 2 n1 + n 2 15

16 Example 9.6 Comparing Two Leukemia Therapies See Example 9.6 on page 310 of the course textbook. 16

17 Inference for Small Samples Fisher s Exact Test Calculates the probability of obtaining observed 2x2 table or any more extreme with margins fixed. Uses hypergeometric distribution P ( X = x N, M,K ) = M N M x K x N K 17

18 Inference for Count Data Data = cell counts = number of observations in each of sevaral (>2) categories, n i, i=1..c, Σn i =n Joint distribution of corresponding r.v. s is multinomial. Goal determine if the probabilities of belonging to each of the categories are equal to hypothesized values, p i0. Test statistic, χ 2 = Σ(observed-expected) 2 /expected, where observed=n i, expected=np i0 χ2 has chi-square distribution when sample size is large 18

19 Multinomial Test of Proportions See Example 9.10 on page 316 of the course textbook. 19

20 Inferences for Two-Way Count Data y: Job Satisfaction x: Annual Very Slightly Slightly Very Satisfied Row Sum Salary Dissatisfied Dissatisfied Satisfied Less than $10,000 $10,000-25,000 $25,000-50,000 More than $50, Column Sum Sampling Model 1: Multinomial Model (Total Sample Size Fixed) Sample of 824 from a single population that is then cross-classified The null hypothesis is that X and Y are independent: H : p = P ( X = i, Y = j) = P ( X ij = i ) P ( Y = j) = p p 0 i.. j for all i, j 20

21 Sampling Model 1 (Total Sample Size Fixed) Based on Table 9.10 in the course textbook y: Job Satisfaction x: Annual Salary Less than $10,000 Very Slightly Slightly Very Satisfied Row Sum Dissatisfied Dissatisfied Satisfied $10,000-25, $25,000-50, More than $50, Column Sum Estimated Expected Frequency = 824 = = (Cell 1,1) = np 1 p 1 21

22 Chi-Square Statistics See Example 9.13, page 324 for instructions on calculating the chi-square statistic. c 2 i e)2 i χ = (n i=1 e i 22

23 Chi- Square Test Critical Value 2 The d.f. for this χ statistics is 2 (4-1)(4-1) = 9. Since χ 9,.05 = the calculated χ = is not sufficiently large to reject the hypothesis of independence at α =.05 level Based on Table A.5, critical values χ υ,α for the Chi-square Distribution, in the course textbook: v α 2 23

24 S-Plus job satisfaction example Call: crosstabs(formula = c(jobsat) ~ c(row(jobsat)) + c(col(jobsat))) 901 cases in table N N/RowTotal N/ColTotal N/Total c(row(jobsat)) c(col(jobsat)) RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 9 (p= ) Yates' correction not used > 24 This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

25 Product Multinomial Model: Row Totals Fixed (See Table 9.2 in the course textbook.) Sampling Model 2: Product Multinomial Total number of patients in each drug group is fixed. The null hypothesis is that the probability of column response (success or failure) is the same, regardless of the row population: H : P (Y = j X = i )= p 0 j 25

26 S-Plus leukemia trial Call: crosstabs(formula = c(leuk) ~ c(row(leuk)) + c(col(leuk))) 63 cases in table N N/RowTotal N/ColTotal N/Total c(row(leuk)) c(col(leuk)) 1 2 RowTotl ColTotl Test for independence of all factors Chi^2 = d.f.= 1 (p= ) Yates' correction not used Some expected values are less than 5, don't trust stated p-value > 26 This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.

27 Remarks About Chi-Square Test The distribution of the chi-square statistics under the null hypothesis is approximately chi-square only when the sample sizes are large The rule of thumb is that all expected cell counts should be greater than 1 and No more than 1/5 th of the expected cell counts should be less than 5. Combine sparse cell (having small expected cell counts) with adjacent cells. Unfortunately, this has the drawback of losing some information. Never stop with the chi-square test. Look at cells with large values of (O-E), as in job satisfaction example. 27

28 Odds Ratio as a Measure of Association for a 2x2 Table Sampling Model I: Multinomial p11 p ψ 12 = p 21 p 22 The numerator is the odds of the column 1 outcome vs. the column 2 outcome for row 1, and the denominator is the same odds for row 2, hence the name odds ratio 28

29 Odds Ratio as a Measure of Association for a 2x2 Table Sampling Model II: Product Multinomial 1 1 p ψ 1 = p p 2 1 p2 The two column outcomes are labeled as success and failure, then ψ is the odds of success for the row 1 population vs. the odds of success for the row 2 population 29

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)