Estimation of a population proportion March 23,

1 Social Studies 201 Notes for March 23, 2005 Estimatio of a populatio proportio Sectio 8.5, p. 521. For the most part, we have dealt with meas ad stadard deviatios this semester. This sectio of the otes deals with usig data from a radom sample to estimate the proportio of a populatio with a particular characteristic. Usig the same methods ad procedures that were used for estimatig meas, it is also possible to obtai estimates of a populatio proportio. Notatio. Let p represet the proportio of a populatio with a particular characteristic ad q deote the proportio of the populatio ot havig this characteristic. Sice members of the populatio must either have this characteristic or ot have this characteristic, p + q = 1. That is, the sum of the proportio with the characteristic ad the proportio without this characteristic comprises the whole populatio. Sice p + q = 1, the proportio of those without the characteristic is q = 1 p. These are the values that will be estimated usig the method of cofidece iterval estimates. If a radom sample of size is selected from this populatio, let X represet the umber of cases i the sample with this same characteristic. The sample proportio, or the proportio of cases i the sample with the characteristic, is X/, ad this is deoted by ˆp i these otes. That is, ˆp = X. The poit estimate of the populatio proportio p is ˆp. For example, i opiio polls prior to a electio, a pollster obtais estimates of the proportio of the populatio sayig they will support each political party. These proportios represet poit estimates of the proportio of electors who actually will vote for the differet political parties. I the case of the CBC poll prior to the November 5, 2003 provicial electio, Wester Opiio Research reported that 42% of those surveyed said they would vote NDP while 39% said they would voted Saskatchewa Party. Coverted ito proportios of 0.42 ad 0.39, these represet poit estimates of the proportio who said they would vote NDP ad Saskatchewa Party, respectively. See page 2 of the March 7-9, 2005 otes for these data.

Estimatio of a populatio proportio March 23, 2005 2 Samplig distributio of a proportio To obtai iterval estimates for a populatio proportio, it is ecessary to kow the samplig distributio of the sample proportio ˆp. That is, a researcher eeds to kow how ˆp behaves as repeated radom samples are draw from a populatio. While this distributio ca be obtaied by cosiderig the ormal approximatio to the biomial distributio (sectios 6.3 ad 6.5 of Chapter 6 of the text), aother way is to cosider a proportio as a special case of a mea. The for large sample size, just as the sample mea X is ormally distributed (by the cetral limit theorem), the sample proportio is also ormally distributed. This result ca be stated as follows. Samplig distributio of a sample proportio ˆp. If radom samples of size are draw from a populatio with a proportio p of the populatio havig a particular characteristic, ad if the sample sizes are large, the the sample proportios ˆp are ormally distributed with mea p ad stadard deviatio pq/. That is ( ˆp is Nor p, ) pq. This is essetially the same as the result from the cetral limit theorem, where large radom samples yield a samplig distributio of sample meas: ( ) σ X is Nor µ,. I the above, the sample proportio ˆp replaces X. The statistic ˆp is a special case of the mea whe there are oly two values for X with the characteristic ad without the characteristic. Similarly, the stadard deviatio of the samplig distributio is pq/ for a proportio, as opposed to σ/ for a mea. Estimate of stadard error of ˆp. I estimatig the stadard deviatio of the sample mea, σ/, it was usually ecessary to replace the ukow populatio stadard deviatio σ with the kow sample stadard deviatio s. Similarly for a estimate of a proportio, the populatio proportio is ot kow, yet the stadard error pq/ must be estimated. I order to provide this estimate, a researcher has two choices.

Estimatio of a populatio proportio March 23, 2005 3 pq/. 1. Oe possibility is to use p = 0.5 ad q = 0.5 whe estimatig Sice p ad q must sum to oe, it ca be demostrated that the maximum value of p q, whe p + q = 1 occurs whe p = q = 0.5. By selectig these values for p ad q i the estimate of the stadard error pq/, this produces the largest possible stadard error for ay give sample size. If a researcher wishes to esure that he or she has ot uderestimated the samplig error, the it is best to use p = q = 0.5 i the expressio pq/. 2. Aother possibility, oce a sample has bee obtaied, is to use p = ˆp ad q = ˆq = 1 ˆp i the expressio pq/. This is comparable to usig the sample stadard deviatio s as a estimate of σ whe estimatig the populatio mea. This teds to produce estimates of pq/ slightly smaller tha the earlier approach. It also builds o the kowledge the researcher already has about the possible value of the populatio proportio, by usig the poit estimate ˆp. Prior to coductig a sample, it is likely that the researcher would employ the first approach above; after the sample has bee coducted, it is more commo to use the secod approach. Large. The samplig distributio of the sample proportio is ormal so log as the sample is a radom sample ad the sample size is reasoably large. I the case of a proportio, a large sample size occurs whe 5 smaller of p or q. This rule is somewhat differet tha i the case of a large sample size for a sample mea, where the rule is a large is more tha = 30. I the case of a proportio, if p is ear 0.5, the a large sample size is ay larger tha 5 0.5 = 12.5. However, if a characteristic of a populatio is ucommo, so the proportio of the populatio with this characeristic is small, say 0.01, or 1 i 100, the the sample size required is 5 0.01 = 500

Estimatio of a populatio proportio March 23, 2005 4 much larger tha i the case of the cetral limit theorem. Iterval estimate for ˆp The method of costructig a cofidece iterval estimate for the populatio proportio is the same as for the populatio mea. First, clearly defie what populatio proportio p is beig cosidered. The the five steps are: 1. Obtai the sample size size ad the sample proportio ˆp. From ˆp, the sample proportio without the characteristic is ˆq = 1 ˆp. 2. If the sample is a radom sample with large (more tha 5 divided by the smaller of p or q), the ( ) pq ˆp is Nor p,. 3. Select a cofidece level, C%. 4. Determie the Z-value associated with the cofidece level of C%. 5. The iterval estimates for the populatio proportio p are pq ˆp ± Z or ( ) pq pq ˆp Z, ˆp + Z. That is, C% of the itervals costructed i this maer will cotai the populatio proportio p. Example estimate of proportio supportig Saskatchewa Party From Saskatchewa Electio Polls ad Results (page 2 of the Marcvh 7-9, 2005 otes), the CBC poll, support for the Saskatchewa Party was 39% or, as a proportio, ˆp = 0.39. This is the poit estimate of the proportio of Saskatchewa electors who support the Saskatchewa Party, p. A 95% iterval estimate of p is obtaied by usig the five steps. 1. The sample size is = 800 ad the sample proportio of supporters is ˆp = 0.39, meaig that the sample proportio of o-supporters of the Saskatchewa Party is ˆq = 1 ˆp = 1 0.39 = 0.61.

Estimatio of a populatio proportio March 23, 2005 5 2. The sample size of = 800 is large sice it is much greater tha 5 divided by a estimate of p. That is 5 0.39 = 12.8 ad this is much less tha = 800. The sample size of 800 is very large ad more tha sufficiet to esure that ( ) pq ˆp is Nor p,. 3. The hadout from Wester Opiio Research states the samplig error is ±3.5% ietee times out of twety. This is 19/20 100% = 95% so the C = 95% cofidece level is used. 4. For 95% cofidece level ad a ormal distributio, the Z-value is 1.96 (95% of the area i the middle of the ormal distributio). 5. The iterval estimates are: pq ˆp ± Z 0.39 0.61 = ˆp ± 1.96 800 0.2379 = ˆp ± 1.96 800 = ˆp ± 1.96 0.0002974 = ˆp ± (1.96 0.0172) = ˆp ± 0.0338 ad the iterval estimate is 0.39 ± 0.034, that is, (0.356, 0.424) or 35.6% to 42.4%. This is slightly differet tha stated i the Wester Opiio Research hadout sice the ˆp ad ˆq were used i the estimate of stadard error, rather tha p = q = 0.5. Note that the CBC poll provided a very accurate estimate of the proportio who actually voted for the Saskatchewa Party o November 5, with 39.35% votig this way. The iterval from 35.6% to 42.4% certaily icluded this p = 0.3935.

Estimatio of a populatio proportio March 23, 2005 6 Similar iterval estimates for each of the other parties could also be obtaied. The samplig error for the NDP ad Liberal Party are each aroud ±3.5% as well, so that the actual percetages of voters who voted for the NDP lies withi the respective 95% cofidece iterval estimate. The actual percetage of electors who voted Liberal is just outside the iterval. For the Cutler poll, there are similar iterval estimates ad all the actual electio results are withi the respective cofidece itervals. Margi of error i Saskatoo. I their press release, Wester Opiio Research also states that a sample of = 400 voters was obtaied i the city of Saskatoo, ad the margi of error for this sample is ±4.9%, ietee times out of twety. I this case, the sample proportio ˆp is ot specified, yet it is possible to use the method of iterval estimates to obtai the 4.9% samplig error. The method used is exactly the same as for the provice as a whole. For determiig whether the sample size is large, a estimate of p = 0.5 ca be used i the formula 5 divided by the smaller of p or q. That is, the formula for determiig a large sample size uses the smaller of p or q. Usig p = q = 0.5 avoids the issue of which of these two values is smaller. The sample size of = 400 exceeds 5/0.5 = 12.5 so the ormal distibutio for ˆp ca be used as before. For purposes of estimatig pq/ i the iterval estimates, p = q = 0.5 ca be used. That is, ˆp ad ˆq are ot specified, so a estimate of the maximum possible samplig error for a sample size of = 400 is obtaied by usig p = q = 0.5 i the estimate of pq/. The iterval estimates are pq ˆp ± Z = ˆp ± 1.96 0.5 0.5 400 0.25 = ˆp ± 1.96 400 = ˆp ± 1.96 0.000625 = ˆp ± (1.96 0.025) = ˆp ± 0.049 or ±4.9%, as stated i the hadout.

Estimatio of a populatio proportio March 23, 2005 7 Coclusio. From these results it ca be see that a sample of = 800 results i a samplig error of approximately ± 3.5%, while a sample size of oly = 400 results i a samplig error of just uder 5 per cet, both with 95% cofidece. That is, if a researcher obtais such radom samples, he or she ca be cofidet that 95 out of 100 sample proportios will be withi 3.5 percetage poits of the populatio proportio if the sample size is 800. Whe radom samples of size 400 are draw from a populatio, a researcher ca also be 95% sure that sample proportios are withi about 5 percetage poits of the populatio proportio. Note that the iterval will be wider: 1. the larger the cofidece level Z-value is larger ad there is greater certaity that the itervals cotai the populatio proportio. 2. the smaller the sample size a smaller reduces the deomiator of pq/ ad icreases its overall value. 3. if the value for p ad q are close to p = q = 0.5. If the researcher is fairly certai that the true proportio iseither much greater or much less tha 0.5, the ˆp ad ˆq ca be used i pq/, ad this will geerally produce a arrower iterval. Determiig sample size for estimatio of a populatio proportio Sectio 8.6.2, p. 541. As idicated i the otes for November 17, whe sample size is larger, the iterval estimate is arrower ad samplig error is reduced, compared with smaller sample size. This sectio of the otes outlies how to obtai the sample size required to estimate a populatio proportio for ay specified samplig error ad cofidece level. Notatio. Let p represet the proportio of a populatio with a particular characteristic ad q deote the proportio of the populatio ot havig this characteristic. Sice members of the populatio must either have this characteristic or ot, p + q = 1 ad q = 1 p. Let the size of the samplig error be give the symbol E. That is, the C% cofidece level will result i the iterval estimates of ˆp ± E if the required

Estimatio of a populatio proportio March 23, 2005 8 sample size is obtaied. Ad if the required sample size is obtaied, C% of these itervals will cotai the populatio proportio p. Note that the uits for E are proportios. For example, if the proportio of populatio members with a particular characteristic is to be estimated to withi ±2 percetage poits, the value of E will be 0.02. That is, the poit estimate of p will be a proportio ˆp, ad this will be accurate to withi ±0.02, so that the itervals will be ˆp 0.02 to ˆp + 0.02. Formula for deteriig sample size As with the iterval estimates for a populatio proportio p, determiig sample size begis by cosiderig the samplig distributio of the sample proportio ˆp. Suppose that radom samples of large sample size are take from a populatio with a proportio p of members havig a particular characteristic. The sample proportios ˆp are ormally distributed with mea p ad stadard deviatio pq/. That is, ( ˆp is Nor p, ) pq. This is the case so log as exceeds 5 divided by the smaller of p or q. Larger sample sizes yield ormal distributios of ˆp that are more cocetrated, smaller sample sizes yield ormal distributios of ˆp that are more dispersed. For ay give cofidece level C ad associated Z-value, the aim is to fid a distributio where the cofidece iterval estimates pq ˆp ± Z match the itervals associated with the specified samplig error E: ˆp ± E. That is, the C% itervals are costructed so that they are Z pq/ o either side of ˆp. But the researcher specifies these are to be itervals of amout E o either side of ˆp. The desired error of estimate E ad the cofidece itervals are the same whe a sample size is selected so that E = Z pq.

Estimatio of a populatio proportio March 23, 2005 9 Whe this latter expressio is solved for, the required sample size is ( ) Z 2 = pq E This is the formula for the required sample size for a specified error of estimate E ad for a Z-value associated with the specified cofidece level. The procedure for estimatig sample size is to select a cofidece level C ad a error of estimate E that the researcher wishes to obtai. From the cofidece level the Z-value ca be determied from the table of the ormal distributio. Usig the above formula, the oly other parts i questio are the values of p ad q. As stated earlier, whe p + q = 1, the maximum value of the product of p ad q occurs whe p = q = 0.5. If a researcher wishes to determie a sample size that is sufficiet to obtai samplig error E with cofidece level C, the this is obtaied whe p = q = 0.5. I this circumstace, the formula for obtaiig the required sample size becomes simply ( ) Z 2 = 0.25 E sice pq = 0.5 0.5 = 0.25. If a researcher has some kowledge that p ad q are quite differet tha 0.5 each, the these alterate estimates for p ad q ca be used i the formula ( ) Z 2 = pq. E This will result i a smaller required sample size ad it may be easier or less costly for the researcher to obtai this smaller sample. The cocer a researcher might have though is that this smaller sample size may ot be sufficiet to produce itervals with the required error of estimate. Resultig iterval estimates may be wider tha desired. Examples. Suppose a researcher wishes to estimate the proportio of a populatio who support legalizig marijuaa, correct to withi (a) 5 percetage poits, or (b) 2 percetage poits, with probability 0.99. What are the required sample sizes?

Estimatio of a populatio proportio March 23, 2005 10 Aswer. This is a estimate of a proportio the proportio p of the populatio who support the legalizatio of marijuaa. Sice the sample size will likely be fairly large, it ca be assumed that the sample proportios ˆp, of those who support legalizatio of marijuaa, will be ormally distributed. The distributio of the sample proportios The formula for sample size is ( ˆp is Nor p, ) pq. ( ) Z 2 = pq E where E = 0.05 for part (a). The cofidece level specified is 99% (0.99 probability) ad the associated Z-value is 2.575. Lettig p = q = 0.5, the required sample size is = ( ) 2.575 2 0.5 0.5 = (51.5) 2 0.25 = 2, 652.25 0.25 = 663.1 0.05 The required sample size is 664. For a accuracy of 2 percetage poits, E = 0.02 ad the required sample size is ( ) 2.575 2 = 0.5 0.5 = (128.75) 2 0.25 = 16, 576.562 0.25 = 4, 144.1 0.02 or 4,145. This latter sample size is very large so it is ulikely that most research projects could obtai a sample with accuracy of ±2 percetage poits with probability 0.99. Coclusio. A few cocludig poits cocerig the determiatio of sample size for estimatio of a proportio are as follows. 1. The formula for determiig sample size i the case of estimatio of a proportio ( ) Z 2 = pq E has advatages over the formula for estimatig a populatio mea i that the values of p ad q ca always be set to 0.5 each. This will always

Estimatio of a populatio proportio March 23, 2005 11 produce a sample size sufficiet to produce the required accuracy E at whatever cofidece level the researcher specifies. I the case of estimatig the sample mea, the researcher eeded some kowlege of the variability of the populatio beig sampled that is, a estimate of σ was required i order to determie sample size. I the case of a proportio, this is ot ecessary; a researcher ca always use p = q = 0.5 ad be sure this will produce a large eough sample size. 2. All of the above results apply to radom samplig from a populatio. While researchers cosider larger sample size to be better tha smaller sample sizes, strictly speakig this may be the case oly if the samples are radom, or chose usig the priciples of probability. If samples are judgmet or sowball samples, large samples may ot be all that much better tha smaller samples. If other forms of probability samples are used, for example, cluster or stratified samples, formula such as that used i this sectio ca be developed. But the formula i this sectio applies oly to radom samplig. 3. If a researcher cosiders the sample size too large whe p = q = 0.5, differet estimates of p ad q ca be used. I the example, if a researcher thiks that oly 15% of the populatio oppose the legalizatio of marijuaa, so that the researcher is willig to work with ˆp = 0.85 ad ˆq = 0.15 whe estimatig pq/, the required sample size for (b) would be ( ) 2.575 2 = 0.85 0.15 = (128.75) 2 0.1275 0.02 = 16, 576.562 0.1275 = 2, 113.5 or 2,114. This is much less tha the earlier sample size of = 4, 145. The oly dager here is that if the proportios supportig or opposig legalizatio of marijuaa are closer to 0.5 tha 0.85 ad 0.15, the this sample size may produce a cofidece iterval estimate that has a samplig error greater tha 0.02. 4. Give that p = q = 0.5 ca always be used i order to determie sample size, it is possible to costruct tables of required sample size for

Estimatio of a populatio proportio March 23, 2005 12 differet cofidece levels C ad accuracy of estimate E. Table 8.8, p. 544 of the text is reproduced here as Table 1. Usig p = q = 0.5 ad the above formula, you should be able to verify all the sample sizes i this table. Table 1: Sample Sizes for a Proportio, Commo Levels of Accuracy ad Cofidece Level of Cofidece Level Accuracy (E) 90% 95% 99% 0.05 271 385 664 0.04 423 601 1,037 0.03 752 1,068 1,842 0.02 1,692 2,401 4,145 0.01 6,766 9,604 16,577 From Table 1, ote that as the researcher is more demadig i terms of accuracy (smaller E), required sample size is greater. Similarly, as a researcher is more demadig i terms of requirig greater cofidece that the itervals will cotai the mea, sample size is agai icreased. I practice, the actual sample size selected is likely to be iformed by the cosideratios of this sectio, but may deped more o the budget ad time available for the researcher. With limited budget ad time for a survey, a researcher may just have to live with the lesser accuracy associated with a smaller sample size. Last edited March 24, 2005.