Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Size: px

Start display at page:

Download "Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013"

Linda Evans
5 years ago
Views:

1 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing the simplest metho of survey sampling, calle simple ranom sampling. Suppose that we are conucting election polling an are intereste in estimating the proportion of voters who support Obama in a battle groun state, say Floria. There are N voters in this state an we call this population of voters P. Thus, N enotes the population size. We assume that the complete list of N voters is available, allowing us to sample a subset of them from this list. Such a list is calle a sampling frame. Before we begin our sampling, we etermine the total number of voters we are going to interview. This number represents the sample size an is enote by n. Simple ranom sampling refers to the proceure in which a researcher ranomly samples n voters from the list of N voters with equal probability. There are two key characteristics of this proceure. First, we are sampling exactly n voters without replacement. That is, every voter gets sample at most once. Seconly, each voter has an equal probability of being sample. Clearly, this sampling probability is equal to n/n for every voter on the sampling frame. Below, we introuce two inferential approaches, one esign-base an the other moel-base, to survey sampling. In the simple case we are consiering, both approaches give substantially similar estimation proceures. For example, from both perspectives, the sample proportion of Obama s supporters in a poll is a goo estimate of the population proportion. However, the two approaches are conceptually quite ifferent. The basic iea of the esign-base approach is that all statistical properties of one s estimator are base solely on the actual ata collection proceure employe by the researcher here, simple ranom sampling). Thus, this approach faithfully follow the research esign. In contrast, the moel-base approach requires the researcher to specify a probability moel. Such a moel represents an approximation to the actual ata generating process. As escribe in more etail below, each approach has its avantages an isavantages. 1 Design-base Inference Statistical inference is about learning what we o not observe, calle parameters, from what we o observe, calle ata. What sets apart statistical inference from other mathematical or non-mathematical) metho of inference is that it can calibrate the inferential uncertainty using formal quantities such as stanar errors an confience intervals. But where o these uncertainty measures come from? The esign-base inference is an approach where all statistical properties of estimation proceures are erive solely from the key features of research esign impose by the researcher. In the current context, the fact that a researcher collect the ata via simple 1

2 ranom sampling allows us to characterize the egree of accuracy for the resulting estimate. The esign-base inference is attractive because it faithfully respects the key features of ata collection methos employe by researchers. The approach also provies intuition about how statistical properties are relate to actual research. 1.1 Setup To formalize the iea, let X i represents a binary variable inicating whether or not voter i in the population P supports Obama. We efine this quantity for each voter in the population. That is, for i = 1, 2,..., N, X i = 1 X i = 0) implies that voter i supports oes not support) Obama. We now introuce the sampling inicator variable Z i for each voter in the population, representing whether or not the voter is sample. Thus, Z i = 1 Z i = 0) inicates that voter i is sample not sample). Accoring to our simple ranom sampling proceure, we sample exactly n voters with equal probability an without replacement. This implies that N Z i = n an PrZ i = 1) = n/n. How many unique samples of voters can this simple ranom sampling prouce? To answer this question, it is helpful to consier a scenario where we sample an interview one voter at a time. For the first interview, we have N choices, an we choose the secon interviewee from N 1 voters, an finally the nth voter will be selecte from N n + 1 remaining voters. Thus, all together, we have N N 1) N n + 1) ways of interviewing n voters from a population of N voters. What we just i is to count all permutations by assuming that the orer of sampling matters. However, we actually o not care about the orer of sampling an hence nee to remove ouble-counting. How many uplicates o we have? We have a total of n voters in our sample, an using the same logic of permutations above, there are a total of n! = n n 1) 1 ways to arrage these n voters of any istinct sample. Thus, all together we arrive at the stanar formula for combinations ) N n = N N 1) N n + 1) n! = N! n!n n)! Since each istinct sample is equally likely, the probability that a particular combination of n voters is selecte equals 1/ N n) = n!n n)!/n!. 1.2 Estimation Our quantity of interest, or estiman, is the proportion of Obama supporters in the population, an this quantity can be written as the population mean of X i, i.e., X = N X i/n, because the population P consists of N voters. Our goal here is to estimate the population proportion of Obama supporters using our sample of n voters. The sample proportion appears to be a reasonable estimator for the population proportion. In the current notation, we can write the sample proportion or sample mean) as x = N Z ix i /n since those who are not in the sample an have Z i = 0 will not contribute to this quantity. What are the statistical properties of this estimator? We consier how this estimator performs over hypothetical repeate sampling. We will never know how close the proportion of Obama supporters in our actual sample to the population proportion, but we can say something about how our estimator behave if we were to repeat the same sampling process many times. Of course, we can never aminister the same survey to the same population at the same point of time 1) 2

3 more than once. Nevertheless, by knowing how one s estimator performs uner this hypothetical scenario, we can gain a better unerstaning of how reasonable our estimator is. To erive the statistical properties of the sample mean uner the esign-base framework, we take the position that the only source of ranomness comes from the proceure of simple ranom sampling. Formally, for each voter i, Z i is regare as a ranom variable with probability istribution efine by simple ranom sampling whereas X i is assume to be a fixe but possibly unobserve epening on the realization of Z i ) quantity. The iea is that each voter i is either a supporter of Obama, in which case we have X i = 1, or not, i.e,. X i = 0, an this quantity is fixe. While X i is a fixe variable for any given voter, it has a istribution over the population of voters because a certain number of voters are Obama supporters while others are not. Now, suppose that X i represents one s income. Income is a fixe quantity for each voter at the time of survey, but we can still consier a istribution of income over the population of voters. Given this istinction between fixe an ranom variables, we compute the expecte value of our estimator over repeate sampling. What oes this mean? As iscusse above, there exist a total of N n) possible samples one can en up with when applying simple ranom sampling. Since each sample contains ifferent combinations of voters with ifferent realizations of Z i, its sample mean may be ifferent an may not even be close to the population mean, which we woul like to estimate. However, the following calculation shows that on average we get the right answer, E x) = 1 n EZ i X i ) = 1 n EZ i )X i = 1 n n N X i = X 2) where the first equality follows from the fact that the expectation is a linear operator, the secon equalitiy hols because X i is a fixe quantity, an the thir equality comes from the binary nature of the sampling inicator variable, i.e., EZ i ) = PrZ i = 1). Note that the expectation is taken with respect to Z i, which is the only ranom variable in this setting. We have just shown that the sample mean is unbiase for the population mean. That is, even though the sample mean we compute from our actual sample may iffer from the population parameter for any particular raw, the average value of this estimator over hypothetical) repeate sampling equals exactly the estiman of interest. The unbiaseness of the sample mean over repeate sampling is helpful, but how close is my actual estimate to the population mean? While one cannot know exactly how far off one s estimate is from the truth, we can characterize the average istance from the truth, again, over repeate sample. One such measure is the variance of sampling istribution of the estimator. Uner the esign-base framework, this variance is given by, V x) = 1 n ) S 2 N n where 1 n/n) is the finite sample correction an S 2 = N X i X) 2 /N 1) is the population variance. The finite sample correction term ajusts the variance by taking into account the sample size relative to the population size. If the sample size equals the population size, then the sample mean equals the population mean an hence the variance is zero. In most cases, the sample size is small relative to the population size, which means that this term is close to one, resulting in a little impact on the variance. The variance expression given in equation 3) is a theoretical result because it is a function of the population variance of X i, which is an unknown quantity. However, we can estimate S 2 in the 3 3)

4 exactly same way in which we estimate the population mean using the sample mean. In fact, we can use the sample variance of X i, s 2 = N Z ix i x) 2 /n 1), as an unbiase estimator for its population variance, S 2. An unbiase estimator for the variance of the sampling istribution, then, is given by, ˆσ 2 = 1 n ) s 2 N n where Eˆσ 2 ) = V x). We have seen that the sample mean is a goo estimator of the population mean an the sample variance is a goo estimator of the population variance. Di you notice any pattern here? Inee, this is calle plug-in principle: a sample analogue of a population parameter is a goo estimator because it usually equals the population parameter in expectation. How o we erive the variance expression given in equation 3)? The calculation require for erivation is quite similar to what we i for showing the unbiaseness of the sample mean, though of course it is much more involve the slies provie etails about each step of the erivation). The basic iea is still the same, however. Recall that the variance represents the average square istance from its mean, i.e., V x) = E{ x E x)} 2 where in this case E x) = X. Thus, the erivation amounts to taking the expectation with respect to the sampling inicator variable Z i, which again is the only source of ranomness uner the esign-base approach. 1.3 Unequal Probability Sampling an Sampling Weights So far, we have focuse on the simplest probability sampling methos, simple ranom sampling. However, in practice, this proceure may not be appropriate an researchers may wish to sample responents with ifferent sampling probabilities. For example, it is often a common practice to over-sample minorities in orer to ensure that the resulting sample contains enough minority responents. Moreover, even if one applies simple ranom sampling, it is often the case that the refusal to participate in survey i.e., unit non-response) results in unequal sampling probability across ifferent iniviuals. This means that when making inference about the population, we must make aitional statistical ajustments. The sample mean is no longer unbiase for the population mean because the sample is not representative of the population. The most common way to make an ajustment is through weighting. We will weight each observation by the inverse of sampling probability. This makes sense because, for example, oversample minorities nee to be unerweighte so that their sample proportion will resemble their population proportion. For example, if an African American voter is twice more likely to be sample when compare to other voters, then the sample proportion of African American voters is, on average, twice as large as their population proportion. Thus, the African American voters in a sample nee to be ownweighte by a factor of two in orer for the sample to be representative of the population. This simple iea is the basis of the following classical weighting estimator by Horvitz an Thompson, 4) x = 1 N Z i X i π i 5) where π i = PrZ i = 1) is the sampling probability for each voter i in the population. This probability is inexe by i in orer to allow for the possibility that the sampling probability varies 4

5 from one voter to another. In the case of simple ranom sampling, everyone has the same sampling probability so that π i = n/n for all i. Showing that this inverse probability weighting leas to an unbiase estimator can be one in the exactly same manner as before. Taking the expectation with respect to Z i an treating X i as a fixe quantity, we have, E x) = 1 N EZ i )X i π i = X. 6) where EZ i ) = π i. The erivation of the variance of this estimator is similar except that, again, it is much more involve than eriving its expectation. Although we have so far assume that the sampling probability, whether it is constant or varying across units, is known, in practice we often must estimate it from ata. As mentione earlier, a main reason for this is the existence of unit nonresponse, which results in ifferent an unknown sampling probabilities across units. We note that survey weights typically available in many social science surveys are constructe using the metho of post-stratification. Uner this methoology, we first obtain the population istribution of certain covariates e.g., emographics from the census). Then, survey weights are calculate such that after weighting the observations of a sample the sample istribution of these covariates approximates the population istribution. While this approach guarantees that the sample is representative of the population with respect to these covariates, it is not necessarily the case that there may be significant ifferences in other observe an unobserve covariates between the sample an the population. Later in this course, we will have a more complete iscussion of the issues that arise from such sample selection. 2 Moel-base Inference The esign-base inference introuce above is clean an intuitive, an oes not require any strong assumptions. This approach has limite applicability, however, since the erivation of variance of the sample mean is challenging even uner simple ranom sampling. It is easy to imagine that in more complex but realistic situations, the esign-base approach may face analytical an other ifficulties. To aress this issue, we introuce the moel-base inference. This approach requires researchers to irectly moel the ata generating process by specifying a probability istribution that characterize the population we are intereste in. The avantage is that uner this approach all the useful tools of probability theory can be use to erive the statistical properties of one s estimator. The moel-base inference gives much greater analytical flexibility than the esign-base approach because the latter is not allowe to eviate from the actual ata collection mechanisms. A major isavantage of the moel-base approach, however, is that in some cases a moel selecte by researchers oes not approximate the actual ata generating process well. Constructing a goo moel requires a careful balancing act. As famously put by the statistician George Box, All moels are false but some are useful. Therefore, the creibility of moel-base inference relies upon the critical question of whether one s moel is a useful approximation of the real worl. 5

6 2.1 Estimation Let s go back to the polling example. A moel-base approach may assume that the number of Obama supporters in one s sample, enote by X = n X i, follows a Binomial istribution with probability p an size n. Is this a reasonable moel? In substantive terms, the Binomial moel assumes that n inepenent raws are sample from an infinite population of voters where the proportion of Obama supporters is exactly p. This moel is a goo approximation to the simple ranom sampling of voters in Floria for two reasons. First, the number of voters in Floria is quite large an hence assuming that the population size is infinite is reasonable. Seconly, using p as the proportion of Obama supporters in the population of Floria voters, obtaining n inepenent raws from the Binomial istribution resembles the simple ranom sampling proceure of n voters. Given this setup, we estimate the population proportion of Obama supporters p using the sample proportion ˆp = X /n. Now, recall that the mean an variance of a Binomial ranom variable are EX ) = np an VX ) = np1 p), respectively. Using this fact, it is straightforwar to show the unbiaseness of this estimator an its variance as follows, Eˆp) = np n = p an Vˆp) = VX ) p1 p) = 7) n 2 n Again, the variance is a theoretical quantity an must be estimate from ata because p is unknown. This is easily one by following the plug-in principle introuce above: we replace p with its unbiase estimate ˆp in the variance expression Vˆp) given in equation 7). In aition to the mean an variance of the sampling istribution, the moel-base approach also enables inference base on asymptotic approximation where the sample size is assume to approach infinity in the limit. Of course, in practice, we will never have infinite amount of ata. Asymptotic approximation is useful because it allows us to use powerful theorems of probability theory to erive the statistical properties of various estimators in the limit as the sample size tens to infinity. Such a calculation is often ifficult in finite sample settings. Thus, when the sample size is large enough, one can apply these asymptotic approximations. One important asymptotic theorem is the weak) law of large numbers. The theorem states that the sample mean approaches the population mean as the sample size increases. More formally, if {X i } n is a sequence of i.i.. ranom variables with mean µ an finite variance σ 2, then letting X n = n X i/n enote the sample mean, we have X n p µ 8) where p enotes the convergence in probability, which is formally efine as lim n Pr X n µ > ɛ) = 0 for any ɛ > 0. Thus, convergence in probability states that for any arbitrarily small positive number ɛ, the probability that the eviation of the sample mean eviates from the population mean is greater than ɛ becomes negligible as the sample size approaches infinity. In the case of survey sampling, this implies that the sample mean is not only unbiase for the population mean, i.e., EX n ) = µ, but it is also consistent, becoming arbitrarily close to the population mean as the sample size increases. Note that unbiase an consistency are ifferent properties. For example, consier another estimator of the population mean X n +1/n. This rather silly) estimator is biase because EX n + 1/n) = µ + 1/n µ. However, it is still consistent for the population mean as the bias term 1/n tens to zero as the sample size approches infinity. One coul also say that this estimator is asymptotically unbiase though it is biase in a finite sample) because the bias isappears asymptotically. In fact, consistency implies asymptotic unbiaseness, while the converse is true so long as the variance goes to zero as the sample size increases. 6

7 2.2 Confience Intervals Another important probability theorem use for asymptotic approximation is the central limit theorem. This theorem enables us to erive the asymptotic) sampling istribution of the sample mean, going beyon its mean an variance given in equation 7). Using the knowlege of sampling istribution, we can make probabilistic statements about the properties of estimators. Specifically, the theorem states that the sample mean is istribute over repeate sampling) accoring to a normal istribution. Formally, if {X i } n is a sequence of i.i.. ranom variables with mean µ an finite variance σ 2, then we can write the central limit theorem as, nxn µ) σ N 0, 1) 9) where represents the convergence in istribution. We say that a sequence of ranom variable, X n, converges to another ranom variable X, if lim n P X n x) = P X x) for all x with P X x) being continuous at every x. In wors, the istribution function of X n converges to that of X as the sample size approaches infinity. What is the intuition behin equation 9)? Uner the moel-base approach, we can erive the variance of the sample mean as VX n ) = n VX i)/n 2 = σ 2 /n using the assumption that X i is an i.i.. ranom variable. Given this result, it is clear that the left han sie of equation 9) represents the stanarize version of the sample mean X n where the mean of the sampling istribution, i.e., µ, is first subtracte from it an then this ifference is ivie by the stanar eviation, i.e., σ/ n. This stanarize version of the sample mean, accoring to the theorem, has the stanar normal istribution in the limit as the sample size tens to infinity. We present another way to unerstan the central limit theorem. First, rewrite the theorem as nxn µ) N 0, σ 2 ) by multiplying the ranom variable in the left han sie of equation 9) with a constant σ. We know that by law of large numbers, X n approaches µ as the sample size increases, implying that X n µ in the numerator becomes smaller an smaller as n tens to infinity. This is off set exactly by the increase of n term an as the sample size increases the istribution of nx n µ) becomes the normal istribution with mean zero an variance σ 2. For this reason, n is calle the rate of convergence. How can we use the central limit theorem uner the moel-base inference? The theorem provies a irect large-sample i.e., asymptotic) approximation to the sampling istribution of the sample mean. To o this, we assume that for large enough n, equation 9) hols approximately, i.e., nx n µ)/σ approx. N 0, 1). Diviing this ranom variable by σ/ n an then aing µ, we have the following approximate sampling istribution of X n, X n approx. Thus, in aition to its mean an variance, we now know the sampling istribution in the limit as the sample size increases. In typical situations, we o not know σ 2 an hence estimate it from the sample. Again, we apply the plug-in principle an use the sample variance, which is an unbiase an consistent estimator for the population variance σ 2. Does the above approximation still hol when we replace the population variance with its estimate? The answer turns out to be yes. That is, we N µ, σ 2 n ) 10) 7

8 have nx n µ)/ˆσ N 0, 1). This result is base on another important result in probability p theory, calle the Slutzky Theorem, which states that if X n x an Y n Y, then X n + Y n x + Y an X n Y n xy. 11) Specifically, we apply the Slutzky Theorem in the following manner, nxn µ) ˆσ = nxn µ) }{{ σ } N 0,1) σˆσ }{{} p 1 N 0, 1) 12) where the first term converges to the stanar normal ranom variable because of the central limit theorem an the secon term converges in probability to one because ˆσ 2 is a consistent estimator of σ 2. Together, the result irectly implies that the so-calle z-score follows the stanar normal istribution in a large sample, z score = X n s.e. = X n ˆσ/ n approx. N 0, 1) 13) Finally, we erive the asymptotic confience intervals, which are often use as a way to express the uncertainty of an estimator. The basic iea is to construct an interval such that over repeate sampling it covers the true value, say, 95% of time. Note that what is ranom here is the confience interval rather than the truth. The probability that any given confience interval reporte by the researcher contains the true value is either zero or one since the truth is fixe. However, over hypothetical repeate sampling, the proceure shoul prouce confience intervals which vary from one sample to another) that contain the truth for the pre-specifie proportion. In the current context, this means that our 95% confience interval from polling ata may or may not contain the true proportion of Obama supporters in Floria but if we were to conuct polling in the exactly same manner, over this hypothetical repeate sampling, 95% of such confience intervals will contain the true proportion of Obama supporters. To formally construct such confience intervals, we use the result about z-score given in equation 13) an the efinition of convergence in istribution. lim Pr z X n µ n s.e. ) z = Pr z Z z) 14) where Z is the stanar normal ranom variable an z is a pre-etermine quantity calle critical value. Now, for 95% confience intervals, choose z α/ such that Pr z α/2 Z z α/2 ) = Φz α/2 ) = 1 α = 0.95 where α = By rearranging the left han sie of equation 14), we obtain, Pr z α/2 X ) n µ z α/2 = Pr z α/2 s.e. X n µ z α/2 s.e. ) 15) s.e. = Pr X n z α/2 s.e. µ X n + z α/2 s.e. ) 16) where the secon line is obtaine by subtracting X n from each of the three terms an then multiplying them by negative one an therefore switching the irection of inequalities). Thus, in 8

9 a large sample, over repeate sampling, the confience intervals, [X n z α/2 s.e., X n +z α/2 s.e.], contains the true value α)% of time. To sum up, the moel-base approach allows us to erive the mean an variance of the estimator by using all available rules of probability theory. In aition, two powerful asymptotic theorems, i.e., law of large numbers an central limit theorem, enables large sample approximation with a large class of ata generating processes. This asymptotic approximation tells us the sampling istribution of our estimator, which in turn lets us characterize the estimation uncertainty with confience intervals. 9

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information