Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Similar documents
Lecture 8 Sampling Theory

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability Year 9. Terminology

Statistics, Probability Distributions & Error Propagation. James R. Graham

Chapter 8: An Introduction to Probability and Statistics

Probability and Statistics

Probability Year 10. Terminology

RVs and their probability distributions

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Management Programme. MS-08: Quantitative Analysis for Managerial Applications

Lecture 1: Brief Review on Stochastic Processes

Business Statistics: A Decision-Making Approach, 6e. Chapter Goals

Lecture 2 Binomial and Poisson Probability Distributions

Chapter Three. Hypothesis Testing

7 Estimation. 7.1 Population and Sample (P.91-92)

Unit 4 Probability. Dr Mahmoud Alhussami

Chapter 4a Probability Models

2011 Pearson Education, Inc

Random variables (discrete)

MAT X (Spring 2012) Random Variables - Part I

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 1

Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<=

Chapter 2 Class Notes

Notes 1 Autumn Sample space, events. S is the number of elements in the set S.)

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Probability calculus and statistics

Unequal Probability Designs

Analysis of Regression and Bayesian Predictive Uncertainty Measures

CS 361: Probability & Statistics

Example. If 4 tickets are drawn with replacement from ,

Conditional Probability

Probability theory basics

Random Variable. Discrete Random Variable. Continuous Random Variable. Discrete Random Variable. Discrete Probability Distribution

Statistics for Managers Using Microsoft Excel (3 rd Edition)

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Foundations of Probability and Statistics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Binomial and Poisson Probability Distributions

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

CS 361: Probability & Statistics

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 2

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

Harvard University. Rigorous Research in Engineering Education

CS 237: Probability in Computing

Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data

Introduction to probability theory

Chapter 4: An Introduction to Probability and Statistics

IV. The Normal Distribution

Scientific Measurement

University of California, Berkeley, Statistics 134: Concepts of Probability. Michael Lugo, Spring Exam 1

Statistics and Quantitative Analysis U4320. Segment 5: Sampling and inference Prof. Sharyn O Halloran

STA Module 4 Probability Concepts. Rev.F08 1

Executive Assessment. Executive Assessment Math Review. Section 1.0, Arithmetic, includes the following topics:

Skip Lists. What is a Skip List. Skip Lists 3/19/14

UNIT NUMBER PROBABILITY 6 (Statistics for the binomial distribution) A.J.Hobson

Mathematical foundations of Econometrics

Take the measurement of a person's height as an example. Assuming that her height has been determined to be 5' 8", how accurate is our result?

Probability Distributions

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

Counting principles, including permutations and combinations.

Lecture: Mixture Models for Microbiome data

A Measurement of Randomness in Coin Tossing

MATH 10 INTRODUCTORY STATISTICS

Frequentist Statistics and Hypothesis Testing Spring

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Sampling (Statistics)

The Bayesian Brain. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. May 11, 2017

1 Probability Distributions

Lecture 1: Probability Fundamentals

Math 105 Course Outline

Distribusi Binomial, Poisson, dan Hipergeometrik

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 60 minutes.

The Empirical Rule, z-scores, and the Rare Event Approach

Probability Distribution

Mean/Average Median Mode Range

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Elementary Discrete Probability

Histograms, Central Tendency, and Variability

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Statistical Experiment A statistical experiment is any process by which measurements are obtained.

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Random variables (section 6.1)

Conditional probabilities and graphical models

STAT200 Elementary Statistics for applications

The subjective worlds of Frequentist and Bayesian statistics

Uncertainty due to Finite Resolution Measurements

IT 403 Statistics and Data Analysis Final Review Guide

Scottish Atlas of Variation

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bootstrap inference for the finite population total under complex sampling designs

Chapter 5. Bayesian Statistics

Introduction to Probability Theory for Graduate Economics Fall 2008

Let us think of the situation as having a 50 sided fair die; any one number is equally likely to appear.

Statistical Methods: Introduction, Applications, Histograms, Ch

High-Throughput Sequencing Course

Transcription:

Sampling: A Brief Review Workshop on Respondent-driven Sampling Analyst Software 201 1

Purpose To review some of the influences on estimates in design-based inference in classic survey sampling methods To give context to some of the challenges of estimation in respondent driven sampling 2

Reprise Lessons from Statistics 101 Suppose we tossed a coin 100 times and it came up heads 20 times. We would estimate the probability(head) as 0.2 And using a formula we could look up in our textbook, and calculate a confidence interval. Figure 1 The width of confidence intervals decrease with increasing sample size (all other things being equal) Figures 2 and 3 There is more variability as the probability approaches 0.5 3

Lessons from Statistics 101 Another view: Figure 4 Now instead of rigging things so I find p =.2 in the sample (for which p =.2 in the population is the best point estimate); I set P =.2 in the population (so to speak) So if I toss a coin with probability(heads) =.2, 100 times, I will not get 20 heads every time. Instead there will be variability. I did this 1000 times for each sample size. The black dot shows the median for these 1000 trials of Sample Size coin tosses The box shows the 25th and 75th percentiles -- 50% of the 1000 different estimates of P fell within this box and the other 50% outside of it. The two blue lines show the 2.5th and 97.5th percentiles: 95% of the 1000 different estimates fell within these lines 4

Design-based Inference (beyond Statistics 101) The population is viewed as fixed. The values of the variables of interest are fixed. Sampling is the only stochastic process, the only source of uncertainty. This is in contrast to model-based inference that posits an underlying datagenerating model. Hence the variables of interest in the population are viewed as random variables. Survey sampling methods are typically design-based. To date, published RDS methods are design-based. 5

Probability Sample (necessary for design-based inference) 1. Every individual in the population must have a non-zero probability of ending up in the sample (written for every individual i) 2. The probability must be know for every individual who does end up in the sample 3. Every pair of individuals in the sample must have a non-zero probability of both ending up in the sample (written for every pair of individuals (i, j)) 4. The probability must be known for every pair that does end up in the sample. Lumley(2010) This is the minimal requirement for estimating the mean and variance of any statistic. Ideally one would like to know the probability for any set of individuals in the sample, not just pairs. 6

Simple Random Sampling (random sampling without replacement) is a sampling design in which n distinct units are selected from the N units in the population in such a way that every possible combination of n units is equally likely to be the sample selected. Thompson(2002) In without replacement sampling, the sampling probability for an individual changes at each draw (draw-wise probability). The probability that a particular individual z is chosen on the first draw is p(z) = n/n. If z is chosen on the first draw, then the probability that z is chosen on the second draw is 0. If z not chosen on the 1st or 2nd draw then the probability that z is chosen on the third draw is p(z) = (n-2)/(n-2). However, the list-wise inclusion probability for individual z, the probability that z will be in any given sample regardless of its size p(z in sample) = n/n for a simple random sample 7

Repeated Simple Random Sampling from a Finite Population Figure 5 The line at Sampling Fraction Percent = 50 is the result of creating a population of size 1000, that is, a population size where 500 is 50%. In this population, P =.2, that is, 20% of the individuals are assigned a value of 1 (disease) and 80% the value 0 (non-disease). Then I sample 500 individuals without replacement from this population Calculate the proportion in the sample with the disease as this is my best estimate of the proportion in population. I repeat this 1000 times and record this estimate each time. As before, the black dot shows the median for these 1000 trials 500 draws without replacement from the finite population. The box shows the 25th and 75th percentiles. And the two blue lines show the 2.5th and 97.5th percentiles: 95% of the 1000 different estimates fell within these lines 8

Repeated Simple Random Sampling from a Finite Population Figure 5 As the sampling fraction increases, holding the sample size constant, the variability in the estimate due to sampling decreases. This makes sense as in the limit, a sampling fraction of 100% would be the entire population and then the variability would be 0. So what does a sampling fraction of 0 mean? Conceptually, an infinite population. This is what we were doing in Figures 1 through 4. In some sense, we generated these figures using a data-generating process, not a sampling process. In real life, we need to sample from a known finite population. So the 0 sampling fraction on this plot stands in for a very large population relative to the sample size. That is, a small enough sampling fraction as to provide negligible reduction in the variability of the estimate. 9

Repeated Simple Random Sampling from a Finite Population The standard error of the mean is reduced by This is called the finite population correction. 10

Weighted Sampling An individual's sampling weight is the inverse of that individual's sampling probability 1/ The fundamental statistical idea behind all of design-based inference is that an individual sampled with a sampling probability of represents 1/ individuals in the population. Lumley (2010) Equal probability sampling (equal weights) results in lower variance than unequal weight sampling (all other things being equal) Higher variability in the weights in a population results in higher variance of sample estimators 11

Weighted Sampling Figure 6 This plot was created the same way as the other, that is, with 1000 repeated samplings of each sample size from a population of twice the sample size (50% sample fraction). Once again, I fix the number of individuals with disease as 20% of the population. What makes this weighted sampling is that each individual in the population has a sampling weight that influences the probability of selection into the sample for that individual 12

Weighted Sampling Figures 7a through 7f The first step in creating the weights was to start with a distribution that was similar to the degree distributions found in some RDS samples. These degrees were selected from a Conway-Maxwell Poisson distribution, a distribution that produces counts, but has a separate parameter for the standard deviation, unlike the Poisson. That is, it can be under-dispersed or over-dispersed relative to a Poisson distribution Then the sampling was done with probabilities to proportional to degree. The sampling without replacement of half of the population was done with individual sampling probabilities proportional to that individual's degree. The weights are the inverse of the sampling probability normalized so they sum to the population total. 13

Weighted Sampling Figure 6 The first thing to notice is that the medians do not all lie on the population value of 0.2 and that they are not identical to the means (the dashes). Asymptotically, that is, with enough repeated samplings, the medians and means will be equal to the population value. But apparently 1000 is not enough, unlike in the simple random sampling case. 14

Weighted Sampling Figure 8 This is another view of the same weighted sampling process for the sample size of 500 and the population size of 1000. Each line shows the 95% confidence interval of the estimate (dot) of the population proportion of diseased individuals for one sample. There are 100 such samples shown on this plot. How many confidence intervals would you expect to not contain the true value of 0.2? The interesting thing about weighted sampling is that the same point estimate does not necessarily produce the same confidence interval as it would in a simple (unweighted) sample. Because of the weights, there are many different samples with different sets of weights and outcomes that can result in the same point estimate of disease proportion in the population. The more variability in these sets of weights that produce this sample mean, the larger the confidence interval. 15

Cluster Sampling In both the simple and weighted sampling described thus far, sampled units (individuals) are statistically independent. In cluster sampling, dependence between observations is introduced. This effectively reduces the sample size, that is, reduces the power to detect effects, that is, increases the variance of the estimators. In this example, there are 20 clusters of size 100 each, for a total population size of 2000. In each cluster, there is a different proportion of the disease ranging from 0.1 to 0.3 with the proportion in the total population being 0.2. 10 of the 20 clusters are sampled without replacement with equal probability. 50 of the 100 individuals from each sampled cluster are sampled without replacement with equal probability. The overall sample fraction is 25%. 16

Cluster Sampling Figure 9a This shows the variability in the point estimates and confidence intervals in 100 different cluster samples In this case, the proportion with disease in each cluster varied from.10 to.30 in increments of.01 (except.20, in order to make 20 clusters with an overall population proportion of.20) Note the variability in the size of the confidence intervals even when the point estimates are similar. 17

Cluster Sampling Figure 9b Same conditions as Figure 9a except for the distribution of the disease proportion across clusters In this case, half (10) of the clusters had a disease proportion of.10 and the other half had a disease proportion of.20. Note that the confidence intervals in Figure 9a are smaller than those in Figure 9b This is a reflection of the difference in the variability of the proportion of disease between the clusters. The standard deviation of the cluster disease proportions is.064 for Figure 9a and.103 for Figure 9b 18

Sampling Methods Comparison Figure 10 Simple (equal weights) random sampling from a fixed population does best Unequal weights adds to the variability in the estimate Cluster sampling with equal weights may do better or worse than weighted sampling depending on the variability in the weights and the between cluster variability Clusters sampling with unequal weights would have higher variability (not shown) 19

Summary All other things being equal, the following increase the uncertainty in the design-based estimation of population quantities. Small sample sizes Population proportions approaching 0.5 Small sample fractions High variability in sampling weights Clustering (dependence) High variability between cluster populations (in the quantities of interest) 20

Samples from RDS Small sample sizes Yes, especially for subpopulations Population proportions approaching 0.5 Sometimes "hidden populations" are studied because of their relatively high proportion of the disease being studied Small sample fractions? Sometimes large sample fractions, but population size is typically unknown so that estimating N adds to the uncertainty High variability in sampling weights Sometimes high variability in the degrees (network size) of respondents Clustering (dependence) Dependence between egos and alters? respondents from the same seed? 21

Samples from RDS High variability between cluster populations (in the quantities of interest) Might this be a result of trying to start with diverse seeds? 22

Recommended Books Complex Surveys: A Guide to Analysis Using R by Thomas S. Lumley, Wiley Series in Survey Methodology, 2010. Sampling (2nd edition) by Steven K. Thompson, Wiley Series in Probability and Statistics, 2002. 23