A SAS MACRO FOR ESTIMATING SAMPLING ERRORS OF MEANS AND PROPORTIONS IN THE CONTEXT OF STRATIFIED MULTISTAGE CLUSTER SAMPLING

Similar documents
Taking into account sampling design in DAD. Population SAMPLING DESIGN AND DAD

Introduction to Survey Data Analysis

Sample size and Sampling strategy

Complex sample design effects and inference for mental health survey data

Part 7: Glossary Overview

Figure Figure

Sampling: What you don t know can hurt you. Juan Muñoz

5.3 LINEARIZATION METHOD. Linearization Method for a Nonlinear Estimator

Sampling Weights. Pierre Foy

ANALYSIS OF SURVEY DATA USING SPSS

Exploring the Association Between Family Planning and Developing Telecommunications Infrastructure in Rural Peru

APPENDIX A SAMPLE DESIGN

Formalizing the Concepts: Simple Random Sampling. Juan Muñoz Kristen Himelein March 2012

New Developments in Econometrics Lecture 9: Stratified Sampling

Formalizing the Concepts: Simple Random Sampling. Juan Muñoz Kristen Himelein March 2013

Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc.

EC969: Introduction to Survey Methodology

Stat472/572 Sampling: Theory and Practice Instructor: Yan Lu

Uganda - National Panel Survey

Model Assisted Survey Sampling

The ESS Sample Design Data File (SDDF)

Estimation of Parameters and Variance

Sampling. Vorasith Sornsrivichai, M.D., FETP Cert. Epidemiology Unit, Faculty of Medicine, PSU

Georgia Kayser, PhD. Module 4 Approaches to Sampling. Hello and Welcome to Monitoring Evaluation and Learning: Approaches to Sampling.

New Developments in Nonresponse Adjustment Methods

Module 4 Approaches to Sampling. Georgia Kayser, PhD The Water Institute

Introduction to Survey Sampling

Estimation Techniques in the German Labor Force Survey (LFS)

Design of 2 nd survey in Cambodia WHO workshop on Repeat prevalence surveys in Asia: design and analysis 8-11 Feb 2012 Cambodia

SAMPLING III BIOS 662

1 Outline. Introduction: Econometricians assume data is from a simple random survey. This is never the case.

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

SAMPLING II BIOS 662. Michael G. Hudgens, Ph.D. mhudgens :37. BIOS Sampling II

A4. Methodology Annex: Sampling Design (2008) Methodology Annex: Sampling design 1

SAMPLING BIOS 662. Michael G. Hudgens, Ph.D. mhudgens :55. BIOS Sampling

POPULATION AND SAMPLE

Sampling approaches in animal health studies. Al Ain November 2017

You are allowed 3? sheets of notes and a calculator.

Introduction to Sample Survey

What if we want to estimate the mean of w from an SS sample? Let non-overlapping, exhaustive groups, W g : g 1,...G. Random

Main sampling techniques

Weighting Missing Data Coding and Data Preparation Wrap-up Preview of Next Time. Data Management

Explain the role of sampling in the research process Distinguish between probability and nonprobability sampling Understand the factors to consider

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

SAS/STAT 13.1 User s Guide. Introduction to Survey Sampling and Analysis Procedures

MN 400: Research Methods. CHAPTER 7 Sample Design

The LTA Data Aggregation Program

Development of methodology for the estimate of variance of annual net changes for LFS-based indicators

Successive Difference Replication Variance Estimation in Two-Phase Sampling

Midterm Exam 1, section 1. Tuesday, February hour, 30 minutes

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Log-linear Modelling with Complex Survey Data

Bayesian Estimation Under Informative Sampling with Unattenuated Dependence

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Sampling : Error and bias

Prepared by: DR. ROZIAH MOHD RASDI Faculty of Educational Studies Universiti Putra Malaysia

SAMPLE SIZE ESTIMATION FOR MONITORING AND EVALUATION: LECTURE NOTES

Lecture 5: Sampling Methods

COMBINING ENUMERATION AREA MAPS AND SATELITE IMAGES (LAND COVER) FOR THE DEVELOPMENT OF AREA FRAME (MULTIPLE FRAMES) IN AN AFRICAN COUNTRY:

Weighting of Results. Sample weights

Day 8: Sampling. Daniel J. Mallinson. School of Public Affairs Penn State Harrisburg PADM-HADM 503

A MODEL-BASED EVALUATION OF SEVERAL WELL-KNOWN VARIANCE ESTIMATORS FOR THE COMBINED RATIO ESTIMATOR

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Mapping the Route Identifying and Consulting with HS2 s Impacted Populations

FCE 3900 EDUCATIONAL RESEARCH LECTURE 8 P O P U L A T I O N A N D S A M P L I N G T E C H N I Q U E

Marriage Institutions and Sibling Competition: Online Theory Appendix

AmericasBarometer, 2016/17

AmericasBarometer, 2016/2017

Where Do Overweight Women In Ghana Live? Answers From Exploratory Spatial Data Analysis

Estimation of change in a rotation panel design

Sampling. Sampling. Sampling. Sampling. Population. Sample. Sampling unit

AmericasBarometer, 2016/17

AmericasBarometer, 2016/2017

SYA 3300 Research Methods and Lab Summer A, 2000

WITHIN-PSU SAMPLING FOR THE CENSUS BUREAU'S DEMOGRAPHIC SURVEYS Edward Kobilarcik, Ronald Statt, and Thomas F. Moore Bureau of the Census

Topic 3 Populations and Samples

SAS/STAT 13.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

ESTIMATE THE REGRESSION COEFFICIENTS OF VARIABLES SPL. REFERENCE TO FERTILITY

SAMPLING ERRORS IN HOUSEHOLD SURVEYS NATIONAL HOUSEHOLD SURVEY CAPABILITY PROGRAMME UNFPA/UN/INT-92-P80-15E

Heteroskedasticity Example

Sampling Techniques and Questionnaire Design

TRIM Workshop. Arco van Strien Wildlife statistics Statistics Netherlands (CBS)

Lesson 6 Population & Sampling

AmericasBarometer, 2016/2017

Survey-Based Cross-Country Comparisons Where Countries Vary in Sample Design: Issues and Solutions

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Strati cation in Multivariate Modeling

SAS/STAT 14.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

A MACRO-DRIVEN FORECASTING SYSTEM FOR EVALUATING FORECAST MODEL PERFORMANCE

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

AS Population Change Question spotting

Part I. Sampling design. Overview. INFOWO Lecture M6: Sampling design and Experiments. Outline. Sampling design Experiments.

Medical GIS: New Uses of Mapping Technology in Public Health. Peter Hayward, PhD Department of Geography SUNY College at Oneonta

In Praise of the Listwise-Deletion Method (Perhaps with Reweighting)

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

SURVEY SAMPLING. ijli~iiili~llil~~)~i"lij liilllill THEORY AND METHODS DANKIT K. NASSIUMA

Variance estimation on SILC based indicators

Transcription:

A SAS MACRO FOR ESTIMATING SAMPLING ERRORS OF MEANS AND PROPORTIONS IN THE CONTEXT OF STRATIFIED MULTISTAGE CLUSTER SAMPLING Mamadou Thiam, Alfredo Aliaga, Macro International, Inc. Mamadou Thiam, Macro International, Inc. 11785 Beltsville Drive, Suite 300, Calverton, MD 20705 Key Words" Weighting, Stratification, Clustering, Sampling error. Abstract In producing estimates from sample surveys, it is essential to quantify the extent of sampling error. Survey data are frequently collected based on complex sample designs featuring unequal weighting, stratification and clustering. Most standard statistical software handle weights correctly in producing point estimates, but few have options to account for the complexity of the sample design in estimating sampling errors. This user-friendly SAS macro allows the computation of sampling error estimates for means and proportions for the total sample as well as specified subclasses in the context of stratified multistage cluster sampling with unequal probabilities of selection. Using the Taylor series approximation to estimate the sampling error from the sample, the macro also produces the unweighted and weighted sample sizes, the sampling error under simple random sampling, the design effect, the relative error and the confidence intervals of the estimate. The coefficient of variation for the size of the primary sampling units is also given to allow a check of the validity of the assumption underlying the Taylor series approximation method. Introduction Researchers widely derive estimates of population characteristics using a probability sample drawn from the population under study. Two types of errors affect these estimates: non-sampling errors and sampling errors. Non-sampling errors generally result from mistakes made during the implementation of data collection and data processing. Sampling errors, which we only deal with, arise from limiting the inquiry to a sample of the population. If a probability sample is drawn, the particular sample obtained is only one of the many samples that could have been selected from the same population using the same sample design. Each of these samples would yield estimates that differ somewhat from the estimates derived from the sample that is actually available. The sampling error of an estimate is a measure of the variability between estimates from all possible samples of the same size and design under same essential survey conditions. The degree of this variability can be estimated from the one probability sample that is selected. The computational procedure must take into account the complexity of the sample design, particularly the effects of unequal weighting, stratification, and clustering which influence the extent of the sampling error. Several sampling error estimation programs are currently available, and they all produce similar results when the sample size is large. This paper presents a SAS macro, %STDERROR, that takes into account the complexity of the sample design in estimating the sampling error of means and proportions for the total sample as well as specified subclasses. The macro also produces certain statistics derived from the estimated sampling error. Computational method and formulae The %STDERROR macro uses the Taylor linearization method of variance estimation to estimate sampling errors of means and proportions under an ultimate cluster sample selection model. Under the ultimate cluster sampling model, elements within primary sampling units (PSUs) are divided into ultimate clusters and then one ultimate cluster is taken from every PSU. The Taylor linearization method treats any estimator as a ratio estimator, obtains a linear approximation of the estimator, and then uses the variance of the approximation to estimate the variance of the estimator itself. This methods also assumes that two or more primary sampling units (PSUs) are selected independently and with replacement from each stratum. In practice, PSUs are usually selected without replacement and variance estimates obtained using the Taylor linearization method are then overstated, but this should be negligible if the sampling fraction for PSUs is small. With the Taylor linearization method, aggregated quantities are computed only at the PSU level; the design information at other subsequent stages is not used. It is worth noting that the strata used for computing sampling errors are not 801

necessarily the explicit strata used in sample design and selection. In the case they differ, the original explicit strata are of no interest. Strata to be used for the computation of sampling errors may be created by grouping PSUs based on some characteristic of similarity so as to maintain homogeneity within each stratum. To supply the basic formulae, we consider a ratio estimator r = y/x where y and x are weighted sums over the whole sample or a subclass. Suppose there are H strata in the sample or subclass, and m h PSUs are selected from stratum h. For any PSU i in stratum h, the notations follow: Yhij = value of variable y for element j in PSU i in stratum h; Yhi = weighted sum of all values Yhij for all elements in PSU i, i.e., mhi Y hi - 2.., j=l W hij Y hij where Whi j is the sample weight for element j in PSU i in stratum h, and mhi is the number of elements in PSU i. yh = sum of values for stratum h, i.e., Yh - ~ m h i=1 Yhi y = sum over the whole sample or subclass, i.e., The variance of the ratio estimator r is computed using the formula given below, with the sampling error being the square root of the variance: SE2(r) = var(r) = --~ h-i in which mhmhlmhl Zhi 2 - z ll Zhi = Yhi- r.xhi, and z h = Yh- r.xh i=1 m h JJ where f is the overall sampling fraction, which is so small that it can be ignored. The estimate of the sampling error given by the Taylor linearization method is not unbiased. The magnitude of this bias depends on the coefficient of variation of primary sampling units () and may be ignored for variation less than 0.2. %STDERROR prints the coefficient of variation of PSU sizes for the entire sample as well as for every subclass. This is computed using the following formula: = 1 X mh h_-i m h - 1 Xh 2 _ i--1 mhi 2 EXh~ i-1 m h where Xhi is the weighted size (weighted number of elements) for PSU i. The design effect, DEFT, printed by %STDERROR is the ratio of the estimated sampling error under the actual sample design to the sampling error that would be obtained under simple random sampling. Assuming simple random sampling, the sampling error for the ratio estimator r = y/x is given by the following formula: Y - ~ Yh h=l Similar notations apply for variable x. H 2 SEsrs (r) = where Zhi j = Yhij - r'xhij H m h mhi 2 1-f E h=l E EWhi j i=1 j=l Zhij n-1 H mh mhi E E EWhU h=l i=1 [=1 802

Syntax and Input The following statement, in which key parameters are to be specified, will invoke and execute the %STDERROR macro within a SAS program: %STDERROR (DATA=, WEIGHT=, PSU=, STRATA=, BYVAR=, ALLVARS=); In the above macro call, the DATA= statement names the input data set which should contain all variables for which the sampling error is requested, and any other variables identified in the WEIGHT, PSU, STRATA, and B YVAR statements The WEIGHT= statement identifies the weight variable. The PSU= statement specifies the primary sampling unit to which each ultimate cluster in the sample belongs whereas STRATA= names the stratification variable to be used for estimating sampling errors. If sampling errors are to be estimated for subclasses, the B YVAR= statement must name the variable defining the subclasses All variables for which sampling errors are to be computed, should be listed and separated with a space in the ALLVARS= statement It should be noted that except the B YVAR= statement which is optional, all other statements are required for the macro to work properly. If subclass estimates are requested, then the macro will produce estimates for all subclasses as well as for the whole sample. STDERROR Sampling error of the sample mean or proportion N_UNWGT N_WGT SRS DEFT Unweighted number of cases used in the calculations Weighted number of cases used in the calculations ' Sampling error of the mean or proportion under simple random sampling Design effect RELERROR Relative error of the sample mean or proportion Lower bound of the 95% confidence intervals Upper bound of the 95% confidence intervals The resulting estimates were compared with estimates obtained through the Taylor linearization method used in SUDAAN and the sampling error module of the Integrated System for Survey Analysis (ISSA). All three programs produced the same estimates. Output %STDERROR creates and prints a SAS data set containing the following results for every variable for which the sampling error was requested: Coefficient of variation of PSU sizes Variable name Examples The examples in this paper use the data set for a country in which a Demographic and Health Survey (DHS) was conducted. The survey objectives include the estimation of the prevalence of contraceptive use in women 15-49 years old, and of fertility and childhood mortality rates. For the purpose of the illustration, the following selected variables are used: LABEL TYPE MEAN Variable label, if any. The maximum length is 40 characters including spaces. Type of statistics (mean or proportion) Weighted sample mean or proportion V 102 type of residence (urban, rural) V 106 highest education level attended V201 total number of children ever born to the woman V502 marital status V613 ideal number of children A sample of 7060 women with completed interviews was obtained using a stratified two-stage design In the first stage, 270 PSUs were selected A complete listing of households within selected PSUs 803

was carried out. The lists of households obtained served as sampling frame for the selection of households in the second stage. All eligible women identified in the selected households were included in the sample. The sample take within each PSU corresponds to the ultimate cluster. Categorical variables need to be recoded in order to estimate the proportions that are of interest to the survey. The following statements create the data set TEST that will contain the new variables to be used in the macro call. /***** RECODING S ******/ DATA TEST; SET MDIR; RWEIGHT=V005/1000000; /*weight variable */ EVBORN = V201; IF V 102= 1 THEN URBAN= 1; ELSE URBAN=0; IF V106 IN (2,3) THEN EDUC=I; ELSE EDUC=0; IF 0<=V613<=45 THEN IDEAL=V613; ELSE IDEAL=.; IF V502=1 THEN CURMAR=I; ELSE CURMAR=0; LABEL URBAN = 'Urban residence' EDUC = 'With secondary education or higher' CURMAR = 'Currently married (in union)' EVBORN = 'Children ever born' IDEAL = 'Ideal number of children'; RUN; Example 1: Sampling errors for subclasses are not requested. %INCLUDE 'C:kMEANS\STDERROR.SAS'; %STDERROR (DATA= test, WEIGHT= rweight, PSU=v021, STRATA=v022, ALLVARS= urban educ curmar evborn ideal); In this example, the %include statement brings into the SAS program the %stderror macro which is stored in the directory 'c:~urneans'. Sampling errors will be produced when the macro call statement %stderror is executed. The output files for these examples are shown in annex. Acknowledgments We are grateful to Pedro Saavedra for his comments and suggestions. References Verma, V., Pearce, M. (1986), Clusters, A Package Program for the Computation of Sampling Errors for Clustered Samples, Version 3.0 DHS-III Basic Documentation Number 6 (1996), Sampling Manual. Authors Mamadou Thiam, Demographic and Health Surveys program, Macro International, Inc., 11785 Beltsville Drive, Calverton, MD 20705. Phone (301) 572 0495, Fax (301) 0572 0999. E-mail: mthiam@macroint.com Alfredo Aliaga, Demographic and Health Surveys program, Macro International, Inc., 11785 Beltsville Drive, Calverton, MD 20705. Phone (301) 572 0940, Fax (301) 0572 0999. E-mail: aliaga@macroint.com Example 2: Sampling errors for subclasses are requested. %INCLUDE 'C:LMEANS\STDERROR.SAS'; %STDERROR (DATA=test, WEIGHT= rweight, PSU=v021, STRATA=v022, BYVAR=v 102, ALLVARS=urban educ curmar evborn ideal); 804

Output 1" Sampling errors for subclasses are not requested COEFFICIENT OF VARIATION FOR CLUSTER SIZES: ENTIRE SAMPLE SAMPLING ERRORS: ENTIRE SAMPLE CURMAR 0.029 EDUC 0.029 EVBORN 0.029 IDEAL 0.029 URBAN 0.029 LABEL TYPE MEAN STDERROR N_UNWGT N_WGT SRS DEFT RELERROR CURMAR Currently married (in union) EDUC With secondary education or higher EVBORN Children ever born IDEAL Ideal number of children URBAN Urban residence Output 2: Sampling errors for subclasses are requested 0.628 0.008 7060 7060 0.006 1.328 0.012 0.269 0.012 7060 7060 0.005 2.310 0.045 3.215 0.050 7060 7060 0.038 1.308 0.016 5.306 0.082 6447 6358 0.035 2.303 0.015 0.281 0.011 7060 7060 0.005 2.007 0.038 0.613 0.643 0.244 0.293 3.115 3.315 5.142 5.469 0.260 0.303 COEFFICIENT OF VARIATION FOR CLUSTER SIZES" ENTIRE SAMPLE SAMPLING ERRORS" ENTIRE SAMPLE CURMAR 0.029 EDUC 0.029 EVBORN 0.029 IDEAL 0.029 URBAN 0.029 LABEL TYPE MEAN STDERROR N_UNWGT N_WGT SRS DEFT RELERROR CURMAR EDUC EVBORN IDEAL URBAN Currently married (in union) With secondary education or higher Children ever born Ideal number of children Urban residence 0.628 0.008 7060 7060 0.006 1.328 0.012 0.269 0.012 7060 7060 0.005 2.310 0.045 3.215 0.050 7060 7060 0.038 1.308 0.016 5.306 0.082 6447 6358 0.035 2.303 0.015 0.281 0.011 7060 7060 0.005 2.007 0.038 0.613 0.643 0.244 0.293 3.115 3.315 5.142 5.469 0.260 0.303

COEFFICIENT OF VARIATION FOR SAMPLING UNIT SIZES" SUBCLASSES V 102= 1 CURMAR 0.038 EDUC 0.038 EVBORN 0.038 IDEAL 0.039 URBAN 0.038 V 102=2 o~ G~ SAMPLING ERRORS" SUBCLASSES CURMAR 0.037 EDUC 0.037 EVBORN 0.037 IDEAL 0.037 URBAN 0.037 V 102= 1 LABEL TYPE MEAN STDERROR N_UNWGT N_WGT SRS DEFT RELERROR CURMAR EDUC EVBORN IDEAL URBAN Currently married (in union) With secondary education or higher Children ever born Ideal number of children Urban residence 0.570 0.015 2376 1987 0.010 1.450 0.026 0.526 0.031 2376 1987 0.010 3.043 0.059 2.496 0.070 2376 1987 0.057 1.225 0.028 4.181 0.115 2255 1857 0.046 2.469 0.027 1.000 0.000 2376 1987 0.000 0.000 0.540 0.599 0.463 0.588 2.356 2.636 3.951 4.410 1.000 1.000 V 102=2 LABEL TYPE MEAN STDERROR N UNWGT N_WGT SRS DEFT RELERROR CURMAR EDUC EVBORN IDEAL URBAN Currently married (in union) With secondary education or higher Children ever born Ideal number of children Urban residence 0.651 0.009 4684 5073 0.007 1.314 0.014 0.168 0.010 4684 5073 0.005 1.884 0.061 3.496 0.061 4684 5073 0.048 1.260 0.017 5.770 0.105 4192 4501 0.046 2.306 0.018 0.000 0.000 4684 5073 0.000 0.633 0.669 0.147 0.189 3.374 3.619 5.560 5.980 0.000 0.000