Machine Made Sampling Designs: Applying Machine Learning Methods for Generating Stratified Sampling Designs

Similar documents
gender mains treaming in Polis h practice

A L A BA M A L A W R E V IE W

EKOLOGIE EN SYSTEMATIEK. T h is p a p e r n o t to be c i t e d w ith o u t p r i o r r e f e r e n c e to th e a u th o r. PRIMARY PRODUCTIVITY.

STEEL PIPE NIPPLE BLACK AND GALVANIZED

GALLUP NEWS SERVICE JUNE WAVE 1

c. What is the average rate of change of f on the interval [, ]? Answer: d. What is a local minimum value of f? Answer: 5 e. On what interval(s) is f

Class Diagrams. CSC 440/540: Software Engineering Slide #1

Stochastic calculus for summable processes 1

MN 400: Research Methods. CHAPTER 7 Sample Design

Use precise language and domain-specific vocabulary to inform about or explain the topic. CCSS.ELA-LITERACY.WHST D

Software Architecture. CSC 440: Software Engineering Slide #1

Photo. EPRI s Power System and Railroad Electromagnetic Compatibility Handbook

3. When a researcher wants to identify particular types of cases for in-depth investigation; purpose less to generalize to larger population than to g

OBESITY AND LOCATION IN MARION COUNTY, INDIANA MIDWEST STUDENT SUMMIT, APRIL Samantha Snyder, Purdue University

Introduction to Survey Data Analysis

APPLICATION INSTRUC TIONS FOR THE

Sampling. Sampling. Sampling. Sampling. Population. Sample. Sampling unit

The Accuracy of Small Area Sampling of Wireless Telephone Numbers

Grain Reserves, Volatility and the WTO

OUTBREAK INVESTIGATION AND SPATIAL ANALYSIS OF SURVEILLANCE DATA: CLUSTER DATA ANALYSIS. Preliminary Programme / Draft_2. Serbia, Mai 2013

Sample size and Sampling strategy

LU N C H IN C LU D E D

Part 3: Inferential Statistics

C o r p o r a t e l i f e i n A n c i e n t I n d i a e x p r e s s e d i t s e l f

CROSS SECTIONAL STUDY & SAMPLING METHOD

Module 16. Sampling and Sampling Distributions: Random Sampling, Non Random Sampling

Module 4 Approaches to Sampling. Georgia Kayser, PhD The Water Institute

New Developments in Econometrics Lecture 9: Stratified Sampling

Cook Islands Household Expenditure Survey

Lesson Ten. What role does energy play in chemical reactions? Grade 8. Science. 90 minutes ENGLISH LANGUAGE ARTS

heliozoan Zoo flagellated holotrichs peritrichs hypotrichs Euplots, Aspidisca Amoeba Thecamoeba Pleuromonas Bodo, Monosiga

Sampling. General introduction to sampling methods in epidemiology and some applications to food microbiology study October Hanoi

S U E K E AY S S H A R O N T IM B E R W IN D M A R T Z -PA U L L IN. Carlisle Franklin Springboro. Clearcreek TWP. Middletown. Turtlecreek TWP.

Σ x i. Sigma Notation

Detroit Area Study November, 1956 University of -Michigan #1176

International Atomic Energy Agency Learning programme: Radon gas. Module 4: Developing and Implementing a Representative Indoor Radon Survey

APPLICATION INSTRUCTIONS FOR THE

M a n a g e m e n t o f H y d ra u lic F ra c tu rin g D a ta

Neighborhood social characteristics and chronic disease outcomes: does the geographic scale of neighborhood matter? Malia Jones

600 Billy Smith Road, Athens, VT

APPLICATION INSTRUC TIONS FOR THE

Information System Desig

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION

Functional pottery [slide]

Guilty of committing ecological fallacy?

Georgia Kayser, PhD. Module 4 Approaches to Sampling. Hello and Welcome to Monitoring Evaluation and Learning: Approaches to Sampling.

Sampling Theory in Statistics Explained - SSC CGL Tier II Notes in PDF

Certificate Sound reduction of building elements

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

Transaction Cost Economics of Port Performance: A Composite Frontier Analysis

Methodological Framework for the Metropolitan Power Diffusion Index

Gov 2002: 4. Observational Studies and Confounding

Acknowledgments xiii Preface xv. GIS Tutorial 1 Introducing GIS and health applications 1. What is GIS? 2

APPLICATION INSTRUCTIONS FOR THE

Sampling and Estimation in Agricultural Surveys

C ENTRA L PA RK 1,2 7 5 u nits M A RQU EE. PHASE I I 2 67 u nits u nits. PA RK PL AC E A PA RTM ENTS PHASE I 98 9 u nits H VI LL A SI ENNA O R

Exploring the Association Between Family Planning and Developing Telecommunications Infrastructure in Rural Peru

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Applications of GIS in Health Research. West Nile virus

You are allowed 3? sheets of notes and a calculator.

Medical GIS: New Uses of Mapping Technology in Public Health. Peter Hayward, PhD Department of Geography SUNY College at Oneonta

DATA DISAGGREGATION BY GEOGRAPHIC

Beechwood Music Department Staff

Texas Student Assessment Program. Student Data File Format for Student Registration and Precoding

Hayward, P. (2007). Mexican-American assimilation in U.S. metropolitan areas. The Pennsylvania Geographer, 45(1), 3-15.

Lecture 2: Repetition of probability theory and statistics

BALANCED HALF-SAMPLE REPLICATION WITH AGGREGATION UNITS


GIS Lecture 5: Spatial Data

What if we want to estimate the mean of w from an SS sample? Let non-overlapping, exhaustive groups, W g : g 1,...G. Random

7.2 P rodu c t L oad/u nload Sy stem s

Form and content. Iowa Research Online. University of Iowa. Ann A Rahim Khan University of Iowa. Theses and Dissertations

176 5 t h Fl oo r. 337 P o ly me r Ma te ri al s

Statistical Sampling for Munitions Response Projects A Layman s Refresher and Food for Thought. Tamir Klaff, P.Gp, PG

Sampling Techniques. Esra Akdeniz. February 9th, 2016

geographic patterns and processes are captured and represented using computer technologies

Part 7: Glossary Overview

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

M E T R O P O L IT A N A R E A S, U N ITED S T A T E S A N D R E G IO N A L S U M M A R IE S,

MOLINA HEALTHCARE, INC. (Exact name of registrant as specified in its charter)

Estimation of change in a rotation panel design

Exercise on Using Census Data UCSB, July 2006

2/7/2018. Module 4. Spatial Statistics. Point Patterns: Nearest Neighbor. Spatial Statistics. Point Patterns: Nearest Neighbor

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

AGRICULTURE SYLLABUS

Correlation and regression

The Joys of Geocoding (from a Spatial Statistician s Perspective)

CATAVASII LA NAȘTEREA DOMNULUI DUMNEZEU ȘI MÂNTUITORULUI NOSTRU, IISUS HRISTOS. CÂNTAREA I-A. Ήχος Πα. to os se e e na aș te e e slă ă ă vi i i i i

Purpose Study conducted to determine the needs of the health care workforce related to GIS use, incorporation and training.

Dynamic equality of opportunity. John E. Roemer Yale University. Burak Ünveren Yıldız Teknik Üniversitesi

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Lecture 24: Partial correlation, multiple regression, and correlation

REFUGEE AND FORCED MIGRATION STUDIES

Uganda - National Panel Survey

SPEW: Online Materials

Prepared by: DR. ROZIAH MOHD RASDI Faculty of Educational Studies Universiti Putra Malaysia

Chapter 7 Sampling Distributions

Joint Modeling of Longitudinal Item Response Data and Survival

Transcription:

Machine Made Sampling Designs: Applying Machine Learning Methods for Generating Stratified Sampling Designs Trent D. Buskirk, UMass Boston Center for Survey Research Todd Bear, University of Pittsburgh School of Public Health Jeff Bareham, Marketing Systems Group BIGSURV18 Barcelona October 25, 2018

Machines Vs. The Rest of Us 2

1 Background and Goals Study Goals Landline RDD Sampling Number Generation

Survey Background and Goals The Allegheny County Health Survey (ACHS) is a population based, telephone survey of health related behaviors for adults living in Allegheny County, PA. The survey is modelled after the Centers for Disease Control and Prevention s BRFSS More information about the survey can be found here: http:/ / b it.ly/ ACHS-20 16 Goal was to tie health information to political units (council districts defined by geopolitical boundaries) Sampling from 13 c o unc il d is tric ts would provide higher resolution than previous surveys that simply s tra tifie d b y RACE o r INCOME 4

Study Goals and Background Main Task: Develop a stratified RDD sampling plan to select households from the 13 council districts in Allegheny County, PA. 5

Allegheny County, PA 13 Council Districts each defined to have roughly the same population 6

2 Me thod s Clustering (not cluster sampling)

Our Approach Landline Numbers organized into 1K Banks Admin. data available for some landline numbers allowing identification of local geography (LLKGs) Council District assignments appended based on known boundaries Machine learning algorithm (k-means clustering) applied to determine strata 8

Our method in pictures 9

Telephone Number Generation in the U.S. Area Code Prefix 10K Bank Last 2 of Phone # (ABC)-DEF-GH XY 1000 Bank (1K Bank) 10

Distribution of 1K Ba n ks b y Nu m b e r o f LLKGs Landline Numbers with Known Geographies (LLKGs) 11

A closer look at the LLKG s 2,410 1K banks 507 with 5 or fewer LLKGs 1,903 1K banks 367,085 LLKGs 988 from eliminated 1K banks 366,097 LLKGs 12

Data used in k-means clustering τ 1,1 τ 1,2 τ 1,3 τ 1,13 τ 2,1 τ 2,2 τ 2,3 τ 2,13 τ 1903,1 τ 1903,2 τ 1903,3 τ 1903,13 τ i,j = proportion of LLKGs in 1K bank i within CD j ω i = total # of LLKGs in 1K bank i falling in CDs 13

Identifying the Clusters The k-means clustering algorithm optimizes groupings based on a within cluster SS criterion based on a multivariate set of covariates. We used a scree type plot created using results of a grid se arch for k possible cluste rs ranging from 2 to 12. 14

From Clusters to Strata Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 HD 1 HD 2 HD 3 HD 4 HD 12 HD 13 Stratum 1 Stratum? Stratum? Using counts of the LLKGs within each cell of the cross tabulation. 15

3 Results From Clusters to Strata

Results 17

Distrib u tion of LLKGs b y Cluster and Health District (HD) Cluster CD1 CD2 CD3 CD4 CD5 CD6 CD7 CD8 CD9 CD10 CD11 CD12 CD13 3 9 2 8 85 341 20,404 4 6 342 2 14 4,183 8 5 1,504 36 29 22,354 24,258 4,115 14 6 14 36 1,535 18,234 2,440 1 2 3 28-2 1,223 5 17 18,147 2 9 2 4 2 22,632 21,460 24,237 2,215 18 78 142 68 33 32 33 47 10,954 4 6 3 11 2 10 30 7 389 2,169 47 12,024 17 31 6 9 6 12 7 4 6 8,551 17,928 1,882 936 25 4 26 7 20 23 104 21 31 18 5,506 2,234 15 15,377 5,579 353 5,191 8-2 689-3 - 11,158 300-4 - - 1 18

Fina l Distrib u tion of LLKGs b y Stra ta a nd Cou nc il Distric t Council District --> CD4 CD5 CD6 CD12 CD1 CD2 CD3 CD7 CD8 CD9 CD10 CD11 CD13 LLKGs in Stratum 1 22439 24599 24519 22417 1513 38 37 18 12 356 38 1549 2448 LLKGs in Stratum 2 2245 68 1355 423 22669 21497 25081 25369 20936 22246 16398 17670 16207 Percentage of LLKGs in Council Districts Covered by Stratum 1 90.9% 99.7% 94.8% 98.1% 6.3% 0.2% 0.1% 0.1% 0.1% 1.6% 0.2% 8.1% 13.1% Percentage of LLKGs in Council Districts Covered by Stratum 2 9.1% 0.3% 5.2% 1.9% 93.7% 99.8% 99.9% 99.9% 99.9% 98.4% 99.8% 91.9% 86.9% 19

From Clusters to Strata Stratum 1 combined two of the 8 clusters to cover Counc il Dis tric ts 4, 5, 6 a nd 12 Stratum 2 combined the remaining 6 clusters to cover Counc il Distric ts: 1,2,3,7,8,9,10,11,13 The two strata were further partitioned into Listed LL and Unlisted LL substrata. 95.9% Estimated Coverage Rate 96.7% Estimated Coverage Rate 20

Universe and Sample Sizes & Yield Stratum Sub-stratum Landline Sample Universe Size Respondents Listed 122,858 9,277 1,155 (12%) 1 Unlisted 288,593 14,543 97 (1%) Overall 411,450 23,820 1,252 (5%) 2 Listed 292,867 23,660 2,217 (9%) Unlisted 733,151 39,310 248 (1%) Overall 1,026,018 62,970 2,465 (4%) Cell N/A 1,892,417 79,200 5,315 (7%) 21

Overall Sample Yield by Assigned Stratum 22

Geographic Coverage of Strata 23

Accuracy/ Coverage of Strata Overall Accuracy based on 3,684 Respondents on LL Frame Assigned to Stratum Actual Stratum 1 2 1 1192 49 2 62 2381 P(Assign to Stratum 1 Stratum 1 Resident) = 95.1% P(Assign to Stratum 2 Stratum 2 Resident) = 97.8% Overall Accuracy: Assigned vs. Actual Stratum = 97.0% 24

Accuracy/ Coverage of Strata: Liste d Overall Accuracy based on 3,341 Respondents on LL Frame from Listed Sub-Strata Assigned to Stratum Actual Stratum 1 2 1 1103 42 2 54 2142 P(Assign to Stratum 1 Stratum 1 Resident) = 95.3% P(Assign to Stratum 2 Stratum 2 Resident) = 98.1% Overall Accuracy: Assigned vs. Actual Stratum = 97.1% 25

Accuracy/ Coverage of Strata: Unliste d Overall Accuracy based on 3,341 Respondents on LL Frame from Listed Sub-Strata Assigned to Stratum Actual Stratum 1 2 1 89 7 2 8 239 P(Assign to Stratum 1 Stratum 1 Resident) = 91.8% P(Assign to Stratum 2 Stratum 2 Resident) = 97.2% Overall Accuracy: Assigned vs. Actual Stratum = 95.6% 26

4 Im p lic a tions What about misclassification and survey estimates? Percentage Ever told of Depression Percentage in Very Good or Excellent Health Mean BMI

Percentage Ever Told of Depression 1.6 pp 28

Percentage in Very Good/ Excellent He alth 2.1 pp 29

Ave ra g e Bod y Ma ss Ind e x (BMI) 0.2 pp 30

5 Discussion and Future Research Conclusions, Limitations and Next Steps

Discussion The number of clusters that are deemed optimal using the k-means algorithm and criterion may not be optimal for the number of sampling strata for theoretical/ estimation and practical/ field concerns. The variables available for clustering may not result in strata that are efficient from a sampling design perspective i.e. being close on the clustering variables may not imply homogeneity in strata on outcomes of interest, especially if multiple clusters need to be combined. 32

Limitations/ Future Work Using s im p le GIS info rm a tio n a nd la nd line lis te d number density we were able to successfully cover subsets of council districts whose geographies don t conform to 1K bank boundaries. Unfortunately, we were not able to separate or create RDD sampling frames that could cover each council district or smaller groups of them. We were also not able to apply this approach to the cellular frame because of limited g e o g ra p hic a l a uxilia ry info rm a tio n. Cons um e r c e ll s o urc e s with GIS info rm a tio n a re ra p id ly im p ro ving s o this m ig ht b e p o s s ib le in the future. 33

Thank You! tdbuskirk@gmail.com +1-781-964-4997 #trentbuskirk www.linkedin.com/in/tbuskirk/ 34