CRISP: Capture-Recapture Interactive Simulation Package

Similar documents
SOUTH COAST COASTAL RECREATION METHODS

PRINCIPAL COMPONENTS ANALYSIS

STAT 536: Genetic Statistics

Confidence intervals for kernel density estimation

Introduction to RStudio

An Introduction to Parameter Estimation

Data Integration for Big Data Analysis for finite population inference

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Comparison of Modern Stochastic Optimization Algorithms

Turbo Codes are Low Density Parity Check Codes

Deep Algebra Projects: Algebra 1 / Algebra 2 Go with the Flow

Order of Operations. Real numbers

Estimating and Testing the US Model 8.1 Introduction

Conditional Distribution Fitting of High Dimensional Stationary Data

Assessing Fishery Condition: Population Estimation

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

Uncertainty due to Finite Resolution Measurements

4.0 Update Algorithms For Linear Closed-Loop Systems

Biometrics Unit and Surveys. North Metro Area Office C West Broadway Forest Lake, Minnesota (651)

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

1 Simplex and Matrices

Package breakaway. R topics documented: March 30, 2016

LOGISTIC REGRESSION Joseph M. Hilbe

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Statistical aspects of prediction models with high-dimensional data

56 CHAPTER 3. POLYNOMIAL FUNCTIONS

HOMEWORK #4: LOGISTIC REGRESSION

Introduction. A rational function is a quotient of polynomial functions. It can be written in the form

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Brief Glimpse of Agent-Based Modeling

Precision of maximum likelihood estimation in adaptive designs

Introduction to capture-markrecapture

Measuring Social Influence Without Bias

Just Enough Likelihood

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

DRAFT Formulation and Analysis of Linear Programs

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Introduction to Survey Data Analysis

Chapter 2: simple regression model

Best Linear Unbiased Prediction: an Illustration Based on, but Not Limited to, Shelf Life Estimation

A Wildlife Simulation Package (WiSP)

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

Function Approximation

A Hypothesis about Infinite Series, of the minimally centered variety. By Sidharth Ghoshal May 17, 2012

Lectures 5 & 6: Hypothesis Testing

Generalized Linear Models (GLZ)

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

CHAPTER 1: Functions

Using Markov Chains To Model Human Migration in a Network Equilibrium Framework

Open Problems in Mixed Models

A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Robustness and Distribution Assumptions

1 Correlation between an independent variable and the error

Introduction to Matrix Algebra and the Multivariate Normal Distribution

Biased Urn Theory. Agner Fog October 4, 2007

Westmoreland County Public Schools Pacing Guide and Checklist Algebra 1

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

FORECASTING STANDARDS CHECKLIST

Non-uniform coverage estimators for distance sampling

Sampling and Estimation in Network Graphs

1 ** The performance objectives highlighted in italics have been identified as core to an Algebra II course.

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 40

1. Capitalize all surnames and attempt to match with Census list. 3. Split double-barreled names apart, and attempt to match first half of name.

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Weighting. Homework 2. Regression. Regression. Decisions Matching: Weighting (0) W i. (1) -å l i. )Y i. (1-W i 3/5/2014. (1) = Y i.

DOE DESIGN OF EXPERIMENT EXPERIMENTAL DESIGN. Lesson 5

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Chapter 7: Simple linear regression

How to deal with non-linear count data? Macro-invertebrates in wetlands

Causal Inference with Big Data Sets

3. When a researcher wants to identify particular types of cases for in-depth investigation; purpose less to generalize to larger population than to g

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

ASA Section on Survey Research Methods

Machine Learning

Bias-Variance Error Bounds for Temporal Difference Updates

Experimental designs for multiple responses with different models

HOMEWORK #4: LOGISTIC REGRESSION

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

Discrete Distributions

Webinar Session 1. Introduction to Modern Methods for Analyzing Capture- Recapture Data: Closed Populations 1

TEMPORAL EXPONENTIAL- FAMILY RANDOM GRAPH MODELING (TERGMS) WITH STATNET

Predicate Logic 1. The Need for Predicate Logic. The Need for Predicate Logic. The Need for Predicate Logic. The Need for Predicate Logic

Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan

Human Population Dynamics CAPT Embedded Task

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Part 7: Glossary Overview

Many natural processes can be fit to a Poisson distribution

Comparing MLE, MUE and Firth Estimates for Logistic Regression

CS1210 Lecture 23 March 8, 2019

Great Theoretical Ideas in Computer Science

Arizona Mathematics Standards Articulated by Grade Level (2008) for College Work Readiness (Grades 11 and 12)

Machine Learning

Estimation and Centering

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Semiparametric Generalized Linear Models

MS&E 226: Small Data

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

Conditional probabilities and graphical models

Transcription:

CRISP: Capture-Recapture Interactive Simulation Package George Volichenko Carnegie Mellon University Pittsburgh, PA gvoliche@andrew.cmu.edu December 17, 2012 Contents 1 Executive Summary 1 2 Introduction 2 3 Methods 2 3.1 Simulation Components....................................... 2 3.2 Computing Issues.......................................... 3 4 Results 5 5 Discussion/Future Work 7 6 Theory 7 6.1 Petersen Estimator......................................... 7 6.2 Horvitz-Thompson Style Estimator................................. 7 6.3 Petersen Estimator Bias....................................... 8 1 Executive Summary Capture-recapture estimation is a technique commonly used to estimate population sizes without having to enumerate every single unit. This family of estimation methods heavily relies on a set of assumptions, most of which are not realistically satisfied, especially in the domain of estimating the size of population of a nation. I construct a simulation architecture that provides a very flexible way to study the adverse effects that violation of these assumptions can have on the final estimates. To demonstrate its functionality, two types of capture-recapture estimators - Petersen and Horvitz-Thompson, are compared. When capture probabilities vary as a function of some demographic variable (e.g. age), the Petersen estimator demonstrates consistently inaccurate estimates, while a Horvitz-Thompson styles estimator tackles this issue by adjusting for the covariate. More generally, the package can be used by the Census Bureau employees, and researches alike, to test the performance of various capture-recapture estimators given the specifics of their population of interest. 1

2 Introduction Capture-recapture (CRC) estimation methods are widely used to estimate population sizes without resorting to the more expensive census enumeration. Alternatively, it can be used alongside the census to improve its accuracy. To be able to compute such an estimate, at least two capture events are required. A good illustration of this method is attempting to estimate the size of a fish population in a pond. A researcher would have to cast a net, count and mark all the fish caught. Then, after a short period of time, researcher goes back to the pond, casts the net again and counts the number of fish that were previously marked as well as the ones that appear for the first time. One of the CRC estimators can then be applied to estimate the fish population size. The U.S. census bureau utilizes CRC methodology by essentially following a similar process in select areas of the country. The second capture event is called the CCM (Census Coverage Measurement) and its purpose is to test the accuracy of the national census figures, which correspond to the first capture event. Traditional CRC estimators make strong assumptions about the population of interest: 1. Closed population. The true population to be enumerated stays fixed over time. This assumption excludes the possibility of births, deaths, and migration during the generation of lists (capture events). 2. Perfect matching. Every individual observed at least once requires a complete capture history. In case of the CCM, people surveyed in the CCM (capture list 2) need to be matched with their records from the main census (list 1). There are numerous record linkage algorithms available, however, none of them guarantee that everyone s records will be perfectly matched. 3. Homogeneity. All the members of the target population are equally likely to be captured. The diversity of a national population makes this assumption seem like a crude oversimplification and, in this simulation, a lot of attention is paid to breaking this assumption by introducing covariates, such as age or race, for a more realistic heterogeneity structure. 4. List independence. Event of capture of an individual on one list has to be independent of his capture on any other list. This assumption can be violated with various list effects, which are implemented in the package under the umbrella term correlation bias. This and other assumptions are very often violated when it comes to real-life populations. As a result, such estimations and the corresponding confidence intervals can have questionable accuracy. I construct a simulation framework that provides insight into the sampling distributions of CRC estimators under a variety of data generation scenarios. The scenarios of most interest are naturally the ones where the assumptions are intentionally violated. 3 Methods 3.1 Simulation Components CRISP is implemented in R and C programming languages and consists of the following three main components: 1) Capture Probabilities The two essential arguments that the user has to specify are the desired true population size (N) and number of capture events or lists (K). The N K matrix of capture probabilities can then be generated. By default all of these probabilities are set to 0.5, however, there are multiple arguments that can be specified to include a heterogeneity or dependence structure. First, the user specifies whether he is interested in presence of covariates. These can come from the list of three main covariates (age, race and gender) or can be user-defined. The package provides a great deal of flexibility for defining covariates and, more importantly, how they affect the capture probabilities. 2

The list independence assumption can be intentionally violated by including a correlation bias between lists. So far, the only type of list effects implemented is the behavioral effect, which makes the capture probability on each list contingent on the capture outcome of the previous list. 2) Captures After generating capture probabilities for all of the individuals on every list, the actual capture process can be simulated. Essentially, there are N K Bernoulli variables with known probabilities and the result is an N K indicator matrix of 0 s and 1 s where 1 means that the particular individual was captured on that list. 3) Estimator function The top-level estimator function carries out the previous two steps and actually applies the selected estimator to the capture matrix to come up with an estimate value. The built-in estimators so far are Petersen and Horvitz- Thompson estimators, however, any other estimator function should be compatible. For our purposes, it is important to understand the variability of estimations and so this top-level function includes functionality for replication. It is also possible to apply the estimator function over a range of true population sizes. In that case, the population attributes (covariates) are generated incrementally, such that when we go from a population of 5,000 to 6,000 the simulation only needs to generate new covariate values only for that extra 1,000 while keeping all the previous covariates. 3.2 Computing Issues After implementing all of the functionalities in R, the run times were quite dissatisfactory, especially when the desired population is large and multiple covariates need to be generated. To alleviate the issue the capture matrix generation was implemented in the C programming language. The run time improvement was very substantial (see Figure 1). C deals with iterative procedures much more effectively since it is a compiled language as opposed to R, which is interpreted. Comparative Runtimes for Capture Matrix Generation Runtimes for Capture Matrix Generation in C runtime in seconds 0 5 10 15 20 25 R C quadratic fit runtime in seconds 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 500000 1000000 1500000 N 0 500000 1000000 1500000 N Figure 1: Comparison of run times produced by R and C scripts by N - pop size Further Improvements As a result of implementing these simulation components, especially given that two programming languages are interfaced, the code contained numerous unnecessary or excessive parts. Most of these were irrational violations of general guidelines of structured programming. For instance, for default covariates like age, race and gender, there were default probability-generating functions implemented as part of the top-level function, which produces the main capture matrix itself. I realize now that it makes a lot more sense to implement as separate functions, that can be called from within the top-level function when needed (when user does not specify any 3

other covariate function). This addresses the benefits of structured programming, where hierarchy of functions is maintained for simplicity and flexibility. Another big improvement addressed the functionality of the simulation package. The original intention was to make the package as flexible as possible for the user. This flexibility could be very useful for the user to study various heterogeneity structures of interest. Say, the user specifies some covariate function (for generating capture probabilities), but he can also supply a separate set of arguments for every list. Alternatively, a different function can be specified for every list. The bottom line is that the simulation is intended to accommodate for any combinations of covariates and the ways they define capture probabilities. For convenience purposes, the estimator functions are called from within the top-level function. The user simply specifies which estimates are of interest by specifying the corresponding function names. It can be any subset of built-in estimators or external estimator functions (e.g. log-linear models from Rcapture package). Large Populations As mentioned previously, capture-recapture estimation is often used for estimation of human populations, like those of a whole nation (e.g. the national Census). Therefore, it needs to cope well with very large population sizes. Of course, there is a general limitation of the computer s memory capacity, however, I found even with very limited memory code efficiency can be improved to establish consistently low runtimes. In R, matrices are stored in memory by columns - it is essentially a vector of length N by K (N - rows, K - columns). The way I set up the R-C interface originally involved feeding the probability matrix into the C script, then accessing it by rows to generate K capture outcomes for each person in the population. This makes logical sense the way a human would think about this problem, however, it is inefficient from a computational perspective. When cells of a matrix created in R are accessed by row, they are not adjacent in memory and could even be in different memory blocks (used for faster access of related memory elements). I was not aware of this issue until I tried running the simulation with larger population sizes: 20 15 runtime(sec) 10 5 0 0e+00 2e+06 4e+06 6e+06 8e+06 1e+07 population size Figure 2: Comparison of run times of C scripts with probability matrix unwrapped by column (hollow points) vs by row (solid points) by N - pop size 4

The hollow points on the plot show that after approximately 5 million, there is apparent stochasticity in the runtimes, which is an even more serious issue for populations of over 10 million. My solution was to simplify the C script by supplying it the probability matrix as a vector, already unwrapped by columns in R. The C script does not put it back in matrix form, but simply makes a vector of identical length with 0 s and 1 s for capture events. The performance of this script is shown by solid black points and has a clear linear trend without any anomalies. 4 Results The analysis below is just an example of the kind of insight that CRISP provides. It includes the comparison of the two implemented estimators, however, other estimators can be studied. Using the graphing tool that was also implemented as part of the package, I was able to plot the multiple replicated values of the Petersen estimator for a range of population sizes from 1,000 to 50,000 (Figure 2). The confidence intervals and the mean of the replicated estimates are also included. Here the vertical axis is the ratio of the estimated population size and the true population size. Ideally, this should be 1 and this is visualized by the horizontal dashed line. In this most basic scenario where no assumptions are violated, the Petersen estimator performs fairly well. 0 10000 20000 30000 40000 50000 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 Petersen Estimator (no covariates) populations Nhat/N estimates mean 95% CI true pop Figure 3: Petersen estimator applied to a range of homogenous populations. This plot becomes a lot more interesting when heterogeneity structure is introduced by including the age covariate (ages are sampled based on the census demographic information): 5

0 10000 20000 30000 40000 50000 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 Petersen Estimator (Age covariate) populations Nhat/N estimates mean 95% CI true pop N=14,000 Figure 4: Petersen estimator applied to heterogeneous populations (age covariate) Heterogeneity resulted in major underestimation by the Petersen estimator. It is apparent that when the true population size exceeds about 14,000 even the upper confidence band curves below the true Nhat/N=1 line. This was an expected result because homogeneity is one of the main assumptions for the Petersen estimator and since it does not model the covariate in any way, the resulting estimations are of questionable quality. The reason for the scenario with presence of a covariate having smaller variance in estimates is the nature of the age heterogeneity structure. The average capture probability in this case was around 0.7, substantially higher than 0.5 in the default case. After exploring the properties of the Petersen estimator in detail (see Theory appendix), I implemented a more advanced CRC estimator - Horvitz-Thompson style estimator. The idea is that by modeling the covariates (see Theory) it provides more accurate estimates. To tackle the issue of heterogeneity in capture probabilities, the U.S. Census Bureau uses a combination of the Petersen estimator with post-stratification and logistic regression, however, I chose to explore the appropriateness of a Horvitz-Thompson style estimator. Here is a visualization of how this estimator solves the underestimation problem in heterogenous populations: 0 10000 20000 30000 40000 50000 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 Horowitz Thompson Style Estimator (Age covariate) populations Nhat/N estimates mean 95% CI true pop Figure 5: Horvitz-Thompson style estimator applied to heterogenous populations 6

5 Discussion/Future Work The simulation framework described here can be used by researchers, as well as Census Bureau employees, for exploring the properties of Capture-Recapture estimators. I, myself, used it to look into the performance of the Petersen estimator in the most basic scenario and in those cases where its assumptions are violated. It shows to come up with severe underestimates in presence of heterogeneity structure in capture probabilities. A Horvitz-Thompson style estimator, however, provides much better estimates in presence of a covariate. Another advantage of this simulation architecture is that it can be used by anyone, who needs to simulate a large population with some set of demographical data (age, race, gender etc.) For example, an epidemiologist could be interested in simulating a large population with disease contraction probabilities instead of capture probabilities. This could be easily and quickly done using CRISP. With some additional work on formalizing the code and writing documentation, this could be turned into a complete R package available to download from CRAN. My next immediate steps are going to include: 1. Cleaning up and organizing the code. 2. Implementing more iterative procedures in C. 3. Including more functionality like imperfect matching and more list effects. 4. Using elegant ggplot2 graphics. 5. Preparing software documentation. 6 Theory Bold (non-cursive): Can be measured Enumerated in List 1? Enumerated in List 2? YES NO Total YES N 11 N 12 N 1+ NO N 21 N 22 N 2+ Total N +1 N +2 N 6.1 Petersen Estimator Petersen estimator uses the following approximation: Then, the estimate for the population size is: N 11 N 1+ N +1 N ( ) N+1 ˆN = (N 1+ ) N 11 (1) (2) 6.2 Horvitz-Thompson Style Estimator The Horvitz-Thompson (HT) style estimator models the probabilities of appearing on each list (π 01,π 10, π 11 ) as a function of the covariate (x) to find the detection probabilities for each individual. A generalized additive 7

model (GAM) is used to approximate the covariate function. The estimate can then be computed by summing the reciprocals of all detection probabilities for everyone captured at least once: π 00 (x) = π 10(x)π 01 (x) π 11 (x) π 10 (x) + π 01 (x) + π 11 (x) ˆφ i = π 10 (x) + π 01 (x) + π 11 (x) + π 10(x)π 01 (x) π 11 (x) ˆN = 1 (3) ˆφ i i:m i=1 6.3 Petersen Estimator Bias An important characteristic of any estimator is its bias. I was interested in finding out if the Petersen estimator was biased or not (when assumptions are satisfied). The issue ( with ) doing it in a purely analytical way is that the expected value of the Petersen is not defined: ˆN = (N1+ ) N+1 N 11 The denominator, which is the number of people appearing on both lists can potentially be 0, and then the expected value becomes infinity. Even though such an outcome is highly unlikely, it still has to be considered and, thus, I cannot use an analytical approach to find the expected value of the Petersen estimator. Instead, a quasi-analytical approach was used. The definition of the expectation of a random variable is the sum of all possible values the random variable can take weighted with respective probabilities of those outcomes: E [f(n)] = n f(n)p (n) = n n 11 0 (n 11 + n 10 )(n 11 + n 01 ) n 11 P (n = {n 11, n 10, n 01, n 00 } n 11 0) (4) The first part of the product within the sum is that same Petersen estimator from before, except represented with disjoint events (previously both N 1+ and N +1 contained the entirety of N 11 ). To find the expected value we need to sum the product of the value of Petersen estimator with the respective probability across all possible outcomes for a given population size. The probability term can be calculated using the multinomial formula: = ( N n 11, n 10, n 01, n 00 The bias curve for population sizes below a 100: = ) P n11 11 P n10 10 P n01 01 P n00 00 N n 11 n 10 n 01 n 00 P n11 11 P n10 10 P n01 01 P n00 00 (5) 8

Petersen Estimator Bias E[N]/N 0.99 1.00 1.01 1.02 1.03 Petersen Chapman 20 40 60 80 100 True N The vertical axis is the ratio of the computed expectation of Petersen estimator and the true population size. We can see that the regular Petersen estimator (green line) is biased upwards for small populations, but it is asymptotically unbiased, as seen from the bias curve approaching the horizontal dotted line. The blue line represents bias of a modification of the Petersen estimator proposed by D. G. Chapman in 1951. It approaches the unbiased line a lot more rapidly, simply by adding a 1 to every part of the product, and subtracting a 1 from the result: ( ) N+1 + 1 ˆN = (N 1+ + 1) 1 (6) N 11 + 1 Acknowledgements I would like to acknowledge Bill Eddy, the U.S. Census Bureau and the Statistics Department at Carnegie Mellon University for the opportunity to perform this research. I would also like to thank Zach Kurtz for being very helpful and encouraging. 9