Population size estimation by means of capture-recapture with special emphasis on heterogeneity using Chao and Zelterman bounds

Size: px

Start display at page:

Download "Population size estimation by means of capture-recapture with special emphasis on heterogeneity using Chao and Zelterman bounds"

Lindsey Charles
5 years ago
Views:

1 Population size estimation by means of capture-recapture with special emphasis on heterogeneity using Chao and Zelterman bounds Dankmar Böhning Quantitative Biology and Applied Statistics, School of Biological Sciences University of Reading Ege University, Izmir, June 3, 2008 prepared: June 2, 2008

2 Outline Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

3 Introduction Formulation of the Problem a population has N units of which n are identified by some mechanism (trap, register, police database,...) probability of identifying an unit is (1 p 0 ) so that N = (1 p 0 )N + p 0 N = n + p 0 N and the Horvitz-Thompson estimator follows: ˆN = n 1 p 0

4 Introduction Formulation of the Problem ˆN = n 1 p 0 is fine BUT: p 0 is assumed to be known usually an estimate of p 0 is required

5 Introduction Formulation of the Problem as Frequencies of Frequencies a common setting for estimating p 0 is the Frequencies of Frequencies: the identifying mechanism provides a count Y of repeated identifications (w.r.t. to a reference period), but zero counts are not observed leading to frequencies f 1, f 2,..., f m where m is the largest observed count and f j is the frequency of units with exactly j counts

6 Introduction Formulation of the Problem as Frequencies of Frequencies we have: f 0 is not observed Recall that N = f 0 + n = f 0 + f 1 + f f m, so that ˆf 0 leads to ˆN

7 Introduction Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

8 Some Applications McKendrick s Data on Cholera in India McKendrick (1926) had the following frequency of households with j cases of Cholera in an Indian village: f 0 f 1 f 2 f 3 f 4 n How many households f 0 are affected by the epidemic, but have no cases?

9 Some Applications Mathews s Data on Estimating the Dystrophin Density in the Human Muscle Cullen et al. (1990) attempted to locate dystrophin, a gene product of possible importance in muscular dystrophies, within the muscle fibres of biopsy specimens taken from normal patients Units (epitops) of Dystrophin cannot be detected by the electron microscope until they have been labelled by a suitable electron-dense substance; technique used gold-conjugated antibodies which adhere to the dystrophin

10 Some Applications Mathews s Data on Estimating the Dystrophin Density in the Human Muscle not all units are labelled and it is important to account for all labelled and unlabelled units to achieve an unbiased estimate of the dystrophin density more than one anti-body molecule may attach to a dystrophin unit; observed then is a count variable Y counting the number of antibody molecules on each dystrophin unit Y = 0 means that unit is unlabelled and not observed

11 Some Applications Mathews s Data on Estimating the Dystrophin Density in the Human Muscle the frequency distribution of the antibody count attached to a dystrophin unit: f 0 f 1 f 2 f 3 f 4 f 5 n

12 Some Applications Del Rio Vilas s Data on Estimating Hidden Scrapie in Great Britain 2005 sheep is kept in holdings in great Britain (and elsewhere) the occurrence of scrapie is monitored in the Compulsory Scrapie Flocks Scheme (CSFS) summarizing abbatoir survey, stock survey and the statutory reporting of clinical cases CSFS established since 2004 the frequency distribution of the scrapie count within each holding for the year 2005: f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 n

13 Some Applications Hser s Data on Estimating Hidden Intravenous Drug Users in Los Angeles 1989 intravenous drug users in L.A. county were entered into the California Drug Abuse Data System (CAL-DADS) the data below refer to the frequency distribution of the episode count per drug user in 1989 the frequency distribution of the episode count per drug user for the year 1989: f 0 f 1 f 2 f 3 f 4 f 5 f 6-11,982 3,893 1,959 1, f 7 f 8 f 9 f 10 f 11 f 12 n ,198

14 Some Applications Drakos Data on Estimating Hidden Transnational Terrorist Activity data on terrorism are provided by various databases including RAND terrorism chronology, the terrorism indictment and DFI International research on terrorist organizations on 153 countries and the period terrorism is violence or the threat of violence, calculated to create an atmosphere of fear and alarm

15 Some Applications Drakos Data on Estimating Hidden Transnational Terrorist Activity the frequency distribution (Drakos 2007) of the count of transnational terrorist activity Y it in country i and year t: f 0 f 1 f 2 f 3 f 4 f 5 f f 7 f 8 f 9 f f 136 n similar to the McKendrick data there is an f 0 = 1, 357 however it is thought that there is a hidden number of terrorist activities which is of interest to be estimated estimate f 0 the frequency of periods and countries with unrecorded terrorist activites

16 Some Applications Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

17 Ways to a Realistic Solution Formulation of the Problem and the Idea for its Solution Suppose we can find some model for the count probabilities p j = p j (λ) then estimate λ by some method (truncated likelihood) and then use the model for p 0 : 1 ˆN = 1 p 0 (ˆλ)

18 Ways to a Realistic Solution Formulation of the Problem and the Idea for its Solution Only to illustrate: Poisson model for the count probabilities then estimate λ and arrive at: p j = p j (λ) = exp( λ)λ j /j! ˆN = n 1 ˆp 0 = n 1 exp( ˆλ)

19 Ways to a Realistic Solution Formulation of the Problem and the Idea for its Solution However: using a simple Poisson model for the count probabilities is not appropriate, since every unit is different p j = p j (λ) = exp( λ)λ j /j! there is population heterogeneity so that more realistic p j = p j (λ) = 0 exp( t)t j /j!λ(t)dt where λ(t) stands for the heterogeneity distribution of the Poisson parameter

20 Ways to a Realistic Solution Formulation of the Problem and the Idea for its Solution instead of providing an estimate ˆλ(t) by means of nonparametric mixture models (Böhning and Schön 2005, JRSSC) interest is on two alternatives: 1. lower bound approach by Chao (1987, 1989, Biometrics) monotonicity of the ratios of consecutive mixed Poisson probabilities diagnostic device for presence of a mixed Poisson 2. robust approach of Zelterman (1988, JSPI) likelihood framework for the Zelterman estimate variance via Fisher information and covariate modelling via logistic regression (Böhning and Del Rio Vilas 2008, JABES) bias investigation

21 Ways to a Realistic Solution Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

22 Generalized Chao Bounds and a Monotonicity Property Chao s Lower Bound Estimate Poisson mixture for j = 0, 1, 2,... p j = 0 exp( t)t j /j!λ(t)dt with unknown λ(t) for t > 0. Then, by the Cauchy-Schwartz inequality: E(XY ) 2 E(X 2 )E(Y 2 ) where X = exp( t) and Y = exp( t)t, and expected values are w.r.t. λ(t): ( 2 exp( t)tλ(t)dt) 0 0 exp( t)λ(t)dt 0 exp( t)t 2 λ(t)dt

23 Generalized Chao Bounds and a Monotonicity Property Chao s Lower Bound Estimate ( 2 exp( t)tλ(t)dt) 0 0 exp( t)λ(t)dt 2 p 2 1 p 0 2p 2 p2 1 2p 2 p 0 0 exp( t) t2 2 λ(t)dt which leads to Chao s lower bound estimate (truely nonparametric) ˆf 0 = f 2 1 2f 2

24 Generalized Chao Bounds and a Monotonicity Property A Monotonicity Property Cauchy-Schwartz more generally applicable use X = exp( t)t j 1 and Y = exp( t)t j+1 ): E(X 2 )E(Y 2 ) = ( E(XY ) 2 = 0 0 ) 2 exp( t)t j λ(t)dt exp( t)t j 1 λ(t)dt 0 exp( t)t j+1 λ(t)dt (j!p j ) 2 (j 1)!p j 1 (j + 1)!p j+1 p j j (j + 1) p j+1 p j 1 says: ratios of consecutive mixed Poissons are monotone p j

25 Generalized Chao Bounds and a Monotonicity Property Application of the Monotonicity Property: Ratio Plot p j plot j p j 1 against j for j = 1, 2,..., m 1 replace p j by observed frequency f j so that: f j plot j f j 1 against j for j = 1, 2,..., m 1 monotonicity indicative for a mixture (heterogeneity) conceptually related to the Poisson plot (Hoaglin 1980 American Statistician, Gart 1970, Rao 1971) difference: Poisson plot looks for a horizontal line, the ratio plot looks for a monotone increasing pattern

26 Generalized Chao Bounds and a Monotonicity Property 5 Ratio Plot for Matthews-Data 4 ratio j 3 4

27 Generalized Chao Bounds and a Monotonicity Property 14 Ratio Plot for Scrapie-Data ratio j 5 6 7

28 Generalized Chao Bounds and a Monotonicity Property Ratio Plot for Drug User Data ratio j

29 Generalized Chao Bounds and a Monotonicity Property 12 Ratio Plot for Terrorist Activity Data 10 8 ratio j

30 Generalized Chao Bounds and a Monotonicity Property Conclusions from the Ratio Plot frequently, we find in count data sets evidence for heterogeneity in form of a mixture concept applicable for both: zero-truncated and untruncated count data (normalizing constant cancels out!)

31 Generalized Chao Bounds and a Monotonicity Property Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

32 Zelterman Estimation The Idea of Zelterman (1988) he noted that leading to the proposal and in particular for j = 1 λ = λj+1 λ j = (j + 1) λj+1 /(j + 1)! λ j /j! Po(j + 1; λ) λ = (j + 1) Po(j; λ) ˆλ j = (j + 1) f j+1 f j ˆλ = ˆλ 1 = 2 f 2 f 1

33 Zelterman Estimation ˆλ = 2 f 2 f1 is robust in the sense that it is not affected by any changes in counts larger than 2 count distribution need only to behave like a Poisson for counts of 1 or 2

34 Zelterman Estimation frequently: evidence for a mixture (Scrapie data) frequency Distribution Observed Frequency Fitted Poisson Mixture Fitted Simple Poisson count

35 Zelterman Estimation frequently: evidence for a 2-component mixture model Example McKendrick Matthews Scrapie Drug Use L.A. terrorist activity Non-parametric mixture model homogeneity 2-component 2-component 3-component 6-component

36 Zelterman Estimation Population Size Estimates for the Examples Example n simple MLE Chao Zelterman McKendrick Matthews Scrapie Drug Use L.A. 20,198 26,425 38,637 42,268 terrorist activity ,144 1,429

37 Zelterman Estimation Bias for Zelterman in a 2-component mixture model assume that for j = 0, 1, 2,... p j = (1 p)po(j; λ) + ppo(j; µ) bias of is determined by bias in ˆp 0 ˆN = n 1 ˆp 0

38 Zelterman Estimation Zelterman and non-parametric mixture Example n Chao Zelterman NPMLE of mixture McKendrick (1) Matthews (2) Scrapie (2) Drug Use L.A. 20,198 38,637 42,268 39,173 (2) 56,836 (3) terrorist activity 785 1,144 1,429 -

39 Zelterman Estimation Bias for Zelterman in a 2-component mixture model for Zelterman: ˆp 0 = exp( ˆλ) = exp( 2 f 2 f 1 ) replacing frequencies by expected values bias of ˆp 0 E(ˆp 0 ) exp( 2 p 2 p 1 ) E(ˆp 0 ) p 0 exp( 2 p 2 p 1 ) [(1 p)e λ + pe µ ]

40 Zelterman Estimation Bias for Zelterman in a 2-component mixture model bias of ˆp 0 ( exp 2 p ) 2 [(1 p)e λ + pe µ ] p 1 = exp ( (1 p)λ2 e λ + pµ 2 e µ ) (1 p)λe λ + pµe µ [(1 p)e λ + pe µ ] µ e λ (1 p)e λ = pe λ small amount of contamination = small bias

41 Zelterman Estimation Bias for Zelterman in a 2-component mixture model bias of ˆp 0 for large µ pe λ > 0 Zelterman overestimates (upper bound) small amount of contamination = small bias

42 Zelterman Estimation For Comparison: Bias of simple MLE in a 2-component mixture model bias of ˆp 0 = exp( Ȳ ) 0 by Jensen s inequality e [(1 p)λ+pµ] [(1 p)e λ + pe µ ] so that simple homogeneity model underestimates for all mixture models and for large µ bias of ˆp 0 = (1 p)e λ small amount of contamination = large bias

43 Zelterman Estimation next graph shows: bias of Zelterman ˆp 0 = pe λ bias of simple MLE ˆp 0 = (1 p)e λ

44 Zelterman Estimation Asymptotic Bias Variable Zelterman simple MLE p

45 Zelterman Estimation next graphs show: exact bias of Zelterman ˆp 0 exp ( (1 p)λ2 e λ + pµ 2 e µ ) (1 p)λe λ + pµe µ [(1 p)e λ + pe µ ] exact bias of simple MLE ˆp 0 e [(1 p)λ+pµ] [(1 p)e λ + pe µ ]

05-0.10 0.0 1.5 1.5 3.0 4.5 0.0 μ λ 3.0 4.5-0.10 0.0 1.5 1.5 3.0 4.5 0.0 μ λ 3.0 4.5 p = 0.

46 Zelterman Estimation Zelterman Bias Bias under Homogenous Poisson μ λ μ λ p = 0.05

47 Zelterman Estimation Zelterman Bias Bias under Homogenous Poisson λ μ μ λ p = 0.5

48 Zelterman Estimation Introduction Some Applications Ways to a Realistic Solution Generalized Chao Bounds and a Monotonicity Property Zelterman Estimation Applications Outlook Zelterman as MLE Zelterman can be extended to case data Zelterman extended to higher counts Zelterman extended: truncation or censoring? Some simulation results

49 Applications Zelterman larger than Chao? ˆN Z = n 1 exp( ˆλ) = n + n exp(ˆλ) 1 n + n 1 + ˆλ ˆλ 2 1 = n + n ˆλ ˆλ 2 = n + n 2f 2 f ( 2f2 f 1 ) 2 = n + ( ) f 2 n + 1 = 2f ˆN C 2 yes, if ˆλ is small (Böhning and Brittain 2008) ( ) f 2 1 n 2f 2 f 1 + f 2

50 Applications Zelterman larger than Chao? f 2 f 1 n f 1 +f 2 Example n Chao Zelterman McKendrick Matthews Scrapie Drug Use L.A. 20,198 38,637 42, terrorist activity 785 1,144 1,

51 Outlook Zelterman as MLE Zelterman Estimation offers Flexibility Zelterman estimate truncates all counts different from 1 or 2: write 1 p = p 1 = exp( λ)λ exp( λ)λ + exp( λ)λ 2 /2 = λ/2 exp( λ)λ 2 /2 p = p 2 = exp( λ)λ + exp( λ)λ 2 /2 = λ/2 1 + λ/2 and consider associated binomial log-likelihood f 1 log(p 1 ) + f 2 log(p 2 ) = f 1 log(1 p) + f 2 log(p) which is maximized for ˆp = ˆp 2 = f 1 f 1 +f 2, or ˆλ = 2ˆp 2 1 ˆp 2 = 2f 2 f 1

52 Outlook Zelterman as MLE Zelterman Estimation offers Flexibility a likelihood framework offers generalizations: (correct) variance estimate of the Zelterman estimator (Fisher information) (Böhning 2008, Statistical Methodology) extension of the estimator for case data incorporation of covariates (binomial logistic regression with log-link function to the Poisson parameter) (Böhning and van der Heijden 2008 Ann. Appl. Statist.) efficiency

53 Outlook Zelterman can be extended to case data Zelterman Estimation: Extension to Case Data Table: Illustration of Case Data with Individual Recapture Counts Unit i Count y i δ i Sex i Age i Male Male Female Male Female Female

54 Outlook Zelterman can be extended to case data Zelterman Estimation offers Flexibility Binomial likelihood for grouped data f 1 log(1 p) + f 2 log(p) becomes for case data (1 δ i ) log(1 p) + δ i log(p) i which becomes with covariate information on case i a logistic regression model p i = exp(βt x i ) 1 + exp(β T x i )

55 Outlook Zelterman can be extended to case data Zelterman Estimation offers Flexibility covariate information on case i p i = exp(βt x i ) 1 + exp(β T x i ) compare with parameterization in capture probability λ it follows that p i = λ i/2 1 + λ i /2 λ i = 2 exp(β T x i ) and the generalization of the Horvitz-Thompson estimator is n i=1 1 1 exp( 2e βt x i )

56 Outlook Zelterman extended to higher counts Generalizing the Idea of Zelterman: Improving Efficiency not only but also Po(j + 1; λ) λ = (j + 1) Po(j; λ) 1 {( }}{ j ) i=1 λ = λ λi j = i=1 λi j i=1 (i + 1) λi+1 (i+1)! j i=1 λi /i! = j i=1 (i + 1)Po(i + 1; λ) j i=1 Po(i; λ)

57 Outlook Zelterman extended to higher counts λ = leads to the proposal j i=1 (i + 1)Po(i + 1; λ) j i=1 Po(i; λ) ˆλ j = j i=1 (i + 1)f i+1 j i=1 f i and in particular for j = 1, j = 2, j = 3, j = 4 ˆλ = ˆλ 1 = 2 f 2 f 1, ˆλ 2 = 2f 2 + 3f 3 f 1 + f 2, ˆλ 3 = 2f 2 + 3f 3 + 4f 4 f 1 + f 2 + f 3 ˆλ 4 = 2f 2 + 3f 3 + 4f 4 + 5f 5 f 1 + f 2 + f 3 + f 4 n Z i = 1 exp( ˆλ i )

58 Outlook Zelterman extended: truncation or censoring? Generalizing the idea of Zelterman: truncation or censoring? disadvantage of conventional Zelterman: uses only f 1 and f 2 idea of truncation: ignore all counts different from 1 and 2 idea of censoring: use marginal likelihood for all counts of 2 and larger p 1 = P(Y = 1) = p 2+ = P(Y > 1) = exp( λ) 1 exp( λ) λ exp( λ) 1 exp( λ) [λ2 /2! + λ 3 /3! +...] = 1 exp( λ) 1 exp( λ) λ

59 Outlook Zelterman extended: truncation or censoring? Generalizing the idea of Zelterman: truncation or censoring? leads to the binomial likelihood f 1 log(p 1 ) + f 2+ log(p 2+ ) since we have leads to f 1 /n = ˆp 1 = f 1 /n exp( λ) 1 exp( λ) λ = 1 exp(λ) 1 λ 1 λ + λ 2 /2 λ ˆλ C = 2(n f 1) f 1

60 Outlook Zelterman extended: truncation or censoring? Bias reduced estimator ˆλ C = 2(n f 1) f 1 = 2f 2+2f f 1 is biased since equate E(ˆλ C ) 2λ2 /2 + 2λ 3 / λ ˆλ C = λ + λ 2 /3 λ + λ 2 /3 and solve for λ to provide ˆλ C bias = ˆλ C denote Z C bias = n 1 exp( λ C bias )

61 Outlook Some simulation results Simulation Experiment Goal: Compare Z j = n 1 exp( ˆλ j ) N C = n + f 1 2 n 2f 2 as well as Z C = corrected version and Chao s estimator 1 exp( ˆλ C ) and its bias count data: f j arise from 0.5Po(j; 0.5) + 0.5Po(j; µ) for µ = 1, 2,..., 7 and j = 0, 1, 2,... population size: N = f 0 + f = 100 f 0 is truncated N estimated using Z 1, Z 2, Z 3, Z C, Z C Bias and N C

62 Outlook Some simulation results Bias Estimator Chao Z_C Z_C_Bias Z1 Z2 Z3 Mean lambda 4 5

63 Outlook Some simulation results Standard Deviation Estimator Chao Z_C Z_C_Bias Z1 Z2 Z3 SD lambda 4 5

64 Outlook Some simulation results RMSE Estimator Chao Z_C Z_C_Bias Z1 Z2 Z3 RMSE lambda 4 5

65 Outlook Some simulation results Main Conclusions Generalized Chao Bounds lead to a diagnostic device for presence of heterogeneity Zelterman Estimation appears appropriate since it offers flexible likelihood concept Further generalizations of Zelterman estimation (covariate modelling and increasing efficiency by including higher counts) easily possible

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1 Estimation September 24, 2018 STAT 151 Class 6 Slide 1 Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1