Statistical Models for Defective Count Data

Size: px

Start display at page:

Download "Statistical Models for Defective Count Data"

Adela Daniel
6 years ago
Views:

1 Statistical Models for Defective Count Data Gerhard Neubauer a, Gordana -Duraš a, and Herwig Friedl b a Statistical Applications, Joanneum Research, Graz, Austria b Institute of Statistics, University of Technology, Graz, Austria Statistik Tage 2011, Graz, September 7 th -9 th

2 Introduction Statistik Tage Graz, September 7-9,

3 Defective Count Data Defective: reported... too much and/or too few is counted, recorded, Most prominent example: Criminology Crimes associated with shame (sexual offences, domestic violence) Theft of low-value goods Likely to be not reported to the police, official numbers are too low Under-Reporting Statistik Tage Graz, September 7-9,

4 Defective Count Data Less considered: Insurance companies suspect that theft may be reported, but only to make an insurance claim Most popular with bicycles, skiing equipment Likely to be reported to the police, official numbers are too high Over-Reporting More precisely: Under-Reporting fraud Over-Reporting theft Statistik Tage Graz, September 7-9,

5 Criminology: Bicycle Theft Data Weekly counts of bicycle theft in an Austrian region Week Statistik Tage Graz, September 7-9,

6 Bicycle Theft? Statistik Tage Graz, September 7-9,

7 Defective Count Data Health data Registers for infectious diseases (HIV), chronic diseases (diabetes) Cause of death registers (heart attack) Any miss-classification will cause both errors Under-Reporting in the right category Over-Reporting in the wrong category Statistik Tage Graz, September 7-9,

8 Health: Heart Attack Data Monthly counts of heart attack discharges from Styrian hospitals Count Month Statistik Tage Graz, September 7-9,

9 Defective Count Data Criminology Health data Insurance data: traffic accidents with minor damage Production: number of defective goods in production Warranty: number of goods sent back for warranty claim... Any sample of count data may be defective due to recording failures Statistik Tage Graz, September 7-9,

10 Under-Reporting How to estimate the total number of cases? Statistik Tage Graz, September 7-9,

11 Bernoulli Sampling One observation λ observations Case reported Case reported Yes No Yes No R 1 R Y D R Bernoulli(π) Y = λ R i Binomial(λ, π) i=1 Y... (random) number of reported cases D... (random) number of unreported cases λ... total number of cases π... reporting probability Statistik Tage Graz, September 7-9,

12 Binomial Model Case reported Yes No Y D Y = λ R i Binomial(λ, π) i=1 E(Y ) = µ = λπ var(y ) = σ 2 = λπ(1 π) Both λ and π have to be estimated Statistik Tage Graz, September 7-9,

13 Binomial Model Case reported Yes No Y D Y = λ R i Binomial(λ, π) i=1 E(Y ) = µ = λπ var(y ) = σ 2 = λπ(1 π) Both λ and π have to be estimated No longer a member of 1-parameter exponential family Limitation to data with s 2 < y Statistik Tage Graz, September 7-9,

14 Is iid Assumption Appropriate? Week Count Month Trend/Seasonality Regression approach Statistik Tage Graz, September 7-9,

15 Regression Approach Consider Y t ind Binomial(λ t, π), t = 1,..., T, where λ t = exp(x tβ), β R d, π = to ensure λ t > 0 and 0 < π < 1 Likelihood contribution of y t L(α, β y t ) = exp(α) 1 + exp(α), α R, ( λt y t ) π y t (1 π) λ t y t Statistik Tage Graz, September 7-9,

16 Maximum Likelihood Estimation Sample Log-Likelihood l(α, β y) = = T t=1 T t=1 log L(α, β y t ) log [( λt y t )π y t (1 π) λ t y t ] Score and observed Information are analytically evaluated Newton-Raphson procedure till convergence Statistik Tage Graz, September 7-9,

17 Maximum Likelihood Estimation Sample Log-Likelihood l(α, β y) = = T t=1 T t=1 log L(α, β y t ) log [( λt y t )π y t (1 π) λ t y t ] Score and observed Information are analytically evaluated Newton-Raphson procedure till convergence Still problems with overdispersion! Statistik Tage Graz, September 7-9,

18 Randomized Parameter Models Statistik Tage Graz, September 7-9,

19 Beta-Binomial Model Randomize reporting probability π We use the parameterization Y t P t Binomial(λ t, P t ) P t Beta(γ, δ) Y t Beta-Binomial(λ t, γ, δ) θ = γ + δ and π = γ γ + δ = E(P t) Hence E(Y t ) = µ t = λ t π and var(y t ) = µ t (1 π)φ, where φ = λ+γ+δ 1+γ+δ 1 Statistik Tage Graz, September 7-9,

20 Poisson Models Randomize total number of cases λ Model not identified Winkelmann (2000): Pogit Model Y t L t Binomial(L t, π) L t Poisson(λ t ) Y t Poisson(λ t π) Y t Poisson(λ t π t ) λ t = exp(x tβ) and π t = with two (disjoint) sets of regressors x t and z t exp(z tα) 1 + exp(z tα) Statistik Tage Graz, September 7-9,

21 Negative Binomial Model Randomize total number of cases λ Y t L t Binomial(L t, π) L t K t Poisson(K t λ t ) K t Gamma(ω t, ω t ) Y t Negative Binomial(ω t, 1 π) where ω t = λ t (1 π) is the expected number of unreported cases E(Y t ) = µ t = λ t π var(y t ) = µ t + µ2 t ω t Statistik Tage Graz, September 7-9,

22 Beta-Poisson Model Randomize π and λ We use the parameterization Y t L t, P t Binomial(L t, P t ) L t Poisson(λ t ) P t Beta(γ, δ) Y t Beta-Poisson(λ t, γ, δ) θ = γ + δ and π = γ γ + δ = E(P t) Hence E(Y t ) = µ t = λ t π and var(y t ) = µ t φ t, where φ t = 1 + λ t(1 π) 1+γ+δ 1 Statistik Tage Graz, September 7-9,

23 Estimation and Inference Statistik Tage Graz, September 7-9,

24 Estimation We use Maximum Likelihood (ML), where the solution of score equations α log L (Ω M y t, x t ) = 0 and β log L (Ω M y t, x t ) = 0 t t Ω M the parameter vector of given model is found by the Newton-Raphson algorithm For the Beta-mixtures we also use the Hybrid algorithm Cycle between 1. ML for α, β given θ, and 2. MoM for θ given α, β Choice of starting values sometimes crucial Statistik Tage Graz, September 7-9,

25 Inference Normality of parameter estimates Simulation studies: for reasonable settings ˆα and ˆβ approximately normal Problems when π 0, π 1, i.e. Poisson limit perfect reporting system Inference within distribution - usual model selection methods e.g. t-values, LRT,... Statistik Tage Graz, September 7-9,

26 Inference Inference between distributions - Non-nested testing (Allcroft and Glasbey, 2003) Comparison of various models M = (M k ), k = 1,..., K 1. find ˆθ k for data y = (y t ) by optimizing GoF criterion c = (c k (y)) 2. simulate samples y (s) (ˆθ k ), s = 1,..., S, (e.g. S = 100) and obtain all S matrices C (s) of dimension K K 3. obtain mean C of S matrices and compare c to the k-th column c k by multivariate normality assumption If D k χ 2 K, D k = (c c k ) V 1 k (c c k) kth model correct Statistik Tage Graz, September 7-9,

27 Real Data Applications Statistik Tage Graz, September 7-9,

28 Bicycle Theft Distribution log L Pearson BIC D p(d) ˆπ Negative-Binomial Beta-Binomial Beta-Poisson Model selected: Negative Binomial ˆπ = 0.61, p(d) = 0.59 Statistik Tage Graz, September 7-9,

29 Bicycle Theft Data Week Statistik Tage Graz, September 7-9,

30 Bicycle Theft Model Week Estimated mean (solid), estimated total number of thefts with confidence interval (dashed) Statistik Tage Graz, September 7-9,

31 Heart Attack Data Distribution log L Pearson BIC D p(d) ˆπ Negative Binomial Beta-Binomial <e Beta-Poisson <e Model selected: Negative Binomial ˆπ = 0.55, p(d) = 0.63 Statistik Tage Graz, September 7-9,

32 Heart Attack Data Count Month Estimated mean (solid), estimated total number of heart attacks with confidence interval (dashed) Statistik Tage Graz, September 7-9,

33 Heart Attack Counts + Cause of Death Counts Count Month Estimated mean (black solid), estimated total number of heart attacks with confidence interval (red dashed) Statistik Tage Graz, September 7-9,

34 Over-Reporting How to estimate the correct number of cases? Statistik Tage Graz, September 7-9,

35 Cameron & Trivedi, 1998 Model dependence between the two Bernoulli variables T 0 : true not reported T 1 : true reported O: over-reported U: under-reported Y : observed count E = Event and R = Recording Bivariate Binomial Random Variable R = 0 R = 1 E = 0 T 0 O N 0 E = 1 U T 1 N 1 N Y Y N Statistik Tage Graz, September 7-9,

36 Independent Sampling In Criminology T 0 does not exist. Therefore we consider independent sampling. R = 0 R = 1 E = 0 R 0 E = 1 U R 1 N and Y = R 0 + R 1 Specify under-reporting model for R 1 and a count model for R 0 Model Y by convolution of both Statistik Tage Graz, September 7-9,

37 A Family of Poisson Convolutions Assume that R 1 is generated by under-reporting, i.e. R 1 N, P Binomial(N, P ), and Then p(y = y = r 0 + r 1 ) = R 0 Poisson(α) r 0 j=0 p 0 (j)p 1 (y j) = r 1 j=0 p 0 (y j)p 1 (j) with E(Y ) = µ 0 + µ 1 = α + λπ var(y ) = σ σ 2 1 = α + var(r 1 ) Statistik Tage Graz, September 7-9,

38 Negative-Binomial-Poisson Convolution For R 1 Negative Binomial(ω, π) Y Delaporte(λ, π, α), λ = ω/(1 π) with where and p(y = y = r 1 + r 0 ) = (1 π)ω α y exp( α) S Γ(ω) S = r 1 j=0 Γ(j + ω) ( π ) j Γ(j + 1)Γ(y j + 1) α E(Y ) = µ 0 + µ 1 = α + λπ var(y ) = σ σ 2 1 = α + λπ(1 π) 1 Statistik Tage Graz, September 7-9,

39 Application to bicycle theft data Data and Mean in grey Week Statistik Tage Graz, September 7-9,

40 Application to bicycle theft data Data and Mean in grey, E(R_0) in red, E(R_1) in green, E(U) in blue Week Statistik Tage Graz, September 7-9,

41 Averaged Results Application to bicycle theft data R = 0 R = 1 E = 0 Ê(R 0 ) = E = 1 Ê(U) = 5.08 Ê(R 1 ) = Ê(N) = and Ê(Y ) = Ê(R 0 + R 1 ) = Reporting probability in the under-reporting model: ˆπ = 0.77 Fraud rate: 14.35/35.53 = 0.40 Statistik Tage Graz, September 7-9,

42 Conclusion highly relevant methodology for wide-spread applications models based on Bernoulli sampling conditional binomial models suitable for cases when var(y ) > µ estimation relies on ML and Hybrid ML non-nested technique for model selection Limitations π 0, i.e. Poisson limit π 1, perfect reporting system Statistik Tage Graz, September 7-9,

43 References Allcroft, D.J. and Glasby, Ch.A. (2003). A simulation-based method for model evaluation, Statistical Modelling, 3, Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press. Neubauer, G., Djuraš, G. and Friedl, H. (2011). Models for Underreporting: A Bernoulli Sampling Approach for Reported Counts. Austrian Journal of Statistics, 40, Winkelmann, R. (2000). Econometric Analysis of Count Data. Berlin: Springer. Statistik Tage Graz, September 7-9,

Models for Underreporting: A Bernoulli Sampling Approach for Reported Counts

AUSTRIAN JOURNAL OF STATISTICS Volume 40 (2011), Number 1 & 2, 85-92 Models for Underreporting: A Bernoulli Sampling Approach for Reported Counts Gerhard Neubauer 1, Gordana Djuraš 1 and Herwig Friedl