The LmB Conferences on Multivariate Count Analysis

Size: px

Start display at page:

Download "The LmB Conferences on Multivariate Count Analysis"

Zoe White
5 years ago
Views:

ultra-overdispersed count data Rahma ABID, C.C. Kokonendji & A.

1 The LmB Conferences on Multivariate Count Analysis Title: On Poisson-exponential-Tweedie regression models for ultra-overdispersed count data Rahma ABID, C.C. Kokonendji & A. Masmoudi Address: Besançon:

2 Some related works: Abid et al. (2018a). Geometric dispersion models with quadratic v-functions. Submission Needing Revision. Abid et al. (2018b). Geometric Tweedie regression models for continuous and semicontinuous data with variation phenomenon. Submitted for publication. Abid et al. (2018c). On Poisson-exponential-Tweedie regression models for ultra-overdispersed count data. To submit asap. Abid et al. (2018d). Multivariate Poisson-exponential-Tweedie regression models for the analysis of maintenance building data. Work in progress. Rahma ABID, Poisson-exponential-Tweedie regression models 2

3 Outline: 1 Introduction: Motivations 2 Poisson-exponential-Tweedie (PET) models (= Geometric Poisson-Tweedie) 3 PET regression models 4 Simulation studies and applications 5 Conclusion & Perspectives Rahma ABID, Poisson-exponential-Tweedie regression models 3

4 1. Introduction: Motivations The overdispersion phenomenon is frequent and can be induced by the zero-inflation ones, these phenomena are defined w.r. to Poisson; see, e.g., ab c d. Count models have been built through compounding and mixture of Poisson distribution. 1) How to do when the degree or level of overdispersion is very high? 2) Should we relativize its measure with respect to another reference count distribution than Poisson? 3) How to built a new family of ultra-overdispersed count models? a Hinde, J. and Demétrio, C.G.B. (1998). Overdispersion: Models and Estimation. Associacao Brasileira de Estatistica, Sao Paulo. b Kokonendji, C.C., Dossou-Gbete, S. and Demétrio, C.G.B. (2004). Some discrete exponential dispersion models: Poisson-Tweedie and Hinde-Demetrio classes. Statistics and Operations Research Transactions, 28, c Kokonendji, C.C., Demétrio, C.G.B. and Zocchi, S.S. (2007). On Hinde-Demétrio regression models for overdispersed count data. Statistical Methodology 4, d Bonat, W.H., Jørgensen, B., Kokonendji, C.C., Hinde, J. & Demétrio, C.G.B. (2018). Extended Poisson-Tweedie: properties and regression models for count data. Statistical Modelling 18: Rahma ABID, Poisson-exponential-Tweedie regression models 4

5 Example of application Application to the repairable systems in reliability e : Data of number of buildings subject to maintenance (e.g., plumbing, roof, heating, cooling system, etc.) until the year The buildings are belonging to one system being 46 years old: (Y 1,..., Y i,..., Y 46 ). Dispersion index = / = : ultra-overdispersed. Some relative references: - Jørgensen, B. & Kokonendji, C.C. (2011). Dispersion models for geometric sums. Brazilian Journal of Probability and Statistics 25: Jørgensen, B. and Kokonendji, C.C. (2016). Discrete dispersion models and their Tweedie asymptotics. AStA Advances in Statistical Analyses 100: e Yeoeman, A. (1987). Forecasting Building Maintenance Using The Weibull Process, M.S.Thesis, University of Missouri-Rolla, United States. Rahma ABID, Poisson-exponential-Tweedie regression models 5

6 Geometric sums of count random variables A geometric sum of Poisson-Tweedie models defined by Y = G PT l, where PT 1, PT 2,... are i.i.d. such a Poisson-Tweedie PT random variable and G Geom(q). Decomposition of number of buildings maintenance actions Y i having age i as a geometric sum of maintenance actions per building. l=1 Rahma ABID, Poisson-exponential-Tweedie regression models 6

7 Geometric sums of count random variables A geometric sum of Poisson-Tweedie models defined by Y = G PT l, where PT 1, PT 2,... are i.i.d. such a Poisson-Tweedie PT random variable and G Geom(q). Decomposition of number of buildings maintenance actions Y i having age i as a geometric sum of maintenance actions per building. For p {0} [1, ), the class of Poisson-Tweedie (PTw p ( m, φ)): l=1 Z Tw p ( m, φ) and PT Z Poisson(Z) PT PTw p ( m, φ) has moments EPT = m and VarPT = m + φ m p. Given the expectation m = EY, its variance is of the form VarY = m + m 2 + φm p. Rahma ABID, Poisson-exponential-Tweedie regression models 6

8 2.Poisson-exponential-Tweedie (PET) models (= Geometric Poisson-Tweedie) The class of Exponential-Poisson-Tweedie: X Exp(1), [Y X] Z Poisson(Z) and Z Tw p (Xm, X 1 p φ). (1) The class of Poisson-exponential-Tweedie (PETw p (m, φ)): Y Z Poisson(Z), Z Tw p (Xm, X 1 p φ) and X Exp(1). (2) Proposition (Abid et al., 2018c) Let Y 1 and Y 2 two random variables defined by (1) and (2), respectively. Then (i) Y 1 and Y 2 have the same distributions. (ii) VarY 1 = m + m 2 + φm p. Rahma ABID, Poisson-exponential-Tweedie regression models 7

9 Table: Summary of PET models with support S p = N of distributions. Type(s) of PET = Geometric PT p Type(s) of Tweedie Geometric Hermite p = 0 Gaussian [Do not exist] 0 < p < 1 [Do not exist] Geometric Neyman Type A p = 1 Poisson Geometric Poisson compound Poisson 1 < p < 2 Gamma compound Poisson Geometric Pólya-Aeppli p = 3/2 Non-central gamma Geometric negative binomial p = 2 Gamma Geometric Poisson positive stable p > 2 Positive stable Geometric Poisson-inverse Gaussian p = 3 Inverse Gaussian Rahma ABID, Poisson-exponential-Tweedie regression models 8

10 Table: Summary of PET models with support S p = N of distributions. Type(s) of PET = Geometric PT p Type(s) of Tweedie Geometric Hermite p = 0 Gaussian [Do not exist] 0 < p < 1 [Do not exist] Geometric Neyman Type A p = 1 Poisson Geometric Poisson compound Poisson 1 < p < 2 Gamma compound Poisson Geometric Pólya-Aeppli p = 3/2 Non-central gamma Geometric negative binomial p = 2 Gamma Geometric Poisson positive stable p > 2 Positive stable Geometric Poisson-inverse Gaussian p = 3 Inverse Gaussian Y PETw p (m, φ) has pmf P(Y = y) = 0 0 exp{ z x}z y Tw p (mx, φx 1 p )(z)dzdx. y! No closed-form available Approximation by Monte Carlo integration and Tweedie simulations rtweedie() in R (Dunn, 2013). Estimation and inference based on the likelihood function is difficult Model selection: estimation of parameters by regression. Rahma ABID, Poisson-exponential-Tweedie regression models 8

11 3. PET regression models Dispersion and zero-inflation indexes w.r. to Poisson: P-DI = VarY EY andf P-ZI = EY + log P(Y = 0). f Other definitions: P-DI = (VarY EY)/EY and P-ZI = 1 + log P(Y = 0)/EY. Rahma ABID, Poisson-exponential-Tweedie regression models 9

12 3. PET regression models Dispersion and zero-inflation indexes w.r. to Poisson: P-DI = VarY EY andf P-ZI = EY + log P(Y = 0). Dispersion and zero-inflation indexes w.r. to negative binomial: NB-DI = VarY EY + (EY) 2 and NB-ZI = log(1 + EY) + log P(Y = 0). Heavy tail index is independent of the reference model: HT = P(Y = y + 1) P(Y = y) for y. Proposition (Abid et al., 2018c) PET is overdispersed and zero-inflated w.r. to P and NB, respectively. f Other definitions: P-DI = (VarY EY)/EY and P-ZI = 1 + log P(Y = 0)/EY. Rahma ABID, Poisson-exponential-Tweedie regression models 9

13 Dispersion indexes of PET w.r. to Poisson and NB Figure: Dispersion indexes of PET distribution as a function of m by dispersion and power parameters. Rahma ABID, Poisson-exponential-Tweedie regression models 10

14 Zero-inflation indexes of PET w.r. to Poisson and NB Figure: Zero-inflation indexes of PET distribution as a function of m by dispersion and power parameters. Rahma ABID, Poisson-exponential-Tweedie regression models 11

15 Estimation and inference: Quasi likelihood approach Consider a cross-sectional data set, (y i, x i ), i = 1,..., n, where y i are i.i.d realizations of Y i PETw p (m i, φ), x i and β are (Q 1) vectors of known covariates and unknown regression parameters, respectively. EY i = m i = exp(x i β) VarY i = m i + m 2 i + φm p i = V i. Rahma ABID, Poisson-exponential-Tweedie regression models 12

16 Estimation and inference: Quasi likelihood approach Consider a cross-sectional data set, (y i, x i ), i = 1,..., n, where y i are i.i.d realizations of Y i PETw p (m i, φ), x i and β are (Q 1) vectors of known covariates and unknown regression parameters, respectively. EY i = m i = exp(x i β) VarY i = m i + m 2 i + φm p i = V i. Models with m 2 p m 1 p < φ < 0 are permitted May be no specific probability distribution Rahma ABID, Poisson-exponential-Tweedie regression models 12

17 Dispersion indexes of PET w.r. to Poisson and NB for φ < 0 Figure: Dispersion indexes for PET distribution by negative dispersion and power parameters. Rahma ABID, Poisson-exponential-Tweedie regression models 13

18 Dominant features of PET models Table: Reference models and dominant features by dispersion and power parameter values in respect to the Poisson and negative binomial models. Reference PET Dominant features Dispersion Power Poisson/negative binomial Equi/Equi - Geometric Hermite Over, under φ 0 p = 0 Geometric Neyman Type A Over, under, ZI φ 0 p = 1 Geometric Poisson compound Poisson Over, under, ZI φ 0 1 < p < 2 Geometric Pólya-Aeppli Over, under, ZI φ 0 p = 1.5 Geometric negative binomial Over, under φ 0 p = 2 Geometric Poisson positive stable Over, HT φ > 0 p > 2 Geometric Poisson-inverse Gaussian Over, HT φ > 0 p = 3 Rahma ABID, Poisson-exponential-Tweedie regression models 14

19 Estimating function approach The quasi-score function for β: n m i ψ β (β, γ) = V 1 i (y i m i ),..., β 1 i=1 n m i V 1 i (y i m i ) β. Q The Pearson estimating function for variance parameters γ = (φ, p): n ψ γ (β, γ) = V 1 n i φ {(y V 1 i m i ) 2 i V i }, p {(y i m i ) 2 V i }. i=1 The chaser algorithm (Jørgensen & Knudsen, 2004) is defined by i=1 i=1 β (i+1) = β (i) S 1 β ψ β(β (i), φ (i) ) with S βjk γ (i+1) = γ (i) αs 1 γ ψ γ (β (i+1), γ (i) ) = E ( ψ β βj (β, φ) ) and S γjk = n k i=1 V 1 i γ j V V i 1 i V γ i k Rahma ABID, Poisson-exponential-Tweedie regression models 15

20 4.1 Simulation studies The expectation and the variance of the PET random variable are given by m i = exp(β 0 + β 1 x 1i + β 2 x 2i ) and V i = m i + m 2 i + φm p, i where x 1 and x 2 are sequences from 1 to 1 with length equals to the sample size. The regression coefficients were fixed at the values, β 0 = 1, β 1 = 1 and β 2 = 0.9. We use different sample sizes (n = 500, 1000 and 5000) generating 1000 data sets in each case. We considered three values of the Tweedie power parameter p = 1.5, 2, 3 combined with three values of the dispersion parameter φ = 0.5, 1, 1.5. Rahma ABID, Poisson-exponential-Tweedie regression models 16

21 Average bias for parameters Figure: Average bias for each parameter by sample size and simulation scenarios. Rahma ABID, Poisson-exponential-Tweedie regression models 17

22 Confidence intervals for parameters Figure: Confidence intervals for each parameter by simulation scenarios. Rahma ABID, Poisson-exponential-Tweedie regression models 18

23 4.2 Three Applications Accidents of private cars in Switzerland (Klugman, 2004): Data analysed in Aryuyuen and Bodhisuwan (2013) using the NB-GE distribution. P-DI = 1.154; NB-DI = P-overdispersed; NB-equidispersed. P-ZI = 0.154; NB-ZI = P-zero-inflated; NB-zero-deflated. Rahma ABID, Poisson-exponential-Tweedie regression models 19

24 4.2 Three Applications Accidents of private cars in Switzerland (Klugman, 2004): Data analysed in Aryuyuen and Bodhisuwan (2013) using the NB-GE distribution. P-DI = 1.154; NB-DI = P-overdispersed; NB-equidispersed. P-ZI = 0.154; NB-ZI = P-zero-inflated; NB-zero-deflated. Table: Parameter estimates for different models. Number of Observed Fitting distributions accidents frequencies Poisson NB NB-GE PT PET Parameters λ = r = r = p = p = estimates q = α = φ = φ = β = m = m = Chi-squares p-value < Rahma ABID, Poisson-exponential-Tweedie regression models 19

25 Accident occurrence in car insurance on Tunisian data: Data from an insurer who operates in the market for automobile insurance in Tunisia. P-DI = 9.100; NB-DI = NB-overdispersed. P-ZI = 0.268; NB-ZI = NB-zero-inflated. Table: Parameter estimates and standard errors (SE) for PET and PT models; paic for models. Parameter PET PT Intercept (0.111) (0.111) Car age (0.003) (0.003) Car power (0.009) (0.009) Driver age (0.002) (0.002) φ (0.093) (0.096) p (0.149) (0.075) paic Rahma ABID, Poisson-exponential-Tweedie regression models 20

26 Buildings maintenance data (Yeoeman, 1987): Data on the number of occurrences of repairs for buildings. The number of maintenance for all buildings is available during four years: 1982, 1983, 1984, For a given year i, the total number of buildings maintenance Y i in the i-th time frame follows the PET model PETw p (m i, φ), i = 1,..., 46. Table: Estimated dispersion and zero-inflation indexes of datasets. Dataset P-DI NB-DI P-ZI NB-ZI No 1 (1982) No 2 (1983) No 3 (1984) No 4 (1985) Rahma ABID, Poisson-exponential-Tweedie regression models 21

27 Table: Parameter estimates and standard errors (SE) for PET and PT indicated by italic symbols; paic for models. Parameter No 1 (1982) No 2 (1983) No 3 (1984) No 4 (1985) Intercept (0.299) (0.048) (0.298) (0.300) (0.299) (0.477) (0.298) (0.300) Age (0.010) (0.058) (0.111) (0.111) (0.010) (0.144) (0.110) (0.111) φ (0.000) (0.001) (0.001) (0.000) (0.312) (1.336) (0.372) (0.131) p (0.000) (0.035) (0.004) (0.000) (0.007) (0.014) (0.002) (0.000) paic Rahma ABID, Poisson-exponential-Tweedie regression models 22

28 5. Conclusion & Perspectives 1 Model selection in the PET to deal with count ultra-overdispersed data. 2 Negative binomial dispersion and zero-inflation indexes relativize the ultra-overdispersion and the excess of zeros. Rahma ABID, Poisson-exponential-Tweedie regression models 23

29 5. Conclusion & Perspectives 1 Model selection in the PET to deal with count ultra-overdispersed data. 2 Negative binomial dispersion and zero-inflation indexes relativize the ultra-overdispersion and the excess of zeros. 3 Multivariate version of PET sums: G 1 (Q) G k (Q) S (Q; PT) = PT l1,..., PT lk, l=1 l=1 where Q = {q ij } k is a suitable matrix of parameters. i,j=1 Independent or correlated components ( G j (Q) ) k j=1? 4 Given Y 1,..., Y n be a n-variate response vector on N d, d 1, EY i = m i = G 1 (X i β), cov(y i, Y j ) = Σ 1/2 i (ρ ij I d ) Σ 1/2 j, Σ i = diag d (m i ) + diag d (m 2 i ) + diag d (mp i ) 1/2 diag d (φ)diag d (m p i ) 1/2. Rahma ABID, Poisson-exponential-Tweedie regression models 23

30 Further references 1 Abid, R., Kokonendji, C.C. and Masmoudi, A. (2018a). Geometric dispersion models with quadratic v-functions. Submission Needing Revision. 2 Abid, R., Kokonendji, C.C. and Masmoudi, A. (2018b). Geometric Tweedie regression models for continuous and semicontinuous data with variation phenomenon. Submitted for publication. 3 Abid, R., Kokonendji, C.C. and Masmoudi, A. (2018c). On Poisson-exponential-Tweedie regression models for ultra-overdispersed count data. To submit asap. 4 Dunn, P.K. (2013). Tweedie exponential family models. version R package URL 5 Jørgensen, B. and Knudsen, S.J. (2004). Parameter orthogonality and bias adjustment for estimating functions. Scandinavian Journal of Statistics 31: Klugman, S.A., Panger, H.H. and Willmot, G.E. (2004). Loss Models: From Data to Decisions, 2nd edn. Wiley, Hoboken, NJ. 7 Tweedie, M.C.K. (1984). An index which distinguishes between some important exponential families. In Statistics: Applications and New Directions. Proceedings of the Indian Statistical Institute Golden Jubilee International Conference (J. K. Ghosh and J. Roy, eds.), pp , Indian Statistical Institute, Calcutta.... Thank You Rahma ABID, Poisson-exponential-Tweedie regression models 24

31 Geometric dispersion models (0) Let µ be a probability measure. (1) The geometric cumulant function of µ (Jørgensen & Kokonendji, 2011) is C µ (θ) = 1 1 L µ (θ) on (µ) = { θ R; 0 < L µ (θ) < }. (2) In general, the application θ C µ(θ) is not strictly monotone on (µ). -Let (µ) (µ) be an interval for which C µ is strictly monotone on (µ). -The application θ C µ(θ) is a diffeomorphism between (µ) and C µ ( (µ)) =: Φ µ. Denote ϕ µ := (C µ) 1. (3) v-function: m V µ (m) = C µ (ϕ µ )(m) on Φ µ. -Denote Φ µ := C µ( (µ)) and Φ + µ := C µ( + (µ)). Then, V µ (m) < 0 on Φ µ and V µ (m) > 0 on Φ + µ. Note: V µ = Var E 2. Rahma ABID, Poisson-exponential-Tweedie regression models 25

32 How to derive GDMs from EDMs with (µ) = (µ)? Let µ be a probability measure. (1) If there exists a probability measure ν such that C µ (θ) = K ν (θ), then (µ) = {θ (µ); C µ (θ) = K ν (θ) > 0} = (µ). ν(µ) : Prop. below. (2) If there exists a probability measure ν such that C µ (θ) = K ν (θ), then (µ) = {θ (µ); C µ (θ) = K (θ) < 0} = (µ). ν(µ)?! ν Proposition (Exponential mixtures distributions) Let µ be a probability measure and ν an infinitely divisible σ-finite positive measure. Consider F(ν) = {P(θ, ν); θ Θ(ν)}. The following assertions are equivalent: (i) For all m Φ + µ, V µ (m) = V F(ν) (m). (ii) The measure µ is an exponential mixture measure µ(dy) = e x P( γ, ν x )(dy)dx, (3) 0 with γ Θ(ν) and ν x denotes the x-th convolution of ν, that is L ν x (θ) = (L ν (θ)) x. Note: Under assumption of infinite divisibility, the corresponding exponential mixture has v-function identical to the variance function V F(ν). Rahma ABID, Poisson-exponential-Tweedie regression models 26

Extended Poisson-Tweedie: properties and regression models for count data

Extended Poisson-Tweedie: properties and regression models for count data arxiv:1608.06888v2 [stat.me] 11 Sep 2016 Wagner H. Bonat and Bent Jørgensen and Célestin C. Kokonendji and John Hinde and Clarice