Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Count Data

Size: px

Start display at page:

Download "Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Count Data"

Samantha Holland
5 years ago
Views:

1 Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Count Data Eduardo Elias Ribeiro Junior 1 2 Walmes Marques Zeviani 1 Wagner Hugo Bonat 1 Clarice Garcia Borges Demétrio 2 John Hinde 3 1 Statistics and Geoinformation Laboratory (LEG-UFPR) 2 Department of Exact Sciences (ESALQ-USP) 3 School of Mathematics, Statistics and Applied Mathematics (NUI-Galway) 27th June 2018 jreduardo@usp.br edujrrib@gmail.com

2 Outline 1. Background 2. Reparametrization 3. Simulation study Final remarks Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 0

3 Background 1 Background Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 0

4 Count data Background Number of times an event occurs in the observation unit. Random variables that assume non-negative integer values. Let Y be a counting random variable, so that y = 0, 1, 2,... Examples in experimental researches: number of grains produced by a plant; number of fruits produced by a tree; number of insects on a particular cell; others. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 1

5 Background Poisson model and limitations GLM framework (Nelder & Wedderburn 1972) Provide suitable distribution for a counting random variables; Efficient algorithm for estimation and inference; Implemented in many software. Poisson model Relationship between mean and variance, E(Y) = Var(Y); Main limitations Overdispersion (more common), E(Y) < Var(Y) Underdispersion (less common), E(Y) > Var(Y) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 2

6 Background COM-Poisson distribution Probability mass function (Shmueli et al. 2005) takes the form Pr(Y = y λ, ν) = where λ > 0 and ν 0. Moments are not available in closed form; λ y (y!) ν Z(λ, ν), Z(λ, ν) = λ j (j!) j=0 ν, Expectation and variance can be closely approximated by E(Y) λ 1/ν ν 1 2ν and Var(Y) λ1/ν ν with accurate approximations for ν 1 or λ > 10 ν (Shmueli et al. 2005, Sellers et al. 2012). Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 3

7 Background COM-Poisson regression models Model definition Modelling the relationship between E(Y i ) and x i indirectly (Sellers & Shmueli 2010); Y i x i COM-Poisson(λ i, ν) η(e(y i x i )) = log(λ i ) = x i β Main goals Study distribution properties in terms of i) modelling real count data and ii) inference aspects. Propose a reparametrization in order to model the expectation of the response variable as a function of the covariate values directly. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 4

8 Reparametrization 2 Reparametrization Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 4

9 Reparametrization Reparametrized COM-Poisson Reparametrization Introduced new parameter µ, using the mean approximation µ = λ 1/ν ν 1 ( ) (ν 1) ν λ = µ + ; 2ν 2ν Precision parameter is taken on the log scale to avoid restrictions on the parameter space Probability mass function φ = log(ν) φ R; Replacing λ and ν as function of µ and φ in the pmf of COM-Poisson ( ) ye φ Pr(Y = y µ, φ) = µ + eφ 1 (y!) e φ 2e φ Z(µ, φ). Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 5

10 Reparametrization Study of the moments approximations µ 15 µ (a) φ 5 (b) φ Figure: Quadratic errors for the approximation of the (a) expectation and (b) variance. Dotted lines represent the restriction for suitable approximations given by Shmueli et al. (2005). Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 6

11 Reparametrization COM-Poisson µ distribution φ = 0.7 φ = 0.0 φ = 0.7 µ = 2 µ = 8 µ = P(Y = y) y Figure: Shapes of the COM-Poisson distribution for different parameter values. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 7

12 Reparametrization Properties of COM-Poisson distribution To explore the flexibility of the COM-Poisson distribution, we consider the follow indexes: Dispersion index: DI = Var(Y)/E(Y); Zero-inflation index: ZI = 1 + log Pr(Y = 0)/E(Y); Heavy-tail index: HT = Pr(Y = y + 1)/ Pr(Y = y), for y. These indexes are interpreted in relation to the Poisson distribution: over- (DI > 1), under- (DI < 1) and equidispersion (DI = 1); zero-inflation (ZI > 0) and zero-deflation (ZI < 0) and heavy-tail distribution for HT 1 when y. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 8

13 Reparametrization Properties of COM-Poisson distribution φ φ φ φ Mean Variance Dispersion index Zero inflation index Heavy tail index µ (a) µ (b) µ (c) y Figure: Indexes for COM-Poisson distribution. (a) Mean and variance relationship, (b d) dispersion, zero-inflation and heavy-tail indexes for different parameter values. Dotted lines represents the Poisson special case. (d) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 9

14 Reparametrization COM-Poisson µ regression models Let y i a set of independent observations from the COM-Poisson and x i = (x i1, x i2,..., x ip ) is a vector of known covariates, i = 1, 2,..., n. Model definition Modelling relationship between E(Y i ) and x i directly Y i x i COM-Poisson µ (µ i, φ) log(e(y i x i )) = log(µ i ) = x i β Log-likelihood function (l = l(β, φ y)) [ n ( ) ] l = e φ y i log µ i + eφ n 1 2e i=1 φ log(y i!) where µ i = exp(x i β) i=1 n i=1 log(z(µ i, φ)) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 10

15 Reparametrization Estimation and inference The estimation and inference is based on the method of maximum likelihood. Let θ = (β, φ) the model parameters. Parameter estimates are obtained by numerical maximization of the log-likelihood function (by BFGS algorithm); l( ˆθ) = max l(θ), θ R p+1 ; Standard errors for regression coefficients are obtained based on the observed information matrix; Var( ˆθ) = H 1, where H is the matrix of second partial derivatives at ˆθ; Confidence intervals for ˆµ i are obtained by delta method.. Var[g( ˆθ)] = G Var( ˆθ)G, where G = ( g/ β 1,..., g/ β p ) ; The Hessian matrix H is obtained numerically by finite differences. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 11

16 Simulation study 3 Simulation study Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 11

17 Simulation study Definitions on the simulation study Objective: assess the properties of maximum likelihood estimators and orthogonality in the reparametrized model; Simulation: we consider counts generated according a regression model with a continuous and categorical covariates and different dispersion scenarios. Algorithm 1: Steps in simulation study. for n {50, 100, 300, 1000} do set x 1 as a sequence, with n elements, between 0 and 1; set x 2 as a repetition, with n elements, of three categories; compute µ using µ = exp(β 0 + β 1 x 1 + β 21 x 21 + β 22 x 22 ); for φ { 1.6, 1.0, 0.0, 1.8} do repeat simulate y from COM-Poisson distribution with µ and φ parameters; fit COM-Poisson µ regression model to simulated y; get ˆθ = ( ˆφ, ˆβ 0, ˆβ 1, ˆβ 21, ˆβ 22 ); get confidence intervals for ˆθ based on the observed information matrix. until 1000 times; Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 12

18 Simulation study Definitions on the simulation study Level of categorical variable (x 2 ) a b c φ = φ = Average count Dispersion index 4 3 φ = 0 φ = Values of continous variable (x 1 ) Values of continous variable (x 1 ) Figure: Average counts (left) and dispersion indexes (right) for each scenario considered in the simulation study. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 13

19 Simulation study Bias of the estimators Sample size n=50 n=100 n=300 n= φ = 1.6 φ = 1 φ = 0 φ = 1.8 ^ β22 ^ β21 ^ β1 ^ β0 ^ φ Standardized Bias Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 14

20 Simulation study Coverage rate of the confidence intervals φ = 1.6 φ = 1 φ = 0 φ = 1.8 ^ φ ^ β0 ^ β ^ β21 ^ β22 Coverage rate Sample size Figure: Coverage rate based on confidence intervals obtained by quadratic approximation for different sample sizes and dispersion levels. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 15

21 Simulation study Orthogonality property of the MLEs φ φ Original parametrization Proposed parametrization φ = φ = φ = φ = λ µ φ = φ = φ = φ = Figure: Deviance surfaces contour plots under original and proposed parametrization. The ellipses are confidence regions (90, 95 and 99%), dotted lines are the maximum likelihood estimates, and points are the real parameters used in the simulation. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 16

22 4 Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 16

23 Motivating data sets and data analysis Three illustrative examples of count data analysis are reported. Assessing toxicity of nitrofen in aquatic systems, an equidispersed example; Soil moisture and potassium doses on soybean culture, an overdispersed example; and Artificial defoliation in cotton phenology, an underdispersed example. In the data analysis, we consider the COM-Poisson model in the two forms (original and new parametrization) and the quasi-poisson regression model as alternative models for the standard Poisson regression model. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 17

24 Artificial defoliation in cotton phenology 4.1 Artificial defoliation in cotton phenology Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 17

Cotton bolls data Artificial defoliation in cotton phenology Aim: to assess the effects of five defoliation levels on the bolls produced at five growth stages; Design: factorial 5 5, with 5

25 Cotton bolls data Artificial defoliation in cotton phenology Aim: to assess the effects of five defoliation levels on the bolls produced at five growth stages; Design: factorial 5 5, with 5 replicates; Experimental unit: a plot with 2 plants; Factors: Artificial defoliation (des): Growth stage (est): Response variable: Total number of cotton bolls; Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 18

26 Model specification Artificial defoliation in cotton phenology Linear predictor: following Zeviani et al. (2014) log(µ ij ) = β 0 + β 1j def i + β 2j def 2 i i varies in the levels of artificial defoliation; j varies in the levels of growth stages. Alternative models: Poisson (µ ij ); COM-Poisson (λ ij = η(µ ij ), φ) COM-Poisson µ (µ ij, φ) Quasi-Poisson (var(y ij ) = σµ ij ) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 19

27 Parameter estimates Artificial defoliation in cotton phenology Table: Parameter estimates (Est) and ratio between estimate and standard error (SE). Poisson COM-Poisson COM-Poisson µ Quasi-Poisson Est Est/SE Est Est/SE Est Est/SE Est Est/SE φ, σ β β β β β β β β β β β LogLik AIC BIC Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 20

28 Fitted curves Artificial defoliation in cotton phenology vegetative Poisson COM Poisson COM Poisson µ Quasi Poisson flower bud blossom boll boll open Number of bolls produced Artificial defoliation level Figure: Scatterplots of the observed data and curves of fitted values with 95% confidence intervals as functions of the defoliation level for each growth stage. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 21

29 Soil moisture and potassium doses on soybean culture 4.2 Soil moisture and potassium doses on soybean culture Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 21

Soybean data Soil moisture and potassium doses on soybean culture Aim: evaluate the effects of potassium doses applied to soil in different soil moisture levels; Design: factorial 5 3 in a randomized

30 Soybean data Soil moisture and potassium doses on soybean culture Aim: evaluate the effects of potassium doses applied to soil in different soil moisture levels; Design: factorial 5 3 in a randomized complete block design (5 blocks); Experimental unit: a pot with a plant; Factors: Potassium fertilization dose (K): Soil moisture level (umid): Response variable: Total number of bean seeds per pot; Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 22

31 Model specification Soil moisture and potassium doses on soybean culture Linear predictor: based on descriptive analysis, log(µ ijk ) = β 0 + γ i + τ j + β 1 K k + β 2 K 2 k + β 3jK k i varies according the blocks; j varies in the levels of soil moisture; k varies in the levels of potassium fertilization. Alternative models: Poisson (µ ij ); COM-Poisson (λ ij = η(µ ij ), φ) COM-Poisson µ (µ ij, φ) Quasi-Poisson (var(y ij ) = σµ ij ) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 23

32 Parameter estimates Soil moisture and potassium doses on soybean culture Table: Parameter estimates (Est) and ratio between estimate and standard error (SE). Poisson COM-Poisson COM-Poisson µ Quasi-Poisson Est Est/SE Est Est/SE Est Est/SE Est Est/SE φ, σ β γ γ γ γ τ τ β β β β LogLik AIC BIC Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 24

33 Fitted curves Soil moisture and potassium doses on soybean culture Poisson COM Poisson COM Poisson µ Quasi Poisson moisture : 37.5% moisture : 50% moisture : 62.5% Number of grains per pot Potassium fertilization level Figure: Dispersion diagrams of been seeds counts as function of potassium doses and humidity levels with fitted curves and confidence intervals (95%). Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 25

34 Assessing toxicity of nitrofen in aquatic systems 4.3 Assessing toxicity of nitrofen in aquatic systems Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 25

Nitrofen data Assessing toxicity of nitrofen in aquatic systems Aim: measure the reproductive toxicity of the herbicide nitrofen on a species of zooplankton (Ceriodaphnia dubia); Design: completely

35 Nitrofen data Assessing toxicity of nitrofen in aquatic systems Aim: measure the reproductive toxicity of the herbicide nitrofen on a species of zooplankton (Ceriodaphnia dubia); Design: completely randomized design, with 10 replicates; Experimental unit: zooplankton animal; Factors: herbicide nitrofen dose (dose); Response variable: Total number of live offspring; Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 26

36 Model specification Assessing toxicity of nitrofen in aquatic systems Linear predictors: Linear: log(µ i ) = β 0 + β 1 dose i, Quadratic: log(µ i ) = β 0 + β 1 dose i + β 2 dose 2 i and Cubic: log(µ i ) = β 0 + β 1 dose i + β 2 dose 2 i + β 3dose 3 i. Alternative models: Poisson (µ ij ); COM-Poisson (λ ij = η(µ ij ), φ) COM-Poisson µ (µ ij, φ) Quasi-Poisson (var(y ij ) = σµ ij ) Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 27

37 Likelihood ratio tests Assessing toxicity of nitrofen in aquatic systems Table: Model fit measures and comparisons between linear predictors. Poisson np l AIC 2(diff l) diff np P(> χ 2 ) Linear Quadratic E 16 Cubic E 02 COM-Poisson np l AIC 2(diff l) diff np P(> χ 2 ) ˆφ Linear Quadratic E Cubic E COM-Poisson µ np l AIC 2(diff l) diff np P(> χ 2 ) ˆφ Linear Quadratic E Cubic E Quasi-Poisson np QDev AIC F diff np P(> F) ˆσ Linear Quadratic E Cubic E Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 28

38 Assessing toxicity of nitrofen in aquatic systems Parameter estimates and fitted values Table: Parameter estimates (Est) and ratio between estimate and standard error (SE). Poisson COM-Poisson COM-Poisson µ Quasi-Poisson Est Est/SE Est Est/SE Est Est/SE Est Est/SE β β β β Number of live offspring Poisson COM Poisson COM Poisson µ Quasi Poisson Nitrofen concentration level Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 29

39 4.4 Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 29

40 Additional results To compare the computational times on the two parametrizations we repeat the fitting 50 times. Time (seconds) 1.5e e e e e+09 Cotton (under) COM Poisson COM Poisson µ Nitrofen (equi) COM Poisson COM Poisson µ 1.0e e e+10 Soybean (over) COM Poisson COM Poisson µ Figure: Computational times to fit the models under original and reparametrized versions based on the fifty repetitions. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 30

41 Final remarks 5 Final remarks Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 30

42 Final remarks Concluding remarks Summary Over/under-dispersion needs caution; COM-Poisson is a suitable choice for these situations; The proposed reparametrization, COM-Poisson µ has some advantages: Simple transformation of the parameter space; Leads to the orthogonality of the parameters (seen empirically); Full parametric approach; Empirical correlation between the estimators was practically null; Faster for fitting; Allows interpretation of the coefficients directly (like GLM-Poisson model). Future work Simulation study to assess model robustness against distribution miss specification; Assess theoretical approximations for Z(λ, ν) (or Z(µ, φ)), in order to avoid the selection of sum s upper bound; Propose a mixed GLM based on the COM-Poisson µ model. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 31

Notes Full-text article is available on arxiv https://arxiv.

09795 All codes (in R) and source files are available on GitHub

com/jreduardo/article-reparcmp Acknowledgements National

43 Notes Full-text article is available on arxiv All codes (in R) and source files are available on GitHub Acknowledgements National Council for Scientific and Technological Development (CNPq), for their support. Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 32

44 References Bibliography Nelder, J. A. & Wedderburn, R. W. M. (1972), Generalized Linear Models, Journal of the Royal Statistical Society. Series A (General) 135, Sellers, K. F., Borle, S. & Shmueli, G. (2012), The com-poisson model for count data: a survey of methods and applications, Applied Stochastic Models in Business and Industry 28(2), Sellers, K. F. & Shmueli, G. (2010), A flexible regression model for count data, Annals of Applied Statistics 4(2), Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S. & Boatwright, P. (2005), A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution, Journal of the Royal Statistical Society. Series C: Applied Statistics 54(1), Zeviani, W. M., Ribeiro Jr, P. J., Bonat, W. H., Shimakura, S. E. & Muniz, J. A. (2014), The Gamma-count distribution in the analysis of experimental underdispersed data, Journal of Applied Statistics pp Clarice Garcia Borges Demétrio Reparametrization of COM-Poisson Regression Models Slide 33

Extended Poisson Tweedie: Properties and regression models for count data

Extended Poisson Tweedie: Properties and regression models for count data Wagner H. Bonat 1,2, Bent Jørgensen 2,Célestin C. Kokonendji 3, John Hinde 4 and Clarice G. B. Demétrio 5 1 Laboratory of Statistics