Flexible Tweedie regression models for continuous data

Size: px

Start display at page:

Download "Flexible Tweedie regression models for continuous data"

Ginger Wilson
6 years ago
Views:

1 Flexible Tweedie regression models for continuous data arxiv: v1 [stat.me] 12 Se 2016 Wagner H. Bonat and Célestin C. Kokonendji Abstract Tweedie regression models rovide a flexible family of distributions to deal with non-negative highly right-skewed data as well as symmetric and heavy tailed data and can handle continuous data with robability mass at zero. The estimation and inference of Tweedie regression models based on the maximum likelihood method are challenged by the resence of an infinity sum in the robability function and non-trivial restrictions on the ower arameter sace. In this aer, we roose two aroaches for fitting Tweedie regression models, namely, quasi- and seudo-likelihood. We discuss the asymtotic roerties of the two aroaches and erform simulation studies to comare our methods with the maximum likelihood method. In articular, we show that the quasilikelihood method rovides asymtotically efficient estimation for regression arameters. The comutational imlementation of the alternative methods is faster and easier than the orthodox maximum likelihood, relying on a simle Newton scoring algorithm. Simulation studies showed that the quasi- and seudo-likelihood aroaches resent estimates, standard errors and coverage rates similar to the maximum likelihood method. Furthermore, the secondmoment assumtions required by the quasi- and seudo-likelihood methods enables us to extend the Tweedie regression models to the class of quasi- Tweedie regression models in the Wedderburn s style. Moreover, it allows to eliminate the non-trivial restriction on the ower arameter sace, and thus rovides a flexible regression model to deal with continuous data. We rovide R imlementation and illustrate the alication of Tweedie regression models using three data sets. Deartment of Mathematics and Comuter Science, University of Southern Denmark, Odense, Denmark. Deartment of Statistics, Paraná Federal University, Curitiba, Paraná, Brazil. wbonat@ufr.br Université de Franche-Comté, Laboratoire de Mathématiques de Besançon, Besançon, France. celestin.kokonendji@univ-fcomte.fr 1

2 1 Introduction Statistical modelling is one of the most imortant areas of alied statistics with alications in many fields of scientific research, such as sociology, economy, ecology, agronomy, insurance, medicine to cite but a few. There exists an infinity of statistical modelling frameworks, but the class of Generalized Linear Models (GLM) (Nelder and Wedderburn; 1972) is the most used in the last four decades. The success of this aroach is due to its ability to deal with different tyes of resonse variables, such as binary, count and continuous in a general framework with a owerful scheme for estimation and inference based on the likelihood aradigm. Secial cases of the GLM class include the Gaussian linear model to deal with continuous data, gamma and inverse Gaussian regression models for handling ositive continuous data. Logistic and Poisson regression models for dealing with binary or binomial and count data, resectively. These models are linked, since they belong to the class of the exonential disersion models (Jørgensen; 1987, 1997), and share the roerty to be described by their first two moments, mean and variance. Furthermore, the variance function lays an imortant role in the context of exonential disersion models, since it describes the relationshi between the mean and variance and characterizes the distribution (Jørgensen; 1997). Let Y denote the resonse variable and assume that the density robability function of Y belongs to the class of exonential disersion models. Furthermore, we assume that E(Y ) = µ and Var(Y ) = φv (µ) = φµ then Y Tw (µ, φ), where Tw (µ, φ) denotes a Tweedie (Tweedie; 1984; Jørgensen; 1997) random variable with mean µ and variance φµ, such that φ > 0 and (, 0] [1, ) are the disersion and ower arameters, resectively. The suort of the distribution deends on the value of the ower arameter. For 2, 1 < < 2 and = 0 the suort corresonds to the ositive, non-negative and real values, resectively. In these cases µ Ω, where Ω is the convex suort (i.e. the interior of the closed convex hull of the corresonding distribution suort). Finally, for < 0 the suort corresonds to the real values, however the exectation µ is ositive. For ractical data analysis, the Tweedie distribution is interesting, since it has as secial cases the Gaussian ( = 0), Poisson ( = 1), non-central gamma ( = 3/2), gamma ( = 2) and inverse Gaussian ( = 3) distributions (Jørgensen; 1987, 1997). Another imortant case often alied in the context of insurance data (Smyth and Jørgensen; 2002; Jørgensen and Paes De Souza; 1994) corresonds to the comound Poisson distribution, obtained when 1 < < 2. The comound Poisson distribution is a frequent choice for the modelling of non-negative data with robability mass at zero and highly right-skewed. 2

3 The ower arameter lays an imortant role in the context of Tweedie models, since it is an index which distinguishes between some imortant continuous distributions. The algorithms we shall roose in Section 3 allow us to estimate the ower arameter, which works as an automatic distribution selection. Although, the estimation of the regression arameters is less affected by the disersion structure, the standard errors associated with the regression arameters are determined by disersion structure, which justifies dedicate attention to the estimation of the ower and disersion arameters. The orthodox aroach is based on the likelihood aradigm, which in turn is an efficient estimation method. However, a articularity about the Tweedie distribution is that outside the secial cases, its robability density function cannot be written in a closed form, and requires numerical methods for evaluating the density function. Dunn and Smyth (2005, 2008) roosed methods to evaluate the density function of the Tweedie distribution, but these methods are comutationally demanding and show different levels of accuracy for different regions of the arameter sace. Furthermore, the arameter sace associated with the ower arameter resents non-trivial restrictions. Current software imlementations (Dunn; 2013) are restricted to dealing with 1. These facts become the rocess of inference based on the likelihood aradigm difficult and sometimes slow. The main goal of this aer is to roose alternative methods for estimation and inference of Tweedie regression models. In articular, we discuss the quasilikelihood (Jørgensen and Knudsen; 2004; Bonat and Jørgensen; 2016) and seudolikelihood (Gourieroux et al.; 1984) aroaches. These methods are fast and simle comutationally, because they emloy the first two moments, merely avoiding to evaluate the robability density function. Moreover, the second-moment assumtions required by the quasi- and seudo-likelihood methods allow us to extend the Tweedie regression models to the class of quasi-tweedie regression models in the style of Wedderburn (1974). The weaker assumtions of the second-moments secification eliminate the restrictions on the arameter sace of the ower arameter. Hence, it is ossible to estimate negative and between zero and one values for the ower arameter. In this way, we overcome the main restrictions of current software imlementations and rovide a flexible regression model to deal with continuous data. We resent the theoretical develoment of the quasi- and seudo-likelihood methods in the context of Tweedie regression models. In articular, we show analytically that the quasi-likelihood aroach rovides asymtotic efficient estimation for regression arameters. We resent efficient and stable fitting algorithms based on the two new aroaches and rovide R comutational imlementation. We emloyed 3

4 simulation studies to comare the roerties of our aroaches with the maximum likelihood method in a finite samle scenario. We comare the aroaches in terms of bias, efficiency and coverage rate of the confidence intervals. Furthermore, we exlore the flexiblity of Tweedie regression models to deal with heavy tailed distributions. Tweedie distributions are extensively used in statistical modelling, thereby motivating the study of their estimation in a more general framework. Alications include Lee and Whitmore (1993); Barndorff-Neilsen and Shehard (2001); Vinogradov (2004), who alied Tweedie distributions for describing the chaotic behaviour of stock rice movements. Further alications include roerty and causality insurance, where Jørgensen and Paes De Souza (1994) and Smyth and Jørgensen (2002) fit the Tweedie family to automobile insurance claims data. Tweedie distributions have also found alications in biology (Kendal; 2004; Kendall; 2007), fisheries research (Foster and Bravington; 2013; Hiroshi; 2008), genetics and medicine (Kendal et al.; 2000). Chen and Tang (2010) resented Bayesian semiarametric models based on the reroductive form of exonential disersion models. Zhang (2013) discussed the maximum likelihood and Bayesian estimation for Tweedie comound Poisson linear mixed models. For a recent alication and further references see Bonat and Jørgensen (2016). The rest of the aer is organized as follows. In the next section, we rovide some background about Tweedie regression models. Section 3 discusses the aroaches to estimation and inference. Section 4 resents the main results from our simulation studies. Section 5 resents the alication of Tweedie regression models to three data sets. The first one concerns daily reciitation in Curitiba, Paraná State, Brazil. This dataset illustrates the analysis of ositive continuous data with robability mass at zero. The second data set corresonds to a cross-section study develoed for studying the income dynamics in Australia. This dataset shows the analysis of ositive, highly right-skewed resonse variable. The last data set illustrates the analysis of symmetric ositive data, where current imlementations have roblems to deal with ower arameter smaller than 1. Finally, Section 6 reorts some final remarks. The R imlementation is available in the sulementary material. 2 Tweedie regression models The Tweedie distribution belongs to the class of exonential disersion models (EDM) (Jørgensen; 1987, 1997). Thus, for a random variable Y which follows an EDM, the density function can be written as: f Y (y; µ, φ, ) = a(y, φ, ) ex{(yψ k(ψ))/φ}, 4

5 where µ = E(Y ) = k (ψ) is the mean, φ > 0 is the disersion arameter, ψ is the canonical arameter and k(ψ) is the cumulant function. The function a(y, φ, ) cannot be written in a closed form aart of the secial cases cited. The variance is given by Var(Y ) = φv (µ) where V (µ) = k (ψ) is called the variance function. Tweedie densities are characterized by ower variance functions of the form V (µ) = µ, where (, 0] [1, ) is the index determining the distribution. Although, Tweedie densities are not known in closed form, their cumulant generating function is simle. The cumulant generating function is given by K(t) = {k(ψ + φt) k(ψ)}/φ, where k(ψ) is the cumulant function, { µ 1 1 ψ = 1 log µ = 1 { and k(ψ) = µ log µ = 2. The remaining factor in the density, a(y, φ, ) needs to be evaluated numerically. Jørgensen (1997) resents two series exressions for evaluating the density, for 1 < < 2 and for > 2. In the first case can be shown that, } P (Y = 0) = ex { µ2 φ(2 ) and for y > 0 that a(y, φ, ) = 1 W (y, φ, ), y with W (y, φ, ) = k=1 W k and W k = y kα ( 1) αk φ k(1 α) (2 ) k k!γ( kα), where α = (2 )/(1 ). A similar series exansion exists for > 2 and it is given by: with V = k=1 V k and a(y, φ, ) = 1 V (y, φ, ), πy V k = Γ(1 + αk)φk(α 1) ( 1) αk Γ(1 + k)( 2) k y αk ( 1) k sin( kπα). 5

6 Dunn and Smyth (2005) resented detailed studies about these series and an algorithm to evaluate the Tweedie density function based on series exansions. The algorithm is imlemented in the ackage tweedie (Dunn; 2013) for the statistical software R(R Core Team; 2016) through the function dtweedie.series. Dunn and Smyth (2008) also studied two alternative methods to evaluate the density function of the Tweedie distributions, one based on the inversion of cumulant generating function using the Fourier inversion and the sandleoint aroximation, for more details see Dunn (2013). In this aer, we used only the aroach described in this Section, i.e. based on series exansions. We now turn to Tweedie regression models. Consider a cross-sectional dataset, (y i, x i ), i = 1,..., n, where y i s are i.i.d. realizations of Y i according to Y i Tw (µ i, φ) and g(µ i ) = η i = x i β, where x i and β are (Q 1) vectors of known covariates and unknown regression arameters, resectively. It is straightforward to see that E(Y i ) = µ i = g 1 (x i β) and the Var(Y i ) = C i = φµ i. Hence, the model is equivalently secified by its joint distribution and by its first two moments. The Tweedie regression model is arametrized by θ = (β, λ = (φ = ex(δ), ) ). Note that, we introduce the rearametrization φ = ex(δ) for comutational convenience. Finally, in this aer we adot the orthodox logarithm link function. 3 Estimation and Inference This section is devoted to estimation and inference of Tweedie regression models. In what follows, we shall discuss the maximum likelihood, quasi-likelihood and seudolikelihood methods. 3.1 Maximum likelihood estimation The maximum likelihood estimator () for the arameter vector θ denoted by ˆθ M is obtained by maximizing the following log-likelihood function, L(θ) = log {a(y i ; λ)} + 1 ex(δ) (y iψ i k(ψ i )). (1) As we shall show below the vectors β and λ are orthogonal, hence is sensible to discuss each of them searately. The score function for the regression arameters β = (β 0,..., β Q ) is given by U β (β, λ) = ( ) L(θ),..., L(θ), β 1 β Q 6

7 where L(θ) β j = = L(θ) ψ i µ i η i ψ i µ i η i β j [ ] 1 µ i x ij ex(δ)µ (y i µ i ), i for j = 1,..., Q. The entry (j, k) of the Q Q Fisher information matrix F β for the regression coefficients is given by by { } 2 L(θ) F βjk = E = β j β k [ µ i x ij 1 ex(δ)µ i ] µ i x ik. (2) Similarly, the score function for the disersion arameters λ = (ex(δ), ) is given U λ (λ, β) = whose comonents are given by ( ) L(θ), L(θ), δ L(θ) δ = δ log a(y i; λ) 1 ex(δ) (y iψ i κ(ψ i )) (3) and L(θ) = log a(y i; λ) + 1 [ ψ i y i ex(δ) κ(ψ ] i). (4) The entry (j, k) of the 2 2 Fisher information matrix F λ for the disersion arameters is given by { } 2 L(θ) F λjk = E. (5) λ j λ k The derivative in equations (3), (4) and (5) deends on the derivative of the infinite sum a(y i ; λ), and it cannot be exressed in closed form. Hence, numerical methods are required for aroximating these derivatives. Let Ũλ and F λ denote the aroximated score function and observed information matrix for the disersion arameters, resectively. In this aer, we adoted the Richardson method (Fornberg and Sloan; 1994), as imlemented in the R ackage numderiv (Gilbert and Varadhan; 2015) for comuting these aroximations. Furthermore, the cross entries of 7

8 the Fisher information matrix are given by { } [ Uβj (β, λ) 1 F βj δ = E = E {µ i x ij δ ex(δ)µ i ] } (y i µ i ) = 0 and { } Uβj (β, λ) F βj = E { = E [ 1 µ i x ij ex(δ)µ i ] } (y i µ i ) = 0. Hence, the vectors β and λ are orthogonal. The joint Fisher information matrix for θ is given by ( ) Fβ 0 F θ =, 0 F λ whose entries are defined by (2) and (5). Finally, the asymtotic distribution of ˆθ M is ˆθ M N(θ, F 1 θ ) where F 1 θ denote the inverse of the Fisher information matrix. In ractice the entry F λ is relaced by the aroximation F λ. In order to solve the system of equations U β = 0 and Ũλ = 0, we emloy the two stes Newton scoring algorithm, defined by β (i+1) = β (i) F 1 β U β(β (i), λ (i) ) λ (i+1) = λ (i) 1 F λ Ũλ(β (i+1), λ (i) ), (6) which in turn exlicitly uses the orthogonality between β and λ. The numerical evaluation of the derivatives required in equations (3), (4) and (5) can be inaccurate, mainly for 1, i.e. the border of the arameter sace. Thus, an alternative aroach is to maximize directly the log-likelihood function in equation (1) using a derivative-free algorithm as the Nelder-Mead method (Nelder and Mead; 1965). A more comutationally efficient aroach is to use the Nelder- Mead algorithm for maximizing only the rofile log-likelihood for the disersion arameters, which in turn is obtained by inserting the first equation of the two stes Newton scoring algorithm (6) in the log-likelihood function (1). Note that, by using this aroach for each evaluation of the rofile likelihood, we have a maximization roblem for the regression arameters. We imlemented these three aroaches to obtain the maximum likelihood estimator. The direct maximization of the loglikelihood function using the Nelder-Mead algorithm is slow, mainly for large number of regression coefficients. The two stes Newton scoring algorithm resented many 8

9 convergence roblems for small values of the ower arameter. Finally, the rofile likelihood aroach is the fast and stable imlementation. However, the rofile likelihood aroach resented roblems to comute the standard errors associated with the disersion estimates for 1. In this aer, we used only the aroach based on the rofile log-likelihood, but we also rovide R code for the other two aroaches. 3.2 Quasi-likelihood estimation We shall now introduce the quasi-likelihood estimation using terminology and results from Jørgensen and Knudsen (2004); Holst and Jørgensen (2015); Bonat and Jørgensen (2016). The quasi-likelihood aroach adoted in this aer combines the quasi-score and Pearson estimating functions to estimation of regression and disersion arameters, resectively. The aroach is also discussed in the context of estimating functions, see Liang and Zeger (1995); Jørgensen and Knudsen (2004) for further details. The quasi-score function for β has the following form, U q β (β, λ) = ( µ i β 1 C 1 i (y i µ i ),..., µ i β Q C 1 i (y i µ i ) ), where µ i / β j = µ i x ij for j = 1,..., Q. The entry (j, k) of the Q Q sensitivity matrix for U q β is given by ( ) [ ] S βjk = E U q 1 β β j (β, λ) = µ i x ij k ex(δ)µ x ik µ i. (7) i In a similar way, the entry (j, k) of the Q Q variability matrix for U q β V βjk = Var(U q β (β, λ)) = µ i x ij [ 1 ex(δ)µ i ] x ik µ i. is given by Following Jørgensen and Knudsen (2004); Bonat and Jørgensen (2016), the Pearson estimating function for the disersion arameters has the following form, U q λ (λ, β) = ( ) [ ] W iδ (yi µ i ) 2 [ ] C i, W i (yi µ i ) 2 C i, 9

10 where W iδ = C 1 i / δ and W i = C 1 i /. The Pearson estimating functions are unbiased estimating functions for λ based on the squared residuals (y i µ i ) 2 with mean C i. It is equivalent to treating the squared residual as a gamma variable, which is hence close in sirit to Perry s gamma regression method (Jørgensen et al.; 2011; Park and Cho; 2004). We shall now calculate the sensitivity matrix for the disersion arameters. The entry (j, k) of the 2 2 sensitivity matrix is given by ( ) S λjk = E U q λ λ j (λ, β) = W iλj C i W iλk C i, k where λ 1 and λ 2 denote either δ or, giving ( n n S λ = log(µ ) i) n log(µ i) n log(µ i) 2. (8) Similarly, the cross entries of the sensitivity matrix are given by ( ) S βj λ k = E U q β λ j (β, λ) = 0 (9) k and ( ) S λj β k = E U q λ β j (λ, β) = k W iλj C i W iβk C i, (10) where W iβk = C 1 i / β k. Finally, the joint sensitivity matrix for the arameter vector θ is given by ( ) Sβ 0 S θ =, S λβ whose entries are defined by (7), (8), (9) and (10). We shall now calculate the asymtotic variance of the quasi-likelihood estimators denoted by ˆθ QL, as obtained from the inverse Godambe information matrix, whose V θs θ general form is J 1 θ = S 1 θ inverse transose. The variability matrix for θ has the form S λ for a vector of arameter θ, where denotes ( Vβ V V θ = βλ V λβ V λ ), (11) whereas V λβ = V βλ and V λ deend on the third and fourth moments of Y i, resectively. In order to avoid this deendence on high-order moments, we roose to use the emirical versions of V λ and V λβ, which entries are given by 10

11 Ṽ λjk = U q λ j (λ, β) i U q λ k (λ, β) i and Ṽ λj β k = U q λ j (λ, β) i U q β k (λ, β) i. Finally, the asymtotic distribution of ˆθ QL is given by ˆθ QL N(θ, J 1 θ ). We may show by using standard results for inverse of artitioned matrix that ( J 1 S 1 β θ = V βs 1 β S 1 β ( V λs 1 β S λβ + ) V λβ )S 1 λ S 1 λ ( S λβs 1 β V β + V λβ )S 1 β S 1 λ (L + V λ)s 1, λ where L = S λβ S 1 β (V βs 1 β S λβ V λβ ) V λβs 1 β S λβ. Moreover, note that S 1 β V βs 1 β = V 1 β, it shows that for known disersion arameters, the asymtotic variance of the quasi-likelihood regression estimators reaches the Cramer Rao lower bound, which in turn shows that the quasi-likelihood aroach rovides asymtotically efficient estimators for the regression coefficients. Jørgensen and Knudsen (2004) roosed the modified chaser algorithm to solve the system of equations U q β = 0 and U q λ = 0, defined by β (i+1) = β (i) S 1 β U q β (β(i), λ (i) ) λ (i+1) = λ (i) S 1 λ U q λ (β(i+1), λ (i) ). The modified chaser algorithm uses the insensitivity roerty (9), which allows us to use two searate equations to udate β and λ. 3.3 Pseudo-likelihood estimation We shall now resent the seudo-likelihood aroach using terminology and results from Gourieroux et al. (1984). The seudo-likelihood aroach considers the roerties of estimators obtained by maximizing a likelihood function associated with a family of robability distributions, which does not necessarily contain the true distribution. In articular, in this aer to estimation of Tweedie regression models, we adoted the Gaussian seudo-likelihood, whose logarithm is given by L (θ) = n 2 log(2π) nδ (log µ i (y i µ i ) 2 2 ex(δ)µ i ). (12)

12 The seudo-score function for θ is given by U θ (β, λ) = ( L (θ) β 0,..., L (θ) β Q whose comonents have the following form and L (θ) β j = 2 L (θ) x ij + L (θ) δ = 1 2, L (θ) δ (y i µ i ) 2 2 ex(δ)µ x ij + i = n ex(δ) (y i µ i ) 2 1 log(µ i ) + 2 ex(δ) ), L (θ), (y i µ i ) ex(δ)µ 1 x ij, (13) i µ i (14) log(µ i ) µ (y i µ i ) 2. (15) i We note in assing that Equation (13) is an unbiased estimating function for β j based on the linear and squared residuals. Similarly, note that Equations (14) and (15) are unbiased estimating functions for δ and based on the squared residuals. Gourieroux et al. (1984) showed under classical assumtions, that the seudolikelihood estimators denoted by ˆθ P L and obtained by maximizing Equation (12) converge almost surely to θ. Furthermore, ˆθ P L converges in distribution to N(θ, S 1 θ V θs 1 where ( ) S θ = E 2 L (θ) and V θ θ θ = E ( U θ (β, λ)u θ (β, λ) ). Similarly, the variability matrix (11) in the context of quasi-likelihood estimation, the matrix V θ deends on third and fourth moments. Hence, we roose to use the emirical version of V θ, which is given by Ṽ θ = U θ (θ) iu θ (θ) i, where the sum is understood to be element-wise. We shall now comute the comonents of the S θ. First, note that the matrix S θ can be artitioned as S β S βδ S β S θ = S δβ S δ S φ. S β S φ S 12 θ )

13 The entry (j, k) of the Q Q matrix S β is given by S βjk = ( 2 x ij x ik 2 + x ) ijx ik ex(δ)µ 2. i Similarly, the entries S δ and S are resectively given by S δ = n 2 and S = Furthermore, the cross entries have the form log(µ i ) 2. 2 S βj δ = x ij 2, S β j = log(µ i )x ij 2 and S δ = log(µ i ). 2 U θ Finally, we roose the Newton scoring algorithm to solve the system of equations (β, λ) = 0, defined by θ (i+1) = θ (i) S 1 θ U θ (β(i), λ (i) ). In that case, we have to udate β and λ together, since the cross-entries of S θ are not zeroes. 4 Simulation studies In this section we shall resent two simulation studies designed to i) check the asymtotic roerties of the maximum, quasi- and seudo-likelihood estimators in a finite samle scenario and ii) check the robustness of the Tweedie regression models in the case of missecification by heavy tailed distributions. 4.1 Fitting Tweedie regression models In this section we resent a simulation study that was conducted to comare the roerties of the estimation methods. We evaluated the exected bias, consistency, coverage rate and efficiency for the maximum likelihood (), quasi-likelihood (Q) and seudo-likelihood (P) estimators. We generated 1000 data sets considering four samle sizes 100, 250, 500 and We considered five values of the ower arameter 0, 1.01, 1.5, 2 and 3 combined with three amounts of variation. 13

14 We used the average coefficient of variation to measure the amount of variation introduced in the data. We defined, small, medium and large amount of variation data sets generated using coefficient of variation equals to 15%, 50% and 80%, resectively. The values of the ower arameter were chosen to have non-standard situations, as the cases of = 0 and = 1.01 where we exect the does not work. The case of = 2 is also difficult for maximum likelihood estimation, since the robability density function should be evaluated using two different infinity sums, for < 2 and > 2. The cases = 1.5 and = 3 reresent the standard comound Poisson and inverse Gaussian distributions, resectively. In these cases, we exect that the works well, so we have safe results to comare with our two alternative aroaches. All scenarios consider models with an intercet (β 0 = 2) and sloes (β 1 = 0.8, β 2 = 1.5). The covariates are a sequence from 1 to 1, reresenting a continuous covariate, a factor with two levels (0 and 1) and length equals the samle size. For = 0 the disersion arameter values are φ = (75, 850, 2100) corresonding, resectively, to small (15%), medium (50%) and large (80%) variation. Similarly, for = 1.01, = 1.5, = 2 and = 3 the disersion arameter values are φ = (1.5, 15, 40), φ = (0.2, 2, 5.3), φ = (0.023, 0.25, 0.65) and φ = (0.0003, , ), resectively. Fig. 1 shows the exected bias lus and minus the exected standard error for the arameters on each model and scenario. The scales are standardized for each arameter dividing the exected bias and the limits of the confidence intervals by the standard error obtained on the samle of size 100. The results in Fig. 1 show that for the quasi- and seudo-likelihood methods and all simulation scenarios, both the exected bias and standard error tend to 0 as the samle size is increased. It shows the consistency and unbiasedness of our estimators. As exected the maximum likelihood method did not work for = 0 and = 1.01 in the medium and large variation scenarios. In these cases, the algorithm failed for all simulated data sets. In the cases of small variation the algorithm converged for 132 and 326 data sets for = 0 and = 1.01, resectively. In these scenarios, although the large bias for the disersion arameters, the regression coefficients were consistently estimated. Fig. 2 resents the coverage rate by estimation methods, samle size and simulation scenarios. The results resented in Fig. 2 show that in general for large samles the coverage rates are close to the nominal level () for all arameters and simulation scenarios. The resented coverage rate zero for the disersion arameters, when = 0 and = 1.01 in all simulation scenarios (not shown). The quasi-likelihood method resented coverage rate closer to the nominal level than the seudo-likelihood method, mainly for disersion arameters and large values of the ower arameter ( 1.5). Regarding the estimation methods as exected the resented the coverage rate 14

15 φ β 2 β 1 β ; ; ;2 1.5;5.3 2; ;0.25 2;0.65 3; ; ; Standardized scale φ β 2 β 1 β 0 0;75 0;850 0; ; ; ;40 1.5; ;2 1.5;5.3 2; ;0.25 2;0.65 3; ; ; ;75 0;850 P P P P P P P P P P P P P P P φ β 2 β 1 β 0 0; ;850 0; ;1.5 Samle size ; ;40 1.5; ;2 1.5; ; ; ;0.65 3; ; ; ; ; ;40 Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Figure 1: Exected bias and confidence interval on a standardized scale by estimation methods (maximum likelihood (), seudo-likelihood (P) and quasilikelihood (Q)), samle size and different values of the ower and disersion arameters (; φ). 15

16 Methods P Q hi 3; ; ; ; hi 3; ; ; ; ; hi 3; ; ; ; ; hi 3; ;0.65 2;0.65 2;0.65 2;0.65 hi 2;0.65 2;0.25 2;0.25 2;0.25 2;0.25 hi 2;0.25 2; ; ; ;0.023 hi 2;0.023 Coverage rate 1.5; ;2 1.5; ;2 1.5; ;2 1.5; ;2 hi hi 1.5; ;2 1.5; ; ; ;0.2 hi 1.5; ; ; ; ;40 hi 1.01; ; ; ; ;15 hi 1.01; ; ; ; ;1.5 hi 1.01;1.5 0;2100 0;2100 0;2100 0;2100 hi 0;2100 0;850 0;850 0;850 0;850 0;850 0;75 0;75 0;75 0;75 hi 0; Samle size Figure 2: Coverage rate for each arameter (β 0, β 1, β 2, φ, ) by estimation methods (maximum likelihood (), seudo-likelihood (P) and quasi-likelihood (Q)), samle size and different values of the ower and disersion arameters (; φ). 16

17 close to the nominal level for large values of the ower arameter. The alternative aroaches worked well in all simulation scenarios, including the cases where the did not work. Finally, Fig. 3 resents the emirical efficiency of the quasiand seudo-likelihood estimators. The emirical efficiency was comuted as the ratio between the variance of the and the variance obtained by the alternative aroaches. We comuted the efficiency only for the cases where 1.5, since for the other cases the resented no reliable results. The results in Fig.3 show that for the regression coefficients both Q and P aroaches resented efficiency close to 1 in all simulation scenarios. Concerns the disersion arameters, for the small variation scenario the Q and P resented efficiency close to 1. However, when the variation increased these estimators loss efficiency, the worst scenario aears for = 1.5 and large variation, where the efficiency resented values around 20%. In general the P is more efficient than the Q for the disersion and ower arameters. 4.2 Robustness of Tweedie regression models In this subsection we resent a simulation study that was conducted to evaluate the robustness of the Tweedie regression models in the case of model missecification by heavy tailed distributions. We generated 1000 data sets considering four samle sizes 100, 250, 500 and 1000 following two heavy tailed distributions, namely, t-student and slash. The arametrization adoted was the one imlemented in the R ackage heavy (Osorio; 2016). For both distributions, we designed three simulation scenarios according to the amount of variation introduced in the data. We defined, small, medium and large amount of variation data sets generated using disersion arameter equals to 100, 500 and 1000, resectively. In order to simulate challenge data sets, we used 2 degrees of freedom. The mean structure was secified as in the subsection 4.1. In the case of heavy tailed distributions, we exect negative values for the ower arameter. Thus, we fitted the Tweedie regression models by using the quasi- and seudo-likelihood aroaches. In order to comute the emirical efficiency of the quasi- and seudo-likelihood estimators, we fitted t-student regression models along with the logarithm link function, as imlemented in the ackage gamlss(family TF) (Rigby and Stasinooulos; 2005). Although, of the extensive literature on robust estimation methods, in this aer we adoted the t-student regression models, since it is a frequent choice for the analysis of heavy tailed data (Huber and Ronchetti; 2009) and can be fitted using the orthodox maximum likelihood method. Furthermore, since there is no software available for fitting slash regression models using logarithm link function, the t-student 17

18 P Methods Q hi ; ; ; ; ; ; ; ; ; hi 3; hi ; ; ; ; ; hi 1.0 2;0.65 2;0.65 2;0.65 2;0.65 2; Emirical efficiency ;0.25 2;0.25 2;0.25 2;0.25 hi hi 2; ; ; ; ; ; hi ; ; ; ;5.3 hi 1.5; ;2 1.5;2 1.5;2 1.5;2 1.5; hi ; ; ; ; ; Samle size Figure 3: Emirical efficiency for each arameter (β 0, β 1, β 2, φ, ) by estimation methods (maximum likelihood (), seudo-likelihood (P) and quasi-likelihood (Q)), samle size and different values of the ower and disersion arameters (; φ). 18

19 regression models were used as the base of comarison for both t-student and slash data sets. Fig. 4 shows the exected bias lus and minus the exected standard error for the regression arameters by estimation methods, samle size and simulation scenarios. The results resented in Fig. 4 show that the three estimation methods rovide unbiased and consistent estimates of the regression arameters in all simulation scenarios. As exected, the standard errors associated with the regression arameters increase while the amount of variation introduced in the data increases. Fig. 5 resents the coverage rate by estimation methods, samle size and simulation scenarios. The emirical coverage rate resented values close to the nominal secified level of 95% for all estimation methods and simulation scenarios. The method resented coverage rate closer to the nominal level than the Q and P methods, however, the difference is no larger than 3%. The coverage rate of the Q and P were virtually the same for all regression arameters, samle size and simulation scenarios. Finally, Fig. 6 resents the emirical efficiency of the Q and P estimators for the regression arameters. The emirical efficiency was comuted as the ratio between the variance of the obtained by fitting the t-student regression models and the variance of the Q and P estimators obtained by fitting the Tweedie regression models. The emirical efficiency resented values close to 1 for the small variation simulation scenarios, however, when the amount of variation increases both Q and P loss efficiency. The loss were around 10% and 20% for the medium and large variation scenarios, resectively. The results are worse for large samles. The P resents efficiency slightly closer to the nominal level than the Q. 5 Data analyses In this section we shall resent three illustrative examles of Tweedie regression models. The data that are analysed and the rograms that were used to analyse them can be obtained from: htt:// 19

20 100 t student Samle size slash 500 t student slash 1000 t student slash β 2 β 1 QUASI QUASI QUASI QUASI QUASI QUASI β t student 100 slash 500 t student 500 slash 1000 t student 1000 slash β 2 β 1 PSEUDO PSEUDO PSEUDO PSEUDO PSEUDO PSEUDO β t student 100 slash 500 t student 500 slash 1000 t student 1000 slash β 2 β 1 β ^ bias ± SE Figure 4: Exected bias and confidence interval by estimation methods (quasilikelihood (Q), seudo-likelihood (P) and maximum likelihood ()), samle size and simulation scenarios. 20

21 Methods PSEUDO QUASI slash 1000 slash 1000 slash t student 1000 t student 1000 t student Emirical coverage rate t student 500 slash 500 t student 500 slash 500 t student 500 slash slash 100 slash 100 slash t student 100 t student 100 t student Samle size Figure 5: Coverage rate for regression arameters by estimation methods (quasilikelihood (Q), seudo-likelihood (P) and maximum likelihood ()), samle size and simulation scenarios. 21

22 PSEUDO Methods QUASI Emirical efficiency slash 1000 t student 1000 slash 500 t student 100 slash 100 t student 100 t student 100 slash 500 t student 500 slash 1000 t student 1000 slash 100 t student 100 slash 500 t student 500 slash 1000 t student 1000 slash Samle size Figure 6: Emirical efficiency for regression arameters by estimation methods (quasi-likelihood (Q), seudo-likelihood (P) and maximum likelihood ()), samle size and simulation scenarios. 22

23 A Days Rainfall B Rainfall Frequency C Year Rainfall SUMMER WINTER SPRING D Season Rainfall Figure 7: Time series lot for Curitiba rainfall data with fitted values (A). Vertical black lines indicate January 1st. Histogram of daily rainfall for the whole eriod (B). Boxlots for year (C) and season (D). 5.1 Smoothing time series of rainfall in Curitiba, Paraná, Brazil This examle concerns daily rainfall data in Curitiba, Paraná State, Brazil. The data were collected for the eriod from 2010 to 2015 corresonding to 2191 days. The main goal is to smooth the time series to hel us better see atterns or trends. The analysis of rainfall data is in general challenged by the resence of many zeroes and the highly right-skewed distribution of the data. The lots shown in Fig. 7 illustrate some of these features for the Curitiba rainfall data. In articular, Fig. 7(B) highlights the right-skewed distribution and the considerable roortion of exact 0s (51%). In order to smooth the Curitiba rainfall time series, we fitted a Tweedie regression model with linear redictor exressed in terms of B-slines (de Boor; 1972). The natural basis regression smoothing framework was used to select the degree of smoothness (Wood; 2006). In that case, we found that 14 degrees of freedom were enough to smooth the times series. The models were fitted by using the three estimation methods, namely, maximum likelihood (), quasi-(q) and seudo- 23

24 likelihood (P). Table 1 resents estimates and standard errors for the disersion and ower arameters. Table 1: Disersion and ower arameter estimates and standard errors (SE) by estimation methods for the Curitiba rainfall data. Estimation methods Parameter Q P Estimate SE Estimate SE Estimate SE δ The results in Table 1 show slightly different estimates for the disersion and ower arameters, deending on the estimation method used. However, the confidence intervals obtained by the Q and P aroaches contain the. The standard errors obtained by the alternative aroaches are larger than the ones obtained by the. To evaluate the effect of the estimation methods on the regression coefficients, Fig. 8 shows estimates and confidence intervals for each regression coefficient by estimation methods. The scales were standardized for each arameter dividing the estimate and the limits of the confidence interval by the estimate obtained by the maximum likelihood method. The results in Fig. 8 show that the Q method resented estimates and confidence intervals more similar to the than the P method. The relative average difference between the and Q estimates was 3.36%. On the other hand, the relative average difference between the and P estimates was 14.58%. Similarly, the confidence intervals obtained by the Q method were on average 3.33% wider than the corresonding intervals. On the other hand, the confidence intervals obtained by the P aroach were 39.98% wider than the intervals. For all estimation methods, the ower arameter estimates are in the interval 1 < < 2, suggesting a comound Poisson distribution, as exected, since the resonse variable is continuous with exact 0s. The fitted values and 95% confidence interval obtained by the quasi-likelihood method are shown in Fig. 7 above. The fitted values obtained by the and P aroaches were similar the ones obtained by the Q (not shown). The smooth function catures the swing in the data and highlights the seasonal behaviour with dry and wet months around the winter and summer seasons, resectively. In order to comare the comutational times required by each aroach for fitting the Tweedie regression model for this data set, we used the ackage rbenchmark (Kus- 24

25 Estimation methods P Q β 13 β 12 β 11 β 10 β 9 β 8 β 7 β 6 β 5 β 4 β 3 β 2 β 1 β Standardized scale Figure 8: Regression arameter estimates and 95% confidence intervals by estimation methods for the Curitiba rainfall data. 25

26 nierczyk; 2012). The comutations were done by a standard ersonal comuter at 2.90 GHz with 8 G RAM by using the R software version for ten relications. The results showed that the Q aroach is 37 and 0.22 times faster than the and P aroaches, resectively. 5.2 Income dynamics in Australia We consider some asects of a cross-section study on earnings of 595 individuals for the year 1982 in Australia. The data set is available in the ackage AER (Kleiber and Zeileis; 2008) for the statistical software R. The resonse variable wage is known to be highly-right skewed. The data set has 12 covariates: exerience years of fulltime work exerience; weeks weeks worked; occuation factor two levels (whitecollar, blue-collar); industry factor two levels (no;yes) indicating if the individual work in a manufacturing industry; south factor two levels (no;yes) indicating if the individuals resides in the south; smsa factor two levels (no;yes) indicating if the individual resides in a standard metroolitan area; gender factor indicating gender (male, female); union factor two levels (no, yes) indicating if the individual s wage set by a union contract; ethnicity factor indicating ethnicity, African-American (afam) or not (other). The main goal of the investigation was to assess the effect of the covariates on the wage. We fitted the Tweedie regression model with linear redictor comosed by all covariates by using the three estimation methods. Table 2 shows the estimates and standard errors for the regression, disersion and ower arameters. The results in Table 2 show that the and Q aroaches strongly agree in terms of estimates and standard errors for the regression coefficients. The P aroach resents estimates slightly different from the and Q aroaches. Regarding the disersion arameters, although the slightly difference in terms of estimates and standard errors, the confidence intervals from the Q and P aroaches contain the estimates. Concerning the effect of the covariates the and Q aroaches agree that the covariates weeks and south are non-significant. On the other hand, the P aroach also indicated that the covariates industry and married are nonsignificant. Regarding the other covariates the three aroaches agree that they are significant. In order to comare the fit of Tweedie regression model with more standard aroaches, we also fitted the Gaussian, gamma and inverse Gaussian regression models for the income dynamics data set. The maximized values of the log-likelihood function were , and for the Gaussian, gamma and in- 26

27 Table 2: Regression, disersion and ower arameter estimates and standard errors (SE) by estimation methods for the income dynamics data. Estimation methods Parameter Q P Estimate SE Estimate SE Estimate SE Intercet exerience weeks occuation industry south smsa married gender union education ethnicity δ verse Gaussian models, resectively. Furthermore, the maximized value of the loglikelihood function for the Tweedie regression model was , which in turn shows the better fit of the Tweedie regression model, as exected. In terms of comutational time for this data set, the Q aroach was 45 and 0.15 times faster than the and P aroaches, resectively. 5.3 Gain in weight of rats The third examle concerns to a standard Gaussian regression model. The goal of this examle is to show that the quasi- and seudo-likelihood aroaches can estimate values of the ower arameter between 0 and 1, where the maximum likelihood estimator does not exist. We used the weightgain data set available in the HSAUR ackage (Everitt and Hothorn; 2015). This data set corresonds to an exeriment to study the gain in weight of rats fed on four different diets, distinguished by the amount of rotein (low and high) and by source of rotein (beef and cereal). The data set has 40 observations. We fitted the Gaussian, gamma, inverse Gaussian and Tweedie regression models 27

28 for the weightgain data set. The linear redictor was comosed of the two main covariates source and tye along with the interaction term, for all models. The values of the maximized log-likelihood were , , and for the Gaussian, gamma, inverse Gaussian and Tweedie models, resectively. These results showed that the Gaussian distribution rovides the best fit for this data set, judging by the maximized log-likelihood value. In that case, the method is not able to indicate the best fit. It is due to the non-trivial restriction on the ower arameter sace. Thus, we fitted the model using the aroaches Q and P. Table 3 resents the estimates and standard errors for the regression, disersion and ower arameters, obtained by, Q and P aroaches. Table 3: Regression, disersion and ower arameter estimates and standard errors (SE) by estimation methods for the gain in weight of rats data. Estimation methods Parameter Q P Estimate SE Estimate SE Estimate SE Intercet source tye source:tye δ The results in Table 3 show that the three aroaches strongly agree in terms of estimates and standard errors for the regression coefficients. The value of the ower arameter was estimated smaller than 1 by the Q and P aroaches, as exected, since the Gaussian distribution rovides the best fit for this data. On the other hand, the maximum likelihood method estimated the ower arameter close to 1 the border of the arameter sace, in that case a non-otimum model. All aroaches resented large standard errors for the ower and disersion arameters. In terms of comutation time, for this alication the P aroach was 94 and 0.15 times faster than the and Q aroaches, resectively. 6 Discussion In this aer, we adoted the quasi- and seudo-likelihood aroaches to estimation and inference of Tweedie regression models. These aroaches emloy merely second- 28

29 moments assumtions, allowing to extend the Tweedie regression models to the class of quasi-tweedie regression models, which in turn offer robust and flexible models to deal with continuous data. Characteristics such as symmetry or asymmetry, heavy tailed and excess 0s are easily handled because of the flexibility of the model class. These features indicate that the Tweedie model is a otential useful tool for the modeling of continuous data. The main advantage in ractical terms, is that we have one model for virtually all kinds of continuous data. Thus, model selection is done automatically when fitting the model. The main advantages of the alternative estimation aroaches in relation to the orthodox maximum likelihood method are their easy imlementation and comutational seed. Furthermore, by emloying only second-moment assumtions, we eliminated the non-trivial restriction on the arameter sace of the ower arameter, becoming the fitting algorithm simle and efficient. It also allows us to aly the Tweedie regression models for symmetric and heavy tailed data, as the cases of Gaussian and t-student data, where in general the ower arameter resents negative and to 0 values. Another otential alication of Tweedie regression model is for the analysis of left-skewed data, where we also exect negative values for the ower arameter. The theoretical develoment in Section 3 showed that the quasi-likelihood aroach has much in common with the orthodox maximum likelihood method. The quasi-score function emloyed in the context of quasi-likelihood estimation coincides with the score function for Tweedie distributions, which also imlies that it will coincide for all exonential disersion models. The asymtotic variance of the quasi-likelihood estimators for the regression arameters coincide with the asymtotic variance of the maximum likelihood estimators, in the case of known ower and disersion arameters. Hence, the quasi-likelihood aroach rovides asymtotic efficient estimation for the regression arameters. Furthermore, the quasi-likelihood aroach as used in this aer combining the quasi-score and Pearson estimating functions, resents the insensitivity roerty (see Eq. 9) which is an analogue to the orthogonality roerty in the context of maximum likelihood estimation. The insensitive roerty allows us to aly the two stes Newton scoring algorithm, using two searate equations to udate the regression and disersion arameters. A similar rocedure can be used in the maximum likelihood framework, since the vectors β and λ are orthogonal. In the context of quasi-likelihood estimation, in this aer, we used the unbiased Pearson estimating function to estimation of the ower and disersion arameters. The discussion about efficiency in that case is difficult, since we cannot obtain a closed form for the Fisher information matrix. The fact that the sensitivity and variability matrices associated with the disersion arameters 29

Estimating function analysis for a class of Tweedie regression models

Title Estimating function analysis for a class of Tweedie regression models Author Wagner Hugo Bonat Deartamento de Estatística - DEST, Laboratório de Estatística e Geoinformação - LEG, Universidade Federal