On Statistical Methods for Zero-Inflated Models

Size: px

Start display at page:

Download "On Statistical Methods for Zero-Inflated Models"

Ilene Ross
5 years ago
Views:

1 U.U.D.M. Project Report 2015:9 On Statistical Methods for Zero-Inflated Models Julia Eggers Examensarbete i matematik, 15 hp Handledare och examinator: Silvelyn Zwanzig Juni 2015 Department of Mathematics Uppsala University

3 Abstract Data with excess zeros arise in many contexts. Conventional probability distributions often cannot explain large proportions of zero observations. In this paper we shall study statistical models which take large proportions of zero observations into account. We will consider both discrete and continuous distributions.

4 Contents 1 Introduction 2 2 Models for Zero-Inflated Data 7 3 Models for Semicontinuous Data with Excess Zeros Tobit Models Sample Selection Models Double Hurdle Models Two-Part Models Inference in Models for Zero-Inflated Data The Likelihood Function Maximum Likelihood Estimators Moment Estimators Cold Spells in Uppsala Exponential Family Inference in Two-Part Models The Likelihood Function Maximum Likelihood Estimators Moment Estimators Exponential Family Hypothesis Testing References 26 1

5 1. Introduction In this paper we will study models for data with a large proportion of zeros. For this we will introduce a few terms. Definition Discrete probability distributions with a large probability mass at zero are said to be zero-inflated. Conventional distributions usually cannot explain the large proportion of zeros in zero-inflated data. For this reason different models which can account for a large proportion of zero observations must be applied instead. Definition Probability distributions which are continuous on the entire sample space with the exception of one value at which there is a positive point mass are said to be semicontinuous. In this paper we will study models for zero-inflated distributions as well as for semicontinuous distributions with a positive probability mass at zero. We will only consider distributions with non-negative support. Remark Unlike in the case of left-censored data, zeros in semicontinuous data correspond to actual observations and do not represent negative or missing values which have been coded as zero. Data with excess zeros may arise in many different contexts. We will start by giving a few examples. Examples of zero-inflated data Cold spells In his paper on trends for warm and cold spells in Uppsala, Jesper Rydén studied the yearly number of cold spells in Uppsala, Sweden for the period from 1840 to 2012 [07]. He defined a cold spell as a period of at least six consecutive days during which the daily minimum temperature was less 2

6 than 13.4 C. The threshold of 13.4 C was chosen as it corresponds to the 5%-quantile of daily minimum temperatures for the reference period The yearly number of cold spells in Uppsala appears to be zero-inflated as can be seen from the data below. There is a large proportion of zero observations, i.e. years during which there were no cold spells. Figure 1.1: Yearly number of cold spells in Uppsala, Defects in manufacturing In manufacturing processes defects usually only occur when manufacturing equipment is not properly aligned. If the equipment is misaligned, defects can be found to occur according to a Poisson distribution [08]. This implies that defects in manufacturing occur according to a Poisson distribution with inflation at zero. Examples of semicontinuous data with excess zeros Household expenditures on durable goods The amount of money a household spends monthly on certain durable goods such as cars or appliances like washing machines or refrigerators is distributed according to a semicontinuous distribution. During most months no such goods are purchased and the expense is zero. If durable goods are purchased, the household expenditure on durable goods for that 3

7 month amounts to some positive value, namely the price of the purchased items. Alcohol consumption Consider the alcohol consumption of a population during a certain period of study. Some people belonging to the population may not drink any alcohol at all, thus consuming zero liters of alcohol. These people account for a point mass at zero. People who do consume alcohol may consume arbitrarily large, but positive, amounts. Thus we have a continuous distribution for positive values. Similarly, the tobacco consumption or consumption of drugs in general is semicontinuously distributed. Insurance benefits The Swedish Social Insurance Agency Forsäkringskassan publishes annual reports on its expenditures. The publication Social Insurance in Figures 2014 states that in 2013 a total amount of approximately 24.1 million SEK were paid out as sickness benefits. These sickness benefits are meant to compensate insured for the inability to work due to illness. In 2013, people in Sweden received sickness benefits. This corresponds to around 9% of all insured between the ages of 16 and 64. The amounts of sickness benefits paid out to insured during the year 2013 are semicontinuously distributed. 91% of all insured received no such benefits. We thus have a positive probability mass at zero. Those people who did receive sickness benefits got positive amounts which varied according to factors like income and time spent on sick leave. Therefore, we have a continuous distribution for positive values. Table 1.1 below gives an account of the average amounts of sickness benefits paid out to insured depending on gender and age group. 4

Table 1.1: Sickness benefits, 2013 We know from the data that 532 450 insured received sickness benefits in 2013 while around 5 383 661 insured received no such benefits. Since 24.

8 Table 1.1: Sickness benefits, 2013 We know from the data that insured received sickness benefits in 2013 while around insured received no such benefits. Since 24.1 million SEK were paid out in total, the average positive amount that was paid out per person that year amounted to approximately SEK. Assuming that the paid out benefits are exponentially distributed with parameter λ given that they are positive, we may estimate ˆλ = 1/ Generating a sample from this distribution in R, we may illustrate how the sickness benefits may have been distributed. Figure 1.2: A possible distribution of the amount of sickness benefits paid out to insured during the year

9 Healthcare expenditures Healthcare expenditures in general can be found to be semicontinuously distributed. Individuals may or may not choose to seek medical treatment during a certain period of study. There are no costs arising for people who do not seek medical treatment. If medical treatment is sought, however, the cost for the treatment amounts to some positive value. Thus healthcare expenditures are continuously distributed for positive values and have a positive probability mass at zero. 6

10 2. Models for Zero-Inflated Data The models for zero-inflated data which we will present here are variations of the following mixture model Y = Z 1 + (1 )Z 2 with Ber(p), Z 1 P Z1 and Z 2 P Z2. If we let P Z2 = δ {0} and assume Z 1 to be discrete, we obtain a model for zeroinflated data. When modeling count data we have the additional assumption that P (Z 1 0) = 1. For the above model we have that Y δ {0} with probability 1 p and Y P Z1 with probability p. Letting p Z1 denote the probability mass function of the random variable { Z 1 we obtain 1 p + ppz1 (0), y = 0 P (Y = y) = pp Z1 (y), y > 0 When modeling count data, the negative binomial and the Poisson distribution are common distributions for Z 1. If Z 1 P o(λ) the above model is referred to as the zero-inflated Poisson model, abbreviated ZIP. Zero-Inflated Poisson Regression Zero-inflated Poisson regression is an extension of the zero-inflated Poisson model which was proposed by Diane Lambert in 1992 [08]. The model assumes Y = (Y 1,..., Y n ) to be a sample of independent, but not necessarily identically distributed random variables Y i. In this model we assume Y i P o(λ i ) with probability p i. { 1 pi + p i exp( λ i ), y i = 0 Thus P (Y i = y i ) = p i λ y i i exp( λ i) y i!, y i > 0 The parameters p = (p 1,..., p n ) and λ = (λ 1,..., λ n ) are assumed to satisfy 7

11 log(λ) = Bβ and logit(p) = log( p 1 p ) = Gγ with B and G denoting matrices with explanatory variables. β and γ are matrices with coefficients to adequately describe the linear dependency of log(λ) and logit(p) on B and G respectively. If p and λ depend on the same explanatory variables, the number of model parameters may be reduced by expressing p as a function of λ. Lambert proposes the relation logit(p) = τ log(λ) for some τ R. This implies that p i = 1 1+λ i. The resulting model is denoted by ZIP (τ). 8

12 3. Models for Semicontinuous Data with Excess Zeros There are a number of different models which can be applied to semicontinuous data with excess zeros. We will present a few of the most common ones. In all of these models we will let Y denote the observed variable. The models we will present are all special kinds of two-component mixture models. A mixture model with two components has the form Y = Z 1 + (1 )Z 2 with Ber(p), Z 1 P Z1 and Z 2 P Z2. In the models we will present, we have that P Z2 = δ {0}. 3.1 Tobit Models The Tobit model which was proposed by James Tobin in 1958 [03] assumes that Y can be expressed in terms of a latent variable Y which can only be observed for values greater than zero. The random variable Y is defined as follows. { Y, Y Y = > 0 0, Y 0 The latent variable Y is assumed to be linearly dependent on a number of explanatory (and observable) variables and can be expressed as a linear combination of these, i.e. Y = Xβ + ɛ where X is a row vector containing the explanatory variables and β is a column vector with the corresponding coefficients describing the linear dependency of Y on X. The error terms ɛ are assumed to be independently and identically distributed 9

13 according to N(0, σ 2 ). Thus the Tobit model assumes an underlying normal distribution. The probability that Y takes the value zero is given by P (Y = 0) = P (Y 0) = P (Xβ + ɛ 0) = P (ɛ Xβ) = = P ( ɛ σ Xβ ) ( Xβ ) (Xβ ) = Φ = 1 Φ. σ σ σ This part of the Tobit model corresponds to the so-called Probit model. The name Tobit alludes to the Tobit model having been proposed by Tobin and being based on the Probit model. The likelihood function L of the uncensored positive values of Y is given by the probability density function of the latent variable Y given that it is positive, i.e. L(y y > 0) = L(y y > 0) = 1 σ φ( y Xβ ). σ In Tobit models the probability of a zero observation depends on the same random variable that determines the magnitude of the observation given that it is positive. Note that in the Tobit model zeros do not represent actual responses. The Tobit model is therefore not appropriate for semicontinuous data. It is, however, often applied to such data in spite of this. Remark There are many variations of the Tobit model. Censoring can for instance be performed at values other than zero. There are also models where censoring is done from above instead of below or from both above and below. Remark The mixture model above corresponds to the Tobit model if Z 1 = Y and p = P (Y > 0) = Φ ( ) Xβ σ. 3.2 Sample Selection Models The sample selection model was first proposed by J. Heckman in 1979 as an extension of the Tobit model. Sample selection models are based on two latent variables Y 1 and Y 2. The first latent variable Y1 is assumed to be of the form Y1 = X 1 β 1 + ɛ 1 with X 1 being a row vector of explanatory variables and β 1 being the corresponding vector of coefficients describing the linear dependency of Y1 on X 1. Similarly, the second latent variable Y 2 is assumed be of the form Y 2 = X 2 β 2 + ɛ 2, again with X 2 being a row vector of explanatory variables and β 2 being the corresponding vector of coefficients. 10

14 Sample selection models thus allow for the latent variables to depend on different covariates. The error terms (ɛ 1, ɛ 2 ) are assumed to be independently and identically distributed according to a bivariate normal distribution. They may thus be correlated. { Y The observed variable is defined as Y = 2, Y1 > 0 0, Y1 0. The sample selection model coincides with the Tobit model if X 1 = X 2 and β 1 = β 2 (i.e. Y1 = Y2 ). Remark The mixture model above corresponds to the sample selection model if Z 1 = Y2 and p = P (Y1 > 0). 3.3 Double Hurdle Models Similarly to sample selection models, double hurdle models are based on two latent variables Y1 and Y2. These latent variables are again assumed to be of the form Y1 = X 1 β 1 + ɛ 1 and Y2 = X 2 β 2 + ɛ 2 with X 1 and X 2 denoting row vectors with observed values of explanatory variables and β 1 and β 2 denoting column vectors that contain the corresponding coefficients describing the linear dependency of Y1 and Y2 on X 1 and X 2 respectively. (ɛ 1, ɛ 2 ) are again assumed to be independently and identically distributed according to a bivariate normal distribution. In double { hurdle models, the observed variable is defined as Y Y = 2, Y1 > 0 and Y2 > 0. 0, otherwise To illustrate the idea behind double hurdle models, we will apply it to the example of tobacco consumption. We thus let Y denote the amount of tobacco consumed by an individual during a certain period of time. The first latent variable may Y1 determine whether an individual is a smoker or non-smoker. This may depend on certain socioeconomic factors which can be accounted for by the dependency of Y1 on X 1. The second latent variable Y 2 may thereafter be used to determine how much tobacco is consumed by an individual given that the individual is a smoker. This quantity may depend on other covariates than the ones that affected the probability of the individual being a smoker in the first place. Note that it is possible for a smoker not to consume any tobacco during the period of the study, in other words we may have Y2 0 Y1 > 0. 11

15 We see that in order to observe positive values of Y two hurdles need to be overcome. The individual must be a smoker and smoke during the period of the study. Hence the name double hurdle model. Remark The mixture model above corresponds to the double hurdle model if Z 1 = Y2 and p = P (Y1 > 0, Y2 > 0). 3.4 Two-Part Models As the name suggests, two-part models consist of two parts. In the first part of the model a random variable determines whether the observation is zero or positive. In the second part another random variable Z determines the magnitude of the observation given that it is positive. The value of the random variable Z is not observed if has taken the value zero. The random variables and Z are assumed to be independent. Moreover, we assume that P (Z > 0) = 1. In other words we have the following model for the random variable Y Y = 1 {1} ( )Z = Z, (3.1) with Ber(θ 1 ) and Z P Z {P θ2 } being independent and P (Z > 0) = 1. Thus Y P Y {P θ, θ = (θ 1, θ 2 )}. Two-part models do not assume an underlying normal distribution and can therefore be applied to a wider range of data than for instance Tobit models. Note that in two-part models we do not have a latent variable. Zeros correspond to actual observations, and are not the result of censoring as in the previously presented models. Consequently, two-part models are more appropriate for modeling semicontinuous data than the other models we have presented. In the following, we will therefore restrict ourselves to the study of two-part models. Remark The mixture model above corresponds to the two-part model if Z 1 = Z. Remark Note that in all the models for semicontinuous data with excess zeros we have that P (Z 1 = 0) = 0 and P (Z 2 = 0) = 1. We can therefore distinguish between observations from Z 1 and Z 2. For zero-inflated count data, however, we have that P (Z 2 = 0) = 1 and P (Z 1 = 0) > 0. Here we are unable to distinguish between zero observations from Z 1 and zero observations from Z 2. 12

16 4. Inference in Models for Zero- Inflated Data Let Y denote the observed variable. We will assume the following model P Y {P θ, θ = (p, λ)} for Y. Y = Z 1 + (1 )Z 2 with Ber(p), Z 1 P Z1 {P λ } and Z 2 δ {0} being independent. Moreover, we assume that Z 1 is discrete and that P (Z 1 0) = 1. Note that the observed variable Y has non-negative support. Thus the above model can be applied to, for instance, count data. { 1 p + ppz1 (0), y = 0 For this model we have P (Y = y) = pp Z1 (y), y > 0. with p Z1 denoting the probability mass function of Z 1. In the case that { Z 1 P o(λ) we obtain a zero-inflated Poisson model with 1 p + p exp( λ), y = 0 P (Y = y) = p λy exp( λ) y!, y > 0. Theorem The expected value E[Y ] and variance V ar[y ] of Y are given by E[Y ] = pe[z 1 ] and V ar[y ] = pv ar[z 1 ] + (1 p)pe[z 1 ] 2. Proof. The expected value of Y is given by E[Y ] = E[ Z 1 + (1 )Z 2 ] = E[ ]E[Z 1 ] + E[Z 2 ] E[ ]E[Z 2 ] = pe[z 1 ]. The variance of Y is given by V ar[y ] = V ar[ Z 1 ] + V ar[z 2 ] + V ar[ Z 2 ] = V ar[ Z 1 ] + V ar[ Z 2 ] = = E[ 2 ]E[Z 2 1] E[ ] 2 E[Z 1 ] 2 + E[ 2 ]E[Z 2 2] E[ ] 2 E[Z 2 ] 2 = = E[ 2 ]E[Z 2 1] E[ ] 2 E[Z 1 ] 2 = = (V ar[ ] + E[ ] 2 )(V ar[z 1 ] + E[Z 1 ] 2 ) E[ ] 2 E[Z 1 ] 2 = = (p(1 p) + p 2 )(V ar[z 1 ] + E[Z 1 ] 2 ) p 2 E[Z 1 ] 2 = = pv ar[z 1 ] + p(1 p)e[z 1 ] 2 Corollary In the zero-inflated Poisson model the expected value E[Y ] and variance V ar[y ] of Y are given by E[Y ] = pλ and V ar[y ] = pλ(1+λ pλ). 13

17 Proof. In the zero-inflated Poisson model Z 1 P o(λ) and E[Z 1 ] = V ar[z 1 ] = λ. Plugging these values in to the expressions for E[Y ] and V ar[y ] yields the above result. 4.1 The Likelihood Function Definition The likelihood function L(θ, y) : Θ R + for an observation y of a random variable Y with probability function p(θ, y) is given by L(θ, y) := p(θ, y). (4.1) For a sample Y = (Y 1,..., Y n ) of independent and identically distributed random variables the likelihood function is given by n L(θ, y) := p(θ, y i ). (4.2) We will now consider a sample y = (y 1,..., y n ) of independent and identically distributed random variables Y i P Y. Let r denote the number of zero observations in the sample y. The likelihood function L(p, λ, y) of the sample y is given by L(p, λ, y) = n P (Y i = y i ) = (1 p + pp Z1 (0)) pp Z1 (y i ) = If Z 1 P o(λ) we have y i=0 = (1 p + pp Z1 (0)) r p n r p Z1 (y i ). y i>0 y i>0 L(p, λ, y) = (1 p + p exp( λ)) r p n r λ y i exp( λ(n r)) n. y i! 4.2 Maximum Likelihood Estimators Definition The maximum likelihood estimator ˆθ MLE of a variable θ is a value of θ which maximizes the likelihood function, i.e. with χ denoting the sample space of Y. ˆθ MLE max L(θ, y) y χ (4.3) θ Θ Theorem Let p (0, 1) and Z 1 P o(λ), i.e. assume a zero-inflated Poisson model for the sample y. The maximum likelihood estimators ˆp MLE (Y ) and ˆλ MLE (Y ) are given by ˆp MLE (Y ) = n r n(1 e ˆλ MLE ). 14

18 and (1 e ˆλ MLE ) y i = ˆλ MLE (n r). Proof. The likelihood function L(p, λ, y) of the sample y is given by L(p, λ, y) = (1 p + p exp( λ)) r p n r λ y i exp( λ(n r)) n y i! The values of p for which L(p, λ, y) is maximized satisfy pl(p, λ, y) = 0 p p ln(l(p, λ, y)) = 0 ( r ln(1 p+p exp( λ))+(n r) ln(p)+ y i ln(λ) λ(n r) ln( n y i!) ) = 0 r( 1+e λ ) 1 p+pe λ p = n r n(1 e λ ) + n r p = 0 The values of λ which maximize L(p, λ, y) satisfy λl(p, λ, y) = 0 ( λ r ln(1 p+p exp( λ))+(n r) ln(p)+ y i ln(λ) λ(n r) ln( n y i!) ) = 0 Inserting p = n r n(1 e λ ) gives (1 e λ ) n y i = λ(n r) Remark Numerical methods must be applied to solve the equation above. (1 e λ ) y i = λ(n r) 4.3 Moment Estimators Definition Let Y = (Y 1, Y 2,..., Y n ) be a sample from independent and identically distributed random variables with distributions depending on a parameter θ. The moment estimator of order k for θ is given by the value of θ for which E[Y k ] = g(θ) = 1 n Y k i 15

19 where g is some function specifying the expected value. Theorem Let Z 1 P o(λ), i.e. assume a zero-inflated Poisson model for the sample y. The moment estimators ˆp MME (Y ) and ˆλ MME (Y ) are given by ) 2 Y i and ˆp MME (Y ) = 1 n ˆλ MME (Y ) = ( 1 n Y 2 i 1 n Y i Y 2 i 1. Y i Proof. The moment estimators ˆp MME (Y ) and ˆλ MME (Y ) are given by values of p and λ which satisfy E[Y ] = 1 n Y i = pλ E[Y 2 ] = 1 n Yi 2 = V ar[y ] + E[Y ] 2 = pλ(1 + λ pλ) + p 2 λ 2 = pλ(1 + λ) (1 + λ) = λ = p = Y 2 i 1 n Y 2 i = Y i Y i 1 n Y i 1 1 n Y i λ = 1 n Y i Y i 2 Y i Y 2 i Y i = 1 n ( ) 2 1 n Y i Yi 2 1 n Y i 4.4 Cold Spells in Uppsala We will now assume a zero-inflated Poisson model for the data regarding cold spells in Uppsala (see chapter 1, figure 1.1) [07]. We let y = (y 1,..., y 169 ) denote the corresponding sample and assume that the observations are independent and identically distributed. Note that, in reality, the number of cold spells that occur during two consecutive years may not actually be independent so this assumption may not hold. For this sample we have 169 y i = 142 and 169 y 2 i = 340. We thus obtain the following moment estimates for the model parameters p and λ. 16

20 ˆp MME (y) = ( ˆλ MME (y) = y i ) 2 y 2 i yi = y i y i = ( ) = Note that ˆp MME (y) 0.6 < 1 so the yearly number of cold spells in Uppsala does indeed appear to be zero-inflated. Theorem An approximate level α test for the testing problem is given by where φ(y) = H 0 : p = 1, λ = 1.39 H 1 : 0 < p < 1, λ = 1.39 { 1, 2 ln(λ(y)) χ 2 α (1) 0, 2 ln(λ(y)) < χ 2 α(1) n r Λ(y) = (1 n(1 e 1.39 ) + n r n(1 e 1.39 ) e 1.39 ) r n r ( n(1 e 1.39 ) ) (n r) e 1.39r. Proof. The likelihood ratio Λ(Y ) is given by Λ(Y ) = max{p 0(y) : p = 1, λ = 1.39} max{p(y) : 0 < p 1, λ = 1.39} = 1 max (1 p + 0<p 1 pe 1.39 ) r p n r e 1.39r = n r = (1 n(1 e 1.39 ) + n r n(1 e 1.39 ) e 1.39 ) r n r ( n(1 e 1.39 ) ) (n r) e 1.39r According to Wilk s Theorem 2 ln(λ(y )) χ 2 (1) approximately as n. We can thus reject H 0 at significance level α if 2 ln(λ(y )) χ 2 α(1). This yields the above test. For the sample y we obtain 2 ln(λ(y)) = χ (1) = 3.84 It therefore follows that H 0 can be rejected at significance level Moreover, the p-value for the test is given by p = P 0 ( 2 ln(λ(y )) 66.92) = = 1 P 0 ( 2 ln(λ(y )) < 66.92) = , so H 0 can be rejected at any significance level α > We conclude that the yearly number of cold spells in Uppsala are zero-inflated. 17

21 4.5 Exponential Family Definition A class of probability measures P = {P θ : θ Θ} is called an exponential family if ( k ) L(y; θ) = A(θ) exp ζ j (θ)t j (y) h(y) (4.4) for some k N, real-valued functions ζ 1,..., ζ k on Θ, real-valued statistics T 1,..., T k and a function h on the sample space χ. j=1 Theorem If the class of probability measures P = {P θ : θ Θ} forms an exponential family, then all P θ are pairwise equivalent, i.e. for any P,Q P we have P(N)=0 iff Q(N)=0 [06]. Theorem If p [0, 1], then P Y does not form an exponential family. Proof. Consider the probability measures P (0,λ1) and P (p2,λ 2) with p 2 (0, 1]. We have that P (0,λ1)(R\{0}) = 0 but P (p2,λ 2)(R\{0}) > 0. Therefore it follows that P Y does not form an exponential family. Theorem If P Z1 does not form an exponential family, then P Y does not form an exponential family. Proof. The likelihood L(p, λ, y) of y is given by L(p, λ, y) = ((1 p) + pp Z1 (0)) r p n r p Z1 (y i ). y i>0 Since P Z1 does not form an exponential family, p Z1 (y i ) is not of the form (4.4). Thus L(p, λ, y) is not of the form (4.4). Consequently, P Y is not an exponential family. Theorem If p (0, 1] and P Z1 is a k-parameter exponential family with natural parameters ζ j (λ) and sufficient statistics T j (z), j = 1,..., k, then P Y is a k+1 parameter exponential family with natural parameters ζ j (λ), j = 1,..., k and ζ k+1 (p) = ln( 1 p+pp Z 1 (0) p ) and sufficient statistics T j (y), j = 1,..., k and T k+1 (y) = r. Proof. We have that L(p, λ, y) = ((1 p) + pp Z1 (0)) r p n r p Z1 (y i ) = y i>0 = exp ( r ln(1 p + pp Z1 (0)) + (n r) ln(p) ) p Z1 (y i ) = y i>0 = p n exp ( ) r ln( 1 p+pp Z 1 (0) p ) p Z1 (y i ) = y i>0 = p n A(λ) exp ( r ln( 1 p+pp Z 1 (0) p ) + k ) ζ j (λ)t j (y) h(y) j=1 Thus P Y {P (p,λ) } forms an exponential family with natural parameters ζ j (λ), j = 1,..., k and ζ k+1 (p) = ln( 1 p+pp Z 1 (0) p ) and sufficient statistics T j (y), j = 1,..., k and T k+1 (y) = r. 18

22 5. Inference in Two-Part Models Consider a random variable Y P Y {P θ, θ = (θ 1, θ 2 )} distributed according to the two-part model, i.e. let Y = 1 {1} ( )Z = Z, with and Z being independent, Ber(θ 1 ), Z P Z {P θ2 } and P (Z > 0) = 1. Theorem The expected value E[Y ] and variance V ar[y ] of Y are given by E[Y ] = θ 1 E[Z] and V ar[y ] = θ 1 V ar[z] + (1 θ 1 )θ 1 E[Z] 2. Proof. The expected value of Y is given by E[Y ] = E[ Z] = E[ ]E[Z] = θ 1 E[Z]. The variance of Y is given by V ar[y ] = V ar[ Z] = E[ 2 ]E[Z 2 ] E[ ] 2 E[Z] 2 since and Z are independent.thus V ar[y ] = E[ 2 ]E[Z 2 ] E[ ] 2 E[Z] 2 = = (V ar[ ] + E[ ] 2 )(V ar[z] + E[Z] 2 ) E[ ] 2 E[Z] 2 = = V ar[ ]V ar[z] + V ar[ ]E[Z] 2 + E[ ] 2 V ar[z] + E[ ] 2 E[Z] 2 E[ ] 2 E[Z] 2 = = V ar[ ]V ar[z] + V ar[ ]E[Z] 2 + E[ ] 2 V ar[z] = = (V ar[ ] + E[ ] 2 )V ar[z] + V ar[ ]E[Z] 2 = = ((1 θ 1 )θ 1 + θ 2 1)V ar[z] + (1 θ 1 )θ 1 E[Z] 2 = = θ 1 V ar[z] + (1 θ 1 )θ 1 E[Z] 2 It can be shown that if E[Z], ˆ E[Z] 2 and V ar[z], then and ˆ V ar[y ] = ˆ E[Y ] = { n r n ˆ E[Z] 2 and { n r n ˆ V ar[z] are unbiased estimators of E[Z], E[Z] ˆ, n r > 0 0, n r = 0 V ar[z] ˆ + n r r ˆ n n 1E[Z] 2, n r > 0 0, n r = 0 are unbiased estimators of E[Y ] and V ar[y ] respectively, see [10]. 19

23 5.1 The Likelihood Function Theorem Let y = (y 1,..., y n ) be a sample from independent and identically distributed random variables Y i distributed according to the two-part model. Moreover, let r denote the number of zero observations in the sample y and let z = (z 1,..., z n r ) be the subsample of positive observations of y. The likelihood function L(θ 1, θ 2, y) of the sample y is given by L(θ 1, θ 2, y) = L 1 (θ 1, y)l 2 (θ 2, y) where L 1 (θ 1, y) = (1 θ 1 ) r θ n r 1 and L 2 (θ 2, y) = L(θ 2, z). Proof. The likelihood function of the sample y is given by L(θ 1, θ 2, y) = P (Y i = 0) P (Y i > 0)f(y i y i > 0) = = (1 θ 1 ) y i=0 y i>0 y i=0 y i>0 θ 1 f(y i y i > 0) = (1 θ 1 ) r θ n r 1 f(y i y i > 0) y i>0 with f(y i y i > 0) denoting the probability density function of the observations given that they are positive, i.e. the probability density function of Z. Thus f(y i y i > 0) = f(z i ). It follows that L(θ 1, θ 2, y) = L 1 (θ 1, y)l 2 (θ 2, y) with L 1 (θ 1, y) = (1 θ 1 ) r θ1 n r and L 2 (θ 2, y) = f(y i y i > 0) = n r f(z i ) = L(θ 2, z). y i>0 Remark Here L(θ 2, z) denotes the likelihood of the subsample z. 5.2 Maximum Likelihood Estimators Theorem Let θ 1 (0, 1). The likelihood estimates of θ 1 and θ 2 are given by ˆθ 1,MLE = n r n and ˆθ 2,MLE max θ 2 Θ 2 L(θ 2, z). Proof. Since the likelihood function L of the sample y can be expressed as a product of two functions L 1 and L 2 that each depend on only θ 1 or θ 2, L may be maximized by maximizing L 1 and L 2 respectively. In other words, we have max L(θ 1, θ 2, y) = max L 1 (θ 1, y) max L 2 (θ 2, y). θ Θ θ 1 Θ 1 θ 2 Θ 2 20

24 Maximum likelihood estimation of θ 1 We have that ˆθ 1,MLE max L 1 (θ 1, y). In other words the maximum likelihood θ 1 Θ 1 estimate ˆθ 1MLE of θ 1 is a value of θ 1 which maximizes L 1 (θ 1, y) = (1 θ 1 ) r θ1 n r. The values of θ 1 for which L 1 (θ 1, y) is maximized satisfy the following equation. θ 1 L(θ 1, y) = 0 θ 1 ln(l(θ 1, y)) = 0 θ 1 r ln(1 θ 1 ) + (n r) ln(θ 1 ) = 0 θ 1 = n r n Maximum likelihood estimation of θ 2 The maximum likelihood estimate ˆθ 2MLE of θ 2 is a value of θ 2 which maximizes L 2 (θ 2, y) = L(θ 2, z), i.e. ˆθ2,MLE max L(θ 2, z). To obtain a maximum θ 2 Θ 2 likelihood estimate of θ 2 we therefore only need to consider the subsample z of positive observations of y. 5.3 Moment Estimators Theorem The moment estimators for θ 1 and θ 2 satisfy θ 1 E[Z] = 1 n Y i θ 1 V ar[z] + θ 1 E[Z] 2 = 1 n Yi 2 Proof. The first moment of Y is given by E[Y ] = 1 n Y i = θ 1 E[Z]. The second moment is given by E[Y 2 ] = 1 n Yi 2 = V ar[y ]+E[Y ] 2 = θ 1 V ar[z]+(1 θ 1 )θ 1 E[Z] 2 +θ1e[z] 2 2 = = θ 1 V ar[z] + θ 1 E[Z] 2. This yields the above result. Corollary If Z Exp(θ 2 ), the moment estimators ˆθ 1,MME (Y ) and ˆθ 2,MME (Y ) are given by 2 ( ) 2 Y i ˆθ 1,MME (Y ) = n n Y 2 i 21

25 and 2 n ˆθ 2,MME (Y ) = Proof. If Z Exp(θ 2 ), E[Z] = 1 θ 2 and V ar[z] = 1. Plugging these values into θ2 2 the equations for the first and second moment, we obtain θ 1 θ 2 = 1 n 2θ 1 θ θ 2 1 n θ 2 = Y i = 1 n Y 2 i Y i = 1 n θ 1 = θ2 n 2 n Y i Yi 2 Y i = Y 2 i 2( n Y i) 2 n n Yi 2 Y i Y 2 i. We will now consider the two parts of the two-part model separately. First we consider the part of the two-part model which determines whether the observation is zero or positive, i.e. the part corresponding to { the random variable. Consider the sample δ = (δ 1,..., δ n ) defined by δ i := 0,if yi = 0 1,if y i > 0. We have that δ is a sample from i.i.d. random variables i Ber(θ 1 ). The moment estimator of order 1 for θ 1 can be determined as follows. E[ ] = θ 1 = 1 n i The moment estimator ˆθ 1,MME for θ 1 is given by ˆθ 1,MME ( ) = 1 i. n Now consider the second part of the two-part model. Again, letting z = (z 1,..., z n r ) denote subsample of positive observations of y, we get that Z Exp(θ 2 ). The first moment estimator for θ 2 is given by E[Z] = 1/θ 2 = 1 n r θ 2 = n r/( n r n r Z i Z i ) The moment estimator of order 1 for θ 2 is given by ˆθ 2,MME (Z) = n r. n r Z i 22

26 Remark We see that the moment estimators for the separate parts of the models are not the same as the moment estimators for the joint model. When deriving moment estimates for the joint model, the two parts of the two-part model may therefore not be considered separately. 5.4 Exponential Family Let Y P Y {P θ, θ = (θ 1, θ 2 )} be distributed according to the two-part model, i.e. let Y = Z with and Z being independent, Ber(θ 1 ), Z P Z {P θ2 } and P (Z > 0) = 1. Let θ 1 Θ 1 and θ 2 Θ 2. Moreover, let y be a random sample of size n from independent and identically distributed random variables Y i. Let r denote the number of zero observations in the sample. Theorem If Θ 1 = [0, 1], then P Y {P θ } does not form an exponential family. Proof. Consider the probability measures P (1,α1), P (0,α2) and P (β,α3) with α 1, α 2, α 3 Θ 2 and β (0; 1). We have that P (1,α1)(R {0}) = 0 but P (β,α3)(r {0}) > 0 since P (β,α3)(0) = 1 β > 0. Moreover, P (0,α2)(R\{0}) = 0 but P (β,α3)(r\{0}) > 0. Therefore it follows that P Y does not form an exponential family. Theorem If P Z {P θ2 } is not an exponential family, then P Y {P θ } does not form an exponential family. Proof. We have that the likelihood L(θ 1, θ 2, y) of the sample y is of the form L(θ 1, θ 2, y) = L 1 (θ 1, y)l 2 (θ 2, y) with L 1 (θ 1, y) = (1 θ 1 ) r θ n r 1 and L 2 (θ 2, y) = y i>0 f(y i y i > 0) = n r f(z i ) = L(θ 2, z). Since P Z {P θ2 } is not an exponential family, L(θ 2, z) is not of the form (4.4). Thus L(θ 1, θ 2, y) is not of the form (4.4) either. This implies that P Y {P θ } does not form an exponential family. Theorem Let P Z {P θ2 } form a k-parameter exponential family with natural parameters ζ j (θ 2 ) and sufficient statistics T j (z), j = 1,..., k. Moreover, let Θ 1 = (0, 1). Then P Y {P θ } with θ = (θ 1, θ 2 ) forms a k+1 parameter exponential family with natural parameters ζ j (θ 2 ), j = 1,..., k and ζ k+1 (θ 1 ) = ln( 1 θ1 θ 1 ) and sufficient statistics T j (y), j = 1,..., k and T k+1 (y) = r. 23

27 Proof. We have that L(θ 1, θ 2, y) = L 1 (θ 1, y)l 2 (θ 2, y) with L 1 (θ 1, y) = (1 θ 1 ) r θ1 n r and L 2 (θ 2, y) = f(y i y i > 0) = n r f(z i ) = L(θ 2, z). y i>0 Since P Z {P θ2 } is a k-parameter exponential family, ( k ) L(θ 2, z) = L 2 (θ 2, y) = A(θ 2 ) ζ j (θ 2 )T j (z) h(z). j=1 Moreover, we have that L 1 (θ 1, y) = (1 θ 1 ) r θ n r 1 = θ n 1 exp(r ln( 1 θ1 θ 1 )). It follows that ( L(θ 1, θ 2, y) = θ1 n A(θ 2 ) exp r ln( 1 θ1 θ 1 ) + k which yields the above result. 5.5 Hypothesis Testing j=1 ) ζ j (θ 2 )T j (y)) h(y) We assume the following two-part model for the sample y = (y 1,..., y n ) of observations from independent and identically distributed random variables Y i Y i = 1 {1} ( i )Z i, where i Ber(θ 1 ), θ 1 (0, 1) and Z i Exp(θ 2 ), θ 2 R + \{0} are independent random variables. Thus Y i P θ where θ = (θ 1, θ 2 ). Note that P θ belongs to an exponential family. Let r denote the number of zero observations in the sample y. is an observation from the random variable R Bin(n, 1 θ 1 ) Note that r Neyman-Pearson tests for simple hypotheses and known θ 2 Theorem The Neyman-Pearson test of size α for the testing problem is given by H 0 : Y i P 0 i.e. θ 1 = α 1, θ 2 = β H 1 : Y i P 1 i.e. θ 1 = α 2, θ 2 = β 1, R < c α P φ(y) = 0(R<c) P 0(R c), R = c 0, R > c The value of c is given by the solution to P 0 (R < c) = α or, if such c doesn t exist, P 0 (R < c) < α < P 0 (R c). Note that R Bin(n, 1 α 1 ) under H 0. Proof. We want to test H 0 : Y i P 0 i.e. θ 1 = α 1, θ 2 = β H 1 : Y i P 1 i.e. θ 1 = α 2, θ 2 = β. The Neyman-Pearson tests are of the form 24

28 1, p 0 (y) < kp 1 (y) φ(y) = γ, p 0 (y) = kp 1 (y) 0, p 0 (y) > kp 1 (y) We have that p 0 (y) = (1 α 1 ) r α1 n r β n r e β p 1 (y) = (1 α 2 ) r α2 n r β n r e β y i y i To obtain a test of size α k is chosen so that P 0 (p 0 (Y ) < kp 1 (Y )) = α or, if such k doesn t exist, so that P 0 (p 0 (Y ) < kp 1 (Y )) < α < P 0 (p 0 (Y ) kp 1 (Y )). P 0 (p 0 (Y ) < kp 1 (Y )) = P 0 ( p0 (Y ) p 1 (Y ) < k ) ( (1 α1 ) r α1 n r = P 0 (1 α 2 ) r α2 n r ) < k = = P 0 ( ( (1 α 1 )α 2 (1 α 2 )α 1 ) r < k (α 2 α 1 ) n ) = P 0 ( r ln( (1 α 1)α 2 (1 α 2 )α 1 ) < ln(k ( α 2 α 1 ) n) ) = ( = P 0 r < ln(k ( α 2 ) n)/ (1 α 1 )α ) 2 ln( ) α 1 (1 α 2 )α 1 Let c = ln(k ( α 2 α 1 ) n)/ ln( (1 α 1)α 2 (1 α 2)α 1 ). Under H 0, r is an observation of the random variable R Bin(n, 1 α 1 ). We have that P 0 (p 0 (Y ) < kp 1 (Y )) = P 0 (R < c) Similarly, P 0 (p 0 (Y ) kp 1 (Y )) = P 0 (R c). The value for c such that either P 0 (p 0 (Y ) < kp 1 (Y )) = P 0 (R < c) = α or P 0 (p 0 (Y ) < kp 1 (Y )) = P 0 (R < c) < α < P 0 (p 0 (Y ) kp 1 (Y )) = P 0 (R c) can easily be computed. From this we can derive γ = α P 0(R < c). P 0 (R c) Finally, k can be obtained through k = ( (1 α1)α2 (1 α 2)α 1 ) c n ( α1 α 2 ) n. Remark The Neyman-Pearson test is the most powerful test of size α for the above testing problem [06]. 25

29 References [01] Social Insurance in Figures 2014 (2014). Försäkringskassan. [02] Duan N., Manning W. G., Morris C. N. and J. P. Newhouse (1983). A Comparison of Alternative Models of the Demand for Medical Care. Journal of Business Economics and Statistics 1, pp [03] Tobin J. (1958). Estimation of Relationships for Limited Dependent Variables. Econometrica, Vol. 26, pp [04] Maren K. Olsen, Joseph L. Schafer. A Class of Models for Semicontinuous Longitudinal Data. [05] Yongyi Min, Agresti, A. (2002). Modeling Nonnegative Data with Clumping at Zero: A Survey. JIRSS, Vol. 1, Nos. 1-2, pp [06] Liero H., Zwanzig S. (2012). Introduction to the Theory of Statistical Inference. [07] Rydén J. (2014). A Statistical Analysis of Trends for Warm and Cold Spells in Uppsala by Means of Counts. Geografiska Annaler: Series A, Physical Geography [08] Lambert D. (1992). Zero-Inflated Poisson Regression, With an Application to Defects in Manufacturing. Technometrics, Vol. 34, No. 1, pp [09] Jie Gao (2007). Modeling Individual Healthcare Expenditures by Extending the Two-part Model, pp [10] Aitchison J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association, Vol.50, No. 271, pp [11] Heckman, J. (1979). Sample Selection Bias as a Specification Error. Econometrica, Vol. 47, pp

Truncation and Censoring

Truncation and Censoring Laura Magazzini laura.magazzini@univr.it Laura Magazzini (@univr.it) Truncation and Censoring 1 / 35 Truncation and censoring Truncation: sample data are drawn from a subset of