Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes 1 JunXuJ.ScottLong Indiana University 2005-02-03 1 General Formula The delta method is a general approach for computing confidence intervals for functions of maximum likelihood estimates. The delta method takes a function that is too complex for analytically computing the variance, creates a linear approximation of that function, then computes the variance of the simpler linear function that can be used for large sample inference. We begin with general result for maximum likelihood theory. Under stard regularity conditions, if β b is a vector of ML estimates, then ³ h ³ n bβ β d N 0,V ar bβ i. (1) Let G (β) be some function, such as predicted probabilities from a logit or ordinal logit model. The Taylor series expansion of G( b β) is G( β) b G(β)+( β b β) 0 G 0 (β)+( β b β) 0 G 00 (β )( β b β)/2 (2) G(β)+( β b β) 0 G 0 (β)+o( β b β) 0, where G 0 (β) G 00 (β) are matrices of first second partial derivatives with respect to β, β is some value between β b β. Then, h n G( β) b i G(β) n( β b β) 0 G 0 (β)+o p (1), (3) which leads to G( β) b ³ N G(β), G(β) Var( b β) G(β) (Greene 2000; Agresti 2002). To 0 G(β x) estimate the variance, we evaluate the partials at the ML estimates,,which β 0 β leads to ³ Var G( β) b G(b β) b 0 Var( β) b G(b β) b. (4) Forexample,considerthelogitmodelwith G(β) Pr(y 1 x) Λ (x 0 β). (5) 1 For information on related programs future updates to this program, please check www.indiana.edu/~jslsoc/spost.htm. 1
To compute the confidence interval for Pr (y 1 x), we need the gradient vector Since Λ is a cdf, Λ(x0 β) G(β) h Λ(x0 β) 0 Λ(x 0 β) 1 Λ(x0 β) k x 0 β x 0 β k λ (x 0 β) x k,then Λ(x 0 β) K i 0. (6) G(β) λ (x 0 β) λ (x 0 β) x 1 λ (x 0 0 β) x K, (7) where x 0 1. To compute the confidence interval for a change in the probability as the independent variables change from x a to x b,weusethefunction G (β) Λ(β x a ) Λ (β x b ), (8) where G(β) [Λ(β x a) Λ(β x b )] Λ(β x a) Λ(β x b) Substituting this result into equation 4, Λ(β xa ) Var(G(β)) 0 Var( β) b Λ(β x a) Λ(β xa ) 0 Var( β) b Λ(β x b) Λ(β xb ) 0 Var( β) b Λ(β x a) Λ(β xb ) + 0 Var( β) b Λ(β x b) We now apply these formula to the models considered in prvalue2. 2 Binary Models In binary models, G(β) Pr(y 1 x) F (x 0 β) where F is the cdf for the logistic, normal, or cloglog function. The gradient is F(x 0 β) F(x0 β) x 0 β f(x 0 β)x k x 0 k, (11) β k where f is the pdf corresponding to F.Forthevectorx it follows that F(x 0 β) f(x 0 β)x. (12) From equation 4, Var [Pr(y 1 x)] f(x 0 β)x 0 Var( β)xf(x b 0 β)f(x 0 β) 2 x 0 Var( β)x b. The variances of Pr(y 0 x) Pr(y 0 x) are the equal since [1 F (x 0 β)] f(x 0 β)x (13) Var [Pr(y 0 x)] [ f(x 0 β)] 2 x 0 Var( β)x b. (14) 2. (9) (10).
3 Ordered Logit Probit Assume that there are m 1,J outcome categories, where Pr (y m x) F (τ m x 0 β) F (τ m 1 x 0 β) for j 1,J. (15) Since we assume that τ 0 τ J, F (τ 0 x 0 β)0 F (τ J x 0 β)1. To compute the gradient, It follows that F (τ m x 0 β) F (τ m x 0 β) k (τ m x 0 β) F (τ m x 0 β) F (τ m x 0 β) τ j (τ m x 0 β) (τ m x 0 β) k f (τ m x 0 β)( x k ) (16) (τ m x 0 β) τ j. (17) F (τ m x 0 β) τ j f (τ m x 0 β) if j m (18) F (τ m x 0 β) 0if j 6 m. τ j (19) Using these results with equation 15, Pr (y i m x i ) k [f (τ m x 0 β)( x k )] [f (τ m 1 x 0 β)( x k )] (20) x k f (τ m x 0 β) [f (τ m 1 x 0 β)] then Pr (y i m x i ) F (τ m x 0 β) F (τ m 1 x 0 β) (21) τ j τ j τ j f (τ m x 0 β) if j m f (τ m 1 x 0 β) if j m 1 0 otherwise. For example, with three categories: Pr (y 1 x) F (τ 1 x 0 β) 0 (22) Pr (y 2 x) F (τ 2 x 0 β) F (τ 1 x 0 β) (23) Pr (y 3 x) 1 F (τ 2 x 0 β), (24) Pr (y i 1 x i ) k x k [f (τ 1 x 0 β)] (25) Pr (y i 2 x i ) k x k [f (τ 2 x 0 β) f (τ 1 x 0 β)] (26) Pr (y i 3 x i ) k x k [ f (τ 2 x 0 β)]. (27) 3
With respect to τ, Pr (y i 1 x i ) τ 1 f (τ 1 x 0 β) (28) Pr (y i 1 x i ) τ 2 0 (29) Pr (y i 2 x i ) τ 1 f (τ 1 x 0 β) (30) Pr (y i 2 x i ) τ 2 f (τ 2 x 0 β) (31) Pr (y i 3 x i ) τ 1 0 (32) Pr (y i 3 x i ) τ 2 0. (33) To implement these procedures in Stata, we create the augmented matrices: β β 0 τ 1 τ J 1 0 (34) x 1 x 0 1 0 0 0 x 2 x 0 0 1 0 0 (35). x J 1 x 0 0 0 1 0, such that x 0 j β τ j xβ. (36) We then create the gradients described above. 4 Generalized Ordered Logit The generalized ordered logit model is identical to the ordinal logit model except that the coefficients associated with x differ for each outcome. Since there is an intercept for each outcome, the τ s are fixed to zero β J 0 for identification. Then, Pr (y i 1 x i ) F ( x 0 β m ) (37) Pr (y i m x i ) F ( x 0 β m ) F x 0 β m 1 for m 2,J 1 (38) Pr (y i J x i ) F x 0 β m 1. (39) The gradient with respect to the β s is F ( x 0 β m ) m,k f ( x 0 β m )( x k ), (40) 4
while no gradient for thresholds is needed. Then, Pr (y i m x i ) x k f ( x 0 β m ) f x 0 β m 1. (41) m,k 5 Multinomial Logit Assuming outcomes 1 through J, Pr(y m x) exp (xβ m) JP exp, (42) xβ j where without loss of generality we assume that β 1 0 to identify the model ( accordingly, the derivatives below do not apply to the partial with respect to β 1 ). To simplify notation, let P exp xβ j. The derivative of the probability of m with respect to βn is Using the quotient rule, n n j1 exp (xβ m) 1 n. (43) exp (xβ m) exp (xβ m ) 2. (44) n n Examining each partial in turn. exp (xβ m ) exp (xβ m) m (45) n m n exp(xβ m ) x if m n 0 if m 6 n P J exp xβ j j1 n JP exp xβ j j1 (46) n exp(xβ n ) x, where the last equality follows since the partial of exp xβ j with respect to βn is 0 unless j n. Combining these results. If m n, exp (xβ m ) x exp (xβ m ) 2 x 2 (47) m exp (xβ m ) exp (xβ m ) 2 2 x exp (xβm ) exp (xβ m) exp (xβ m ) x [Pr(y m) Pr (y m)pr(y m)] x Pr(y m)[1 Pr (y m)] x. 5
For m 6 n, For example, for two x s m 1: [0 exp (xβ m )exp(xβ n ) x] 2 (48) n exp (xβ m) exp (xβ n ) x Pr(y m)pr(y n) x. p m (1 p m ) x 1 p m (1 p m ) x 2 p m (1 p m ) 0 (49) m 0 p m p n x 1 p m p n x 2 p m p n. (50) n6m 6 Poisson Negative Binomial Regression In the Poisson regression model, so that Using matrices, μ i exp(x 0 iβ), (51) μ exp (x0 β) (52) k k exp(x0 β) x 0 β x 0 β k exp(x 0 β)x k μx k μ exp (x0 β) exp(x0 β) x 0 β x 0 β μx (53) The probability of a given count is so we can compute the gradient as: Pr (y x) exp ( μ) μy y!, (54) exp ( μ) μ y /y! 1 exp ( μ) μ y μ. (55) k y! μ k 6
Since the last term was computed above, we only need to derive: exp ( μ) μ y exp( μ) μy exp ( μ) + μy μ μ μ exp( μ) yμ y 1 μ y exp ( μ) (56) which leads to: Using matrices, Pr (y x) 1 k y! μ exp ( μ) yμ y 1 μ y exp ( μ) x k (57) exp ( μ) yμy μ y+1 exp ( μ) x k y! yμy μ y+1 exp (μ) y! x k. Pr (y x) The negative binomial model is specified as exp ( μ) yμy μ y+1 exp ( μ) x (58) y! yμy μ y+1 exp (μ) y! x. μ exp(x 0 β + ε) (59) exp(x 0 β)exp(ε), where ε has a gamma distribution with variance α. The counts have a negative binomial distribution Pr (y i x i ) Γ(y µ ν µ yi i + ν) ν μi, (60) y i! Γ(ν) ν + μ i ν + μ i where ν α 1. The derivatives of the log-likelihood are given in Stata Reference, Version 8, page 10. To simplify notation, we define τ lnα, m 1/α, p 1/ (1 + αμ), μ exp(xβ) (61) with ψ (z) beingthedigammafunctionevaluatedatz, τ Then by the chain rule, p (y μ) (62) α (μ y) m ln (1 + αμ)+ψ (y + m) ψ (m). 1+αμ (63) Pr (y x) Pr (y x) Pr(y x) 1 Pr (y x), (64) 7
so that Similarly for τ, Pr (y x) Pr (y x) τ τ Pr (y x). (65) Pr (y x). (66) 7 References Agresti, Alan. 2002. Categorical Data Analysis. 2nd Edition. New York: Wiley. Greene, William H. 2000. Econometric Analysis, 4th Ed. New York: Prentice Hall. File prvalue2-delta-050203.tex 8