Generalized Linear Models. Kurt Hornik

Generalized Linear Models Kurt Hornik

Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general linear model Y = XB + E (with E normal with flexible error covariance structures) But what if normality is not appropriate (e.g., skewed, bounded, discrete)? Transformations or generalized linear models. Slide 2

Exponential Dispersion Models Densities with respect to reference measure m of the form yθ b(θ) ƒ (y θ, ϕ) = exp + c(y, ϕ) ϕ (alternatively, write (ϕ) instead of ϕ in the denominator). For fixed ϕ, this is an exponential family in θ. Slide 3

Exponential Dispersion Models Differentiate ƒ (y θ, ϕ) dm(y) = 1 with respect to θ (and assume interchanging integration and differentiation is justified): so that 0 = ƒ (y θ, ϕ) ƒ (y θ, ϕ) dm(y) = dm(y) θ θ = y b (θ) ƒ (y θ, ϕ) dm(y) = E θ,ϕ(y) b (θ) ϕ ϕ E θ,ϕ (y) = b (θ) (which does not depend on ϕ!). Slide 4

Exponential Dispersion Models Differentiate once more: 0 = so that b (θ) y b 2 (θ) + ƒ (y θ, ϕ) dm(y) ϕ ϕ = b (θ) ϕ V θ,ϕ (y) = ϕb (θ). + V θ,ϕ(y) ϕ 2 (which shows that ϕ is a dispersion parameter). Slide 5

Exponential Dispersion Models We can thus write E(y) = μ = b (θ). If μ = b (θ) defines a one-to-one relation between μ and θ (which it does: can be shown using convex analysis), we can write b (θ) = V(μ), formally V(μ) = b ((b ) 1 (μ)) where V is the variance function of the family. Thus: vr(y) = ϕb (θ) = ϕv(μ). Slide 6

Example: Bernoulli Family Take y binary with P(y = 1) = p. With m counting measure (on {0, 1}), ƒ (y) = p y (1 p) 1 y p y = (1 p) 1 p p = exp y log + log(1 p). 1 p I.e., exponential dispersion model with ϕ = 1 (hence in fact, exponential family) and p θ = log = logit(p) 1 p (quantile function of standard logistic distribution). Slide 7

Example: Bernoulli Family Inverting θ = logit(p) gives eθ p = 1 + e θ (probability function of standard logistic distribution) and hence 1 1 p = 1 + e θ so that b(θ) = log(1 p) = log(1 + e θ ). Altogether (note that there is a problem for p {0, 1}): ƒ (y θ) = exp(yθ log(1 + e θ )), θ = logit(p). Slide 8

Example: Bernoulli Family Differentiation gives: and b 1 (θ) = 1 + e θ eθ = p = μ b (θ) = eθ (1 + e θ ) e θ e θ (1 + e θ ) 2 = eθ 1 + e θ 1 = p(1 p) = μ(1 μ). 1 + eθ Necessary? We know that E(y) = p and vr(y) = p(1 p). Hence, eθ b (θ) = p = 1 + e θ = b(θ) = log(1 + eθ ). Slide 9

Generalized Linear Models For = 1,..., n have responses y from an exponential dispersion family with the same b and covariates such that for E(y ) = μ = b (θ ) we have g(μ ) = β = η, where g is the link function and η is the linear predictor. Alternatively, μ = h(β ) = h(η ), where h is the response function (and g and h are inverses of each other if invertible). Why useful? General conceptual framework for estimation and inference. Slide 10

Maximum Likelihood Estimation Log-likelihood is where l = l(β) = n =1 g(μ ) = g(b (θ )) = β. y θ b(θ ) + c(y, ϕ ), ϕ Differentiating the latter with respect to β j : j = β = g(b (θ )) = g (b (θ ))b (θ ) θ = g (μ )V(μ ) θ β j β j β j β j Slide 11

Maximum Likelihood Estimation Hence, and θ β j = l β j = j g (μ )V(μ ) n y b (θ ) j g (μ )V(μ ) = =1 ϕ n =1 y μ ϕ V(μ ) j g (μ ). MLE typically performed by solving score equations l/ β j = 0. For Newton-type algorithms, need the Hessian H(β) = [ 2 l/ β j β j ]. As μ = b (θ ), Slide 12 μ = b (θ ) θ j = V(μ ) β j β j g (μ )V(μ ) = j g (μ )

Maximum Likelihood Estimation Hence: 2 l β j β k = = = n β k ϕ V(μ ) g (μ ) =1 n j ϕ =1 n =1 n =1 y μ j μ β k 1 V(μ )g (μ ) j k ϕ V(μ )g (μ ) 2 y μ (V(μ )g (μ )) (V(μ )g (μ )) 2 β k (y μ ) j k ϕ V(μ ) 2 g (μ ) 3 (V (μ )g (μ ) + V(μ )g (μ )) Slide 13

Maximum Likelihood Estimation Second term looks complicated, but has expectation zero. Hence, drop and only use first term for Newton-type iteration: Fisher scoring algorithm. Equivalently, replace observed information matrix (negative Hessian of log-likelihood) by its expectation (Fisher information matrix). Next problem: what about ϕ? Assume that ϕ = ϕ/ with known case weights. Slide 14

Maximum Likelihood Estimation Then Fisher information matrix is 1 ϕ n V(μ =1 )g (μ ) 2 j k = X W(β)X ϕ where X is the usual regressor matrix (with as row ) and W(β) = dig V(μ )g (μ ) 2 Similarly, score function is 1 ϕ,, g(μ ) = β. n V(μ =1 )g (μ ) 2 g (μ )(y μ ) j = X W(β)r(β), ϕ where r(β) has elements g (μ )(y μ ): so-called working residuals. Slide 15

Maximum Likelihood Estimation Remember: Newton updates for minimizing l(β) are β new β (H(l)(β)) 1 l(β). Thus, Fisher scoring update (with approximation for H) uses β new β + (X W(β)X) 1 X W(β)r(β) = (X W(β)X) 1 X W(β)(Xβ + r(β)) = (X W(β)X) 1 X W(β)z(β) where working response z(β) has elements β + g (μ )(y μ ), g(μ ) = β. I.e., update computed by weighted least squares regression of z(β) on X (weights: square roots of W(β)): Fisher scoring algorithm for obtaining the MLEs is an iterative weighted least squares (IWLS) algorithm. Note: common dispersion parameter ϕ not used! Slide 16

Canonical Links The canonical link is given by g = (b ) 1 so that η = g(μ ) = g(b (θ )) = θ, g (μ) = d dμ (b ) 1 (μ) = so that g (μ)v(μ) 1, and hence 1 b ((b ) 1 (μ)) = 1 V(μ), l = 1 β j ϕ n (y μ ) j, =1 2 l = 1 β j β k ϕ n V(μ ) j k =1 Thus: observed and expected information coincide, IWLS Fisher scoring algorithm is the same as Newton s algorithm. Slide 17

Inference Under suitable conditions, MLE ˆβ asymptotically N(β, (β) 1 ) with expected Fisher information matrix (β) = 1 ϕ X W(β)X. Thus, standard errors can be computed as square roots of diagonal elements of cov( ˆβ) = ϕ(x W( ˆβ)X) 1 where X W( ˆβ)X is a by-product of the final IWLS iteration. Slide 18

Inference This needs an estimate of ϕ (unless known). Estimation by MLE is practically difficult: hence, usually estimated by method of moments. Remember vr(y ) = ϕ V(μ ) = ϕv(μ )/. Hence: if β was known, unbiased estimate of ϕ would be 1 n (y μ ) 2. n V(μ =1 ) Taking into account that β is estimated, estimate is ˆϕ = 1 n p n (y ˆμ ) 2 =1 V( ˆμ ) (where p is the number of β parameters). Slide 19

Deviance A quality-of-fit statistic for model fitting achieved by ML, generalizing the idea of using the sum of squares of residuals in ordinary least squares: D = 2ϕ(l st l mod ) (assuming a common ϕ, perhaps after taking out weights), where the saturated model uses separate parameters for each observation so that the data is fitted exactly. For GLMs: y = μ = b (θ ) achieves zero scores. Contribution of observation to l st l mod is y θ b (θ ) y ˆθ b ( ˆθ ) = y θ θ b(θ), ϕ ϕ ϕ ˆθ where ˆθ is obtained from the fitted model (i.e., g(b ( ˆθ )) = ˆβ ). Slide 20

Deviance We can write (y θ b(θ)) θ = ˆθ θ ˆθ d dθ (y θ b(θ)) dθ = θ ˆθ (y b (θ)) dθ. Substituting μ = b (θ): dμ = b (θ) dθ, i.e., dθ = V(μ) 1 dμ, so that θ ˆθ (y b (θ)) dθ = y ˆμ y μ V(μ) dμ and the deviance contribution of observation is 2ϕ y θ b(θ) ϕ θ = 2 ˆθ y ˆμ y μ V(μ) dμ. Can be taken to define deviance and introduce quasi-likelihood models. Slide 21

Residuals Several kinds of residuals can be defined for GLMs: response y ˆμ working from working response in IWLS, i.e., g ( ˆμ )(y ˆμ ) Pearson r P = y ˆμ V( ˆμ ) so that (rp )2 equals the generalized Pearson statistic. deviance so that (rd ) 2 equals the deviance (see above). (All definitions equivalent for the Gaussian family.) Slide 22

Generalized Linear Mixed Models Augment the linear predictor by (unknown) random effects b : η = β + z b where the b come from a suitable family of distributions and the z (as well as the, of course) are known covariates. Typically, b N(0, G(ϑ)). Conditionally on b, y is taken to follow an exponential dispersion model with g(e(y b )) = η = β + z b. Marginal likelihood function is observed y obtained by integrating out the joint likelihood of the y and b with respect to the marginal distribution of the b. If b are independent across observation units, L(β, ϕ, ϑ) = n ƒ (y β, ϕ, ϑ, b )ƒ (b ϑ) db =1 Slide 23