Introduction to General and Generalized Linear Models

Size: px

Start display at page:

Download "Introduction to General and Generalized Linear Models"

Easter Lamb
5 years ago
Views:

1 Introduction to General and Generalized Linear Models Mixed effects models - Part IV Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby January 2011 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

2 This lecture General mixed effects models Laplace approximation Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

3 General mixed effects models General mixed effects models Let us now look at methods to deal with nonlinear and non-normal mixed effects models. In general it will be impossible to obtain closed form solutions and hence numerical methods must be used. Estimation and inference will be based on likelihood principles. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

4 General mixed effects models General mixed effects models The general mixed effects model can be represented by its likelihood function: L M (θ;y) = L(θ;u,y)du R q where y is the observed random variables, θ is the model parameters to be estimated, and U is the q unobserved random variables or effects. The likelihood function L is the joint likelihood of both the observed and the unobserved random variables. The likelihood function for estimating θ is the marginal likelihood L M obtained by integrating out the unobserved random variables. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

5 General mixed effects models General mixed effects models The integral shown on the previous slide is generally difficult to solve if the number of unobserved random variables is more than a few, i.e. for large values of q. A large value of q significantly increases the computational demands due to the product rule which states that if an integral is sampled in m points per dimension to evaluate it, the total number of samples needed is m q, which rapidly becomes infeasible even for a limited number of random effects. The likelihood function gives a very broad definition of mixed models: the only requirement for using mixed modeling is to define a joint likelihood function for the model of interest. In this way mixed modeling can be applied to any likelihood based statistical modeling. Examples of applications are linear mixed models (LMM) and nonlinear mixed models (NLMM), generalized linear mixed models, but also models based on Markov chains, ODEs or SDEs. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

6 Hierarchical models General mixed effects models As for the Gaussian linear mixed models it is useful to formulate the model as a hierarchical model containing a first stage model f Y u (y;u,β) which is a model for the data given the random effects, and a second stage model f U (u;ψ) which is a model for the random effects. The total set of parameters is θ = (β,ψ). Hence the joint likelihood is given as L(β,Ψ;u,y) = f Y u (y;u,β)f U (u;ψ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

7 Hierarchical models General mixed effects models To obtain the likelihood for the model parameters (β, Ψ) the unobserved random effects are again integrated out. The likelihood function for estimating (β, Ψ) is as before the marginal likelihood L M (β,ψ;y) = L(β,Ψ;u,y)du R q where q is the number of random effects, and β and Ψ are the parameters to be estimated. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

8 General mixed effects models Grouping structures and nested effects For nonlinear mixed models where no closed form solution to the likelihood function is available it is necessary to invoke some form of numerical approximation to be able to estimate the model parameters. The complexity of this problem is mainly dependent on the dimensionality of the integration problem which in turn is dependent on the dimension of U and in particular the grouping structure in the data for the random effects. These structures include a single grouping, nested grouping, partially crossed and crossed random effects. For problems with only one level of grouping the marginal likelihood can be simplified as L M (β,ψ;y) = M i=1 R q i f Y ui (y;u i,β)f Ui (u i ;Ψ)du i where q i is the number of random effects for group i and M is the number of groups. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

9 General mixed effects models Grouping structures and nested effects Instead of having to solve an integral of dimension q it is only necessary to solve M smaller integrals of dimension q i. In typical applications there is often just one or only a few random effects for each group, and this thus greatly reduces the complexity of the integration problem. If the data has a nested grouping structure a reduction of the dimensionality of the integral similar to that shown on the previous slide can be performed. An example of a nested grouping structure is data collected from a number of schools, a number of classes within each school and a number of students from each class. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

10 General mixed effects models Grouping structures and nested effects If the nonlinear mixed model is extended to include any structure of random effects such as crossed or partially crossed random effects it is required to evaluate the full multi-dimensional integral Estimation in these models can efficiently be handled using the multivariate Laplace approximation, which only samples the integrand in one point common to all dimensions. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

11 The Laplace approximation For a given set of model parameters θ the joint log-likelihood l(θ,u,y) = log(l(θ,u,y)) is approximated by a second order Taylor approximation around the optimum ũ = û θ of the log-likelihood function w.r.t. the unobserved random variables u, i.e. l(θ,u,y) l(θ,ũ,y) 1 2 (u ũ)t H(ũ)(u ũ) where the first-order term of the Taylor expansion disappears since the expansion is done around the optimum ũ and H(ũ) = l uu(θ,u,y) u=ũ is the negative Hessian of the joint log-likelihood evaluated at ũ which will simply be referred to as the Hessian. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

12 The Laplace approximation Using the approximation in the Laplace approximation of the marginal log-likelihood becomes l M,LA (θ,y) = log exp ( l(θ,ũ,y) 1 R q 2 (u ũ)t H(ũ)(u ũ) ) du = l(θ,ũ,y) 1 2 log H(ũ) 2π Yhe integral is eliminated by transforming it to an integration of a multivariate Gaussian density with mean ũ and covariance H 1 (ũ). Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

13 The Laplace approximation The Laplace likelihood only approximates the marginal likelihood for mixed models with nonlinear random effects and thus maximizing the Laplace likelihood will result in some amount of error in the resulting estimates. It can be shown that joint log-likelihood converges to a quadratic function of the random effect for increasing number of observations per random effect and thus that the Laplace approximation is asymptotically exact. In practical applications the accuracy of the Laplace approximation may still be of concern, but often improved numerical approximation of the marginal likelihood (such as Gaussian quadrature) may easily be computationally infeasible to perform. Another option for improving the accuracy is Importance sampling. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

14 Two-level hierarchical model For the two-level or hierarchical model it is readily seen that the joint log-likelihood is l(θ,u,y) = l(β,ψ,u,y) = logf Y u (y;u,β)+logf U (u;ψ) which implies that the Laplace approximation becomes l M,LA (θ,y) = logf Y u (y;ũ,β)+logf U (ũ;ψ) 1 2 log H(ũ) 2π It is clear that as long as a likelihood function of the random effects and model parameters can be defined it is possible to use the Laplace likelihood for estimation in a mixed model framework. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

15 Gaussian second stage model Let us assume that the second stage model is zero mean Gaussian, i.e. u N(0,Ψ) which means that the random effect distribution is completely described by its covariance matrix Ψ. In this case the Laplace likelihood in becomes l M,LA (θ,y) = logf Y u (y;ũ,β) 1 2 log Ψ 1 2ũT Ψ 1 ũ 1 2 log H(ũ) where it is seen that we still have no assumptions on the first stage model f Y u (y;u,β). Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

16 Gaussian second stage model If we furthermore assume that the first stage model is Gaussian Y U = u N(µ(β,u),Σ) then the Laplace likelihood can be further specified. For the hierarchical Gaussian model it is rather easy to obtain a numerical approximation of the Hessian H at the optimum, ũ H(ũ) µ u Σ 1 µ T u +Ψ 1 where µ u is the partial derivative with respect to u. The approximation in is called Gauss-Newton approximation In some contexts estimation using this approximation is also called the First Order Conditional Estimation (FOCE) method. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

17 Automatic differentiation A simpel and efficient way to use the Laplace transformation technique outlined is via the open source software package AD Model Builder, which takes advantages of automatic differentiation. Any calculation done via a computer program can be broken down to a long chain of simple operations like +,,, /, exp, log, sin, cos, tan,, and so on. It is simple to write down the analytical derivative of each of these operations by themselves. If our log-likelihood function l consisted of only a few of these simple operations, then it would be tractable to use the chain rule (f g) (x) = f (g(x))g (x) to find the analytical gradient l of the log-likelihood function l. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

18 Automatic differentiation Automatic differentiation is a technique where the chain rule is used by the computer program itself. When the program evaluates the log-likelihood it keeps track of all the operations used along the way, and then runs the program backwards (reverse mode automatic differentiation) and uses the chain rule to update the derivatives one simple operation at a time. Automatic differentiation is accurate, and the computational cost of evaluating the gradient is surprisingly low. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

19 Automatic differentiation Theorem The computational cost of evaluating the gradient of the log-likelihood l with reverse mode automatic differentiation is less than four times the computational cost of evaluating the log-likelihood function l itself. This holds no matter how many parameters the model contain. It is surprising that computational cost does not depend on how many parameters the model contain. There is however a practical concern. The computational cost mentioned above is measured in the number of operations, but reverse mode automatic differentiation requires all the intermediate variables in the calculation of the negative log-likelihood to be stored in the computer s memory, so if the calculation is lengthy, for instance consisting of a long iterative procedure, then the memory requirements can be enormous. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

20 Automatic differentiation combined with the Laplace approximation Finding the gradient of the Laplace approximation of the marginal log-likelihood is challenging, because the approximation itself includes the result of a function minimization, and not just a straightforward sequence of simple operations. It is however possible, but requires up to third order derivatives to be computed internally by clever successive application of automatic differentiation. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

21 Importance sampling Importance sampling is a re-weighting technique for approximating integrals w.r.t. a density f by simulation in cases where it is not feasible to simulate from the distribution with density f. Instead it uses samples from a different distribution with density g, where the support of g includes the support of f. For general mixed effects models it is possible to simulate from the distribution with density proportional to the second order Taylor approximation L(θ,û θ,y ) = exp { l(θ,û θ,y ) 1 2 (u û θ) T ( l uu (θ,u,y) u=û θ )(u û θ ) } as, apart from a normalization constant, it is the density φûθ,ˆv θ (u) of a multivariate normal with mean û θ and covariance ˆV θ = H 1 (û θ ) = ( l uu (θ,u,y ) u=û θ ) 1. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

22 Importance sampling Laplace approximation The integral to be approximated can be rewritten as: L(θ,u,Y ) L M (θ,y ) = L(θ,u,Y )du = φûθ,ˆv θ (u) φ û θ,ˆv θ (u)du. So if u (i), i = 1,...,N is simulated from the multivariate normal distribution with mean û θ and covariance ˆV θ, then the integral can be approximated by the mean of the importance weights L M (θ,y ) = 1 N L(θ,u (i),y ) φûθ,ˆv θ (u (i) ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

23 AD Model Builder AD Model Builder is a programming language that builds on C++. It includes helper functions for reading in data, defining model parameters, and implementing and optimizing the negative log-likelihood function. The central feature is automatic differentiation (AD), which is implemented in such a way that the user rarely has to think about it at all. AD Model Builder can be used for fixed effects models, but in addition it includes Laplace approximation and importance sampling for dealing with general mixed effects models. AD Model Builder is developed by Dr. Dave Fournier and was a commercial product for many years. Recently AD Model Builder has been placed in the public domain (see Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January / 23

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Mixed effects models - Part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby