Background. GLM with clustered data. The problem. Solutions. A fixed effects approach

Size: px

Start display at page:

Download "Background. GLM with clustered data. The problem. Solutions. A fixed effects approach"

Alaina Sullivan
5 years ago
Views:

1 Background GLM with clustered data A fixed effects aroach Göran Broström Poisson or Binomial data with the following roerties A large data set, artitioned into many relatively small grous, and where members within grous have something in common, Deartment of Statistics Umeå University SE Umeå, Sweden GLM with clustered data. 1 GLM with clustered data. 2 The roblem Solutions the number of arameters tend to increase with samle size. This fact causes the standard assumtions underlying asymtotic results to be violated. There are (at least two ossible solutions to the roblem, 1. a random intercets model, and 2. a fixed effects model, with asymtotics relaced by simulation. GLM with clustered data. 3 GLM with clustered data. 4

2 Packages in R Data structure The ackage Matrix has lmer, the MASS ackage has mpql, Jim Lindsey s m in his reeated ackage, Myles and Clayton s GLMMGibbs for fitting mixed models by Gibbs samling. Adding to that m and mboot in the ackage m. n clusters of sizes,i = 1,...,n. For each cluster i,i = 1,...,n, observe resonses (y i1,...,y ini and vectors of exlanatory variables (x i1,...,x ini, where x ij are -dimensional vectors with the first element identically equal to unity, corresonding to the mean value of the random intercets. The random art, u i of the intercets are normal with mean zero and variance σ 2, and it is assumed that u 1,...,u n are indeendent. The conditional distribution GLM with clustered data. 5 Likelihood function GLM with clustered data. 6 given the random intercets β 1 + u i,i = 1,...,n: Pr(Y ij = y ij u i ;x = P(βx ij + u i,y ij, y ij = 0, 1,... ; j = 1,...,, i = 1,...,n. Bernoulli distribution logit link, P(x, y = exy, y = 0,1; < x <, 1 + ex cloglog link P(x, y = `1 ex( e x y ex` (1 ye x, y = 0,1; < x <, In the fixed effects model (and in the conditional random effects model, the likelihood functios L ( (β,γ;y,x = The log likelihood functios l ( (β,γ;y,x = n i=1 P(βx ij + γ i,y ij. n log P(βx ij + γ i,y ij, i=1 Poisson distribution with log link P(x, y = exy y! e ex, y = 0,1, 2,... ; < x < GLM with clustered data. 7 GLM with clustered data. 8

3 Tests of cluster effect Comutational asects Testing is erformed via a simle bootstra (mboot. Under the null hyothesis of no grouing effect, the grouing factor can be randomly ermuted without changing the robability distribution (the conditional aroach, or a arametric bootstra aroach: simulate observations from the fitted model under the null hyothesis (the unconditional aroach. A rofiling aroach reduces an otimizing roblem in high dimensions to a roblem consisting of solving several one-variable equations followed by otimization low dimensions. The score vector GLM with clustered data. 9 Cluster comonents of the score GLM with clustered data. 10 The artial derivatives wrt β m, m = 1;...,, of the log likelihood function are: U m (β,γ = l ( (β,γ;y,x n = x ijm G(βx ij + γ i, y ij, i=1 m = 1,...,. The artial derivatives wrt γ i, i = 1,...,n, are U +i (β,γ = γ i l ( (β,γ;y,x = G(βx ij + γ i, y ij, i = 1,...,n. where G(x,y = x log P(x,y = x P(x,y P(x,y GLM with clustered data. 11 GLM with clustered data. 12

4 With rofiling Profile score Setting U +i (β,γ = 0 defines γ imlicitly as functions of β, γ i = γ i (β, i = 1,...,n: F ( β,γ i (β = G ( βx ij + γ i (β, y ij = 0, From we get F ( β,γ i (β = γ i F γ i + F = 0 i = 1,...,n. γ i (β = = F F γ i ni j=i x ijmh ( βx ij + γ i, y ij ni H( βx ij + γ i, y ij, i = 1,...,n; m = 1,..., which is needed when calculating the score corresonding to the rofile likelihood. Profile loglihood GLM with clustered data. 13 Profile artial derivatives GLM with clustered data. 14 Relacing γ by γ(β gives the rofile log likelihood l (P : l (P( β;y,x n = log P ( βx ij + γ i (β, y ij, as a function of β alone. i=1 The artial derivatives wrt β m, m = 1;...,, of the log rofile likelihood function becomes: U (P m (β = l (P (β;y,x n = i=1 ( x ijm + γ i(β = U m ( β,γ(β + n = U m (β,γ(β, i=1 γ i (β G ( βx ij + γ i (β, y ij G ( βx ij + γ i (β, y ij Thus we get back the unrofiled artial derivatives. GLM with clustered data. 15 GLM with clustered data. 16

5 Profile hessian At the maximum I (P ms (β = β s U m ( β, γ(β = n i=1 ( x ijm x ijs + γ i(β H ( βx ij + γ i (β, y ij β s = I ms (β, γ(β n ni x ni ijmh ij ni H ij i=1 m, s = 1,...,. x ijsh ij, Justifying the use of the rofile likelihood: Theorem 1 (Patefield The inverse hessians from the full likelihood and from the rofile likelihood for β are equal when (γ,β = (ˆγ, ˆβ. where H ij = H ( βx ij + γ i (β, y ij, j = 1,...ni ; i = 1,...,n. Prearation for R GLM with clustered data. 17 Imlementation R GLM with clustered data. 18 l (P (β = n ni i=1 U m (P (β = n m = 1,...,. i=1 log P( βx ij + γ i (β, y ij, ni x ijmg ( βx ij + γ i (β, y ij, For fixed β, γ i (β is found by solving G(βx ij + γ i, y ij = 0, with resect to γ i, i = 1,...,n. The maximizatios erformed by otim, via the C function vmmin, available as an entry oint in the C code of R. Imlemented in the ackage m in R. Covers three cases, 1. Binomial with logit link, 2. Binomial with cloglog link, 3. Poisson with log link. The functios mboot, Testing of cluster effect is done by simulation (a simle form of bootstraing. conditionally, or unconditionally. GLM with clustered data. 19 GLM with clustered data. 20

6 Binomial with logit link Binomial with cloglog link P(x,y = ex(xy/(1 + ex(x, G(x,y = y P(x, 1. We get (γ 1,...,γ n by solving the equations y ij = ex(βx ij + γ i 1 + ex(βx ij + γ i for i = 1,...,n (using the C version of uniroot. Secial cases: y ij = 0 or ; giving γ i = or +, resectively. Corresonding cluster can be thrown out. (Should be used in? P(x,y = (1 ex( ex(x y ex( (1 y ex(x, G(x,y = ex(x P(x,1 {y P(x, 1} We get (γ 1,...,γ n by solving the equations y ij = ex( ex(βx ij + γ i for i = 1,...,n (using the C version of uniroot. Secial cases: y ij = 0 or ; γ i = or +, resectively. Corresonding cluster can be thrown out. GLM with clustered data. 21 GLM with clustered data. 22 Poisson with log link Simulation P(x,y = exy y! ex( ex(x G(x,y = y e x We get (γ 1,...,γ n from giving y ij = e γ i ex(βx ij, γ i = log i = 1,...,n, { j y } ij j ex(βx, i = 1,...,n. ij Model: P(Y ij = 1 γ i = 1 P(Y ij = 0 γ i where γ 1,...,γ n are iid N(0,σ. Hyothesis: σ = 0. = eγ i 1 + e γ, j = 1,...,5; i = 1,...,n, i Secial case: y ij = 0, giving γ i =. GLM with clustered data. 23 GLM with clustered data. 24

7 Simulation secifications Null model (σ = 0; 5 clusters σ = 0, 0.5. n = 5, 50, 500. Four methods: mboot, unconditional and conditional, m, (naively?. F( F(, conditional F( F( Null model (σ = 0; 50 clusters GLM with clustered data. 25 Null model (σ = 0; 500 clusters GLM with clustered data. 26, conditional, conditional F( F( F( F( F( F( F( F( GLM with clustered data. 27 GLM with clustered data. 28

8 Clustering (σ = 0.5; 5 clusters Clustering (σ = 0.5; 50 clusters, conditional, conditional F( F( F( F( F( F( F( F( Clustering (σ = 0.5; 500 clusters GLM with clustered data. 29 Timings, 5 clusters GLM with clustered data. 30 F( F(, conditional > system.time(mboot(y 1, cluster = cluster, + data = timing, conditional = FALSE, boot = 2000 [1] > system.time(mboot(y 1, cluster = cluster, data = timing, conditional = TRUE, boot = 2000 [1] F( F( > system.time(m(y 1, cluster = cluster, data = timing [1] > system.time((y factor(cluster, data = timing, family = binomial [1] GLM with clustered data. 31 GLM with clustered data. 32

9 Timings, 50 clusters Timings, 500 clusters > system.time(mboot(y 1, cluster = cluster, data = timing, conditional = FALSE, boot = 2000 [1] > system.time(mboot(y 1, cluster = cluster, data = timing, conditional = TRUE, boot = 2000 [1] > system.time(m(y 1, cluster = cluster, data = timing [1] > system.time((y factor(cluster, data = timing, family = binomial [1] > system.time(mboot(y 1, cluster = cluster, data = timing, conditional = FALSE, boot = 2000 [1] > system.time(mboot(y 1, cluster = cluster, data = timing, conditional = TRUE, boot = 2000 [1] > system.time(m(y 1, cluster = cluster, data = timing [1] > system.time((y factor(cluster, data = timing, family = binomial [1] vs. mboot(boot = 0 GLM with clustered data. 33 GLM with clustered data. 34 Execution times No. of clusters mboot Conclusion: Profiling is numerically very efficient. GLM with clustered data. 35

Outline for today. Maximum likelihood estimation. Computation with multivariate normal distributions. Multivariate normal distribution

Outline for today. Maximum likelihood estimation. Computation with multivariate normal distributions. Multivariate normal distribution Outline for today Maximum likelihood estimation Rasmus Waageetersen Deartment of Mathematics Aalborg University Denmark October 30, 2007 the multivariate normal distribution linear and linear mixed models