Lecture 16: Mixtures of Generalized Linear Models

Size: px

Start display at page:

Download "Lecture 16: Mixtures of Generalized Linear Models"

Lindsey York
5 years ago
Views:

1 Lecture 16: Mixtures of Generalized Linear Models October 26, 2006

3 Setting Outline Often, a single GLM may be insufficiently flexible to characterize the data

4 Setting Often, a single GLM may be insufficiently flexible to characterize the data For example, the exponential family assumption may be violated

5 Setting Often, a single GLM may be insufficiently flexible to characterize the data For example, the exponential family assumption may be violated A flexible solution is to define a mixture of GLMs

6 Density Estimation Outline Suppose that we have data for a single continuous variable, y i, i = 1,..., n.

7 Density Estimation Suppose that we have data for a single continuous variable, y i, i = 1,..., n. Interest focuses on estimating the density of Y without assuming normality

8 Density Estimation Outline Suppose that we have data for a single continuous variable, y i, i = 1,..., n. Interest focuses on estimating the density of Y without assuming normality One possibility is to use a mixture of normals: f (y i ) = N(y i ; µ i, σi 2 )dg(µ i, σi 2 ) N(y; µ, σ 2 ) = (2πσ 2 ) 1/2 exp { 1/2σ 2 (y µ) 2} G = mixture distribution

9 Mixture Distribution Different choices of G correspond to different mixture specifications

10 Mixture Distribution Different choices of G correspond to different mixture specifications We can express the mixture of normals in hierarchical form: (y i µ i, σ 2 i ) N(µ i, σ 2 i ) (µ i, σ 2 i ) G,

11 Mixture Distribution Different choices of G correspond to different mixture specifications We can express the mixture of normals in hierarchical form: (y i µ i, σ 2 i ) N(µ i, σ 2 i ) (µ i, σ 2 i ) G, A finite mixture is obtained by letting: G = k p h δ θh, θ h = (µ h, σh 2 ), h=1 p h =probability for component h θ h =parameters in component h

12 Finite In the finite mixture of normals case, we have k f (y i ) = p h N(y i ; µ h, σh 2 ). h=1

13 Finite In the finite mixture of normals case, we have f (y i ) = k p h N(y i ; µ h, σh 2 ). h=1 It is well known that mixtures of normals can approximate any smooth density

14 Finite In the finite mixture of normals case, we have f (y i ) = k p h N(y i ; µ h, σh 2 ). h=1 It is well known that mixtures of normals can approximate any smooth density Finite mixtures with sufficient numbers of components (say 5-7) are very flexible

15 Fitting Finite By considering the mixture component for individual i as latent data, model fitting becomes straightforward

16 Fitting Finite By considering the mixture component for individual i as latent data, model fitting becomes straightforward Letting Z i = h if individual i is sampled from component h: (y i Z i = h) N(y i ; µ h, σh 2 ) k (Z i p) p h δ h h=1

17 Fitting Finite By considering the mixture component for individual i as latent data, model fitting becomes straightforward Letting Z i = h if individual i is sampled from component h: (y i Z i = h) N(y i ; µ h, σh 2 ) k (Z i p) p h δ h h=1 We can use an EM algorithm for maximum likelihood estimation or MCMC for posterior computation

18 Prior Specification Outline Complete a specification of the model with priors for p = (p 1,..., p k ), θ h = (µ h, σh 2 ), for h = 1,..., k.

19 Prior Specification Outline Complete a specification of the model with priors for p = (p 1,..., p k ), θ h = (µ h, σh 2 ), for h = 1,..., k. Traditional school of thought - constraints must be made for identifiability, with the most common choice being: µ 1 < µ 2 < < µ k.

20 Prior Specification Outline Complete a specification of the model with priors for p = (p 1,..., p k ), θ h = (µ h, σh 2 ), for h = 1,..., k. Traditional school of thought - constraints must be made for identifiability, with the most common choice being: µ 1 < µ 2 < < µ k. Often works better to use a prior of the form: p Dirichlet(α/k,..., α/k) θ h Normal-Inv-Gamma,

21 Gibbs Sampling Outline After augmentation with Z = (Z 1,..., Z n ), posterior computation is straightforward via Gibbs sampling

22 Gibbs Sampling After augmentation with Z = (Z 1,..., Z n ), posterior computation is straightforward via Gibbs sampling Step 1 - Sample Z i, i = 1,..., n from multinomial full conditional posterior: Pr(Z i = h p, θ) = p hn(y i ; µ h, θ h ) k l=1 p ln(y i ; µ l, θ l ).

23 Gibbs Steps (Continued) Step 2 - Update θ h, for h = 1,..., k, by sampling from the normal-inv-gamma posterior. Can be calculated for θ h by just updating the prior with the data for those subjects with Z i = h.

24 Gibbs Steps (Continued) Step 2 - Update θ h, for h = 1,..., k, by sampling from the normal-inv-gamma posterior. Can be calculated for θ h by just updating the prior with the data for those subjects with Z i = h. Step 3 - Update p by sampling from the conditionally-conjugate Dirichlet: ( α Dirichlet k + 1(Z i = 1),..., α k + i i ) 1(Z i = n).

25 What about predictors?? Now suppose that we have a continuous response, y i, & predictors, x i = (x i1,..., x ip ).

26 What about predictors?? Now suppose that we have a continuous response, y i, & predictors, x i = (x i1,..., x ip ). To avoid normality assumption, model the residual distribution using a mixture of normals: y i = x i β + ɛ i, ɛ i N(0, σi 2 ) k σi 2 G = p h δ τh. h=1

27 What about predictors?? Now suppose that we have a continuous response, y i, & predictors, x i = (x i1,..., x ip ). To avoid normality assumption, model the residual distribution using a mixture of normals: y i = x i β + ɛ i, ɛ i N(0, σi 2 ) k σi 2 G = p h δ τh. h=1 This is a scale mixture of normals

28 Location vs Scale Mixtures A location mixture is a mixture over a location parameter - e.g., k f (y i ) = p h N(y i ; µ h, σ 2 ) h=1

29 Location vs Scale Mixtures A location mixture is a mixture over a location parameter - e.g., k f (y i ) = p h N(y i ; µ h, σ 2 ) h=1 A scale mixture is a mixture over a scale parameter - e.g., k f (y i ) = p h N(y i ; µ, σh 2 ). h=1

30 Location vs Scale Mixtures A location mixture is a mixture over a location parameter - e.g., k f (y i ) = p h N(y i ; µ h, σ 2 ) h=1 A scale mixture is a mixture over a scale parameter - e.g., k f (y i ) = p h N(y i ; µ, σh 2 ). h=1 A location-scale mixture does both - e.g., k f (y i ) = p h N(y i ; µ h, σh 2 ). h=1

31 Scale Mixtures for Residual Densities For residual densities it is often plausible to assume a symmetric form with mode at 0

32 Scale Mixtures for Residual Densities For residual densities it is often plausible to assume a symmetric form with mode at 0 By using a scale mixture of normals with mean 0, the density is automatically symmetric with 0 mode

33 Scale Mixtures for Residual Densities For residual densities it is often plausible to assume a symmetric form with mode at 0 By using a scale mixture of normals with mean 0, the density is automatically symmetric with 0 mode Continuous scale mixtures can be used to allow a heavier-tailed parametric form (e.g., t-distribution instead of normal)

34 Scale Mixtures for Residual Densities For residual densities it is often plausible to assume a symmetric form with mode at 0 By using a scale mixture of normals with mean 0, the density is automatically symmetric with 0 mode Continuous scale mixtures can be used to allow a heavier-tailed parametric form (e.g., t-distribution instead of normal) Finite mixtures have the advantage of additional flexibility

35 Location Mixtures for Residual Densities In order to allow multimodality & skewness of the residual density a location or location-scale mixture can be used

36 Location Mixtures for Residual Densities In order to allow multimodality & skewness of the residual density a location or location-scale mixture can be used In such cases, the intercept should be removed from the x i β component

37 Location Mixtures for Residual Densities In order to allow multimodality & skewness of the residual density a location or location-scale mixture can be used In such cases, the intercept should be removed from the x i β component Otherwise, there is non-identifiability between the mean of the residual density (no longer restricted to be zero) and the intercept.

38 Gibbs Sampling Outline Focusing on the model: y i = x iβ + ɛ i, ɛ i N(0, σi 2 ) k σi 2 G = p h δ τh. h=1

39 Gibbs Sampling Outline Focusing on the model: y i = x iβ + ɛ i, ɛ i N(0, σi 2 ) k σi 2 G = p h δ τh. h=1 The Gibbs sampler described above can be trivially extended to do posterior computation

40 Gibbs Sampling Outline Focusing on the model: y i = x iβ + ɛ i, ɛ i N(0, σi 2 ) k σi 2 G = p h δ τh. h=1 The Gibbs sampler described above can be trivially extended to do posterior computation Just requires a step for updating β and use of yi x i β in place of y i in the other steps

41 Multivariate Response Data All of the above approaches can be straightforwardly applied when y i = (y i1,..., y iq )

42 Multivariate Response Data All of the above approaches can be straightforwardly applied when y i = (y i1,..., y iq ) In the absence of predictors, we would have the finite mixture: f (y i ) = k p h N q (µ h, Σ h ). h=1

43 Multivariate Response Data All of the above approaches can be straightforwardly applied when y i = (y i1,..., y iq ) In the absence of predictors, we would have the finite mixture: f (y i ) = k p h N q (µ h, Σ h ). h=1 Instead of a normal-inverse-gamma, we can use a normal-inverse-wishart as the conjugate prior for θ h = {µ, Σ h }.

44 What about Binary Response Data? For binary response data & probit models, we have been relying on: y i = 1(z i > 0) z i = x iβ + ɛ i, ɛ i N(0, 1).

45 What about Binary Response Data? For binary response data & probit models, we have been relying on: y i = 1(z i > 0) z i = x iβ + ɛ i, ɛ i N(0, 1). Suppose we want to avoid assuming underlying normality, so use a mixture of normals in place of N(0, 1)

46 What about Binary Response Data? For binary response data & probit models, we have been relying on: y i = 1(z i > 0) z i = x iβ + ɛ i, ɛ i N(0, 1). Suppose we want to avoid assuming underlying normality, so use a mixture of normals in place of N(0, 1) Any dangers?

47 Complications of Binary Response Models Focus initially on the simple case in which y i {0, 1}, i = 1,..., n, with no predictors

48 Complications of Binary Response Models Focus initially on the simple case in which y i {0, 1}, i = 1,..., n, with no predictors Then, we may be tempted to fit the following mixture model: y i Bernoulli(π i ) k p h δ θh, π i h=1

49 Bernoulli Mixtures Outline However, this model is equivalent to where π = k h=1 p hθ h. y i Bernoulli(π )

50 Bernoulli Mixtures Outline However, this model is equivalent to where π = k h=1 p hθ h. y i Bernoulli(π ) Hence, a mixture of Bernoullis is Bernoulli & no flexibility is gained!

51 Mixtures of Probit Models Then, what are we gaining by allowing the residual density in the underlying variable specification to be non-normal?

52 Mixtures of Probit Models Then, what are we gaining by allowing the residual density in the underlying variable specification to be non-normal? Answer: uncertainty in the link function

53 Mixtures of Probit Models Then, what are we gaining by allowing the residual density in the underlying variable specification to be non-normal? Answer: uncertainty in the link function Letting y i = 1(z i > 0), we let z i = x i β + µ i + ɛ i, µ i G, ɛ i N(0, 1).

54 Mixtures of Probit Models Then, what are we gaining by allowing the residual density in the underlying variable specification to be non-normal? Answer: uncertainty in the link function Letting y i = 1(z i > 0), we let z i = x i β + µ i + ɛ i, µ i G, ɛ i N(0, 1). Then, we can let µ i k h=1 p hδ θh.

55 What about Choice of k? Until now, focus has been on assuming k known

56 What about Choice of k? Until now, focus has been on assuming k known In practice, k may be unknown and difficult to choose.

57 What about Choice of k? Until now, focus has been on assuming k known In practice, k may be unknown and difficult to choose. Ideally, we could avoid choosing the number of mixture components

58 Unknown Number of Components Let µ i G, with G the mixture distribution

59 Unknown Number of Components Let µ i G, with G the mixture distribution Then, we let G = k h=1 p hδ θh

60 Unknown Number of Components Let µ i G, with G the mixture distribution Then, we let G = k h=1 p hδ θh Consider the following prior: p Diri(α/k,..., α/k) θ h G 0

61 Upper Bound & Approximations Note that for large k, not all of the mixture components will be occupied

62 Upper Bound & Approximations Note that for large k, not all of the mixture components will be occupied Hence, k effectively provides an upper bound on the number of components

63 Upper Bound & Approximations Note that for large k, not all of the mixture components will be occupied Hence, k effectively provides an upper bound on the number of components For large k, we have G is approximately assigned a Dirichlet process prior

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural