Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010

Presented by Eric Wang, Duke University

Background and Motivation A Brief Review of Parameter Expansion Literature Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown

Background and Motivation A Brief Review of Parameter Expansion Literature Factor models provide flexible and sparse representations of multivariate data take the form y i = Λη i + ɛ i, ɛ i N p (0, Σ), (1) Markov Chain Monte Carlo (MCMC) algorithms are commonly used for parameter inference. Often, conditionally conjugate priors are chosen for the model parameters to facilitate straightforward posterior computation by a Gibbs sampler. However, traditionally used priors lead to several challenges limiting the performance of Bayesian factor models.

Background and Motivation A Brief Review of Parameter Expansion Literature Challenges posed by the standard Bayesian FA construction Generally, knowledge of reasonable or plausible hyperparameter values are be limited. A common hierarchical structure is to use normal and inverse gamma priors for factor loadings and residual variances respectively. Known issues: Improper posteriors arise in the limiting case as the prior variance on the normal or inverse gamma gets large. Proper, but diffuse priors also do not solve the problem. Slow mixing is observed even in cases when informative priors are used.

Parameter expansion as a solution Background and Motivation A Brief Review of Parameter Expansion Literature This paper proposes a novel application of parameter expansion (PX) which yields default priors and leads to substantially improved mixing and reliable posterior computation. PX is an attractive approach since it allows us to introduce new families of priors, in this case t and folded-t priors. The authors also propose an efficient PX Gibbs sampling scheme Draw samples from a conventional conditionally-conjugate distributions in an expanded working model Use a post-processing step to transform the model back to the inferential model of (1). The authors propose a way to allow uncertainty in the number of factors.

Liu et al., 1998 Outline Background and Motivation A Brief Review of Parameter Expansion Literature Parameter Expansion to Accelerate EM: The PX-EM Algorithm (Liu et al., 1998) proposed PX as a way to accelerate EM inference in a simple model. The authors introduced an auxiliary variable to reduce coupling between variables in the original model. Using a simple hierarchical model, the PX-EM performs a covariance adjustment at every M-step to correct for the imputed value of the auxiliary variable and its fixed expectation under the non-expanded model. The authors also showed that when the set of model parameters θ = θ MLE, the deviation between the expanded model and the non-expanded model dissappears.

Qi and Jaakola, 2006 Background and Motivation A Brief Review of Parameter Expansion Literature Parameter Expanded Variational Bayesian Methods (Qi and Jaakola, 2006) proposed PX-VB for probit regression. The authors used the same concept as in PX-EM, suitably adapted to VB for probit regression. Significant efficiency gains were seen when applied to RVM. The authors claim that empirically, PX-VB solutions are similar to VB solutions, and include plots which show a 15 reduction in the number of iterations.

Gelman, 2004 Outline Background and Motivation A Brief Review of Parameter Expansion Literature Parameterization and Bayesian Modeling (Gelman, 2004) showed that PX can induce new families of priors by applying a redundant multiplicative reparameterization of the original model. Reparameterization can induce an implicit folded noncentral t distribution. Some appealing special cases of the folded noncentral t distribution including the half-t, uniform, and the proper half-cauchy distributions. Although the folded noncentral t distribution is itself not conditionally conjugate in the Bayesian hierarchical setting, the individual components used to induce it are, and straightforward Gibbs sampling can be performed.

Recall the original model specification y i = Λη i + ɛ i, ɛ i N p (0, Σ), (2) where Λ is a p k matrix of factor loadings, η i = (η i1,..., η ik ) N(0, I k ) is a vector of latent factors, and ɛ i is the residual with diagonal covariance matrix Σ = diag(σ 2 1,..., σ2 p). To insure identifiability, assume Λ has a full-rank lower triangular structure, The diagonal elements of Λ have truncated normal priors, the lower diagonal elements are given normal priors, and the σ 2 1,..., σ2 p have inverse-gamma priors. Note that if we marginalize out η i, then y i N(0, Ω) where Ω = ΛΛ + Σ.

Inducing Priors through Parameter Expansion The authors note that the priors on the previous slide yield computationally convenient posteriors via Gibbs sampling but is subject to issues such as slow mixing and wrongly informative priors. In this paper, the authors use PX to induce a heavier tailed, proper prior on the factor loadings following Gelman (2004,2006). The authors first introduce a redundantly overparamertized working model which is then related to inferential model above through a transformation.

PX-Factor Model Define the following PX-factor model y i = Λ η i + ɛ i, η i N(0, Ψ), ɛ i N p (0, Σ), (3) where Λ is an unconstrained p k lower triangular working factor loading matrix, ηi is a vector of working latent factors, Ψ = diag(ψ 1,..., ψ k ) and Σ is as defined previously. Note the redundant overparameterization after marginalizing out η i, y i N(0, Λ ΨΛ + Σ).

To relate the working model parameters and the inferential model parameters, the authors employ the following transformation Λ jl = S(λ ll )λ jl ψ1/2 l, j = 1,..., p, l = 1,..., k, η i = ψ 1/2 η i, (4) where S(x) is a function which takes the sign of x. The following priors are then employed for the working model parameters λ jl N(0, 1), j = 1,..., p, l = 1,..., min(j, k), λ jl δ 0, j = 1,..., p, l = j + 1,..., k, ψ l Gamma(a l, b l ), l = 1,..., k (5) where δ 0 is a measure concentrated at 0.

Full PX-FA Model Putting the previous two slides together, the full model is y i = Λ η i + ɛ i, η i N(0, Ψ), ɛ i N p (0, Σ), λ jl N(0, 1), j = 1,..., p, l = 1,..., min(j, k), λ jl δ 0, j = 1,..., p, l = j + 1,..., k, ψ l Gamma(a l, b l ), l = 1,..., k (6) with the transformation function λ jl = S(λ ll )λ jl ψ1/2 l, j = 1,..., p, l = 1,..., k, η i = ψ 1/2 η i, (7) where S(x) is a function which takes the sign of x.

Full PX-FA Model The preceding model can be thought of as a generalization of Gelman (2006). Upon marginalizing out Λ and Ψ, we obtain t priors for the off-diagonal elements of Λ and half-t priors for the diagonal elements. Note from the transform function λ jl = S(λ ll )λ jl ψ1/2 al that the columns of Λ are dependent on Ψ, specifically, columns with high ψ l will tend to have higher factor loadings, while columns with low ψ l will tend to have low factor loadings. Inference on the PX-FA model can be done using a standard Gibbs sampler on the working model, then each iteration is transformed back to the inferential model, discarding the working model samples.

Inference PX-FA can be written as y ij = z ij λ j + ɛ ij, ɛ ij N(0, σ 2 j ) (8) where z ij = (ηi1,..., η ik j ), λ j = (λ j1,..., λ jk j ), and k j = min(j, k). The full conditional posteriors of the model parameters are p(λ j η, Ψ, Σ, y) = N mj ((Σ 1 0λ j + σ 2 j Z j Z j) 1 (Σ 1 0λ j where Z j = (z 1j,..., z nj ) and Y j = (y 1j,..., y nj ). (continued on next slide) (Σ 1 0λ j λ 0j + σ 2 j Z j Y j), + σ 2 j Z j Z j) 1 )

Inference p(η λ j, Ψ, Σ, y) = N m((ψ 1 + Λ Σ 1 bmλ ) 1 Λ Σ 1 y i, and p(ψ 1 η, λ j, Σ, y) = Gamma(a l + n 2, b l + 1 2 p(σ 2 j η, λ j, Ψ, y) = Gamma(c j + n 2, d j + 1 2 (Ψ 1 + Λ Σ 1 Λ ) 1 ) (9) n ηil 2 ) (10) i=1 n (y ij z ij λ j )2 ) i=1

One Factor Model Here the authors consider a simple example with p = 7 and n = 100 where λ = (0.995, 0.975, 0.949, 0.922, 0.894, 0.866, 0.837), diag(σ) = (0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30). Traditional Gibbs sampler priors: N + (0, 1) for diagonal elements and N(0, 1) for the lower triangular elements of λ, respectively. PX Gibbs sampler priors for λ are induced half-cauchy and Cauchy priors, The priors are induced by choosing N(0, 1) priors on the free elements of Λ, η N(0, Ψ), and the diagonal elements of Ψ have Gamma(1/2, 1/2) priors.

One Factor Model The prior on the noise precision σ 2 j is Gamma(1, 0.2) for both samplers. Both Gibbs samplers were run 25000 iterations, with 5000 iteration burn-in. Recall Ω = ΛΛ + Σ. The authors compare Effective Sample Size (ESS) and bias of posterior means of Ω across 100 simulations. The results show that PX Gibbs sampling results in dramatic improvements in ESS and slightly lower bias. The authors claim that the improvement in ESS is due to the heavy-tailed induced Cauchy prior.

One Factor Model ESS-PX/ESS-Traditional on Ω

One Factor Model Bias for some values of Ω

Three Factor Model Here the authors carried out identical simulations as above, but with known k = 3. The ESS-PX/ESS-Traditional results are shown below.

Model Selection - ISPX The probability of choosing a model with h factors is Pr(k = h y) = κ hπ(y k = h) m l=1 κ lπ(y k = l) (11) where π(y k = h) is the marginal likelihood under model k, obtained by integrating i N p(y i ; 0, Λ (k) Λ (k) + Σ) across the priors for Λ (k) and residual variances Σ, and κ h is the prior probability P(k = h), h = 1,..., m. Instead of parameterizing all m models, the authors parameterize the k = m factor model, and constitute a smaller model k = h by marginalizing out the columns from (h + 1) to m.

Model Selection - ISPX The posterior probability can be expressed as Pr(k = h y) = O[h : j] BF[h : j] m l=1 O[h : j] BF[h : j] (12) where O[h : j] = κ h /κ j is the prior odds and BF[h : j] = {π(y k = h)/π(y k = j)} are the Bayes factors. The Bayes factors can be estimated for h = 2,..., m as ˆ BF [(h 1) : h] = 1 n n i=1 p(y θ (h) i, k = h 1) p(y θ (h) i, k = h) where θ (h) i = (Λ (h) i, Σ i ), i = 1,..., n are samples from running the model with k = h. (13)

Model Selection - ISPX The Bayes factor for comparing any two models can be obtained, e.g. BF[1 : m] = BF[1 : 2] BF[2 : 3]...BF[(m 1) : m]. Setting κ h = 1/m, the posterior probabilities Pr(k = h y) can be estimated. The authors call this method Importance Sampling with Parameter Expansion (ISPX).

Model Selection - PSPX The authors also adopted the path sampling based approach first used in Lee and Song (2002) for estimating log Bayes factors. This approach constructs a path using a scalar t [0, 1] to link two models M 0 and M 1, a method first suggested by Gelman and Meng (1998). Numerical integration is used to approximate the integration over t. This approach is highly accurate but computationally expensive, and is referred to as Path Sampling Parameter Expansion (PSPX).

Categorical Data Suppose that y i = (y i1, y i,2 ) where y i1 is a p 1 1 vector of continuous variables and y i2 is a p 2 2 vector of ordered categorical variables. The inferential (non-expanded) model can be generalized as y ij = h j (yij, τ j), j = 1,..., p y i = α + Λη i + η i, η N(0, I k ), η N(0, Σ) where α are Gaussian intercept variables, h j (.) is the identity link for j = 1,..., p 1, and for j = p 1 + 1,..., p is a probit type link L j h j (z; τ j ) = c1(τ j,c 1 z τ jc ). c=1

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Fertility Study The underlying latent factor of interest in this study is the fertility score of subjects. Concentration is defined as (sperm count/semen volume). Each y i has three variables based on three different techniques for counting sperm. In addition the outcomes, each y i is associated with a vector of covariates x i = (x i1, x i2, x i3 ). The inferential model considered includes the covariates at the latent variable level: y ij = α j +λ j η i +ɛ ij, η i = β x i +δ i, η ij N(0, τ 1 j ), δ N(0, 1)

Fertility Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown The covariate vector x i is a three dimensional vector encoding location of the subject and whether the time since last ejaculation was less than 2 hours. The PX model is y ij = α j + λ j η i + ɛ ij, η i = µ + β x i + δ i, η ij N(0, τ 1 j ) δ N(0, 1) and the transformations relating the working model parameters to the inferential model are α j = α j + λ j µ, λ j = S(λ j )λ j ψ1/2, β = βψ 1/2, η i = ψ 1/2 (η i = µ ), δ i = ψ 1/2 δ i

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown y ij = α j +λ j η i +ɛ ij, η i = β x i +δ i, η ij N(0, τ 1 j ), δ N(0, 1) The priors for the standard Gibbs Sampler are specified as α j N(0, 1), λ j N + (0, 1), τ j Gamma(1, 0.2) for j = 1, 2, 3, and β N(0, 10 I 3 ). y ij = α j + λ j η i + ɛ ij, η i = µ + β x i + δ i, η ij N(0, τ 1 j ) δ N(0, 1) The priors for the PX Gibbs Sampler are specified as α j N(0, 1), λ j N(0, 1), τ j Gamma(1, 0.2) for j = 1, 2, 3, µ N(0, 1), β N(0, 10 I 3 ),ψ Gamma(1/2, 1/2).

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Trace plots of intercept terms α j

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Trace plots of factor loadings λ j

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Additional Results The ESS-PX/ESS-Traditional for the upper triangular elements of Ω = ΛΛ + Σ are 209.3953, 213.7289, 208.9247, 206.767, 207.9695, 156.3594, meaning that the traditional Gibbs sampler would have to run approximately 200 times longer than the PX Gibbs sampler to have the same performances in mixing. The absolute bias of these parameters under traditional Gibbs sampling are 0.0184, 0.0219, 0.0202, 0.0183, 0.0206, 0.0182 and with the PX Gibbs sampler they are 0.0035, 0.0000, 0.0017, 0.0035, 0.0012, 0.0036.

Toxicology Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown The purpose of this study is to examine the effect of Anthraquinone in female Fischer rates. 60 animals were dosed with 0, 1875, 3750, 7500, 15000 and 30000 ppm Anthraquinone. Body weight along with organ weights were recorded. The small sample size makes estimation of the covariance matrix a significant challenge. The authors therefore apply both ISPX and PSPX to this data, setting the maximum of factors m = 3, 100,000 iterations for ISPX and 25,000 iterations per grid for PSPX. The probability for the one, two, and three factor models were 0.4417, 0.3464 and 0.2120 for ISPX and 0.9209, 0.0714 and 0.0077 for PSPX, respectively.

Toxicology Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Toxicology Study Notice that both ISPX and PSPX favor the one factor model. This is supported when examining the magnitudes of the factor loadings. Recall that due to prior dependence, the magnitude of the loadings will tend to scale with their respective ψ j. The two factor model is very close to the one factor model, and the authors state that the increase in the number of model parameters does not justify the small increase in likelihood, leading them to believe that the PSPX posterior probabilities are most likely correct.

Discussion and Factor models offers a flexible dimensionality reduction method for analyzing multivariate data, but the traditional specification of Normal Gamma priors for computational convenience leads to slow mixing in MCMC. The authors proposed a default heavy-tailed prior for factor analysis models using parameter expansion which yields substantially better mixing. General extensions of the model to categorical data was demonstrated to be straightforward. Two different methods for computing posterior model probabilities were proposed, and shown to work quite well in practice.