Default Priors and Effcient Posterior Computation in Bayesian

Similar documents
Default Priors and Efficient Posterior Computation in Bayesian Factor Analysis

STA 216, GLM, Lecture 16. October 29, 2007

Bayes Model Selection with Path Sampling: Factor Models

Bayesian Regression Linear and Logistic Regression

Gibbs Sampling in Linear Models #2

Bayesian Methods for Machine Learning

Lecture 16: Mixtures of Generalized Linear Models

Bayesian Analysis of Latent Variable Models using Mplus

November 2002 STA Random Effects Selection in Linear Mixed Models

Partial factor modeling: predictor-dependent shrinkage for linear regression

Bayesian linear regression

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Contents. Part I: Fundamentals of Bayesian Inference 1

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Part 8: GLMs and Hierarchical LMs and GLMs

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian inference for factor scores

Lecture Notes based on Koop (2003) Bayesian Econometrics

Bayesian Networks in Educational Assessment

STA 4273H: Statistical Machine Learning

Bayes methods for categorical data. April 25, 2017

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

MULTILEVEL IMPUTATION 1

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Penalized Loss functions for Bayesian Model Choice

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

STAT 425: Introduction to Bayesian Analysis

STA414/2104 Statistical Methods for Machine Learning II

ST 740: Markov Chain Monte Carlo

Part 6: Multivariate Normal and Linear Models

Bayesian Inference and MCMC

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

On Bayesian Computation

Sparse Factor-Analytic Probit Models

Plausible Values for Latent Variables Using Mplus

17 : Markov Chain Monte Carlo

VCMC: Variational Consensus Monte Carlo

Part 7: Hierarchical Modeling

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

MCMC algorithms for fitting Bayesian models

Bayesian data analysis in practice: Three simple examples

Likelihood-free MCMC

Nonparametric Bayesian Methods (Gaussian Processes)

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

CSC 2541: Bayesian Methods for Machine Learning

Dynamic System Identification using HDMR-Bayesian Technique

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

Modeling conditional distributions with mixture models: Theory and Inference

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Marginal Specifications and a Gaussian Copula Estimation

Bayesian methods in economics and finance

Bayesian Sparse Correlated Factor Analysis

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian Linear Regression

Integrated Non-Factorized Variational Inference

Bayesian spatial hierarchical modeling for temperature extremes

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Gibbs Sampling in Endogenous Variables Models

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Embedding Supernova Cosmology into a Bayesian Hierarchical Model

Introduction to Machine Learning

CPSC 540: Machine Learning

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

ABC methods for phase-type distributions with applications in insurance risk problems

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Variational Inference via Stochastic Backpropagation

Hierarchical Models & Bayesian Model Selection

Density Estimation. Seungjin Choi

Bayesian Modeling of Conditional Distributions

STA 4273H: Sta-s-cal Machine Learning

Katsuhiro Sugita Faculty of Law and Letters, University of the Ryukyus. Abstract

g-priors for Linear Regression

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Bayesian model selection: methodology, computation and applications

Graphical Models for Collaborative Filtering

On the Fisher Bingham Distribution

Fitting Narrow Emission Lines in X-ray Spectra

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

1 Data Arrays and Decompositions

0.1 factor.bayes: Bayesian Factor Analysis

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

Bayesian model selection for computer model validation via mixture model estimation

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Online Appendix: Bayesian versus maximum likelihood estimation of treatment effects in bivariate probit instrumental variable models

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

eqr094: Hierarchical MCMC for Bayesian System Reliability

Supplementary Note on Bayesian analysis

Transcription:

Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010

Presented by Eric Wang, Duke University

Background and Motivation A Brief Review of Parameter Expansion Literature Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown

Background and Motivation A Brief Review of Parameter Expansion Literature Factor models provide flexible and sparse representations of multivariate data take the form y i = Λη i + ɛ i, ɛ i N p (0, Σ), (1) Markov Chain Monte Carlo (MCMC) algorithms are commonly used for parameter inference. Often, conditionally conjugate priors are chosen for the model parameters to facilitate straightforward posterior computation by a Gibbs sampler. However, traditionally used priors lead to several challenges limiting the performance of Bayesian factor models.

Background and Motivation A Brief Review of Parameter Expansion Literature Challenges posed by the standard Bayesian FA construction Generally, knowledge of reasonable or plausible hyperparameter values are be limited. A common hierarchical structure is to use normal and inverse gamma priors for factor loadings and residual variances respectively. Known issues: Improper posteriors arise in the limiting case as the prior variance on the normal or inverse gamma gets large. Proper, but diffuse priors also do not solve the problem. Slow mixing is observed even in cases when informative priors are used.

Parameter expansion as a solution Background and Motivation A Brief Review of Parameter Expansion Literature This paper proposes a novel application of parameter expansion (PX) which yields default priors and leads to substantially improved mixing and reliable posterior computation. PX is an attractive approach since it allows us to introduce new families of priors, in this case t and folded-t priors. The authors also propose an efficient PX Gibbs sampling scheme Draw samples from a conventional conditionally-conjugate distributions in an expanded working model Use a post-processing step to transform the model back to the inferential model of (1). The authors propose a way to allow uncertainty in the number of factors.

Liu et al., 1998 Outline Background and Motivation A Brief Review of Parameter Expansion Literature Parameter Expansion to Accelerate EM: The PX-EM Algorithm (Liu et al., 1998) proposed PX as a way to accelerate EM inference in a simple model. The authors introduced an auxiliary variable to reduce coupling between variables in the original model. Using a simple hierarchical model, the PX-EM performs a covariance adjustment at every M-step to correct for the imputed value of the auxiliary variable and its fixed expectation under the non-expanded model. The authors also showed that when the set of model parameters θ = θ MLE, the deviation between the expanded model and the non-expanded model dissappears.

Qi and Jaakola, 2006 Background and Motivation A Brief Review of Parameter Expansion Literature Parameter Expanded Variational Bayesian Methods (Qi and Jaakola, 2006) proposed PX-VB for probit regression. The authors used the same concept as in PX-EM, suitably adapted to VB for probit regression. Significant efficiency gains were seen when applied to RVM. The authors claim that empirically, PX-VB solutions are similar to VB solutions, and include plots which show a 15 reduction in the number of iterations.

Gelman, 2004 Outline Background and Motivation A Brief Review of Parameter Expansion Literature Parameterization and Bayesian Modeling (Gelman, 2004) showed that PX can induce new families of priors by applying a redundant multiplicative reparameterization of the original model. Reparameterization can induce an implicit folded noncentral t distribution. Some appealing special cases of the folded noncentral t distribution including the half-t, uniform, and the proper half-cauchy distributions. Although the folded noncentral t distribution is itself not conditionally conjugate in the Bayesian hierarchical setting, the individual components used to induce it are, and straightforward Gibbs sampling can be performed.

Recall the original model specification y i = Λη i + ɛ i, ɛ i N p (0, Σ), (2) where Λ is a p k matrix of factor loadings, η i = (η i1,..., η ik ) N(0, I k ) is a vector of latent factors, and ɛ i is the residual with diagonal covariance matrix Σ = diag(σ 2 1,..., σ2 p). To insure identifiability, assume Λ has a full-rank lower triangular structure, The diagonal elements of Λ have truncated normal priors, the lower diagonal elements are given normal priors, and the σ 2 1,..., σ2 p have inverse-gamma priors. Note that if we marginalize out η i, then y i N(0, Ω) where Ω = ΛΛ + Σ.

Inducing Priors through Parameter Expansion The authors note that the priors on the previous slide yield computationally convenient posteriors via Gibbs sampling but is subject to issues such as slow mixing and wrongly informative priors. In this paper, the authors use PX to induce a heavier tailed, proper prior on the factor loadings following Gelman (2004,2006). The authors first introduce a redundantly overparamertized working model which is then related to inferential model above through a transformation.

PX-Factor Model Define the following PX-factor model y i = Λ η i + ɛ i, η i N(0, Ψ), ɛ i N p (0, Σ), (3) where Λ is an unconstrained p k lower triangular working factor loading matrix, ηi is a vector of working latent factors, Ψ = diag(ψ 1,..., ψ k ) and Σ is as defined previously. Note the redundant overparameterization after marginalizing out η i, y i N(0, Λ ΨΛ + Σ).

To relate the working model parameters and the inferential model parameters, the authors employ the following transformation Λ jl = S(λ ll )λ jl ψ1/2 l, j = 1,..., p, l = 1,..., k, η i = ψ 1/2 η i, (4) where S(x) is a function which takes the sign of x. The following priors are then employed for the working model parameters λ jl N(0, 1), j = 1,..., p, l = 1,..., min(j, k), λ jl δ 0, j = 1,..., p, l = j + 1,..., k, ψ l Gamma(a l, b l ), l = 1,..., k (5) where δ 0 is a measure concentrated at 0.

Full PX-FA Model Putting the previous two slides together, the full model is y i = Λ η i + ɛ i, η i N(0, Ψ), ɛ i N p (0, Σ), λ jl N(0, 1), j = 1,..., p, l = 1,..., min(j, k), λ jl δ 0, j = 1,..., p, l = j + 1,..., k, ψ l Gamma(a l, b l ), l = 1,..., k (6) with the transformation function λ jl = S(λ ll )λ jl ψ1/2 l, j = 1,..., p, l = 1,..., k, η i = ψ 1/2 η i, (7) where S(x) is a function which takes the sign of x.

Full PX-FA Model The preceding model can be thought of as a generalization of Gelman (2006). Upon marginalizing out Λ and Ψ, we obtain t priors for the off-diagonal elements of Λ and half-t priors for the diagonal elements. Note from the transform function λ jl = S(λ ll )λ jl ψ1/2 al that the columns of Λ are dependent on Ψ, specifically, columns with high ψ l will tend to have higher factor loadings, while columns with low ψ l will tend to have low factor loadings. Inference on the PX-FA model can be done using a standard Gibbs sampler on the working model, then each iteration is transformed back to the inferential model, discarding the working model samples.

Inference PX-FA can be written as y ij = z ij λ j + ɛ ij, ɛ ij N(0, σ 2 j ) (8) where z ij = (ηi1,..., η ik j ), λ j = (λ j1,..., λ jk j ), and k j = min(j, k). The full conditional posteriors of the model parameters are p(λ j η, Ψ, Σ, y) = N mj ((Σ 1 0λ j + σ 2 j Z j Z j) 1 (Σ 1 0λ j where Z j = (z 1j,..., z nj ) and Y j = (y 1j,..., y nj ). (continued on next slide) (Σ 1 0λ j λ 0j + σ 2 j Z j Y j), + σ 2 j Z j Z j) 1 )

Inference p(η λ j, Ψ, Σ, y) = N m((ψ 1 + Λ Σ 1 bmλ ) 1 Λ Σ 1 y i, and p(ψ 1 η, λ j, Σ, y) = Gamma(a l + n 2, b l + 1 2 p(σ 2 j η, λ j, Ψ, y) = Gamma(c j + n 2, d j + 1 2 (Ψ 1 + Λ Σ 1 Λ ) 1 ) (9) n ηil 2 ) (10) i=1 n (y ij z ij λ j )2 ) i=1

One Factor Model Here the authors consider a simple example with p = 7 and n = 100 where λ = (0.995, 0.975, 0.949, 0.922, 0.894, 0.866, 0.837), diag(σ) = (0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30). Traditional Gibbs sampler priors: N + (0, 1) for diagonal elements and N(0, 1) for the lower triangular elements of λ, respectively. PX Gibbs sampler priors for λ are induced half-cauchy and Cauchy priors, The priors are induced by choosing N(0, 1) priors on the free elements of Λ, η N(0, Ψ), and the diagonal elements of Ψ have Gamma(1/2, 1/2) priors.

One Factor Model The prior on the noise precision σ 2 j is Gamma(1, 0.2) for both samplers. Both Gibbs samplers were run 25000 iterations, with 5000 iteration burn-in. Recall Ω = ΛΛ + Σ. The authors compare Effective Sample Size (ESS) and bias of posterior means of Ω across 100 simulations. The results show that PX Gibbs sampling results in dramatic improvements in ESS and slightly lower bias. The authors claim that the improvement in ESS is due to the heavy-tailed induced Cauchy prior.

One Factor Model ESS-PX/ESS-Traditional on Ω

One Factor Model Bias for some values of Ω

Three Factor Model Here the authors carried out identical simulations as above, but with known k = 3. The ESS-PX/ESS-Traditional results are shown below.

Model Selection - ISPX The probability of choosing a model with h factors is Pr(k = h y) = κ hπ(y k = h) m l=1 κ lπ(y k = l) (11) where π(y k = h) is the marginal likelihood under model k, obtained by integrating i N p(y i ; 0, Λ (k) Λ (k) + Σ) across the priors for Λ (k) and residual variances Σ, and κ h is the prior probability P(k = h), h = 1,..., m. Instead of parameterizing all m models, the authors parameterize the k = m factor model, and constitute a smaller model k = h by marginalizing out the columns from (h + 1) to m.

Model Selection - ISPX The posterior probability can be expressed as Pr(k = h y) = O[h : j] BF[h : j] m l=1 O[h : j] BF[h : j] (12) where O[h : j] = κ h /κ j is the prior odds and BF[h : j] = {π(y k = h)/π(y k = j)} are the Bayes factors. The Bayes factors can be estimated for h = 2,..., m as ˆ BF [(h 1) : h] = 1 n n i=1 p(y θ (h) i, k = h 1) p(y θ (h) i, k = h) where θ (h) i = (Λ (h) i, Σ i ), i = 1,..., n are samples from running the model with k = h. (13)

Model Selection - ISPX The Bayes factor for comparing any two models can be obtained, e.g. BF[1 : m] = BF[1 : 2] BF[2 : 3]...BF[(m 1) : m]. Setting κ h = 1/m, the posterior probabilities Pr(k = h y) can be estimated. The authors call this method Importance Sampling with Parameter Expansion (ISPX).

Model Selection - PSPX The authors also adopted the path sampling based approach first used in Lee and Song (2002) for estimating log Bayes factors. This approach constructs a path using a scalar t [0, 1] to link two models M 0 and M 1, a method first suggested by Gelman and Meng (1998). Numerical integration is used to approximate the integration over t. This approach is highly accurate but computationally expensive, and is referred to as Path Sampling Parameter Expansion (PSPX).

Categorical Data Suppose that y i = (y i1, y i,2 ) where y i1 is a p 1 1 vector of continuous variables and y i2 is a p 2 2 vector of ordered categorical variables. The inferential (non-expanded) model can be generalized as y ij = h j (yij, τ j), j = 1,..., p y i = α + Λη i + η i, η N(0, I k ), η N(0, Σ) where α are Gaussian intercept variables, h j (.) is the identity link for j = 1,..., p 1, and for j = p 1 + 1,..., p is a probit type link L j h j (z; τ j ) = c1(τ j,c 1 z τ jc ). c=1

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Fertility Study The underlying latent factor of interest in this study is the fertility score of subjects. Concentration is defined as (sperm count/semen volume). Each y i has three variables based on three different techniques for counting sperm. In addition the outcomes, each y i is associated with a vector of covariates x i = (x i1, x i2, x i3 ). The inferential model considered includes the covariates at the latent variable level: y ij = α j +λ j η i +ɛ ij, η i = β x i +δ i, η ij N(0, τ 1 j ), δ N(0, 1)

Fertility Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown The covariate vector x i is a three dimensional vector encoding location of the subject and whether the time since last ejaculation was less than 2 hours. The PX model is y ij = α j + λ j η i + ɛ ij, η i = µ + β x i + δ i, η ij N(0, τ 1 j ) δ N(0, 1) and the transformations relating the working model parameters to the inferential model are α j = α j + λ j µ, λ j = S(λ j )λ j ψ1/2, β = βψ 1/2, η i = ψ 1/2 (η i = µ ), δ i = ψ 1/2 δ i

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown y ij = α j +λ j η i +ɛ ij, η i = β x i +δ i, η ij N(0, τ 1 j ), δ N(0, 1) The priors for the standard Gibbs Sampler are specified as α j N(0, 1), λ j N + (0, 1), τ j Gamma(1, 0.2) for j = 1, 2, 3, and β N(0, 10 I 3 ). y ij = α j + λ j η i + ɛ ij, η i = µ + β x i + δ i, η ij N(0, τ 1 j ) δ N(0, 1) The priors for the PX Gibbs Sampler are specified as α j N(0, 1), λ j N(0, 1), τ j Gamma(1, 0.2) for j = 1, 2, 3, µ N(0, 1), β N(0, 10 I 3 ),ψ Gamma(1/2, 1/2).

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Trace plots of intercept terms α j

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Trace plots of factor loadings λ j

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Additional Results The ESS-PX/ESS-Traditional for the upper triangular elements of Ω = ΛΛ + Σ are 209.3953, 213.7289, 208.9247, 206.767, 207.9695, 156.3594, meaning that the traditional Gibbs sampler would have to run approximately 200 times longer than the PX Gibbs sampler to have the same performances in mixing. The absolute bias of these parameters under traditional Gibbs sampling are 0.0184, 0.0219, 0.0202, 0.0183, 0.0206, 0.0182 and with the PX Gibbs sampler they are 0.0035, 0.0000, 0.0017, 0.0035, 0.0012, 0.0036.

Toxicology Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown The purpose of this study is to examine the effect of Anthraquinone in female Fischer rates. 60 animals were dosed with 0, 1875, 3750, 7500, 15000 and 30000 ppm Anthraquinone. Body weight along with organ weights were recorded. The small sample size makes estimation of the covariance matrix a significant challenge. The authors therefore apply both ISPX and PSPX to this data, setting the maximum of factors m = 3, 100,000 iterations for ISPX and 25,000 iterations per grid for PSPX. The probability for the one, two, and three factor models were 0.4417, 0.3464 and 0.2120 for ISPX and 0.9209, 0.0714 and 0.0077 for PSPX, respectively.

Toxicology Study Outline Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown

Treating the Number of Factors as Fixed Treating the Number of Factors as Unknown Toxicology Study Notice that both ISPX and PSPX favor the one factor model. This is supported when examining the magnitudes of the factor loadings. Recall that due to prior dependence, the magnitude of the loadings will tend to scale with their respective ψ j. The two factor model is very close to the one factor model, and the authors state that the increase in the number of model parameters does not justify the small increase in likelihood, leading them to believe that the PSPX posterior probabilities are most likely correct.

Discussion and Factor models offers a flexible dimensionality reduction method for analyzing multivariate data, but the traditional specification of Normal Gamma priors for computational convenience leads to slow mixing in MCMC. The authors proposed a default heavy-tailed prior for factor analysis models using parameter expansion which yields substantially better mixing. General extensions of the model to categorical data was demonstrated to be straightforward. Two different methods for computing posterior model probabilities were proposed, and shown to work quite well in practice.