Bayesian Inference in the Multivariate Probit Model

Size: px

Start display at page:

Download "Bayesian Inference in the Multivariate Probit Model"

Holly Long
5 years ago
Views:

1 Bayesian Inference in the Multivariate Probit Model Estimation of the Correlation Matrix by Aline Tabet A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Statistics) The University Of British Columbia August, 27 c Aline Tabet 27

2 Abstract Correlated binary data arise in many applications. Any analysis of this type of data should take into account the correlation structure among the variables. The multivariate Probit model (MVP), introduced by Ashford and Snowden (97), is a popular class of models particularly suitable for the analysis of correlated binary data. In this class of models, the response is multivariate, correlated and discrete. Generally speaking, the MVP model assumes that given a set of explanatory variables the multivariate response is an indicator of the event that some unobserved latent variable falls within a certain interval. The latent variable is assumed to arise from a multivariate normal distribution. Difficulties with the multivariate Probit are mainly due to computation as the likelihood of the observed discrete data is obtained by integrating over a multidimensional constrained space of latent variables. In this work, we adopt a Bayesian approach and develop an an efficient Markov chain Monte Carlo algorithm for estimation in MVP models under the full correlation and the structured correlation assumptions. Furthermore, in addition to simulation results, we present an application of our method to the Six Cities data set. Our algorithm has many advantages over previous approaches, namely it handles identifiability and uses a marginally uniform prior on the correlation matrix directly. ii

3 Table of Contents Abstract ii Table of Contents iii List of Tables vi List of Figures viii Acknowledgements xi Dedication xii I Thesis Introduction Motivation Outline The Multivariate Probit Model Model Specification and Notation Difficulty with Multivariate Probit Regression: Identifiability Bayesian Inference in Multivariate Probit Models Prior Specification on β Prior Specification on the correlation matrix R Correlation Estimation in the Saturated Model Introduction iii

4 Table of Contents 3.2 Parameter Expansion and Data Augmentation Data Augmentation Parameter Expansion for Data Augmentation Data Transformation Proposed Model Imputation Step Posterior Sampling Step Simulations Results for T = Results for T = Convergence Assessment Application: Six Cities Data Correlation Estimation in the Structured Model Introduction Conditional Independence Gaussian Graphical Models Graph Theory The Hyper-inverse Wishart Distribution Marginally Uniform Prior for Structured Covariance PX-DA in Gaussian Graphical Models Simulations Loss Under the Saturated Model and the Structured Model Effect of Decreasing Sample Size Prediction Accuracy Application: Six Cities Data Revisited Conclusion Summary Extensions, Applications, and Future Work Bibliography iv

5 Table of Contents II Appendices 8 Appendices A Distributions and Identities A. The Multivariate Normal (Gaussian) Distribution A.2 The Gamma Distribution A.3 The Standard Inverse Wishart Distribution B Marginal Prior on R proof from Barnard et al. (2).. 84 C Computation of the Jacobian J : Z W D Sampling from Multivariate truncated Gaussian E Sampling from the Hyper Inverse Wishart Distribution (Carvalho et al., 27) F Simulation Results v

6 List of Tables 2. Summary of how identifiability has been handled in some previous work Correlation results from simulations for T = Regression coefficients results from simulations for T = Correlation Results from simulations for T = Regression coefficients results from simulations when T = Six Cities Data: Posterior estimates using Marginal Prior, MLE estimate using MCEM and Posterior estimates using the Jointly Uniform Prior (Chib and Greenberg (998)) Simulation results: Entropy and quadratic loss averaged over 5 data sets generated by different correlation matrices with the same structure Entropy and Quadratic loss obtained by estimating the true correlation and partial correlation matrix with the PX-DA algorithm under the saturated and structured model assumption Simulation results on the unconstrained correlation coefficients corresponding to the model in 4., with n =, T = 8 based on N = 5 Gibbs samples Simulation results on the constrained correlation coefficients corresponding to the model in 4., with n =, T = 8 based on N = 5 Gibbs samples vi

7 List of Tables 4.5 Simulation results on the unconstrained correlation coefficients corresponding to the model in 4., with n = 2, T = 8 based on N = 5 Gibbs samples Simulation results on the constrained correlation coefficients corresponding to the model in 4., with n = 2, T = 8 based on N = 5 Gibbs samples Six Cities Data: Posterior estimates under structured model assumption, MLE estimate using MCEM and Posterior estimates using the Jointly Uniform Prior under a saturated model assumption(chib and Greenberg (998)) F. Simulation results: Entropy and quadratic loss for 5 data sets generated by different correlation matrices with the same structure F.2 Table F continued vii

8 List of Figures 2. A graphical representation of the model in 2.3 under a full correlation structure. Observed nodes are shaded Marginal prior density for r 2 when T = 3 and T = under the jointly uniform prior p(r), based on 2 draws. (Figure reproduced from Barnard et al. (2)) Marginal correlations obtained using the prior in 2.2 by sampling from a standard inverse Wishart with degrees of freedom ν = T Correlation estimates for ρ =.4, T = 3 and increasing sample size from n = to n= Correlation estimates for ρ =.8, T = 3 and increasing sample size from n = to n= β estimates for ρ =.4, T = 3 and sample size n = β estimates for ρ =.4, T = 3 and sample size n = β estimates for ρ =.8, T = 3 and sample size n = β estimates for ρ =.8, T = 3 and sample size n = Correlation estimates for ρ =.2, T = 8 and increasing sample size from n = to n= Correlation estimates for ρ =.6, T = 8 and increasing sample size from n = to n= β estimates for ρ =.2, T = 8 and sample size n = β estimates for ρ =.2, T = 8 and sample size n = β estimates for ρ =.6, T = 8 and sample size n = β estimates for ρ =.6, T = 8 and sample size n = viii

9 List of Figures 3.3 n =, T = 3, Trace plots as the number of iterations increase from N = 5 to N = 5 post Burn-in. The algorithm has started to converge after about iteration post Burn-in n =, T = 3, Autocorrelation plots of a randomly chosen parameter from correlation matrices for the cases where the marginal correlations is ρ =.2, ρ =.4, ρ =.6, and ρ = Trace plots of the cumulative mean and cumulative standard deviation of randomly chosen parameters from correlation matrices as the correlation is varied from ρ =.2, ρ =.4, ρ =.6, and ρ =.8 and n =, T = 3. The vertical line marks the Burn-in value (5) used in the simulations Six Cities Data: Trace plots and density plots of the correlation coefficients. The vertical lines denote 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (998) Six Cities Data : Trace plots, density plots and autocorrelation plots of the regression coefficients. Vertical lines denote 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (998) A graphical representation of a structured MVP model for T = 3. The edge between Z i and Z i3 is missing, this is equivalent to r 3 =. This structure is typical of longitudinal models where each variable is strongly associated with the one before it and after it, given the other variables in the model A graphical model with T = 7 vertices. In this graph, Z is a neighbor of Z 2. Z 3, Z 2, and Z 7 form a complete subgraph or a clique. This graph can be decomposed into two cliques {Z, Z 2, Z 3, Z 5, Z 4 } and {Z 3, Z 6, Z 7 }. {Z 3 } separates the two cliques Marginal distributions of the prior on the correlation matrix corresponding to the model in ix

10 List of Figures 4.4 Illustration of the marginally uniform prior on the structure of the graph in figure 4.2. In this graph we have unequal clique sizes where C = 5 and C 2 = Box plot of the entropy and quadratic loss obtained by generating data from 5 correlation structures and computing the loss function under the full correlation structure versus a structured correlation structure Six Cities Data: Correlation and partial correlation estimates Six Cities Data : Trace plots, density plots and autocorrelation plots of the regression coefficients under a structured model assumption. Vertical lines denote 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (998) x

11 Acknowledgements I would like to thank my supervisors Dr. Arnaud Doucet and Dr. Kevin Murphy. This work would not have been possible without their valued advice and suggestions. I also thank the staff and faculty members of the Statistics Department at UBC, in particular, Dr. Paul Gustafson, Dr. Harry Joe and Dr. Matias Salibian-Barrera, for their help, advice and mentorship. I am forever grateful to my family, Salma, Ghassan, Najat, Sal and Rhea, for their continued support and encouragement. The numerous sacrifices they made over the last few years allowed me to pursue my aspirations, and reach important milestones in my professional career. Finally I want to thank my friends and fellow graduate students, both in the Statistics Department and in Computer Science, for providing theoretical advice, computer support and numerous help, but most importantly for making the last two years a memorable journey. xi

12 Dedication To my mom and dad, your love and support makes everything possible. xii

13 Part I Thesis

14 Chapter Introduction. Motivation Correlated discrete data, whether be it binary, nominal or ordinal, arise in many applications. Examples range from the study of group randomized clinical trials to consumer behavior, panel data, sample surveys and longitudinal studies. Modeling dependencies between binary variables can be done using Markov random fields (e.g., Ising models). However, an attractive alternative is to use a latent variable model, where the observed binary variables are assumed independent given latent Gaussian variables, which are correlated. An example of such model is the multivariate Probit model (MVP), introduced by Ashford and Snowden (97). In this class of models, the response is multivariate, correlated and discrete. Generally speaking, the MVP model assumes that given a set of explanatory variables the multivariate response is an indicator of the event that some unobserved latent variable falls within a certain interval. The latent variable is assumed to arise from a multivariate normal distribution. The likelihood of the observed discrete data is then obtained by integrating over the multidimensional constrain space of latent variables. P (Y ij = X i, β, Σ) =... φ T (Z i X i, β, R)dZ... dz T (.) A it A i where i =,..., n indexes the independent observation, j =,..., T indexes the dimension of the response, Y ij is a T -dimensional vector taking values in {, }, A ij is the interval (, ) if Y ij = and the interval (, ] otherwise, β is the regression coefficient, Σ is the covariance matrix, and φ T (Z i X i, β, R) is the probability density function of the standard normal 2

15 Chapter. Introduction distribution defined in A.. The MVP model has been proposed as an alternative to the multivariate logistic model, which is defined as: P (Y ij = X i, β, Σ) = exp(x i β j) T k= exp(x i β k) (.2) The appeal of the probit model is that it relaxes the independence of the irrelevant alternatives (IIA) property assumed by the logit model. This IIA property assumption states that if choice A is preferred to choice B out of the choice set {A,B}, then introducing a third alternative C, thus expanding the choice set to {A,B,C} must not make B preferred to A. This means that adding or deleting alternative outcome categories does not affect the odds among the remaining outcomes. More specifically in the logistic regression model, the odds of choosing m versus n does not depend on which other outcomes are possible. That is, the odds are determined only by the coefficient vectors for m and n, namely β m and β n : P (Y im = X i, β, Σ) P (Y in = X i, β, Σ) = exp(x i β T m)/ k= exp(x i β k) exp(x i β n)/ T k= exp(x i β k) = exp(x(β m β n )) (.3) In many cases, this is considered to be an unrealistic assumption (see for example McFadden (974)), particularly when the alternatives are similar or redundant as is the case in many econometric applications. Until recently, estimation of MVP models, despite its appeal, has been difficult due to computational intractability especially when the response is high dimensional. However, recent advances in computational and simulation methods made this class of models more widely used. Both classical and Bayesian methods have been extensively developed for estimation of these models. For a low dimensional response, finding the maximum likelihood estimator numerically using quadrature methods for solving the multidimensional integral is possible, but becomes quickly intractable as the number of dimensions T increases usually past 3. 3

16 Chapter. Introduction Lerman and Manski (98) suggest the method of simulated maximum likelihood (SML). This method is based on Monte Carlo simulations to approximate the high dimensional integral to estimate the probability of each choice. McFadden (989) introduced the method of simulated moments (MSM). This method also requires simulating the probability of each outcome based on moment conditions. Natarajan et al. (2) introduced a Monte Carlo variant of the Expectation Maximization algorithm (MCEM) to find the maximum likelihood estimator without solving the high dimensional integral. Other frequentist methods were also developed using Generalized Estimation Equations (GEE) (eg. Chaganty and Joe (24)). On the Bayesian side, Albert and Chib (993) introduced a method that involves a Gibbs Sampling algorithm using data augmentation for the univariate probit model. McCulloch and Rossi (994) extended this model to the multivariate case. The Bayesian method entails iteratively alternating between sampling the latent data and estimating the unknown parameters by drawing from their conditional distributions. The idea is that under mild conditions, successive sampling from the conditional distributions produces a Markov chain which converges in distribution to the desired joint conditional distribution. Other work on the Bayesian side includes that of Chib and Greenberg (998), and more recently Liu (2), Liu and Daniels (26), and Zhang et al. (26). These methods will be examined in more detail in Chapter 2. Geweke et al. (994) compared the performance of the classical frequentist methods SML and MSM with the Bayesian Gibbs sampling method and found the Bayesian method to be superior especially when the covariates are correlated and the error variances vary across responses..2 Outline In this work we adopt a Bayesian approach for estimation in the multivariate Probit class of models. The multinomial and the ordinal models are generalizations of the binary case. The multivariate binary response is a special case of the multinomial response with only two categories. The ordinal 4

17 Chapter. Introduction model is also a special case of the multinomial model, where the categories are expected to follow a certain order. All the methods developed herein are developed for the multivariate binary model, but could be easily extended to include the multinomial and ordinal cases. The aim is to find a general framework to estimate the parameters required for inference in the MVP model, especially in high dimensional problems. We particularly focus on the estimation of an identifiable correlation matrix under a full correlation assumption and a constrained partial correlation assumption. This thesis will be structured as follows: In Chapter 2, we introduce the notation that will be used throughout the thesis. We discuss the problem of identifiability in the MVP class of models. We briefly compare several possible choices of prior distributions for Bayesian modeling, as well as review some methods that have been proposed in the literature to deal with identifiability and prior selection. In Chapter 3, we detail a method for estimating an identifiable correlation matrix under the saturated model. The saturated model admits a full covariance matrix where all off-diagonal elements are assumed to be non-zero. We show simulation results on a low dimensional and a higher dimensional problem. Finally, we further investigate the method, by applying it to a widely studied data set: The Six Cities Data. In Chapter 4, we extend the method developed in Chapter 3 to the case where a structure on the partial correlation matrix is imposed. To do so, we motivate the use of Gaussian graphical models and the Hyperinverse Wishart Distribution. We provide a general introduction to Gaussian graphical models, and we adapt the algorithm and the priors developed in Chapter 3 to the new framework. Throughout this chapter, we assume that the structure of the inverse correlation matrix is known and given. Simulation results are presented as well as an application to the Six Cities Data set from Chapter 3. We conclude in Chapter 5, by summarizing the work and the results. We also discuss possible extensions, applications and future work. 5

18 Chapter 2 The Multivariate Probit Model 2. Model Specification and Notation The multivariate Probit model assumes that each subject has T distinct binary responses, and a matrix of covariates that can be any mixture of discrete and continuous variables. Specifically, let Y i = (Y i,..., Y it ) denote the T - dimensional vector of observed binary / responses on the ith subject, i =,..., n. Let X i be a T p design matrix, and let Z i = (Z i,..., Z it ) denote a T -variate normal vector of latent variables such that Z i = X i β + ɛ i, i =,..., n (2.) The relationship between Z ij and Y ij in the multivariate probit model is given by { if Z ij > ; Y ij = j =,..., T (2.2) otherwise. So that P (Y i = β, Σ) = Φ(Z i ) Z i N(X i β, Σ) (2.3) where Φ is the Probit link which denotes the cumulative distribution function of the normal distribution as defined in A.. Here β = ( β,..., βt ) is a p T matrix of unknown regression coefficients, ɛ i is a T vector of residual error distributed as N T (, Σ), where Σ is the T T correlation matrix of Z i. 6

19 Chapter 2. The Multivariate Probit Model β X i X i 2 X i 3 Zi Z i 3 Z i2 Y i Y i2 Y i3 Σ i = : n Figure 2.: A graphical representation of the model in 2.3 under a full correlation structure. Observed nodes are shaded. The posterior distribution of Z i is given by T f(z i Y i, β, R) φ T (Z i X i, β, R) {I(z ij > )I(y ij = ) + I(z ij < )I(y ij = )} j= (2.4) This is a multivariate truncated Gaussian where φ T (Z) is the probability density function of the normal distribution as in A.. The likelihood of the observed data Y is obtained by integrating over the latent variables Z: P (Y i = y i X i, β, R) =... A it Φ T (Z i X i, β, R)dZ i A i (2.5) 7

20 Chapter 2. The Multivariate Probit Model where A ij is the interval (, ) if Y ij = and the interval (, ] otherwise. This formulation of the model is most general, since it allows the regression parameters as well as the covariates to vary across categories T. In this work, we let the covariates vary across categories, however, we constrain the regression coefficients β to be fixed across categories by requiring β =... = β T = β. 2.2 Difficulty with Multivariate Probit Regression: Identifiability In the multivariate Probit model, the unknown parameters (β, Σ) are not identifiable from the observed-data model (e.g: Chib and Greenberg (998), Keane (992)). This could be easily seen if we scale Z by a constant c >, we get cz = c(xβ + ɛ) (2.6) = X(cβ) + cɛ (2.7) from equation 2.2, clearly Y will have the same value given Z and given cz, which means that the likelihood of Y X, β, Σ is the same as that of Y X, cβ, c 2 Σ. Furthermore, we have no way of estimating the value of c. In order to handle this identifiability issue in MVP, restrictions need to be imposed on the covariance matrix. In the univariate case, this restriction is handled by setting the variance to one. However, imposing such a restriction in the multivariate case is a little more complicated. It is not uncommon to ignore the identifiability problem and perform the analysis on the unidentified model and post-process samples by scaling with the sampling variance using the separation strategy R = D ΣD, where D is a diagonal matrix with diagonal elements d ii = Σ ii. This method is adopted by McCulloch and Rossi (994), and is widely used (e.g Edwards and Allenby (23)). Many researchers are uncomfortable working with unidentified parameters. For instance, ignoring identifiability adds difficulty in the choice of prior 8

21 Chapter 2. The Multivariate Probit Model distributions, since priors are placed on unidentified parameters. Therefore, if the prior is improper, it is difficult to verify that the scaled draws are from a proper posterior distribution. Koop (23, p. 227) gives an empirical illustration of the effect of ignoring identifiability. From simulation results, he shows that unidentifiable parameters have higher standard errors, and furthermore with non-informative priors there is nothing stopping estimates from going to infinity. McCulloch et al. (2) address identifiability by setting the first diagonal element of the covariance matrix σ =. However, this means that the standard priors for covariance could no longer be used, they propose a prior directly on the identified parameters, but their method is computationally expensive, and is slow to converge as pointed out by Nobile (2). Nobile suggests an alternative way of normalizing the covariance by drawing from an inverse Wishart conditional on σ = (Linardakis and Dellaportas, 23). The approach of constraining one element of the covariance adds difficulty in the interpretability of the parameters and priors, and is computationally demanding and slow to converge. Other approaches impose constraints on Σ, the precision matrix. Webb and Forster (26) parametrize Σ in terms of its Cholesky decomposition: Σ = Ψ T ΛΨ T. In this parametrization, Ψ is an upper triangular matrix with diagonal elements equal to, and Λ is a diagonal matrix. The elements of Ψ could be regarded as the regression coefficients obtained by regressing the latent variable on its predecessors. Each λ jj is interpreted as the conditional precision of the latent data corresponding to variable j given the latent data for all the variables preceding j in the decomposition. Identifiability is addressed in this case by setting λ jj to. This approach only works if the data follows a specific ordering, for example time series. Dobra et al. (24) propose an algorithm to search over possible orderings, however this becomes very computationally expensive in high dimensions. Alternatively, identifiability could be handled by restricting the covariance matrix Σ to be a correlation matrix R (Chib and Greenberg (998)). The correlation matrix admits additional constraints, since in addition to being positive semi-definite, it is required to have diagonal elements equal to 9

22 Chapter 2. The Multivariate Probit Model and off-diagonal elements [, ]. Furthermore, just as in the covariance case, the number of parameters to be estimated increases quadratically with the dimension of the matrix. Barnard et al. (2) use the decomposition Σ = DRD, and place a separate prior on R and D directly. They use a Griddy Gibbs sampler (Ritter and Tanner, 992) to sample the correlation matrix. Their approach involves drawing the correlation elements one at time and requires setting grid sizes and boundaries. This approach is inefficient, especially in high dimensions. Chib and Greenberg (998) use a Metropolis Hastings Random Walk algorithm to sample the correlation matrix. This is more efficient than the Griddy Gibbs approach because it draws the correlation coefficient in blocks. However the resulting correlation matrix is not guaranteed to be positive definite, which requires the algorithm to have an extra rejection step. Furthermore, as with random walk algorithms in general, the mixing is slow in high dimensions. Alternatively, some approaches use parameter expansion as described in Liu and Wu (999) together with data augmentation, for example Liu (2), Zhang et al. (26), Liu and Daniels (26), and others. The idea is to propose an alternative parametrization, to move from a constrained correlation space to sampling a less constrained covariance matrix and transform it back to a correlation matrix. These approaches differ mainly with the choice of priors and how the covariance matrix is sampled. The different possibilities for priors will be discussed in more detail in the next section, and an in-depth explanation of parameter expansion with data augmentation algorithm is in the next Chapter. Table 2. gives a summary of the how identifiability has been handled in the Probit model. 2.3 Bayesian Inference in Multivariate Probit Models A Bayesian framework treats parameters as random variables and therefore requires the computation of the posterior distribution of the unknown

23 Chapter 2. The Multivariate Probit Model Table 2.: Summary of how identifiability has been handled in some previous work Identifiability Paper Ignored McCulloch and Rossi (994) Restrict σ = McCulloch et al. (2) Nobile (2) Restrict λ jj = in Σ = Ψ T ΛΨ T Webb and Forster (26) Restrict Σ to R Barnard et al. (2) Liu (2) Liu and Daniels (26) Zhang et al. (26) random parameters conditional on the data. A straightforward application of Bayes rule results in the posterior distribution of (β, R) where R is the correlation matrix, β is the matrix of regression coefficients, and D is the data. π(β, R D) f(d β, R)π(β, R) (2.8) In order to estimate the posterior distribution, a prior distribution on the unknown parameters β and R needs to be specified. In the absence of prior knowledge, it is often desirable to have uninformative flat priors on the parameters we are estimating 2.3. Prior Specification on β It is common to assume that a priori β and R are independent. Liu (2) propose a prior on β that depends on R to facilitate computations. There are several other choices of priors in the literature on the regression coefficients β. The most common choice is a multivariate Gaussian distribution centered at B, with known diagonal covariance matrix Ψ β. It is typical to choose large values for the diagonal elements of Ψ β so that the prior on β is uninformative. This is the proper conjugate prior. In addition, without loss of generality,

24 Chapter 2. The Multivariate Probit Model we could set B to π( β) N pt (, Ψ β I T ) (2.9) where β is the nt -dimensional vector obtained by stacking up the columns of the p T regression coefficient matrix β. In this work, we constrain the regression parameter to be constant across T Prior Specification on the correlation matrix R To handle identifiability, we restrict the covariance matrix Σ to be a correlation matrix, which means that the standard conjugate inverse Wishart prior for covariances cannot be used. Instead, a prior needs to be placed on R directly. However, as mentioned previously there does not exist a conjugate prior for correlation matrices. Barnard et al. (2) discuss possible choices of diffuse priors on R. The first is the proper jointly uniform prior: π(r), R R T (2.) Where the correlation matrix space R T is a compact subspace of the hypercube [, ] T (T )/2. The posterior distribution resulting from this prior is not easy to sample from. Barnard et al. use the Griddy Gibbs approach (Ritter and Tanner, 992), which is inefficient. The approach in Chib and Greenberg (998) uses this prior as well. Liu and Daniels (26) use this prior for inference. However, they use a different prior to generate their sampling proposal. It is important to note that using a jointly uniform prior would not result in uniform marginals on each r ij. Barnard et al. (2) show that a jointly uniform prior will tend to favor marginal correlations close to, making it highly informative, marginally. This problem becomes more apparent as T increases (see Figure 2.2). 2

25 r 2 Chapter 2. The Multivariate Probit Model Another commonly used uninformative prior is the Jeffrey s prior π(r) R (p+) 2 (2.) This prior is used by Liu (2). Liu and Daniels (26) use it for generating their proposal..4 Jointly Uniform Prior T = 3 T = Figure 2.2: Marginal prior density for r 2 when T = 3 and T = under the jointly uniform prior p(r), based on 2 draws. (Figure reproduced from Barnard et al. (2)) It has been shown that in the context of parameter expansion, this prior helps facilitate computations. However, it suffers from the disadvantage of being improper. Improper priors are not guaranteed to have a proper posterior distribution and, in addition, cannot be used for model selection due to Lindley s paradox. Furthermore, it has been shown that the use of improper priors on covariance matrices is in fact informative and tends to favor marginal correlations close to ± (Rossi et al., 25, Chapter 2). Alternatively, Barnard et al. (2) propose a prior on R such that marginally each r ij is uniform on the interval [, ]. This is achieved by taking the joint distribution of R to be: 3

26 Chapter 2. The Multivariate Probit Model T (T ) π(r) R 2 ( i R ii ) (T +)/2 (2.2) The above distribution is difficult to sample from directly. However, they show that sampling from it can be achieved by sampling from a standard inverse Wishart with degrees of freedom equal to ν = T + and transforming Marginal Correlations Density.5 Density.5 Density ρ ρ ρ 23 Figure 2.3: Marginal correlations obtained using the prior in 2.2 by sampling from a standard inverse Wishart with degrees of freedom ν = T + back to a correlation matrix using the separation strategy (Σ = DRD). The proof is reproduced in Appendix B and the result is illustrated in Figure 2.3. The marginally uniform prior seems convenient, since it is proper and we are able to compute its normalizing constant. It does not push correlations toward or ± even in high dimensions. Most importantly, because it is proper, it opens the possibility for Bayesian model selection. However, multiplying together the distribution of Z in equation 2.4 and the marginally uniform prior in 2.2, results in a posterior distribution that is complicated and not easily sampled from. 4

27 Chapter 2. The Multivariate Probit Model Nevertheless, we show in the next chapter that the marginal prior, when used in the context of parameter expansion, is actually computationally convenient for sampling from the posterior distribution. 5

28 Chapter 3 Correlation Estimation in the Saturated Model 3. Introduction As we have seen from the previous chapter, inference in the MVP model is complicated due to the identifiability issue which requires constraining the covariance to be a correlation matrix. There is no conjugate prior for correlation matrices and therefore the posterior is not easily sampled from. In this Chapter, we build on previous work and adopt a Bayesian approach that uses a combination of Gibbs sampling and data augmentation. Furthermore, we use a re-parametrization leading to an expansion of the parameter space. This helps significantly with the computation of the posterior distribution. We focus on R being a full T T correlation matrix. 3.2 Parameter Expansion and Data Augmentation 3.2. Data Augmentation Data Augmentation (DA) is an algorithm introduced by Tanner and Wong (987), very popular in statistics, used mainly to facilitate computation. These methods center on the construction of iterative algorithms by introducing artificial variables, referred to as missing data or latent variables. These variables may or may not have a physical interpretation but are mainly there for computational convenience. Let Y be the observed data, and θ be the unknown parameter of interest. 6

29 Chapter 3. Correlation Estimation in the Saturated Model If we are interested in making draws from f(y θ), the idea is to find a latent variable Z such that the joint distribution f(y, Z θ) is easily sampled from. The distribution of the observed data model is recovered by marginalizing the latent variable: f(y θ) = f(y, Z θ)dz (3.) Algorithm 3. Data Augmentation At iteration i. Draw Z f(z θ, Y ) f(y, Z θ) 2. Draw θ f(θ Z, Y ) f(y, Z θ)f(θ) The data augmentation algorithm 3. iterates between an imputation step where the latent variables are sampled and a posterior estimation step until convergence. The samples of the unknown parameter θ could then be used for inference Parameter Expansion for Data Augmentation Parameter Expansion for Data Augmentation (PX-DA), introduced by Liu and Wu (999), is a technique usually useful for accelerating convergence. The idea is that if we can find an hidden parameter α in the complete data model f(y, Z θ), we can then expand this model to a larger model p(y, W θ, α), that would preserve the distribution of the observed data model: p(y, W θ, α)dw = f(y θ) (3.2) We adopt the notation used in Liu and Wu (999), and use W instead of Z and p instead of f to denote the latent data and the distributions under the expanded model. To implement the DA algorithm in this setting, a joint prior on the expansion parameter α and the original parameter of interest θ needs to be specified such that the prior on θ is the same under the original model and the expanded model ( p(θ, α)dα = f(θ)). This can be done by maintaining the prior for θ at f(θ) and specifying a prior p(α θ). 7

30 Chapter 3. Correlation Estimation in the Saturated Model By iterating through the steps of algorithm 3.2, we are able to achieve a faster rate of convergence than the DA algorithm in 3.. Algorithm 3.2 PX-DA Algorithm At iteration i. Draw (α, W ) jointly by drawing α p(α θ) 2. Draw (α, θ) jointly by drawing W p(w θ, α, Y ) p(y, W θ, α) α, θ Y, W p(y, W θ, α)p(α θ)f(θ) Data Transformation Under certain conditions, an alternative view of the PX-DA treats W as the result of a transformation on the latent data Z induced by the expansion parameter α (Liu and Wu, 999, Scheme ). For this interpretation to hold, a transformation Z = t α (W ), needs to be defined such that for any fixed value of α, t α (W ) is a one-to-one differentiable mapping between Z and W: p(y, W θ, α) = f(y, t α (W ) θ) J α (W ) (3.3) where J α (W ) is the determinant of the Jacobian of the transformation T α evaluated at W. The algorithm is detailed in 3.3. Note that in the second step of algorithm 3.3, α is sampled from its prior distribution. This interpretation of the PX-DA algorithm is particularly useful in the case of MVP regression. 3.3 Proposed Model In the model we are proposing, we want to use PX-DA mainly to simplify computation. We adopt the scheme described in algorithm 3.3 (correspond- 8

31 Chapter 3. Correlation Estimation in the Saturated Model Algorithm 3.3 PX-DA Algorithm/ Data Transformation (scheme ) At iteration i. Draw Z f(z Y, θ), compute W = t α (Z) 2. Draw (α, θ) jointly conditional on the latent data α, θ Y, W p(y, t α (W ) θ) J α (W ) p(α θ)f(θ) ing to scheme in Liu and Wu (999)) Imputation Step Let θ = (R, β), be the identifiable parameter of interest. The first step of algorithm 3.3, involves drawing Z conditional on the identifiable parameter θ. This is achieved by sampling from a multivariate truncated Gaussian as in equation (2.4). For the generation of multivariate truncated Gaussian variables, we followed the approach outlined in Appendix D. This approach uses Gibbs steps to cycle through a series of univariate truncated Gaussians. In each step Z ij is simulated from Z ij Z i, j, β, R, which is a univariate Gaussian distribution truncated to [, ) if Y ij = and to (, ] if Y ij =. The parameters of the untruncated distribution Z ij Z i, j, β, R are obtained from the usual formulae for moments of conditional Gaussians Posterior Sampling Step Given the latent data sampled in step, we would like to draw (α, θ) from its posterior distribution. In order to implement step 2 of algorithm 3.3, we need to find an expansion parameter α, not identifiable from the observed data model, but identifiable from complete data-model. Subsequently, we need to define a transformation on the latent data. 9

32 Chapter 3. Correlation Estimation in the Saturated Model Defining the Expansion Parameter and the Transformation If we let Z = t α (W ) = D W (3.4) or alternatively W = DZ, where D is a diagonal matrix with positive diagonal elements d ii = Σ ii. The scale parameter D is not identifiable. For reasons which will become clear later, we could conveniently pick α = (α,..., α T ) to be a function of D by taking α i = rii 2d 2 i (3.5) where r ii is the ith diagonal element of R and d i is the ith diagonal element of D. In this case, for any fixed value of α, D is a one-to-one function of α and t α (W ) is a one-to-one differentiable mapping between Z and W. This choice of α is not arbitrary. It is conveniently picked so that when combined with the prior of (R, β), the transformed likelihood, and the Jacobian, it results in a posterior distribution that is easily sampled from. The Transformed Complete Likelihood: p(y, t α (W ) θ) J α (W ) For a given α, the determinant of the Jacobian resulting by going from (Z W ) under the transformation in 3. is given by: J : Z W = (Z,... Z n ) (W... W n ) (3.6) = (I n D ) (3.7) = D n (3.8) Combining the complete likelihood in equation 2.4 with the Jacobian, and after doing some algebra, we get: see a 3 3 example in Appendix C 2

33 Chapter 3. Correlation Estimation in the Saturated Model p(y, t α (W ) β, R) J : Z W = p(y, Z β, R) J : Z W (3.9) ( ) = R n 2 exp n (Z i X i β) R (Z i X i β) J : Z W 2 i= ( ) = D n R n 2 exp n (D(Z i X i β)) (DRD) (D(Z i X i β)) 2 i= ( ) = DRD n 2 exp n (W i X i βd) (DRD) (W i X i βd) 2 i= If we define Σ = DRD (3.) ɛ = D(Z Xβ) (3.) We can re-write the likelihood under the expanded data model in equation 3. as p(y, t α (W ) R, β) J α (W ) Σ ( n 2 exp tr Σ ɛ ɛ ) (3.2) The Prior: p (α θ)f(θ) For Bayesian inference, we need to define a joint prior on θ = (β, R) and α. We assume that β and R are independent a priori so that π(β, R, α) = p (α R)f(R)f(β). Under the transformation Σ = DRD, Barnard et al. (2) showed that if we take Σ to be a standard inverse Wishart distribution as in A.4 we can re-write the distribution of Σ as in B.: π(σ) = π(α, R) J : Σ D, R = f(r)p(α R) (3.3) Where with a particular choice of parameters, namely ν = T +, the distribution f(r) is as in 2.2 such as each r ij is uniform on the interval 2

34 Chapter 3. Correlation Estimation in the Saturated Model [, ]. Furthermore, the distribution of p (α R) is Gamma with shape parameter (T + )/2 and rate parameter. Therefore, we are able to get the desired prior distribution π(α R)π(R) by sampling Σ from a standard inverse Wishart with degrees of freedom ν = T +, and transforming using Σ = DRD. Here, we like to point out that both the prior distributions of R and β are the same under the expanded model and the observed data model. This is a condition required for the PX-DA algorithm. In addition, we note that R and α are not a priori independent. The independence of these parameters is a necessary condition only to prove the optimality of the convergence of algorithm 3.3. In this case, their independence is not key since we are using PX-DA mainly for the convenience in that it results in a posterior distributions that is easily sampled from. Posterior Distribution of (α, θ) Now that we have specified the expanded likelihood and prior on the parameters of interest (R, β) and the expansion parameter α, the joint posterior distribution of (β, R, α) conditional on the latent data can be computed: β, R, α Y, W p(y, t α (W ) β, R) J α (W ) f(r)f(β)p (α R) (3.4) where t α (W ) = Z = D W is the transformation of the latent data and J α (W ) is the determinant of the Jacobian of going from Z W. We could therefore put together the likelihood in 3.2 and the marginally uniform prior on R in 2.2, the Gamma prior on α in 3.3, and the prior on β in 2.9, we get: π(r, α, β Y, W ) Σ ( n 2 exp tr Σ ɛ ɛ ) T (T ) R 2 ( ( ) T + R ii ) (T +)/2 Gamma, 2 i exp(β ψ β β) (3.5) 22

35 Chapter 3. Correlation Estimation in the Saturated Model where the Gamma distribution is defined as in A.2. In order to sample from the joint posterior distribution in 3.5, we use a Gibbs Sampling framework, where we sample β Z, R and then sample R, α W. Since given R, the parameter β is identifiable, we sample it prior to transforming the data. Straightforward computations give the posterior distribution of β Y, Z, R. The normal distribution is the conjugate prior, therefore the posterior distribution of β will also follow a multivariate normal distribution with mean parameters β and covariance Ψ β where Ψ β = Ψ β + β = Ψ β ( n X ir X i i= n X ir Z) The joint posterior π(r, α Y, W, β) can be obtained from 3.5: i= π(r, α Y, W, β) Σ ( n 2 exp tr Σ ɛ ɛ } (3.6) T (T ) R 2 ( ( ) T + R ii ) (T +)/2 Gamma, 2 i We perform a change of variable Σ = DRD: π(σ Y, W, β) π(r, α Y, W, β) J α : (D, R) Σ = Σ ( n 2 exp tr Σ ɛ ɛ } Σ 2 2(T +) exp( 2 tr(σ )) = Σ 2 (ν+t +) exp( 2 tr(σ S)) (3.7) This is an inverse Wishart distribution with ν = n+t + and S = ɛ ɛ. The second line in the equation above is obtained by reversing the steps of the proof in Appendix B. 23

36 Chapter 3. Correlation Estimation in the Saturated Model Algorithm 3.4 Full PX-DA Sampling Scheme in Multivariate Probit At iteration i. Imputation Step Draw Z f(z Y, β, R) from a truncated Multivariate Normal distribution T MV N(Xβ, R) as described in Appendix D. 2. Posterior Sampling Step Draw (β, R, α) jointly conditional on the latent data : Draw β Z, Y, R from a Multivariate Normal distribution β MV N(β, Ψ β ) Draw α p (α R) from a Gamma distribution G( T + 2, ) Compute the diagonal matrix D, where each diagonal element d i = r ii 2α i and r ii is the ith diagonal element of R. compute W = t α (Z) = DZ or equivalently ɛ = D(Z Xβ). Draw Σ β, Y, W from an inverse Wishart distribution Σ IW (ν, S) where ν = n + T + and S = ɛ ɛ. compute R = D ΣD Repeat until convergence 3.4 Simulations In order to test the performance of the algorithm developed in the previous section, we conduct several simulation studies first with T = 3 and then we increase the dimension to T = 8. The data is simulated as follows: we generate a design matrix with p = 2 covariates from a uniform distribution from [.5,.5], we set the coefficients β = (, ) and we generate random error from a multivariate Gaussian distribution centered at and a full correlation matrix R. We fix R such that all ρ ij off-diagonal elements are of equal value. We try for different values of ρ namely.2,.4,.6, and.8. The following two loss functions are considered to evaluate the accuracy of 24

37 Chapter 3. Correlation Estimation in the Saturated Model the estimated correlation matrix: L ( ˆR, R) = tr( ˆRR ) log ˆRR T (3.8) L 2 ( ˆR, R) = tr( ˆRR I) 2 (3.9) Where ˆR is the estimated correlation and R is the true correlation used to generate the data. The first loss function is the entropy loss and the second is the quadratic loss. These loss functions are discussed in more detail in Yang and Berger (994). In each case, N = Gibbs samples are drawn and the first 5 are discarded as Burn-in. We tried multiple runs, to ensure convergence of results. The correlation is always initialized at the identity matrix, and the latent variables are initialized at Results for T = 3 For T = 3, three parameters in the correlation matrix are estimated. Table 3. outlines results from the simulations for the correlation matrix. The posterior median estimate is reported, the number of parameters falling within the 95% credible interval, the average interval length, as well as the entropy loss and the quadratic loss. 95% credible intervals are calculated based on 2.5% and 97.5% quantiles of the estimates. We can see that the likelihood carries more information with larger correlation values, estimation of the correlation becomes more accurate and confidence intervals become smaller on average. Similarly with more data, estimates become more precise and furthermore, we see a decrease in both the entropy and the quadratic loss. Except in one case (r ij =.2, n = 5), the true correlation coefficient was always included in the 95% credible interval. Figures 3. and 3.2, provide examples of traceplots and density plots for the correlation matrix with ρ ij =.4 and ρ ij =.8 respectively. Subfigures (a) and (b) in each case show how the density becomes narrower by increasing the sample size from n = to n =. Furthermore, we see 25

38 Chapter 3. Correlation Estimation in the Saturated Model that the algorithm mixes very well and converges fast. Table 3.: Correlation results from simulations for T = 3 Sample CI Contains Average CI Entropy Quadratic Size r ij Length Loss Loss.2 3/ / / / / / / / / / / / Table 3.2, shows simulation results for the regression coefficients β. For each coefficient, we report the median of the posterior distribution, a 95% credible interval and the standard error The true regression coefficients seems to always fall within the 95% credible interval. Standard errors and consequently credible intervals lengths tend to become smaller with the increase of correlation as well as the increase in sample size. Figures 3.3, 3.4, 3.5, and 3.6 provide trace plots, density and autocorrelation plots for the regression coefficient in the case where the correlation matrix has elements ρ ij =.4 and ρ ij =.8 and increasing the sample size from n = to n = respectively. The density becomes narrower with a larger sample size and here too, the algorithm seems to be mixing well. 26

39 Chapter 3. Correlation Estimation in the Saturated Model Table 3.2: Regression coefficients results from simulations for T = 3 Sample Confidence Standard Confidence Standard Size r ij ˆβ Interval Error ˆβ2 Interval Error (-.88,-.83) (.79,.86) (-.72,-.72) (.38,.37) (-.47,-.52) (.4,.36) (-.73,-.82).23.5 (.62,.49) (-.45,-.99).2.8 (.95,.4) (-.45,-.)..4 (.82,.26) (-.,-.7)..5 (.93,.35) (-.33,-.95)..93 (.75,.) (-.28,-.96).8.4 (.98,.3) (-.25,-.94).8.96 (.8,.2) (-.23,-.93).8.2 (.97,.26) (-.22,-.96).7.98 (.85,.).7 27

40 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.: Correlation estimates for ρ =.4, T = 3 and increasing sample size from n = to n= (a) n = (b) n =

41 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.2: Correlation estimates for ρ =.8, T = 3 and increasing sample size from n = to n= (a) n = (b) n =

42 Chapter 3. Correlation Estimation in the Saturated Model.5 β 2 β Traceplot Traceplot Samples Samples Autocorrelation Autocorrelation Figure 3.3: β estimates for ρ =.4, T = 3 and sample size n = 3

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary