Bayesian inference for factor scores

Size: px

Start display at page:

Download "Bayesian inference for factor scores"

Eustacia Taylor
5 years ago
Views:

1 Bayesian inference for factor scores Murray Aitkin and Irit Aitkin School of Mathematics and Statistics University of Newcastle UK October, 3 Abstract Bayesian inference for the parameters of the factor model follows directly from the likelihood and the prior distributions for the model parameters. Inference about the factor scores themselves is more complex, but can be accommodated in the complete data form of the model using Markov Chain Monte Carlo methods. This approach has interesting connections to the EM algorithm approach to maximum likelihood estimation, and it casts light on the controversy over factor score estimation and factor indeterminacy. Introduction Bayesian methods are rapidly becoming more popular with the greatly increased power of Markov Chain Monte Carlo (MCMC or MC ) methods for inference in complex models. Such features as missing or incomplete data, latent variables and non-conjugate prior distributions can be handled in a unified way. Bayesian methods solve inferential problems that are difficult to deal with in frequentist (repeated-sampling) theory: the inadequacy of asymptotic theory in small samples and the difficulties of second-order asymptotics; the difficulties of maximum likelihood methods with complex models and partial or incomplete data. The Bayesian solution of these problems, as for all models, requires a full prior specification for model parameters and all other unobserved variables. Bayesian analysis of the factor model has been considered in considerable generality by Press and Shigemasu (989, 997) who give a good coverage of earlier work; in the context of this volume Bayesian methods have been used for the problem of Heywood cases by Martin and McDonald (975). A new book by Rowe () gives a comprehensive Bayesian treatment of the factor model. Data augmentation (DA) methods including MCMC methods are discussed in detail

2 in Tanner (996), but these have apparently not been applied to the factor model apart from the maximum likelihood EM algorithm approach of Rubin and Thayer (98). In this chapter we describe the fully Bayesian DA approach. The single-factor model We adopt the standard notation of upper-case letters for random variables and lower-case letters for their observed values. For the single common-factor model with p test variables Y and a single unobserved factor X, and the marginal distribution of Y is Y X = x N( + λx, Ψ) X N(, ) Y N(, Σ), Σ = λλ + Ψ, where is a length p column vector of means, λ is a length p column vector of factor loadings, and Ψ = diag(,..., ψ p) is a p p diagonal matrix of specific variances. We restrict X to be standard normal because of the unidentifiability of its mean and variance parameters. It follows immediately that the maximum likelihood estimate (MLE) of is ȳ. Maximum likelihood methods for the estimation of λ and Ψ from data y i (i =,..., n), together with large-sample standard errors from the information matrix, are implemented in many packages, and will not concern us here apart from the EM algorithm approach of Rubin and Thayer (98). In this approach, we regard the unobserved factor variables X i as missing data; in the complete data model in which the x i are counterfactually observed, the complete data log-likelihood is, omitting constants, l = log L (, λ, Ψ) = n log Ψ = n n i= x i n (y i λx i ) Ψ (y i λx i ) i= p log ψj j= n i= x i n p (y ij j λ j x i ) /ψj i= j= which is equivalent to a sum of p separate log-likelihoods from the p regressions of Y j on x with intercept j, slope λ j and variance ψj. The term i x i does not involve unknown parameters and can be omitted.

3 The sufficient statistics in these regressions involve the x i and x i ; in the E step of the algorithm these are replaced by their conditional expectations given the current estimates of the parameters. Standard calculations give the conditional distribution of X given Y = y and the parameters as X Y = y N(λ Σ (y ), λ Σ λ). Here ρ = λ Σ λ is the squared multiple correlation of the factor X with the variables Y, so the conditional variance of X given Y = y is ρ. In the E step of EM the unobserved x i are replaced by x i = λ Σ (y i ) and the unobserved x i are replaced by x i = x i + ρ where the parameters are replaced by their current estimates. In the M step of EM new estimates of the parameters are obtained from the p regressions by solving the score equations ˆλ j = i (y ij ȳ j )x i /[ i x i + n( ρ )] ˆψ j = i (y ij ȳ j λ j x i ) /n + ρ λ j. The EM algorithm may converge very slowly if the regression of Y on X is weak, and further numerical work is needed for the information matrix and (large-sample) standard errors for the parameter estimates. 3 Bayesian analysis Bayesian analysis of the factor model, as with any other model, requires prior distributions for the model parameters; the product of the joint prior and the likelihood gives the joint posterior distribution (after normalization), and any marginal posterior distributions of interest may be computed by integrating out the relevant parameters. Prior distributions may be diffuse, representing little information, or informative, representing real external information relevant to the current sample. Conjugate prior distributions are widely used where they exist; by setting the ( hyper- ) parameters in these distributions at appropriate values, they can be made to represent a range of information from diffuse to precise. Since the factor model is essentially a set of conditionally independent linear regressions, diffuse priors are the same as for a regression model flat on, λ and log ψj. Conjugate priors are normal for and λ, and inverse gamma for ψj. 3

4 The mean is of no inferential interest, so it is convenient to integrate it out immediately from the posterior distribution. The multivariate normal likelihood can be written L(, Σ) = = [ exp Σ n/ ] n (y i ) Σ (y i ) i= [ Σ exp n ] / (ȳ ) Σ (ȳ ) [ exp n (y Σ (n )/ i ȳ) Σ (y i ȳ) A flat prior on leaves this unchanged, and integrating out gives directly the marginal likelihood [ M(Σ) = Σ exp n (n )/ trsσ ], where S = n (y i ȳ)(y i ȳ) /n is the sample covariance matrix; as in frequentist theory, the analysis of the factor model may be based on this matrix. Because the structure of Σ = Ψ+λλ does not lead to any simple form for the posterior distributions of the λ j and ψj, it is simpler to approach the posterior distributions of these parameters indirectly, through the complete data model. Since, conditional on the x i, the regressions of y j on x are independent with unrelated parameters, it follows immediately from standard Bayesian results for regression models that where i= ( j, λ j ) x, ψ j N((ˆ j, ˆλ j ), ψ j S xx ) (n )s j/ψ j x χ n, x = (x,..., x n ) ˆ j = ȳ j ˆλ x ˆλ j = S jx S xx (n )s j = S jj S jx S xx S xj S jj = i (y ij ȳ j ) ]. S jx = (y ij ȳ j )x i i [ n S xx = i x ] i i x i i x i Since the individual λ j given x are conditionally independent, the joint conditional distribution of the λ j given x and the ψ j is multivariate normal with a 4

5 diagonal covariance matrix, so integrating out the ψ j, the joint distribution of the λ j given x is multivariate t, with the marginal distributions of the individual λ j, given x, being λ j ˆλ j s j t n. We cannot proceed further analytically integrating out x as well gives an intractable marginal distribution for the λ j and ψ j because of the complex appearance of x in the conditioned distributions. 3. Inference about the factor scores One standard approach to factor score estimation in repeated-sampling theory is to use the conditional mean x i as the estimate of x i the regression estimate of the factor score. This estimate requires the ML estimates of, λ and Ψ to be substituted for the true values, introducing uncertainty which is difficult to allow for; though the delta method may be used to find large-sample standard errors for nonlinear functions like λ Σ, this does not give reliable representation of uncertainty in small to medium samples. A further difficulty is that the regression estimate is only the (conditional) mean of the factor score distribution the conditional variance is ignored in this representation. This underlies the criticism of regression estimates by Guttman and others, discussed below. In Bayes theory inference about X, like that about the model parameters, is based on its posterior distribution. The conditional distribution of X given the parameters is normal, as given in, but in Bayes analysis we have to integrate out the parameters from this conditional distribution (and not substitute the MLEs for the unknowns), with respect to their conditional distribution given the data. Unfortunately, integrating out λ and Ψ from the conditional distribution of X given these parameters is intractable, though Press and Shigemasu (997) showed, in a more general model than ours, that the marginal posterior distribution of X is asymptotically multivariate matrix T. The attraction of Bayesian methods is that they can give exact (to within simulation error) posterior distributions without asymptotic approximations. We now consider simulation methods to obtain these. 3 Data Augmentation The close parallel between the EM algorithm approach to maximum likelihood, using the complete data model, and the Bayesian analysis of the same model, can be turned to advantage using a simulation approach, called Data Augmentation (DA) by Tanner and Wong (987). We augment the observed data by the unobserved factor x, using the same complete data model as for the EM algorithm. Write θ = (, λ, Ψ) for the full vector of parameters. Then the conditional posterior distribution of θ given 5

6 x and y is π(θ x, y), and the conditional distribution of x given θ and y is π(x θ, y). Our object is to compute the marginal posterior distribution π(θ y) of θ given y, and the predictive distribution π(x y) of x. The data augmentation (DA) algorithm achieves this by cycling between the two conditionals, in a similar way to the EM algorithm cycling between the E and M steps. However, in DA we perform not expectations as in the E step, but full simulations from the conditional distributions (Tanner 996 p.9), and convergence is in distribution, rather than to a function maximum. One full of the DA algorithm consists of: Imputation step: generate a sample of M values x [],...,x [M] from the current approximation to the predictive distribution π(x y). Posterior step: update the current approximation to π(θ y) to be the mixture of conditional posteriors of θ, given y and the augmented data x [],...,x [M] : π(θ y) = π(θ x [m], y) M. m Generate a sample of M values θ [],..., θ [M] from π(θ y). For each θ [m], generate a random value x [m] of x from π(x θ [m], y). We repeat these s until the posterior distributions of θ and x converge, or stabilise. To assess this stability we track summary statistics of the posterior distributions of the parameters; we illustrate below with the medians and quartiles of the model parameters. These s can be carried out with relatively small M, like M = 5 to save computing time. Once the posterior distributions have converged, M may be increased to a larger number, like M =, to give the posterior distribution to high accuracy. We use kernel density estimation to give a smooth graphical picture of the posteriors, though the values themselves may be used to make any needed probability statements about the parameters. This process can be substantially accelerated by starting from an approximate posterior for θ based on the ML estimates ˆθ and information matrix I of the parameters from a ML routine, though it can start from a random starting point, as we show in the example. At convergence we have the full (marginal over x) posterior distribution of θ, and the (marginal over θ) posterior distribution of X = (X,..., X n ), so the marginal posterior distribution of any individual factor score X i follows immediately. An obvious, but fundamental, point is that the inference about X i is its distribution, conditional on the observed responses y ij. From the M simulation values of this X, we could compute the posterior mean and variance, and these would be corrected for the underestimation of variability in the plug-in conditional mean and variance. But this is unnecessary, because we have the full 6

7 (simulated) distribution of X: the distribution of the simulated values represents the real information about X that is, it is not a parameter which can be estimated by ML with a standard error, but a random variable. 4 An example We illustrate the DA analysis with n = observations from a four-variate example, with = (,,, ), λ = (,,, ), diag Ψ = (.36, 4, 4,.96). We generate values x i randomly from N(, ), and compute the data values y ij = j + λ j x i + e ij, i =,..., ; j =,..., 4 where the e ij are randomly generated from N(, ψj ). To illustrate the power and capabilities of the DA approach, we do not use an ML factor analysis package to get initial estimates of the parameters, but begin with a set of M = random values of x im, m =,..., M generated from N(, ) for each observation i. For each m we fit the regression of each y j on x m, obtaining MLEs of the model parameters. We then draw, for each m and each j, a random value jm, λ jm and ψjm from the respective current conditional posterior distributions of these parameters given y and x. The values of all the parameters for each m are conceptually assigned mass /M in the discrete joint posterior distribution of all these parameters; the updated posterior is the unweighted mean of the M individual conditional posteriors. This completes one posterior step of the DA algorithm. To generate random parameter values from the current marginal posterior, we draw a random integer m in the range (, M), and select the corresponding parameter vector indexed m from the above discrete posterior distribution. Given the parameter vector ( [m ], λ [m ], ψ[m ]) we compute the posterior distribution of x given y ij, and draw one random vector x [m ] from this distribution. We repeat this process of random integer drawing and random generation of x M times, obtaining x im, m =,..., M. This completes one full of the DA algorithm. We show in Figure the median and upper and lower quartiles of the posterior distributions of the factor loadings λ j and specific variances ψj for s of the DA algorithm. The distributions of the intercepts j are very stable around zero and we do not show them. The s, each with M =, required about 5 hours computing time on a Dell workstation. All programming was done in STATA 8. It is clear that convergence of the algorithm for most of the λ j requires - 5 s (this is known as the burn-in period in the Bayesian literature), but many more s are needed for the variances ψj, especially for ψ, the smallest variance, and the corresponding λ : as for the EM algorithm, convergence of a parameter value near a zero boundary is much slower. The rate 7

8 of convergence is parametrization-dependent, an important issue in large-scale Markov chains for complex models; convergence may be substantially improved by choosing near-orthogonal parametrizations. From the observations on each parameter we compute a kernel density using a Gaussian kernel, choosing the bandwidth to give a smooth density. The kernel densities for all the parameters are shown in Figure. The density for ψ is shown on the log scale, as the kernel method does not restrict the density estimate to the positive values of ψ. All posterior densities have the true parameter values within the 95% credible region, though some are near the edges. The densities of the loadings λ j are slightly skewed and have slightly longer tails than the normal distribution; those for j are very close to normality, and those for ψj are quite skewed as expected, especially for small ψj. We show in Figure 3 the kernel posterior density for x (solid curve), together with the empirical Bayes normal density N(ˆλ ˆV (y ˆ), ˆρ ) (dashed curve) using the maximum likelihood estimates of the parameters from a standard factor analysis routine; these are given in the table below. For reference, Table : MLEs of the parameters j ˆ j ˆλj ˆψ j the true value of x is.94. The much greater dispersion of the posterior density is quite marked, showing the importance of allowing correctly for uncertainty in the parameters. The Figure also shows the.5% (-.537) and 97.5% (-) points of the x distribution, giving a 95% credible interval for x of (.537, ). The corresponding values from the plug-in normal distribution are -34 and ; the coverage of this interval is only 76% from the true posterior distribution of x. We remark again that preliminary estimates of the parameters are not required it is quite striking that the DA algorithm converges to stable posterior distributions relatively quickly, given the well-known slow convergence of the EM algorithm in this model. 5 Relation to factor indeterminacy The DA analysis casts light on the controversy over factor score estimation and factor indeterminacy (Guttman 955, McDonald 974). The standard practice at the time for factor score estimation was to use the mean of the conditional distribution of X Y = y as a point estimate of x. The estimates of the parameters were substituted for their true values in this approach. 8

9 This substitution approach is well-documented in modern random effect models as the empirical Bayes approach the posterior distribution (typically normal) of the random effects, depending on the unknown model parameters, is estimated by the same distribution with the unknown parameters replaced by their MLEs, called plug-in estimates. This distribution is used to make full distributional statements about the random effects, not just the mean. The distributional statements are deficient because the uncertainty in the MLEs is not allowed for, and so the true variance of the posterior distribution is underestimated (as in Figure 3). Guttman was however concerned with a different issue, the behavior of the regression estimate as a random variable in repeated sampling. He considered the sampling distribution of the regression estimate, averaged over the distribution of Y. Conditionally on Y = y, but if we average the distribution of X Y = y N(λ V (y ), ρ ), X = λ V (y ) over Y, we have X N(, λ V λ) which is N(, ρ ). So the unconditional distribution of X, averaged across varying sample values of Y, has zero mean but variance ρ, not. If regression estimates were really estimates of the factor scores, then it seemed axiomatic that they should have the right distribution, that of the true factor scores. They clearly failed this requirement. Guttman expressed this failure through a conceptual simulation experiment: given the true values of all the parameters, generate a random error term ɛ from N(, ρ ) and add this to the regression estimate, giving a new estimate Z = X + ɛ. Then the unconditional distribution of Z would be N(, ) like that for X. Now imagine two such error terms with opposite signs: given ɛ, we could equally well have observed ɛ, and this could have been used to obtain Z = X ɛ (with the same ɛ). The correlation of Z and Z, in repeated random generations, would be r = cov(z, Z ) var(z )var(z ) = var( X) var(ɛ) = ρ which is negative for ρ <.5, a high value for any regression model in psychology, let alone a factor model. Guttman argued that two equally plausible values of X which correlated negatively would cast doubt on the whole factor model, and he coined the term factor indeterminacy for this aspect of the factor model. Since the primary 9

10 role of the factor model was to make statements about individual unobserved abilities, Guttman concluded that the model could not be used in this way except for models with very high variable factor score correlations. Factor analysis was based on a shaky assumption. The Bayesian framework helps to understand why this criticism was overstated. The essential feature of Bayesian analysis is that the information about the model parameters and any other unobserved quantities is expressed through posterior distributions, conditional on the observed data. This applies to the unobserved factor variables, and provides a very natural conditional distribution for an individual subject s factor score. The regression estimate the mean of this conditional distribution is indeed an unsuitable summary represention of the factor score as it suppresses the conditional variance, and more generally the whole conditional distribution, of the factor score. Guttman s criticism, viewed in this light, makes just this point that the variance is being overlooked: conditional means cannot be used as a surrogate for the true values without some way of representing the conditional uncertainty. But this failure to represent correctly the uncertainty about factor scores does not cast doubt on the factor model, merely on the technical tools with which the uncertainty in factor scores is represented. One such representation would be to present the plug-in conditional variance as well as the plug-in conditional mean. But the Bayesian representation is more informative it gives the full distribution of the uncertain factor score, and allows properly for the estimation of the parameters on which this distribution is based; the uncertainty in the factor loading and specific variance parameters is also correctly represented, in a much richer way than by the MLE and information matrix. The simulation approach is even more general than we have demonstrated it can be applied to differences among individuals in ability, by simply computing the M values of any comparison x i x i of interest. Such comparisons are widely used in small-area estimation and other empirical Bayes applications in multi-level models where differences among regions, areas or centres are of importance. Such differences are always overstated by empirical Bayes methods because of the systematic underestimation of variability from the use of plug-in ML estimates. The full joint posterior distribution of the model parameters provides even more information, for which there is no frequentist equivalent. Figure 4 is a composite plot of each parameter against every other, for a random sub-sample of values drawn from the (the sub-sampling is necessary for clarity). It is immediately clear that correlations are very high between the three smaller loadings, and that the factor loading and specific variance for Y are highly negatively correlated. The correlation matrix of the parameters (from the full values) shown in the table below bears this out.

11 Table : Correlation coefficients of the parameters λ ψ λ ψ 3 ψ 3 4 λ -. ψ λ ψ ψ ψ Conclusion The data augmentation algorithm, and more general Markov Chain Monte Carlo methods, provide the Bayesian analysis of the factor model. No new issues arise in the general multiple common factor case except for the rotational invariance problem - given a fixed covariance matrix for the factors, their posterior distribution can be simulated in the same way as for a single factor. The computing time required for the full Bayesian analysis is substantial, but the richness of information from the full joint posterior distribution more than compensates for the computational effort. MCMC packages like BUGS (Bayesian inference Using Gibbs Sampling) are widely available: we can look forward confidently to their future use with factor models, and other latent variable models, of much greater complexity than the simple model discussed here. 7 Acknowledgement We have benefited from discussions with our colleague Darren Wilkinson, and from editorial suggestions from Albert Maydeu-Olivares. 8 References Guttman, L. (955) The determinacy of factor score matrices with applications for five other problems of common factor theory. Brit. Jour. Statist. Psych. 8, McDonald, R. P. (974) The measurement of factor indeterminacy. Psychometrika 39, 3-. Martin, J.K. and McDonald, R.P. (975) Bayesian estimation in unrestricted factor analysis: a treatment for Heywood cases. Psychometrika 4, Press, S.J. and Shigemasu, K. (989) Bayesian inference in factor analysis. In Contributions to Probability and Statistics: Essays in honor of Ingram Olki. Gleser, L. J., et. al. editors. Springer Verlag, New York. Press, S.J. and Shigemasu, K., (997) Bayesian inference in factor analysis revised. Technical Report No. 43, Department of Statistics, University of

12 California, Riverside. Rubin, D.B. and Thayer, D. (98) EM Algorithms for ML factor analysis. Psychometrika 47, Rowe, D.B. () Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing. CRC Press, Boca Raton. Tanner, M.A. (996) Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer, New York. Tanner, M.A. and Wong, W.H. (987) The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 8,

13 PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements λ λ λ λ λ λ λ λ ψ ψ ψ ψ PSfrag replacements λ λ ψ3 ψ PSfrag replacements PSfrag replacements PSfrag replacements λ λ ψ3 ψ4 λ λ λ λ Figure : Iteration history 3

14 3 4 5 PSfrag replacements λ λ PSfrag replacements PSfrag replacements PSfrag replacements λ λ λ λ λ λ λ λ λ PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements λ λ λ λ λ λ PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements 3 4 ψ3 ψ λ Figure : Kernel densities 4

15 .5.5.5% 97.5% PSfrag replacements λ λ λ3 λ4 ψ ψ ψ3 ψ4 log(ψ ) 3 4 x posterior empirical Bayes Figure 3: Posterior density of x and empirical Bayes density λ.5 ψ PSfrag replacements λ ψ 3 λ3 ψ3 log(ψ ) 4 λ4 ψ4.5 Figure 4: scatter plots of all parameters 5

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature