Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity

Size: px
Start display at page:

Download "Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity"

Transcription

1 Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity by Veronika Ročková and Edward I. George 1 The University of Pennsylvania Revised April 20, 2015 Abstract Rotational transformations have traditionally played a key role in enhancing the interpretability of factor analysis via post-hoc modifications of the factor model orientation. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we cross-fertilize these two paradigms within a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain a) dramatic accelerations, b) robustness against poor initializations and c) better oriented sparse solutions. For accurate recovery of factor loadings, we deploy a two-component refinement of the Laplace prior, the spike-and-slab LASSO prior. This prior is coupled with the Indian Buffet Process IBP) prior to avoid the pre-specification of the factor cardinality. The ambient dimensionality is learned from the posterior, which is shown to reward sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional data, which would render posterior simulation impractical. Keywords: Indian Buffet Process; Infinite-dimensional Factor Analysis; Factor Rotation; PXL-EM; Spike-and-slab LASSO. 1 Veronika Ročková is Postdoctoral Research Associate, Department of Statistics, University of Pennsylvania, Philadelphia, PA, 19104, vrockova@wharton.upenn.edu. Edward I. George is Professor of Statistics, Department of Statistics, University of Pennsylvania, Philadelphia, PA, 19104, edgeorge@wharton.upenn.edu.

2 1 Bayesian Factor Analysis Revisited Latent factor models aim to find regularities in the variation among multiple responses, and relate these to a set of hidden causes. This is typically done within a regression framework through a linear superposition of unobserved factors. The traditional setup for factor analysis consists of an n G matrix Y = [y 1,..., y n ] of n independent G-dimensional vector observations. dimension K, the generic factor model is of the form For a fixed factor fy i ω i, B, Σ) ind N G Bω i, Σ), ω i N K 0, I K ), 1.1) for 1 i n, where Σ = diag{σ 2 j }G j=1 is a diagonal matrix of unknown positive scalars, ω i R K is the i th realization of the unobserved latent factors, and B R G K is the matrix of factor loadings that weight the contributions of the individual factors. Marginally, fy i B, Σ) = N G 0, BB +Σ), 1 i n, a decomposition which uses at most G K +1) parameters instead of GG+1)/2 parameters in the unconstrained covariance matrix. Note that we have omitted an intercept term, assuming throughout that the responses have been centered. Fundamentally a multivariate regression with unobserved regressors, factor analysis is made more more challenging by the uncertainty surrounding the factor dimensionality K and the orientation of the regressors. A persistent difficulty associated with the factor model 1.1) has been that B is unidentified. In particular, any orthogonal transformation of the loading matrix and latent factors Bω i = BP )P ω i ) yields exactly the same distribution for Y. Although identifiability is not necessary for prediction or estimation of the marginal covariance matrix Bhattacharya and Dunson, 2011), non-sparse orientations diminish the potential for interpretability, our principal focus here. Traditional approaches to obtaining interpretable loading patterns have entailed post-hoc rotations of the original solution. For instance, the varimax Kaiser, 1958) post-processing step finds a rotation P that minimizes a complexity criterion and yields new loadings that are either very small near zero) or large. As an alternative, regularization methods have been proposed that prioritize matrices with exact zeroes via a penalty/sparsity prior Zou et al., 2006; Witten et al., 2009; Carvalho et al., 2008). The main thrust of this contribution is to cross-fertilize these two paradigms within one unified framework. In our approach, model rotations are embedded within the learning process, greatly enhancing the effectivity of regularization and providing an opportunity to search for a factor basis which best supports sparsity, when it in fact exists. Going further, our approach does not require pre-specification of the factor cardinality K. As with similar factor analyzers Knowles and Ghahramani, 2011; Bhattacharya and Dunson, 2011), here the loading matrix B is extended to include infinitely many columns, where the number of effective factors remains finite with probability one. 1

3 Our approach begins with a prior on the individual elements in B = {β jk } G, j,k=1 that induces posterior zeroes with high-probability. Traditionally, this entails some variant of a spike-and-slab prior that naturally segregates the important coefficients from the ignorable George and McCulloch, 1993; West, 2003; Carvalho et al., 2008; Rai and Daumé, 2008; Knowles and Ghahramani, 2011). The specification of such priors here relies on latent binary allocation matrix Γ = {γ jk } G, j,k=1, γ jk {0, 1}, where γ jk = 1 whenever the j th variable is associated with k th factor. Given each γ jk {0, 1} for whether β jk should be ignored, a particularly appealing spike-and-slab variant has been πβ jk γ jk, λ 1 ) = 1 γ jk )δ 0 β jk ) + γ jk φβ jk λ 1 ), 1.2) where δ 0 ) is the spike distribution atom at zero) and φ λ 1 ) is the absolutely continuous slab distribution with exponential tails or heavier, indexed by a hyper-parameter λ 1. Coupled with a suitable prior on γ jk, the generative model 1.2) yields optimal rates of posterior concentration, both in linear regression Castillo and van der Vaart, 2012; Castillo et al., 2015) and covariance matrix estimation Pati et al., 2014). This methodological ideal, despite being amenable to posterior simulation, poses serious computational challenges in high-dimensional data. We address this challenge by developing a tractable inferential procedure that performs deterministic rather than stochastic posterior exploration. At the heart of our approach is the spike-and-slab LASSO SSL) prior, a feasible continuous relaxation of its limiting special case 1.2) Rockova, 2015). This prior transforms the obstinate combinatorial search problem into one of optimization in continuous systems, permitting the use of EM algorithms Dempster et al., 1977), a strategy we pursue here. The SSL prior is coupled with the Indian Buffet Process IBP) prior, thereby providing an opportunity to learn about the ambient factor dimensionality. Confirming its potential for this purpose, the posterior induced with this SSL-IBP prior combination is shown to concentrate on sparse matrices. The indeterminacy of 1.1) due to rotational invariance is ameliorated with the SSL prior, which anchors on sparse representations. This prior promotes orientations with many zero loadings by creating ridge-lines of posterior probability along coordinate axes, radically reducing posterior multimodality. The posterior modes thus perform variable selection through automatic thresholding. In contrast, empirical averages resulting from posterior simulation are non-sparse and must be thresholded for selection. Moreover, posterior simulation requires identifiability constraints on the allocation of the zero elements of B, such as lower-triangular forms Lopes and West, 2004) and their generalizations Frühwirth-Schnatter and Lopes, 2009), to prevent the aggregation of probability mass from multiple modes. Such constraints are not needed in our approach. The search for promising sparse factor orientations is greatly enhanced with data augmentation by expanding the likelihood with an auxiliary transformation matrix. Exploiting the rotational invariance of the factor model, we propose a PXL-EM parameter expanded likelihood EM) algorithm, a variant 2

4 of the PX-EM algorithm of Liu et al., 1998) and the one-step late PXL-EM of van Dyk and Tang 2003) for Bayesian factor analysis. PXL-EM automatically rotates the loading matrix as a part of the estimation process, gearing the EM trajectory along the orbits of equal likelihood. The SSL prior then guides the selection from the many possible likelihood maximizers. Moreover, the PXL-EM algorithm is far more robust against poor initializations, converging dramatically faster than the parent EM algorithm. As with many other sparse factor analyzers Knowles and Ghahramani, 2011; Frühwirth-Schnatter and Lopes, 2009; Carvalho et al., 2008), the number of factors will be inferred through the estimated patterns of sparsity. In over-parametrized models with many redundant factors, this can be hampered by the phenomenon of factor splitting, i.e. the smearing of factor loadings across multiple correlated factors. Such factor splitting is dramatically reduced with our approach, because our IBP construction prioritizes lower indexed loadings and the PXL-EM rotates towards independent factors. A variety of further steps are proposed to enhance the effectiveness of our approach. To facilitate the search for higher posterior modes we implement a dynamic posterior exploration with a sequential reinitialization of PXL-EM along a ladder of increasing spike penalties. For selection from the set of identified posterior modes, we recommend an evaluation criterion motivated as an integral lower bound to a posterior probability of the implied sparsity pattern. Finally, we introduce an optional varimax rotation step within PXL-EM, to provide further robustification against local convergence issues. The paper is structured as follows. Section 2 introduces our hierarchical prior formulation for infinite factor analysis, establishing some of its appealing theoretical properties. Section 3 develops the construction of our basic EM algorithm, which serves as a basis for the PXL-EM algorithm presented in Section 4. Section 5 describes the dynamic posterior exploration strategy for PXL-EM deployment, demonstrating its effectiveness on simulated data. Section 6 derives and illustrates our criterion for factor model comparison. Section 7 illustrates the potential of varimax robustification. Sections 8 and 9 present applications of our approach on real data. Section 10 concludes with a discussion. 2 Infinite Factor Model with Spike-and-Slab LASSO The cornerstone of our Bayesian approach is a hierarchically structured prior on infinite-dimensional loading matrices, based on the spike-and-slab LASSO SSL) prior of Rockova 2015). Independently for each loading β jk, we consider a continuous relaxation of 1.2) with two Laplace components: a slab component with a common penalty λ 1, and a spike component with a penalty λ 0k that is potentially unique to the k th factor. More formally, πβ jk γ jk, λ 0k, λ 1 ) = 1 γ jk )φβ jk λ 0k ) + γ jk φβ jk λ 1 ), 2.1) 3

5 where φβ λ) = λ 2 exp{ λ β } is a Laplace prior with mean 0 and variance 2/λ2 and λ 0k >> λ 1 > 0, k = 1,.... The prior 2.1) will be further denoted as SSLλ 0k, λ 1 ). The SSL priors form a continuum between a single Laplace LASSO) prior, obtained with λ 0k = λ 1, and the point-mass mixture prior 1.2), obtained as a limiting case when λ 0k. Coupled with a prior on γ jk, SSLλ 0k, λ 1 ) generates a spike-and-slab posterior that adaptively segregates the informative loadings from the ignorable. Inducing a variant of selective shrinkage Ishwaran and Rao, 2005, 2011; Rockova, 2015), posterior modes under the SSL prior are adaptively thresholded, smaller values shrunk to exact zero. This is in sharp contrast to spike-and-slab priors with a Gaussian spike George and McCulloch, 1993; Rockova and George, 2014), whose non-sparse posterior modes must be thresholded for variable selection. The exact sparsity here is crucial for anchoring on interpretable factor orientations and thus alleviating identifiability issues. For a prior over the feature allocation matrix Γ = {γ jk } G, j,k=1, we adopt the Indian Buffet Process IBP) prior Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007), which defines an exchangeable distribution over equivalence classes 2 [Γ] of infinite-dimensional binary matrices. Formally, the IBP with intensity parameter α > 0 arises from the beta-bernoulli prior πγ jk θ k ) ind Bernoulliθ k ), 2.2) πθ k ) ind α ) B K, 1, by integrating out the θ k s and by taking the limit K Griffiths and Ghahramani, 2011). Due to its ability to learn about the ambient dimensionality of the latent space, the IBP prior has been used in various factor analytic contexts Knowles and Ghahramani, 2011; Rai and Daumé, 2008; Paisley and Carin, 2009). Our IBP deployment differs from these existing procedures in two important aspects. First, we deploy IBP in concert with a continuous spike-and-slab prior rather than the point-mass mixture compared with Knowles and Ghahramani 2011); Rai and Daumé 2008); Zhou et al. 2009)). Second, the IBP process is deployed here on the matrix of factor loadings rather than on the matrix of latent factors compared with Paisley et al. 2012)). Whereas posterior simulation with the IBP is facilitated by margining over θ k in 2.2) Knowles and Ghahramani, 2011; Rai and Daumé, 2008), we instead proceed conditionally on a particular ordering of the θ k s. As with similar deterministic algorithms for sparse latent allocation models Paisley and Carin, 2009; Doshi-Velez et al., 2009; Paisley et al., 2011), our EM algorithms capitalize on the stickbreaking representation of a beta process Paisley et al., 2012). We use the following construction derived specifically for the IBP by Teh et al. 2007). 2 Each equivalence class [Γ] contains all matrices Γ with the same left-ordered form, obtained by ordering the columns from left to right by their binary numbers. 4

6 Theorem 2.1. Teh et al., 2007)) Let θ 1) > θ 2) >... > θ K) be a decreasing ordering of θ = θ 1,..., θ K ) iid, where each θ k B α K, 1). In the limit as K, the θ k) s obey the following stickbreaking law k iid θ k) = ν l, where ν l Bα, 1). 2.3) l=1 Remark 2.1. The implicit ordering θ 1) > θ 2) >... > θ K) induces a soft identifiability constraint against the permutational invariance of the factor model. To sum up, our hierarchical prior on infinite factor loading matrices B R G, further referred to as SSL-IBP{λ 0k }; λ 1 ; α), is as follows: k iid πβ jk γ jk ) SSLλ 0k, λ 1 ), γ jk θ k) Bernoulli[θ k) ], θ k) = ν l, ν l Bα, 1). 2.4) Note that the inclusion of loadings associated with the k th factor is governed by the k th largest inclusion probability θ k). According to Theorem 2.1, these θ k) decay exponentially with k, hampering the activation of binary indicators γ jk when k is large, and thereby controlling the growth of the effective dimensionality. Thus, the most prominent features are associated with a small column index k. The actual factor dimensionality K +, defined as the number of columns with at least one active indicator γ jk, is finite with probability one Griffiths and Ghahramani, 2011). The prior specification is completed with a prior on diagonal elements of Σ. We assume independent inverse gamma priors σ1, 2..., σg 2 iid IGη/2, ηξ/2) 2.5) with the relatively noninfluential choice η = 1 and ξ = 1. To avoid the lack of identifiability that would occur between a singleton factor loading a single nonzero loading parameter in a factor) and its idiosyncratic variance, we restrict the set of feasible B to matrices with at least two nonzero factor loadings per factor, Frühwirth-Schnatter and Lopes, 2009). A practical implementation of this constraint is described in Section 6. l=1 2.1 Truncated Stick-Breaking Approximation The representation 2.3) provides a natural truncated stick-breaking approximation to the IBP under which θ k) = 0 for all k > K. By choosing K suitably large, and also assuming β jk = 0 for all k > K, this approximation will play a key role in the implementation of our EM algorithms. The issue remains how to select the order of truncation K. The truncated loading matrix B K yields a marginal covariance matrix Λ K = B K B K + Σ, which can be made arbitrarily close to Λ by considering K large enough. Measuring the closeness 5

7 of such approximation in the sup-norm metric d Λ, Λ K ) = max 1 j,m G Λ jk Λ mk, Bhattacharya and Dunson 2011) derived a lower bound on K so that d Λ, Λ K ) < ε with large prior probability using a continuous shrinkage prior. A similar result can be obtained for the SSL-IBP prior. Lemma 2.1. Assume λ 2 0k = 1/a)k with 0 < a < 1. Then for any ε > 0, we have P[d Λ, Λ K ) ε] > 1 ε, { [ ] [ ] } whenever K > max log ε 2 1 a) λ 2 / loga); log 1 ε 2 1 µ) / logµ), where µ = ε[1 1 ε) 1/G ]. ) α 1+α and ε = Proof. Appendix Section A.1). Remark 2.2. The assumption λ 0k = 1/a) k with 0 < a < 1 guarantees that the implied covariance matrix B B + Σ has finite entries almost surely. Intuitively, higher order of truncation K is needed when the reciprocal penalty decay rate a and/or the mean breaking fraction µ = α 1+α are closer to one. Lemma 2.1 provides a justification for the stick-breaking approximation based solely on prior considerations. In the next section, we provide practical guidance for choosing the order of truncation K, based on the posterior tail distribution of the number of active factors. 2.2 Learning about Factor Dimensionality The SSL-IBP prior provides an opportunity to learn about the factor cardinality from the data. In this section, we investigate the ability of the posterior distribution to identify the ambient dimensionality, taking into account the rotational invariance of the likelihood. To begin, we show that the IBP prior concentrates on sparse matrices. Namely, for 0 < α 1, the IBP process induces an implicit prior distribution on the actual dimensionality K +, which decays exponentially. Theorem 2.2. Let [Γ] be distributed according to the IBP prior 2.4) with an intensity 0 < α 1. Let K + denote the actual factor dimension, that is the largest index K + N, so that γ jk = 0 for all k > K +, j = 1,..., G. Then [ PK + > k) < 2 Gα + 1) + 4 ] [ )] α + 1 exp k + 1) log 3 α Proof. Appendix Section A.2). Exponentially decaying priors on dimensionality are essential for obtaining well-behaved posteriors in sparse situations Castillo et al., 2015; Castillo and van der Vaart, 2012). In the context of factor 2.6) 6

8 analysis, such priors were deployed by Pati et al. 2014) to obtain rate-optimal posterior concentration around any true covariance matrix in spectral norm, when the dimension G = G n can be much larger than the sample size n. The following connection between the priors used by Pati et al. 2014) and the IBP prior is illuminating. Using a point-mass spike-and-slab mixture prior, Pati et al. 2014) deploy the following hierarchical model for the binary allocations. First, the effective factor dimension K + is picked from a distribution π K + which decays exponentially, i.e. π K +K + > k) exp C k) for some C > 0. Second, each binary indicator is sampled from a Bernoulli prior with a common inclusion probability θ arising from a beta prior B1, K 0n G n +1) with expectation 1/2+K 0n G n ), where K 0n is the true factor dimension. Our SSL-IBP prior instead uses factor-specific inclusion probabilities θ 1) > θ 2) >... obtained as a limit of ordered sequences of K independent beta random variables with expectation 1/1 + K/α), when K. The stick-breaking law then yields E[θ k) ] = 1 1+1/α) k. This parallel between the SSL- IBP and point-mass mixture prior of Pati et al. 2014) is instructive for choosing a suitable intensity parameter α. Selecting α 1/G leads to an IBP formulation that is more closely related to a prior yielding the desired theoretical properties. Such choice yields rapid exponential prior decay of the order e k+1)g. As will be shown later, this choice also yields exponential posterior decay. First, we introduce the notion of effective factor dimensionality. Note that the SSL-IBP prior puts zero mass on factor loading matrices with exact zeroes. In such contexts, Pati et al. 2014) and Bhattacharya et al. 2015) deploy a generalized notion of sparsity, regarding coefficients below a certain threshold as negligible. For our continuous SSL-IBP prior, we proceed similarly with the threshold δλ 1, λ 0, θ) = 1 λ 0 λ 1 log [ λ0 1 θ) λ 1 θ ], 2.7) which is the intersection point between the θ-weighted spike-and-slab LASSO densities with penalties λ 0, λ 1 ), respectively. Definition 2.1. Given a loading matrix B G, define the effective factor dimensionality KB) as the largest column index K such that β jk < δλ 1, λ 0k, θ k) ) for k > K and j = 1,..., G. Since δλ 1, λ 0k, θ k) ) decreases with λ 0k, the intersection point vanishes as k, when {λ 0k } increases fast enough. The following direct analogue of Theorem 2.2 shows that the prior effective dimensionality is also exponentially decaying. Corollary 2.1. Let B G be distributed according to the SSL-IBP{λ 0k }, λ 1, α) with an intensity parameter α = c/g and λ 1 < e 2, where 0 < c < G. Then we have for D > 0 P[KB) > k] < D e k log1+g/c). 2.8) 7

9 Proof. Appendix A.3 We now proceed to show the property 2.8) penetrates through the data into the posterior. Let us first introduce some additional notation and assumptions. Let B 0 = [β 1 0, β 2 0,... ] be the G ) true loading matrix with K 0 < G nonzero columns ordered to be the leftmost ones). Furthermore, assume that each of the K 0 active columns satisfies β k 0 0 S < G/2, 1 k K 0. Let s max resp. s min ) denote the largest resp. smallest) eigenvalue of Λ 0, the true covariance matrix, and let ρ n = 2 s max /s min. Let B 0 2 [β 2 0,..., β K 0 0 ] 2 be the spectral norm of the sub-matrix of nonzero the columns of B 0. We assume the following: A) n < G and n log n < S K 0 logg + 1); B) S K 0 /s min and S K0 /n/s min 0; C) B 0 2 < S; and D) σ1 2 = = σ2 G = 1. The assumption A) limits our considerations to high-dimensional scenarios. In addition, the growth of n should be reasonably slow relative to G and number of nonzero elements S K 0. Assumption C) is an analog of the assumption A3) of Pati et al. 2014) and holds in any case with probability at least 1 e C K 0. Assumption D) is here to avoid the singleton identifiability issue and can be relaxed as in Pati et al. 2014)). The following theorem states that the posterior also concentrates on sparse matrices. Namely, the average posterior probability that KB) overshoots the true dimensionality K 0, by a multiple of a sparsity level, goes to zero. We assume that G, S, K 0 ) increase with n and thereby supply the index n. Theorem 2.3. Assume model 1.1), where the true loading matrix B 0 has K 0n nonzero leftmost columns. For 1 k K 0n assume G n j=1 Iβ jk 0) = S nk and S nk /G n < 1/2. Assume B SSL-IBP {λ 0k }; λ 1 ; α) with α = c/g n, λ 1 < e 2 and λ 0k d G 2 nk 3 n/s nk, where 0 < d and 0 < c < G. Assume A) D). Denote by S n = K 0n k=1 S nk, then [ E B0 P KB) > C 1 K 0n Sn Y n)] ) n for some C 1 > 0. Proof. Appendix. Note that K 0n Sn is the actual number of nonzero components in B 0. Thus, the tail bound 2.9) reflects the ability of the posterior to identify the ambient sparsity level. However, it also acknowledges the uncertainty about how these nonzero components are distributed among the columns of B. This uncertainty is due to the rotational invariance of the likelihood. As a practical matter, one would select the threshold for truncation K large enough to accommodate the dimensionality excess when K 0n and S n is known. In the absence of knowledge about K 0nSn, one would deploy as many columns as needed to obtain at least one vacant factor zero loading column). Such an empirical strategy may 8

10 require gradually increasing the number of columns in repeated runs, or adapting the computational procedure to add extra columns throughout the calculation. This strategy is justified by Theorem 2.3, which tells us that only a finite number of columns is needed in the presence of sparsity. The failure to detect sparse structure is then an indication that the sparsity assumption is inappropriate for the data at hand. Note also that the threshold K should be chosen such that model parameters are guaranteed to be estimable, i.e. the number of nonzero loadings does not exceed GG 1)/2. This condition will be met for suitably sparse B even with large K. 3 The EM Approach to Bayesian Factor Analysis We will leverage the resemblance between factor analysis and multivariate regression, and implement a sparse variant of the EM algorithm for probabilistic principal components Tipping and Bishop, 1999). We will capitalize on EMVS, a fast method for posterior model mode detection in linear regression under spike-and-slab priors Rockova and George, 2014). To simplify notation, throughout the section we will denote the truncated approximation B K by B, for some pre-specified K. Similarly, θ will be the finite vector of ordered inclusion probabilities θ = θ 1),..., θ K )) and λ 0k = λ 0 for k = 1,..., K. Letting = B, Σ, θ), the goal of the proposed algorithm will be to find parameter values which are most likely a posteriori) to have generated the data, i.e. = arg max log π Y ). This task would be trivial if we knew the hidden factors Ω = [ω 1,..., ω n ] and the latent allocation matrix Γ. In that case the estimates would be obtained as a unique solution to a series of penalized linear regressions. On the other hand, if were known, then Γ and Ω could be easily inferred. This chicken-and-egg problem can be resolved iteratively by alternating between two steps. Given m) at the m th iteration, the E-step computes expected sufficient statistics of hidden/missing data Γ, Ω). The M-step then follows to find the a-posteriori most likely m+1), given the expected sufficient statistics. These two steps form the basis of a vanilla EM algorithm with a guaranteed monotone convergence to at least a local posterior mode. More formally, the EM algorithm locates modes of π Y ) iteratively by maximizing the expected logarithm of the augmented posterior. Given an initialization 0), the m+1) st step of the algorithm outputs m+1) = arg max Q ), where Q ) = E Γ,Ω Y, m) [log π, Γ, Ω Y )], 3.1) with E Γ,Ω Y, m) ) denoting the conditional expectation given the observed data and current parameter estimates at the m th iteration. Note that we have parametrized our posterior in terms of the ordered inclusion probabilities θ rather than the breaking fractions ν. These can be recovered using 9

11 the stick-breaking relationship ν k = θ k) /θ k 1). This parametrization yields a feasible M-step, as will be seen below. We now take a closer look at the objective function 3.1). For notational convenience, let X denote the conditional expectation E Γ,Ω Y, m)x). As a consequence of the hierarchical separation of model parameters, B, Σ) and θ are conditionally independent given Ω, Γ). Thereby Q ) = Q 1 B, Σ) + Q 2 θ), where Q 1 B, Σ) = log πb, Σ, Ω, Γ Y ) and Q 2 θ) = log πθ, Γ Y ). The function Q 2 θ) can be further simplified by noting that the latent indicators γ jk enter linearly and thereby can be directly replaced by their expectations γ jk, yielding Q 2 θ) = log π Γ θ) + log πθ). The term Q 1 B, Σ) is also linear in Γ, but involves quadratic terms ω i ω i, namely Q 1 B, Σ) =C 1 2 n { yi B ω i ) Σ 1 y i B ω i ) + tr [ B Σ 1 B ω i ω i ω i ω i ) ]} i=1 K G j=1 k=1 β jk λ 1 γ jk + λ 0 1 γ jk )) n G log σj 2 j=1 G 1 2σ 2 j=1 j, 3.2) where C is a constant not involving. The E-step entails the computation of both first and second conditional moments of the latent factors. The updates are summarized below. 3.1 The E-step The conditional posterior mean vector ω i is obtained as a solution to a ridge-penalized regression of Σ m) 1/2 y i on Σ m) 1/2 B m). This yields 1 Ω = B m) Σ m) 1 B m) + I K ) B m) Σ m) 1 Y. 3.3) The conditional second moments are then obtained from ω i ω i = M + ω i ω i, where M = B m) Σ m) 1 B m) + I K ) 1 is the conditional covariance matrix of the latent factors, which does not depend on i. We note in passing that the covariance matrix M can be regarded as a kernel of a smoothing penalty. The E-step then proceeds by updating the expectation of the binary allocations Γ. The entries can be updated individually by noting that conditionally on m), the γ jk s are independent. The model hierarchy separates the indicators from the data through the factor loadings so that πγ, Y ) = πγ ) does not depend on Y. This leads to rapidly computable updates γ jk P γ jk = 1 m)) = θ m) k) φβm) jk 10 θ m) k) φβm) jk λ 1) λ 1) + 1 θ m) k) )φβm) jk λ 0). 3.4)

12 As shown in the next section, the conditional inclusion probabilities γ jk serve as adaptive mixing proportions between spike and slab penalties, determining the amount of shrinkage of the associated β jk s. 3.2 The M-step Once the latent sufficient statistics have been updated, the M-step consists of maximizing 3.1) with respect to the unknown parameters. Due to the separability of B, Σ) and θ, these groups of parameters can be optimized independently. The next theorem explains how Q 1 B, Σ) can be interpreted as a log-posterior arising from a series of independent penalized regressions, facilitating the execution of the M-step. First, let us denote Y = [y 1,..., y G ], Ω = [ ω 1,..., ω n ] and let β 1,..., β G be the columns of B. Theorem 3.1. Denote by Ỹ = vectors ỹ 1,..., ỹ G. Let Ω = Then where Q j β j, σ j ) = 1 2σ 2 j with λ jk = γ jk λ γ jk )λ 0. Y 0 K K ) R n+k ) G a zero-augmented data matrix with column Ω n M L ) R n+k ) K, where M L is the lower Cholesky factor of M. Q 1 B, Σ) = G Q j β j, σ j ), 3.5) j=1 ỹ j Ωβ j 2 K k=1 β jk λ jk n log σ 2 j 1 2σ 2 j 3.6) Proof. The statement follows by rearranging the terms in the first row of 3.2). Namely, in the likelihood term we replace the row summation by a column summation. Then, we rewrite n 2 trb Σ 1 BM) = G j=1 n β 2σ jmβ j 2 j. This quadratic penalty can be embedded within the likelihood term by augmenting the data rows as stated in the theorem. Remark 3.1. The proof of Theorem 3.1 explains why M can be regarded as a kernel of a Markov Random Field smoothing prior, penalizing linear combinations of loadings associated with correlated factors. Based on the previous theorem, each β m+1) j the j th row of the matrix B m+1) ) can be obtained by deploying an adaptive LASSO computation Zou, 2006) with a response ỹ j and augmented data matrix Ω. Each coefficient β jk is associated with a unique penalty parameter 2σ m) j λ jk, which is proportional to λ jk, an adaptive convex combination of the spike and slab LASSO penalties. Notably, 11

13 each λ jk yields a self-adaptive penalty, informed by the data through the most recent β m) jk at the m th iteration. The computation is made feasible with the very fast LASSO implementations Friedman et al., 2010), which scale very well with both K and n. First, the data matrix is reweighted by a vector 1/λ j1,..., 1/λ jk ) and a standard LASSO computation is carried out with a penalty 2σ m) j. The resulting estimate is again reweighted by 1/λ jk Zou, 2006), yielding β m+1) j. Note that the updates j = 1,..., G) are obtained conditionally on Σ m) and are independent of each other, permitting the use of distributed computing. This step is followed by a closed form update Σ m+1) with σ m+1)2 β m+1) j 1 Ωβ m+1) j j = n+1 yj 2 + 1), conditionally on the new B m+1). Despite proceeding conditionally in the M-step, monotone convergence is still guaranteed Meng and Rubin, 1993). Now we continue with the update of the ordered inclusion probabilities θ from the stick-breaking construction. To motivate the benefits of parametrization based on θ, let us for a moment assume that we are actually treating the breaking fractions ν as the parameters of interest. The corresponding objective function Q 2 ν) = log π Γ ν) + log πν) is { G K k Q 2ν) = γ jk log ν l + 1 γ jk ) log 1 j=1 k=1 l=1 )} k K ν l + α 1) logν k ), 3.7) a nonlinear function that is difficult to optimize. Instead, we use the stick-breaking law and plug ν k = θ k) /θ k 1) into 3.7). The objective function then becomes Q 2 θ) = l=1 G K [ γjk log θ k) + 1 γ jk ) log1 θ k) ) ] + α 1) log θ K ), 3.8) j=1 k=1 whose maximum θ m+1) can be found by solving a linear program with a series of constraints θ k) θ k 1) 0, k = 2,..., K, 0 θ k) 1, k = 1,..., K. Had we assumed the finite beta-bernoulli prior 2.2), the update of the unordered) occurrence probabilities would simply become θ k ) m) j=1 G = γ jk +a 1 a+b+g 2. Note that the ordering constraint here induces increasing shrinkage of higher-indexed factor loadings, thereby controlling the growth of the effective factor cardinality. k=1 4 Rotational Ambiguity and Parameter Expansion The EM algorithm outlined in the previous section is prone to entrapment in local modes in the vicinity of initialization. This local convergence issue is exacerbated by the rotational ambiguity 12

14 of the likelihood, which induces highly multimodal posteriors, and by strong couplings between the updates of loadings and factors. These couplings cement the initial factor orientation, which may be suboptimal, and affect the speed of convergence with zigzagging update trajectories. These issues can be alleviated with additional augmentation in the parameter space that can dramatically accelerate the convergence Liu et al., 1998; van Dyk and Meng, 2010, 2001; Liu and Wu, 1999; Lewandowski et al., 1999). By embedding the complete data model within a larger model with extra parameters, we derive a variant of a parameter expanded EM algorithm PX-EM by Liu et al. 1998)). This enhancement performs an automatic rotation to sparsity, gearing the algorithm towards orientations which best match the prior assumptions of independent latent components and sparse loadings. A key to our approach is to employ the parameter expansion only on the likelihood portion of the posterior, while using the SSL prior to guide the algorithm towards sparse factor orientations. We refer to our variant as parameter-expanded-likelihood EM PXL-EM). Our PXL-EM algorithm is obtained with the following parameter expanded version of 1.1) y i ω i, B, Σ, A ind N G BA 1 L ω i, Σ), ω i A N K 0, A), A πa) 4.1) for 1 i n, where A L denotes the lower Cholesky factor of A, the newly introduced parameter. This expansion was used for traditional factor analysis by Liu et al. 1998). The observed-data likelihood here is invariant under the parametrizations indexed by A. This is evident from the marginal distribution fy i B, Σ, A) = N G 0, BB + Σ), 1 i n, which does not depend on A. Although A is indeterminate from the observed data, it can be identified with the complete data. Note that the original factor model is preserved at the null value A 0 = I K. To exploit the invariance of the parameter expanded likelihood, we impose the SSL prior 2.4) on B = BA 1 L rather than on B. That is, β jk γ jk ind SSLλ 0k, λ 1 ), γ jk θ k) ind Bernoulli[θ k) ], θ k) = k iid ν l, ν l Bα, 1), 4.2) l=1 where the βjk s are the transformed elements of B. This yields an implicit prior on B that depends on A L and therefore is not transformation invariant, a crucial property for anchoring sparse factor orientations. The original factor loadings B can be recovered from B, A) through the reduction function B = B A L. The prior 2.5) on Σ remains unchanged. Just as the EM algorithm from Section 3, PXL-EM also targets local) maxima of the posterior π Y ) implied by 1.1) and 2.1), but does so in a very different way. PXL-EM proceeds indirectly in terms of the parameter expanded posterior π Y ) indexed by = B, Σ, θ, A) and implied by 4.1) and 4.2). By iteratively optimizing the conditional expectation of the augmented log posterior log π X, Ω, Γ Y ), PXL-EM yields a path of updates through the expanded parameter space. 13

15 This sequence corresponds to a trajectory in the original parameter space through the reduction function B = B A L. Importantly, the E-step of PXL-EM is taken with respect to the conditional distribution of Ω and Γ under the original model governed by B and A 0, rather than under the expanded model governed by B and A. Thus, the updated A is not carried forward throughout the iterations. Instead, each E-step is anchored on A = A 0. As is elaborated on in Section 4.4, A = A 0 upon convergence and thus the PXL-EM trajectory converges local modes of the original posterior π Y ). The prior πa) influences the orientation of the augmented feature matrix upon convergence. Whereas proper prior distributions πa) can be implemented within our framework, and may be a fruitful avenue for future research, here we use the improper prior πa) 1. The improper prior has an orthogonalization property which can be exploited for more efficient calculations Section 4.2). In contrast to marginal augmentation Liu and Wu, 1999; Meng and van Dyk, 1999), where improper working prior may cause instability in posterior simulation, here it is more innocuous. This is because PXL-EM does not use the update A for the next E-step. The PXL-EM M-step uses A to guide the trajectory, which can be very different from the classical EM. Recall that A indexes continuous transformations yielding the same marginal likelihood. Adding this extra dimension, each mode of the original posterior π Y ) corresponds to a curve in the expanded posterior π Y ), indexed by A. These ridge-lines of accumulated probability, or orbits of equal likelihood, serve as a bridges connecting remote posterior modes. Due to the thresholding ability of the SSL prior, promising modes are located at the intersection of the orbits with coordinate axes. Obtaining A and subsequently performing the reduction step B = B A L, the PXL-EM trajectory is geared along the orbits, taking larger steps over posterior valleys to conquer multimodality. More formally, the PXL-EM traverses the expanded parameter space and generates a trajectory { 1), 2),...}, where m) = B m), Σ m), θ m), A m) ). This trajectory corresponds to a sequence { 1), 2),...} in the reduced parameter space, where m) = B m), Σ m), θ m) ) and B m) = B m) A m) L. Beginning with the initialization 0), every step of the PXL-EM algorithm outputs an update m+1) = arg max Q X ), where Q X ) = E Ω,Γ Y, m),a 0 log π, Ω, Γ Y ). 4.3) Each such computation is facilitated by the separability of Q X with respect to B, Σ), θ and A, a consequence of the hierarchical structure of the Bayesian model. Thus we can write Q X ) = C X + Q 1 B, Σ) + Q 2 θ) + Q X 3 A). 4.4) 14

16 The function Q 1 is given in 3.5), Q 2 is given in 3.8) and Q X 3 A) = 1 2 n tr[a 1 E Ω m),a 0 ω i ω i)] n log A. 4.5) 2 i=1 4.1 The PXL E-step Recall that the expectation in 4.3) is taken with respect to the conditional distribution of Ω and Γ under the original model governed by m) and A 0. Formally, the calculations remain the same as before Section 3.1). However, the update B m) = B m) A m) L is now used instead of B m) throughout these expressions. The implications of this substitution are discussed in the following two intentionally simple examples. These examples convey the intuition of entries in A as penalties that encourage featurizations with fewer and more informative factors. The first example describes the scaling aspect of the transformation by A L, assuming A L is diagonal. Example 4.1. Diagonal A) We show that for A = diag{α 1,..., α K }, each α k plays a role of a penalty parameter, determining the size of new features as well as the amount of shrinkage. This is seen from the E-step, which a) creates new features Ω, b) determines penalties for variable selection Γ, c) creates a smoothing penalty matrix Cov ω i B, Σ). Here is how inserting B= B A L affects these three steps. For simplicity, assume Σ = I K, B B = I K and θ = 0.5,..., 0.5). From 3.3), the update of latent features is { E Ω Y,B Ω ) = A 1 L I K + A 1 ) 1 B Y αk = diag 1 + α k } B Y. 4.6) Note that 4.6) with α k = 1 k = 1,..., K) corresponds to no parameter expansion. The function fα) = α 1+α steeply increases up to its maximum at α = 1 and then slowly decreases. Before the convergence which corresponds to α k 1), PXL-EM performs shrinkage of features, which is more dramatic if the k th variance α k is close to zero. Regarding the Markov-field kernel smoothing penalty, the coordinates with higher variances α k are penalized less. This is seen from Cov ω i B, Σ)= A L A L + I K ) 1 = diag{1/1 + α k )}. The E-step is then completed with the update of variable selection penalty mixing weights Γ. Here E Γ B,θ γ jk ) = This probability is exponentially increasing in α k. [ 1 + λ 0 exp βjk λ α kλ 0 λ 1 ) ) ] 1. 1 Higher variances α k > 1 increase the inclusion probability as compared to no parameter expansion α k = 1. The coefficients of the newly created features with larger α k are more likely to be selected. 15

17 The next example illustrates the rotational aspect of A L, where the off-diagonal elements perform linear aggregation. Example 4.2. Unit lower-triangular A L ) Suppose that A = α jk ) K j,k=1 with α jk = min{j, k}. This matrix has a unit-lower-triangular Cholesky factor with entries A L jk = Ij k). For Σ = I K and B B = I K, we have again E Ω Y,B Ω ) = A 1 L I K + A 1 ) 1 B Y, where now the matrix A 1 L I K + A 1 ) 1 has positive elements in the upper triangle and negative elements in the lower triangle. The first feature thus aggregates information from all the columns Y B, whereas the last feature correlates negatively with all but the last column. The variable selection indicators are computed for linear aggregates of the coefficients B, as determined by the entires A L, i.e. E Γ B,θ γ jk ) = 1 + λ K 0 exp βjl λ λ 0 λ 1 ) 1 where θ = 0.5,..., 0.5). The lower ordered inclusion indicators are likely to be higher since they aggregate more coefficients, a consequence of the lower-triangular form A L. As a result, lower ordered new features are more likely to be selected. The smoothness penalty matrix Cov ω i B, Σ) = A L A L+I K ) 1 has increasing values on the diagonal and all the off-diagonal elements are negative. The quadratic penalty β jcov ω i B, Σ)β j thus forces the loadings to be similar due to the positive covariances between the factors as given in A), where the penalty is stronger between coefficients of higher-ordered factors. Remark 4.1. Instead of A L, we could as well deploy the square root A 1/2. We have chosen A L due to its lower-triangular structure, whose benefits were highlighted in Example 4.2. l k 1, 4.2 The PXL M-step Conditionally on the imputed latent data, the M-step is then performed by maximizing Q X ) over in the augmented space. The update of A m+1), obtained by maximizing 4.5), requires only a fast simple operation, A m+1) = max A=A,A 0 Q X 3 A) = 1 n n E Ω Y, m),a 0 ω i ω i) = 1 n Ω Ω = 1 n Ω Ω + M. 4.7) i=1 The updates of B m+1), Σ m+1) ) and θ m+1) can be obtained by maximizing 3.5) and 3.8) as described in Section 3.2. The new coefficient updates in the reduced parameter space can be then obtained by the following step B m+1) = B m+1) A m+1) L, a rotation along an orbit of equal likelihood. 16

18 Remark 4.2. Although A L is not strictly a rotation matrix in the sense of being orthonormal, we refer to its action of changing the factor model orientation as the rotation by A L. From the polar decomposition of A L = UP, transformation by A L is the composition of a rotation represented by the orthogonal matrix U = A L A L A L) 1/2, and a dilation represented by the symmetric matrix P. Thus, when we refer to the rotation by A L, what is meant is the rotational aspect of A L, namely the action of U. More insight into the role of A L can be gained by recasting the PXL-EM M-step in terms of the original model parameters B = B A L. From 3.6), the PXL M-step yields { } β m+1) j = arg max ỹ j Ωβ j 2 2σ m)2 K β j βjk λ jk, j for each j = 1,..., G. However, in terms of the original parameters B m), where β m+1) j these solutions become β m+1) j = arg max β j Thus, the rotated parameters β m+1) j ỹj ΩA 1 L k=1 )β j 2 2σ m)2 K K j k=1 A 1 L ) lkβ jl l k = A L β m+1) j, λ jk. 4.8) are solutions to modified penalized regressions of ỹ j on ΩA 1 L serves to orthogonalize the under a series of triangular linear constraints. As seen from 4.7), A 1 L factor basis. This observation can be exploited for obtaining a closed form approximate calculation for B m+1) Section 4.3). Because ΩA 1 L in 4.8) is orthogonal, B m+1) can be approximated using a closed form update, removing the need for first computing B m+1) and then rotating it by A L. By noting a) LASSO has a closed form solution in orthogonal designs, b) the system of constraints in 4.8) is triangular with dominant entries on the diagonal, we can deploy back-substitution to quickly obtain an approximate update B m+1) in just one sweep. Denote by z j = A 1 Ω L y j /n and z j k+ = l>k A 1 L ) lkβ jl /A 1 L ) kk. Then, β m+1) jk z σ m)2 j ) λ jk A 1 L ) kk/n signz) + zj k+, 4.9) where z = z j k + zj k+. This approximate step dramatically reduces the computational cost, as discussed in Section 4.5, and is therefore worthwhile deploying in large problems. Performing 4.9) instead of the proper M-step yields a slightly different trajectory. However, both PXL-EM and this trajectory have the same fixed points B, A 0 ). Towards convergence when A A 0, the approximation 4.9) becomes exact. To sum up, the default EM algorithm proceeds by finding B m) at the M-step, and then using that B m) for the next E-step. In contrast, the PXL-EM algorithm finds B m) at the M-step, but then 17

19 uses the value of B m) = B m) A m) L for the next E-step. Each transformation B m) = B m) A m) L decouples the most recent updates of the latent factors and factor loadings, enabling the EM trajectory to escape the attraction of suboptimal orientations. In this, the rotation induced by A m) L plays a crucial role for the detection of sparse representations which are tied to the orientation of the factors. 4.3 Modulating the Trajectory Our PXL-EM algorithm can be regarded as a one-step-late PX-EM van Dyk and Tang, 2003) or more generally as a one-step-late EM Green, 1990). The PXL-EM differs from the traditional PX-EM of Liu et al. 1998) by not requiring the SSL prior be invariant under transformations A L. PXL- EM purposefully leaves only the likelihood invariant, offering a) tremendous accelerations without sacrificing the computational simplicity, b) automatic rotation to sparsity and c) robustness against poor initializations. The price we pay is the guarantee of monotone convergence. Let m), A 0 ) be an update of at the m th iteration. It follows from the information inequality, that for any = B, Σ, θ), where B = B A L, log π Y ) log π m) Y ) Q X ) Q X m) ) + E Γ m),a 0 log πb ) A L, Γ) πb. 4.10), Γ) Whereas m+1) = arg max Q X ) increases the Q X function, the log prior ratio evaluated at B m+1), A m+1) ) is generally not positive. van Dyk and Tang 2003) proposed a simple adjustment to monotonize their one-step-late PX-EM, where the new proposal B m+1) = B m+1) A m+1) L is only accepted when the value of the right hand side of 4.10) is positive. Otherwise, the classical EM step is performed with B m+1) = B m+1) A 0. Although this adjustment guarantees the convergence towards the nearest stationary point Wu, 1983), poor initializations may gear the monotone trajectories towards peripheral modes. It may therefore be beneficial to perform the first couple of iterations according to PXL-EM to escape such initializations, not necessarily improving on the value of the objective, and then to switch to EM or to the monotone adjustment. Monitoring the criterion 4.10) throughout the iterations, we can track the steps in the trajectory that are guaranteed to be monotone. Apart from monotonization, one could also divert the PXL-EM trajectory with occasional jumps between orbits. Note that PXL-EM moves along orbits indexed by oblique rotations. One might also consider moves along orbits indexed by orthogonal rotations such as varimax Kaiser 1958)). One might argue that performing a varimax rotation instead of the oblique rotation throughout the EM computation would be equally, if not more, successful. However, the plain EM may fail to provide a sufficiently structured intermediate input for varimax. On the other hand, PXL-EM identifies enough structure early on in the trajectory and may benefit from further varimax rotations. The potential for further improvement with this optional step is demonstrated in Section 7. 18

20 In the next two sections we show that PXL-EM is an efficient scheme, i.e. computes fast. 4.4 Convergence Speed: EM versus PXL-EM it converges and The convergence speed of the PX-EM algorithm of Liu et al. 1998) is provably superior to the classical EM algorithm. Using similar arguments, we show that PXL-EM fares also provably better than EM. The speed of convergence of the EM algorithm for MAP estimation) is defined as the smallest eigenvalue of the matrix fraction of the observed information S = IaugI 1 obs, where I obs = 2 log π Y ) =, I aug = 2 log Q ) = 4.11) and where = B, Σ, θ) is a target posterior mode Dempster, Laird and Rubin 1968)). The speed matrix satisfies S = I DM, where DM is the Jacobian of the EM mapping t+1) = M t) ), evaluated at. The Jacobian governs the behavior of the EM algorithm near its fixed point. This traditional notion of convergence speed is not directly applicable here. Under the SSL prior, B is potentially very sparse and the complete and incomplete log-posteriors in 4.11) are not differentiable at zero. However, the mapping M ) is a soft-thresholding operator, setting to zero small loadings in the vicinity of origin. Therefore, near convergence, the EM trajectory is trapped inside a region of attraction of B, where all the zero directions have been thresholded out. In the neighborhood of B, the algorithm iterates only over the nonzero coordinates of B. Therefore, it is sensible to evaluate the speed of convergence by confining attention to nonzero elements on B. Using this notion, the global speed the smallest eigenvalue of a speed matrix S) will be obtained by taking only derivatives wrt nonzero directions of B. Remark 4.3. This notion of convergence speed supports the intuition: the sparser the mode, the faster the convergence. Taking a sub-matrix S 1 of a symmetric positive-semi definite matrix S 2, the smallest eigenvalues satisfy λ 1 S 1 ) λ 2 S 2 ). Hypothetically, if a mode B 2 is an augmentation of a sparser mode B 1 i.e. B2 contains the same nonzero entires as B 1 and some extra ones), both EM and PXL-EM will converge faster to B 1. To establish the speed of convergence of PXL-EM relative to EM, we need to change slightly the notation. Let =, A) = B, Σ, θ, A) be the parameters in the expanded space, where we use B = B A L, which has already been rotated, in place of B in from before. Clearly =, A 0 ) is a fixed point of PXL-EM. Similarly as?, to obtain the speed matrix near the fixed point we define 3 Iobs X = log π Y ) 2 =, IX aug = 2 log Q X ) 3 The derivatives in 4.12) are taken only with respect to nonzero directions in B! =. 4.12) 19

Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity

Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2016 Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity Veronika Ročková Edward I. George University

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Determinantal Priors for Variable Selection

Determinantal Priors for Variable Selection Determinantal Priors for Variable Selection A priori basate sul determinante per la scelta delle variabili Veronika Ročková and Edward I. George Abstract Determinantal point processes (DPPs) provide a

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Partially Collapsed Gibbs Samplers: Theory and Methods. Ever increasing computational power along with ever more sophisticated statistical computing

Partially Collapsed Gibbs Samplers: Theory and Methods. Ever increasing computational power along with ever more sophisticated statistical computing Partially Collapsed Gibbs Samplers: Theory and Methods David A. van Dyk 1 and Taeyoung Park Ever increasing computational power along with ever more sophisticated statistical computing techniques is making

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Scale Mixture Modeling of Priors for Sparse Signal Recovery Scale Mixture Modeling of Priors for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Outline Outline Sparse

More information

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process 10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

The Spike-and-Slab LASSO

The Spike-and-Slab LASSO The Spike-and-Slab LASSO Veronika Ročková and Edward I. George Submitted on 7 th August 215 Abstract Despite the wide adoption of spike-and-slab methodology for Bayesian variable selection, its potential

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Comment on Article by Scutari

Comment on Article by Scutari Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Bayesian non parametric approaches: an introduction

Bayesian non parametric approaches: an introduction Introduction Latent class models Latent feature models Conclusion & Perspectives Bayesian non parametric approaches: an introduction Pierre CHAINAIS Bordeaux - nov. 2012 Trajectory 1 Bayesian non parametric

More information

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Partially Collapsed Gibbs Samplers: Theory and Methods

Partially Collapsed Gibbs Samplers: Theory and Methods David A. VAN DYK and Taeyoung PARK Partially Collapsed Gibbs Samplers: Theory and Methods Ever-increasing computational power, along with ever more sophisticated statistical computing techniques, is making

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Scalable MCMC for the horseshoe prior

Scalable MCMC for the horseshoe prior Scalable MCMC for the horseshoe prior Anirban Bhattacharya Department of Statistics, Texas A&M University Joint work with James Johndrow and Paolo Orenstein September 7, 2018 Cornell Day of Statistics

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13) Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13) 1. Weighted Least Squares (textbook 11.1) Recall regression model Y = β 0 + β 1 X 1 +... + β p 1 X p 1 + ε in matrix form: (Ch. 5,

More information

arxiv: v1 [stat.me] 11 Apr 2018

arxiv: v1 [stat.me] 11 Apr 2018 Sparse Bayesian Factor Analysis when the Number of Factors is Unknown Sylvia Frühwirth-Schnatter and Hedibert Freitas Lopes arxiv:1804.04231v1 [stat.me] 11 Apr 2018 April 13, 2018 Abstract Despite the

More information

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes

Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes Ellida M. Khazen * 13395 Coppermine Rd. Apartment 410 Herndon VA 20171 USA Abstract

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Brian Williams and Rick Picard LA-UR-12-22467 Statistical Sciences Group, Los Alamos National Laboratory Abstract Importance

More information

Approximation Metrics for Discrete and Continuous Systems

Approximation Metrics for Discrete and Continuous Systems University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science May 2007 Approximation Metrics for Discrete Continuous Systems Antoine Girard University

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Machine Learning for Economists: Part 4 Shrinkage and Sparsity Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors

More information

Fitting Narrow Emission Lines in X-ray Spectra

Fitting Narrow Emission Lines in X-ray Spectra Outline Fitting Narrow Emission Lines in X-ray Spectra Taeyoung Park Department of Statistics, University of Pittsburgh October 11, 2007 Outline of Presentation Outline This talk has three components:

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology FE670 Algorithmic Trading Strategies Lecture 3. Factor Models and Their Estimation Steve Yang Stevens Institute of Technology 09/12/2012 Outline 1 The Notion of Factors 2 Factor Analysis via Maximum Likelihood

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

"SYMMETRIC" PRIMAL-DUAL PAIR

SYMMETRIC PRIMAL-DUAL PAIR "SYMMETRIC" PRIMAL-DUAL PAIR PRIMAL Minimize cx DUAL Maximize y T b st Ax b st A T y c T x y Here c 1 n, x n 1, b m 1, A m n, y m 1, WITH THE PRIMAL IN STANDARD FORM... Minimize cx Maximize y T b st Ax

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information