arxiv: v3 [stat.ml] 8 Jun 2015

Size: px
Start display at page:

Download "arxiv: v3 [stat.ml] 8 Jun 2015"

Transcription

1 Bayesian Manifold Learning: Locally Linear Latent Variable Models (LL-LVM) Mijung Park 1 Wittawat Jitkrittum 1 Ahmad Qamar arxiv: v3 [stat.ml] 8 Jun 015 Zoltán Szabó 1 Lars Buesing 3 Maneesh Sahani 1 1 University College London University of Toronto 3 Columbia University Abstract We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LL-LVM encapsulates the local-geometry preserving intuitions that underlie non-probabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct out-of-sample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold. 1 Introduction Many high-dimensional datasets comprise points derived from a smooth, lower-dimensional manifold embedded within the high-dimensional space of measurements and possibly corrupted by noise. For instance, biological or medical imaging data might reflect the interplay of a small number of latent processes that all affect measurements non-linearly. Linear multivariate analyses such as principal component analysis (PCA) or multidimensional scaling (MDS) have long been used to estimate such underlying processes, but cannot always reveal low-dimensional structure when the mapping is non-linear (or, equivalently, the manifold is curved). Thus, there has been substantial recent interest in algorithms to identify non-linear manifolds in data. Many more-or-less heuristic methods for non-linear manifold discovery are based on the idea of preserving the geometric properties of local neighbourhoods within the data, while embedding, unfolding or otherwise transforming the data to occupy fewer dimensions. Thus, algorithms such as locally-linear embedding (LLE) and Laplacian eigenmap attempt to preserve local linear relationships or to minimise the distortion of local derivatives [1, ]. Others, like Isometric feature mapping (Isomap) or maximum variance unfolding (MVU) preserve local distances, estimating global manifold properties by continuation across neighbourhoods before embedding to lower dimensions by classical methods such as PCA or MDS [3]. While generally hewing to this same intuitive path, the range of available algorithms has grown very substantially in recent years [4, 5]. However, these approaches do not define distributions over the data or over the manifold properties. Thus, they provide no measures of uncertainty on manifold structure or on the low-dimensional locations of the embedded points; they cannot be combined with a structured probabilistic model within the manifold to define a full likelihood relative to the high-dimensional observations; and they 1

2 provide only heuristic methods to evaluate the manifold dimensionality. As others have pointed out, they also make it difficult to extend the manifold definition to out-of-sample points in a principled way [6]. An established alternative is to construct an explicit probabilistic model of the functional relationship between low-dimensional manifold coordinates and each measured dimension of the data, assuming that the functions instantiate draws from Gaussian-process priors. The original Gaussian process latent variable model (GP-LVM) required optimisation of the low-dimensional coordinates, and thus still did not provide uncertainties on these locations or allow evaluation of the likelihood of a model over them [7]; however a recent extension exploits an auxiliary variable approach to optimise a more general variational bound, thus retaining approximate probabilistic semantics within the latent space [8]. The stochastic process model for the mapping functions also makes it straightforward to estimate the function at previously unobserved points, thus generalising out-of-sample with ease. However, the GP-LVM gives up on the intuitive preservation of local neighbourhood properties that underpin the non-probabilistic methods reviewed above. Instead, the expected smoothness or other structure of the manifold must be defined by the Gaussian process covariance function, chosen a priori. Here, we introduce a new probabilistic model over high-dimensional observations, low-dimensional embedded locations and locally-linear mappings between high and low dimensional linear maps within each neighbourhood, such that each group of variables is Gaussian distributed given the other two. This locally linear latent variable model (LL-LVM) thus respects the same intuitions as the common non-probabilistic manifold discovery algorithms, while still defining a full-fledged probabilistic model. Indeed, variational inference in this model follows more directly and with fewer separate bounding operations than the sparse auxiliary-variable approach used with the GP- LVM. Thus, uncertainty in the low-dimensional coordinates and in the manifold shape (defined by the local maps) is captured naturally. A lower bound on the marginal likelihood of the model makes it possible to select between different latent dimensionalities and, perhaps most crucially, between different definitions of neighbourhood, thus addressing an important unsolved issue with neighbourhood-defined algorithms. Unlike existing probabilistic frameworks with locally linear models such as mixtures of factor analysers (MFA)-based and local tangent space analysis (LTSA)- based methods [9, 10, 11], LL-LVM does not require an extra step to obtain the globally consistent alignment of low-dimensional local coordinates. 1 This paper is organised as follows. In section, we introduce our generative model, LL-LVM, for which we derive the variational inference method in section 3. We briefly describe out-of-sample extension for LL-LVM and mathematically describe the dissimilarity between LL-LVM and GP- LVM at the end of section 3. In section 4, we demonstrate the approach on several real world problems. Notation: Let blkdiag( ) output a block diagonal matrix of its arguments. We formulate a diagonal matrix via diag(v) where the diagonal elements are v. We denote a vector of ones by 1. The Euclidean norm of a vector is v, the Frobenius norm of a matrix is M F. We represent the Kronecker product of two matrices by M N. For a random vector w, we denote the normalisation constant in its probability density function by Z w. The expectation of a random vector w with respect to a density q is w q. The Kronecker delta is denoted by δ ij = 1 if i = j, and 0 otherwise. Our model: LL-LVM Suppose we have n data points denoted by y = [y 1,, y n ], where y i R dy and there is a neighbourhood graph denoted by G for y. As in many nonlinear dimensionality reduction techniques, we assume that there is a low dimensional (latent) representation of the high dimensional data, x = [x 1,, x n ], where x i R dx. Our key assumption is that there is a locally linear mapping between tangent spaces defined in the low and high dimensional spaces (See Fig. 1). The tangent spaces are approximated by {y j y i } and {x j x i }, the pairwise differences between the ith point and its neighbouring points j. The 1 As an exception, MFA-based method presented in [1] jointly finds the global coordinate and estimates the model parameters by maximising the lower bound.

3 high-dimensional space y j T M yi y y i C i low-dimensional space T M x i x x j x i Figure 1: Locally linear mapping C i for ith data point transforms the tangent space, T xi M x at x i in the low dimensional space to the tangent space, T yi M y at the corresponding data point y i in the high dimensional space. A neighbouring data point is denoted by y j and the corresponding latent variable by x j. matrix C i R dy dx at the ith point linearly maps those tangent spaces as y j y i C i (x j x i ). (1) Under this assumption, we aim to find the distribution over the linear maps C = [C 1,, C n ] R dy ndx and the latent variables x that best describe the data likelihood given a graph G: log p(y G) = log p(y, C, x G)dxdC. () The joint distribution can be written in terms of priors on C, x and the likelihood of y as p(y, C, x G) = p(y C, x, G)p(C G)p(x G). (3) In the following, we highlight the essential components of our model referred to as Locally Linear Latent Variable Model (LL-LVM). Detailed derivations are given in the Appendix. Adjacency matrix and Laplacian matrix The graph G for n data points specifies the n n symmetric adjacency matrix G. The i, jth element of G is written η ij, and is 1 if y j and y i are neighbouring and 0 if not. We assume η ii = 0 to avoid degeneracy. We denote the graph Laplacian matrix by L = diag(g1) G. Prior on x We assume that the latent variables are zero-centered and not too large. Furthermore, we assume that the neighbouring latent variables are similar in terms of Euclidean distance. Formally, the log prior on x is log p(x G, α) = 1 (α x i + i=1 η ij x i x j ) log Z x, where the parameter α controls the expected scale (α > 0). An equivalent form of the prior on x is a multivariate normal distribution p(x G, α) = N (0, Π), where Ω 1 = L I dx, Π 1 = αi ndx + Ω 1. Prior on C We assume that the linear maps of neighbouring points are similar in terms of Frobenius norm and not too large log p(c G) = ɛ c C i F 1 i=1 i=1 η ij C i C j F log Z c = 1 Tr [ (ɛ c I ndx + Ω 1 )C C ] log Z c. (4) which is in the form of the density of a matrix normal distribution. By introducing a prior covariance U to better capture the correct scaling, we have the matrix normal prior p(c G, U) = MN (C 0, U, (ɛ c I ndx + Ω 1 ) 1 ). In our implementation, we fix ɛ c to a small value, since the magnitude of C i and x i can be controlled by the hyper-parameter α, which is optimised in the M-step. 3

4 C G U y V x α Figure : Graphical representation of generative process in LL- LVM. Given a dataset, we construct a neighbourhood graph G. The distribution over the latent variable x is controlled by the graph G as well as the parameter α. Similarly, the distribution over the linear map C is governed by the graph G and a prior covariance term U. The latent variable x and the linear map C together determine the data likelihood. Likelihood likelihood log p(y C, x, V, G) = ɛy In accordance with Eq. (1), we penalise the approximation error which yields the log y i 1 i=1 i=1 η ij( yj,i C i xj,i ) V 1 ( yj,i C i xj,i ) log Z y, (5) where yj,i = y j y i and xj,i = x j x i. An equivalent form of the likelihood is a multivariate normal distribution in y p(y C, x, V, G) = N (µ y, Σ y ), where Σ 1 y = ɛ y I ndy + L V 1, µ y = Σ y e, and e = [e 1,, e n ] R ndy whose ith vector is given by e i = n η jiv 1 (C j + C i ) xj,i. We fix ɛ y to a small value since the magnitude of y i C i x i can be adjusted by α. The graphical representation of the generative process of LL-LVM is given in Fig.. 3 Variational inference Our goal is to infer the latent variables (x, C) as well as the parameters θ = {α, U, V} in LL-LVM. We infer them by maximising the lower bound L of the marginal likelihood of the observations p(y, C, x G, θ) log p(y G, θ) q(c, x) log dxdc := L(q(C, x), θ). (6) q(c, x) Following the common treatment for computational tractability, we assume the posterior over (C, x) factorises as q(c, x) = q(x)q(c) [13]. We maximise the lower bound w.r.t. q(c, x) and θ by the variational expectation maximization algorithm [14], which consists of (1) the variational expectation step for computing q(c, x) by [ ] q(x) exp q(c) log p(y, C, x G, θ)dc, (7) [ q(c) exp ] q(x) log p(y, C, x G, θ)dx, (8) then () the maximization step for estimating θ by ˆθ = arg max θ L(q(C, x), θ). Variational-E step Computing q(x) from Eq. (38) requires rewriting the likelihood in Eq. () as a quadratic function in x p(y C, x, θ, G) = 1 Zx exp [ 1 (x A 1 x x b) ], where the normaliser Z x has all the terms that do not depend on x from Eq. (). The matrix A 1 is defined by A 1 = [ A 1 ] n ij i, Rndx ndx and the i, jth block is given by A 1 i,j = (δ ij 1)(C i V 1 C i +C j V 1 n C j )+δ ij k=1 η ik(c k V 1 C k +C i V 1 C i ). The vector b is defined by b = [b 1,, b n ] R ndx whose ith vector of length d x is b i = n η ij(c j + C i ) V 1 yj,i. The likelihood combined with the prior on x gives us the Gaussian posterior over x (i.e., solving Eq. (38)) q(x) = N (x µ x, Σ x ), where Σ 1 x = A 1 q(c) + Π 1, µ x = Σ x b q(c). (9) Similarly, computing q(c) from Eq. (39) requires rewriting the likelihood in Eq. () as a quadratic function in C p(y C, x, θ, G) = 1 Z C exp [ 1 Tr(Γ 1 C V 1 C C V 1 H ) ], (10) 4

5 A data in 3D space B neighbourhood graph (k=6) C po Swiss Roll ste rio rm ean po ste rio of C 3D Gaussian rm D E neighbourhood graph (k=0) F po ean ste po ste r io of x r io rm ean of C rm ea n of x Figure 3: Two simulated examples. A: 800 data points drawn from Swiss Roll. B: A neighbourhood graph (k = 6) constructed from data points shown in A, but visualised with true latent points in D. C: Posterior means of C, x after 40 EM iterations. D: 800 data points drawn from 3D Gaussian, which we used to construct a graph shown in E with k = 0. F: Posterior means of C, x after 40 EM iterations. where the normaliser Z C has all the terms that do not depend on C from Eq. (). The block ndx ndx 1 and the ith block diagonal P matrix Γ 1 is defined by Γ 1 = blkdiag(γ 1 11,..., Γnn ) R n 1 > dx dx Γii =. The matrix H is defined by H = [H η R 1,, Hn ] ij x xj,i j,i Pn dy ndx > dy dx R whose ith block is Hi = ηij yj,i xj,i R. The likelihood combined with the prior on c gives us the Gaussian posterior over c (i.e., solving Eq. (39)) 1 q(c) = N (c µc, Σc ), where Σ 1 iq(x) V 1 + ( c I + Ω 1 ) U 1, µc = Σc vec(v 1 hhiq(x) ). (11) c = hγ The expected values of A, B, Γ and H are given in the Appendix. Variational-M step We set the parameters θ by maximising L(q(C, x), V) w.r.t. θ which is split into three terms based on dependence on each parameter: (1) expected log-likelihood for updating V by arg maxv Eq(x)q(C) [log p(y C, x, V, G)]; () negative KL divergence between the prior and the posterior on C for updating U by arg maxu Eq(x)q(C) [log p(c G, U) log q(c)]; and (3) negative KL divergence between the prior and the posterior on x for updating α by arg maxα Eq(x)q(C) [log p(x G, α) log q(x)]. Update rules for each hyperparameter are given in the Appendix. The full EM algorithm starts with an initial value of θ. In the E-step, given q(c), compute q(x) as in Eq. (9). Likewise, given q(x), compute q(c) as in Eq. (11). The parameters θ are updated in the M-step by maximising Eq. (37). The two steps are repeated until the variational lower bound in Eq. (37) saturates. To give a sense of how the algorithm works, we visualise fitting results for the two simulated examples in Fig. 3. Using the graph constructed from 3D observations, we run our EM algorithm. The posterior means of x and C resemble the true manifolds in D and 3D spaces, respectively. Out-of-sample extension In the LL-LVM model one can formulate a computationally efficient out-of-sample extension technique as follows. Given n data points denoted by D = {y1,, yn }, the variational EM algorithm derived in the previous section converts D into the posterior q(x, C): 5

6 A samples from swiss roll (3D) B D representation C posterior mean of x in D space G without shortcut G with shortcut LB: LB: Figure 4: Resolving short-circuiting problems using variational lower bound. A: Visualization of 800 samples drawn from a Swiss Roll in 3D space. Points 8 (red) and 9 (blue) are close to each other (dotted grey) in 3D. B: Visualization of the 800 samples on the latent D manifold. The distance between points 8 and 9 is seen to be large. C: Posterior mean of x with/without shortcircuiting the 8th and the 9th data points. The EM algorithm without the shortcut (left) achieves higher lower bound than that with the shortcut. The red and blue parts are mixed in the resulting estimate in D space (right) in case with the shortcut. The lower bound is obtained after 30 EM iterations. D q(x)q(c). Now, given a new high-dimensional data point y, one can first find the neighbourhood of y without changing the current neighbourhood graph. Then, it is possible to compute the distributions over the corresponding locally linear map and latent variable q(c, x ) via simply performing the E-step given q(x)q(c) (freezing all other quantities the same) as D {y } q(x)q(c)q(x )q(c ). Comparison to GP-LVM A closely related probabilistic dimensionality reduction algorithm to LL-LVM is GP-LVM [7]. GP-LVM defines the mapping from the latent space to data space using Gaussian processes. The likelihood of the observations Y = [y 1,..., y dy ] R n dy given latent variables X = [x 1,..., x dx ] R n dx is defined by p(y X) = d y k=1 N (y k 0, K nn + β 1 I n ), where the i, jth element of the covariance matrix is of the squared exponential form: k(x i, x j ) = σf [ exp 1 dx q=1 α q(x i,q x j,q ) ]. Here, the parameters α q determine the relevant dimensionality of the latent space [8]. In LL-LVM, once we integrate out C from Eq. (), we also obtain the Gaussian likelihood given x, p(y x, G, θ) = p(y C, x, G, θ)p(c G, θ)dc = 1 Z Yy exp [ 1 y K 1 LL y]. In contrast to GP-LVM, the precision matrix K 1 LL takes the form of (L V 1 ) (W V 1 ) Λ (W V 1 ), where both W and Λ are a function of the Laplacian matrix (their formulae are given in Appendix). Therefore, under our model, the graph structure directly determines the functional form of precision in the likelihood term. 4 Experiments 4.1 Mitigating the short-circuit problem Like other neighbour-based methods, LL-LVM is sensitive to misspecified neighbourhoods; the prior, likelihood, and posterior all depend on the assumed graph. Unlike other methods, LL- LVM provides a natural way to evaluate possible short-circuits using the variational lower bound of Eq. (37). Fig. 4 shows 800 samples drawn from a Swiss Roll in 3D space (Fig. 4A). Two points, labelled 8 and 9, happen to fall close to each other in 3D, but are actually far apart on the latent (D) surface (Fig. 4B). A k-nearest-neighbour graph might link these, distorting the recovered coordinates. However, evaluating the model with and without this edge yields a higher variational bound for the correct graph. Although prohibitive to evaluate every possible graph in this way, the availability of a principled criterion to test specific hypotheses is of obvious value. In the following, we demonstrate LL-LVM on two real datasets: handwritten digits and climate data. 6

7 A variational lower bound 0.5 x k=n/4 k=n/8 k=n/ B posterior mean of x (k=n/4) digit 0 digit 1 digit digit 3 digit 4 true Y* estimate 1.5 query (0) query (4) 0 # EM iterations 30 query () query (3) query (1) C GP-LVM D ISOMAP E LLE F Classification error LLLVM ISOMAP GPLVM Figure 5: USPS handwritten digit dataset described in section 4.. A: Mean (in solid) and variance (1 standard deviation shading) of the variational lower bound across 10 different random starts of EM algorithm with different k s. The highest lower bound is achieved when k = n/4. B: The posterior mean of x in D. Each digit is colour coded. On the right side are reconstructions of y for randomly chosen query points x. Using neighbouring y and posterior means of C we can recover y successfully (see text). C: Fitting results by GP-LVM using the same data. D: ISOMAP and E: LLE with the same k. Using the extracted features (in D), we evaluated a 1-NN classifier for digit identity with 5-fold cross-validation (the same data divided into 5 training and test sets). The classification error is shown in F. LL-LVM and ISOMAP features yield the lowest error. LLE 4. Modelling USPS handwritten digits As a first real-data example, we test our method on a subset of 10 samples each of the digits 0, 1,, 3, 4 from the USPS digit dataset, where each digit is size of (i.e., n = 600, d y = 56). We follow [7], and represent the low dimensional latent variables in D. In this experiment, we set U 1 = βi, and V 1 = γi for simplicity, so the parameters are θ = {α, β, γ}. Fig. 5A shows variational lower bounds for different values of k, using 10 different EM initialisations. The posterior mean of x under LL-LVM using the best k is illustrated in Fig. 5B, where grouping by digit identity (represented in colour) is notable. Fig. 5B also shows reconstructions of one randomly-selected example of each digit, using its D coordinates x as well as the posterior mean coordinates ˆx i, tangent spaces Ĉi and actual images y i of its k = n/4 closest neighbours. The reconstruction is based on the assumed tangent-space structure of the generative model (Eq. ()), that is: ŷ = 1 [ ] k k i=1 y i + Ĉi(x ˆx i ). A similar process could be used to reconstruct digits at out-of-sample locations. Finally, we quantify the relevance of the recovered subspace by computing the error incurred using a simple classifier to report digit identity using the D features obtained by LL-LVM and various competing methods (Fig. 5C-F). Classification with LL-LVM coordinates outperforms GP-LVM and LLE, and matches ISOMAP. 4.3 Mapping climate data In a second experiment, we attempted to recover D geographical relationships between weather stations from recorded monthly precipitation patterns. Data were obtained by averaging month-bymonth annual precipitation records from at 609 weather stations scattered across the US 7

8 45 Latitude Longitude (a) 609 weather stations (b) LLE (c) LTSA (d) ISOMAP (e) GP-LVM (f) LL-LVM Figure 6: Climate modelling problem as described in section 4.3. Each example corresponding to a weather station is a 1-dimensional vector of monthly precipitation measurements. Using only the measurements, the projection obtained from the proposed LL-LVM recovers the topological arrangement of the stations to a large degree. (see Fig. 6). Thus, the data set comprised dimensional vectors. The goal of the experiment is to recover the two-dimensional topology of the weather stations (as given by their latitude and longitude) using only these 1-dimensional climatic measurements. As before, we compare the projected points obtained by LL-LVM with several widely used dimensionality reduction techniques. For the graph-based methods LL-LVM, LTSA, ISOMAP, and LLE, we used 1-NN with Euclidean distance to construct the neighbourhood graph. The results are presented in Fig. 6. LL-LVM identified a more geographically-accurate arrangement for the weather stations than the other algorithms. The fully probabilistic nature of LL-LVM and GPLVM allowed these algorithms to handle the noise present in the measurements in a principled way. This contrasts with ISOMAP which can be topologically unstable [16] i.e. vulnerable to shortcircuit errors if the neighbourhood is too large (see also the Appendix). Perhaps coincidentally, LL-LVM also seems to respect local geography more fully in places than does GP-LVM. 5 Conclusion We have demonstrated a new probabilistic approach to non-linear manifold discovery that embodies the central notion that local geometries are mapped linearly between manifold coordinates and high-dimensional observations. The approach offers a natural variational algorithm for learning, quantifies local uncertainty in the manifold, and permits evaluation of hypothetical neighbourhood relationships. In the present study, we have described the LL-LVM model conditioned on a neighbourhood graph. In principle, it is also possible to extend LL-LVM so as to construct a distance matrix as in [17], by maximising the data likelihood. We leave this as a direction for future work. The dataset is made available by the National Climatic Data Center at gov/oa/climate/research/ushcn/. We use version.5 monthly data [15]. 8

9 References [1] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 90(5500):33 36, 000. [] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pages , 00. [3] J. B. Tenenbaum, V. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 90(5500):319 33, 000. [4] L.J.P. van der Maaten, E. O. Postma, and H. J. van den Herik. Dimensionality reduction: A comparative review, jz/ dimensionality_reduction_a_comparative_review.pdf. [5] L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, pages 1 17, [6] J. Platt. Fastmap, metricmap, and landmark MDS are all Nyström algorithms. In Proceedings of 10th International Workshop on Artificial Intelligence and Statistics, pages 61 68, 005. [7] N. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In NIPS, pages , 003. [8] M. K. Titsias and N. D. Lawrence. Bayesian Gaussian process latent variable model. In AISTATS, pages , 010. [9] S. Roweis, L. Saul, and G. Hinton. Global coordination of local linear models. In NIPS, pages , 00. [10] M. Brand. Charting a manifold. In NIPS, pages , 003. [11] Y. Zhan and J. Yin. Robust local tangent space alignment. In NIPS, pages [1] J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(8): , 006. [13] C. Bishop. Pattern recognition and machine learning. Springer New York, 006. [14] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit, University College London, 003. [15] M. Menne, C. Williams, and R. Vose. The U.S. historical climatology network monthly temperature data, version.5. Bulletin of the American Meteorological Society, 90(7): , July 009. [16] Mukund Balasubramanian and Eric L. Schwartz. The isomap algorithm and topological stability. Science, 95(555):7 7, January 00. [17] N. Lawrence. Spectral dimensionality reduction via maximum entropy. In AISTATS, pages 51 59, 011. [18] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 01. Version

10 Appendix Notation The vectorized version of a matrix is vec(m). We denote an identity matrix of size m with I m. Other notations are the same as used in the main text. A Matrix normal distribution The matrix normal distribution generalises the standard multivariate normal distribution to matrixvalued variables. A matrix A R n p is said to follow a matrix normal distribution MN n,p (M, U, V) with parameters U and V if its density is given by p(a M, U, V) = exp ( 1 Tr [ V 1 (A M) T U 1 (A M) ]) (π) np/ V n/ U p/. (1) If A MN (M, U, V), then vec(a) N (vec(m), V U), a relationship we will use to simplify many expressions. B Matrix normal expressions of priors and likelihood Recall that G ij = η ij. Prior on low dimensional latent variables log p(x G, α) = α x i 1 η ij x i x j log Z x (13) i=1 i=1 where = 1 log ππ 1 x Π 1 x, (14) Π 1 := αi ndx + Ω 1, Ω 1 := L I dx, L := diag(g1) G. L is known as a graph Laplacian. It follows that p(x G, α) = N (0, Π). The prior covariance Π can be rewritten as Π 1 = αi n I dx + L I dx (15) = (αi n + L) I dx, (16) Π = (αi n + L) 1 I dx. (17) By the relationship of a matrix normal and multivariate normal distributions described in section A, the equivalent prior for the matrix X = [x 1 x x n ] R dx n, constructed by reshaping x, is given by Prior on locally linear maps p(x G, α) = MN (X 0, I dx, (αi n + L) 1 ). (18) Recall that C = [C 1,..., C n ] R dy ndx where each C i R dy dx. We formulate the log prior on C as log p(c G) = ɛ c C i F 1 η ij C i C j F log Z c, i=1 i=1 = ɛ c Tr ( C C ) 1 Tr ( Ω 1 C C ) log Z c., = 1 Tr [ (ɛ c I ndx + Ω 1 )C C ] log Z c. (19) 10

11 In the first line, the first term imposes a soft constraint that each C i should not be too large. The second term encourages the the locally linear maps of neighbouring points i and j to be similar in the sense of the Frobenius norm. Notice that the last line is in the form a the log of a matrix normal density with mean 0. By explicitly writing the normalizing constant Z c, we have log p(c G, U) = 1 Tr [ (ɛ c I ndx + Ω 1 )C U 1 C ] nd xd y log(π) + d y log ɛ ci ndx + Ω 1 nd x (0) where we introduced U 1 to better capture the correct scaling. The expression is equivalent to p(c G, U) = MN (C 0, U, (ɛ c I ndx + Ω 1 ) 1 ). (1) For simplicity we use the diagonal form of the precision matrix U 1 = βi dy. In our implementation, we fix ɛ c to a small value, since the magnitude of C i and x i can be controlled by the hyperparameter α, which is optimized in the M-step. Likelihood We penalise linear approximation error of the tangent spaces. Assume that the noise precision matrix is a scaled identify matrix ei.g., V 1 = γi dy. log p(y x, C, V, G) () = ɛ y y i 1 η ij ((y j y i ) C i (x j x i )) V 1 ((y j y i ) C i (x j x i )) log Z Y i=1 i=1 = 1 (y Σ y 1 y y e + f) log Z y, (4) (3) log U, where Σ 1 y = ɛ y I ndy + L V 1, (5) e = [e 1,, e n ] R ndy, (6) e i = η ji V 1 (C j + C i )(x j x i ), (7) f = i=1 η ij (x j x i ) C i V 1 C i (x j x i ). (8) By completing the quadratic form in y, the likelihood is given by: We can rewrite p(y x, C, V, G) = N (µ y, Σ y ), (9) µ y = Σ y e, (30) Z y = πσ y 1. (31) Σ 1 y = ɛ y I ndy + L V 1 (3) = (ɛ y I n + γl) I dy, (33) Σ y = (ɛ y I n + γl) 1 I dy (34) and therefore the equivalent expression in term of a matrix normal distribution for Y = [y 1, y,, y n ] R dy n is given by p(y x, C, γ, G) = MN (Y M y, I dy, (ɛ y I n + γl) 1 ), (35) M y = E(ɛ y I n + γl) 1, (36) where E = [e 1,, e n ] R dy n. Similar to ɛ c, we also fix ɛ y to a small value, since the magnitude of y i C i x i can be adjusted by α which is optimized in the M-step. 11

12 C Variational inference In LL-LVM, the goal is to infer the latent variables (x, C) as well as to learn the hyper-parameters θ = {α, β, γ}. We infer them by maximising the lower bound of the marginal likelihood of the observations y. log p(y θ, G) = log p(y, C, x G, θ) dx dc, q(c, x) log = F(q(C, x), θ). p(y, C, x G, θ) dxdc, q(c, x) For computational tractability, we assume that the posterior over (C, x) factorizes as where q(x) and q(c) are multivariate normal distributions. q(c, x) = q(x)q(c). (37) We maximize the lower bound w.r.t. q(c, x) and θ by the variational expectation maximization algorithm, which consists of (1) the variational expectation step for determining q(c, x) by [ ] q(x) exp q(c) log p(y, C, x G, θ)dc, (38) [ ] q(c) exp q(x) log p(y, C, x G, θ)dx, (39) followed by () the maximization step for estimating θ, ˆθ = arg max θ F(q(C, x), θ). C.1 VE step C.1.1 Computing q(x) In variational E-step, we compute q(x) by integrating out C from the total log joint distribution: log q(x) = E q(c) [log p(y, C, x G, θ)] + const, (40) = E q(c) [log p(y C, x, G, θ) + log p(x G, θ) + log p(c G, θ)] + const. (41) To determine q(x), we firstly re-write p(y C, x, G, θ) as a quadratic function in x : log p(y C, x, G, θ) = 1 (x A 1 x x b + d) log Z y, (4) where A 1 = A 1 11 A 1. 1 A 1 1n.... Rndx ndx (43) A 1 n1 A 1 nn A 1 ij = (δ ij 1)η ij (C i V 1 C i + C j V 1 C j ) + δ ij η ik (C k V 1 C k + C i V 1 C i ) R dx dx, b = [b 1,, b n ] R ndx, (45) b i = η ij (C j V 1 (y i y j ) C i V 1 (y j y i )) R dx, (46) d = (ɛ y y i y i + i=1 k=1 (44) η ij (y j y i ) V 1 (y j y i )), (47) 1

13 and the Kronecker delta is denoted by δ ij = 1 if i = j, and 0 otherwise. With the likelihood expressed as a quadratic function of x, the log posterior over x is given by log q(x) = 1 E [ q(c) (x A 1 x x b + d) + x Π 1 x ] + const, (48) = 1 [ x ( A 1 q(c) + Π 1 )x x ] b q(c) + const, (49) which can be recognized as a multivariate normal distribution. The posterior over x is then given by where q(x) = N (x µ x, Σ x ), (50) Σ 1 x = A 1 q(c) + Π 1, (51) µ x = Σ x b q(c). (5) Notice that the parameters of q(x) depend on the sufficient statistics A 1 q(c) and b q(c) whose explicit forms are given in section C.1.. C.1. Sufficient statistics A and b for q(x) Given the posterior over c, the sufficient statistics A 1 q(c) and b q(c) necessary to characterise q(x) are computed as following: A 1 ij q(c) = η ij γ(δ ij 1) C i C i + C j C j q(c) + γδ ij b i q(c) = γ N k=1 η ik C k C k + C i C i R dx dx (53), η ij ( C j q(c) (y i y j ) C i q(c) (y j y i )) R dx, (54) where (55) C i q(c) = i-th chunk of µ C i.e., µ C = [ C 1 q(c),..., C n q(c) ] (56) C i C i q(c) = (i,i)-th (nd x nd x ) chunk of Σ C + C i q(c) C i q(c), (57) C.1.3 Computing q(c) Next, we compute q(c) by integrating out x from the total log joint distribution: log q(c) = E q(x) [log p(y, C, X G, θ)] + const, (58) = E q(x) [log p(y C, x, G, θ) + log p(x G, θ)] + log p(c G, θ) + const. (59) We re-write p(y C, x, G, θ) as a quadratic function in C: log p(y C, x, G, θ) = 1 Tr(Γ 1 C V 1 C C V 1 H) 1 d log Z y, (60) where d is as in Eq. (47) and Γ Γ 1 =..... Rndx ndx 0 Γ 1 nn Γ 1 ii = η ij (x j x i )(x j x i ) R dx dx, (61) H = [H 1,, H n ] R dy ndx, (6) H i = η ij (y j y i )(x j x i ) R dy dx. (63) 13

14 By expanding out log q(c), we have log q(c) = 1 Tr [ Γ 1 q(x) C V 1 C C V 1 H q(x) + (ɛ c I ndx + Ω 1 )C U 1 C ] + const, (64) It follow that the approximate posterior over C is given by q(c) = MN (µ C, I dy, Σ C ), where Σ 1 C = γ Γ 1 q(x) + β(ɛ c I ndx + Ω 1 ) R ndx ndx (65) µ C = V 1 H q(x) Σ C R dy ndx. (66) The parameters of q(c) depend on the sufficient statistics Γ 1 q(x) and H q(x) which are given in section C.1.4. C.1.4 Sufficient statistics Γ and H Given the posterior over x, the sufficient statistics Γ 1 q(x) and H q(x) necessary to characterise q(c) are computed as follows. Γ 1 ii q(x) = η ij (x j x i )(x j x i ) q(x) (67) = H i q(x) = = η ij x j x j x j x i x i x j + x i x i q(x), (68) η ij (y j y i )(x j x i ) q(x) (69) η ij (y j x j q(x) y j x i q(x) y i x j q(x) + y i x i q(x) ), (70) where x i x j q(x) = Σ (ij) x C. VM step + x i q(x) x j q(x) and Σ (ij) x = cov(x i, x j ). We set the parameters θ = (α, β, γ) by maximising the free energy w.r.t. θ: ˆθ = arg max E q(x)q(c) [log p(y, C, x G, θ) log q(x, C)], θ = arg max E q(x)q(c) [log p(y C, x, G, θ) + log p(c G, θ) + log p(x G, θ) log q(x) log q(c)]. θ (71) Update for γ Recall that the precision matrix in the likelihood term is V 1 = γi dy. For updating γ, it is sufficient to consider the log conditional likelihood integrating out x, C: E q(x)q(c) [log p(y C, x, G, θ)] = E q(x)q(c) [ 1 Tr(Γ 1 C V 1 C C V 1 H) 1 ] d log Z (7) y, where the exponent term is given by 1 E q(x)q(c)tr(γ 1 C V 1 C C V 1 H) 1 d = 1 E q(c)tr( Γ 1 q(x) C V 1 C C V 1 H q(x) ) 1 d = 1 E q(c)[c ( Γ 1 q(x) V 1 )c c vec(v 1 H q(x) )] 1 d = γd y Tr( Γ 1 q(x) Σ C ) γ Tr( Γ 1 q(x) µ C µ C ) + γtr(µ C H q(x) ) 1 d, 14

15 and the log normaliser is given by 1 log Z y = nd y log(π) + d y log ɛ yi n + γl. (73) The update for γ is given by the maximiser of Eq. (7). Unfortunately, there is no closed-form expression for the update. Update for β Recall that the left precision matrix of p(c G, U) is U 1 = βi dy. We update β by maximizing the following term in Eq. (71). E q(x)q(c) [log p(c G, θ) log q(c)] N (c µ c, Σ c ) = dc N (c µ c, Σ c ) log N (0, ((ɛ c I + Ω 1 ) U 1 ) 1 ), = 1 log Σ c((ɛ c I + Ω 1 ) U 1 ) 1 Tr [ Σ c ((ɛ c I + Ω 1 ) U 1 ) I ] 1 µ c ((ɛ c I + Ω 1 ) U 1 )µ c, = 1 log (Σ C(ɛ c I + Ω 1 )) βi 1 Tr [ (Σ C (ɛ c I + Ω 1 )) βi I ] 1 βtr((ɛ ci + Ω 1 )µ C µ C ), = d y log Σ C(ɛ c I + Ω 1 ) + nd xd y log β d y βtr[σ C(ɛ c I + Ω 1 )] + 1 nd xd y 1 βtr((ɛ ci + Ω 1 )µ C µ C ), Equating the derivative with respect to β of the previous expression to 0 yields β 1 = 1 nd x Tr[(ɛ c I + Ω 1 )(Σ C + 1 d y µ C µ C )]. Update for α We update α by maximizing the following term in Eq. (71). E q(x)q(c) [log p(x G, θ) log q(x)] (74) = dx N (x µ x, Σ x ) log N (x µ x, Σ x ) N (x 0, Π), = 1 log Σ xπ 1 1 Tr [ Π 1 Σ x I ] 1 µ x Π 1 µ x, (75) = 1 log Σ x + 1 log αi + Ω 1 α Tr [Σ x] 1 Tr [ Ω 1 Σ x ] α µ x µ x 1 µ x Ω 1 µ x (76) := f α (α). The stationarity condition of α is given by α E q(x)q(c)[log p(x G, θ) log q(x)] = 1 Tr((αI + Ω 1 ) 1 ) 1 Tr [Σ x] 1 µ x µ x = 0, which is not closed-form and requires finding the root of the equation. For updating α, we compute α = arg max α f α (α): (77) (78) α = arg max log αi + Ω 1 αtr [Σ x ] αµ x µ x. (79) α 15

16 Assume Ω 1 = E Ω V Ω E Ω by eigen-decomposition and V Ω = diag (v 11,...., v ndx,nd x ). The main difficult in optimizing α comes from the first term which can be handled by rewriting as log αi + Ω 1 (a) = log αe Ω E Ω + E Ω V Ω E Ω (80) = log E Ω (αi + V Ω )E Ω (81) (b) nd x = log αi + V Ω = log(α + v ii ) (8) i=1 (c) n = d x log(α + ω i ), (83) where at (a) we use the fact that E Ω is orthogonal. At (b), the determinant of a product is the product of the determinants, and that the determinant of an orthogonal matrix is 1. Assume that L = E L V L E L by eigen-decomposition and V L = diag ({ω i } n i=1 ). Recall that Ω 1 = L I dx. By Theorem 1, v ii = ω i and ω i appears d x times for each i = 1,..., n. This explains the d x factor in (c). The eigen-decomposition of L is needed only once in the beginning. We only need the eigenvalues of L, not the eigenvectors. D Useful results In this section, we summarize theorems and matrix identities useful for deriving update equations of LL-LVM. The notation in this section is independent of the rest. Theorem 1. Let A R n n have eigenvalues λ i, and let B R m m have eigenvalues µ j. Then the mn eigenvlaues of A B are λ 1 µ 1,..., λ 1 µ m, λ µ 1,..., λ µ m,..., λ n µ m. Theorem. A graph Laplacian L R n n is positive semi-definite. That is, its eigenvalues are non-negative. D.1 Matrix identities x (A B)y = tr(diag(x)a diag(y)b ) (84) From section of the matrix cookbook [18], [ exp 1 ] x Ax + c x dx = [ ] 1 det(πa 1 ) exp c A c. (85) Lemma 1. If X = (x 1 x n ) and C = (c 1... c n ), then [ exp 1 ] [ ] 1 tr(x AX) + tr(c X) dx = det(πa 1 ) n/ exp tr(c A 1 C). Woodbury matrix identity (A + UCV ) 1 = A 1 A 1 U(C 1 + V A 1 U) 1 V A 1. (86) 16

17 45 Latitude Longitude (a) 609 weather stations (b) LL-LVM (c) LTSA (d) GP-LVM (e) ISOMAP (f) LLE Figure 7: Climate modelling problem and the projections obtained from five methods E Detail of the climate data experiment In this section, we give more detail regarding the climate data experiment described in the section in the main text. In Fig. 8, we show the precipitation of each month as measured by each of the 609 stations Figure 8: Precipitation of each month at 609 stations. The irregularity of the rainfall in each location (column at each location) allows the measured precipitation to be used to distinguish different locations. For certain graph-based algorithms including ISOMAP, the result highly depends on the input neighbourhood graph. ISOMAP can be i.e. vulnerable to short-circuit errors if the neighbourhood is too large [16]. This problem is illustrated in Fig. 7 where we set k = 50 for LL-LVM, LTSA, ISOMAP, and LLE. 17

Bayesian Manifold Learning: The Locally Linear Latent Variable Model

Bayesian Manifold Learning: The Locally Linear Latent Variable Model Bayesian Manifold Learning: The Locally Linear Latent Variable Model Mijung Park, Wittawat Jitkrittum, Ahmad Qamar, Zoltán Szabó, Lars Buesing, Maneesh Sahani Gatsby Computational Neuroscience Unit University

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Gaussian Process Latent Random Field

Gaussian Process Latent Random Field Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Gaussian Process Latent Random Field Guoqiang Zhong, Wu-Jun Li, Dit-Yan Yeung, Xinwen Hou, Cheng-Lin Liu National Laboratory

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction. Jose A. Costa Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Nonlinear Manifold Learning Summary

Nonlinear Manifold Learning Summary Nonlinear Manifold Learning 6.454 Summary Alexander Ihler ihler@mit.edu October 6, 2003 Abstract Manifold learning is the process of estimating a low-dimensional structure which underlies a collection

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13 CSE 291. Assignment 3 Out: Wed May 23 Due: Wed Jun 13 3.1 Spectral clustering versus k-means Download the rings data set for this problem from the course web site. The data is stored in MATLAB format as

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Statistical Learning. Dong Liu. Dept. EEIS, USTC Statistical Learning Dong Liu Dept. EEIS, USTC Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle

More information

Matching the dimensionality of maps with that of the data

Matching the dimensionality of maps with that of the data Matching the dimensionality of maps with that of the data COLIN FYFE Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, PA 2BE SCOTLAND. Abstract Topographic maps are

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Nonnegative Matrix Factorization Clustering on Multiple Manifolds Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Nonnegative Matrix Factorization Clustering on Multiple Manifolds Bin Shen, Luo Si Department of Computer Science,

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction A presentation by Evan Ettinger on a Paper by Vin de Silva and Joshua B. Tenenbaum May 12, 2005 Outline Introduction The

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Graphs, Geometry and Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In

More information

Intrinsic Structure Study on Whale Vocalizations

Intrinsic Structure Study on Whale Vocalizations 1 2015 DCLDE Conference Intrinsic Structure Study on Whale Vocalizations Yin Xian 1, Xiaobai Sun 2, Yuan Zhang 3, Wenjing Liao 3 Doug Nowacek 1,4, Loren Nolte 1, Robert Calderbank 1,2,3 1 Department of

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Supervised locally linear embedding

Supervised locally linear embedding Supervised locally linear embedding Dick de Ridder 1, Olga Kouropteva 2, Oleg Okun 2, Matti Pietikäinen 2 and Robert P.W. Duin 1 1 Pattern Recognition Group, Department of Imaging Science and Technology,

More information

Manifold Learning and it s application

Manifold Learning and it s application Manifold Learning and it s application Nandan Dubey SE367 Outline 1 Introduction Manifold Examples image as vector Importance Dimension Reduction Techniques 2 Linear Methods PCA Example MDS Perception

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Deterministic Component Analysis Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal

More information

Robust Laplacian Eigenmaps Using Global Information

Robust Laplacian Eigenmaps Using Global Information Manifold Learning and its Applications: Papers from the AAAI Fall Symposium (FS-9-) Robust Laplacian Eigenmaps Using Global Information Shounak Roychowdhury ECE University of Texas at Austin, Austin, TX

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Dimensionality Reduction by Unsupervised Regression

Dimensionality Reduction by Unsupervised Regression Dimensionality Reduction by Unsupervised Regression Miguel Á. Carreira-Perpiñán, EECS, UC Merced http://faculty.ucmerced.edu/mcarreira-perpinan Zhengdong Lu, CSEE, OGI http://www.csee.ogi.edu/~zhengdon

More information

Latent Variable Models with Gaussian Processes

Latent Variable Models with Gaussian Processes Latent Variable Models with Gaussian Processes Neil D. Lawrence GP Master Class 6th February 2017 Outline Motivating Example Linear Dimensionality Reduction Non-linear Dimensionality Reduction Outline

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Locality Preserving Projections

Locality Preserving Projections Locality Preserving Projections Xiaofei He Department of Computer Science The University of Chicago Chicago, IL 60637 xiaofei@cs.uchicago.edu Partha Niyogi Department of Computer Science The University

More information

Apprentissage non supervisée

Apprentissage non supervisée Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016 From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Statistical and Computational Analysis of Locality Preserving Projection

Statistical and Computational Analysis of Locality Preserving Projection Statistical and Computational Analysis of Locality Preserving Projection Xiaofei He xiaofei@cs.uchicago.edu Department of Computer Science, University of Chicago, 00 East 58th Street, Chicago, IL 60637

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Dimensionality Reduction: A Comparative Review

Dimensionality Reduction: A Comparative Review Tilburg centre for Creative Computing P.O. Box 90153 Tilburg University 5000 LE Tilburg, The Netherlands http://www.uvt.nl/ticc Email: ticc@uvt.nl Copyright c Laurens van der Maaten, Eric Postma, and Jaap

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold. Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27 Laplacian Eigenmaps Linear methods Lower-dimensional linear projection that preserves distances between all

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Data dependent operators for the spatial-spectral fusion problem

Data dependent operators for the spatial-spectral fusion problem Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.

More information

Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models Journal of Machine Learning Research 6 (25) 1783 1816 Submitted 4/5; Revised 1/5; Published 11/5 Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models Neil

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Discriminant Uncorrelated Neighborhood Preserving Projections

Discriminant Uncorrelated Neighborhood Preserving Projections Journal of Information & Computational Science 8: 14 (2011) 3019 3026 Available at http://www.joics.com Discriminant Uncorrelated Neighborhood Preserving Projections Guoqiang WANG a,, Weijuan ZHANG a,

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Neil D. Lawrence neill@cs.man.ac.uk Mathematics for Data Modelling University of Sheffield January 23rd 28 Neil Lawrence () Dimensionality Reduction Data Modelling School 1 / 7

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU, Eric Xing Eric Xing @ CMU, 2006-2010 1 Machine Learning Data visualization and dimensionality reduction Eric Xing Lecture 7, August 13, 2010 Eric Xing Eric Xing @ CMU, 2006-2010 2 Text document retrieval/labelling

More information

Missing Data in Kernel PCA

Missing Data in Kernel PCA Missing Data in Kernel PCA Guido Sanguinetti, Neil D. Lawrence Department of Computer Science, University of Sheffield 211 Portobello Street, Sheffield S1 4DP, U.K. 19th June 26 Abstract Kernel Principal

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Dimensionality Reduction: A Comparative Review

Dimensionality Reduction: A Comparative Review Dimensionality Reduction: A Comparative Review L.J.P. van der Maaten, E.O. Postma, H.J. van den Herik MICC, Maastricht University, P.O. Box 616, 6200 MD Maastricht, The Netherlands. Abstract In recent

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 1 / 59 Motivation for Non-Linear Dimensionality Reduction USPS Data Set

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Dimension Reduction and Low-dimensional Embedding

Dimension Reduction and Low-dimensional Embedding Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26 Dimension

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information