Dimensional reduction of clustered data sets
|
|
- Henry Wright
- 5 years ago
- Views:
Transcription
1 Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which contain clusters. The model simultaneously performs the clustering and the dimensional reduction in an unsupervised manner, providing an optimal visualisation of the clusterig. We prove that the resulting dimensional reduction results in Fisher s linear discriminant analysis as a limiting case. Testing of the model both on real and artificial data shows marked improvements in the quality of the visualisation and clustering. 1 Introduction Dimensional reduction techniques form an important chapter in statistical machine learning. Linear methods such as principal component analysis (PCA) and multidimensional scaling are classical statistical tools. While a lot of current research focuses on non-linear dimensional reduction techniques, linear methods are still widely used in a number of applications, from computer vision to bioinformatics. In particular, PCA is still the preferred tool for dimensional reduction in applications such as microarray analysis, where the noisy nature of the data and the lack of knowledge of the underlying processes means that linear models are generally more trusted. A major advance in the understanding of dimensional reduction was provided by Tipping and Bishop s probabilistic interpretation of PCA [Tipping and Bishop, 1999b, PPCA]. The starting point for PPCA is in defining a factoranalysis style latent variable model y = W x + µ + ɛ. (1) Here y is a D-dimensional vector, W is a tall, thin matrix with D rows and q columns (with q < D), x is a q-dimensional latent variable, µ is a constant (mean) vector and ɛ N ( 0, σ 2 I ) is an error term. Placing a spherical unitary normal prior on the latent variable x, Tipping and Bishop proved that, having observed i.i.d. data y i, at maximum likelihood the columns of W spanned the principal subspace of the data. Therefore, the generative model (1) could be said to provide a probabilistic equivalent of PCA. 1
2 This result opened the way for a number of applications and generalisations: it allowed a principled treatment of missing data and mixtures of PCA models [Tipping and Bishop, 1999a], noise propagation through the model [Sanguinetti et al., 2005] and, through different choices of priors, allowed the linear dimensional reduction framework to be extended to cases where the data could not be modelled as Gaussian [Girolami and Breitling, 2004]. One major limitation of PCA, which is shared by its extensions, is that it is based on matching the covariance of the data. As such, it is not obvious how to incorporate information on the structure of the data. For example, if the data set has a cluster structure, the dimensional reduction obtained through PCA is not guaranteed to respect this structure. A specific application where this problem is particularly acute is bioinformatics: many microarray studies contain samples from different conditions (e.g. different types of tumour, or control versus diseased), so the data can be expected to be clustered according to conditions [e.g. Golub et al., 1999, Pomeroy et al., 2002]. Dimensional reduction using PCA does not lead to a good visualisation, and further information needs to be included (such as restricting the data set to include only genes that are significantly differentially expressed in the different conditions). The problem is even more significant if we view dimensional reduction as a feature extracting technique. It is generally assumed that the genes capturing most of the variation in the data are the most important for the physiological process under study [Alter et al., 2000]. However, when the data set contains clusters, the most important features are clearly the ones that discriminate between clusters, which, in general, do not coincide with the directions of greatest variation in the whole dataset. While unsupervised techniques for dimensional reduction of data sets with clusters are relatively understudied, the supervised case leads to linear discriminant analysis (LDA). It was Fisher [1936] who first provided the solution to the problem of identifying the optimal projection in order to separate two classes. Under the assumption that the class conditional probabilities are Gaussian with identical covariance, he proved that the optimal 1-dimensional projection to separate the data maximises the Rayleigh coefficient d 2 J (e) = σ (2) σ2 2 Here e is the unit vector that defines the projection, d is the projected distance of the (empirical) class centers and σi 2 is the projected empirical variance of class i. This criterion can be generalised to the general covariance, multi-class case [De la Torre and Kanade, 2005]. In this note we propose a latent variable model to perform dimensional reduction of clustered data sets in an unsupervised way. The maximum likelihood solution for the model fulfills an optimality criterion which is a generalisation of Fisher s criterion, and reduces to maximising Rayleigh s coefficient under particular limiting conditions. The rest of the paper is organised as follows: in the next section we describe the latent variable model and provide an expectation-maximisation (EM) algo- 2
3 rithm for finding the maximum likelihood solution. We then prove that, under certain assumptions, the likelihood of the model is a monotonic function of Rayleigh s coefficient, so that LDA can be retrieved as a limiting case of our model. Finally, we discuss the advantages of our model, as well as possible generalisations and relationships with existing models. 2 Latent Variable Model Let our data set be composed of N D-dimensional vectors y i, y = 1,..., N. We have prior knowledge that the data set contains K clusters, and we wish to obtain a q < D-dimensional projection of the data which is optimal with respect to the clustered structure of the data. The latent variable model formulation we present is closely related to and inspired by the extreme component analysis (XCA) model of Welling et al. [2003]. We explain the data using two sets of latent variables which are forced to map to orthogonal subspaces in the data; the model then takes the form y = Ex + µ + W z. (3) In this equation, E is a D q matrix which defines the projection, so that E T E = I q, the identity matrix in q dimensions. µ is a D dimensional mean vector, and W is a D (D q) matrix whose columns span the orthogonal complement of the subspace spanned by the columns of E, i.e. W T E = 0. Notice a key difference from the PPCA model of equation (1) is the absence of a noise term; this is common to our model and the XCA model. Choosing spherical Gaussian prior distributions for the latent variables x and z returns the XCA model of Welling et al. [2003]. Instead, in order to incorporate the clustered structure in the data, we will select the following prior distributions z N (0, I D q ) K x π k N ( m k, σ 2 ) I q K π k = 1. (4) Thus, we have a mixture of Gaussians generative distribution in the directions determined by the columns of E, and a general Gaussian covariance structure in the orthogonal directions. Intuitively, the mixture of Gaussians part will tend to optimise the fit to the clusters, which (given the spherical covariances) will be obtained when the subspace spanned by E interpolates the centres of the cluster in the optimal (least squares) way. Meanwhile, the general Gaussian covariance in the orthogonal directions will seek to maximise the variance of each cluster in the directions orthogonal to E, hence minimising the sum of the 3
4 projected variances. Therefore, the projection defined by E will seek an optimal compromise between separating the cluster centres and obtaining tight clusters. From the definitions of the model and the priors, it is clear that we can set the mean vector µ to zero and consider centred data sets without loss of generality. We can therefore assume that the centres of the mixture components in the prior distribution for x sum to zero, 2.1 Likelihood K m k = 0. Given the model (3) and the prior distributions on the latent variables (4), it is straightforward to marginalise the latent variable and obtain a likelihood for the model. Using the orthogonality of E and W, we readily obtain a log-likelihood for the model L (y j, E, W, m k, π, σ) = Nq log σ 2 + [ N K ] log π k exp 1 ( E T ) 2 2σ 2 y j m k N log C 1 2 tr ( C 1 S ). Here S = 1 N N y jyj T is the empirical covariance of the (centred) data, C = W W T and we have omitted a number of constants. With a slight notational abuse, we denote C 1 = W ( W W ) T 2 W T (using the pseudoinverse of W ) and C as the product of the non-zero eigenvalues of C. We notice immediately that, as W is unconstrained (except for being orthogonal to E and full column rank), at maximum likelihood W W T will match exactly the covariance of the data in the directions perpendicular to E. Therefore, the last term in equation (5) will just be a constant D q 2 irrespective of the other parameters and can be dropped from the likelihood. Also, we can use the orthogonality between E and W to rewrite C in terms of E alone. This can be done by observing that, if P and P represent two mutually orthogonal projections that sum to the identity, then the determinant of a symmetric matrix A can be expressed as A = P AP T P AP T. But since C matches exactly the covariance in the directions orthogonal to E, we obtain that C = S E T SE. Substituting these results into the log-likelihood (5), and omitting terms which do not depend on the parameters, we obtain L (y j, E, m k π, σ) = Nq log σ 2 [ N K ] + log π k exp 1 ( E T ) 2 2σ 2 y j m k + N log E T SE (6). 4 (5)
5 This expression is now simply a mixture of Gaussian likelihood in the projected space spanned by the columns of E, plus a term that depends only on the projection E. This suggests a simple iterative strategy to maximise the likelihood. 2.2 E-M algorithm We can optimise the likelihood (6) using an E-M algorithm [Dempster et al., 1977]. This is an iterative procedure that is proved to converge to a (possibly local) maximum of the likelihood. We introduce unobserved binary class membership vectors c R K, and denote by ỹ = ye the projected data. Introducing the responsibilities γ jk = π k exp 1 2σ 2 ( E T y j m k ) 2 K π k exp 1 2σ 2 (E T y j m k ) 2 (7) we obtain a lower bound on the log-likelihood in the form Q (E, m k, σ, π) = N K γ jk log [π k p (ỹ c k, m k, σ)] + N log E T SE (8) where the class conditional probabilities p (ỹ c k, Θ) are the Gaussian mixture components and we denoted collectively as Θ all the model parameters. Optimising the bound (8) with respect to the parameters leads to fixed point equations. For the latent means these are N m k = γ jkỹ j N γ. (9) jk This simply tells us that the latent means are the means of the projected data weighted by the responsibilities. Similarly, we obtain a fixed point equation for σ 2 N K σ 2 = γ jk (ỹ j m k ) 2 K = σ2 k (10) Nq Nq where we have introduced the projected variances σk 2. Again, we can easily obtain an update for the mixing coefficients π k = N N k N = γ jk. (11) N Substituting equations (9) into equation (10), and equation (10) into equation (8) we obtain, ignoring constants, Q (E) = log ( ES K E T ) 1 E T SE (12) where S K = K N N γ jk (y j γ ) ( N jky j N γ y j γ jky j N jk γ jk ) T. 5
6 Optimisation of the likelihood w.r.t. the projection directions E is then a matter of simple linear algebra, and is solved by a generalised eigenvalue problem. Equation (12) is noteworthy. The denominator is a soft assignment, multi dimensional generalisation of the sum of intra-cluster variances which appeared in the denominator of equation (2). Therefore, optimising the likelihood with respect to the projection subspace E will tend to make the projected clusters tighter. Furthermore, the matrix S can be decomposed as S K +S T [e.g. Bishop, 2006], where S T is the generalisation of the inter-cluster variance appearing in the numerator of (2). Therefore, the likelihood bound is a monotonic increasing function of a generalisation of the Rayleigh coefficient, and can be viewed as a probabilistic version of LDA. In the next section we will make precise this correspondence in the original Fisher s discriminant case, and show under which conditions the two techniques are equivalent. 3 Relationship with linear discriminant analysis We now make a number of simplifying assumptions: we assume our data set to be composed of two clusters, and we choose to project it to one dimension, so that q = 1 and the projection is defined by a single vector e rather than a matrix E. If the clusters are well separated, assuming they are balanced in the data, the likelihood of equation (6) can be simplified neglecting the probability that points in one cluster were generated by the other cluster, obtaining L = Nq log σ 2q K N k 1 2σ 2 [ (y j em k ) T e] 2 + N log e T Se. (13) where we replaced π i = 1 2 and N i is the number of points assigned cluster i. Using the fact that µ = N y j = 0, we define y j = µ i + v j, (14) where µ i = 1 N i Ni y j. The intra-cluster covariances are then S i = 1 N i vj v T j. Since there are two clusters, we also have µ 1 = µ 2 = µ and m 1 = m 2 = m. Using equation (14), we obtain that S = S 1 + S 2 + 6µµ T. Defining σi 2 = et S i e the projected variance of cluster i and d = 2µ T e the projected distance separating the clusters, it is easy to compute the maximum likelihood estimates for m and σ 2, given respectively by m = d, σ 2 = α ( σ1 2 + σ2 2 ) with α a constant. Notice that these can also be obtained from equations (9-10) by taking the limit when the responsibilities are 0 or 1 (hard assignments). 6
7 Therefore, we obtain that the likelihood (13) can be rewritten as ( σ 2 L = log 1 + σ d 2 ) σ const = log [1 + 6J (e)] + const σ2 2 where J (e) is the Rayleigh coefficient of equation (2). Therefore, the maximum likelihood solution is obtained at the maximum of the Rayleigh coefficient, showing that the model reduces to Fisher s discriminant in these conditions. 4 Discussion In this paper we have addressed the question of selecting an optimal linear projection of a high dimensional data set, given prior knowledge of the presence of cluster structure in the data. The proposed solution is a generative model which is made up of two orthogonal latent variable models: one is an unconstrained, PPCA-like model [Tipping and Bishop, 1999b], the other is a projection (i.e. the loading vectors have unit lengths) with a mixture of Gaussians prior on the latent variables. The result of this rather artificial choice of priors is that the PPCA-like part will tend to maximise the variance of the data in the directions orthogonal to the projection, hence forcing the projected clusters to maximise the inter-cluster spread, while minimising the intra-cluster variance. Indeed, we prove that the maximum likelihood estimator for the projection plane in our model satisfies an unsupervised, soft assignment generalisation of the multidimensional version of the criterion for LDA. We also prove that when the projection is one dimensional and the clusters are well separated, the likelihood of the model reduces to a monotonic function of the Rayleigh coefficient used in Fisher s discriminant analysis. Our model is related to several previously proposed models. It can be viewed as an adaptation of the XCA model of Welling et al. [2003] to the case when the data set has a clustered structure; it is remarkable though that the generalised LDA criterion is obtained without imposing that the variance in the directions orthogonal to the projection be maximal. Bishop and Tipping [1998] dealt with the problem of visualising clustered data sets by proposing a hierarchical model based on a mixture of PPCAs [Tipping and Bishop, 1999a]. While this approach addresses successfully the problem of analysing the detailed structure of each cluster, it does not provide a single optimal projection for visualising all the clusters simultaneously. Perhaps the contribution that is closest to ours is in Girolami [2001]. This paper considered the clustering problem using an Independent Component Analysis (ICA) model with one latent binary variable corrupted by Gaussian noise. The author then proved that the maximum likelihood solution returned an unsupervised analogue of Fisher s discriminant. This approach is somewhat complementary to ours in that it uses discrete rather than continuous variables, but the end result is remarkably close. In fact, in the one dimensional, two classes case the objective function of the ICA model is the same as the likelihood (6). The major difference comes when considering latent spaces which are more than 7
8 one dimensional, which is generally the most important case. Generalising the ICA approach would enforce an independence constraint between the directions of the projection which is not plausible; this is not a problem in our approach. Also, treatment of cases where the intra-cluster spread is different for the different clusters is much more straightforward in our model; in Girolami [2001] s approach one has to introduce a class-dependent noise level, while in our approach it just amounts to a straightforward modification of the E-M algorithm. There are several interesting directions for generalising the present work. One interesting direction would be to extend it to non-linear dimensional reductions techniques such as Bishop et al. [1998], Lawrence [2005]. While this is in principle possible due to the shared probabilistic nature of these models, in practice it will require considerable further work. Another important direction would be to automate model selection issues (such as the number of clusters or number of latent dimensions) using Bayesian techniques such as Bayesian PCA and non-parametric Dirichlet process priors Bishop and Tipping [1998], Neal [2000]. While this is feasible, careful consideration should be given to issues of numerical efficiency. Acknowledgments The author wishes to thank Mark Girolami for useful discussions and suggestions on ICA models for clustering. References O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. PNAS, 97(18): , C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Singapore, C. M. Bishop, M. Svensen, and C. K. Williams. GTM: the generative topographic mapping. Neural Computation, 10(1): , C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualisation. IEEE Transactions PAMI, 20(3): , F. De la Torre and T. Kanade. Multimodal oriented discriminant analysis. In Proceedings of the International Conference on Machine Learning (ICML), A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1 38,
9 R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: , M. Girolami. Latent class and trait models for data classification and visualisation. In Independent Component Analysis: principles and practice. Cambridge University Press, Cambridge, M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression. Bioinformatics, 20(17): , T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531 7, N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6: , R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: , S. M. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifdin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415: , G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence. Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics, 21(19): , M. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2): , 1999a. M. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3): , 1999b. M. Welling, F. Agakov, and C. K. Williams. Extreme component analysis. In Advances in Neural Information Processing Systems,
Matching the dimensionality of maps with that of the data
Matching the dimensionality of maps with that of the data COLIN FYFE Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, PA 2BE SCOTLAND. Abstract Topographic maps are
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationFeature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers
Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationAll you want to know about GPs: Linear Dimensionality Reduction
All you want to know about GPs: Linear Dimensionality Reduction Raquel Urtasun and Neil Lawrence TTI Chicago, University of Sheffield June 16, 2012 Urtasun & Lawrence () GP tutorial June 16, 2012 1 / 40
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationKernel Density Topic Models: Visual Topics Without Visual Words
Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de
More informationA Generalization of Principal Component Analysis to the Exponential Family
A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,
More informationAutomated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling
Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling Abstract An automated unsupervised technique, based upon a Bayesian framework, for the segmentation of low light level
More informationLocalized Sliced Inverse Regression
Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationSelf-Organization by Optimizing Free-Energy
Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationMissing Data in Kernel PCA
Missing Data in Kernel PCA Guido Sanguinetti, Neil D. Lawrence Department of Computer Science, University of Sheffield 211 Portobello Street, Sheffield S1 4DP, U.K. 19th June 26 Abstract Kernel Principal
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationFeature Selection for SVMs
Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More informationArtificial Intelligence Module 2. Feature Selection. Andrea Torsello
Artificial Intelligence Module 2 Feature Selection Andrea Torsello We have seen that high dimensional data is hard to classify (curse of dimensionality) Often however, the data does not fill all the space
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationRegression Model In The Analysis Of Micro Array Data-Gene Expression Detection
Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationDimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014
Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationModelling Transcriptional Regulation with Gaussian Processes
Modelling Transcriptional Regulation with Gaussian Processes Neil Lawrence School of Computer Science University of Manchester Joint work with Magnus Rattray and Guido Sanguinetti 8th March 7 Outline Application
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationMachine Learning (Spring 2012) Principal Component Analysis
1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationDimensionality Reduction
Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball
More informationLatent Variable Models with Gaussian Processes
Latent Variable Models with Gaussian Processes Neil D. Lawrence GP Master Class 6th February 2017 Outline Motivating Example Linear Dimensionality Reduction Non-linear Dimensionality Reduction Outline
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationNMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering
NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering Andri Mirzal Faculty of Computing Universiti Teknologi Malaysia Skudai, Johor Bahru, Malaysia Email: andrimirzal@utm.my
More informationGrouping of correlated feature vectors using treelets
Grouping of correlated feature vectors using treelets Jing Xiang Department of Machine Learning Carnegie Mellon University Pittsburgh, PA 15213 jingx@cs.cmu.edu Abstract In many applications, features
More informationIntroduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond
PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationDiscriminant analysis and supervised classification
Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationProbabilistic Fisher Discriminant Analysis
Probabilistic Fisher Discriminant Analysis Charles Bouveyron 1 and Camille Brunet 2 1- University Paris 1 Panthéon-Sorbonne Laboratoire SAMM, EA 4543 90 rue de Tolbiac 75013 PARIS - FRANCE 2- University
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationIn this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.
1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationTHESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott
THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements
More informationImproving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification
Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification Sampath Deegalla and Henrik Boström Department of Computer and Systems Sciences Stockholm University Forum 100,
More informationDecember 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis
.. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make
More informationWidths. Center Fluctuations. Centers. Centers. Widths
Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.
More informationAutomatic Rank Determination in Projective Nonnegative Matrix Factorization
Automatic Rank Determination in Projective Nonnegative Matrix Factorization Zhirong Yang, Zhanxing Zhu, and Erkki Oja Department of Information and Computer Science Aalto University School of Science and
More informationSmart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University
Smart PCA Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Abstract PCA can be smarter and makes more sensible projections. In this paper, we propose smart PCA, an extension
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More information7. Variable extraction and dimensionality reduction
7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality
More information