EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models
|
|
- Adrian Melton
- 5 years ago
- Views:
Transcription
1 EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models Antonio Peñalver Benavent, Francisco Escolano Ruiz and Juan M. Sáez Martínez Robot Vision Group Alicante University Alicante, Spain {sco, Abstract In this paper we address the problem of estimating the parameters of a Gaussian mixture model. Although the EM algorithm yields the maximum-likelihood solution it requires a careful initialization of the parameters and the optimal number of kernels in the mixture may be unknown beforehand. We propose a criterion based on the entropy of the pdf (probability density function) associated to each kernel to measure the quality of a given mixture model. A novel method for estimating Shannon entropy based on Entropic Spanning Graphs is developed and a modification of the classical EM algorithm to find the optimal number of kernels in the mixture is presented. We test our algorithm in probability density estimation, pattern recognition and color image segmentation. 1 Introduction Gaussian Mixture models have been widely used for density estimation, pattern recognition and function approximation. One of the most common methods for fitting mixtures to data is the EM algorithm [5]. However, this algorithm is prone to initialization errors and it may converge to local maxima of the log-likelihood function. In addition, the algorithm requires that the number of elements (kernels) in the mixture is known beforehand (model-selection). The so called model-selection problem has been addressed in many ways [13][14][7][6]. In this paper we propose a method that starting with only one kernel, finds the maximum-likelihood solution. In order to do so, it tests whether the underlying pdf of each kernel is Gaussian and otherwise it replaces that kernel with two kernels adequately separated from each other. In order to detect non- Gaussianity we compare the entropy of the underlying pdf with the theoretical entropy of a Gaussian. After the kernel with worse degree of Gaussianity has been splited in two, new EM steps are performed in order to obtain a new maximum-likelihood solution. The paper is organized as follows: In section 2, we review the basic principles of Gaussian-mixture models and usual EM algorithm. In section 3, we introduce Entropic Spanning Graphs and how they can be related with the estimation of Renyi s entropy of order α and how Shannon entropy can be estimated through it. In section 4, we present our maximum-entropy based criterion approach and the modified EM algorithm that exploits it. In section 5, we show some experimental results. Conclusions and future work are presented in Section 6. 2 Gaussian-Mixture Models 2.1 Definition A d-dimensional random variable y follows a finitemixture distribution when its pdf p(y Θ) can be described by a weighted sum of known pdf s named kernels. When all these kernels are Gaussian, the mixture is named in the same way: K p(y Θ) = π i p(y Θ i ) (1) i=1 where 0 π i 1, i = 1,..., K, and K i=1 π i = 1, being K the number of kernels, π 1,..., π k the a priori probabilities of each kernel, and Θ i the parameters describing the kernel. In Gaussian mixtures, Θ i = {µ i, Σ i }, that is, the average vector and the covariance matrix. The set of parameters of a given mixture is Θ {Θ 1,..., Θ k, π 1,..., π k }. Obtaining the optimal set of parameters Θ is usually posed in terms of maximizing the log-likelihood of the pdf to be estimated: l(y Θ) = log p(y Θ) = log N n=1 p(y n Θ) = = n=1 log K k=1 π kp(y k Θ k ). With Θ = arg max Θ l(θ) and Y = {y 1,...y N } is a set of N i.i.d. samples of the variable Y. (2)
2 2.2 EM Algorithm The EM (Expectation-Maximization) algorithm [5] is an iterative procedure that allows us to find maximumlikelihood solutions to problems in which there are hidden variables. In the case of Gaussian mixtures [11], these variables are a set of N labels Z = {z 1,..., z N } associated to the samples. Each label is a binary vector z i = [z (i) 1,..., z(i) k ], with z(i) m = 1 and z p (i) = 0, if p m, indicating that y (i) has been generated by the kernel m. The EM algorithm generates a sequence of estimations of the set of parameters {Θ (t), t = 1, 2,...} by alternating an expectation step and the maximization one until convergence. The Expectation step consists of estimating the expected value of the hidden variables given the visible data Y and the current estimation of the parameters Θ (t). The probability of generating y n with the kernel k is given by p(k y n ) = π kp(y (n) k) Σ K j=1 π jp(y (n) k) In the Maximization step we re-estimate the new parameters Θ (t + 1) given the expected Z: p(k y )y n=1 n n p(k y, n=1 n ) N p(k y )(y n=1 n n µ k)(y n µ k ) T N p(k y, n=1 n ) π k = 1 N n=1 p(k y n ), µ k = Σ k = A detailed description of this classic algorithm is given in [11]. Here we focus on the fact that if K is unknown beforehand it cannot be estimated through maximizing the log-likelihood because l(θ) grows with K. In a classical EM algorithm with a fixed number of kernels density can be underestimated giving a poor description of the data. In the next section we describe the use of entropy to test whether a given kernel describes properly the underlying data. 3 Shannon entropy estimation Several methods for the nonparametric estimation of the differential entropy of a continuous random variable have been widely used (see [1] for a detailed overview). Entropic Spanning Graphs obtained from data to estimate Renyi s α- entropy [9] belong to a type of methods called non plugin. Renyi s α-entropy of a probability density function f is defined as: H α (f) = 1 1 α ln f α (z)dz (5) z for α (0, 1). The α entropy converges to the Shannon entropy f(z) ln f(z)dz as α 1, so it is possible to obtain the second one from the first one. (3) (4) 3.1 Renyi s Entropy and Entropic Spanning Graphs A graph G consists of a set of vertices X n = {x 1,..., x n }, with x n R d and edges {e} that connect vertices in graph: e ij = (x i, x j ). If we denote by M(X n ) the possible sets of edges in the class of acyclic graphs spanning X n (spanning trees), the total edge length functional of the Euclidean power weighted Minimal Spanning Tree is: L MST γ (X n ) = min e γ (6) M(X n) e M(X n) with γ (0, d) y e the euclidean distance between graph vertices. It is intuitive that the length of the MST spanning the more concentrated nonuniform set of points increases at a slower rate than does the MST spanning the uniformly distributed points. This fact, motivate the application of the MST as a way to test for randomness of a set of points[10]. In [8] it was showed that in d-dimensional feature space, with d 2: H α (X n ) = d γ [ ln L ] γ(x n ) n α ln β Lγ,d is an asymptotically unbiased, and almost surely consistent, estimator of the α-entropy of f where α = (d γ)/d and β Lγ,d is a constant bias correction depending on the graph minimization criterion, but independent of f. Closed form expressions are not available for β Lγ,d, only known approximations and bounds: (i) Monte Carlo simulation of uniform random samples on unit cube [0, 1] d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)) in [2]. We can estimate H α (f) for different values of α = (d γ)/d by changing the edge weight exponent γ. As γ modifies the edge weights monotonically, the graph is the same for different values of γ, and only the total length in expression 7 needs to be recomputed. 3.2 Renyi Extrapolation of Shannon Entropy Entropic spanning graphs are suitable for estimating α- entropy with α [0, 1[, so Shannon entropy can not be directly estimated with this method. Figure 1 shows that the shape of function does not depend neither on the nature of data nor on their size. Since it is not possible to directly calculate the value of H α for α = 1, we will approximate this value by means of a continuous function that captures the tendency of H α in the environment of 1. From a value of α [0, 1[, we can calculate the tangent line y = mx + b to H α in this point, using m = H α, x = α and y = H α. In any case, this line (7)
3 Figure 1. H α for gaussian distributions with different covariance matrices. will be continuous and we will be able to calculate its value for x = 1. The point in α = 1 will be different depending on the α used. From now on, we will call α to the α value that generates the correct entropy value in α = 1, following the described procedure. Experimentally, we have verified which this value does not depend on the nature of the probability distribution associated to the data, but it depends on the number of samples used for the estimation and the problem dimension. As H α is a monotonous decreasing function, we can estimate α value in the Gaussian case by means of a dichotomic search between two well separated α values for a constant number of samples and problem dimension and different covariance matrices. Experimentally, we have verified that α is almost constant for diagonal covariance matrices with variance value greater than 0.5. In order to appreciate the effects of the dimension and the number of samples on the problem, we generated a new experiment in which we randomly calculated α for a set of 1000 distributions with 2 d 5. Experimentally we have verified that the shape of the underlying curve adjusts suitably to a function of the type: α a+b expcd = 1 N, where N is the number of samples, D is the problem dimension and a, b, c are three constants to estimate. In order to estimate these values, we used Monte Carlo Simulation, minimizing the mean square error between expression and data. We obtained a = 1.271, b = and c = Figure 2 shows the function form for different values of dimension and number of samples. 4 The EBEM Algorithm Given a set of kernels for the mixture (initially one kernel) we evaluate the real global entropy H(y) and the theoretical maximum entropy H max (y) of the mixture by considering the individual pairs of entropies for each ker- Figure 2. α for dimensions between 2 and 5 and different number of samples. nel, and the prior probabilities: H(Y ) = K k=1 π kh k (Y ) and H max (Y ) = K k=1 π kh maxk (Y ), with H maxk (Y ) = 1 2 log[(2πe)d Σ ]. If the ratio H(y)/H max (y) is above a given threshold we consider that all kernels are well fitted. Otherwise, we select the kernel with the lowest individual ratio and it is replaced by two other kernels that are conveniently placed and initialized. Then, a new EM with K + 1 kernels starts. 4.1 Introducing a New Kernel A low H(y)/H max (y) local ratio indicates that multimodality arises and thus the kernel must be replaced by two other kernels. The difficulty in the split step arises when the original covariance matrix needs to generate two new matrices with two constraints: overall dispersion must remain almost constant and the new matrices must be positive definite. In [12] the split is based on the constraints of preserving the first to moments before and after the operation in one-dimensional settings. This is an ill-posed problem because the number of equations is less than the number of unknowns and the requirement of retaining the positive definiteness of covariance matrices. 4.2 The Split Equations From definition of mixture in equation 1, considering that the K component is the one with lowest Gaussianity threshold, it must be decomposed into the K 1 and K 2 components with parameters Θ k1 = (µ k1, Σ k1 ) and Θ k2 = (µ k2, Σ k2 ). The corresponding priors, the mean vectors and the covariance matrices should satisfy the following split equations: π = π 1 + π 2 π µ = π 1 µ 1 + π 2 µ 2 π (Σ + µ µ T ) = π 1 (Σ 1 + µ 1 µ T 1 ) + π 2 (Σ 2 + µ 2 µ T 2 ) (8)
4 In [15] a RJMCMC based algorithm is presented. To solve the equations showed above, a spectral decomposition of the actual covariance matrix is used and the original problem is replaced by estimating the new eigenvalues and eigenvectors of new covariance matrices from the original one. The main limitation is that all covariance matrices of the mixture are constrained to share the same eigenvector matrix. More recently, in [4] new eigenvector matrices are obtained applying a rotation matrix. Let = V Λ V T be the spectral decomposition of the covariance matrix, with Λ = diag(λj 1,..., λj d ) a diagonal matrix containing the eigenvalues of with increasing order, the component with the lowest entropy ratio, π, π 1, π 2 the priors of both original and new components, µ, µ 1, µ 2 the means and, 1, 2 the covariance matrices. Let also be D a dxd rotation matrix with columns orthonormal unit vectors. D is built by generating its lower triangular matrix independently from d(d 1)/2 different uniform U(0, 1) densities. The proposed split operation is given by: π 1 = u 1 π, π 2 = (1 u 1 )π µ 1 = µ ( d i=1 ui 2 λ i V i ) π2 π 1 µ 2 = µ ( d i=1 ui 2 λ i V i ) π1 π 2 π Λ 1 = diag(u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 1 π Λ 2 = diag(ι u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 2 V 1 = DV, V 2 = D T V (9) where, ι is a d x 1 vector of ones, u 1, u 2 = (u 1 2, u 2 2,..., u d 2) T and u 3 = (u 1 3, u 2 3,..., u d 3) T are 2d + 1 random variables needed to construct priors, means and eigenvalues for the new component in the mixture. They are calculated as: u 1 be(2, 2), u 1 2 be(1, 2d), u j 2 U( 1, 1), u 1 3 be(1, d), u j 3 U(0, 1), with j = 2,..., d. and be() a Beta distribution. EBEM ALGORITHM Initialization: Start with a unique kernel. K 1. Θ 1 {µ 1, Σ 1} with random values. repeat: //Main loop repeat: //E, M Steps Estimate log-likelihood in iteration i: l i until: l i l i 1 < CONVERGENCE TH Evaluate H(Y ) and H max(y ) globally if (H(Y )/H max < ENTROPY TH) Select kernel K with the lowest ratio and decompose into K 1 and K 2 Initialize parameters Θ 1 and Θ 2(Eq.9) Initialize new averages: µ 1, µ 2 Initialize new eigenvalues matrices: Λ 1, Λ 2 Initialize new eigenvector matrices: V 1 and V 2 Set new priors: π 1 and π 2 else Final True until: Final = True Figure 4. Entropy Based EM algorithm. 5 Experiments In order to test our approach we have performed several experiments with synthetic and image data. In the first one we have generated 2500 samples from 5 bi-dimensional Gaussians with prior probabilities π k = 0.2 k. Their averages are: µ 1 = [ 1, 1] T, µ 2 = [6, 3] T, µ 3 = [3, 6] T, µ 4 = [2, 2] T, µ 5 = [0, 0] T. We have used a Gaussianity threshold of 0.15, and a convergence threshold of for the EM algorithm. Our algorithm converges after 30 iterations finding correctly k = 5. In Figure 5 we show the evolution of the algorithm. Figure 3. 2-D Example of splitting one kernel. A graphical description of the splitting process in the 2-D case is showed in Fig.3. Directions and magnitudes of variability are defined by eigenvectors and eigenvalues of the covariance matrix. Otherwise, a full algorithmic description of the process is showed in Fig. 4. Figure 5. Evolution of our algorithm.
5 We have also tested our approach in unsupervised segmentation of color images. At each pixel i in the image we compute a 3-dimensional feature vector x i with the components in the RGB color space. As a result of our algorithm we obtain the number of components (classes) M and y i [1, 2,..., M] to indicate from which class came the pixel i th. Therefore our image model sets that each pixel is generated by one of the Gaussian densities in the Gaussian mixture model. We have used different entropy thresholds Figure 6. Color image segmentation results. and a convergence threshold of 0.1 for the EM algorithm. In Fig. 6 we show some color image segmentation results with three different images with sizes of 189x189 pixels. In each row we show the original image and the resulting segmented image with increasing entropy threshold. The greater it is the demanded threshold the higher is the number of kernels generated and therefore the number of colors detected in the image. No postprocess has been applied to the resulting images. Finally, we have applied the proposed method to the well known Iris [3] data set, that contains 3 classes of 50 (4- dimensional) instances referred to a type of iris plant: Versicolor, Virginica and Setosa. We have generated 300 training samples from the averages and covariances of the original classes and we have checked the performance in a classification problem with the original 150 samples. Starting with K = 1, the method correctly selected K = 3. Then, a maximum a posteriori classifier was built, with classification performance of 98%. 6 Conclusions and future Work In this paper we have presented a method for finding the optimal number of kernels in a Gaussian mixture based on maximum entropy. We start the algorithm with only one kernel and then we decide to split it on the basis of the entropy of the underlying pdf. The algorithm converges in few iterations and it is suitable for density estimation, pattern recognition and unsupervised color image segmentation. We are currently exploring new methods of estimating the MST in O(logn) cost and how to select the data features efficiently. References [1] E. Beirlant, E. Dudewicz, L. Gyorfi, and E. V. der Meulen. Nonparametric entropy estimation. International Journal on Mathematical and Statistical Sciences, 6(1):17 39, [2] D. Bertsimas and G. V. Ryzin. An asymptotic determination of the minimum spanning tree and minimum matching constants in geometrical probability. Operations Research Letters, 9(1): , [3] C. Blake and C. Merz. Uci repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, [4] P. Dellaportas and I. Papageorgiou. Multivariate mixtures of normals with unknown number of components. Statistics and Computing, To appear. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from incomplete data via the em algorithm. Journal of The Royal Statistical Society, 39(1):1 38, [6] M. Figueiredo and A. Jain. Unsupervised selection and estimation of finite mixture models. In International Conference on Pattern Recognition. ICPR2000, Barcelona, Spain, IEEE. [7] M. Figueiredo, J. Leitao, and A. Jain. On fitting mixture models. Energy Minimization Methods in Computer Vision and Pattern Recognition. Lecture Notes in Computer Science, 1654(1):54 69, [8] A. Hero and O. Michel. Asymptotic theory of greedy aproximations to minnimal k-point random graphs. IEEE Transactions on Information Theory, 45(6): , [9] A. Hero and O. Michel. Applications of entropic graphs. IEEE Signal Processing Magazine, 19(5):85 95, [10] R. Hoffman and A. Jain. A test of randomness based on the minimal spanning tree. Pattern Recognition Letters, 1(1): , [11] R. Redner and H. Walker. Mixture densities, maximum likelihood, and the em algorithm. SIAM Review, 26(2): , [12] S. Richardson and P. Green. On bayesian analysis of mixtures with unknown number of components (with discussion). Journal of the Royal Statistical Society B, (1), [13] N. Vlassis and A. Likas. A kurtosis-based dynamic approach to gaussian mixture modeling. IEEE Trans. Systems, Man, and Cybernetics, 29(4): , [14] N. Vlassis and A. Likas. A greedy em algorithm for gaussian mixture learning. Neural Processing Letters, 15(1):77 87, [15] Z. Zhang, K. Chan, Y. Wu, and C. Chen. Learning a multivariate gaussian mixture models with the reversible jump mcmc algorithm. Statistics and Computing, (1), 2004.
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationK-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543
K-Means, Expectation Maximization and Segmentation D.A. Forsyth, CS543 K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can t do this by
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationEECS490: Digital Image Processing. Lecture #26
Lecture #26 Moments; invariant moments Eigenvector, principal component analysis Boundary coding Image primitives Image representation: trees, graphs Object recognition and classes Minimum distance classifiers
More informationJOURNAL OF PHYSICAL AGENTS, VOL. 4, NO. 2, MAY Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks
JOURNAL OF PHYSICAL AGENTS, VOL. 4, NO., MAY Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks Boyan Bonev and Miguel Cazorla Abstract In this paper we present a scalable machine
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationEstimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1
Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees Alfred Hero Dept. of EECS, The university of Michigan, Ann Arbor, MI 489-, USA Email: hero@eecs.umich.edu Olivier J.J. Michel
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationGaussian Mixture Distance for Information Retrieval
Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationMIXTURE MODELS AND EM
Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationDegenerate Expectation-Maximization Algorithm for Local Dimension Reduction
Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationExperiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationInformation Theory in Computer Vision and Pattern Recognition
Francisco Escolano Pablo Suau Boyan Bonev Information Theory in Computer Vision and Pattern Recognition Foreword by Alan Yuille ~ Springer Contents 1 Introduction...............................................
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,
More informationECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012
Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMulti-Layer Boosting for Pattern Recognition
Multi-Layer Boosting for Pattern Recognition François Fleuret IDIAP Research Institute, Centre du Parc, P.O. Box 592 1920 Martigny, Switzerland fleuret@idiap.ch Abstract We extend the standard boosting
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationSupervised locally linear embedding
Supervised locally linear embedding Dick de Ridder 1, Olga Kouropteva 2, Oleg Okun 2, Matti Pietikäinen 2 and Robert P.W. Duin 1 1 Pattern Recognition Group, Department of Imaging Science and Technology,
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationTwo-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix
Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix René Vidal Stefano Soatto Shankar Sastry Department of EECS, UC Berkeley Department of Computer Sciences, UCLA 30 Cory Hall,
More informationECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015
Reading ECE 275B Homework #2 Due Thursday 2/12/2015 MIDTERM is Scheduled for Thursday, February 19, 2015 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example
More informationMIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA
MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationDensity Estimation: ML, MAP, Bayesian estimation
Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationMixtures of Gaussians with Sparse Structure
Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationUncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization
Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationAdaptive Crowdsourcing via EM with Prior
Adaptive Crowdsourcing via EM with Prior Peter Maginnis and Tanmay Gupta May, 205 In this work, we make two primary contributions: derivation of the EM update for the shifted and rescaled beta prior and
More informationCOMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)
COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core
More informationFinite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier
Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,
More informationRobust Laplacian Eigenmaps Using Global Information
Manifold Learning and its Applications: Papers from the AAAI Fall Symposium (FS-9-) Robust Laplacian Eigenmaps Using Global Information Shounak Roychowdhury ECE University of Texas at Austin, Austin, TX
More informationHierarchical Clustering of Dynamical Systems based on Eigenvalue Constraints
Proc. 3rd International Conference on Advances in Pattern Recognition (S. Singh et al. (Eds.): ICAPR 2005, LNCS 3686, Springer), pp. 229-238, 2005 Hierarchical Clustering of Dynamical Systems based on
More informationEM in High-Dimensional Spaces
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B: CYBERNETICS, VOL. 35, NO. 3, JUNE 2005 571 EM in High-Dimensional Spaces Bruce A. Draper, Member, IEEE, Daniel L. Elliott, Jeremy Hayes, and Kyungim
More informationFeature selection and extraction Spectral domain quality estimation Alternatives
Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic
More informationSpectral Generative Models for Graphs
Spectral Generative Models for Graphs David White and Richard C. Wilson Department of Computer Science University of York Heslington, York, UK wilson@cs.york.ac.uk Abstract Generative models are well known
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationEfficient Greedy Learning of Gaussian Mixture Models
LETTER Communicated by John Platt Efficient Greedy Learning of Gaussian Mixture Models J. J. Verbeek jverbeek@science.uva.nl N. Vlassis vlassis@science.uva.nl B. Kröse krose@science.uva.nl Informatics
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationReview of Maximum Likelihood Estimators
Libby MacKinnon CSE 527 notes Lecture 7, October 7, 2007 MLE and EM Review of Maximum Likelihood Estimators MLE is one of many approaches to parameter estimation. The likelihood of independent observations
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationLarge Scale Environment Partitioning in Mobile Robotics Recognition Tasks
Large Scale Environment in Mobile Robotics Recognition Tasks Boyan Bonev, Miguel Cazorla {boyan,miguel}@dccia.ua.es Robot Vision Group Department of Computer Science and Artificial Intelligence University
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationFinite Singular Multivariate Gaussian Mixture
21/06/2016 Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Plan Singular Multivariate Normal Distribution 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Multivariate
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationThe Regularized EM Algorithm
The Regularized EM Algorithm Haifeng Li Department of Computer Science University of California Riverside, CA 92521 hli@cs.ucr.edu Keshu Zhang Human Interaction Research Lab Motorola, Inc. Tempe, AZ 85282
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationLow-level Image Processing
Low-level Image Processing In-Place Covariance Operators for Computer Vision Terry Caelli and Mark Ollila School of Computing, Curtin University of Technology, Perth, Western Australia, Box U 1987, Emaihtmc@cs.mu.oz.au
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More informationA Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier
A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe
More informationLecture: Face Recognition and Feature Reduction
Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed
More informationAn Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval
An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval G. Sfikas, C. Constantinopoulos *, A. Likas, and N.P. Galatsanos Department of Computer Science, University of
More informationA Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier
A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationDetermining the number of components in mixture models for hierarchical data
Determining the number of components in mixture models for hierarchical data Olga Lukočienė 1 and Jeroen K. Vermunt 2 1 Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 5000
More information