Clustering on Unobserved Data using Mixture of Gaussians

Size: px
Start display at page:

Download "Clustering on Unobserved Data using Mixture of Gaussians"

Transcription

1 Clustering on Unobserved Data using Mixture of Gaussians Lu Ye Minas E. Spetsakis Technical Report CS Oct Department of Computer Science 4700 eele Street orth York, Ontario M3J 1P3 Canada

2 Clustering on Unobserved Data using Mixture of Gaussians Lu Ye Dept. of Mathematics York University 4700 eele Str Toronto, O M3J 1P3 lye@mathstat.yorku.ca Minas E. Spetsakis Dept. of Computer Science Center for Vision Research York University 4700 eele Str Toronto, O M3J 1P3 minas@cs.yorku.ca ABSTRACT This report provides an review of Clustering using Mixture Models and the Expectation Maximization method[1] and then extends these concepts to the problem of clustering of unobserved data where we cluster a set of vectors u i for i = 1.. for which we only know the probability distribution. This problem has several applications in Computer Vision where we want to cluster noisy data. 1

3 2

4 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data 1. Clustering Data clustering is an important statistical technique, closely related to unsupervised learning. Clustering is the process of grouping the samples into clusters, so that samples with similar properties belong to the same cluster. A cluster is just a set whose members are similar but are different from the members of other clusters. For example, samples in a particular space where the distances between two samples in a cluster is less than minimal. One of the most popular clustering techniques is based on Expectation Maximization (EM). EM is a method for estimating parameters using partially observed data[2, 3]. Since the clustering problem would be trivial if we already knew the membership of ev ery sample to a particular cluster, we treat this membership information as the unobserved part of the data and apply EM. In this report we develop a clustering method that treats not only the membership of the data as unobserved but the data as well. Edge grouping is a potential application for clustering. Edges can be considered as the sets of many edgels (edge elements) and these edgels can be detected in an image using any of the standard edge detection techniques. If we happen to know that our edgels belong to straight lines and we have an estimate of the slope of the lines then we can group the edgels into straight line edges using a clustering technique. One of the problems is that some of the information is not very accurate and we can only assume that we know the probability distribution of the data, but not the data itself. So we essentially have partially observed data. 2. Mixture Model The underlying model of EM clustering is the Mixture Model[4]. This model assumes that each sample comes from a cluster ω j, where j = 1,...,. The way to generate samples form the mixture model is as follows. We select one of the clusters ω j with probability P(ω j ) = π j and then generate a sample x out of a probability distribution p(x ω j, θ )orp(x θ j ) where θ = θ j, j = 1,..., and θ j is the vector of parameters associated with the cluster ω j, which in our case contains the mean µ j, the covariance matrix C j and the mixture probability π j. One should notice that p(x θ j ) and p(x ω j, θ ) are the same. The first one denotes the probability density of x given the parameters θ j of the cluster ω j and the other denotes the probability density of x given the parameters θ of all the clusters and the fact that we have selected cluster ω j. The density p(x θ j )is called component density. The probability density function of a sample x from the mixture model is then p(x θ ) = Σ p(x ω j, θ )π j. For the mixture probabilities π j, the following is true (2.1) 2

5 EM Clustering Lu Ye, Minas E. Spetsakis Σ π j = Maximum-Likelihood Estimation The Maximum-Likelihood (ML) method can be used for parameter estimation from a set of samples x i, i = 1... We assume that p(x θ j ) is of known parametric form with parameter vectors θ j which are unknown. The mixture probabilities π j are also unknown. The maximum-likelihood estimate of the parameters θ related to a set of data D = x i, i = 1,..., is obtained by maximizing the log likelihood of the parameters. The likelihood of the parameters is the probability of the data given the parameters L(θ ) = p(d θ ) = Π p(x i θ ) assuming the data is independent. By substituting the probability density function Eq. (2.1), we get p(d θ ) = Π Σ p(x i ω j, θ )π j. The reason to take the log-likelihood of the parameters is that p(d θ ) inv olves a product of terms. To simplify the calculation, we use log-likelihood to change products to sums. We will use the Gaussian distribution[5] as our parametric form for p(x i θ j ). For multidimensional data x it takes the form p(x θ j ) = 1 (2π ) m C j e (x µ j ) T C j 1 (x µ j ) 2 (3.1) where m is the number of dimensions of vector x, C j is the covariance matrix and µ j is the mean for the cluster ω j. The mixture probabilities, the covariance and the mean are bundled in the parameter vector θ j = µ j, C j, π j. Im most applications of ML the next step is to maximize the likelihood function (or more often the log-likelihood). Usually the result is a simple analytic expression which along with a host of other nice properties explains the popularity of ML. But in this case we have a product of sums and it is impossible to get a simple expression. The problem would become very simple if we knew the membership of every sample. Then we would only have to solve the Gaussian parameter estimation problem for ev ery cluster. But lacking this membership information we can use Expectation Maximization. 4. Expectation Maximization Applied to Clustering Expectation Maximization (EM) is an iterative algorithm that alternates between two steps: the Expectation step and the Maximization step. In the Expectation step, we 3

6 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data first derive the log-likelihood of the unknown parameters as a function of the unobserved data and then we compute the expected value of this likelihood using the probability density of the unobserved data. Since the density of the unobserved data involves the parameters we want to estimate we use a guess. In the Maximization step, we obtain the parameters that maximize the expected value of the likelihood. This new parameter will become the guess to be used in the next iteration. This iteration goes on until convergence or satisfaction of the termination conditions. The unknown parameter vector θ contains information appropriate for the mixture model we use. Since we will use the Gaussian distribution over the mixture model, normally the parameter vector contains three components which are the mean and the variance for each cluster and the mixture probability. These are usually the unknowns of our problem. After we know the mean, variance and the mixture probabilities of each cluster, we are able to group samples. The number of clusters is known, but we can augment the algorithm with BIC or AIC (see below) and have as unknown too. The straight application of ML to our clustering problem is extremely difficult because it involves logarithms of sums and many complicated functions. Luckily, it can be easily put in a form where we can apply EM. We do this as follows[6]. Let z i be a vector of length (the number of clusters) and the j th element of the vector is 1, if the sample x i was generated by cluster j and zero otherwise. Obviously there is exactly one 1 in the vector and the rest are zero so Σ z ij = 1 (4.1) where z ij is the j th element of z i. Since we do not know the membership of every sample, z i is the unobserved part of the data. Let D y = y i, i = 1,..., be the complete data set where y i = {(x i, z i )}, D x = x i, i = 1,..., is observed data set and D z = z i, i = 1,..., be the unobserved data set. In the expectation step, the expression for expectation is Q(θ, θ t ) = E Dz { ln p(d y θ ) D x, θ t }. Although it looks intimidating at first, it can be tamed in a few simple steps. The first step is to derive the form of ln p(d y θ ) which is a function of D x, D z and θ. Once we simplify the ln p(d y θ ), we can get its expected value by applying the standard rules. The subscript D z in E Dz means that the random variables are the elements of D z and the probability of D z is conditioned on D x and θ t, where θ t is just a guess, but it is usually the result of the t th (or previous) iteration. The result is a function of θ, the unknown mixture parameters, θ t the guess for θ and the observed data D x, but for simplicity we drop D x from Q(θ, θ t ). Using the rule of joint probabilities and assuming that the y i s, the elements of D y, are independent, we can rewrite p(d y θ ) as follows: 4

7 EM Clustering Lu Ye, Minas E. Spetsakis p(d y θ ) = Π p(y i θ ) Since we have y i = {(x i, z i )}, p(y i θ ) can be written as p(y i θ ) = p(x i, z i θ ) Based on the properties of conditional probability p(x i, z i θ ) has the following probability density expression. p(x i, z i θ ) = p(x i z i, θ )p(z i θ ) As we mentioned above, z i is the vector, whose j th element is 1 if the sample x i was generated by the cluster j. Therefore, the probability of z i given the parameter vector θ,is equal to the mixture probability π j, i,e. the probability to select cluster j among the clusters of the mixture model. The probability of x i given z i and θ, p(x i z i, θ ) has an alternative form which is p(x i θ j ), where θ j is the vector whose parameters associated with the cluster ω j.sowehav e p(x i, z i θ ) = p(x i θ j )π j. Thus p(y i θ ) can be written as p(y i θ ) = p(x i θ j )π j (4.2) This equation only presents the correspondence of p(y i θ ) to the parameter vector of the j th cluster θ j. Recalling that z i vector, z ij = 1 and z ik = 0 for k j, we hav e the following new expression for p(y i θ ): p(y i θ ) = Π(p(x i θ k )π k ) z ik The term (p(x i θ k )π k ) z ik is equal to p(x i θ j )π j when z ij = 1, and it is 1 when k j. Thus, the product terms of all clusters is the same as Eq. (4.2). By collecting all the samples y i in D y the probability density function of the D y set given θ is: p(d y θ ) = Π(p(x i θ k )π k ) z ik Π Then we should obtain ln p(d y θ ) by simply taking the logarithm on the above equation: ln p(d y θ ) = Σ Σ z ik ln( p(x i θ k )π k ) and by applying the logarithm rule once more ln p(d y θ ) = Σ Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π k ) (4.3) Eq. (4.3) is our form for ln p(d y θ ) and we can now calculate the expected value of it to conclude the Expectation Step. Since the form of ln p(d y θ ) is the sum of two terms, the expected value should be the sum of the expected values of those two terms. We also need to keep in mind that the random variables are the unobserved data z ik only. The 5

8 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data other terms of the equation should be treated as the constants in this situation. The expected values of constants are themselves. E z { ln p(d y θ ) D x, θ t } = Σ Σ Let s rename E z {z ik D x, θ t } as z ik, then we have E z { ln p(d y θ ) D x, θ t } = Σ Σ E z {z ik D x, θ t } ln( p(x i θ k )) + Σ E z {z ik D x, θ t } ln(π k ) Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π k ) (4.4) (4.5) Based on the definition of the z ik vector, we notice that z ik takes two values only: zero and one. So z ik = E z {z ik D x, θ t } = 0 P(z ik = 0 D x, θ t ) + 1 P(z ik = 1 D x, θ t ) = P(z ik = 1 D x, θ t ) is the probability of z ik to be 1. This is the same as P(ω k x i, D x, θ t ) the probability the sample x i to be generated by the k th cluster among all the clusters. Applying the Bayes rule, we have z ik responding to the whole model: z ik = E z {z ik D x, θ t } = P(ω k x i, θ t ) = p(x i D x, θ t, ω k )P(ω k D x, θ t ) Σ p(x i D x, θ t, ω j )P(ω j D x, θ t ) = (4.6) p(x i θ t, ω k )P(ω k D x, θ t ) Σ p(x i θ t, ω j )P(ω j D x, θ t ) = p(x i θ t k )P(ω k D x, θ t ) Σ p(x i θ t j )P(ω j D x, θ t ) Since we use a known parametric form for the probability, Gaussian in this case, we can compute p(x i θ t )and so compute z ik from Eq. (4.6) and the we can plug the value of z ik into Eq. (4.4), to get the final result of the Expectation Step. In the Maximization Step, we are supposed to maximize the above expectation function values in order to get the optimal solution for the unknown θ which can be used in the next round as θ t to estimate a new expectation function and from this to get the new θ. There is one constraint in our mixture model, which is Σ π j = 1. (4.7) Therefore, we maximize the expectation function Q(θ, θ t ) subject to this constraint. We define a new Q (θ, θ t ) which takes into account the constraint of Eq. (4.7) by using the Lagrange multiplier. Q (θ, θ t ) = Σ Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π j ) + λ(1 Σ π j ) Using standard calculus procedures, we can find the maximum value of a function subject 6

9 EM Clustering Lu Ye, Minas E. Spetsakis to the constraint by taking derivatives over the unknowns of the function and λ, set the derivatives to zero, and solve the resulting equations to find the value of the unknowns and λ that maximize the function. In the our case the unknown is θ the vector containing all the parameters of the mixture model π j, µ j and C j for j = 1... First, we take the partial derivative with respect to π j.inq (θ, θ t ), only the last two terms involve the variable π j so the first term will give zero. Since we only take derivatives of the whole Q term with respect to one particular π j term, only the j th element of the whole Q term will have non zero value. So we can get rid of the summation over j = 1,...,, and we have Q = 1 π j Σ z ik λ = 0 π j λ = Σ 1 z ik π j π j = Σ z ik λ (4.8) If we take the summation from 1 to on both sides we can use Eq. (4.7) to eliminate π j and have Σ Σ z ik λ = 1 from which we get λ = Σ Σ z ik So we plug the λ into Eq. (4.8), and get π j = Σ z ik Σ Σ z ik We notice that the denominator in the above expression is Σ Σ z ik = Σ after taking into account Eq. (4.1). So finally Σ z ik = Σ 1 = π j = 1 Σ z ik. (4.9) Second, we take the partial derivative with respect to the unknown µ j, the mean of the j th cluster. It is only involved in the Gaussian distribution for our mixture model, so 7

10 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data we only need to be concerned with the first term of the expectation function when taking derivatives. In the first term of the function, z ik uses the parameters θ t which is the guess and we consider it independent from θ and constant. By the same reason as above, we eliminate one summation from j = 1,..., by taking derivative with respect to the particular µ j. The definition of the Gaussian density from Eq. (3.1) is p(x θ j ) = 1 (2π ) m C j e (x µ j ) T C j 1 (x µ j ) 2 and the log is 1 ln (2π ) m C j (x i µ j ) T C 1 j (x i µ j ) 2 The derivative then is as follows: Q = 1 µ j µ j Σ z ik ln (2π ) m C j Σ z (x i µ ) T C 1 (x j j i µ j ) ik = 0 2 The derivative of the first term equals to zero. The derivative of (x i µ j ) T C 1 j (x i µ j ) 2 with respect to µ j is C 1 j (x i µ j ). The derivative of the second part is Q = µ j Σ z ik C 1 1 j (x i µ j ) = C j Σ z ik (x i µ j ) = 0 which results to or Σ z ik x i = µ j = Σ z ik µ j Σ z ik x i Σ z ik and using Eq. (4.9) we get µ j = 1 π j Σ z ik x i. It s time to derive the last unknown variable C j of our model in the Maximization step. The steps of derivation are similar to the derivation of µ j. We only work with the first term of expectation function. The difference begins with the following step: 8

11 EM Clustering Lu Ye, Minas E. Spetsakis Q C j = C j Σ z ik ln 1 (2π ) m C j Σ z (x i µ ) T C 1 (x j j i µ j ) ik = 0 2 The above differentiation involves derivatives of a determinant and a quadratic product which we can get from the Matrix Reference Manual ( (x C i µ j ) T C 1 j (x i µ j ) = C 1 j (x i µ j )(x i µ j ) T 1 C j j and using the chain rule C j C j = C j C j 1 C j ln 1 (2π ) m C j = C j 1 2 ln C j = 1 2 C j 1 and we get Q = 1 C j 2 Σ z ik C j 1 + C 1 j (x i µ j )(x i µ j ) T C 1 j = 0 After that, we multiply C j twice on both sides of above equation to eliminate C 1 j, and we get the following: which gives Σ z ik ( C j + (x i µ j )(x i µ j ) T ) = 0 Σ z ik C j = C j = Σ z ik (x i µ j )(x i µ j ) T Σ z ik (x i µ j )(x i µ j ) T Σ z ik By using Eq. (4.9), we have the C j for the next iteration. C j = 1 π j Σ z ik (x i µ j )(x i µ j ) T In conclusion, the above derivation explains the internal detail steps of EM algorithm. When we run the EM algorithm each time, we can directly use the obtained equations to calculate the covariance, means and mixture parameters. -means clustering is a simple and classical example by applying EM algorithm. It can be viewed as the problem of estimating the means of Gaussians Mixture Model. The special assumption of -means is that the mixing probabilities π j are equal and each 9

12 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data Gaussian distribution has the same variance. 5. Bayesian Information Criterion Applied to Clustering The EM clustering algorithm that we outlined above assumes a Mixture of Gaussians as the underlying model. After the particular mixture model is defined, the EM algorithm is be applied to find all the parameters except the number of clusters. So we have to provide the number of clusters through some other method. The Bayesian Information Criterion (BIC) is a widely used method for this purpose[7, 8], and can be applied in our case to find the number of clusters. Since we use EM to find the maximum mixture likelihood after each iteration, we can get reliable approximation of BIC. The expression of BIC is BIC = 2L M (x, θ ) + m M log() (5.1) where L M (x, θ ) is the maximized mixture loglikelihood for the model M, m M is the number of independent parameters to be estimated in the model and is the number of samples. The maximized mixture loglikelihood function is L M (x, θ ) = logπ p(x i θ ) = logπ Σ p(x i θ j )π j Recalling the material discussed in the previous sections, p(x i θ j ) is the probability of x i given that it belongs to the j th cluster, which is a Gaussian distribution and π j is the the mixture probability of the model M. The smaller the value of BIC, the stronger the evidence for the model. The BIC can be used not only to determine the number of clusters by comparing models, but also help avoid local minima in our clustering problem. Since we use random starting points, usually by selecting random initial values for parameters θ j, θ j = (µ j, C j, π j ) where j = 1,...,, we get distinct clustering results when we run the EM algorithm several times. If it is our lucky day, we get reasonably good initial values for θ j, and we get desirable results in the first attempt. However, we can t control our luck. To solve this problem, we apply the EM clustering on the same data several times using random starting points and calculate the BIC to examine the maximum likelihood for each complete run by applying Eq. (5.1). Then we select the run that has minimum BIC. 6. Expectation Maximization Clustering Applied on Unobserved Data In the previous sections, we discussed the EM method and the clustering problem and applied the EM algorithm on the clustering of samples into clusters. We assumed that the underlying model is a Mixture of Gaussians and the coordinates of the samples are given. ext we study the problem where the coordinates of samples are not given directly and we only know the probability distributions of the samples p(u i D i ), where u i are the samples and D i is the data where these probabilities are based on. The union of all D i is D. We will again use EM to cluster these samples. The general procedure of EM algorithm remain the same but we have different assumptions and preconditions. 10

13 EM Clustering Lu Ye, Minas E. Spetsakis As before, we have both a set of observed data and a set of unobserved data. We are given the information D, and we have unobserved samples u i, i = 1,...,. As before ψ i is the membership vector of that sample where as ψ ij = 1 if sample u i was generated by cluster ω j and is equal to zero otherwise. The complete the data y i for each sample is y i = (D i,ψ i, u i ). The complete data set is D y = (y i, i = 1,..., ) where y i = (D i,ψ i, u i ). The unobserved data is D z = (z i, i = 1,..., ) where z i = (ψ i, u i ). Since we assume Mixture of Gaussians, we have to compute the parameter vector θ = (θ j, j = 1.. ) where θ j = (µ j, C j, π j ) is the mean, variance and mixture probabilities of each cluster Expectation The first step in EM is to derive the expression for expectation. In this problem, D is the only given data, D y is the complete data and D z is the unobserved data. The general expression for the expectation for this problem is Q(θ, θ t ) = E Dz { ln p(d y θ ) D, θ t } but now D y and D z contain different data sets and we need to derive ln p(d y θ ) under different assumptions. As before, we write the p(d y θ ) as: p(d y θ ) = Π p(y i θ ) (6.1) Since y i = (D i,ψ i, u i ), we can write p(y i θ ) = p(d i,ψ i, u i θ ) and we rewrite p(d i,ψ i, u i θ )as p(d i,ψ i, u i θ ) = p(ψ i, u i θ )p(d i θ,ψ i, u i ) and p(ψ i, u i θ )as p(ψ i, u i θ ) = p(ψ i θ )p(u i θ,ψ i ). so p(y i θ ) = p(ψ i θ )p(u i θ,ψ i )p(d i θ,ψ i, u i ) We now introduce the quazi-dependence assumption that D (or its components) are not directly related to θ (or its components) but only through the corresponding u. So p(d i θ, ψ i, u i ) = p(d i u i ). So the p(y i θ ) becomes p(y i θ ) = p(ψ i θ )p(u i θ,ψ i )p(d i u i ) (6.2) The first term in Eq. (6.2) is p(ψ i θ ) and ψ i is the membership vector for the i th sample. Then and ψ ij = 1 ifu i generated by ω j 0 otherwise 11

14 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data and we can write Σ ψ ij = 1 p(ψ i θ ) = Π ψ π ik k (6.3) The value of the above expression is π j if the i th sample is the member of j th cluster, or ψ ij = 1. We can apply the same technique to p(u i θ,ψ i ) and have p(u i θ,ψ i ) = Π p(u i θ k ) ψ ik. Then the Eq. (6.2) can be rewritten as p(y i θ ) = ψ Π π ik k Π p(u i θ k ) ψ ik p(d i u i ) By taking the log of p(y i θ ) we get: and from Eq. (6.1) log p(y i θ ) = Σ ψ ik log π k + Σ ψ ik log p(u i θ k ) + log p(d i u i ) log p(d y θ ) = Σ log p(y i θ ) We substitute Eq. (6.4) into the above equation, we get the function of log p(d y θ ) log p(d y θ ) = Σ log π k + Σ log p(u i θ k ) + Σ log p(d i u i ) (6.4) (6.5) The final step to derive the expectation function is to take expected value of Eq. (6.5) with respect to the unobserved data D z given the data D and a guess of the clustering parameters θ t. The expectation equation is: E Dz { log p(d y θ ) D, θ t } = Σ Σ E{ψ ik θ t, D} log π k + Σ E{ψ ik log p(u i θ k ) θ t, D} + E{ log p(d i, u i ) θ t, D} (6.6) where we assume that all the expected values are taken over D z. E{ψ ik θ t, D}, which we name ψ ik. We first derive ψ ik = E{ψ ik θ t, D} = 0 P(ψ ik = 0 θ t, D) + 1 P(ψ ik = 1 θ t, D) since ψ ik takes only the values 0 and 1. So 12

15 EM Clustering Lu Ye, Minas E. Spetsakis ψ ik = P(ψ ik = 1 θ t, D) Since we need explicit dependence on u i we write P(ψ ik = 1 θ t, D) = P(ψ ik = 1,u i θ t, D)du i = p(u i θ t, D)P(ψ ik = 1 θ t, D, u i )du i By invoking again the quazi-dependence assumption we notice that ψ ik doesn t depend on D, since u i is given, so we drop D. We also notice that p(u i θ t, D) = p(u i θ t, D i ). p(u i θ t, D i )P(ψ ik = 1 θ t, u i )du i = p(u i θ t, D i )P(ω k θ t, u i )du i = p(d i θ t, u i )p(u i θ t ) p(d i θ t ) p(d i θ t, u i )p(u i θ t t k )π k p(d i θ t ) p(u i θ k t )P(ω k θ t ) p(u i θ t ) du i du i = In p(d i θ t, u i ), θ t has no direct relationship with D i but only through u i,soweinv oke the quazi-dependence assumption and drop θ t so p(d i θ t, u i ) = p(d i u i ) which is the likelihood of u i given the data D and is assumed known. Since π k t and p(d i θ t ) don t contain the u i variable, we have Then we get ψ ik= p(d i u i )p(u i θ t t k )π k p(d i θ t ) du i = π k t p(d i θ t ) p(d i θ t, u i )p(u i θ k t )du i π k t p(d i θ t ) p(d i u i )p(u i θ k t )du i = t π k p(d i, u i θ t )du i p(d i u i )p(u i θ t k )du i = t π k p(d i u i )p(u i θ t )du i p(d i u i )p(u i θ t k )du i = p(d i u i ) π k t Σ π j t p(u i θ j t )du i p(d i u i )p(u i θ k and if we exchange the summation and the integral we get t )du i (6.7) 13

16 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data π k t Σ p(d i u i )π t p(d i u i )p(u i θ k j p(u i θ t j )du i t π k p(d i u i )p(u i θ t k )du i Σ π t j p(d i u i )p(u i θ t j )du i t )du i = We define g ik and we rewrite the expression of ψ ik as gik = p(di θ k t ) = p(d i u i )p(u i θ t k )du i = ψ ik = π k t g ik Σ π j t g ij (6.8) We can compute g ik by evaluating the integral using one of the methods described later and we can the compute ψ ik. The first term of Eq. (6.6) can be written as follows Σ E{ψ ik θ t, D} log π k = We now simplify the second term of Eq.(6.6). Σ E Dz {ψ ik log p(u i θ k ) θ t, D} = 0 + log π k Σ E ui { log p(u i θ k ) D, θ t,ψ ik = 1}P(ψ ik = 1 D, θ t ) (6.9) The second factor P(ψ ik = 1 D, θ t )isthe same as the ψ ik that we have derived. We use the result of Eq. (6.7) directly, and have E ui { log p(u i θ k ) D, θ t,ψ ik = 1}ψ ik = E ui { log p(u i θ k ) D, θ k t }ψ ik = ψ ik log p(u i θ k )p(u i D i, θ k t )du i All the θ t parameters are given and only log p(u i θ k ) contains µ k and C k which will be estimated in the Maximization step. Since the third term of Eq. (6.6) doesn t contain any variables that need to be estimated in the maximization step, it is irrelevant and we can ignore it Maximization In the Maximization step, we use the same method to get the optimal solution for parameters in θ j = {µ j, C j, π j }. We also need to use the constraint Σ π j = 1. I use the defined symbols to substitute some complex terms in Expectation step. By using the 14

17 EM Clustering Lu Ye, Minas E. Spetsakis Lagrange multiplier, we get the maximization function. Q (θ, θ t ) = Σ log π k + log p(u i θ k )p(u i D i, θ k t )du i + λ 1 Σ π k (6.10) We want to take partial derivatives over Q (θ, θ t ) with respect to the components of θ j = {µ j, C j, π j } and λ. First, we take the partial derivative with respect to π j. Only the first term and the lagrange term contain the variable π j in the Eq. (6.10), so the other terms will vanish. Since the integral operator and partial derivative are the linear operators, we can interchange them Q = 1 π j λ = 0 π j and after solving for π j λ = Σ 1 ψ ik π j π j = 1 λ (6.11) Using the constraint of the mixture probabilities (by taking the derivative with respect to λ) we get Σ π k = 1 1 = 1 λ Σ λ = Σ So we plug the expression of λ into the Eq. (6.11) and get π j = Σ (6.12) ext we take the partial derivatives with respect to µ j and set it to 0 to get the optimal value of µ j which only appears in the log-gaussian log p(u i θ k ). The other terms will vanish. 15

18 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data Σ log p(u i θ k )p(u i D i, θ k t )du i Q = = 0 µ j µ k The distribution of p(u i θ k ) is a Gaussian with probability density function p(u i θ k ) = 1 (2π ) m C k e (u i µ k ) T C k 1 (u i µ k ) 2 and 1 log p(u i θ k ) = log (2π ) m C k (u i µ k ) T C 1 k (u i µ k ) 2 Then we plug the expression of log p(u i θ k ) into the Eq. (6.13) and we get Q µ j = Σ log (6.13) 1 (2π ) m C k (u i µ k ) T C 1 k (u i µ k ) 2 p(u i D i, θ t k )du i = 0 µ k After changing the order of the summation, the integral and the derivative, and ignoring the term that vanishes we get Q = µ j Σ 1 2 (u i µ k ) T C 1 k (u i µ k )p(u i D i, θ t k ) µ j du i = 0 As mentioned in the last problem, the derivative of (u i µ k ) T C 1 k (u i µ k ) with respect 2 to µ k is C 1 k (u i µ k ). Thus we get Q = µ j C j 1 (u i µ j )p(u i D i, θ t k )du i = p(u i D i, θ t k )C 1 j u i du i p(u i D i, θ k t )C j 1 µ j du i = 0. Since µ j can be moved out of the integral and since C j 1 is not singular and can be eliminated, we get and p(u i D i, θ t k )u i du i = µ j p(u i D i, θ t k )du i µ j = p(u i D i, θ k t )u i du i since p(u i D i, θ k t )du i = 1. Furthermore, we can write 16

19 EM Clustering Lu Ye, Minas E. Spetsakis p(u i D i, θ k t ) = p(d i u i, θ k t )p(u i θ k t ) p(d i θ k t ) and from the quazi-dependence assumption, we have p(d i u i, θ k t ) = p(d i u i ). Thus we rewrite above expression as µ j = Σ ψ ik p(d i θ k t ) p(d i u i )p(u i θ k t )u i du i Recall that we denoted p(d i θ t k ) as g ik, and similarly we denote g ik = p(d i u i )p(u i θ t k )u i du i. Then we get µ j = g ik g ik Finally, we need to derive the last unknown parameter C j. As before, only log p(u i θ k ) contains C j. We start the derivation from the following: Q = C j which is Since Σ 1 2 p(u i D i, θ k t ) log Σ Q C j = C j 1 (2π ) m C k du i (u i µ k ) T C k 1 (u i µ k )p(u i D i, θ k t )du i Σ Σ (log C k ) p(u i D i, θ k t )du i C j C j = 0 (u i µ k ) T C k 1 (u i µ k )p(u i D i, θ k t )du i C j = 0 17

20 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data log C k C k and the derivative of a determinant is C k C k = 1 C k = C k C k 1 the derivative of the logarithm of the determinant is log C k = C 1 C k. k The differentiation of a quadratic product is (u i µ k ) T C 1 k (u i µ k ) = C 1 k (u i µ k )(u i µ k ) T 1 C k C k We combine the above two equations and get Q = 1 C j 2 C j 1 p(u i D i, θ t k )du i After left and right multiplying by C j both sides and omitting non-zero constants, we get and get finally p(u i D i, θ k t )C j 1 (u i µ j )(u i µ j ) T C j 1 du i = 0 C j p(u i D i, θ k t )du i p(u i D i, θ k t )(u i µ j )(u i µ j ) T du i = 0 C j = p(u i D i, θ k t )(u i µ j )(u i µ j ) T du i p(u i D i, θ k t )du i The integral in the denominator is equal to one so we get C j = p(d i u i )p(u i θ k t )(u i µ j )(u i µ j ) T p(d i θ k t ) As before, we denote g ik = p(d i u i )p(u i θ k t )(u i µ j )(u i µ j ) T du i. Then we have du i 18

21 EM Clustering Lu Ye, Minas E. Spetsakis C j = g ik g ik 6.3. Computing g ik, g ik and g ik All the computations described above are simple summations and averages of the g ik s, g ik s and g ik s. The only non trivial computation involves g ik, g ik and g ik themselves. The reason is that they contain integration. The actual method used to compute them will depend on the form in which p(d i u i ) is given. If we are given samples on a regular grid (which can be efficient for low dimensionality u i ) we can compute p(u i θ t k ) on the same grid and the integral becomes a discrete summation. Let m u i be the discrete samples of u i for m = 1.. M. We can write then p(d i u i ) = Σ M p(d i m u i )Sa( m u i u i ) = so the integral g ik = p(d i u i )p(u i θ t k )du i = u i M u i M Σ p(d i m u i )Sa( m u i u i )p(u i θ k t )du i = Σ p(d i m u i ) Sa( m u i u i )p(u i θ t k )du i u i is transformed into a sum of several definite integrals that can be computed analytically. If we approximate the sampling function with the zero mean Gaussian with variance C s the integral becomes much simpler g ik = Σ M p(d i m u i ) ( m u i u i ;0,C s )p(u i θ t k )du i = u i M Σ p(d i m u i ) M ( m u i u i ;0,C s )(u i ; µ t k, C t k )du i = u i Σ p(d i m u i )( m u i ; µ k t, C k t + C s ) where C s is a matrix chosen so that the Gaussian approximates the sampling function. A diagonal matrix with every member of the diagonal equal to 0.4 is adequate for our purposes if the samples are on integer values of u i. A similar procedure can be followed to compute the other g s 19

22 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data g ik = M g ik = Σ M p(d i m u i )( m u i ; µ t k, C t k + C s ) m u i Σ p(d i m u i )( m u i ; µ k t, C k t + C s )( m u i µ k ) T ( m u i µ k ) It is obvious that the above procedure is not suitable for high dimensionality u i s. An alternative is that p(d i u i ) is giv en in a standard analytic form which would allow the analytic computation of the g s. There are many good candidate standard forms but one of the most adaptable is Weighted Sum of Gaussians (similar to Mixture of Gaussians but without the requirement that the mixture probabilities sum up to one). Let p(d i u i ) = m i Σ q im (u i, µ im, C im ) where q im is the mixture weight for Gaussian m for data point u i, and µ im and C im are the m th mean and covariance for point u i. As before p(u i θ k t) is a Gaussian, which we denote as (u i ; µ t k, C t k). So we can write g ik = p(d i u i )p(u i θ k t )du i = m i Σ q im (u i, µ im, C im ) (u i, µ t k, Ck)du t i m i Σ q im (u i, µ im, C im )(u i, µ t k, C t k)du i The product of two Gaussians is a simple function of the two sets of parameters C im, C j, µ im and µ j (u i ; µim, C im)(u i ; µ t k, Ck) t = (u i ; µ imk, C umk ) (0; µ im, C im )(0; µ t k, Ck) t = (0; µ imk, C imk ) (u i ; µ imk, C imk ) f imk where C imk = C 1 im + (C t k) 1 1 µ imk = C(Cim 1 µ im + (Ck) t 1 µ t k) f imk = (0; µ im, C im )(0; µ t k, Ck) t (0; µ imk, C imk ) and since we know that the integral of the Gaussian is the unity g ik = m i Σ q im f imk 20

23 EM Clustering Lu Ye, Minas E. Spetsakis In a very similar fashion we can derive the expression for g ik and g ik and g ik = p(d i u i )p(u i θ k t )u i du i = m i Σ q im (u i, µ im, C im ) (u i, µ t k, Ck)u t i du i m i Σ q im (u i; µ imk, C imk ) f imk u i du i m i Σ q im f imk (u i; µ imk, C imk )u i du i = m i Σ q im f imk µ imk g ik = m i Σ q im f imk C imk 6.4. Summary of the Algorithm While the derivation is rather complicated the steps one needs to follow to execute one iteration of the algorithm are straightforward. We will follow the established practice and name the two distinct groups of steps Expectation and Maximization respectively although this association is by no means direct. The Expectation step is and the Maximization step is: ψ ik = π k t g ik Σ π j t g ij π j = Σ Σ ψ ik g ik g µ j = ik g ik g C j = ik 21

24 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data After we compute the parameters of all the Gaussians we can compute the g ik s to be used in the next step. If the p(d i u i ) is giv en in discrete form as a set of M samples on a regular grid in the space spanned by u i then g ik = Σ M p(d i m u i )( m u i ; µ t k, C t k + C s ) g ik = Σ M p(d i m u i ) m u i ( m u i ; µ t k, C t k + C s ) g ik = M Σ p(d i m u i )( m u i µ k ) T ( m u i µ k )( m u i ; µ k t, C k t + C s ) If on the other hand p(d i u i ) is giv en as a weighted sum of Gaussians then C imk = C 1 im + (Ck) t 1 1 µ imk = C(Cim 1 µ im + (Ck) t 1 µ t k) f imk = (0; µ im, C im )(0; µ t k, Ck) t (0; µ imk, C imk ) g ik = m i Σ q im f imk g ik = m i Σ q im f imkimk g ik = m i Σ q im f (C im, C k, µ im, µ k )C imk 7. Conclusion In this report, we reviewed briefly the concepts of Clustering, Expectation Maximization and Mixture of Gaussians and then developed the background for clustering of unobserved data under the Mixture of Gaussians model. We used Maximum Likelihood estimation to do this and since this problem involves hidden variables (the unobserved data and the membership) we used the Expectation Maximization (EM) method. Our fundamental assumption is that the data is not directly observed but we assume that its probability distribution is known. Each datum u i is conditioned on a set of known parameters D i and that datum u i is independent of D i if i = i. We also assume that the D i s are not directly dependent on the clustering parameters which gav e arise to the quazidependence assumption we used throughout the report. Armed with these assumptions we developed the equations for the iterative estimation of the gaussian clusters. If the probability distribution of the data is assumed to be a parametric form the formulas could be further further simplified and we plan to do this for future research. 22

25 EM Clustering Lu Ye, Minas E. Spetsakis References 1. Frank Dellaert, The Expectation Maximization Algorithm, dellaert/html/publications.htm (Feb. 2002). 2. Dempster, A. P., Laird,. M., and Rubin, D. B., Maximum likelihood from incomplete data via the EM Algorithm, Journal of the Royal Statistical Society B39pp (1977). 3. Geoffrey J. McLachlan and Thriyambakam rishnan, The EM Algorithm and Extensions, Wiley-Interscience (1997). 4. Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification, Wiley Interscience (2001). 5. Athanasios Papoulis, Probability, Random Variables and Stochastic Processes, McGrow-Hill (1984). 6. George Bebis, Mathematical Methods Computer Vision, bebis/mathmethods/em/lecture.pdf (). 7. C. Fraley and A. E. Raftery, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis, Tech. Report, (). 8. Gigeon Schwarz, Estimating the Dimension of a Model, The annals of Statistics 6(2) pp (March. 1978). 23

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

1 Expectation Maximization

1 Expectation Maximization Introduction Expectation-Maximization Bibliographical notes 1 Expectation Maximization Daniel Khashabi 1 khashab2@illinois.edu 1.1 Introduction Consider the problem of parameter learning by maximizing

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM. Sargur Srihari Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

EXPECTATION- MAXIMIZATION THEORY

EXPECTATION- MAXIMIZATION THEORY Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543 K-Means, Expectation Maximization and Segmentation D.A. Forsyth, CS543 K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can t do this by

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Mixtures of Rasch Models

Mixtures of Rasch Models Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Bayesian Networks Structure Learning (cont.)

Bayesian Networks Structure Learning (cont.) Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

The Regularized EM Algorithm

The Regularized EM Algorithm The Regularized EM Algorithm Haifeng Li Department of Computer Science University of California Riverside, CA 92521 hli@cs.ucr.edu Keshu Zhang Human Interaction Research Lab Motorola, Inc. Tempe, AZ 85282

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

EM Algorithm. Expectation-maximization (EM) algorithm.

EM Algorithm. Expectation-maximization (EM) algorithm. EM Algorithm Outline: Expectation-maximization (EM) algorithm. Examples. Reading: A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.,

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Expectation-Maximization

Expectation-Maximization Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression,

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

MIXTURE MODELS AND EM

MIXTURE MODELS AND EM Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

More information

We Prediction of Geological Characteristic Using Gaussian Mixture Model

We Prediction of Geological Characteristic Using Gaussian Mixture Model We-07-06 Prediction of Geological Characteristic Using Gaussian Mixture Model L. Li* (BGP,CNPC), Z.H. Wan (BGP,CNPC), S.F. Zhan (BGP,CNPC), C.F. Tao (BGP,CNPC) & X.H. Ran (BGP,CNPC) SUMMARY The multi-attribute

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements

More information

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select

More information