Clustering on Unobserved Data using Mixture of Gaussians
|
|
- Wilfrid Pope
- 5 years ago
- Views:
Transcription
1 Clustering on Unobserved Data using Mixture of Gaussians Lu Ye Minas E. Spetsakis Technical Report CS Oct Department of Computer Science 4700 eele Street orth York, Ontario M3J 1P3 Canada
2 Clustering on Unobserved Data using Mixture of Gaussians Lu Ye Dept. of Mathematics York University 4700 eele Str Toronto, O M3J 1P3 lye@mathstat.yorku.ca Minas E. Spetsakis Dept. of Computer Science Center for Vision Research York University 4700 eele Str Toronto, O M3J 1P3 minas@cs.yorku.ca ABSTRACT This report provides an review of Clustering using Mixture Models and the Expectation Maximization method[1] and then extends these concepts to the problem of clustering of unobserved data where we cluster a set of vectors u i for i = 1.. for which we only know the probability distribution. This problem has several applications in Computer Vision where we want to cluster noisy data. 1
3 2
4 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data 1. Clustering Data clustering is an important statistical technique, closely related to unsupervised learning. Clustering is the process of grouping the samples into clusters, so that samples with similar properties belong to the same cluster. A cluster is just a set whose members are similar but are different from the members of other clusters. For example, samples in a particular space where the distances between two samples in a cluster is less than minimal. One of the most popular clustering techniques is based on Expectation Maximization (EM). EM is a method for estimating parameters using partially observed data[2, 3]. Since the clustering problem would be trivial if we already knew the membership of ev ery sample to a particular cluster, we treat this membership information as the unobserved part of the data and apply EM. In this report we develop a clustering method that treats not only the membership of the data as unobserved but the data as well. Edge grouping is a potential application for clustering. Edges can be considered as the sets of many edgels (edge elements) and these edgels can be detected in an image using any of the standard edge detection techniques. If we happen to know that our edgels belong to straight lines and we have an estimate of the slope of the lines then we can group the edgels into straight line edges using a clustering technique. One of the problems is that some of the information is not very accurate and we can only assume that we know the probability distribution of the data, but not the data itself. So we essentially have partially observed data. 2. Mixture Model The underlying model of EM clustering is the Mixture Model[4]. This model assumes that each sample comes from a cluster ω j, where j = 1,...,. The way to generate samples form the mixture model is as follows. We select one of the clusters ω j with probability P(ω j ) = π j and then generate a sample x out of a probability distribution p(x ω j, θ )orp(x θ j ) where θ = θ j, j = 1,..., and θ j is the vector of parameters associated with the cluster ω j, which in our case contains the mean µ j, the covariance matrix C j and the mixture probability π j. One should notice that p(x θ j ) and p(x ω j, θ ) are the same. The first one denotes the probability density of x given the parameters θ j of the cluster ω j and the other denotes the probability density of x given the parameters θ of all the clusters and the fact that we have selected cluster ω j. The density p(x θ j )is called component density. The probability density function of a sample x from the mixture model is then p(x θ ) = Σ p(x ω j, θ )π j. For the mixture probabilities π j, the following is true (2.1) 2
5 EM Clustering Lu Ye, Minas E. Spetsakis Σ π j = Maximum-Likelihood Estimation The Maximum-Likelihood (ML) method can be used for parameter estimation from a set of samples x i, i = 1... We assume that p(x θ j ) is of known parametric form with parameter vectors θ j which are unknown. The mixture probabilities π j are also unknown. The maximum-likelihood estimate of the parameters θ related to a set of data D = x i, i = 1,..., is obtained by maximizing the log likelihood of the parameters. The likelihood of the parameters is the probability of the data given the parameters L(θ ) = p(d θ ) = Π p(x i θ ) assuming the data is independent. By substituting the probability density function Eq. (2.1), we get p(d θ ) = Π Σ p(x i ω j, θ )π j. The reason to take the log-likelihood of the parameters is that p(d θ ) inv olves a product of terms. To simplify the calculation, we use log-likelihood to change products to sums. We will use the Gaussian distribution[5] as our parametric form for p(x i θ j ). For multidimensional data x it takes the form p(x θ j ) = 1 (2π ) m C j e (x µ j ) T C j 1 (x µ j ) 2 (3.1) where m is the number of dimensions of vector x, C j is the covariance matrix and µ j is the mean for the cluster ω j. The mixture probabilities, the covariance and the mean are bundled in the parameter vector θ j = µ j, C j, π j. Im most applications of ML the next step is to maximize the likelihood function (or more often the log-likelihood). Usually the result is a simple analytic expression which along with a host of other nice properties explains the popularity of ML. But in this case we have a product of sums and it is impossible to get a simple expression. The problem would become very simple if we knew the membership of every sample. Then we would only have to solve the Gaussian parameter estimation problem for ev ery cluster. But lacking this membership information we can use Expectation Maximization. 4. Expectation Maximization Applied to Clustering Expectation Maximization (EM) is an iterative algorithm that alternates between two steps: the Expectation step and the Maximization step. In the Expectation step, we 3
6 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data first derive the log-likelihood of the unknown parameters as a function of the unobserved data and then we compute the expected value of this likelihood using the probability density of the unobserved data. Since the density of the unobserved data involves the parameters we want to estimate we use a guess. In the Maximization step, we obtain the parameters that maximize the expected value of the likelihood. This new parameter will become the guess to be used in the next iteration. This iteration goes on until convergence or satisfaction of the termination conditions. The unknown parameter vector θ contains information appropriate for the mixture model we use. Since we will use the Gaussian distribution over the mixture model, normally the parameter vector contains three components which are the mean and the variance for each cluster and the mixture probability. These are usually the unknowns of our problem. After we know the mean, variance and the mixture probabilities of each cluster, we are able to group samples. The number of clusters is known, but we can augment the algorithm with BIC or AIC (see below) and have as unknown too. The straight application of ML to our clustering problem is extremely difficult because it involves logarithms of sums and many complicated functions. Luckily, it can be easily put in a form where we can apply EM. We do this as follows[6]. Let z i be a vector of length (the number of clusters) and the j th element of the vector is 1, if the sample x i was generated by cluster j and zero otherwise. Obviously there is exactly one 1 in the vector and the rest are zero so Σ z ij = 1 (4.1) where z ij is the j th element of z i. Since we do not know the membership of every sample, z i is the unobserved part of the data. Let D y = y i, i = 1,..., be the complete data set where y i = {(x i, z i )}, D x = x i, i = 1,..., is observed data set and D z = z i, i = 1,..., be the unobserved data set. In the expectation step, the expression for expectation is Q(θ, θ t ) = E Dz { ln p(d y θ ) D x, θ t }. Although it looks intimidating at first, it can be tamed in a few simple steps. The first step is to derive the form of ln p(d y θ ) which is a function of D x, D z and θ. Once we simplify the ln p(d y θ ), we can get its expected value by applying the standard rules. The subscript D z in E Dz means that the random variables are the elements of D z and the probability of D z is conditioned on D x and θ t, where θ t is just a guess, but it is usually the result of the t th (or previous) iteration. The result is a function of θ, the unknown mixture parameters, θ t the guess for θ and the observed data D x, but for simplicity we drop D x from Q(θ, θ t ). Using the rule of joint probabilities and assuming that the y i s, the elements of D y, are independent, we can rewrite p(d y θ ) as follows: 4
7 EM Clustering Lu Ye, Minas E. Spetsakis p(d y θ ) = Π p(y i θ ) Since we have y i = {(x i, z i )}, p(y i θ ) can be written as p(y i θ ) = p(x i, z i θ ) Based on the properties of conditional probability p(x i, z i θ ) has the following probability density expression. p(x i, z i θ ) = p(x i z i, θ )p(z i θ ) As we mentioned above, z i is the vector, whose j th element is 1 if the sample x i was generated by the cluster j. Therefore, the probability of z i given the parameter vector θ,is equal to the mixture probability π j, i,e. the probability to select cluster j among the clusters of the mixture model. The probability of x i given z i and θ, p(x i z i, θ ) has an alternative form which is p(x i θ j ), where θ j is the vector whose parameters associated with the cluster ω j.sowehav e p(x i, z i θ ) = p(x i θ j )π j. Thus p(y i θ ) can be written as p(y i θ ) = p(x i θ j )π j (4.2) This equation only presents the correspondence of p(y i θ ) to the parameter vector of the j th cluster θ j. Recalling that z i vector, z ij = 1 and z ik = 0 for k j, we hav e the following new expression for p(y i θ ): p(y i θ ) = Π(p(x i θ k )π k ) z ik The term (p(x i θ k )π k ) z ik is equal to p(x i θ j )π j when z ij = 1, and it is 1 when k j. Thus, the product terms of all clusters is the same as Eq. (4.2). By collecting all the samples y i in D y the probability density function of the D y set given θ is: p(d y θ ) = Π(p(x i θ k )π k ) z ik Π Then we should obtain ln p(d y θ ) by simply taking the logarithm on the above equation: ln p(d y θ ) = Σ Σ z ik ln( p(x i θ k )π k ) and by applying the logarithm rule once more ln p(d y θ ) = Σ Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π k ) (4.3) Eq. (4.3) is our form for ln p(d y θ ) and we can now calculate the expected value of it to conclude the Expectation Step. Since the form of ln p(d y θ ) is the sum of two terms, the expected value should be the sum of the expected values of those two terms. We also need to keep in mind that the random variables are the unobserved data z ik only. The 5
8 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data other terms of the equation should be treated as the constants in this situation. The expected values of constants are themselves. E z { ln p(d y θ ) D x, θ t } = Σ Σ Let s rename E z {z ik D x, θ t } as z ik, then we have E z { ln p(d y θ ) D x, θ t } = Σ Σ E z {z ik D x, θ t } ln( p(x i θ k )) + Σ E z {z ik D x, θ t } ln(π k ) Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π k ) (4.4) (4.5) Based on the definition of the z ik vector, we notice that z ik takes two values only: zero and one. So z ik = E z {z ik D x, θ t } = 0 P(z ik = 0 D x, θ t ) + 1 P(z ik = 1 D x, θ t ) = P(z ik = 1 D x, θ t ) is the probability of z ik to be 1. This is the same as P(ω k x i, D x, θ t ) the probability the sample x i to be generated by the k th cluster among all the clusters. Applying the Bayes rule, we have z ik responding to the whole model: z ik = E z {z ik D x, θ t } = P(ω k x i, θ t ) = p(x i D x, θ t, ω k )P(ω k D x, θ t ) Σ p(x i D x, θ t, ω j )P(ω j D x, θ t ) = (4.6) p(x i θ t, ω k )P(ω k D x, θ t ) Σ p(x i θ t, ω j )P(ω j D x, θ t ) = p(x i θ t k )P(ω k D x, θ t ) Σ p(x i θ t j )P(ω j D x, θ t ) Since we use a known parametric form for the probability, Gaussian in this case, we can compute p(x i θ t )and so compute z ik from Eq. (4.6) and the we can plug the value of z ik into Eq. (4.4), to get the final result of the Expectation Step. In the Maximization Step, we are supposed to maximize the above expectation function values in order to get the optimal solution for the unknown θ which can be used in the next round as θ t to estimate a new expectation function and from this to get the new θ. There is one constraint in our mixture model, which is Σ π j = 1. (4.7) Therefore, we maximize the expectation function Q(θ, θ t ) subject to this constraint. We define a new Q (θ, θ t ) which takes into account the constraint of Eq. (4.7) by using the Lagrange multiplier. Q (θ, θ t ) = Σ Σ z ik ln( p(x i θ k )) + Σ Σ z ik ln(π j ) + λ(1 Σ π j ) Using standard calculus procedures, we can find the maximum value of a function subject 6
9 EM Clustering Lu Ye, Minas E. Spetsakis to the constraint by taking derivatives over the unknowns of the function and λ, set the derivatives to zero, and solve the resulting equations to find the value of the unknowns and λ that maximize the function. In the our case the unknown is θ the vector containing all the parameters of the mixture model π j, µ j and C j for j = 1... First, we take the partial derivative with respect to π j.inq (θ, θ t ), only the last two terms involve the variable π j so the first term will give zero. Since we only take derivatives of the whole Q term with respect to one particular π j term, only the j th element of the whole Q term will have non zero value. So we can get rid of the summation over j = 1,...,, and we have Q = 1 π j Σ z ik λ = 0 π j λ = Σ 1 z ik π j π j = Σ z ik λ (4.8) If we take the summation from 1 to on both sides we can use Eq. (4.7) to eliminate π j and have Σ Σ z ik λ = 1 from which we get λ = Σ Σ z ik So we plug the λ into Eq. (4.8), and get π j = Σ z ik Σ Σ z ik We notice that the denominator in the above expression is Σ Σ z ik = Σ after taking into account Eq. (4.1). So finally Σ z ik = Σ 1 = π j = 1 Σ z ik. (4.9) Second, we take the partial derivative with respect to the unknown µ j, the mean of the j th cluster. It is only involved in the Gaussian distribution for our mixture model, so 7
10 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data we only need to be concerned with the first term of the expectation function when taking derivatives. In the first term of the function, z ik uses the parameters θ t which is the guess and we consider it independent from θ and constant. By the same reason as above, we eliminate one summation from j = 1,..., by taking derivative with respect to the particular µ j. The definition of the Gaussian density from Eq. (3.1) is p(x θ j ) = 1 (2π ) m C j e (x µ j ) T C j 1 (x µ j ) 2 and the log is 1 ln (2π ) m C j (x i µ j ) T C 1 j (x i µ j ) 2 The derivative then is as follows: Q = 1 µ j µ j Σ z ik ln (2π ) m C j Σ z (x i µ ) T C 1 (x j j i µ j ) ik = 0 2 The derivative of the first term equals to zero. The derivative of (x i µ j ) T C 1 j (x i µ j ) 2 with respect to µ j is C 1 j (x i µ j ). The derivative of the second part is Q = µ j Σ z ik C 1 1 j (x i µ j ) = C j Σ z ik (x i µ j ) = 0 which results to or Σ z ik x i = µ j = Σ z ik µ j Σ z ik x i Σ z ik and using Eq. (4.9) we get µ j = 1 π j Σ z ik x i. It s time to derive the last unknown variable C j of our model in the Maximization step. The steps of derivation are similar to the derivation of µ j. We only work with the first term of expectation function. The difference begins with the following step: 8
11 EM Clustering Lu Ye, Minas E. Spetsakis Q C j = C j Σ z ik ln 1 (2π ) m C j Σ z (x i µ ) T C 1 (x j j i µ j ) ik = 0 2 The above differentiation involves derivatives of a determinant and a quadratic product which we can get from the Matrix Reference Manual ( (x C i µ j ) T C 1 j (x i µ j ) = C 1 j (x i µ j )(x i µ j ) T 1 C j j and using the chain rule C j C j = C j C j 1 C j ln 1 (2π ) m C j = C j 1 2 ln C j = 1 2 C j 1 and we get Q = 1 C j 2 Σ z ik C j 1 + C 1 j (x i µ j )(x i µ j ) T C 1 j = 0 After that, we multiply C j twice on both sides of above equation to eliminate C 1 j, and we get the following: which gives Σ z ik ( C j + (x i µ j )(x i µ j ) T ) = 0 Σ z ik C j = C j = Σ z ik (x i µ j )(x i µ j ) T Σ z ik (x i µ j )(x i µ j ) T Σ z ik By using Eq. (4.9), we have the C j for the next iteration. C j = 1 π j Σ z ik (x i µ j )(x i µ j ) T In conclusion, the above derivation explains the internal detail steps of EM algorithm. When we run the EM algorithm each time, we can directly use the obtained equations to calculate the covariance, means and mixture parameters. -means clustering is a simple and classical example by applying EM algorithm. It can be viewed as the problem of estimating the means of Gaussians Mixture Model. The special assumption of -means is that the mixing probabilities π j are equal and each 9
12 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data Gaussian distribution has the same variance. 5. Bayesian Information Criterion Applied to Clustering The EM clustering algorithm that we outlined above assumes a Mixture of Gaussians as the underlying model. After the particular mixture model is defined, the EM algorithm is be applied to find all the parameters except the number of clusters. So we have to provide the number of clusters through some other method. The Bayesian Information Criterion (BIC) is a widely used method for this purpose[7, 8], and can be applied in our case to find the number of clusters. Since we use EM to find the maximum mixture likelihood after each iteration, we can get reliable approximation of BIC. The expression of BIC is BIC = 2L M (x, θ ) + m M log() (5.1) where L M (x, θ ) is the maximized mixture loglikelihood for the model M, m M is the number of independent parameters to be estimated in the model and is the number of samples. The maximized mixture loglikelihood function is L M (x, θ ) = logπ p(x i θ ) = logπ Σ p(x i θ j )π j Recalling the material discussed in the previous sections, p(x i θ j ) is the probability of x i given that it belongs to the j th cluster, which is a Gaussian distribution and π j is the the mixture probability of the model M. The smaller the value of BIC, the stronger the evidence for the model. The BIC can be used not only to determine the number of clusters by comparing models, but also help avoid local minima in our clustering problem. Since we use random starting points, usually by selecting random initial values for parameters θ j, θ j = (µ j, C j, π j ) where j = 1,...,, we get distinct clustering results when we run the EM algorithm several times. If it is our lucky day, we get reasonably good initial values for θ j, and we get desirable results in the first attempt. However, we can t control our luck. To solve this problem, we apply the EM clustering on the same data several times using random starting points and calculate the BIC to examine the maximum likelihood for each complete run by applying Eq. (5.1). Then we select the run that has minimum BIC. 6. Expectation Maximization Clustering Applied on Unobserved Data In the previous sections, we discussed the EM method and the clustering problem and applied the EM algorithm on the clustering of samples into clusters. We assumed that the underlying model is a Mixture of Gaussians and the coordinates of the samples are given. ext we study the problem where the coordinates of samples are not given directly and we only know the probability distributions of the samples p(u i D i ), where u i are the samples and D i is the data where these probabilities are based on. The union of all D i is D. We will again use EM to cluster these samples. The general procedure of EM algorithm remain the same but we have different assumptions and preconditions. 10
13 EM Clustering Lu Ye, Minas E. Spetsakis As before, we have both a set of observed data and a set of unobserved data. We are given the information D, and we have unobserved samples u i, i = 1,...,. As before ψ i is the membership vector of that sample where as ψ ij = 1 if sample u i was generated by cluster ω j and is equal to zero otherwise. The complete the data y i for each sample is y i = (D i,ψ i, u i ). The complete data set is D y = (y i, i = 1,..., ) where y i = (D i,ψ i, u i ). The unobserved data is D z = (z i, i = 1,..., ) where z i = (ψ i, u i ). Since we assume Mixture of Gaussians, we have to compute the parameter vector θ = (θ j, j = 1.. ) where θ j = (µ j, C j, π j ) is the mean, variance and mixture probabilities of each cluster Expectation The first step in EM is to derive the expression for expectation. In this problem, D is the only given data, D y is the complete data and D z is the unobserved data. The general expression for the expectation for this problem is Q(θ, θ t ) = E Dz { ln p(d y θ ) D, θ t } but now D y and D z contain different data sets and we need to derive ln p(d y θ ) under different assumptions. As before, we write the p(d y θ ) as: p(d y θ ) = Π p(y i θ ) (6.1) Since y i = (D i,ψ i, u i ), we can write p(y i θ ) = p(d i,ψ i, u i θ ) and we rewrite p(d i,ψ i, u i θ )as p(d i,ψ i, u i θ ) = p(ψ i, u i θ )p(d i θ,ψ i, u i ) and p(ψ i, u i θ )as p(ψ i, u i θ ) = p(ψ i θ )p(u i θ,ψ i ). so p(y i θ ) = p(ψ i θ )p(u i θ,ψ i )p(d i θ,ψ i, u i ) We now introduce the quazi-dependence assumption that D (or its components) are not directly related to θ (or its components) but only through the corresponding u. So p(d i θ, ψ i, u i ) = p(d i u i ). So the p(y i θ ) becomes p(y i θ ) = p(ψ i θ )p(u i θ,ψ i )p(d i u i ) (6.2) The first term in Eq. (6.2) is p(ψ i θ ) and ψ i is the membership vector for the i th sample. Then and ψ ij = 1 ifu i generated by ω j 0 otherwise 11
14 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data and we can write Σ ψ ij = 1 p(ψ i θ ) = Π ψ π ik k (6.3) The value of the above expression is π j if the i th sample is the member of j th cluster, or ψ ij = 1. We can apply the same technique to p(u i θ,ψ i ) and have p(u i θ,ψ i ) = Π p(u i θ k ) ψ ik. Then the Eq. (6.2) can be rewritten as p(y i θ ) = ψ Π π ik k Π p(u i θ k ) ψ ik p(d i u i ) By taking the log of p(y i θ ) we get: and from Eq. (6.1) log p(y i θ ) = Σ ψ ik log π k + Σ ψ ik log p(u i θ k ) + log p(d i u i ) log p(d y θ ) = Σ log p(y i θ ) We substitute Eq. (6.4) into the above equation, we get the function of log p(d y θ ) log p(d y θ ) = Σ log π k + Σ log p(u i θ k ) + Σ log p(d i u i ) (6.4) (6.5) The final step to derive the expectation function is to take expected value of Eq. (6.5) with respect to the unobserved data D z given the data D and a guess of the clustering parameters θ t. The expectation equation is: E Dz { log p(d y θ ) D, θ t } = Σ Σ E{ψ ik θ t, D} log π k + Σ E{ψ ik log p(u i θ k ) θ t, D} + E{ log p(d i, u i ) θ t, D} (6.6) where we assume that all the expected values are taken over D z. E{ψ ik θ t, D}, which we name ψ ik. We first derive ψ ik = E{ψ ik θ t, D} = 0 P(ψ ik = 0 θ t, D) + 1 P(ψ ik = 1 θ t, D) since ψ ik takes only the values 0 and 1. So 12
15 EM Clustering Lu Ye, Minas E. Spetsakis ψ ik = P(ψ ik = 1 θ t, D) Since we need explicit dependence on u i we write P(ψ ik = 1 θ t, D) = P(ψ ik = 1,u i θ t, D)du i = p(u i θ t, D)P(ψ ik = 1 θ t, D, u i )du i By invoking again the quazi-dependence assumption we notice that ψ ik doesn t depend on D, since u i is given, so we drop D. We also notice that p(u i θ t, D) = p(u i θ t, D i ). p(u i θ t, D i )P(ψ ik = 1 θ t, u i )du i = p(u i θ t, D i )P(ω k θ t, u i )du i = p(d i θ t, u i )p(u i θ t ) p(d i θ t ) p(d i θ t, u i )p(u i θ t t k )π k p(d i θ t ) p(u i θ k t )P(ω k θ t ) p(u i θ t ) du i du i = In p(d i θ t, u i ), θ t has no direct relationship with D i but only through u i,soweinv oke the quazi-dependence assumption and drop θ t so p(d i θ t, u i ) = p(d i u i ) which is the likelihood of u i given the data D and is assumed known. Since π k t and p(d i θ t ) don t contain the u i variable, we have Then we get ψ ik= p(d i u i )p(u i θ t t k )π k p(d i θ t ) du i = π k t p(d i θ t ) p(d i θ t, u i )p(u i θ k t )du i π k t p(d i θ t ) p(d i u i )p(u i θ k t )du i = t π k p(d i, u i θ t )du i p(d i u i )p(u i θ t k )du i = t π k p(d i u i )p(u i θ t )du i p(d i u i )p(u i θ t k )du i = p(d i u i ) π k t Σ π j t p(u i θ j t )du i p(d i u i )p(u i θ k and if we exchange the summation and the integral we get t )du i (6.7) 13
16 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data π k t Σ p(d i u i )π t p(d i u i )p(u i θ k j p(u i θ t j )du i t π k p(d i u i )p(u i θ t k )du i Σ π t j p(d i u i )p(u i θ t j )du i t )du i = We define g ik and we rewrite the expression of ψ ik as gik = p(di θ k t ) = p(d i u i )p(u i θ t k )du i = ψ ik = π k t g ik Σ π j t g ij (6.8) We can compute g ik by evaluating the integral using one of the methods described later and we can the compute ψ ik. The first term of Eq. (6.6) can be written as follows Σ E{ψ ik θ t, D} log π k = We now simplify the second term of Eq.(6.6). Σ E Dz {ψ ik log p(u i θ k ) θ t, D} = 0 + log π k Σ E ui { log p(u i θ k ) D, θ t,ψ ik = 1}P(ψ ik = 1 D, θ t ) (6.9) The second factor P(ψ ik = 1 D, θ t )isthe same as the ψ ik that we have derived. We use the result of Eq. (6.7) directly, and have E ui { log p(u i θ k ) D, θ t,ψ ik = 1}ψ ik = E ui { log p(u i θ k ) D, θ k t }ψ ik = ψ ik log p(u i θ k )p(u i D i, θ k t )du i All the θ t parameters are given and only log p(u i θ k ) contains µ k and C k which will be estimated in the Maximization step. Since the third term of Eq. (6.6) doesn t contain any variables that need to be estimated in the maximization step, it is irrelevant and we can ignore it Maximization In the Maximization step, we use the same method to get the optimal solution for parameters in θ j = {µ j, C j, π j }. We also need to use the constraint Σ π j = 1. I use the defined symbols to substitute some complex terms in Expectation step. By using the 14
17 EM Clustering Lu Ye, Minas E. Spetsakis Lagrange multiplier, we get the maximization function. Q (θ, θ t ) = Σ log π k + log p(u i θ k )p(u i D i, θ k t )du i + λ 1 Σ π k (6.10) We want to take partial derivatives over Q (θ, θ t ) with respect to the components of θ j = {µ j, C j, π j } and λ. First, we take the partial derivative with respect to π j. Only the first term and the lagrange term contain the variable π j in the Eq. (6.10), so the other terms will vanish. Since the integral operator and partial derivative are the linear operators, we can interchange them Q = 1 π j λ = 0 π j and after solving for π j λ = Σ 1 ψ ik π j π j = 1 λ (6.11) Using the constraint of the mixture probabilities (by taking the derivative with respect to λ) we get Σ π k = 1 1 = 1 λ Σ λ = Σ So we plug the expression of λ into the Eq. (6.11) and get π j = Σ (6.12) ext we take the partial derivatives with respect to µ j and set it to 0 to get the optimal value of µ j which only appears in the log-gaussian log p(u i θ k ). The other terms will vanish. 15
18 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data Σ log p(u i θ k )p(u i D i, θ k t )du i Q = = 0 µ j µ k The distribution of p(u i θ k ) is a Gaussian with probability density function p(u i θ k ) = 1 (2π ) m C k e (u i µ k ) T C k 1 (u i µ k ) 2 and 1 log p(u i θ k ) = log (2π ) m C k (u i µ k ) T C 1 k (u i µ k ) 2 Then we plug the expression of log p(u i θ k ) into the Eq. (6.13) and we get Q µ j = Σ log (6.13) 1 (2π ) m C k (u i µ k ) T C 1 k (u i µ k ) 2 p(u i D i, θ t k )du i = 0 µ k After changing the order of the summation, the integral and the derivative, and ignoring the term that vanishes we get Q = µ j Σ 1 2 (u i µ k ) T C 1 k (u i µ k )p(u i D i, θ t k ) µ j du i = 0 As mentioned in the last problem, the derivative of (u i µ k ) T C 1 k (u i µ k ) with respect 2 to µ k is C 1 k (u i µ k ). Thus we get Q = µ j C j 1 (u i µ j )p(u i D i, θ t k )du i = p(u i D i, θ t k )C 1 j u i du i p(u i D i, θ k t )C j 1 µ j du i = 0. Since µ j can be moved out of the integral and since C j 1 is not singular and can be eliminated, we get and p(u i D i, θ t k )u i du i = µ j p(u i D i, θ t k )du i µ j = p(u i D i, θ k t )u i du i since p(u i D i, θ k t )du i = 1. Furthermore, we can write 16
19 EM Clustering Lu Ye, Minas E. Spetsakis p(u i D i, θ k t ) = p(d i u i, θ k t )p(u i θ k t ) p(d i θ k t ) and from the quazi-dependence assumption, we have p(d i u i, θ k t ) = p(d i u i ). Thus we rewrite above expression as µ j = Σ ψ ik p(d i θ k t ) p(d i u i )p(u i θ k t )u i du i Recall that we denoted p(d i θ t k ) as g ik, and similarly we denote g ik = p(d i u i )p(u i θ t k )u i du i. Then we get µ j = g ik g ik Finally, we need to derive the last unknown parameter C j. As before, only log p(u i θ k ) contains C j. We start the derivation from the following: Q = C j which is Since Σ 1 2 p(u i D i, θ k t ) log Σ Q C j = C j 1 (2π ) m C k du i (u i µ k ) T C k 1 (u i µ k )p(u i D i, θ k t )du i Σ Σ (log C k ) p(u i D i, θ k t )du i C j C j = 0 (u i µ k ) T C k 1 (u i µ k )p(u i D i, θ k t )du i C j = 0 17
20 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data log C k C k and the derivative of a determinant is C k C k = 1 C k = C k C k 1 the derivative of the logarithm of the determinant is log C k = C 1 C k. k The differentiation of a quadratic product is (u i µ k ) T C 1 k (u i µ k ) = C 1 k (u i µ k )(u i µ k ) T 1 C k C k We combine the above two equations and get Q = 1 C j 2 C j 1 p(u i D i, θ t k )du i After left and right multiplying by C j both sides and omitting non-zero constants, we get and get finally p(u i D i, θ k t )C j 1 (u i µ j )(u i µ j ) T C j 1 du i = 0 C j p(u i D i, θ k t )du i p(u i D i, θ k t )(u i µ j )(u i µ j ) T du i = 0 C j = p(u i D i, θ k t )(u i µ j )(u i µ j ) T du i p(u i D i, θ k t )du i The integral in the denominator is equal to one so we get C j = p(d i u i )p(u i θ k t )(u i µ j )(u i µ j ) T p(d i θ k t ) As before, we denote g ik = p(d i u i )p(u i θ k t )(u i µ j )(u i µ j ) T du i. Then we have du i 18
21 EM Clustering Lu Ye, Minas E. Spetsakis C j = g ik g ik 6.3. Computing g ik, g ik and g ik All the computations described above are simple summations and averages of the g ik s, g ik s and g ik s. The only non trivial computation involves g ik, g ik and g ik themselves. The reason is that they contain integration. The actual method used to compute them will depend on the form in which p(d i u i ) is given. If we are given samples on a regular grid (which can be efficient for low dimensionality u i ) we can compute p(u i θ t k ) on the same grid and the integral becomes a discrete summation. Let m u i be the discrete samples of u i for m = 1.. M. We can write then p(d i u i ) = Σ M p(d i m u i )Sa( m u i u i ) = so the integral g ik = p(d i u i )p(u i θ t k )du i = u i M u i M Σ p(d i m u i )Sa( m u i u i )p(u i θ k t )du i = Σ p(d i m u i ) Sa( m u i u i )p(u i θ t k )du i u i is transformed into a sum of several definite integrals that can be computed analytically. If we approximate the sampling function with the zero mean Gaussian with variance C s the integral becomes much simpler g ik = Σ M p(d i m u i ) ( m u i u i ;0,C s )p(u i θ t k )du i = u i M Σ p(d i m u i ) M ( m u i u i ;0,C s )(u i ; µ t k, C t k )du i = u i Σ p(d i m u i )( m u i ; µ k t, C k t + C s ) where C s is a matrix chosen so that the Gaussian approximates the sampling function. A diagonal matrix with every member of the diagonal equal to 0.4 is adequate for our purposes if the samples are on integer values of u i. A similar procedure can be followed to compute the other g s 19
22 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data g ik = M g ik = Σ M p(d i m u i )( m u i ; µ t k, C t k + C s ) m u i Σ p(d i m u i )( m u i ; µ k t, C k t + C s )( m u i µ k ) T ( m u i µ k ) It is obvious that the above procedure is not suitable for high dimensionality u i s. An alternative is that p(d i u i ) is giv en in a standard analytic form which would allow the analytic computation of the g s. There are many good candidate standard forms but one of the most adaptable is Weighted Sum of Gaussians (similar to Mixture of Gaussians but without the requirement that the mixture probabilities sum up to one). Let p(d i u i ) = m i Σ q im (u i, µ im, C im ) where q im is the mixture weight for Gaussian m for data point u i, and µ im and C im are the m th mean and covariance for point u i. As before p(u i θ k t) is a Gaussian, which we denote as (u i ; µ t k, C t k). So we can write g ik = p(d i u i )p(u i θ k t )du i = m i Σ q im (u i, µ im, C im ) (u i, µ t k, Ck)du t i m i Σ q im (u i, µ im, C im )(u i, µ t k, C t k)du i The product of two Gaussians is a simple function of the two sets of parameters C im, C j, µ im and µ j (u i ; µim, C im)(u i ; µ t k, Ck) t = (u i ; µ imk, C umk ) (0; µ im, C im )(0; µ t k, Ck) t = (0; µ imk, C imk ) (u i ; µ imk, C imk ) f imk where C imk = C 1 im + (C t k) 1 1 µ imk = C(Cim 1 µ im + (Ck) t 1 µ t k) f imk = (0; µ im, C im )(0; µ t k, Ck) t (0; µ imk, C imk ) and since we know that the integral of the Gaussian is the unity g ik = m i Σ q im f imk 20
23 EM Clustering Lu Ye, Minas E. Spetsakis In a very similar fashion we can derive the expression for g ik and g ik and g ik = p(d i u i )p(u i θ k t )u i du i = m i Σ q im (u i, µ im, C im ) (u i, µ t k, Ck)u t i du i m i Σ q im (u i; µ imk, C imk ) f imk u i du i m i Σ q im f imk (u i; µ imk, C imk )u i du i = m i Σ q im f imk µ imk g ik = m i Σ q im f imk C imk 6.4. Summary of the Algorithm While the derivation is rather complicated the steps one needs to follow to execute one iteration of the algorithm are straightforward. We will follow the established practice and name the two distinct groups of steps Expectation and Maximization respectively although this association is by no means direct. The Expectation step is and the Maximization step is: ψ ik = π k t g ik Σ π j t g ij π j = Σ Σ ψ ik g ik g µ j = ik g ik g C j = ik 21
24 Lu Ye, Minas E. Spetsakis Cluster ing of Unobserved Data After we compute the parameters of all the Gaussians we can compute the g ik s to be used in the next step. If the p(d i u i ) is giv en in discrete form as a set of M samples on a regular grid in the space spanned by u i then g ik = Σ M p(d i m u i )( m u i ; µ t k, C t k + C s ) g ik = Σ M p(d i m u i ) m u i ( m u i ; µ t k, C t k + C s ) g ik = M Σ p(d i m u i )( m u i µ k ) T ( m u i µ k )( m u i ; µ k t, C k t + C s ) If on the other hand p(d i u i ) is giv en as a weighted sum of Gaussians then C imk = C 1 im + (Ck) t 1 1 µ imk = C(Cim 1 µ im + (Ck) t 1 µ t k) f imk = (0; µ im, C im )(0; µ t k, Ck) t (0; µ imk, C imk ) g ik = m i Σ q im f imk g ik = m i Σ q im f imkimk g ik = m i Σ q im f (C im, C k, µ im, µ k )C imk 7. Conclusion In this report, we reviewed briefly the concepts of Clustering, Expectation Maximization and Mixture of Gaussians and then developed the background for clustering of unobserved data under the Mixture of Gaussians model. We used Maximum Likelihood estimation to do this and since this problem involves hidden variables (the unobserved data and the membership) we used the Expectation Maximization (EM) method. Our fundamental assumption is that the data is not directly observed but we assume that its probability distribution is known. Each datum u i is conditioned on a set of known parameters D i and that datum u i is independent of D i if i = i. We also assume that the D i s are not directly dependent on the clustering parameters which gav e arise to the quazidependence assumption we used throughout the report. Armed with these assumptions we developed the equations for the iterative estimation of the gaussian clusters. If the probability distribution of the data is assumed to be a parametric form the formulas could be further further simplified and we plan to do this for future research. 22
25 EM Clustering Lu Ye, Minas E. Spetsakis References 1. Frank Dellaert, The Expectation Maximization Algorithm, dellaert/html/publications.htm (Feb. 2002). 2. Dempster, A. P., Laird,. M., and Rubin, D. B., Maximum likelihood from incomplete data via the EM Algorithm, Journal of the Royal Statistical Society B39pp (1977). 3. Geoffrey J. McLachlan and Thriyambakam rishnan, The EM Algorithm and Extensions, Wiley-Interscience (1997). 4. Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification, Wiley Interscience (2001). 5. Athanasios Papoulis, Probability, Random Variables and Stochastic Processes, McGrow-Hill (1984). 6. George Bebis, Mathematical Methods Computer Vision, bebis/mathmethods/em/lecture.pdf (). 7. C. Fraley and A. E. Raftery, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis, Tech. Report, (). 8. Gigeon Schwarz, Estimating the Dimension of a Model, The annals of Statistics 6(2) pp (March. 1978). 23
p(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationEstimating the parameters of hidden binomial trials by the EM algorithm
Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More information1 Expectation Maximization
Introduction Expectation-Maximization Bibliographical notes 1 Expectation Maximization Daniel Khashabi 1 khashab2@illinois.edu 1.1 Introduction Consider the problem of parameter learning by maximizing
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationLatent Variable View of EM. Sargur Srihari
Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationEXPECTATION- MAXIMIZATION THEORY
Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists
More informationThe Expectation Maximization Algorithm
The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationK-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543
K-Means, Expectation Maximization and Segmentation D.A. Forsyth, CS543 K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can t do this by
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization
More informationp(x ω i 0.4 ω 2 ω
p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationMH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution
MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationMixtures of Rasch Models
Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationBayesian Networks Structure Learning (cont.)
Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationLecture 6: Gaussian Mixture Models (GMM)
Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models
U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationThe Regularized EM Algorithm
The Regularized EM Algorithm Haifeng Li Department of Computer Science University of California Riverside, CA 92521 hli@cs.ucr.edu Keshu Zhang Human Interaction Research Lab Motorola, Inc. Tempe, AZ 85282
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationLecture 3: Machine learning, classification, and generative models
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel
More informationMinimum Error-Rate Discriminant
Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationEM Algorithm. Expectation-maximization (EM) algorithm.
EM Algorithm Outline: Expectation-maximization (EM) algorithm. Examples. Reading: A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.,
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationExpectation-Maximization
Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression,
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationMIXTURE MODELS AND EM
Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,
More informationWe Prediction of Geological Characteristic Using Gaussian Mixture Model
We-07-06 Prediction of Geological Characteristic Using Gaussian Mixture Model L. Li* (BGP,CNPC), Z.H. Wan (BGP,CNPC), S.F. Zhan (BGP,CNPC), C.F. Tao (BGP,CNPC) & X.H. Ran (BGP,CNPC) SUMMARY The multi-attribute
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationTesting Restrictions and Comparing Models
Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationTHESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott
THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements
More informationExpectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )
Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select
More information