1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Size: px

Start display at page:

Download "1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM."

Prudence Carson
5 years ago
Views:

1 Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification EM algorithm for the well-nown Gaussian mixture model (GMM), under some constraints, is exactly K-means so that CEM can be viewed as a probabilistic version of K-means. An additional feature of this wor is to derive the EM algorithm to estimated a mixture of regressions. This can namely serve for curve clustering. In this case, the data are curves rather than vectorial data. EM algorithm: updating the mixing proportions {π } Consider the problem of finding the maximum of the function Q π (π,..., π K, Ψ (q) ) i log π with respect to the mixing proportions (π,..., π K ) subject to the constraint K π, where i are the posterior probabilities at the qth iteration of EM. To perform this constrained maximization, introduce the Lagrange multiplier λ and derive the resulting unconstrained maximization problem (the Lagrangian function). To maximize the Lagrangian with respect to π (,..., K), first set the derivative of the Lagrangian with respect to π to zero, determine the Lagrange multiplier λ, and then the resulting value π (q+) (,..., K) that corresponds to the maximum (the updating formula for the mixing proportions π (,..., K)) Solution Consider the problem of finding the maximum of the function Q π (π,..., π K, Ψ (q) ) i log π with respect to the mixing proportions (π,..., π K ) subject to the constraint K π. To perform this constrained maximization, we introduce the Lagrange multiplier λ such that the resulting Lagrangian function is given by: L(π,..., π K ) i log π + λ( π ). ()

2 Taing the derivatives of the Lagrangian with respect to π for,..., K we obtain: Then, setting these derivatives to zero yields: L(π,..., π K ) i i λ, {,..., K}. () π π i i π λ, {,..., K}. (3) By multiplying each hand side of (3) by π (,..., K) and summing over we get π i i π λ π (4) which implies that λ n. Finally, from (3) we get the updating formula for the mixing proportions π s, that is π (q+) i i i i, {,..., K}. (5) λ n CEM clustering as a probabilistic view for K-means clustering Given an iid data set (x,..., x n ), the aim is to automatically find a partition into K clusters. Lets us denote by z (z,..., z n ) the corresponding unnown cluster labels where z i {,..., K} denotes the cluster label of x i. To achieve this clustering tas, we have seen that several algorithms can be used, namley K-means, EM or CEM with GMMs, etc. Here we will consider the Classification EM (CEM) algorithm and the K-means algorithm. CEM is the classification version of EM. Let us recall that K-means minimizes the following distortion measure J(µ,..., µ K, z) x i µ (6) i z i simultaneously w.r.t the cluster centres (µ,..., µ K ) and the cluster labels z. CEM for the GMM p(x i ; Ψ) K π N ( ) x i ; µ, Σ however maximizes the complete-data log-lielihood Lc (Ψ, z) simultaneously w.r.t the Gaussian mixture parameters Ψ (π,..., π K, µ,..., µ K, Σ,..., Σ K ) and the cluster labels z.. Derive the expression of the optimized complete-data log-lielihood. Show that, under the following constraints, maximizing L c by using CEM is equivalent to minimizing J by using K-means π K (same proportions) Σ σ I (identical isotropic covariance matrices) We recall that the multivariate Gaussian density N (x i ; µ, Σ ) is given by: N (x i ; µ, Σ ) (π) d Σ ( exp ) (x i µ ) T Σ (x i µ )

3 Solution The complete-data log-lielihood, for a GMM, which is the criterion optimized by the CEM algorithm, is given by: L c(ψ, z) log p((x, z ),..., (x n, z n); Ψ) log K log i [ p(z i )p(x z i ; Ψ ) z i log[π p ( ) x i z i ; Ψ ] n p(x i, z i; Ψ) i ] zi z i log[π N ( ) x i; µ, Σ ] (7) To show that a particular case of CEM yields in K-means algorithm, consider the spherical Gaussian mixture model where the classes have the same diagonal covariance matrix: Σ σ I and the same mixing proportions π K. Then, the complete-data log-lielihood (7) taes the following form: L c(ψ, z) z i log[π N ( ) n [ ( x i; µ, Σ ] z i log π exp (π) d Σ (xi µ )T Σ (x i µ )) ] (8) z i log π d z i log π z i log Σ z i (x i µ ) T Σ (x i µ )(9) z i log K nd log π n log K nd log π n log K nd z i log σ I z i log(σ d ) z i (x i µ ) T (σ I) (x i µ ) (0) z i (x i µ ) T ( σ I) (xi µ ) () log π nd log σ z σ i (x i µ ) T (x i µ ) () where we used: K z i n (from (9) to (0) and from (0) to ()) σ I d σ d (from (0) to ()) (σ I d ) (σ ) I d (from (0) to ()) Since σ is constant (we are not maximizing w.r.t it), maximizing the complete-data log-lielihood () w.r.t the means and the clusters (µ,..., µ K, z) is therefore equivalent to maximizing z i (x i µ ) T (x i µ ) z i x i µ i i z i x i µ. (3) which is also equivalent minimizing w.r.t the means and the cluster labels (µ,..., µ K, z) the criterion J(µ,..., µ K, z) x i µ (4) i z i which is none other than the distortion criterion () minimized by the K-means algorithm. 3 EM for mixture of polynomial regressions The aim here is to cluster n iid curves ((x, y ),..., (x n, y n )) into K clusters using mixture of regressions and EM. Each curve consists of m observations y (y i,..., y im ) regularly observed at the inputs x i (x i,..., x im ) for all i,..., n (e.g., x may represent the sampling time in a temporal context). 3

4 Model definition The polynomial regression mixture model arises when we assume that, each class of curves has a prior probability α and generate a curve according to polynomial function (with polynomial coefficients β ) corrupted by a zero-mean Gaussian noise witha variance σ : y i X i β + ɛ i (5) where y i is an n curve, X i is the n (p + ) regression matrix (Vandermonde matrix) with rows (, x ij, x ij..., xp ij ), p being the order of the polynomial, β (β 0,..., β p ) T is the vector of regression coefficients for class and ɛ i N (0, σ I m) is its corresponding Gaussian noise. From (5), derive the corresponding density for the observed curve y i given x i (mixture of polynomial regressions) Derive the EM algorithm for estimating the model parameters Ψ (α,..., α K, β,..., β K, σ,..., σ K ) provide the updating formula for Ψ Solution: The conditional mixture density of a curve y i (i,..., n) can be written as: p(y i x i ; Ψ) p(z i )p(y i x i, z i ; β, σ) α N (y i ; X i β, σi m ), (6) Ψ (α,..., α, θ,..., θ K ) with θ (β, σ ), σ being the noise variance for the cluster. The unnown parameter vector Ψ is estimated by maximum lielihood via the EM algorithm. Parameter estimation via the EM algorithm Given an i.i.d training set of n curves Y (y,..., y n ) regularly observed for the inputs X (x,..., x n ), the log-lielihood of Ψ is given by: L(Ψ) log n p(y i x i ; Ψ) i log α N (y i ; X i β, σi m ). (7) i The log-lielihood is maximized by the EM algorithm. Before giving the EM steps, the complete-data log-lielihood is given by: L c (Ψ) z i log α + z i log N (y i ; β T X j, σi m ) (8) where z (z,..., z n ) is the vector of cluster labels for the n curves and z i is an indicator binary-valued variable such that z i if z i (i.e., if y i is generated by the cluster r). The EM algorithm for PRMs and PSRMs starts with an initial model parameters Ψ (0) and alternates between the two following steps until convergence: E-step: Compute the expected complete-data log-lielihood given the curves Y, the inputs X and the current value of the parameter Ψ denoted by Ψ (q) : Q(Ψ, Ψ (q) ) E [ L c(ψ) Y, X; Ψ (q)] E [ z i Y, X; Ψ (q)] log α + E [ z i Y, X; Ψ (q)] log N (y i; Xβ, σi m) i log α + i log N (yi; Xβ, σ I m) (9) 4

5 where i p(z i y i, t; Ψ (q) ) α (q) K α(q) N ( y i ; Xβ T (q), σ (q) I m ) N (y i ; Xβ (q)t, σ (q) I m ) is the posterior probability that the curve y i is generated by the cluster r. This step therefore only requires the computation of the posterior cluster probabilities i (i,..., n) for each of the R clusters. (0) M-step: Compute the update Ψ (q+) fo Ψ by maximizing the Q-function (9) with respect to Ψ. The two terms of the Q-function are maximized separately. The first term, that is the function R i log α is maximized with respect to (α,..., α R ) subject to the constraint K α using Lagrange multipliers which gives the following updates: α (q+) n i i (r,..., R). () The second term of (9) can also be decomposed independently as a sum of R functions of (β, σ ) to perform R separate maximizations. The maximization of each of the R functions, that is i i log N (y i; Xβ, σ I m) corresponds therefore to solving a weighted least-squares problem. The solution of this problem is straightforward and is given by: β (q+) ( X T W (q) σ (q+) i i X ) X T W (q) y () (y X T β (q+) ) T W (q) (y X T β (q+) ) (3) where X (X T,..., X T n ) T, the vector y is an nm vector composed of the n curves by stacing them one curve after another, that is y (y T,..., yn T ) T and W (q) is the nm nm diagonal matrix whose diagonal elements are (,...,,..., n }{{},..., n ). }{{} m times m times 5

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume