Exemplifying the application of hierarchical agglomerative clustering (single-, complete- and average-linkage)
|
|
- Nelson Marshall
- 6 years ago
- Views:
Transcription
1 Clustering 0.
2 1. Exemplifying the application of hierarchical agglomerative clustering (single-, complete- and average-linkage) CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, HW4, pr. 2.a extended by Liviu Ciortuz
3 2. The table below is a distance matrix for 6 objects. A B C D E F A 0 B C D E F Show the final result of hierarchical clustering with single-, complete- and average-linkage by drawing the corresponding dendrograms.
4 Solution: Single-linkage: AB C D E F AB 0 C D E F ABCDF E ABCDF 0 E AB CD E F AB 0 CD E F ABCD E F ABCD 0 E F A B C D F E
5 Complete-linkage: 4. AB C D E F AB 0 C D E F ABF CDE ABF 0 CDE AB CD E F AB 0 CD E F ABF CD E ABF 0 CD E A B F C D E
6 Average-linkage: 5. AB C D E F AB 0 C D E F AB CD E F AB 0 CD E F ABCD E F ABCD 0 E F d(ab,c) = d(a,c)+d(b,c) 2 d(ab,d) = d(a,d)+d(b,d) 2 d(ab,e) = d(a,e)+d(b,e) 2 d(ab,f) = d(a,f)+d(b,f) 2 = = = = d(ab,cd) ( ) = 2d(AB,C)+2d(AB,D) 4 d(cd,e) = d(c,e)+d(d,e) 2 d(cd,f) = d(c,f)+d(d,f) 2 = = = 0.38 = 0.5 = = = = = = 0.44 d(abcd,e) ( ) = 2d(AB,E)+2d(CD,E) = = d(abcd,f) ( ) = 2d(AB,F)+2d(CD,F) = =
7 Average-linkage (cont d): 6. ABCDF E ABCDF 0 E d(abcdf,e) ( ) = 4d(ABCD,E)+d(F,E) 5 = = Note: For the proof of the (*) relation, see problem CMU, 2010 fall, Aarti Singh, HW3, pr. 4.2: d(x Y,Z) = X d(x,z)+ Y d(y,z) X + Y A B C D F E
8 7. Exemplifying the application of hierarchical devisive clustering and the relationship between slingle-linkage hierarchies and Minimum Spanning Trees (MSTs) CMU, 2009 spring, Ziv Bar-Joseph, final exam, pr. 9.3
9 Hierarchical clustering may be bottom-up or top-down. In this problem we will see whether a top-down clustering algorithm can be exactly analogous to a bottom-up clustering algorithm. 8. Consider the following top-down clustering algorithm: 1. Calculate the pairwise distance d(p i,p j ) between every two objects P i and P j in the set of objects to be clustered, and build a complete graph on the set of objects with edge weights being the corresponding distances. 2. Generate the Minimum Spanning Tree of the graph, i.e. choose the subset of edges E with minimum sum of weights such that G = (P,E ) is a single connected tree. 3. Throw out the edge with the heviest weight to generate two disconnected trees corresponding to top level clusters. 4. Repeat the previous step recursively on the lower level clusters to generate a top-down clustering on the set of n objects.
10 9. a. Apply this algoritm on the dataset given in the nearby table, using the Euclidian distance. b. Does this top-down algorithm perform analogously to any bottom-up algorithm that you have encountered in class? Why? Point x y P P P P P P
11 10. Solution: a. Kruskal algorithm: 1. (P 1,P 2 ), cost 1, 2. (P 4,P 5 ), cost 2, 3. (P 3,P 5 ), cost 3, 4. (P 2,P 3 ), cost (P 5,P 6 ), cost 6 2. y P 3 P P 6 Prim algorithm: 1. (P 1,P 2 ), cost 1, 2. (P 2,P 3 ), cost (P 3,P 5 ), cost 3, 4. (P 5,P 4 ), cost 2, 5. (P 5,P 6 ), cost P P 2 P x
12 11. C 0 C 1 C 4 C 6 C 3 P 1 P 2 P 3 P 4 P 5 P 6 (1,2) (2,2) (3,6) (6,4) (6,6) (12,12)
13 12. Note: If there is only one MST for the given dataset, then both Kruskal s and Prim s algorithm will find it. Otherwise, the two algorithms can produce differents results. One can see (both on this dataset and also in general) that Kruskal s algorithm is exactly analogous to the single-linkage bottom-up clustering algorithm. Therefore, there is indeed a bottom-up equivalent to the top-down clustering algorithm presented in this exercise.
14 13. Hierachical [bottom-up] clustering Ward s metric CMU, 2010 fall, Aarti Singh, HW3, pr. 4.1
15 14. In this problem you will anlayze an alternative approach to quantify the distance between two disjoint clusters, proposed by Joe H. Ward in We will call it Ward s metric. Ward s metric simply says that the distance between two disjoint clusters, X and Y, is how much the sum of squares will increase when we merge them. More formally, (X,Y) = i X Y x i µ X Y 2 i X x i µ X 2 i Y x i µ Y 2 (1) where µ i is the centroid of cluster i and x i is a data point in the X cluster. By definition, here we will consider µ X = 1 x n i X x i, where n X is the number X of elements in X. (X,Y) can be thought as the merging cost of combining clusters X and Y into one cluster. That is, in agglomerative clustering those two clusters with the lowest merging cost are merged, when Ward s metric is used as a closeness measure.
16 15. a. Can you reduce the formula in equation (1) for (X,Y) to a simpler form? Give the simplified formula. Hint: Your formula should be in terms of the cluster sizes (let s denote them as n X and n Y ) and the distance µ X µ Y 2 between cluster centroids µ X and µ Y only.
17 Solution (X,Y) = x i µ X Y 2 x i µ X 2 x i µ Y 2 i X Y i X i Y (3) = ( x i n ) 2 Xµ X +n Y µ Y i µ X ) n i X Y X +n Y i X(x 2 (x i µ Y ) 2 i Y (4) = ( ) x 2 i 2nX µ X +2n Y µ Y x i + ( nx µ X +n Y µ Y n i X Y X +n Y n i X Y i X Y X +n Y x 2 i +2µ X x 2 i +2µ Y i X i X x i i X µ 2 X i Y = 2(n Xµ X +n Y µ Y ) 2 + (n Xµ X +n Y µ Y ) 2 n X +n Y n X +n Y = (n Xµ X +n Y µ Y ) 2 = n X +n Y n X n Y µ X µ Y 2. n X +n Y i Y x i i Y µ 2 Y ) 2 +2µ X n X µ X +2µ Y n Y µ Y n X µ 2 X n Y µ 2 Y +n X µ 2 X +n Y µ 2 Y = n Xn Y n X +n Y (µ 2 X 2µ X µ Y +µ 2 Y) 16.
18 17. 3 explicative notes µ X def. = 1 X x i X x i x i X x i = X µ X (2) def. µ X Y = (2) = 1 X Y x i X Y X Y= x i = 1 X + Y ( X µ X + Y µ Y ) = 1 X + Y x i X x i + y i Y y i 1 n X +n Y (n X µ X +n Y µ Y ) (3) 2n X µ X +2n Y µ Y n X +n Y x i X Y x i (2),(3) = 2(n Xµ X +n Y µ Y ) n X +n Y = 2(n X µ X +n Y µ Y ) 2 (n X +n Y ) n Xµ X +n Y µ Y n X +n Y n X +n Y. (4)
19 18. b. Give an interpretation for Ward s metric. What do you think it is trying to achieve? Hint: The simplified formula from above will be helpful to answer this part. Solution: With agglomerative hierarchical clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters. Ward s metric aims to keep this growth in sum of squares as small as possible. This is nice if we believe that the sum of squares (as a measure of cluster coherence) should be small. Notice that both n X and n Y (in fact, [half of] their harmonic mean) show up in (X,Y) as computed at part a. The intuition is that given two pairs of clusters whose centers are equally far apart, Ward s method will prefer to merge the smaller ones.
20 19. c. Assume that you are given two pairs of clusters P 1 and P 2. The centers of the two clusters in P 1 are farther apart than the centers of the two clusters in P 2. Using Ward s metric, does agglomerative clustering always choose to merge the two clusters in P 2 (those with less distance between their centers)? Why (not)? Justify your answer with a simple example. Solution: No, not always. Which pair will be merged also depends on the size of the clusters. One simple counter example is where the size of the clusters in P 1 are 1 and 99, respectively; and similarly 50 and 50 for P 2. Then the harmonic mean 2n Xn Y n X +n Y of cluster sizes is for P 1 and 2 25 for P 2. If the distance between the centers of the clusters in P 1 is less than 25/0.99 = 25.(25) times the distance between the centers of the clusters in P 2, then (P 1 ) will be still smaller and so the clusters in P 1 are merged.
21 20. d. In clustering it is usually not trivial to decide what is the right number of clusters the data falls into. Using Ward s metric for agglomerative clustering, can you come up with a simple heuristic to pick the number of clusters k? Solution: Ward s algorithm can give us a hint to pick a reasonable k through the merging cost. If the cost of merging increases a lot, it is probably going too far, and losing a lot of structure. So one possible heuristic is to keep reducing k until the cost jumps, and then use the k right before the jump. In other words, pick the k just before the merging cost takes off.
22 21. Exemplifying non-hierarchical clustering using the K-means algorithm T.U. Dresden, 2006 summer, Steffen Höldobler, Axel Grossmann, HW3
23 22. Folosiţi algoritmul K-means şi distanţa euclidiană pentru a grupa următoarele 8 instanţe din R 2 în 3 clustere: A(2,10), B(2,5), C(8,4), D(5,8), E(7,5), F(6,4), G(1,2), H(4,9). Se vor lua drept centroizi iniţiali punctele A, D şi G. a. Rulaţi prima iteraţie a algoritmului K-means. Pe un grid de valori veţi marca instanţele date, poziţiile centroizilor la începutul primei iteraţii şi componenţa fiecărui cluster la finalul acestei iteraţii. (Trasaţi mediatoarele segmentelor determinate de centroizi, ca separatori ai clusterelor.) b. Câte iteraţii sunt necesare pentru ca algoritmul K-means să conveargă? Desenaţi pe câte un grid rezultatul rulării fiecărei iteraţii.
24 ALGORITMUL de clusterizare K-MEANS - Datele initiale
25 Solution: 24. Iteration 0: µ 0 1 µ 0 2 µ 0 3 = (2, 10) = (5, 8) = (1, 2) P i d(µ 0 1,P i ) d(µ 0 2,P i ) d(µ 0 3,P i ) A(2,10) B(2,5) C(8, 4) D(5,8) E(7, 5) F(6,4) G(1,2) H(4, 9) C 0 1 = {A} C 0 2 = {C,D,E,F,H} C 0 3 = {B,G} Iteration 1: µ 1 1 = µ 0 1 = (2,10) ( µ 1 2 =, ) = (6, 6) 5 5 ( 2+1 µ 1 3 = 2, 5+2 ) = (1.5, 3.5) 2... C 1 1 = {A,H} C 1 2 = {C,D,E,F} C 1 3 = {B,G}
26 11 ALGORITMUL de clusterizare K-MEANS - Datele initiale 11 ALGORITMUL de clusterizare K-MEANS - Datele initiale To be used in class ALGORITMUL de clusterizare K-MEANS - Datele initiale ALGORITMUL de clusterizare K-MEANS - Datele initiale
27 26. Iteration 2: µ 2 1 = (3, 9.5) ( 26 µ 2 2 = 4, 21 ) = (6.5, 5.25) 4 µ 2 3 = µ1 3 = (1.5, 3.5)... C 2 1 = {A,D,H} C 2 2 = {C,E,F} C 2 3 = {B,G} Iteration 3: ( µ 3 1 =, ) 3 3 = (7, 13/3) µ 3 2 µ 3 3 = µ 2 3 = (1.5, 3.5) = (11/3,9)... C 3 1 = {A,D,H} = C2 1 C 3 2 = {C,E,F} = C2 2 C 3 3 = {B,G} = C 2 3 Stop
28 ALGORITMUL de clusterizare K-MEANS - Iteratia
29 28. Exemplifying one property of K-means: The clusterisation result depends on initialisation CMU, 2006 spring, Carlos Guestrin, HW5, pr. 1
30 29. a-b. Consider the data set in the following figures. The symbols indicate data points, while the crosses ( ) indicate the current cluster centers. For each one of these figures, show the progress of the K-means algorithm by showing how the class centers move with each iteration until convergence. For each iteration, indicate which data points will be associated with each of the clusters, as well as the updated class centers. If during the cluster update step, a cluster center has no points associated with it, it will not move. Use/produce as many figures you need until convergence of the algorithm. c. What does this imply about the behavior of the K-Means algorithm?
31 30. 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Datele initiale
32 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Iteratia0 31. Solution: a ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia
33 1 ALGORITMUL de clusterizare K MEANS Iteratia2 1 ALGORITMUL de clusterizare K MEANS Iteratia2 32. Solution: a. (cont d) ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia
34 33. 1 ALGORITMUL de clusterizare K MEANS Iteratia4 1 ALGORITMUL de clusterizare K MEANS Iteratia
35 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Iteratia0 34. Solution: b ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia
36 35. Solution [in Romanian]: c. Este evident din acest exerciţiu că rezultatul algoritmului K-means depinde de poziţionarea iniţială a centroizilor. În cazul iniţializării de la punctul a au fost necesare 4 iteraţii până a se ajunge la convergenţă, pe când la punctul b algoritmul a convers după doar o iteraţie. Mai este încă ceva important de remarcat: faptul că la prima variantă de iniţializare, punctul din dreapta jos, care este un outlier (rom., excepţie, caz particular, aberaţie) este până la urmă asociat clusterului format de punctele din partea dreaptă (sus), în vreme ce la cea de-a doua variantă de iniţializare el constituie un cluster aparte/ singleton, obligând în mod indirect grupările de puncte din stânga-sus şi dreapta-sus să formeze împreună un singur cluster (ccea ce, foarte probabil, nu era de dorit). Detecţia outlier-elor este un capitol important din învăţarea automată. Ignorarea lor poate conduce la rezultate eronate sau chiar aberante ale modelărilor.
37 Some proofs 36.
38 37. K-means as an optimisation algorithm: The monotonicity of the J K criterion [CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1]
39 Algoritmul K-means (S. P. Lloyd, 1957) Input: x 1,...,x n R d, cu n K. Output: o anumită K-partiţie pentru {x 1,...,x n }. Procedură: [Iniţializare/Iteraţia 0:] t 0; se fixează în mod arbitrar µ 0 1,...,µ 0 K, centroizii iniţiali ai clusterelor, şi se asignează fiecare instanţă x i la centroidul cel mai apropiat, formând astfel clusterele C1,...,C 0 K 0. [Corpul iterativ:] Se execută iteraţia ++ t: Pasul 1: se calculează noile poziţii ale centroizilor: a µ t j = 1 C t 1 x j i C t 1 x i pentru j = 1,K; j Pasul 2: se reasignează fiecare x i la [clusterul cu] centroidul cel mai apropiat, adică se stabileşte noua componenţă a clusterelor la iteraţia t: C1,...,C t K t ; [Terminare:] până când o anumită condiţie este îndeplinită (de exemplu: până când poziţiile centroizilor sau: componenţa clusterelor nu se mai modifică de la o iteraţie la alta). 38. a Formula aceasta corespunde folosiriidistanţei euclidiene. Pentru alte măsuri de distanţă, este posibil să fie necesare alte formule.
40 39. a. Demonstraţi că, de la o iteraţie la alta, algoritmul K-means măreşte coeziunea de ansamblu a clusterelor. I.e., considerând funcţia unde: J(C t,µ t ) def. = x i µ t C t (x i ) 2 def. = ( ) ( ) xi µ t C t (x i ) xi µ t C t (x i ), C t = (C1,C t 2,...,C t K) t este colecţia de clustere (i.e., K-partiţia) la momentul t, µ t = (µ t 1,µ t 2,...,µ t K) este colecţia de centroizi ai clusterelor (K-configuraţia) la momentul t, C t (x i ) desemnează clusterul la care este asignat elementul x i la iteraţia t, operatorul desemnează produsul scalar al vectorilor din R d, arătaţi că J(C t,µ t ) J(C t+1,µ t+1 ) pentru orice t.
41 40. Ideea demonstraţiei Inegalitatea de mai sus rezultă din două inegalităţi (care corespund paşilor 1 şi 2 de la iteraţia t): J(C t,µ t ) (1) J(C t,µ t+1 ) (2) J(C t+1,µ t+1 ) La prima inegalitate (cea corespunzătoare pasului 1) se poate considera că parametrul C t este fixat iar µ este variabil, în vreme ce la a doua inegalitate (cea corespunzătoare pasului 2) se consideră µ t fixat şi C variabil. Prima inegalitate se poate obţine însumând o serie de inegalităţi, şi anume câte una pentru fiecare cluster Cj t. A doua inegalitate se demonstrează imediat. Ilustrarea acestei idei, pe un exemplu particular: Vezi următoarele 3 slide-uri [Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3]
42 41. Iter. 0 / initialization iter. 1 Step 1: re positioning the centroids Step 2: re computing the clusters / / iter / / iter
43 Pentru acest exemplu de aplicare a algoritmului K-means, scriem expresiile numerice pentru valoarea criteriului J 2 (C t,µ t ) pentru fiecare iteraţie (t = 0,1,2,3). 42. iter. J 2 (C t,µ t ) 0. 0+{( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} 1. ( 9 ( 20)) 2 +{( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] Observaţie: La prima vedere, este greu să dovedim aceste inegalităţi (J 2 (C t 1,µ t 1 ) J 2 (C t,µ t ), pentru t = 1,2,3)...altfel decât calculând efectiv valoarea expresiilor care se compară. Însă, introducând nişte termeni intermediari, inegalităţile acestea se vor demonstra într-un mod foarte elegant...
44 43. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) {( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} {( 9 7/3) 2 + ( 8 7/3) ( 5 7/3)) 2 ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] = +2[(5 7) (9 7) 2 ] Explicaţii: 1. Inegalităţile pe orizontală (J 2 (C t 1,µ t ) J 2 (C t,µ t ), pentru t = 1,2,3) sunt uşor de demonstrat, pe baza corespondenţei termen cu termen. (Ele corespund eventualelor micşorări ale distanţelor atunci când se face reasignarea instanţelor la centroizi.) 2. Restul inegalităţilor (J 2 (C t,µ t ) J 2 (C t,µ t+1 ), pentru t = 1,2,3) se rezolvă printro metodă de optimizare simplă. De exemplu, pentru t = 1 este imediat că funcţia ( 9 x) 2 +( 8 x) ( 5 x) 2 +2[(5 x) (9 x) 2 ] îşi atinge minimul pentru x = 7/3, deci J 2 (C 1,µ 2 ) J 2 (C 1,µ 1 ).
45 Step by step (t = 1) 44. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) {( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} {( 9 7/3) 2 + ( 8 7/3) ( 5 7/3)) 2 ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} +2[(5 7/3) (9 7/3) 2 ]} Details to be filled in class
46 Step by step (t = 2) 45. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) 1. ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] Details to be filled in class / at home
47 Step by step (t = 3) 46. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] = +2[(5 7) (9 7) 2 ] Details to be filled in class / at home
48 Graphical illustration 47. by Gheorghe Balan, FII student, 2017 fall For Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3
49 Graphical exemplification (cont d) 48. c Andrew Ng, Stanford University. For Stanford, A. Ng, 2012 spring, HW9 (image segmentation and compression)
50 Demonstraţie, pentru cazul general 49. Observaţie: Pentru convenienţă, ne vom limita la cazul d = 1. Extinderea demonstraţiei la cazul d > 1 nu comportă dificultăţi. Demonstrarea inegalităţii (1): J(C t,µ t ) J(C t,µ t+1 ) (Vezi pasul 1 al iteraţiei t.) Fixăm j {1,...,K}. Dacă notăm cu Cj t = {x i 1,x i2,...,x il }, unde l not. = Cj t, atunci J(C t j,µ t j) = l p=1 ( xip µ t ) 2, j deci J(C t,µ t ) = K J(Cj,µ t t j). j=1 Dacă se consideră Cj t fixat, iar µt j variabil, atunci putem minimiza imediat funcţia care,,măsoară coeziunea din interiorul clusterului Cj t: f(µ) def. = J(C t j,µ) = lµ2 2µ l x ip + p=1 l p=1 x 2 i p argmin µ J(C t j,µ) = 1 l l p=1 x ip def. = µ t+1 j. Aşadar, J(Cj t,µ) J(Ct j,µt+1 j ), pentru µ. În particular, pentru µ = µ t j vom avea: J(Cj t,µt j ) J(Ct j,µt+1 j ). Inegalitatea aceasta este valabilă pentru toate clusterele j = 1,...,K. Dacă sumăm toate aceste inegalităţi, rezultă: J(C t,µ t ) J(C t,µ t+1 ).
51 50. Demonstrarea inegalităţii (2): J(C t,µ t+1 ) J(C t+1,µ t+1 ) (Vezi pasul 2 al iteraţiei t.) La acest pas, o instanţă oarecare x i, unde i {1,...,n}, este reasignată de la clusterul cu centroidul µ t+1 j, la un alt centroid µ t+1 q, dacă x i µ t+1 j 2 x i µ t+1 q 2 (x i µ t+1 j ) 2 (x i µ t+1 q ) 2, pentru orice j = 1,...,K. În contextul iteraţiei t, acest lucru implică ( ) 2 ( 2. x i µ t+1 C t (x i ) x i µ t+1 C t+1 (x i )) Sumând membru cu membru inegalităţile de acest tip obţinute pentru i = 1,n, rezultă: J(C t,µ t+1 ) J(C t+1,µ t+1 ), ceea ce era de demonstrat.
52 51. Interesting / Useful Remark (in English) (by L. Ciortuz, following CMU, 2012 fall, E. Xing, A. Singh, HW3, pr. 1) Following the result obtained above, the K-means clustering algorithm can be seen as an optimisationalgorithm. As input, we are given points x 1,...,x n R d and an integer K > 1. The goal is to minimize the the K-means objective function/criterion, which in the case of using the Euclidian distance measure is the within-cluster sum of square distances J(µ,L) = x i µ li 2, where µ = (µ 1,...,µ K ) are the cluster centers (µ j R d ), and L = (l 1,...,l n ) are the cluster assignments (l i {1,...,K}). (Finding the exact minimum of this function is computationally difficult.)
53 Interesting / Useful Remark (cont d) 52. Lloyd s [K-means] algorithm, takes some initial cluster centers µ, and proceeds as follows: Step 1: Keeping µ fixed, find cluster assignments L to minimize J(µ, L). This step only involves finding the nearest cluster center(s) for each input point x i. Ties can be broken using arbitrary (but consistent) rules. Step 2: Keeping L fixed, find µ to minimize J(µ,L). This is a simple step that only involves in the case of using the Euclidian distance measure averaging points within a cluster. Stopping criterion: If [this was not the first iteration and] none of the values in L changed from the previous iteration, go to the next step; otherwise repeat from Step 1. Termination: Return µ and L. Note: The initial cluster centers µ given as input to the algorithm are often picked randomly from x 1,...,x n. (In practice, we often repeat multiple runs of Lloyd s algorithm with different initializations, and pick the best resulting clustering in terms of the K-means objective.)
54 53. b. Ce puteţi spune despre oprirea algoritmului K-means? Termină oare acest algoritm într-un număr finit de paşi, sau (întrucât numărul de K-partiţii care pot fi formate cu cele n instanţe de antrenament, x 1,...,x n, este finit, şi anume K n ) este posibil ca el să reviziteze de o infinitate de ori o K-configuraţie anterioară, µ = (µ 1,...,µ K )?)
55 54. Răspuns: Dacă algoritmul revizitează o K-partiţie, atunci rezultă că pentru un anumit t avem J(C t 1,µ t ) = J(C t,µ t+1 ). Este posibil ca acest fapt să se întâmple, şi anume atunci când: există instanţe multiple (i.e., x i = x j, deşi i j), criteriul de oprire al algoritmului K-means este de forma până când componenţa clusterelor nu se mai modifică, se presupune că, în cazul în care o instanţă x i este situată la egală distanţă faţă de doi sau mai mulţi centroizi, ea poate fi asignată în mod aleatoriu la oricare dintre ei. Aşa se întâmplă în exemplul din figura alăturată dacă se consideră că la o iteraţie t avem x 2 = 0 C1 t şi x 3 = 0 C2 t, iar la iteraţia următoare alegem ca x 3 = 0 C1 t+1 şi x 2 = 0 C2 t+1 şi, din nou, invers la iteraţia t+2. x 3 x 2 x 1 µ 1 µ x /2 0 1/2 1
56 55. Observaţii Dacă se păstrează criteriul dat ca exemplu în enunţul problemei adică se iterează până când centroizii staţionează algoritmul se poate opri fără ca la ultima iteraţie J(C,µ) să fi atins minimul posibil. În cazul exemplului de mai sus, vom avea = 1 > 2 3 = 2 ( 1 3 ) 2 + Dacă nu există instanţe multiple care să fie situate la distanţe egale faţă de doi sau mai mulţi centroizi la o iteraţie oarecare a algoritmului K- means (precum sunt x 2 şi x 3 în exemplul de mai sus), sau dacă se impune restricţia ca în astfel de situaţii instanţele identice să fie asignate la un singur cluster, este evident că algoritmul K-means se opreşte într-un număr finit de paşi. ( 2 3 ) 2.
57 56. Concluzii Algoritmul K-means explorează pornind de la o anumită iniţializare a celor K centroizi, doar un subset din totatul de K n K-partiţii, asigurându-ne însă că are loc proprietatea J(C 0,µ 1 ) J(C 1,µ 2 )... J(C t 1,µ t ) J(C t,µ t+1 ), conform punctului a al acestei probleme. Atingerea minimului global al funcţiei J(C,µ) unde C este o variabilă care parcurge mulţimea tuturor K-partiţiilor care se pot forma cu instanţele {x 1,...,x n } nu este garantată pentru algoritmul K-means. Valoarea funcţiei J care se obţine la oprirea algoritmului K-means este dependentă de plasarea iniţială a centroizilor µ, precum şi de modul concret în care sunt alcătuite clusterele în cazul în care o instanţă oarecare se află la distanţă egală de doi sau mai mulţi centroizi, după cum am arătat în exemplul de mai sus.
58 57. K-means algorithm: The approximate maximization of the distance between clusters [CMU, 2010 fall, Aarti Singh, HW3, pr. 5.2]
59 58. Note: In this problem we will work with a version of the K-means algorithm which is slightly modified w.r.t. the one given in the problem CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1, where we have proved the monotonicity of the criterion J. Let X := {x 1,x 2,...,x n } be our sample points, and K denote the number of clusters to use. We represent the cluster assignments of the data points by an indicator matrix γ {0,1} n K such that γ ij = 1 means x i belongs to cluster j. We require that each point belongs to exactly one cluster, so K j=1 γ ij = 1. [We already know that] the K-means algorithm estimates γ by minimizing the following cohesion criterion (or, measure of distortion, or simply sum of squares ): J(γ,µ 1,µ 2,...,µ K ) := K γ ij x i µ j 2, j=1 where denotes the vector 2-norm. K-means alternates between estimating γ and re-computing µ j s.
60 The K-means algorithm (...yet another version!) Initialize µ 1,µ 2,...,µ K, and let C := {1,...,K}. While the value of J is still decreasing, repeat the following: Determine γ by γ ij { 1, x i µ j 2 x i µ j 2, j C, 0, otherwise. Break ties arbitrarily. 2. Recompute µ j using the updated γ: For each j C, if n γ ij > 0 set µ j n γ ijx i n γ ij. Otherwise, don t change µ j.
61 60. Let x denote the sample mean. Consider the following three quantities: n Total variation: T(X) = x i x 2 n n Within-cluster variation: W j (X) = γ ij x i µ j 2 n γ for j = 1,...,K ij K ( n Between-cluster variation: B(X) = γ ) ij µ n j x 2. What is the relation between these three quantities? Based on this relation, show that K-means can be interpreted as minimizing a weighted average of within-cluster variations while approximately(!) maximizing the between-cluster variation. Note that the relation may contain an extra term that does not appear above. j=1
62 Solution 61. To simplify the notation, we define n j = n γ ij. We then have: T (X) = 1 x i x 2 = 1 n n = 1 n = 1 n = = K j=1 K j=1 K j=1 n j n j=1 K γ ij x i x 2 = 1 n K j=1 γ ij x i x 2 γ ij x i µ j +µ j x 2 (5) ( γ ij xi µ j 2 + µ j x 2 +2 ( ) ( x i µ j µj x )) n γ ij x i µ j 2 n j + K n j n W j(x) +B(X) 2 n j=1 }{{} J n K j=1 K j=1 n j µ j x 2 n n K ( µj x ) j=1 ( n j µj x ) (µ ) not. j µ j, where µj = ( ) γ ij xi µ j n γ ijx i n j. (6)
63 62. Notes 1. In (5), the quantities µ j can be taken arbitrarily; however, they will be thought of in the end see relation (6) and the Conclusion on the last slide as the centroids of the clusters 1,...,K, as computed at Step 2 in the K-means algorithm. 2. The equality (6) on the previous slide holds because ( ) n γ ij (x i µ j ) = γ ij x i n j µ j = n γ ijx i j n j µ j n j ( n = n γ ) ijx i j µ j = n j ( µ j µ j ) n j
64 63. Conclusion We already know see CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1 that K-means aims to minimize J, and consequently 1 n J, which coincides with the first term in the expression we obtained for T(X), namely K n j j=1 n W j(x). Since the total variation T(X) is constant, minimizing the first term is equivalent to maximizing the sum of the other two terms, which is expected to be dominated by the between-cluster variation B(X) since a good µ j should be close to µ j, making the third term small in absolute value.
65 64. Graphical exemplification by Gheorghe Balan, FII student, 2017 fall For CMU, 2006 spring, C. Guestrin, HW5, pr. 1 For Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3
66 Graphical exemplification (cont d) 65. c Andrew Ng, Stanford University. For Stanford, A. Ng, 2012 spring, HW9 (image segmentation and compression)
67 66. Exemplifying the application of a simple version of EM/GMM on data from R (σ 1 = σ 2 = 1,π 1 = π 2 = 1/2) CMU, 2012 spring, Ziv Bar-Joseph, final exam, pr. 3.1 enhanced by Liviu Ciortuz
68 Suppose a GMM has two components with known variance and an equal prior distribution 1 2 N(µ 1,1)+ 1 2 N(µ 2,1). The observed data are x 1 = 0.5 and x 2 = 2, and the current estimates of µ 1 and µ 2 are 1 and 2 respectively. 67. a. Execute the first iteration of the EM algorithm. Hint: Normal densities for the standardized variable y (µ=0,σ=1) at 0, 0.5, 1, 1.5, 2 are 0.4, 0.35, 0.24, 0.13, 0.05 respectively. b. Consider the log-likelihood function for the observable data, 2 2 l(µ 1,µ 2 ) def. = lnp(x 1,x 2 µ 1,µ 2 ) indep. = lnp(x i µ 1,µ 2 ) = ln P(x i,z ij µ 1,µ 2 ), where z ij {0,1} and 2 j=1 z ij = 1 for all i {1,2}. zij Compute the values of l function at the beginning and also at the end of the first iteration of the EM algorithm. What do you see?
69 The E-step: E[Z i1 ] = P(Z i1 = 1 x i,µ) B.Th. = = = Solution (a.) P(x i Z i1 = 1,µ 1 )P(Z i1 = 1) P(x i Z i1 = 1,µ 1 )P(Z i1 = 1)+P(x i Z i2 = 1,µ 2 )P(Z i2 = 1) P(x i Z i1 = 1,µ 1 ) 1 2 P(x i Z i1 = 1,µ 1 ) 1 2 +P(x i Z i2 = 1,µ 2 ) 1 2 P(x i Z i1 = 1,µ 1 ) P(x i Z i1 = 1,µ 1 )+P(x i Z i2 = 1,µ 2 ) for i {1,2}. 68. Therefore, P(Z 11 = 1 x 1,µ) = P(Z 21 = 1 x 2,µ) = N(0.5;1,1) N(0.5;1,1)+N(0.5;2,1) = N(2;1,1) N(2;1,1)+N(2;2,1) = N(0.5;0,1) N(0.5;0,1)+N(1.5;0,1) = = N(1;0,1) N(1;0,1)+N(0;0,1) = = = 3 8 Similarly, P(Z 12 = 1 x 1,µ) = P(Z 11 = 0 x 1,µ) = 1 P(Z 11 = 1 x 1,µ 1 ) = P(Z 22 = 1 x 2,µ) = P(Z 21 = 0 x 2,µ) = 1 P(Z 21 = 1 x 2,µ 1 ) = 5 8
70 69. The M-step: µ (t+1) j = 2 E[Z ij] x i 2 E[Z ij] = 2 P(Z ij = 1 x i,µ (t) ) x i 2 P(Z ij = 1 x i,µ (t) ) Therefore, µ (1) 1 = = and µ(1) 2 = =
71 l(µ 1,µ 2 ) def. = 2 lnp(x i µ 1,µ 2 ) = Solution (b.) 2 ln zij P(x i,z ij µ 1,µ 2 ) 70. = = 2 ln P(x i z ij,µ 1,µ 2 ) P(z ij µ 1,µ 2 ) }{{} z ij 1/2 2 ln 1 2 P(x i z ij,µ 1,µ 2 ) = ln2+ln 2 z ij 2 P(x i z ij = 1,µ j ) j=1 z ij P(x i z ij,µ 1,µ 2 ) z 11 = 1, z 12 = 0 1 2π exp( 1 2 (x 1 µ 1 ) 2 ) z 11 = 0, z 21 = 1 z 21 = 1, z 21 = 0 z 21 = 0, z 21 = 1 1 2π exp( 1 2 (x 1 µ 2 ) 2 ) 1 2π exp( 1 2 (x 2 µ 1 ) 2 ) 1 2π exp( 1 2 (x 2 µ 2 ) 2 )
72 71. Therefore, ( 1 l(µ 1,µ 2 ) = 2ln2+ln (exp( 1 2π 2 (x 1 µ 1 ) 2 )+exp( 1 )) 2 (x 1 µ 2 ) 2 ) ( 1 +ln (exp( 1 2π 2 (x 2 µ 1 ) 2 )+exp( 1 )) 2 (x 2 µ 2 ) 2 ) ( = 2ln2 ln(2π)+ln exp( 1 2 (x 1 µ 1 ) 2 )+exp( 1 ) 2 (x 1 µ 2 ) 2 ) ( +ln exp( 1 2 (x 2 µ 1 ) 2 )+exp( 1 ) 2 (x 2 µ 2 ) 2 ) and l(µ (0) 1,µ(0) 2 ( 107 ) = l(1,2) = l(µ(1) 1,µ(1) 2 ) = l 106, 133 ) = , 86 meaning that the value of the log-likelihood function increases, which is in line with the theoretical result concerning the correctness of the EM algorithm.
73 72. Derivation of the EM algorithm for a mixture of K uni-variate Gaussians: the general case (i.e., when all parameters π,µ,σ 2 are free) following Dahua Lin, An Introduction to Expectation-Maximization (MIT, ML 6768 course, 2012 fall)
74 73. Note We will first consider K = 2. Generalization to K > 2 will be shown afterwards.
75 Estimation (E) Step: 74. p ij not. = P(Z ij = 1 X i,µ,σ,π) calcul = E[Z ij X i,µ,σ,π] = = P(X i = x i Z ij = 1,µ,σ,π) P(Z ij = 1 µ,σ,π) 2 j =1 P(X i = x i Z ij = 1,µ,σ,π) P(Z ij = 1 µ,σ,π) 1 exp ( (x ) i µ j ) 2 π 2πσj 2σj 2 j 1 exp ( (x ) i µ 1 ) 2 π 2πσ1 2σ exp ( (x i µ 2 ) 2 2πσ2 2σ2 2 ) π 2 Therefore, for t > 0 we will have: p (t) ij = π (t 1) 1 σ (t 1) 1 exp ( π (t 1) j σ (t 1) j exp (x i µ (t 1) 1 ) 2 2(σ (t 1) 1 ) 2 ( ) (x ) i µ (t 1) j ) 2 2(σ (t 1) j ) 2 ( + π(t 1) 2 σ (t 1) 2 exp (x i µ (t 1) 2 ) 2 2(σ (t 1) 2 ) 2 )
76 75. The likelihood of a complete instance (x i,z i1,z i2 ): P(X i = x i,z i1 = z i1,z i2 = z i2 µ,σ,π) = P(X i = x i Z i1 = z i1,z i2 = z i2,µ i,σ i,π i ) P(Z i1 = z i1,z i2 = z i2 µ i,σ i,π i ) 1 = exp ( (x ) i µ j ) 2 π j, where z ij = 1 and z ij = 0 for j j 2πσj = 1 2π σ z i1 1 σ z i2 2 2σ 2 j exp 1 2 j {1,2} z ij (x i µ j ) 2 σ 2 j π z i1 1 π z i2 2 The log-likelihood of the same complete instance will be: lnp(x i = x i,z i1 = z i1,z i2 = z i2 µ,σ,π) = ln(2π) z ij lnσ j 1 2 (x i µ j ) 2 z ij 2 σj 2 j=1 j=1 + 2 z ij lnπ j j=1
77 76. Given the dataset X = {x 1,...,x n }, the log-likelihood function will be: l(µ,σ,π) def. = lnp(x,z 1,Z 2 µ,σ,π) i.i.d. = ln = lnp(x i = x i,z i1,z i2 µ,σ,π) n P(X i = x i,z i1,z i2 µ,σ,π) = n 2 ln(2π) 2 Z ij lnσ j 1 2 j=1 2 j=1 Z ij (x i µ j ) 2 σ 2 j + 2 Z ij lnπ j j=1
78 The expectation of the log-likelihood function: 77. E[lnP(X,Z 1,Z 2 µ,σ,π)] = n 2 2 ln(2π) E[Z ij ]lnσ j 1 2 j=1 2 j=1 E[Z ij ] (x i µ j ) 2 σ 2 j + 2 E[Z ij ] lnπ j j=1 Here above, the probability function w.r.t. which the expectation was computed was left unspecified. Now we will make it explicit: Q(µ,σ,π µ (t),σ (t),π (t) ) not. = E Z X,µ (t),σ (t),π (t)[lnp(x,z 1,Z 2 X,µ,σ,π)] = n 2 2 ln(2π) E[Z ij X i,µ (t),σ (t),π (t) ]lnσ j j=1 2 E[Z ij X i,µ (t),σ (t),π (t) ] (x i µ j ) 2 j=1 2 E[Z ij X i,µ (t),σ (t),π (t) ] lnπ j j=1 σ 2 j
79 78. p (t) ij not. = E[Z ij X,µ (t 1),σ (t 1),π (t 1) ] Q(µ,σ,π µ (t),σ (t),π (t) ) = n 2 2 ln(2π) p (t) ij lnσ j 1 2 j=1 2 j=1 p (t) ij (x i µ j ) 2 σ 2 j + 2 j=1 p (t) ij lnπ j Since K = 2 and π 1 +π 2 = 1, we get Q(µ,σ,π µ (t),σ (t),π (t) ) = n 2 2 ln2π p (t) ij lnσ j 1 2 j=1 2 j=1 p (t) ij (x i µ j ) 2 σ 2 j + (p (t) i1 lnπ 1 +p (t) i2 ln(1 π 1))
80 79. Maximization (M) Step: [For K = 2:] Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 π 1 π 1 p (t) i1 = π 1( p (t) i1 + p (t) i2 ) π (t+1) 1 1 n p (t) i1 = π 1 p (t) i1 = 1 1 π 1 (p (t) i1 +p(t) i2 p (t) i1 }{{} 1 ) p (t) i2 p (t) i1 =nπ 1 Taking into account that π (t+1) 1 +π (t+1) 2 = 1 and p (t) i1 +p(t) i2 π (t+1) 2 1 n p (t) i2 = 1 for i = 1,...,n,
81 80. µ 1 Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 σ 2 1 p (t) i1 (x i µ 1 ) = 0 p (t) i1 (x i µ 1 )=0 µ (t+1) 1 n p(t) i1 x i n p(t) i1 Similarly, µ (t+1) 2 n p(t) i2 x i n p(t) i2
82 81. σ 1 Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 σ 1 p (t) i1 + 1 σ 3 1 p (t) i1 (x i µ 1 ) 2 = 0, ( Similarly, σ (t+1) 1 ( ) 2 n p(t) i1 (x i µ (t+1) σ (t+1) 2 n p(t) i1 1 ) 2 ) 2 n p(t) i2 (x i µ (t+1) n p(t) i2 2 ) 2 Note: One could relatively easily prove that these solutions (namely, π (t+1),µ (t+1),σ (t+1) ) of the partial derivatives of the auxiliary function Q designate the values for which Q reaches its maximum.
83 Generalization to K > In this case, the Bernoulli distribution is replaced by a categorical one. The only one change needed in the above proof concerns updating the parameters of this distribution. Since π π K = 1, we must solve the following constraint optimization problem: max π,µ,σ Q(π,µ,σ π(t),µ (t),σ (t) ) K subject to π j = 1 and π j 0, j = 1,...,K. By letting asside the constraints, and using the Lagrangean multiplier λ R, this problem becomes: ( ) K Q(π,µ,σ π (t),µ (t),σ (t) )+λ(1 π j ). max π,µ,σ
84 83. For j = 1,...,K: π j Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 p (t) ij 1 = λ π (t+1) j π j = 1 λ p (t) ij. Because K j=1 π(t+1) j = 1, it follows that λ = K j=1 π (t+1) j = K j=1 π (t+1) j }{{} 1 = 1 = n. Therefore, π (t+1) j 1 n p (t) ij. Note that indeed π (t+1) j 0, because the p (t) ij terms designate some probabilities (see E-step).
85 E Step: M Step: p (t) ij To summarize: not. = P(z ij = 1 x i ;µ (t),(σ 2 ) (t),π (t) ) = where N(x i µ j,σj 2 ) def. = π (t+1) j 1 n N(x i µ (t) j,(σ2 j) (t) ) π (t) j K l=1 N(x i µ (t) l,(σl 2)(t) ) π (t) l ) 1 exp ( (x i µ j ) 2 2πσj 2σj 2 p (t) ij. 84. ( µ (t+1) j σ (t+1) j ) 2 n p(t) ij x i n p(t) ij n p(t) ij (x i µ (t+1) j ) 2 n p(t) ij
86 Example: Modelling the waiting and eruption times for the Old Faithful geyser, (Yellowstone Park, USA) Michael Eichler (University of Chicago, Statistics course (24600) - Spring 2004) 85. R code p< c(0.5, 40, 90, 20, 20) emstep< function(y, p) { EZ< p[1] * dnorm(y, p[2], sqrt(p[4]))/ (p[1] * dnorm(y, p[2], sqrt(p[4])) +(1 - p[1]) * dnorm(y, p[3], sqrt(p[5]))) p[1]< mean(ez) p[2]< sum(ez * Y) / sum(ez) p[3]< sum((1 - EZ)*Y) / sum(1-ez) p[4]< sum(ez * (Y - p[2]) 2) / sum(ez) p[5]< sum((1 - EZ) * (Y - p[3]) 2) / sum(1 - EZ) p } emiteration< function(y, p, n=10) { for (i in (1:n)) { p< emstep(y, p) } p } p< c(0.5, 40, 90, 20, 20) p< emiteration(y, p, 20) p p< emstep(y, p) p } Starting values: p (0) = 0.4 µ (0) 1 = 40, σ (0) 1 = 4 µ (0) 1 = 90, σ (0) 1 = 4
87 k p (k) µ (k) 1 µ (k) 2 σ (k) 1 σ (k)
88 87. Exemplifying some methodological issues regarding the application of the EM algorithmic schema (using a simple EM/GMM algorithm on data from R (π 1 = π 2 = 1/2)) CMU, 2007 spring, Eric Xing, final exam, pr. 1.8
89 88. A long time ago there was a village amidst hundreds of lakes. Two types of fish lived in the region, but only one type in each lake. These types of fish both looked exactly the same, smelled exactly the same when cooked, and had the exact same delicious taste except one was poisonous and would kill any villager who ate it. The only other difference between the fish was their effect on the ph (acidity) of the lake they occupy. The ph for lakes occupied by the non-poisonous type of fish was distributed according to a Gaussian with unknown mean (µ safe ) and variance (σ 2 safe ) and the ph for lakes occupied by the poisonous type was distributed according to a different Gaussian with unknown mean (µ deadly ) and variance (σ 2 deadly ). (Poisonous fish tended to cause slightly more acidic conditions). Naturally, the villagers turned to machine learning for help. However, there was much debate about the right way to apply EM to their problem. For each of the following procedures, indicate whether it is an accurate implementation of Expectation-Maximization and will provide a reasonable estimate for parameters µ and σ 2 for each class.
90 a. Guess initial values of µ and σ 2 for each class. (1) For each lake, find the most likely class of fish for the lake. (2) Update the µ and σ 2 values using their maximum likelihood estimates based on these predictions. Iterate (1) and (2) until convergence. 89. b. For each lake, guess an initial probability that it is safe. (1) Using these probabilities, find the maximum likelihood estimates for the µ and σ values for each class. (2) Use these estimates of µ and σ to reestimate lake safety probabilities. Iterate (1) and (2) until convergence. c. Compute the mean and variance of the ph levels across all lakes. Use these values for the µ and σ 2 value of each class of fish. (1) Use the µ and σ 2 values of each class to compute the belief that each lake contains poisonous fish. (2) Find the maximum likelihood values for µ and σ 2. Iterate (1) and (2) until convergence.
91 Solution 90. a. It ll do ok if we give sensible enough µ and σ 2 initial values. b. Ok, this is the same as a after the first M-step. (See the general EM algorithmic schema on the next slide.) c. This will be stuck at the initial µ and σ 2 : In the E-step we ll get: P(safe x) = P(x safe) P(safe) P(x safe) P(safe)+P(x deadly) P(deadly) = 1 2 since on one side we assume P(safe) = P(poison) = 1 and on the other side 2 P(x safe) = P(x poison) because µ safe = µ deadly and σsafe 2 = σ2 deadly. In the M-step µ and σ 2 will not change since we are again letting them be calculated from all lakes (weighted equally).
92 91. The [general] EM algorithmic schema h (t) E[Z X, h (t) ] ++t P(X h) h (t+1) = argmax h E [ln P(Y h)] P(Z X; ) h (t)
93 92. The EM algorithm for modeling mixtures of multi-variate Gaussians with diagonal covariance matrices Utah University, Piyush Rai ML course (CS5350/6350), 2009 fall, lecture notes, Gaussian Mixture Models [adapted by Liviu Ciortuz]
94 Fie mixtura de distribuţii gaussiene 93. gmm(x) = K π j N(x;µ j,σ 2 I), j=1 unde x R d, probabilităţile a priori de selecţie π j R satisfac (ca de obicei) restricţiile π j 0 pentru j = 1,...,K şi K j=1 π i = 1; mediile gaussienelor j = 1,...,K sunt vectorii µ j R d, iar matricele de covarianţă ale acestor gaussine sunt identice, ba chiar au forma particulară σ 2 I, cu σ R şi σ > 0, matricea I fiind matricea identitate. Se consideră instanţele x 1,...,x n R d generate cu distribuţia probabilistă de mai sus (gmm). În acest exerciţiu veţi deduce regulile de actualizare din cadrul [pasului E şi al pasului M al] algoritmului EM care face estimarea parametrilor π not. = (π 1,...,π K ),µ not. = (µ 1,...,µ K ) şi σ.
95 94. a. Se ştie că expresia funcţiei de densitate a distribuţiei gaussiene multivariate (d-dimensionale) de medie µ R d şi matrice de covarianţă Σ R d d este ( 1 ( 2π) d Σ exp 1 ) 2 (x µ) Σ 1 (x µ), unde x şi µ sunt consideraţi vectori-coloană din R d, iar operatorul desemnează operaţia de transpunere a vectorilor/matricelor. Aduceţi expresia de mai sus la forma cea mai simplă pentru cazul Σ = σ 2 I. Vă recomandăm să folosiţi faptul că x µ 2 = (x µ) (x µ) = (x µ) (x µ), unde operatorul desemnează produsul scalar al vectorilor. Observaţie: Prima din ultimele două egalităţi implică un uşor abuz (sau, mai degrabă, o convenţie) de notaţie: o matrice reală de dimensiune 1 1 este identificată cu un număr real, care este chiar singurul ei element. Acelaşi tip de abuz/convenţie a intervenit şi în scrierea expresiei exp(...) de mai sus.
96 95. Answer Σ = σ 2 I Σ 1 = 1 σ 2I, and also Σ = (σ 2 ) d Σ = σ d since σ > 0. Therefore, N(x;µ,Σ = σ 2 I) = = = ( 1 ( 2π) d Σ exp 1 ) 2 (x µ) Σ 1 (x µ) 1 ( 2π) d σ d exp 1 ( 2πσ) d exp ( 12 (x µ) 1σ 2(x µ) ) ( 1 ) 2σ 2 x µ 2
97 96. b. Vom asocia fiecărei instanţe x i un vector-indicator (mai precis, un vector de variabile aleatoare z i {0,1} K, cu z ij = 1 dacă şi numai dacă x i a fost generat de gaussiana N(x;µ j,σ 2 I). Pentru pasul E al algoritmului EM, veţi demonstra mai întâi că media E[z ij ] not. = E[z ij x i ;π,µ,σ], unde x i = (x i,1,...,x i,d ) R d, µ not. = (µ 1,...,µ K ) (R d ) K şi σ R +, are valoarea P(z ij = 1 x i ;π,µ,σ), iar apoi veţi elabora formula de calcul a acestei probabilităţi, folosind teorema lui Bayes.
98 Answer For the sake of simplicity, we will designate E[z ij x i ;πµ,σ] as E[z ij ]. So, 97. E[z ij ] not. = E[z ij x i ;πµ,σ] def. = 0 P(z ij = 0 x i ;π,µ,σ)+1 P(z ij = 1 x i ;π,µ,σ) = P(z ij = 1 x i ;π,µ,σ) Bayes F. = = a. = = P(x i z ij = 1;π,µ,σ) P(z ij = 1;π,µ,σ) P(x i ;π,µ,σ) P(x i z ij = 1;π,µ,σ) P(z ij = 1;π,µ,σ) K j =1 P(x i z ij = 1;π,µ,σ)P(z ij = 1;π,µ,σ) ( 1 ( 2πσ) exp d ) 2σ 2 x i µ j 2 1 π j ( K 1 j =1 ( 2πσ) exp 1 ) d 2σ 2 x i µ j 2 π j ( π j exp 1 ) 2σ 2 x i µ j 2 K j =1 π j ( exp 1 ) (7) 2σ 2 x i µ j 2
99 98. c. Arătaţi că expresia funcţiei de log-verosimilitate a datelor,,complete în raport cu parametrii π,µ şi σ este lnp(x,z π,µ,σ) = K z ij (lnπ j +lnn(x i ;µ j,σ 2 I)), j=1 unde x not. = (x 1,...,x n ) şi z not. = (z 1,...,z n ). Deduceţi apoi expresia,,funcţiei auxiliare Q(π,µ,σ π (t),µ (t),σ (t) ) def. = E[lnp(x,z π,µ,σ)], cu precizarea că media aceasta este calculată în raport cu distribuţia / distribuţiile P(z ij x i,π (t),µ (t),σ (t) ).
100 Answer The log-verosimility of the complete data is lnp(x,z π,µ,σ) def. = lnp((x 1,z 1 ),...,(x n,z n ) π,µ,σ) i.i.d. = ln = = lnp(x i,z i π,µ,σ) j=1 mult. rule = K z ij [lnn(x i µ j,σ)+lnπ j ], n p(x i,z i π,µ,σ) lnp(x i z i ;π,µ,σ) p(z i π,µ,σ) }{{} π j 99. since z i = (z i,1,...,z i,j,...,z i,k ), with z i,j = 1 and z i,j = 0 for all j j. Furthermore, using the result we got at part a, we can write K [ lnp(x,z π,µ,σ) = z ij d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j. j=1 Finally, using the linearity of expectation, Q(π,µ,σ π (t),µ (t),σ (t) ) def. = E[lnp(x,z π,µ,σ)] K [ = E[z ij ] d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j. j=1
101 100. Pentru Pasul M, în contextul precizat în enunţ, vom avea de rezolvat următoarea problemă de optimizare: (π (t+1),µ (t+1),σ (t+1) ) = argmax π,µ,σ cu restricţiile π (t+1) j 0 pentru j = 1,...,K şi Q(π,µ,σ π (t),µ (t),σ (t) ), (8) K j=1 π (t+1) j = 1. Această problemă se rezolvă optimizând funcţia ei obiectiv în mod separat în raport cu variabilele π,µ şi σ. d. Aplicaţi metoda multiplicatorilor lui Lagrange pentru a rezolva problema de optimizare cu restricţii (8) în raport (doar) cu variabilele π.
102 Answer 101. For now, we will ignore the constraints π j 0. Thus, the Lagrangean functional associated to our optimisation problem is: K (π (t+1),µ (t+1),σ (t+1) ) = argmax(q(π,µ,σ π,µ (t),σ (t) ) λ(1 π,µ,σ,λ π j )), (9) with λ R playing the role of Lagrangean multiplier. (After solving this optimisation problem it will be seen that indeed π j 0 for all j = 1,...,K.) By taking the partial derivative of the Lagrangean functional with respect to π j (with j {1,...,K}) and then solving for it, π j j=1 K E[z ij ] π j (Q(π,µ,σ π (t),µ (t),σ (t) ) λ(1 K π j )) = 0 j=1 [ d 2 ln(2πσ2 ) 1 2σ 2 x i µ j 2 +lnπ j ] λ = 0 j=1 E[z ij ] 1 = λ π j we will obtain the solution π (t+1) j = 1 λ E[z ij ]. (10)
103 102. Moreover, since these solutions should satisfy the constraint K that j=1 π(t+1) j = 1, it follows K j=1 1 λ E[z ij ] = 1 1 λ 1 λ j=1 K j=1 E[z ij ] = 1 b. 1 λ K P(z ij = 1 x i,π,µ,σ) = 1 1 λ } {{ } 1 j=1 K P(z ij = 1 x i,π,µ,σ) = 1 1 = 1 λ = 1 n. By replacing this value of λ back into relation (10), we will get π (t+1) j = 1 n E[z ij ]. (11) It is obvious that π (t+1) j 0.
104 e. Optimizaţi funcţia Q (i.e., rezolvaţi problema de optimizare (8)) în raport cu variabilele µ Answer: Remember the notation µ = (µ 1,...,µ K ) (R d ) K, with µ j R d for j = 1,...,K. µj Q(π,µ,σ π (t),µ (t),σ (t) ) n K [ = µj E[z ij ] d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j = j =1 E[z ij ] [ 1 2σ 2 µ j (x i µ j ) 2 ] = 1 2σ 2 E[z ij ]2(x i µ j )( 1) [( = 1 n σ 2 E[z ij ](x i µ j ) = 1 n ( n ] σ 2 E[z ij ]x i ) E[z ij ] )µ j After equating this expression to zero (in fact, the column-vector (0,...,0) R d ), we get the solution: µ (t+1) = n E[z ij]x i n E[z ij]. (12)
105 f. Optimizaţi funcţia Q (i.e., rezolvaţi problema de optimizare (8)) în raport cu variabila σ Answer: The partial derivative of Q(π,µ,σ π (t),µ (t),σ (t) ) with respect to σ is: σ Q(π,µ,σ π(t),µ (t),σ (t) ) = K [ E[z ij ] d σ 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j = = j=1 j=1 K j=1 E[z ij ] σ K E[z ij ] = 1 σ 3( dσ2 n j=1 [ d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 [ d 2 2σ σ 2 2 ] 2σ 3 x i µ j 2 = K E[z ij ] + }{{} 1 j=1 j=1 K E[z ij ] [ dσ + 1σ ] 3 x i µ j 2 K E[z ij ] x i µ j 2 ) = 1 σ 3( ndσ2 + j=1 K E[z ij ] x i µ j 2 )
106 105. By equating σ Q(π,µ,σ π(t),µ (t),σ (t) ) to 0 we get the solution: (σ (t+1) ) 2 = 1 nd K j=1 E[z ij ] x i µ (t+1) j 2 0. (13) It is not too difficult to see that this solution leads to the maximization of Q. (Note that σ > 0 and the expression ndσ 2 + n K j=1 E[z ij] x i µ j 2 as a function of σ 2, is linear and decreasing.)
107 106. g. Sumarizaţi rezultatele obţinute la punctele de mai sus (b şi d-f), redactând în pseudocod algoritmul EM pentru rezolvarea mixturii de gaussiene din enunţ. Answer: By gathering the relations (7), (11), (12) and (13), we are now able to write the pseudocode of our algorithm: Initialize the a priori probabilities π, the means µ, and the variance σ. Iterate until a certain termination condition is met: Step E: Compute the expectation (i.e., a posterori probabilities) of z variables: p (t+1) ij not. = E[z ij x i ;π (t),µ (t),σ (t) ] = π (t) j N(x;µ (t) j,(σ(t) ) 2 I) K j =1 π(t) j N(x;µ (t) j,(σ (t) ) 2 I) Step M: Compute new values for π,µ şi σ: π (t+1) j = 1 n p (t) ij µ (t+1) j = n p(t) ij x i n p(t) ij ( σ (t+1)) 2 = 1 dn K j=1 p (t) ij x i µ (t) j 2
108 Important Remark: a slight generalization 107. Suppose that our mixture model is made of Gaussians with unrestricted diagonal covariance matrices, i.e., X i Z i = j N µ j,1.. µ j,d, Z i Categorical(p 1,...,p K ) (σ j,1 ) (σ j,2 ) (σ j,d ) 2 It can be easily be proven (by going along the lines of the above proof) that the only update relation that changes in the formulation of the EM algorithm is (13). It becomes: (σ (t+1) jk ) 2 = n E[z ij](x i,k µ (t+1) j,k ) 2 n E[z ij] 0 for j = 1,...,K and k = 1,...,d. (14) This is indeed what we would expect, given on one hand the fact that the Gaussian components are mutually independent, and on the other hand the updating formulas [that we have already obtained] for the EM algorithm for solving mixtures of uni-variate Gaussians [of unrestricted form].
Teorema Reziduurilor şi Bucuria Integralelor Reale Prezentare de Alexandru Negrescu
Teorema Reiduurilor şi Bucuria Integralelor Reale Preentare de Alexandru Negrescu Integrale cu funcţii raţionale ce depind de sint şi cost u notaţia e it, avem: cost sint i ( + ( dt d i, iar integrarea
More informationSoluţii juniori., unde 1, 2
Soluţii juniori Problema 1 Se consideră suma S x1x x3x4... x015 x016 Este posibil să avem S 016? Răspuns: Da., unde 1,,..., 016 3, 3 Termenii sumei sunt de forma 3 3 1, x x x. 3 5 6 sau Cristian Lazăr
More informationSisteme cu logica fuzzy
Sisteme cu logica fuzzy 1/15 Sisteme cu logica fuzzy Mamdani Fie un sistem cu logică fuzzy Mamdani două intrări x şi y ieşire z x y SLF Structura z 2/15 Sisteme cu logica fuzzy Mamdani Baza de reguli R
More information1.3. OPERAŢII CU NUMERE NEZECIMALE
1.3. OPERAŢII CU NUMERE NEZECIMALE 1.3.1 OPERAŢII CU NUMERE BINARE A. ADUNAREA NUMERELOR BINARE Reguli de bază: 0 + 0 = 0 transport 0 0 + 1 = 1 transport 0 1 + 0 = 1 transport 0 1 + 1 = 0 transport 1 Pentru
More informationO V E R V I E W. This study suggests grouping of numbers that do not divide the number
MSCN(2010) : 11A99 Author : Barar Stelian Liviu Adress : Israel e-mail : stelibarar@yahoo.com O V E R V I E W This study suggests grouping of numbers that do not divide the number 3 and/or 5 in eight collumns.
More informationProcedeu de demonstrare a unor inegalităţi bazat pe inegalitatea lui Schur
Procedeu de demonstrare a unor inegalităţi bazat pe inegalitatea lui Schur Andi Gabriel BROJBEANU Abstract. A method for establishing certain inequalities is proposed and applied. It is based upon inequalities
More informationDivizibilitate în mulțimea numerelor naturale/întregi
Divizibilitate în mulțimea numerelor naturale/întregi Teorema îmărţirii cu rest în mulțimea numerelor naturale Fie a, b, b 0. Atunci există q, r astfel încât a=bq+r, cu 0 r < b. În lus, q şi r sunt unic
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationON THE QUATERNARY QUADRATIC DIOPHANTINE EQUATIONS (II) NICOLAE BRATU 1 ADINA CRETAN 2
ON THE QUATERNARY QUADRATIC DIOPHANTINE EQUATIONS (II) NICOLAE BRATU 1 ADINA CRETAN ABSTRACT This paper has been updated and completed thanks to suggestions and critics coming from Dr. Mike Hirschhorn,
More informationUNITATEA DE ÎNVĂȚARE 3 Analiza algoritmilor
UNITATEA DE ÎNVĂȚARE 3 Analiza algoritmilor Obiective urmărite: La sfârşitul parcurgerii acestei UI, studenţii vor 1.1 cunoaște conceptul de eficienta a unui algoritm vor cunoaste si inţelege modalitatile
More informationGradul de comutativitate al grupurilor finite 1
Gradul de comutativitate al grupurilor finite Marius TĂRNĂUCEANU Abstract The commutativity degree of a group is one of the most important probabilistic aspects of finite group theory In this survey we
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationK-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1
EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess
More informationLegi de distribuţie (principalele distribuţii de probabilitate) Tudor Drugan
Legi de distribuţie (principalele distribuţii de probabilitate) Tudor Drugan Introducere In general distribuţiile variabilelor aleatoare definite pe o populaţie, care face obiectul unui studiu, nu se cunosc.
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationCercet¼ari operaţionale
Cercet¼ari operaţionale B¼arb¼acioru Iuliana Carmen CURSUL 9 Cursul 9 Cuprins Programare liniar¼a 5.1 Modelul matematic al unei probleme de programare liniar¼a.................... 5. Forme de prezentare
More informationA GENERALIZATION OF A CLASSICAL MONTE CARLO ALGORITHM TO ESTIMATE π
U.P.B. Sci. Bull., Series A, Vol. 68, No., 6 A GENERALIZATION OF A CLASSICAL MONTE CARLO ALGORITHM TO ESTIMATE π S.C. ŞTEFĂNESCU Algoritmul Monte Carlo clasic A1 estimeazează valoarea numărului π bazându-se
More informationBarem de notare clasa a V-a
Barem de notare clasa a V-a Problema1. Determinați mulțimile A și B, formate din numere naturale, știind că îndeplinesc simultan condițiile: a) A B,5,6 ; b) B A 0,7 ; c) card AB 3; d) suma elementelor
More informationFORMULELE LUI STIRLING, WALLIS, GAUSS ŞI APLICAŢII
DIDACTICA MATHEMATICA, Vol. 34), pp. 53 67 FORMULELE LUI STIRLING, WALLIS, GAUSS ŞI APLICAŢII Eugenia Duca, Emilia Copaciu şi Dorel I. Duca Abstract. In this paper are presented the Wallis, Stirling, Gauss
More informationDecision Trees Some exercises
Decision Trees Some exercises 0. . Exemplifying how to compute information gains and how to work with decision stumps CMU, 03 fall, W. Cohen E. Xing, Sample questions, pr. 4 . Timmy wants to know how to
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationSTATS 306B: Unsupervised Learning Spring Lecture 5 April 14
STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced
More informationTeoreme de Analiză Matematică - I (teorema Weierstrass-Bolzano) 1
Educaţia Matematică Vol. 3, Nr. 1-2 (2007), 79-84 Teoreme de Analiză Matematică - I (teorema Weierstrass-Bolzano) 1 Silviu Crăciunaş, Petrică Dicu, Mioara Boncuţ Abstract In this paper we propose a Weierstrass
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationBayesian Networks Structure Learning (cont.)
Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationLecture 2: GMM and EM
2: GMM and EM-1 Machine Learning Lecture 2: GMM and EM Lecturer: Haim Permuter Scribe: Ron Shoham I. INTRODUCTION This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the Expectation-Maximization
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationHabilitation Thesis. Periodic solutions of differential systems: existence, stability and bifurcations
UNIVERSITATEA BABEŞ BOLYAI CLUJ-NAPOCA FACULTATEA DE MATEMATICĂ ŞI INFORMATICĂ Habilitation Thesis Mathematics presented by Adriana Buică Periodic solutions of differential systems: existence, stability
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationUniversitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor
Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor TEZĂ DE ABILITARE Metode de Descreştere pe Coordonate pentru Optimizare
More informationClustering. Léon Bottou COS 424 3/4/2010. NEC Labs America
Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.
More informationCOMPARATIVE DISCUSSION ABOUT THE DETERMINING METHODS OF THE STRESSES IN PLANE SLABS
74 COMPARATIVE DISCUSSION ABOUT THE DETERMINING METHODS OF THE STRESSES IN PLANE SLABS Codrin PRECUPANU 3, Dan PRECUPANU,, Ștefan OPREA Correspondent Member of Technical Sciences Academy Gh. Asachi Technical
More informationUtilizarea claselor de echivalenta in analiza asistata de calculator a sistemelor cu evenimente discrete
72 Utilizarea claselor de echivalenta in analiza asistata de calculator a sistemelor cu evenimente discrete Conf.dr. Alexandru TERTISCO, ing. Alexandru BOICEA Facultatea de Automatica si Calculatoare,
More informationCS 136 Lecture 5 Acoustic modeling Phoneme modeling
+ September 9, 2016 Professor Meteer CS 136 Lecture 5 Acoustic modeling Phoneme modeling Thanks to Dan Jurafsky for these slides + Directly Modeling Continuous Observations n Gaussians n Univariate Gaussians
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationPROBLEME DIVERSE lecţie susţinută la lotul de 13 de Andrei ECKSTEIN Bucureşti, 25 mai 2015
PROBLEME DIVERSE lecţie susţinută la lotul de 13 de Andrei ECKSTEIN Bucureşti, 5 mai 015 I. SUBSTITUŢIA TAIWANEZĂ 1. Fie a, b, c > 0 astfel încât a bc, b ca şi c ab. Determinaţi valoarea maximă a expresiei
More informationDefiniţie. Pr(X a) - probabilitatea ca X să ia valoarea a ; Pr(a X b) - probabilitatea ca X să ia o valoare în intervalul a,b.
Variabile aleatoare Definiţie Se numeşte variabilă aleatoare pe un spaţiu fundamental E şi se notează prin X, o funcţie definită pe E cu valori în mulţimea numerelor reale. Unei variabile aleatoare X i
More informationSubiecte geometrie licenta matematica-informatica 4 ani
Class: Date: Subiecte geometrie licenta matematica-informatica 4 ani Multiple Choice Identify the letter of the choice that best completes the statement or answers the question. 1. Complementara unui subspatiu
More informationEM (cont.) November 26 th, Carlos Guestrin 1
EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationStatistical NLP Spring Digitizing Speech
Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon
More informationDigitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...
Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield
More informationModule Master Recherche Apprentissage et Fouille
Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationUtilizarea limbajului SQL pentru cereri OLAP. Mihaela Muntean 2015
Utilizarea limbajului SQL pentru cereri OLAP Mihaela Muntean 2015 Cuprins Implementarea operatiilor OLAP de baza in SQL -traditional: Rollup Slice Dice Pivotare SQL-2008 Optiunea ROLLUP Optiunea CUBE,
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationEcuatii si inecuatii de gradul al doilea si reductibile la gradul al doilea. Ecuatii de gradul al doilea
Ecuatii si inecuatii de gradul al doilea si reductibile la gradul al doilea Ecuatia de forma Ecuatii de gradul al doilea a + b + c = 0, (1) unde a, b, c R, a 0, - variabila, se numeste ecuatie de gradul
More informationSIMULAREA DECIZIEI FINANCIARE
SIMULAREA DECIZIEI FINANCIARE Conf. univ. dr. Nicolae BÂRSAN-PIPU T5.1 TEMA 5 DISTRIBUŢII DISCRETE T5. Cuprins T5.3 5.1 Variabile aleatoare discrete 5. Distribuţia de probabilitate a unei variabile aleatoare
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationInteligenta Artificiala
Inteligenta Artificiala Universitatea Politehnica Bucuresti Anul universitar 2010-2011 Adina Magda Florea http://turing.cs.pub.ro/ia_10 si curs.cs.pub.ro 1 Curs nr. 4 Cautare cu actiuni nedeterministe
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood
More informationScribe to lecture Tuesday March
Scribe to lecture Tuesday March 16 2004 Scribe outlines: Message Confidence intervals Central limit theorem Em-algorithm Bayesian versus classical statistic Note: There is no scribe for the beginning of
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationLecture 11: Unsupervised Machine Learning
CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationAvem 6 tipuri de simboluri in logica predicatelor:
Semantica Avem 6 tipuri de simboluri in logica predicatelor: Predicate: p, q, r,, p1, q2 etc. Constante: a, b, c,, z, a1, b4,, ion, mihai, labus etc. Variabile: x, y, z, x1, y1, z4 etc. Conective:,,,,
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationMIXTURE MODELS AND EM
Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,
More information