Exemplifying the application of hierarchical agglomerative clustering (single-, complete- and average-linkage)

Size: px
Start display at page:

Download "Exemplifying the application of hierarchical agglomerative clustering (single-, complete- and average-linkage)"

Transcription

1 Clustering 0.

2 1. Exemplifying the application of hierarchical agglomerative clustering (single-, complete- and average-linkage) CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, HW4, pr. 2.a extended by Liviu Ciortuz

3 2. The table below is a distance matrix for 6 objects. A B C D E F A 0 B C D E F Show the final result of hierarchical clustering with single-, complete- and average-linkage by drawing the corresponding dendrograms.

4 Solution: Single-linkage: AB C D E F AB 0 C D E F ABCDF E ABCDF 0 E AB CD E F AB 0 CD E F ABCD E F ABCD 0 E F A B C D F E

5 Complete-linkage: 4. AB C D E F AB 0 C D E F ABF CDE ABF 0 CDE AB CD E F AB 0 CD E F ABF CD E ABF 0 CD E A B F C D E

6 Average-linkage: 5. AB C D E F AB 0 C D E F AB CD E F AB 0 CD E F ABCD E F ABCD 0 E F d(ab,c) = d(a,c)+d(b,c) 2 d(ab,d) = d(a,d)+d(b,d) 2 d(ab,e) = d(a,e)+d(b,e) 2 d(ab,f) = d(a,f)+d(b,f) 2 = = = = d(ab,cd) ( ) = 2d(AB,C)+2d(AB,D) 4 d(cd,e) = d(c,e)+d(d,e) 2 d(cd,f) = d(c,f)+d(d,f) 2 = = = 0.38 = 0.5 = = = = = = 0.44 d(abcd,e) ( ) = 2d(AB,E)+2d(CD,E) = = d(abcd,f) ( ) = 2d(AB,F)+2d(CD,F) = =

7 Average-linkage (cont d): 6. ABCDF E ABCDF 0 E d(abcdf,e) ( ) = 4d(ABCD,E)+d(F,E) 5 = = Note: For the proof of the (*) relation, see problem CMU, 2010 fall, Aarti Singh, HW3, pr. 4.2: d(x Y,Z) = X d(x,z)+ Y d(y,z) X + Y A B C D F E

8 7. Exemplifying the application of hierarchical devisive clustering and the relationship between slingle-linkage hierarchies and Minimum Spanning Trees (MSTs) CMU, 2009 spring, Ziv Bar-Joseph, final exam, pr. 9.3

9 Hierarchical clustering may be bottom-up or top-down. In this problem we will see whether a top-down clustering algorithm can be exactly analogous to a bottom-up clustering algorithm. 8. Consider the following top-down clustering algorithm: 1. Calculate the pairwise distance d(p i,p j ) between every two objects P i and P j in the set of objects to be clustered, and build a complete graph on the set of objects with edge weights being the corresponding distances. 2. Generate the Minimum Spanning Tree of the graph, i.e. choose the subset of edges E with minimum sum of weights such that G = (P,E ) is a single connected tree. 3. Throw out the edge with the heviest weight to generate two disconnected trees corresponding to top level clusters. 4. Repeat the previous step recursively on the lower level clusters to generate a top-down clustering on the set of n objects.

10 9. a. Apply this algoritm on the dataset given in the nearby table, using the Euclidian distance. b. Does this top-down algorithm perform analogously to any bottom-up algorithm that you have encountered in class? Why? Point x y P P P P P P

11 10. Solution: a. Kruskal algorithm: 1. (P 1,P 2 ), cost 1, 2. (P 4,P 5 ), cost 2, 3. (P 3,P 5 ), cost 3, 4. (P 2,P 3 ), cost (P 5,P 6 ), cost 6 2. y P 3 P P 6 Prim algorithm: 1. (P 1,P 2 ), cost 1, 2. (P 2,P 3 ), cost (P 3,P 5 ), cost 3, 4. (P 5,P 4 ), cost 2, 5. (P 5,P 6 ), cost P P 2 P x

12 11. C 0 C 1 C 4 C 6 C 3 P 1 P 2 P 3 P 4 P 5 P 6 (1,2) (2,2) (3,6) (6,4) (6,6) (12,12)

13 12. Note: If there is only one MST for the given dataset, then both Kruskal s and Prim s algorithm will find it. Otherwise, the two algorithms can produce differents results. One can see (both on this dataset and also in general) that Kruskal s algorithm is exactly analogous to the single-linkage bottom-up clustering algorithm. Therefore, there is indeed a bottom-up equivalent to the top-down clustering algorithm presented in this exercise.

14 13. Hierachical [bottom-up] clustering Ward s metric CMU, 2010 fall, Aarti Singh, HW3, pr. 4.1

15 14. In this problem you will anlayze an alternative approach to quantify the distance between two disjoint clusters, proposed by Joe H. Ward in We will call it Ward s metric. Ward s metric simply says that the distance between two disjoint clusters, X and Y, is how much the sum of squares will increase when we merge them. More formally, (X,Y) = i X Y x i µ X Y 2 i X x i µ X 2 i Y x i µ Y 2 (1) where µ i is the centroid of cluster i and x i is a data point in the X cluster. By definition, here we will consider µ X = 1 x n i X x i, where n X is the number X of elements in X. (X,Y) can be thought as the merging cost of combining clusters X and Y into one cluster. That is, in agglomerative clustering those two clusters with the lowest merging cost are merged, when Ward s metric is used as a closeness measure.

16 15. a. Can you reduce the formula in equation (1) for (X,Y) to a simpler form? Give the simplified formula. Hint: Your formula should be in terms of the cluster sizes (let s denote them as n X and n Y ) and the distance µ X µ Y 2 between cluster centroids µ X and µ Y only.

17 Solution (X,Y) = x i µ X Y 2 x i µ X 2 x i µ Y 2 i X Y i X i Y (3) = ( x i n ) 2 Xµ X +n Y µ Y i µ X ) n i X Y X +n Y i X(x 2 (x i µ Y ) 2 i Y (4) = ( ) x 2 i 2nX µ X +2n Y µ Y x i + ( nx µ X +n Y µ Y n i X Y X +n Y n i X Y i X Y X +n Y x 2 i +2µ X x 2 i +2µ Y i X i X x i i X µ 2 X i Y = 2(n Xµ X +n Y µ Y ) 2 + (n Xµ X +n Y µ Y ) 2 n X +n Y n X +n Y = (n Xµ X +n Y µ Y ) 2 = n X +n Y n X n Y µ X µ Y 2. n X +n Y i Y x i i Y µ 2 Y ) 2 +2µ X n X µ X +2µ Y n Y µ Y n X µ 2 X n Y µ 2 Y +n X µ 2 X +n Y µ 2 Y = n Xn Y n X +n Y (µ 2 X 2µ X µ Y +µ 2 Y) 16.

18 17. 3 explicative notes µ X def. = 1 X x i X x i x i X x i = X µ X (2) def. µ X Y = (2) = 1 X Y x i X Y X Y= x i = 1 X + Y ( X µ X + Y µ Y ) = 1 X + Y x i X x i + y i Y y i 1 n X +n Y (n X µ X +n Y µ Y ) (3) 2n X µ X +2n Y µ Y n X +n Y x i X Y x i (2),(3) = 2(n Xµ X +n Y µ Y ) n X +n Y = 2(n X µ X +n Y µ Y ) 2 (n X +n Y ) n Xµ X +n Y µ Y n X +n Y n X +n Y. (4)

19 18. b. Give an interpretation for Ward s metric. What do you think it is trying to achieve? Hint: The simplified formula from above will be helpful to answer this part. Solution: With agglomerative hierarchical clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters. Ward s metric aims to keep this growth in sum of squares as small as possible. This is nice if we believe that the sum of squares (as a measure of cluster coherence) should be small. Notice that both n X and n Y (in fact, [half of] their harmonic mean) show up in (X,Y) as computed at part a. The intuition is that given two pairs of clusters whose centers are equally far apart, Ward s method will prefer to merge the smaller ones.

20 19. c. Assume that you are given two pairs of clusters P 1 and P 2. The centers of the two clusters in P 1 are farther apart than the centers of the two clusters in P 2. Using Ward s metric, does agglomerative clustering always choose to merge the two clusters in P 2 (those with less distance between their centers)? Why (not)? Justify your answer with a simple example. Solution: No, not always. Which pair will be merged also depends on the size of the clusters. One simple counter example is where the size of the clusters in P 1 are 1 and 99, respectively; and similarly 50 and 50 for P 2. Then the harmonic mean 2n Xn Y n X +n Y of cluster sizes is for P 1 and 2 25 for P 2. If the distance between the centers of the clusters in P 1 is less than 25/0.99 = 25.(25) times the distance between the centers of the clusters in P 2, then (P 1 ) will be still smaller and so the clusters in P 1 are merged.

21 20. d. In clustering it is usually not trivial to decide what is the right number of clusters the data falls into. Using Ward s metric for agglomerative clustering, can you come up with a simple heuristic to pick the number of clusters k? Solution: Ward s algorithm can give us a hint to pick a reasonable k through the merging cost. If the cost of merging increases a lot, it is probably going too far, and losing a lot of structure. So one possible heuristic is to keep reducing k until the cost jumps, and then use the k right before the jump. In other words, pick the k just before the merging cost takes off.

22 21. Exemplifying non-hierarchical clustering using the K-means algorithm T.U. Dresden, 2006 summer, Steffen Höldobler, Axel Grossmann, HW3

23 22. Folosiţi algoritmul K-means şi distanţa euclidiană pentru a grupa următoarele 8 instanţe din R 2 în 3 clustere: A(2,10), B(2,5), C(8,4), D(5,8), E(7,5), F(6,4), G(1,2), H(4,9). Se vor lua drept centroizi iniţiali punctele A, D şi G. a. Rulaţi prima iteraţie a algoritmului K-means. Pe un grid de valori veţi marca instanţele date, poziţiile centroizilor la începutul primei iteraţii şi componenţa fiecărui cluster la finalul acestei iteraţii. (Trasaţi mediatoarele segmentelor determinate de centroizi, ca separatori ai clusterelor.) b. Câte iteraţii sunt necesare pentru ca algoritmul K-means să conveargă? Desenaţi pe câte un grid rezultatul rulării fiecărei iteraţii.

24 ALGORITMUL de clusterizare K-MEANS - Datele initiale

25 Solution: 24. Iteration 0: µ 0 1 µ 0 2 µ 0 3 = (2, 10) = (5, 8) = (1, 2) P i d(µ 0 1,P i ) d(µ 0 2,P i ) d(µ 0 3,P i ) A(2,10) B(2,5) C(8, 4) D(5,8) E(7, 5) F(6,4) G(1,2) H(4, 9) C 0 1 = {A} C 0 2 = {C,D,E,F,H} C 0 3 = {B,G} Iteration 1: µ 1 1 = µ 0 1 = (2,10) ( µ 1 2 =, ) = (6, 6) 5 5 ( 2+1 µ 1 3 = 2, 5+2 ) = (1.5, 3.5) 2... C 1 1 = {A,H} C 1 2 = {C,D,E,F} C 1 3 = {B,G}

26 11 ALGORITMUL de clusterizare K-MEANS - Datele initiale 11 ALGORITMUL de clusterizare K-MEANS - Datele initiale To be used in class ALGORITMUL de clusterizare K-MEANS - Datele initiale ALGORITMUL de clusterizare K-MEANS - Datele initiale

27 26. Iteration 2: µ 2 1 = (3, 9.5) ( 26 µ 2 2 = 4, 21 ) = (6.5, 5.25) 4 µ 2 3 = µ1 3 = (1.5, 3.5)... C 2 1 = {A,D,H} C 2 2 = {C,E,F} C 2 3 = {B,G} Iteration 3: ( µ 3 1 =, ) 3 3 = (7, 13/3) µ 3 2 µ 3 3 = µ 2 3 = (1.5, 3.5) = (11/3,9)... C 3 1 = {A,D,H} = C2 1 C 3 2 = {C,E,F} = C2 2 C 3 3 = {B,G} = C 2 3 Stop

28 ALGORITMUL de clusterizare K-MEANS - Iteratia

29 28. Exemplifying one property of K-means: The clusterisation result depends on initialisation CMU, 2006 spring, Carlos Guestrin, HW5, pr. 1

30 29. a-b. Consider the data set in the following figures. The symbols indicate data points, while the crosses ( ) indicate the current cluster centers. For each one of these figures, show the progress of the K-means algorithm by showing how the class centers move with each iteration until convergence. For each iteration, indicate which data points will be associated with each of the clusters, as well as the updated class centers. If during the cluster update step, a cluster center has no points associated with it, it will not move. Use/produce as many figures you need until convergence of the algorithm. c. What does this imply about the behavior of the K-Means algorithm?

31 30. 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Datele initiale

32 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Iteratia0 31. Solution: a ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia

33 1 ALGORITMUL de clusterizare K MEANS Iteratia2 1 ALGORITMUL de clusterizare K MEANS Iteratia2 32. Solution: a. (cont d) ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia

34 33. 1 ALGORITMUL de clusterizare K MEANS Iteratia4 1 ALGORITMUL de clusterizare K MEANS Iteratia

35 1 ALGORITMUL de clusterizare K MEANS Datele initiale 1 ALGORITMUL de clusterizare K MEANS Iteratia0 34. Solution: b ALGORITMUL de clusterizare K MEANS Iteratia ALGORITMUL de clusterizare K MEANS Iteratia

36 35. Solution [in Romanian]: c. Este evident din acest exerciţiu că rezultatul algoritmului K-means depinde de poziţionarea iniţială a centroizilor. În cazul iniţializării de la punctul a au fost necesare 4 iteraţii până a se ajunge la convergenţă, pe când la punctul b algoritmul a convers după doar o iteraţie. Mai este încă ceva important de remarcat: faptul că la prima variantă de iniţializare, punctul din dreapta jos, care este un outlier (rom., excepţie, caz particular, aberaţie) este până la urmă asociat clusterului format de punctele din partea dreaptă (sus), în vreme ce la cea de-a doua variantă de iniţializare el constituie un cluster aparte/ singleton, obligând în mod indirect grupările de puncte din stânga-sus şi dreapta-sus să formeze împreună un singur cluster (ccea ce, foarte probabil, nu era de dorit). Detecţia outlier-elor este un capitol important din învăţarea automată. Ignorarea lor poate conduce la rezultate eronate sau chiar aberante ale modelărilor.

37 Some proofs 36.

38 37. K-means as an optimisation algorithm: The monotonicity of the J K criterion [CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1]

39 Algoritmul K-means (S. P. Lloyd, 1957) Input: x 1,...,x n R d, cu n K. Output: o anumită K-partiţie pentru {x 1,...,x n }. Procedură: [Iniţializare/Iteraţia 0:] t 0; se fixează în mod arbitrar µ 0 1,...,µ 0 K, centroizii iniţiali ai clusterelor, şi se asignează fiecare instanţă x i la centroidul cel mai apropiat, formând astfel clusterele C1,...,C 0 K 0. [Corpul iterativ:] Se execută iteraţia ++ t: Pasul 1: se calculează noile poziţii ale centroizilor: a µ t j = 1 C t 1 x j i C t 1 x i pentru j = 1,K; j Pasul 2: se reasignează fiecare x i la [clusterul cu] centroidul cel mai apropiat, adică se stabileşte noua componenţă a clusterelor la iteraţia t: C1,...,C t K t ; [Terminare:] până când o anumită condiţie este îndeplinită (de exemplu: până când poziţiile centroizilor sau: componenţa clusterelor nu se mai modifică de la o iteraţie la alta). 38. a Formula aceasta corespunde folosiriidistanţei euclidiene. Pentru alte măsuri de distanţă, este posibil să fie necesare alte formule.

40 39. a. Demonstraţi că, de la o iteraţie la alta, algoritmul K-means măreşte coeziunea de ansamblu a clusterelor. I.e., considerând funcţia unde: J(C t,µ t ) def. = x i µ t C t (x i ) 2 def. = ( ) ( ) xi µ t C t (x i ) xi µ t C t (x i ), C t = (C1,C t 2,...,C t K) t este colecţia de clustere (i.e., K-partiţia) la momentul t, µ t = (µ t 1,µ t 2,...,µ t K) este colecţia de centroizi ai clusterelor (K-configuraţia) la momentul t, C t (x i ) desemnează clusterul la care este asignat elementul x i la iteraţia t, operatorul desemnează produsul scalar al vectorilor din R d, arătaţi că J(C t,µ t ) J(C t+1,µ t+1 ) pentru orice t.

41 40. Ideea demonstraţiei Inegalitatea de mai sus rezultă din două inegalităţi (care corespund paşilor 1 şi 2 de la iteraţia t): J(C t,µ t ) (1) J(C t,µ t+1 ) (2) J(C t+1,µ t+1 ) La prima inegalitate (cea corespunzătoare pasului 1) se poate considera că parametrul C t este fixat iar µ este variabil, în vreme ce la a doua inegalitate (cea corespunzătoare pasului 2) se consideră µ t fixat şi C variabil. Prima inegalitate se poate obţine însumând o serie de inegalităţi, şi anume câte una pentru fiecare cluster Cj t. A doua inegalitate se demonstrează imediat. Ilustrarea acestei idei, pe un exemplu particular: Vezi următoarele 3 slide-uri [Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3]

42 41. Iter. 0 / initialization iter. 1 Step 1: re positioning the centroids Step 2: re computing the clusters / / iter / / iter

43 Pentru acest exemplu de aplicare a algoritmului K-means, scriem expresiile numerice pentru valoarea criteriului J 2 (C t,µ t ) pentru fiecare iteraţie (t = 0,1,2,3). 42. iter. J 2 (C t,µ t ) 0. 0+{( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} 1. ( 9 ( 20)) 2 +{( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] Observaţie: La prima vedere, este greu să dovedim aceste inegalităţi (J 2 (C t 1,µ t 1 ) J 2 (C t,µ t ), pentru t = 1,2,3)...altfel decât calculând efectiv valoarea expresiilor care se compară. Însă, introducând nişte termeni intermediari, inegalităţile acestea se vor demonstra într-un mod foarte elegant...

44 43. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) {( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} {( 9 7/3) 2 + ( 8 7/3) ( 5 7/3)) 2 ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] = +2[(5 7) (9 7) 2 ] Explicaţii: 1. Inegalităţile pe orizontală (J 2 (C t 1,µ t ) J 2 (C t,µ t ), pentru t = 1,2,3) sunt uşor de demonstrat, pe baza corespondenţei termen cu termen. (Ele corespund eventualelor micşorări ale distanţelor atunci când se face reasignarea instanţelor la centroizi.) 2. Restul inegalităţilor (J 2 (C t,µ t ) J 2 (C t,µ t+1 ), pentru t = 1,2,3) se rezolvă printro metodă de optimizare simplă. De exemplu, pentru t = 1 este imediat că funcţia ( 9 x) 2 +( 8 x) ( 5 x) 2 +2[(5 x) (9 x) 2 ] îşi atinge minimul pentru x = 7/3, deci J 2 (C 1,µ 2 ) J 2 (C 1,µ 1 ).

45 Step by step (t = 1) 44. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) {( 9 ( 10)) ( 5 ( 10)) 2 +2[(5 ( 10)) (9 ( 10)) 2 ]} {( 9 7/3) 2 + ( 8 7/3) ( 5 7/3)) 2 ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} +2[(5 7/3) (9 7/3) 2 ]} Details to be filled in class

46 Step by step (t = 2) 45. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) 1. ( 9 ( 20)) 2 + {( 8 7/3)) ( 5 7/3) 2 +2[(5 7/3) (9 7/3) 2 ]} 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] Details to be filled in class / at home

47 Step by step (t = 3) 46. iter. J 2 (C t 1,µ t ) J 2 (C t,µ t ) 2. ( 9 ( 9)) 2 + {( 8 22/7)) ( 5 22/7) 2 ( 9 ( 9)) 2 + ( 8 ( 9)) ( 5 ( 9)) 2 +2[(5 22/7) (9 22/7) 2 ]} +2[(5 22/7) (9 22/7) 2 ] 3. ( 9 ( 7)) ( 5 ( 7)) 2 ( 9 ( 7)) ( 5 ( 7)) 2 +2[(5 7) (9 7) 2 ] = +2[(5 7) (9 7) 2 ] Details to be filled in class / at home

48 Graphical illustration 47. by Gheorghe Balan, FII student, 2017 fall For Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3

49 Graphical exemplification (cont d) 48. c Andrew Ng, Stanford University. For Stanford, A. Ng, 2012 spring, HW9 (image segmentation and compression)

50 Demonstraţie, pentru cazul general 49. Observaţie: Pentru convenienţă, ne vom limita la cazul d = 1. Extinderea demonstraţiei la cazul d > 1 nu comportă dificultăţi. Demonstrarea inegalităţii (1): J(C t,µ t ) J(C t,µ t+1 ) (Vezi pasul 1 al iteraţiei t.) Fixăm j {1,...,K}. Dacă notăm cu Cj t = {x i 1,x i2,...,x il }, unde l not. = Cj t, atunci J(C t j,µ t j) = l p=1 ( xip µ t ) 2, j deci J(C t,µ t ) = K J(Cj,µ t t j). j=1 Dacă se consideră Cj t fixat, iar µt j variabil, atunci putem minimiza imediat funcţia care,,măsoară coeziunea din interiorul clusterului Cj t: f(µ) def. = J(C t j,µ) = lµ2 2µ l x ip + p=1 l p=1 x 2 i p argmin µ J(C t j,µ) = 1 l l p=1 x ip def. = µ t+1 j. Aşadar, J(Cj t,µ) J(Ct j,µt+1 j ), pentru µ. În particular, pentru µ = µ t j vom avea: J(Cj t,µt j ) J(Ct j,µt+1 j ). Inegalitatea aceasta este valabilă pentru toate clusterele j = 1,...,K. Dacă sumăm toate aceste inegalităţi, rezultă: J(C t,µ t ) J(C t,µ t+1 ).

51 50. Demonstrarea inegalităţii (2): J(C t,µ t+1 ) J(C t+1,µ t+1 ) (Vezi pasul 2 al iteraţiei t.) La acest pas, o instanţă oarecare x i, unde i {1,...,n}, este reasignată de la clusterul cu centroidul µ t+1 j, la un alt centroid µ t+1 q, dacă x i µ t+1 j 2 x i µ t+1 q 2 (x i µ t+1 j ) 2 (x i µ t+1 q ) 2, pentru orice j = 1,...,K. În contextul iteraţiei t, acest lucru implică ( ) 2 ( 2. x i µ t+1 C t (x i ) x i µ t+1 C t+1 (x i )) Sumând membru cu membru inegalităţile de acest tip obţinute pentru i = 1,n, rezultă: J(C t,µ t+1 ) J(C t+1,µ t+1 ), ceea ce era de demonstrat.

52 51. Interesting / Useful Remark (in English) (by L. Ciortuz, following CMU, 2012 fall, E. Xing, A. Singh, HW3, pr. 1) Following the result obtained above, the K-means clustering algorithm can be seen as an optimisationalgorithm. As input, we are given points x 1,...,x n R d and an integer K > 1. The goal is to minimize the the K-means objective function/criterion, which in the case of using the Euclidian distance measure is the within-cluster sum of square distances J(µ,L) = x i µ li 2, where µ = (µ 1,...,µ K ) are the cluster centers (µ j R d ), and L = (l 1,...,l n ) are the cluster assignments (l i {1,...,K}). (Finding the exact minimum of this function is computationally difficult.)

53 Interesting / Useful Remark (cont d) 52. Lloyd s [K-means] algorithm, takes some initial cluster centers µ, and proceeds as follows: Step 1: Keeping µ fixed, find cluster assignments L to minimize J(µ, L). This step only involves finding the nearest cluster center(s) for each input point x i. Ties can be broken using arbitrary (but consistent) rules. Step 2: Keeping L fixed, find µ to minimize J(µ,L). This is a simple step that only involves in the case of using the Euclidian distance measure averaging points within a cluster. Stopping criterion: If [this was not the first iteration and] none of the values in L changed from the previous iteration, go to the next step; otherwise repeat from Step 1. Termination: Return µ and L. Note: The initial cluster centers µ given as input to the algorithm are often picked randomly from x 1,...,x n. (In practice, we often repeat multiple runs of Lloyd s algorithm with different initializations, and pick the best resulting clustering in terms of the K-means objective.)

54 53. b. Ce puteţi spune despre oprirea algoritmului K-means? Termină oare acest algoritm într-un număr finit de paşi, sau (întrucât numărul de K-partiţii care pot fi formate cu cele n instanţe de antrenament, x 1,...,x n, este finit, şi anume K n ) este posibil ca el să reviziteze de o infinitate de ori o K-configuraţie anterioară, µ = (µ 1,...,µ K )?)

55 54. Răspuns: Dacă algoritmul revizitează o K-partiţie, atunci rezultă că pentru un anumit t avem J(C t 1,µ t ) = J(C t,µ t+1 ). Este posibil ca acest fapt să se întâmple, şi anume atunci când: există instanţe multiple (i.e., x i = x j, deşi i j), criteriul de oprire al algoritmului K-means este de forma până când componenţa clusterelor nu se mai modifică, se presupune că, în cazul în care o instanţă x i este situată la egală distanţă faţă de doi sau mai mulţi centroizi, ea poate fi asignată în mod aleatoriu la oricare dintre ei. Aşa se întâmplă în exemplul din figura alăturată dacă se consideră că la o iteraţie t avem x 2 = 0 C1 t şi x 3 = 0 C2 t, iar la iteraţia următoare alegem ca x 3 = 0 C1 t+1 şi x 2 = 0 C2 t+1 şi, din nou, invers la iteraţia t+2. x 3 x 2 x 1 µ 1 µ x /2 0 1/2 1

56 55. Observaţii Dacă se păstrează criteriul dat ca exemplu în enunţul problemei adică se iterează până când centroizii staţionează algoritmul se poate opri fără ca la ultima iteraţie J(C,µ) să fi atins minimul posibil. În cazul exemplului de mai sus, vom avea = 1 > 2 3 = 2 ( 1 3 ) 2 + Dacă nu există instanţe multiple care să fie situate la distanţe egale faţă de doi sau mai mulţi centroizi la o iteraţie oarecare a algoritmului K- means (precum sunt x 2 şi x 3 în exemplul de mai sus), sau dacă se impune restricţia ca în astfel de situaţii instanţele identice să fie asignate la un singur cluster, este evident că algoritmul K-means se opreşte într-un număr finit de paşi. ( 2 3 ) 2.

57 56. Concluzii Algoritmul K-means explorează pornind de la o anumită iniţializare a celor K centroizi, doar un subset din totatul de K n K-partiţii, asigurându-ne însă că are loc proprietatea J(C 0,µ 1 ) J(C 1,µ 2 )... J(C t 1,µ t ) J(C t,µ t+1 ), conform punctului a al acestei probleme. Atingerea minimului global al funcţiei J(C,µ) unde C este o variabilă care parcurge mulţimea tuturor K-partiţiilor care se pot forma cu instanţele {x 1,...,x n } nu este garantată pentru algoritmul K-means. Valoarea funcţiei J care se obţine la oprirea algoritmului K-means este dependentă de plasarea iniţială a centroizilor µ, precum şi de modul concret în care sunt alcătuite clusterele în cazul în care o instanţă oarecare se află la distanţă egală de doi sau mai mulţi centroizi, după cum am arătat în exemplul de mai sus.

58 57. K-means algorithm: The approximate maximization of the distance between clusters [CMU, 2010 fall, Aarti Singh, HW3, pr. 5.2]

59 58. Note: In this problem we will work with a version of the K-means algorithm which is slightly modified w.r.t. the one given in the problem CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1, where we have proved the monotonicity of the criterion J. Let X := {x 1,x 2,...,x n } be our sample points, and K denote the number of clusters to use. We represent the cluster assignments of the data points by an indicator matrix γ {0,1} n K such that γ ij = 1 means x i belongs to cluster j. We require that each point belongs to exactly one cluster, so K j=1 γ ij = 1. [We already know that] the K-means algorithm estimates γ by minimizing the following cohesion criterion (or, measure of distortion, or simply sum of squares ): J(γ,µ 1,µ 2,...,µ K ) := K γ ij x i µ j 2, j=1 where denotes the vector 2-norm. K-means alternates between estimating γ and re-computing µ j s.

60 The K-means algorithm (...yet another version!) Initialize µ 1,µ 2,...,µ K, and let C := {1,...,K}. While the value of J is still decreasing, repeat the following: Determine γ by γ ij { 1, x i µ j 2 x i µ j 2, j C, 0, otherwise. Break ties arbitrarily. 2. Recompute µ j using the updated γ: For each j C, if n γ ij > 0 set µ j n γ ijx i n γ ij. Otherwise, don t change µ j.

61 60. Let x denote the sample mean. Consider the following three quantities: n Total variation: T(X) = x i x 2 n n Within-cluster variation: W j (X) = γ ij x i µ j 2 n γ for j = 1,...,K ij K ( n Between-cluster variation: B(X) = γ ) ij µ n j x 2. What is the relation between these three quantities? Based on this relation, show that K-means can be interpreted as minimizing a weighted average of within-cluster variations while approximately(!) maximizing the between-cluster variation. Note that the relation may contain an extra term that does not appear above. j=1

62 Solution 61. To simplify the notation, we define n j = n γ ij. We then have: T (X) = 1 x i x 2 = 1 n n = 1 n = 1 n = = K j=1 K j=1 K j=1 n j n j=1 K γ ij x i x 2 = 1 n K j=1 γ ij x i x 2 γ ij x i µ j +µ j x 2 (5) ( γ ij xi µ j 2 + µ j x 2 +2 ( ) ( x i µ j µj x )) n γ ij x i µ j 2 n j + K n j n W j(x) +B(X) 2 n j=1 }{{} J n K j=1 K j=1 n j µ j x 2 n n K ( µj x ) j=1 ( n j µj x ) (µ ) not. j µ j, where µj = ( ) γ ij xi µ j n γ ijx i n j. (6)

63 62. Notes 1. In (5), the quantities µ j can be taken arbitrarily; however, they will be thought of in the end see relation (6) and the Conclusion on the last slide as the centroids of the clusters 1,...,K, as computed at Step 2 in the K-means algorithm. 2. The equality (6) on the previous slide holds because ( ) n γ ij (x i µ j ) = γ ij x i n j µ j = n γ ijx i j n j µ j n j ( n = n γ ) ijx i j µ j = n j ( µ j µ j ) n j

64 63. Conclusion We already know see CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1 that K-means aims to minimize J, and consequently 1 n J, which coincides with the first term in the expression we obtained for T(X), namely K n j j=1 n W j(x). Since the total variation T(X) is constant, minimizing the first term is equivalent to maximizing the sum of the other two terms, which is expected to be dominated by the between-cluster variation B(X) since a good µ j should be close to µ j, making the third term small in absolute value.

65 64. Graphical exemplification by Gheorghe Balan, FII student, 2017 fall For CMU, 2006 spring, C. Guestrin, HW5, pr. 1 For Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3

66 Graphical exemplification (cont d) 65. c Andrew Ng, Stanford University. For Stanford, A. Ng, 2012 spring, HW9 (image segmentation and compression)

67 66. Exemplifying the application of a simple version of EM/GMM on data from R (σ 1 = σ 2 = 1,π 1 = π 2 = 1/2) CMU, 2012 spring, Ziv Bar-Joseph, final exam, pr. 3.1 enhanced by Liviu Ciortuz

68 Suppose a GMM has two components with known variance and an equal prior distribution 1 2 N(µ 1,1)+ 1 2 N(µ 2,1). The observed data are x 1 = 0.5 and x 2 = 2, and the current estimates of µ 1 and µ 2 are 1 and 2 respectively. 67. a. Execute the first iteration of the EM algorithm. Hint: Normal densities for the standardized variable y (µ=0,σ=1) at 0, 0.5, 1, 1.5, 2 are 0.4, 0.35, 0.24, 0.13, 0.05 respectively. b. Consider the log-likelihood function for the observable data, 2 2 l(µ 1,µ 2 ) def. = lnp(x 1,x 2 µ 1,µ 2 ) indep. = lnp(x i µ 1,µ 2 ) = ln P(x i,z ij µ 1,µ 2 ), where z ij {0,1} and 2 j=1 z ij = 1 for all i {1,2}. zij Compute the values of l function at the beginning and also at the end of the first iteration of the EM algorithm. What do you see?

69 The E-step: E[Z i1 ] = P(Z i1 = 1 x i,µ) B.Th. = = = Solution (a.) P(x i Z i1 = 1,µ 1 )P(Z i1 = 1) P(x i Z i1 = 1,µ 1 )P(Z i1 = 1)+P(x i Z i2 = 1,µ 2 )P(Z i2 = 1) P(x i Z i1 = 1,µ 1 ) 1 2 P(x i Z i1 = 1,µ 1 ) 1 2 +P(x i Z i2 = 1,µ 2 ) 1 2 P(x i Z i1 = 1,µ 1 ) P(x i Z i1 = 1,µ 1 )+P(x i Z i2 = 1,µ 2 ) for i {1,2}. 68. Therefore, P(Z 11 = 1 x 1,µ) = P(Z 21 = 1 x 2,µ) = N(0.5;1,1) N(0.5;1,1)+N(0.5;2,1) = N(2;1,1) N(2;1,1)+N(2;2,1) = N(0.5;0,1) N(0.5;0,1)+N(1.5;0,1) = = N(1;0,1) N(1;0,1)+N(0;0,1) = = = 3 8 Similarly, P(Z 12 = 1 x 1,µ) = P(Z 11 = 0 x 1,µ) = 1 P(Z 11 = 1 x 1,µ 1 ) = P(Z 22 = 1 x 2,µ) = P(Z 21 = 0 x 2,µ) = 1 P(Z 21 = 1 x 2,µ 1 ) = 5 8

70 69. The M-step: µ (t+1) j = 2 E[Z ij] x i 2 E[Z ij] = 2 P(Z ij = 1 x i,µ (t) ) x i 2 P(Z ij = 1 x i,µ (t) ) Therefore, µ (1) 1 = = and µ(1) 2 = =

71 l(µ 1,µ 2 ) def. = 2 lnp(x i µ 1,µ 2 ) = Solution (b.) 2 ln zij P(x i,z ij µ 1,µ 2 ) 70. = = 2 ln P(x i z ij,µ 1,µ 2 ) P(z ij µ 1,µ 2 ) }{{} z ij 1/2 2 ln 1 2 P(x i z ij,µ 1,µ 2 ) = ln2+ln 2 z ij 2 P(x i z ij = 1,µ j ) j=1 z ij P(x i z ij,µ 1,µ 2 ) z 11 = 1, z 12 = 0 1 2π exp( 1 2 (x 1 µ 1 ) 2 ) z 11 = 0, z 21 = 1 z 21 = 1, z 21 = 0 z 21 = 0, z 21 = 1 1 2π exp( 1 2 (x 1 µ 2 ) 2 ) 1 2π exp( 1 2 (x 2 µ 1 ) 2 ) 1 2π exp( 1 2 (x 2 µ 2 ) 2 )

72 71. Therefore, ( 1 l(µ 1,µ 2 ) = 2ln2+ln (exp( 1 2π 2 (x 1 µ 1 ) 2 )+exp( 1 )) 2 (x 1 µ 2 ) 2 ) ( 1 +ln (exp( 1 2π 2 (x 2 µ 1 ) 2 )+exp( 1 )) 2 (x 2 µ 2 ) 2 ) ( = 2ln2 ln(2π)+ln exp( 1 2 (x 1 µ 1 ) 2 )+exp( 1 ) 2 (x 1 µ 2 ) 2 ) ( +ln exp( 1 2 (x 2 µ 1 ) 2 )+exp( 1 ) 2 (x 2 µ 2 ) 2 ) and l(µ (0) 1,µ(0) 2 ( 107 ) = l(1,2) = l(µ(1) 1,µ(1) 2 ) = l 106, 133 ) = , 86 meaning that the value of the log-likelihood function increases, which is in line with the theoretical result concerning the correctness of the EM algorithm.

73 72. Derivation of the EM algorithm for a mixture of K uni-variate Gaussians: the general case (i.e., when all parameters π,µ,σ 2 are free) following Dahua Lin, An Introduction to Expectation-Maximization (MIT, ML 6768 course, 2012 fall)

74 73. Note We will first consider K = 2. Generalization to K > 2 will be shown afterwards.

75 Estimation (E) Step: 74. p ij not. = P(Z ij = 1 X i,µ,σ,π) calcul = E[Z ij X i,µ,σ,π] = = P(X i = x i Z ij = 1,µ,σ,π) P(Z ij = 1 µ,σ,π) 2 j =1 P(X i = x i Z ij = 1,µ,σ,π) P(Z ij = 1 µ,σ,π) 1 exp ( (x ) i µ j ) 2 π 2πσj 2σj 2 j 1 exp ( (x ) i µ 1 ) 2 π 2πσ1 2σ exp ( (x i µ 2 ) 2 2πσ2 2σ2 2 ) π 2 Therefore, for t > 0 we will have: p (t) ij = π (t 1) 1 σ (t 1) 1 exp ( π (t 1) j σ (t 1) j exp (x i µ (t 1) 1 ) 2 2(σ (t 1) 1 ) 2 ( ) (x ) i µ (t 1) j ) 2 2(σ (t 1) j ) 2 ( + π(t 1) 2 σ (t 1) 2 exp (x i µ (t 1) 2 ) 2 2(σ (t 1) 2 ) 2 )

76 75. The likelihood of a complete instance (x i,z i1,z i2 ): P(X i = x i,z i1 = z i1,z i2 = z i2 µ,σ,π) = P(X i = x i Z i1 = z i1,z i2 = z i2,µ i,σ i,π i ) P(Z i1 = z i1,z i2 = z i2 µ i,σ i,π i ) 1 = exp ( (x ) i µ j ) 2 π j, where z ij = 1 and z ij = 0 for j j 2πσj = 1 2π σ z i1 1 σ z i2 2 2σ 2 j exp 1 2 j {1,2} z ij (x i µ j ) 2 σ 2 j π z i1 1 π z i2 2 The log-likelihood of the same complete instance will be: lnp(x i = x i,z i1 = z i1,z i2 = z i2 µ,σ,π) = ln(2π) z ij lnσ j 1 2 (x i µ j ) 2 z ij 2 σj 2 j=1 j=1 + 2 z ij lnπ j j=1

77 76. Given the dataset X = {x 1,...,x n }, the log-likelihood function will be: l(µ,σ,π) def. = lnp(x,z 1,Z 2 µ,σ,π) i.i.d. = ln = lnp(x i = x i,z i1,z i2 µ,σ,π) n P(X i = x i,z i1,z i2 µ,σ,π) = n 2 ln(2π) 2 Z ij lnσ j 1 2 j=1 2 j=1 Z ij (x i µ j ) 2 σ 2 j + 2 Z ij lnπ j j=1

78 The expectation of the log-likelihood function: 77. E[lnP(X,Z 1,Z 2 µ,σ,π)] = n 2 2 ln(2π) E[Z ij ]lnσ j 1 2 j=1 2 j=1 E[Z ij ] (x i µ j ) 2 σ 2 j + 2 E[Z ij ] lnπ j j=1 Here above, the probability function w.r.t. which the expectation was computed was left unspecified. Now we will make it explicit: Q(µ,σ,π µ (t),σ (t),π (t) ) not. = E Z X,µ (t),σ (t),π (t)[lnp(x,z 1,Z 2 X,µ,σ,π)] = n 2 2 ln(2π) E[Z ij X i,µ (t),σ (t),π (t) ]lnσ j j=1 2 E[Z ij X i,µ (t),σ (t),π (t) ] (x i µ j ) 2 j=1 2 E[Z ij X i,µ (t),σ (t),π (t) ] lnπ j j=1 σ 2 j

79 78. p (t) ij not. = E[Z ij X,µ (t 1),σ (t 1),π (t 1) ] Q(µ,σ,π µ (t),σ (t),π (t) ) = n 2 2 ln(2π) p (t) ij lnσ j 1 2 j=1 2 j=1 p (t) ij (x i µ j ) 2 σ 2 j + 2 j=1 p (t) ij lnπ j Since K = 2 and π 1 +π 2 = 1, we get Q(µ,σ,π µ (t),σ (t),π (t) ) = n 2 2 ln2π p (t) ij lnσ j 1 2 j=1 2 j=1 p (t) ij (x i µ j ) 2 σ 2 j + (p (t) i1 lnπ 1 +p (t) i2 ln(1 π 1))

80 79. Maximization (M) Step: [For K = 2:] Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 π 1 π 1 p (t) i1 = π 1( p (t) i1 + p (t) i2 ) π (t+1) 1 1 n p (t) i1 = π 1 p (t) i1 = 1 1 π 1 (p (t) i1 +p(t) i2 p (t) i1 }{{} 1 ) p (t) i2 p (t) i1 =nπ 1 Taking into account that π (t+1) 1 +π (t+1) 2 = 1 and p (t) i1 +p(t) i2 π (t+1) 2 1 n p (t) i2 = 1 for i = 1,...,n,

81 80. µ 1 Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 σ 2 1 p (t) i1 (x i µ 1 ) = 0 p (t) i1 (x i µ 1 )=0 µ (t+1) 1 n p(t) i1 x i n p(t) i1 Similarly, µ (t+1) 2 n p(t) i2 x i n p(t) i2

82 81. σ 1 Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 1 σ 1 p (t) i1 + 1 σ 3 1 p (t) i1 (x i µ 1 ) 2 = 0, ( Similarly, σ (t+1) 1 ( ) 2 n p(t) i1 (x i µ (t+1) σ (t+1) 2 n p(t) i1 1 ) 2 ) 2 n p(t) i2 (x i µ (t+1) n p(t) i2 2 ) 2 Note: One could relatively easily prove that these solutions (namely, π (t+1),µ (t+1),σ (t+1) ) of the partial derivatives of the auxiliary function Q designate the values for which Q reaches its maximum.

83 Generalization to K > In this case, the Bernoulli distribution is replaced by a categorical one. The only one change needed in the above proof concerns updating the parameters of this distribution. Since π π K = 1, we must solve the following constraint optimization problem: max π,µ,σ Q(π,µ,σ π(t),µ (t),σ (t) ) K subject to π j = 1 and π j 0, j = 1,...,K. By letting asside the constraints, and using the Lagrangean multiplier λ R, this problem becomes: ( ) K Q(π,µ,σ π (t),µ (t),σ (t) )+λ(1 π j ). max π,µ,σ

84 83. For j = 1,...,K: π j Q(µ,σ,π µ (t),σ (t),π (t) ) = 0 p (t) ij 1 = λ π (t+1) j π j = 1 λ p (t) ij. Because K j=1 π(t+1) j = 1, it follows that λ = K j=1 π (t+1) j = K j=1 π (t+1) j }{{} 1 = 1 = n. Therefore, π (t+1) j 1 n p (t) ij. Note that indeed π (t+1) j 0, because the p (t) ij terms designate some probabilities (see E-step).

85 E Step: M Step: p (t) ij To summarize: not. = P(z ij = 1 x i ;µ (t),(σ 2 ) (t),π (t) ) = where N(x i µ j,σj 2 ) def. = π (t+1) j 1 n N(x i µ (t) j,(σ2 j) (t) ) π (t) j K l=1 N(x i µ (t) l,(σl 2)(t) ) π (t) l ) 1 exp ( (x i µ j ) 2 2πσj 2σj 2 p (t) ij. 84. ( µ (t+1) j σ (t+1) j ) 2 n p(t) ij x i n p(t) ij n p(t) ij (x i µ (t+1) j ) 2 n p(t) ij

86 Example: Modelling the waiting and eruption times for the Old Faithful geyser, (Yellowstone Park, USA) Michael Eichler (University of Chicago, Statistics course (24600) - Spring 2004) 85. R code p< c(0.5, 40, 90, 20, 20) emstep< function(y, p) { EZ< p[1] * dnorm(y, p[2], sqrt(p[4]))/ (p[1] * dnorm(y, p[2], sqrt(p[4])) +(1 - p[1]) * dnorm(y, p[3], sqrt(p[5]))) p[1]< mean(ez) p[2]< sum(ez * Y) / sum(ez) p[3]< sum((1 - EZ)*Y) / sum(1-ez) p[4]< sum(ez * (Y - p[2]) 2) / sum(ez) p[5]< sum((1 - EZ) * (Y - p[3]) 2) / sum(1 - EZ) p } emiteration< function(y, p, n=10) { for (i in (1:n)) { p< emstep(y, p) } p } p< c(0.5, 40, 90, 20, 20) p< emiteration(y, p, 20) p p< emstep(y, p) p } Starting values: p (0) = 0.4 µ (0) 1 = 40, σ (0) 1 = 4 µ (0) 1 = 90, σ (0) 1 = 4

87 k p (k) µ (k) 1 µ (k) 2 σ (k) 1 σ (k)

88 87. Exemplifying some methodological issues regarding the application of the EM algorithmic schema (using a simple EM/GMM algorithm on data from R (π 1 = π 2 = 1/2)) CMU, 2007 spring, Eric Xing, final exam, pr. 1.8

89 88. A long time ago there was a village amidst hundreds of lakes. Two types of fish lived in the region, but only one type in each lake. These types of fish both looked exactly the same, smelled exactly the same when cooked, and had the exact same delicious taste except one was poisonous and would kill any villager who ate it. The only other difference between the fish was their effect on the ph (acidity) of the lake they occupy. The ph for lakes occupied by the non-poisonous type of fish was distributed according to a Gaussian with unknown mean (µ safe ) and variance (σ 2 safe ) and the ph for lakes occupied by the poisonous type was distributed according to a different Gaussian with unknown mean (µ deadly ) and variance (σ 2 deadly ). (Poisonous fish tended to cause slightly more acidic conditions). Naturally, the villagers turned to machine learning for help. However, there was much debate about the right way to apply EM to their problem. For each of the following procedures, indicate whether it is an accurate implementation of Expectation-Maximization and will provide a reasonable estimate for parameters µ and σ 2 for each class.

90 a. Guess initial values of µ and σ 2 for each class. (1) For each lake, find the most likely class of fish for the lake. (2) Update the µ and σ 2 values using their maximum likelihood estimates based on these predictions. Iterate (1) and (2) until convergence. 89. b. For each lake, guess an initial probability that it is safe. (1) Using these probabilities, find the maximum likelihood estimates for the µ and σ values for each class. (2) Use these estimates of µ and σ to reestimate lake safety probabilities. Iterate (1) and (2) until convergence. c. Compute the mean and variance of the ph levels across all lakes. Use these values for the µ and σ 2 value of each class of fish. (1) Use the µ and σ 2 values of each class to compute the belief that each lake contains poisonous fish. (2) Find the maximum likelihood values for µ and σ 2. Iterate (1) and (2) until convergence.

91 Solution 90. a. It ll do ok if we give sensible enough µ and σ 2 initial values. b. Ok, this is the same as a after the first M-step. (See the general EM algorithmic schema on the next slide.) c. This will be stuck at the initial µ and σ 2 : In the E-step we ll get: P(safe x) = P(x safe) P(safe) P(x safe) P(safe)+P(x deadly) P(deadly) = 1 2 since on one side we assume P(safe) = P(poison) = 1 and on the other side 2 P(x safe) = P(x poison) because µ safe = µ deadly and σsafe 2 = σ2 deadly. In the M-step µ and σ 2 will not change since we are again letting them be calculated from all lakes (weighted equally).

92 91. The [general] EM algorithmic schema h (t) E[Z X, h (t) ] ++t P(X h) h (t+1) = argmax h E [ln P(Y h)] P(Z X; ) h (t)

93 92. The EM algorithm for modeling mixtures of multi-variate Gaussians with diagonal covariance matrices Utah University, Piyush Rai ML course (CS5350/6350), 2009 fall, lecture notes, Gaussian Mixture Models [adapted by Liviu Ciortuz]

94 Fie mixtura de distribuţii gaussiene 93. gmm(x) = K π j N(x;µ j,σ 2 I), j=1 unde x R d, probabilităţile a priori de selecţie π j R satisfac (ca de obicei) restricţiile π j 0 pentru j = 1,...,K şi K j=1 π i = 1; mediile gaussienelor j = 1,...,K sunt vectorii µ j R d, iar matricele de covarianţă ale acestor gaussine sunt identice, ba chiar au forma particulară σ 2 I, cu σ R şi σ > 0, matricea I fiind matricea identitate. Se consideră instanţele x 1,...,x n R d generate cu distribuţia probabilistă de mai sus (gmm). În acest exerciţiu veţi deduce regulile de actualizare din cadrul [pasului E şi al pasului M al] algoritmului EM care face estimarea parametrilor π not. = (π 1,...,π K ),µ not. = (µ 1,...,µ K ) şi σ.

95 94. a. Se ştie că expresia funcţiei de densitate a distribuţiei gaussiene multivariate (d-dimensionale) de medie µ R d şi matrice de covarianţă Σ R d d este ( 1 ( 2π) d Σ exp 1 ) 2 (x µ) Σ 1 (x µ), unde x şi µ sunt consideraţi vectori-coloană din R d, iar operatorul desemnează operaţia de transpunere a vectorilor/matricelor. Aduceţi expresia de mai sus la forma cea mai simplă pentru cazul Σ = σ 2 I. Vă recomandăm să folosiţi faptul că x µ 2 = (x µ) (x µ) = (x µ) (x µ), unde operatorul desemnează produsul scalar al vectorilor. Observaţie: Prima din ultimele două egalităţi implică un uşor abuz (sau, mai degrabă, o convenţie) de notaţie: o matrice reală de dimensiune 1 1 este identificată cu un număr real, care este chiar singurul ei element. Acelaşi tip de abuz/convenţie a intervenit şi în scrierea expresiei exp(...) de mai sus.

96 95. Answer Σ = σ 2 I Σ 1 = 1 σ 2I, and also Σ = (σ 2 ) d Σ = σ d since σ > 0. Therefore, N(x;µ,Σ = σ 2 I) = = = ( 1 ( 2π) d Σ exp 1 ) 2 (x µ) Σ 1 (x µ) 1 ( 2π) d σ d exp 1 ( 2πσ) d exp ( 12 (x µ) 1σ 2(x µ) ) ( 1 ) 2σ 2 x µ 2

97 96. b. Vom asocia fiecărei instanţe x i un vector-indicator (mai precis, un vector de variabile aleatoare z i {0,1} K, cu z ij = 1 dacă şi numai dacă x i a fost generat de gaussiana N(x;µ j,σ 2 I). Pentru pasul E al algoritmului EM, veţi demonstra mai întâi că media E[z ij ] not. = E[z ij x i ;π,µ,σ], unde x i = (x i,1,...,x i,d ) R d, µ not. = (µ 1,...,µ K ) (R d ) K şi σ R +, are valoarea P(z ij = 1 x i ;π,µ,σ), iar apoi veţi elabora formula de calcul a acestei probabilităţi, folosind teorema lui Bayes.

98 Answer For the sake of simplicity, we will designate E[z ij x i ;πµ,σ] as E[z ij ]. So, 97. E[z ij ] not. = E[z ij x i ;πµ,σ] def. = 0 P(z ij = 0 x i ;π,µ,σ)+1 P(z ij = 1 x i ;π,µ,σ) = P(z ij = 1 x i ;π,µ,σ) Bayes F. = = a. = = P(x i z ij = 1;π,µ,σ) P(z ij = 1;π,µ,σ) P(x i ;π,µ,σ) P(x i z ij = 1;π,µ,σ) P(z ij = 1;π,µ,σ) K j =1 P(x i z ij = 1;π,µ,σ)P(z ij = 1;π,µ,σ) ( 1 ( 2πσ) exp d ) 2σ 2 x i µ j 2 1 π j ( K 1 j =1 ( 2πσ) exp 1 ) d 2σ 2 x i µ j 2 π j ( π j exp 1 ) 2σ 2 x i µ j 2 K j =1 π j ( exp 1 ) (7) 2σ 2 x i µ j 2

99 98. c. Arătaţi că expresia funcţiei de log-verosimilitate a datelor,,complete în raport cu parametrii π,µ şi σ este lnp(x,z π,µ,σ) = K z ij (lnπ j +lnn(x i ;µ j,σ 2 I)), j=1 unde x not. = (x 1,...,x n ) şi z not. = (z 1,...,z n ). Deduceţi apoi expresia,,funcţiei auxiliare Q(π,µ,σ π (t),µ (t),σ (t) ) def. = E[lnp(x,z π,µ,σ)], cu precizarea că media aceasta este calculată în raport cu distribuţia / distribuţiile P(z ij x i,π (t),µ (t),σ (t) ).

100 Answer The log-verosimility of the complete data is lnp(x,z π,µ,σ) def. = lnp((x 1,z 1 ),...,(x n,z n ) π,µ,σ) i.i.d. = ln = = lnp(x i,z i π,µ,σ) j=1 mult. rule = K z ij [lnn(x i µ j,σ)+lnπ j ], n p(x i,z i π,µ,σ) lnp(x i z i ;π,µ,σ) p(z i π,µ,σ) }{{} π j 99. since z i = (z i,1,...,z i,j,...,z i,k ), with z i,j = 1 and z i,j = 0 for all j j. Furthermore, using the result we got at part a, we can write K [ lnp(x,z π,µ,σ) = z ij d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j. j=1 Finally, using the linearity of expectation, Q(π,µ,σ π (t),µ (t),σ (t) ) def. = E[lnp(x,z π,µ,σ)] K [ = E[z ij ] d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j. j=1

101 100. Pentru Pasul M, în contextul precizat în enunţ, vom avea de rezolvat următoarea problemă de optimizare: (π (t+1),µ (t+1),σ (t+1) ) = argmax π,µ,σ cu restricţiile π (t+1) j 0 pentru j = 1,...,K şi Q(π,µ,σ π (t),µ (t),σ (t) ), (8) K j=1 π (t+1) j = 1. Această problemă se rezolvă optimizând funcţia ei obiectiv în mod separat în raport cu variabilele π,µ şi σ. d. Aplicaţi metoda multiplicatorilor lui Lagrange pentru a rezolva problema de optimizare cu restricţii (8) în raport (doar) cu variabilele π.

102 Answer 101. For now, we will ignore the constraints π j 0. Thus, the Lagrangean functional associated to our optimisation problem is: K (π (t+1),µ (t+1),σ (t+1) ) = argmax(q(π,µ,σ π,µ (t),σ (t) ) λ(1 π,µ,σ,λ π j )), (9) with λ R playing the role of Lagrangean multiplier. (After solving this optimisation problem it will be seen that indeed π j 0 for all j = 1,...,K.) By taking the partial derivative of the Lagrangean functional with respect to π j (with j {1,...,K}) and then solving for it, π j j=1 K E[z ij ] π j (Q(π,µ,σ π (t),µ (t),σ (t) ) λ(1 K π j )) = 0 j=1 [ d 2 ln(2πσ2 ) 1 2σ 2 x i µ j 2 +lnπ j ] λ = 0 j=1 E[z ij ] 1 = λ π j we will obtain the solution π (t+1) j = 1 λ E[z ij ]. (10)

103 102. Moreover, since these solutions should satisfy the constraint K that j=1 π(t+1) j = 1, it follows K j=1 1 λ E[z ij ] = 1 1 λ 1 λ j=1 K j=1 E[z ij ] = 1 b. 1 λ K P(z ij = 1 x i,π,µ,σ) = 1 1 λ } {{ } 1 j=1 K P(z ij = 1 x i,π,µ,σ) = 1 1 = 1 λ = 1 n. By replacing this value of λ back into relation (10), we will get π (t+1) j = 1 n E[z ij ]. (11) It is obvious that π (t+1) j 0.

104 e. Optimizaţi funcţia Q (i.e., rezolvaţi problema de optimizare (8)) în raport cu variabilele µ Answer: Remember the notation µ = (µ 1,...,µ K ) (R d ) K, with µ j R d for j = 1,...,K. µj Q(π,µ,σ π (t),µ (t),σ (t) ) n K [ = µj E[z ij ] d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j = j =1 E[z ij ] [ 1 2σ 2 µ j (x i µ j ) 2 ] = 1 2σ 2 E[z ij ]2(x i µ j )( 1) [( = 1 n σ 2 E[z ij ](x i µ j ) = 1 n ( n ] σ 2 E[z ij ]x i ) E[z ij ] )µ j After equating this expression to zero (in fact, the column-vector (0,...,0) R d ), we get the solution: µ (t+1) = n E[z ij]x i n E[z ij]. (12)

105 f. Optimizaţi funcţia Q (i.e., rezolvaţi problema de optimizare (8)) în raport cu variabila σ Answer: The partial derivative of Q(π,µ,σ π (t),µ (t),σ (t) ) with respect to σ is: σ Q(π,µ,σ π(t),µ (t),σ (t) ) = K [ E[z ij ] d σ 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 +lnπ j = = j=1 j=1 K j=1 E[z ij ] σ K E[z ij ] = 1 σ 3( dσ2 n j=1 [ d 2 ln(2πσ2 ) 1 ] 2σ 2 x i µ j 2 [ d 2 2σ σ 2 2 ] 2σ 3 x i µ j 2 = K E[z ij ] + }{{} 1 j=1 j=1 K E[z ij ] [ dσ + 1σ ] 3 x i µ j 2 K E[z ij ] x i µ j 2 ) = 1 σ 3( ndσ2 + j=1 K E[z ij ] x i µ j 2 )

106 105. By equating σ Q(π,µ,σ π(t),µ (t),σ (t) ) to 0 we get the solution: (σ (t+1) ) 2 = 1 nd K j=1 E[z ij ] x i µ (t+1) j 2 0. (13) It is not too difficult to see that this solution leads to the maximization of Q. (Note that σ > 0 and the expression ndσ 2 + n K j=1 E[z ij] x i µ j 2 as a function of σ 2, is linear and decreasing.)

107 106. g. Sumarizaţi rezultatele obţinute la punctele de mai sus (b şi d-f), redactând în pseudocod algoritmul EM pentru rezolvarea mixturii de gaussiene din enunţ. Answer: By gathering the relations (7), (11), (12) and (13), we are now able to write the pseudocode of our algorithm: Initialize the a priori probabilities π, the means µ, and the variance σ. Iterate until a certain termination condition is met: Step E: Compute the expectation (i.e., a posterori probabilities) of z variables: p (t+1) ij not. = E[z ij x i ;π (t),µ (t),σ (t) ] = π (t) j N(x;µ (t) j,(σ(t) ) 2 I) K j =1 π(t) j N(x;µ (t) j,(σ (t) ) 2 I) Step M: Compute new values for π,µ şi σ: π (t+1) j = 1 n p (t) ij µ (t+1) j = n p(t) ij x i n p(t) ij ( σ (t+1)) 2 = 1 dn K j=1 p (t) ij x i µ (t) j 2

108 Important Remark: a slight generalization 107. Suppose that our mixture model is made of Gaussians with unrestricted diagonal covariance matrices, i.e., X i Z i = j N µ j,1.. µ j,d, Z i Categorical(p 1,...,p K ) (σ j,1 ) (σ j,2 ) (σ j,d ) 2 It can be easily be proven (by going along the lines of the above proof) that the only update relation that changes in the formulation of the EM algorithm is (13). It becomes: (σ (t+1) jk ) 2 = n E[z ij](x i,k µ (t+1) j,k ) 2 n E[z ij] 0 for j = 1,...,K and k = 1,...,d. (14) This is indeed what we would expect, given on one hand the fact that the Gaussian components are mutually independent, and on the other hand the updating formulas [that we have already obtained] for the EM algorithm for solving mixtures of uni-variate Gaussians [of unrestricted form].

Teorema Reziduurilor şi Bucuria Integralelor Reale Prezentare de Alexandru Negrescu

Teorema Reziduurilor şi Bucuria Integralelor Reale Prezentare de Alexandru Negrescu Teorema Reiduurilor şi Bucuria Integralelor Reale Preentare de Alexandru Negrescu Integrale cu funcţii raţionale ce depind de sint şi cost u notaţia e it, avem: cost sint i ( + ( dt d i, iar integrarea

More information

Soluţii juniori., unde 1, 2

Soluţii juniori., unde 1, 2 Soluţii juniori Problema 1 Se consideră suma S x1x x3x4... x015 x016 Este posibil să avem S 016? Răspuns: Da., unde 1,,..., 016 3, 3 Termenii sumei sunt de forma 3 3 1, x x x. 3 5 6 sau Cristian Lazăr

More information

Sisteme cu logica fuzzy

Sisteme cu logica fuzzy Sisteme cu logica fuzzy 1/15 Sisteme cu logica fuzzy Mamdani Fie un sistem cu logică fuzzy Mamdani două intrări x şi y ieşire z x y SLF Structura z 2/15 Sisteme cu logica fuzzy Mamdani Baza de reguli R

More information

1.3. OPERAŢII CU NUMERE NEZECIMALE

1.3. OPERAŢII CU NUMERE NEZECIMALE 1.3. OPERAŢII CU NUMERE NEZECIMALE 1.3.1 OPERAŢII CU NUMERE BINARE A. ADUNAREA NUMERELOR BINARE Reguli de bază: 0 + 0 = 0 transport 0 0 + 1 = 1 transport 0 1 + 0 = 1 transport 0 1 + 1 = 0 transport 1 Pentru

More information

O V E R V I E W. This study suggests grouping of numbers that do not divide the number

O V E R V I E W. This study suggests grouping of numbers that do not divide the number MSCN(2010) : 11A99 Author : Barar Stelian Liviu Adress : Israel e-mail : stelibarar@yahoo.com O V E R V I E W This study suggests grouping of numbers that do not divide the number 3 and/or 5 in eight collumns.

More information

Procedeu de demonstrare a unor inegalităţi bazat pe inegalitatea lui Schur

Procedeu de demonstrare a unor inegalităţi bazat pe inegalitatea lui Schur Procedeu de demonstrare a unor inegalităţi bazat pe inegalitatea lui Schur Andi Gabriel BROJBEANU Abstract. A method for establishing certain inequalities is proposed and applied. It is based upon inequalities

More information

Divizibilitate în mulțimea numerelor naturale/întregi

Divizibilitate în mulțimea numerelor naturale/întregi Divizibilitate în mulțimea numerelor naturale/întregi Teorema îmărţirii cu rest în mulțimea numerelor naturale Fie a, b, b 0. Atunci există q, r astfel încât a=bq+r, cu 0 r < b. În lus, q şi r sunt unic

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

ON THE QUATERNARY QUADRATIC DIOPHANTINE EQUATIONS (II) NICOLAE BRATU 1 ADINA CRETAN 2

ON THE QUATERNARY QUADRATIC DIOPHANTINE EQUATIONS (II) NICOLAE BRATU 1 ADINA CRETAN 2 ON THE QUATERNARY QUADRATIC DIOPHANTINE EQUATIONS (II) NICOLAE BRATU 1 ADINA CRETAN ABSTRACT This paper has been updated and completed thanks to suggestions and critics coming from Dr. Mike Hirschhorn,

More information

UNITATEA DE ÎNVĂȚARE 3 Analiza algoritmilor

UNITATEA DE ÎNVĂȚARE 3 Analiza algoritmilor UNITATEA DE ÎNVĂȚARE 3 Analiza algoritmilor Obiective urmărite: La sfârşitul parcurgerii acestei UI, studenţii vor 1.1 cunoaște conceptul de eficienta a unui algoritm vor cunoaste si inţelege modalitatile

More information

Gradul de comutativitate al grupurilor finite 1

Gradul de comutativitate al grupurilor finite 1 Gradul de comutativitate al grupurilor finite Marius TĂRNĂUCEANU Abstract The commutativity degree of a group is one of the most important probabilistic aspects of finite group theory In this survey we

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1 EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess

More information

Legi de distribuţie (principalele distribuţii de probabilitate) Tudor Drugan

Legi de distribuţie (principalele distribuţii de probabilitate) Tudor Drugan Legi de distribuţie (principalele distribuţii de probabilitate) Tudor Drugan Introducere In general distribuţiile variabilelor aleatoare definite pe o populaţie, care face obiectul unui studiu, nu se cunosc.

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Cercet¼ari operaţionale

Cercet¼ari operaţionale Cercet¼ari operaţionale B¼arb¼acioru Iuliana Carmen CURSUL 9 Cursul 9 Cuprins Programare liniar¼a 5.1 Modelul matematic al unei probleme de programare liniar¼a.................... 5. Forme de prezentare

More information

A GENERALIZATION OF A CLASSICAL MONTE CARLO ALGORITHM TO ESTIMATE π

A GENERALIZATION OF A CLASSICAL MONTE CARLO ALGORITHM TO ESTIMATE π U.P.B. Sci. Bull., Series A, Vol. 68, No., 6 A GENERALIZATION OF A CLASSICAL MONTE CARLO ALGORITHM TO ESTIMATE π S.C. ŞTEFĂNESCU Algoritmul Monte Carlo clasic A1 estimeazează valoarea numărului π bazându-se

More information

Barem de notare clasa a V-a

Barem de notare clasa a V-a Barem de notare clasa a V-a Problema1. Determinați mulțimile A și B, formate din numere naturale, știind că îndeplinesc simultan condițiile: a) A B,5,6 ; b) B A 0,7 ; c) card AB 3; d) suma elementelor

More information

FORMULELE LUI STIRLING, WALLIS, GAUSS ŞI APLICAŢII

FORMULELE LUI STIRLING, WALLIS, GAUSS ŞI APLICAŢII DIDACTICA MATHEMATICA, Vol. 34), pp. 53 67 FORMULELE LUI STIRLING, WALLIS, GAUSS ŞI APLICAŢII Eugenia Duca, Emilia Copaciu şi Dorel I. Duca Abstract. In this paper are presented the Wallis, Stirling, Gauss

More information

Decision Trees Some exercises

Decision Trees Some exercises Decision Trees Some exercises 0. . Exemplifying how to compute information gains and how to work with decision stumps CMU, 03 fall, W. Cohen E. Xing, Sample questions, pr. 4 . Timmy wants to know how to

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Teoreme de Analiză Matematică - I (teorema Weierstrass-Bolzano) 1

Teoreme de Analiză Matematică - I (teorema Weierstrass-Bolzano) 1 Educaţia Matematică Vol. 3, Nr. 1-2 (2007), 79-84 Teoreme de Analiză Matematică - I (teorema Weierstrass-Bolzano) 1 Silviu Crăciunaş, Petrică Dicu, Mioara Boncuţ Abstract In this paper we propose a Weierstrass

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Bayesian Networks Structure Learning (cont.)

Bayesian Networks Structure Learning (cont.) Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Lecture 2: GMM and EM

Lecture 2: GMM and EM 2: GMM and EM-1 Machine Learning Lecture 2: GMM and EM Lecturer: Haim Permuter Scribe: Ron Shoham I. INTRODUCTION This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the Expectation-Maximization

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Habilitation Thesis. Periodic solutions of differential systems: existence, stability and bifurcations

Habilitation Thesis. Periodic solutions of differential systems: existence, stability and bifurcations UNIVERSITATEA BABEŞ BOLYAI CLUJ-NAPOCA FACULTATEA DE MATEMATICĂ ŞI INFORMATICĂ Habilitation Thesis Mathematics presented by Adriana Buică Periodic solutions of differential systems: existence, stability

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor

Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor TEZĂ DE ABILITARE Metode de Descreştere pe Coordonate pentru Optimizare

More information

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

More information

COMPARATIVE DISCUSSION ABOUT THE DETERMINING METHODS OF THE STRESSES IN PLANE SLABS

COMPARATIVE DISCUSSION ABOUT THE DETERMINING METHODS OF THE STRESSES IN PLANE SLABS 74 COMPARATIVE DISCUSSION ABOUT THE DETERMINING METHODS OF THE STRESSES IN PLANE SLABS Codrin PRECUPANU 3, Dan PRECUPANU,, Ștefan OPREA Correspondent Member of Technical Sciences Academy Gh. Asachi Technical

More information

Utilizarea claselor de echivalenta in analiza asistata de calculator a sistemelor cu evenimente discrete

Utilizarea claselor de echivalenta in analiza asistata de calculator a sistemelor cu evenimente discrete 72 Utilizarea claselor de echivalenta in analiza asistata de calculator a sistemelor cu evenimente discrete Conf.dr. Alexandru TERTISCO, ing. Alexandru BOICEA Facultatea de Automatica si Calculatoare,

More information

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

CS 136 Lecture 5 Acoustic modeling Phoneme modeling + September 9, 2016 Professor Meteer CS 136 Lecture 5 Acoustic modeling Phoneme modeling Thanks to Dan Jurafsky for these slides + Directly Modeling Continuous Observations n Gaussians n Univariate Gaussians

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

PROBLEME DIVERSE lecţie susţinută la lotul de 13 de Andrei ECKSTEIN Bucureşti, 25 mai 2015

PROBLEME DIVERSE lecţie susţinută la lotul de 13 de Andrei ECKSTEIN Bucureşti, 25 mai 2015 PROBLEME DIVERSE lecţie susţinută la lotul de 13 de Andrei ECKSTEIN Bucureşti, 5 mai 015 I. SUBSTITUŢIA TAIWANEZĂ 1. Fie a, b, c > 0 astfel încât a bc, b ca şi c ab. Determinaţi valoarea maximă a expresiei

More information

Definiţie. Pr(X a) - probabilitatea ca X să ia valoarea a ; Pr(a X b) - probabilitatea ca X să ia o valoare în intervalul a,b.

Definiţie. Pr(X a) - probabilitatea ca X să ia valoarea a ; Pr(a X b) - probabilitatea ca X să ia o valoare în intervalul a,b. Variabile aleatoare Definiţie Se numeşte variabilă aleatoare pe un spaţiu fundamental E şi se notează prin X, o funcţie definită pe E cu valori în mulţimea numerelor reale. Unei variabile aleatoare X i

More information

Subiecte geometrie licenta matematica-informatica 4 ani

Subiecte geometrie licenta matematica-informatica 4 ani Class: Date: Subiecte geometrie licenta matematica-informatica 4 ani Multiple Choice Identify the letter of the choice that best completes the statement or answers the question. 1. Complementara unui subspatiu

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Module Master Recherche Apprentissage et Fouille

Module Master Recherche Apprentissage et Fouille Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Utilizarea limbajului SQL pentru cereri OLAP. Mihaela Muntean 2015

Utilizarea limbajului SQL pentru cereri OLAP. Mihaela Muntean 2015 Utilizarea limbajului SQL pentru cereri OLAP Mihaela Muntean 2015 Cuprins Implementarea operatiilor OLAP de baza in SQL -traditional: Rollup Slice Dice Pivotare SQL-2008 Optiunea ROLLUP Optiunea CUBE,

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Ecuatii si inecuatii de gradul al doilea si reductibile la gradul al doilea. Ecuatii de gradul al doilea

Ecuatii si inecuatii de gradul al doilea si reductibile la gradul al doilea. Ecuatii de gradul al doilea Ecuatii si inecuatii de gradul al doilea si reductibile la gradul al doilea Ecuatia de forma Ecuatii de gradul al doilea a + b + c = 0, (1) unde a, b, c R, a 0, - variabila, se numeste ecuatie de gradul

More information

SIMULAREA DECIZIEI FINANCIARE

SIMULAREA DECIZIEI FINANCIARE SIMULAREA DECIZIEI FINANCIARE Conf. univ. dr. Nicolae BÂRSAN-PIPU T5.1 TEMA 5 DISTRIBUŢII DISCRETE T5. Cuprins T5.3 5.1 Variabile aleatoare discrete 5. Distribuţia de probabilitate a unei variabile aleatoare

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Inteligenta Artificiala

Inteligenta Artificiala Inteligenta Artificiala Universitatea Politehnica Bucuresti Anul universitar 2010-2011 Adina Magda Florea http://turing.cs.pub.ro/ia_10 si curs.cs.pub.ro 1 Curs nr. 4 Cautare cu actiuni nedeterministe

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Scribe to lecture Tuesday March

Scribe to lecture Tuesday March Scribe to lecture Tuesday March 16 2004 Scribe outlines: Message Confidence intervals Central limit theorem Em-algorithm Bayesian versus classical statistic Note: There is no scribe for the beginning of

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Lecture 11: Unsupervised Machine Learning

Lecture 11: Unsupervised Machine Learning CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Avem 6 tipuri de simboluri in logica predicatelor:

Avem 6 tipuri de simboluri in logica predicatelor: Semantica Avem 6 tipuri de simboluri in logica predicatelor: Predicate: p, q, r,, p1, q2 etc. Constante: a, b, c,, z, a1, b4,, ion, mihai, labus etc. Variabile: x, y, z, x1, y1, z4 etc. Conective:,,,,

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

MIXTURE MODELS AND EM

MIXTURE MODELS AND EM Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

More information