Bayesian Networks Structure Learning (cont.)

Similar documents
K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Mixtures of Gaussians continued

EM (cont.) November 26 th, Carlos Guestrin 1

Structure Learning: the good, the bad, the ugly

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CSE446: Clustering and EM Spring 2017

Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization Algorithm

BN Semantics 3 Now it s personal! Parameter Learning 1

Unsupervised Learning: K-Means, Gaussian Mixture Models

Learning Bayes Net Structures

Lecture 4: Probabilistic Learning

Learning P-maps Param. Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Networks Representation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Generative v. Discriminative classifiers Intuition

Expectation Maximization

Directed Graphical Models or Bayesian Networks

Expectation maximization

ECE 5984: Introduction to Machine Learning

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Dynamic models 1 Kalman filters, linearization,

Probabilistic Graphical Models

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning Lecture 5

CPSC 540: Machine Learning

K-Means and Gaussian Mixture Models

Clustering and Gaussian Mixture Models

Probabilistic Graphical Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Graphical Models

CSE 473: Artificial Intelligence Autumn Topics

Bias-Variance Tradeoff

Statistical Pattern Recognition

Generative Clustering, Topic Modeling, & Bayesian Inference

Notes on Machine Learning for and

Machine Learning

10-701/15-781, Machine Learning: Homework 4

Week 5: Logistic Regression & Neural Networks

Unsupervised learning (part 1) Lecture 19

Brief Introduction of Machine Learning Techniques for Content Analysis

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning for Data Science (CS4786) Lecture 12

Directed Graphical Models

CPSC 540: Machine Learning

Expectation Maximization

Lecture 3: Pattern Classification

Bayes Nets III: Inference

Mixture of Gaussians Models

Approximate Inference

Linear Models for Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS 188: Artificial Intelligence. Bayes Nets

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Hidden Markov Models and Gaussian Mixture Models

CS281 Section 4: Factor Analysis and PCA

Basic Sampling Methods

Introduction to Machine Learning Midterm, Tues April 8

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Data Preprocessing. Cluster Similarity

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Structure Learning in Bayesian Networks (mostly Chow-Liu)

Learning Theory Continued

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Latent Variable Models

CSCI-567: Machine Learning (Spring 2019)

VC-dimension for characterizing classifiers

Lecture 9: PGM Learning

Machine Learning, Midterm Exam

CS534 Machine Learning - Spring Final Exam

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

BN Semantics 3 Now it s personal!

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Unsupervised Learning: K- Means & PCA

Unsupervised Learning

CPSC 540: Machine Learning

COM336: Neural Computing

Lecture 3: Pattern Classification. Pattern classification

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Statistical Pattern Recognition

Introduction to Bayesian Learning. Machine Learning Fall 2018

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mobile Robot Localization

Latent Variable Models and EM Algorithm

What you need to know about Kalman Filters

Bayesian Machine Learning

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Transcription:

Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic structure learning) Learning BN tutorial (class website): ftp://ftp.research.microsoft.com/pub/tr/tr-95-6.pdf TAN paper (class website): http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Structure Learning (cont.) Machine Learning 171/15781 Carlos Guestrin Carnegie Mellon University April 3 rd, 6 1

Learning Bayes nets Known structure Unknown structure Fully observable data Missing data Data x (1) x (m) structure CPTs P(X i Pa Xi ) parameters

Learning the CPTs Data For each discrete variable X i x (1) x (M) WHY?????????? 3

Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 4

Maximum likelihood (ML) for learning BN structure Possible structures Learn parameters using ML Score structure Flu Allergy Sinus Headache Nose Data <x 1 (1),,x n (1) > <x 1 (M),,x n (M) > 5

Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 6

Information-theoretic interpretation of maximum likelihood 3 Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose 7

Mutual information Independence tests Statistically difficult task! Intuitive approach: Mutual information Mutual information and independence: X i and X j independent if and only if I(X i,x j )= Conditional mutual information: 8

Decomposable score Log data likelihood 9

Scoring a tree 1: equivalent trees 1

Scoring a tree : similar trees 11

Chow-Liu tree learning algorithm 1 For each pair of variables X i,x j Compute empirical distribution: Compute mutual information: Define a graph Nodes X 1,,X n Edge (i,j) gets weight 1

Chow-Liu tree learning algorithm Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, breadth-firstsearch defines directions 13

Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. 97] Naïve Bayes model overcounts, because correlation between features not considered Same as Chow-Liu, but score edges with: 14

Can we extend Chow-Liu (Approximately learning) models with tree-width up to k [Narasimhan & Bilmes 4] But, O(n k+1 ) 15

Scoring general graphical models Model selection problem What s the best structure? Flu Allergy Sinus Headache Nose Data <x_1^{(1)},,x_n^{(1)}> <x_1^{(m)},,x_n^{(m)}> The more edges, the fewer independence assumptions, the higher the likelihood of the data, but will overfit 16

Maximum likelihood overfits! Information never hurts: Adding a parent always increases score!!! 17

Bayesian score avoids overfitting Given a structure, distribution over parameters Difficult integral: use Bayes information criterion (BIC) approximation (equivalent as M ) Note: regularize with MDL score Best BN under BIC still NP-hard 18

How many graphs are there? 19

Structure learning for general graphs In a tree, a node only has one parent Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition in different ways

Learn BN structure using local search Starting from Chow-Liu tree Local search, possible moves: Add edge Delete edge Invert edge Score using BIC 1

What you need to know about learning BNs Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree (Chow-Liu) Best TAN Other BNs, usually local search with BIC score

Unsupervised learning or Clustering K-means Gaussian mixture models Machine Learning 171/15781 Carlos Guestrin Carnegie Mellon University April 3 rd, 6 3

Some Data 4

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 5

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5). Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 9

Unsupervised Learning You walk into a bar. A stranger approaches and tells you: I ve got data from k classes. Each class produces observations with a normal distribution and variance σ I. Standard simple multivariate gaussian assumptions. I can tell you all the P(w i ) s. So far, looks straightforward. I need a maximum likelihood estimate of the µ i s. No problem: There s just one thing. None of the data are labeled. I have datapoints, but I don t know what class they re from (any of them!) Uh oh!! 3

31 Gaussian Bayes Classifier Reminder ) ( ) ( ) ( ) ( x x x p i y P i y p i y P = = = = ( ) ( ) ) ( 1 exp ) ( 1 ) ( 1/ / x µ x Σ µ x Σ x p p i y P i i k i T i k i m = = π How do we deal with that?

Predicting wealth from age 3

Predicting wealth from age 33

34 Learning modelyear, mpg ---> maker = m m m m m 1 1 1 1 1 σ σ σ σ σ σ σ σ σ L M O M M L L Σ

35 General: O(m ) parameters = m m m m m 1 1 1 1 1 σ σ σ σ σ σ σ σ σ L M O M M L L Σ

36 Aligned: O(m) parameters = m m 1 3 1 σ σ σ σ σ L L M M O M M M L L L Σ

37 Aligned: O(m) parameters = m m 1 3 1 σ σ σ σ σ L L M M O M M M L L L Σ

38 Spherical: O(1) cov parameters = σ σ σ σ σ L L M M O M M M L L L Σ

39 Spherical: O(1) cov parameters = σ σ σ σ σ L L M M O M M M L L L Σ

Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 4

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ 1 µ µ 3 41

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ I Assume that each datapoint is generated according to the following recipe: µ 1 µ µ 3 4

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ Each component generates data from a Gaussian with mean µ i and covariance matrix σ I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ). 43

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ Each component generates data from a Gaussian with mean µ i and covariance matrix σ I x Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ).. Datapoint ~ N(µ i, σ I ) 44

The General GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(y i ).. Datapoint ~ N(µ i, Σ i ) µ 1 µ µ 3 45

Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW -d UNLABELED DATA (X VECTORS) DISTRIBUTED IN -d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS 46

Computing likelihoods in supervised learning case We have y 1,x 1, y,x, y N,x N Learn P(y 1 ) P(y ).. P(y k ) Learn σ, µ 1,, µ k By MLE: P(y 1,x 1, y,x, y N,x N µ i, µ k, σ) 47

Computing likelihoods in unsupervised case We have x 1, x, x N We know P(y 1 ) P(y ).. P(y k ) We know σ P(x y i, µ i, µ k ) = Prob that an observation from class y i would have value x given class means µ 1 µ x Can we write an expression for that? 48

likelihoods in unsupervised case We have x 1 x x n We have P(y 1 ).. P(y k ). We have σ. We can define, for any x, P(x y i, µ 1, µ.. µ k ) Can we define P(x µ 1, µ.. µ k )? Can we define P(x 1, x 1,.. x n µ 1, µ.. µ k )? [YES, IF WE ASSUME THE X 1 S WERE DRAWN INDEPENDENTLY] 49

Unsupervised Learning: Mediumly Good News We now have a procedure s.t. if you give me a guess at µ 1, µ.. µ k, I can tell you the prob of the unlabeled data given those µ s. Suppose x s are 1-dimensional. (From Duda and Hart) There are two classes; w 1 and w P(y 1 ) = 1/3 P(y ) = /3 σ = 1. There are 5 unlabeled datapoints x 1 =.68 x = -1.59 x 3 =.35 x 4 = 3.949 : x 5 = -.71 5

Duda & Hart s Example We can graph the prob. dist. function of data given our µ 1 and µ estimates. We can also graph the true function from which the data was randomly generated. They are close. Good. The nd solution tries to put the /3 hump where the 1/3 hump should go, and vice versa. In this example unsupervised is almost as good as supervised. If the x 1.. x 5 are given the class which was used to learn them, then the results are (µ 1 =-.176, µ =1.684). Unsupervised got (µ 1 =-.13, µ =1.668). 51

Duda & Hart s Example µ Graph of log P(x 1, x.. x 5 µ 1, µ ) against µ 1 ( ) and µ ( ) µ1 Max likelihood = (µ 1 =-.13, µ =1.668) Local minimum, but very close to global at (µ 1 =.85, µ =-1.57)* * corresponds to switching y 1 with y. 5

Finding the max likelihood µ 1,µ..µ k We can compute P( data µ 1,µ..µ k ) How do we find the µ i s which give max. likelihood? The normal max likelihood trick: Set log Prob (.) = µ i and solve for µ i s. # Here you get non-linear non-analyticallysolvable equations Use gradient descent Slow but doable Use a much faster, cuter, and recently very popular method 53

Expectation Maximalization 54

The E.M. Algorithm DETOUR We ll get back to unsupervised learning soon. But now we ll look at an even simpler case with hidden information. The EM algorithm Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as we ll see. Many, many other uses, including inference of Hidden Markov Models (future lecture). 55

Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w = Gets a B P(B) = µ w 3 = Gets a C P(C) = µ w 4 = Gets a D P(D) = ½-3µ (Note µ 1/6) Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 56

Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w = Gets a B P(B) = µ w 3 = Gets a C P(C) = µ w 4 = Gets a D P(D) = ½-3µ (Note µ 1/6) Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 57

Trivial Statistics P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ P( a,b,c,d µ) = K(½) a (µ) b (µ) c (½-3µ) d log P( a,b,c,d µ) = log K + alog ½ + blog µ + clog µ + dlog (½-3µ) LogP FOR MAX LIKE µ, SET = µ LogP b c = + µ µ µ Gives max like µ = So if class got 3d 1 / 6 b = 3µ + c ( b + c + d ) A B C D Max like µ = 1 1 14 6 9 1 Boring, but true! 58

Same Problem with Hidden Information REMEMBER Someone tells us that Number of High grades (A s + B s) = h Number of C s = c Number of D s = d P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ What is the max. like estimate of µ now? 59

Same Problem with Hidden Information Someone tells us that Number of High grades (A s + B s) = h Number of C s = c Number of D s = d What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION MAXIMIZATION REMEMBER P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ If we know the value of µ we could compute the expected value of a and b 1 µ a = h b = 1 + µ 1 + µ Since the ratio a:b should be the same as the ratio ½ : µ If we know the expected values of a and b we could compute the maximum likelihood value of µ µ = 6 b + c ( b + c + d) 6 h

E.M. for our Trivial Problem REMEMBER We begin with a guess for µ We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b. P(A) = ½ P(B) = µ P(C) = µ P(D) = ½-3µ Define µ(t) the estimate of µ on the t th iteration b(t) the estimate of b on t th iteration µ() = initialguess b( t) = µ( t + 1) = 6 = 1 µ(t) h = Ε + µ( t) b() t + c ( b() t + c + d) max likeest of [ b µ( t) ] () µ given b t E-step M-step Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum. 61

E.M. Convergence Convergence proof based on fact that Prob(data µ) must increase or remain same between each iteration [NOT OBVIOUS] But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, suppose we had h = c = 1 d = 1 µ() = Convergence is generally linear: error decreases by a constant factor each time step. t µ(t) b(t) 1.833.857.937 3.158 3.947 3.185 4.948 3.187 5.948 3.187 6.948 3.187 6

Back to Unsupervised Learning of GMMs Remember: We have unlabeled data x 1 x x R We know there are k classes We know P(y 1 ) P(y ) P(y 3 ) P(y k ) We don t know µ 1 µ.. µ k We can write P( data µ 1. µ k ) = p = = = ( x... x µ...µ ) R i= 1 R i= 1 j= 1 R 1 p i= 1 j= 1 ( x µ...µ ) k k i R p 1 ( x w,µ...µ ) P( y ) i 1 1 K exp σ j k k 1 k ( x µ ) P( y ) i j j j 63

E.M. for GMMs For Max likelihood we know µ j = R i= 1 R P i= 1 ( y x,µ...µ ) P j ( y x,µ...µ ) j i i 1 1 k k x i µ Some wild' n' crazy algebra turns This is n nonlinear equations in µ j s. i log Pr ob ( data µ...µ ) 1 this into :" For Max likelihood, for each If, for each x i we knew that for each w j the prob that µ j was in class y j is P(y j x i,µ 1 µ k ) Then we would easily compute µ j. k See http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf = j, If we knew each µ j then we could easily compute P(y j x i,µ 1 µ k ) for each y j and x i. I feel an EM experience coming on!! 64

65 E.M. for GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ (t) µ c (t) } E-step Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = = = c j j j j k i i i k t k t i t i k t k i t p t y x t p t y x x y y x x y 1 ) ( ), (, p ) ( ), (, p p P, p, P I I σ µ σ µ λ λ λ λ M-step. Compute Max. like µ given our data s class membership distributions ( ) ( ) ( ) = + k t k i k k t k i i x y x x y t λ λ, P, P 1 µ Just evaluate a Gaussian at x k

E.M. Convergence Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works. As with all EM procedures, convergence to a local optimum guaranteed. This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 66

E.M. for General GMMs 67 Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ (t) µ c (t), Σ 1 (t), Σ (t) Σ c (t), p 1 (t), p (t) p c (t) } E-step Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = Σ Σ = = c j j j j j k i i i i k t k t i t i k t k i t p t t y x t p t t y x x y y x x y 1 ) ( ) ( ), (, p ) ( ) ( ), (, p p P, p, P µ µ λ λ λ λ M-step. Compute Max. like µ given our data s class membership distributions p i (t) is shorthand for estimate of P(y i ) on t th iteration ( ) ( ) ( ) = + k t k i k k t k i i x y x x y t λ λ, P, P 1 µ ( ) ( ) ( ) [ ] ( ) [ ] ( ) + + = + Σ k t k i T i k i k k t k i i x y t x t x x y t λ µ µ λ, P 1 1, P 1 ( ) ( ) R x y t p k t k i i = +,λ P 1 R = #records Just evaluate a Gaussian at x k

Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible 68

After first iteration 69

After nd iteration 7

After 3rd iteration 71

After 4th iteration 7

After 5th iteration 73

After 6th iteration 74

After th iteration 75

Some Bio Assay data 76

GMM clustering of the assay data 77

Resulting Density Estimator 78

Three classes of assay (each learned with it s own mixture model) 79

Resulting Bayes Classifier 8

Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 81

Final Comments Remember, E.M. can get stuck in local minima, and empirically it DOES. Our unsupervised learning example assumed P(y i ) s known, and variances fixed and known. Easy to relax this. It s possible to do Bayesian unsupervised learning instead of max. likelihood. 8

What you should know How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data. Be happy with this kind of probabilistic analysis. Understand the two examples of E.M. given in these notes. 83

Acknowledgements K-means & Gaussian mixture models presentation derived from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering/tu torial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem. html 84