Mixtures of Gaussians and the EM Algorithm

Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1

Gaussias A popular way to estimate probability desity fuctios is to model them as Gaussias. Review: a 1D ormal distributio is defied as: N x = 1 σ 2π e x μ 2 2σ 2 To defie a Gaussia, we eed to specify just two parameters: μ, which is the mea (average) of the distributio. σ, which is the stadard deviatio of the distributio. Note: σ 2 is called the variace of the distributio. 2

Estimatig a Gaussia I oe dimesio, a Gaussia is defied like this: N x = 1 x μ 2 σ 2π e 2σ 2 Give a set of real umbers x 1,, x, we ca easily fid the best-fittig Gaussia for that data. The mea μ is simply the average of those umbers: μ = 1 x i 1 The stadard deviatio σ is computed as: σ = 1 1 (x i μ) 2 1 3

Estimatig a Gaussia Fittig a Gaussia to data does ot guaratee that the resultig Gaussia will be a accurate distributio for the data. The data may have a distributio that is very differet from a Gaussia. 4

Example of Fittig a Gaussia The blue curve is a desity fuctio F such that: - F(x) = 0.25 for 1 x 3. - F(x) = 0.5 for 7 x 8. The red curve is the Gaussia fit G to data geerated usig F. 5

Naïve Bayes with 1D Gaussias Suppose the patters come from a d-dimesioal space: Examples: pixels to be classified as ski or o-ski, or the statlog dataset. Notatio: x i = (x i,1, x i,2,, x i,d ) For each dimesio j, we ca use a Gaussia to model the distributio p j (x i,j C k ) of the data i that dimesio, give their class. For example for the statlog dataset, we would get 216 Gaussias: 36 dimesios * 6 classes. The, we ca use the aïve Bayes approach (i.e., assume pairwise idepedece of all dimesios), to defie P(x C k ) as: d p x C k ) = p j x i,j ) C k ) i=1 6

Mixtures of Gaussias This figure shows our previous example, where we fitted a Gaussia ito some data, ad the fit was poor. Overall, Gaussias have attractive properties: They require learig oly two umbers (μ ad σ), ad thus require few traiig data to estimate those umbers. However, for some data, Gaussias are just ot good fits. 7

Mixtures of Gaussias Mixtures of Gaussias are oftetimes a better solutio. They are defied i the ext slide. They still require relatively few parameters to estimate, ad thus ca be leared from relatively small amouts of data. They ca fit pretty well actual distributios of data. 8

Mixtures of Gaussias Suppose we have k Gaussia distributios N i. Each N i has its ow mea μ i ad std σ i. Usig these k Gaussias, we ca defie a Gaussia mixture M as follows: k M x = w i N i x i=1 Each w i is a weight, specifyig the relative importace of Gaussia N i i the mixture. Weights w i are real umbers betwee 0 ad 1. Weights w i must sum up to 1, so that the itegral of M is 1. 9

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.9. w 2 = 0.1. The mixture looks a lot like N 1, but is iflueced a little by N 2 as well. 10

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.7. w 2 = 0.3. The mixture looks less like N 1 compared to the previous example, ad is iflueced more by N 2. 11

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.5. w 2 = 0.5. At each poit x, the value of the mixture is the average of N 1 (x) ad N 2 (x). 12

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.3. w 2 = 0.7. The mixture ow resembles N 2 more tha N 1. 13

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.1. w 2 = 0.9. The mixture ow is almost idetical to N 2 (x). 14

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. Suppose all x j belog to the same class c. How ca we fit a mixture of Gaussias to this data? This will be the topic of the ext few slides. We will lear a very popular machie learig algorithm, called the EM algorithm. EM stads for Expectatio-Maximizatio. Step 0 of the EM algorithm: pick k maually. Decide how may Gaussias the mixture should have. Ay approach for choosig k automatically is beyod the scope of this class. 15

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. Suppose all x j belog to the same class c. We wat to model P(x c) as a mixture of Gaussias. Give k, how may parameters do we eed to estimate i order to fully defie the mixture? Remember, a mixture M of k Gaussias is defied as: M x = w i N i x For each N i, we eed to estimate three umbers: w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e So, i total, we eed to estimate 3*k umbers. k x μ i 2 2σ i 2 16

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. A mixture M of k Gaussias is defied as: k M x = w i N i x For each N i, we eed to estimate w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e x μ i 2 2σ i 2 Suppose that we kew for each x j, that it belogs to oe ad oly oe of the k Gaussias. The, learig the mixture would be a piece of cake: For each Gaussia N i : Estimate μ i, σ i based o the examples that belog to it. Set w i equal to the fractio of examples that belog to N i. 17

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. A mixture M of k Gaussias is defied as: k M x = w i N i x For each N i, we eed to estimate w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e x μ i 2 2σ i 2 However, we have o idea which mixture each x j belogs to. If we kew μ i ad σ i for each N i, we could probabilistically assig each x j to a compoet. Probabilistically meas that we would ot make a hard assigmet, but we would partially assig x j to differet compoets, with each assigmet weighted proportioally to the desity value N i (x j ). 18

Example of Partial Assigmets Usig our previous example of a mixture: Suppose x j = 6.5. How do we assig 6.5 to the two Gaussias? N 1 (6.5) = 0.0913. N 2 (6.5) = 0.3521. So: 6.5 belogs to N 1 by 0.0913 0.0913+0.3521 = 20.6%. 6.5 belogs to N 2 by 0.3521 = 79.4%. 0.0913+0.3521 19

The Chicke-ad-Egg Problem To recap, fittig a mixture of Gaussias to data ivolves estimatig, for each N i, values w i, μ i, σ i. If we could assig each x j to oe of the Gaussias, we could compute easily w i, μ i, σ i. Eve if we probabilistically assig x j to multiple Gaussias, we ca still easily w i, μ i, σ i, by adaptig our previous formulas. We will see the adapted formulas i a few slides. If we kew μ i, σ i ad w i, we could assig (at least probabilistically) x j s to Gaussias. So, this is a chicke-ad-egg problem. If we kew oe piece, we could compute the other. But, we kow either. So, what do we do? 20

O Chicke-ad-Egg Problems Such chicke-ad-egg problems occur frequetly i AI. Surprisigly (at least to people ew i AI), we ca easily solve such chicke-ad-egg problems. Overall, chicke ad egg problems i AI look like this: We eed to kow A to estimate B. We eed to kow B to compute A. There is a fairly stadard recipe for solvig these problems. Start by givig to A values chose radomly (or perhaps oradomly, but still i a uiformed way, sice we do ot kow the correct values). Repeat this loop: Give our curret values for A, estimate B. Give our curret values of B, estimate A. If the ew values of A ad B are very close to the old values, break. 22

The EM Algorithm - Overview We use this approach to fit mixtures of Gaussias to data. This algorithm, that fits mixtures of Gaussias to data, is called the EM algorithm (Expectatio-Maximizatio algorithm). Remember, we choose k (the umber of Gaussias i the mixture) maually, so we do t have to estimate that. To iitialize the EM algorithm, we iitialize each μ i, σ i, ad w i. Values w i are set to 1/k. We ca iitialize μ i, σ i i differet ways: Givig radom values to each μ i. Uiformly spacig the values give to each μ i. Givig radom values to each σ i. Settig each σ i to 1 iitially. The, we iteratively perform two steps. The E-step. The M-step. 23

The E-Step E-step. Give our curret estimates for μ i, σ i, ad w i : We compute, for each i ad j, the probability p ij = P(N i x j ): the probability that x j was geerated by Gaussia N i. How? Usig Bayes rule. p ij = P(N i x j ) = P x j N i P(N i ) P(x j ) N i x j = 1 σ i k 2π e x μ i 2 2σ i 2 P x j = w i N i x j i =1 = N i x j w i P(x j ) 24

The M-Step: Updatig μ i ad σ i M-step. Give our curret estimates of p ij, for each i, j: We compute μ i ad σ i for each N i, as follows: μ i = j=1 [p ij x j ] j=1 p ij σ i = j=1 [p ij x j μ i 2 ] j=1 p ij To uderstad these formulas, it helps to compare them to the stadard formulas for fittig a Gaussia to data: μ = 1 1 x j σ = 1 1 j=1 (x j μ) 2 25

μ i = The M-Step: Updatig μ i ad σ i j=1 [p ij x j ] j=1 p ij σ i = j=1 [p ij x j μ i 2 ] j=1 p ij To uderstad these formulas, it helps to compare them to the stadard formulas for fittig a Gaussia to data: μ = 1 1 x j σ = 1 1 j=1 (x j μ) 2 Why do we take weighted averages at the M-step? Because each x j is probabilistically assiged to multiple Gaussias. We use p ij = P N i x j as weight of the assigmet of x j to N i. 26

The M-Step: Updatig w i w i = k i=1 j=1 p ij j=1 p ij At the M-step, i additio to updatig μ i ad σ i, we also eed to update w i, which is the weight of the i-th Gaussia i the mixture. The formula show above is used for the update of w i. We sum up the weights of all objects for the i-th Gaussia. We divide that sum by the sum of weights of all objects for all Gaussias. k The divisio esures that i=1 w i = 1. 27

The EM Steps: Summary E-step: Give curret estimates for each μ i, σ i, ad w i, update p ij : p ij = N i x j w i P(x j ) M-step: Give our curret estimates for each p ij, update μ i, σ i ad w i : μ i = j=1 [p x ij j] j=1 p ij σ i = j=1[p ij j=1 p ij x j μ i 2 ] w i = j=1 p ij k i=1 j=1 p ij 28

The EM Algorithm - Termiatio The log likelihood of the traiig data is defied as: L x 1,, x = log 2 M x j As a remider, M is the Gaussia mixture, defied as: k M x = w i N i x i=1 = w i 1 Oe ca prove that, after each iteratio of the E-step ad the M- step, this log likelihood icreases or stays the same. We check how much the log likelihood chages at each iteratio. Whe the chage is below some threshold, we stop. 29 j=1 k i=1 σ i 2π e x μ i 2 2σ i 2

The EM Algorithm: Summary Iitializatio: Iitialize each μ i, σ i, w i, usig your favorite approach (e.g., set each μ i to a radom value, ad set each σ i to 1, set each w i equal to 1/k). last_log_likelihood = -ifiity. Mai loop: E-step: Give our curret estimates for each μ i, σ i, ad w i, update each p ij. M-step: Give our curret estimates for each p ij, update each μ i, σ i, ad w i. log_likelihood = L x 1,, x. if (log_likelihood last_log_likelihood) < threshold, break. last_log_likelihood = log_likelihood 30

The EM Algorithm: Limitatios Whe we fit a Gaussia to data, we always get the same result. We ca also prove that the result that we get is the best possible result. There is o other Gaussia givig a higher log likelihood to the data, tha the oe that we compute as described i these slides. Whe we fit a mixture of Gaussias to the same data, we (sadly) do ot always get the same result. The EM algorithm is a greedy algorithm. The result depeds o the iitializatio values. We may have bad luck with the iitial values, ad ed up with a bad fit. There is o good way to kow if our result is good or bad, or if better results are possible. 32

Mixtures of Gaussias - Recap Mixtures of Gaussias are widely used. Why? Because with the right parameters, they ca fit very well various types of data. Actually, they ca fit almost aythig, as log as k is large eough (so that the mixture cotais sufficietly may Gaussias). The EM algorithm is widely used to fit mixtures of Gaussias to data. 33

Multidimesioal Gaussias Istead of assumig that each dimesio is idepedet, we ca istead model the distributio usig a multi-dimesioal Gaussia: N v = 1 2π d Σ exp 1 2 (x μ)τ Σ 1 (x μ) To specify this Gaussia, we eed to estimate the mea μ ad the covariace matrix Σ. 34

Multidimesioal Gaussias - Mea Let x 1, x 2,, x be d-dimesioal vectors. x i = (x i,1, x i,2,, x i,d ), where each x i,j is a real umber. The, the mea μ = (μ 1,..., μ d ) is computed as: μ = 1 x i 1 Therefore, μ j = 1 i=1 x i,j 35

Multidimesioal Gaussias Covariace Matrix Let x 1, x 2,, x be d-dimesioal vectors. x i = (x i,1, x i,2,, x i,d ), where each x i,j is a real umber. Let Σ be the covariace matrix. Its size is dxd. Let σ r,c be the value of Σ at row r, colum c. σ r,c = 1 1 j=1 (x j,r μ r )(x j,c μ c ) 36

Multidimesioal Gaussias Traiig Let N be a d-dimesioal Gaussia with mea μ ad covariace matrix Σ. How may parameters do we eed to specify N? The mea μ is defied by d umbers. The covariace matrix Σ requires d 2 umbers σ r,c. Strictly speakig, Σ is symmetric, σ r,c = σ c,r. So, we eed roughly d 2 /2 parameters. The umber of parameters is quadratic to d. The umber of traiig data we eed for reliable estimatio is also quadratic to d. 37

The Curse of Dimesioality We will discuss this "curse" i several places i this course. Summary: dealig with high dimesioal data is a pai, ad presets challeges that may be surprisig to someoe used to dealig with oe, two, or three dimesios. Oe first example is i estimatig Gaussia parameters. I oe dimesio, it is very simple: We estimate two parameters, μ ad σ. Estimatio ca be pretty reliable with a few tes of examples. I d dimesios, we estimate O(d 2 ) parameters. The umber of traiig data is quadratic to the dimesios. 38

The Curse of Dimesioality For example: suppose we wat to trai a system to recogize the faces of Michael Jorda ad Kobe Bryat. Assume each image is 100x100 pixels. Each pixel has three umbers: r, g, b. Thus, each image has 30,000 umbers. Suppose we model each class as a multi-dimesioal Gaussia. The, we eed to estimate parameters of a 30,000- dimesioal Gaussia. We eed roughly 450 millio umbers for the covariace matrix. We would eed more tha te billio traiig images to have a reliable estimate. It is ot realistic to expect to have such a large traiig set for learig how to recogize a sigle perso. 39

The Curse of Dimesioality The curse of dimesioality makes it (usually) impossible to estimate precisely probability desities i high-dimesioal spaces. The umber of traiig data that is eeded is expoetial to the umber of dimesios. The curse of dimesioality also makes histogram-based probability estimatio ifeasible i high dimesios. Estimatig a histogram still requires a umber of traiig examples that is expoetial to the dimesios. Estimatig a Gaussia requires a umber of traiig parameters that is "oly" quadratic to the dimesios. However, Gaussias may ot be accurate fits for the actual distributio. Mixtures of Gaussias ca ofte provide sigificatly better fits. 40