Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Size: px

Start display at page:

Download "Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University"

Reynard Robertson
5 years ago
Views:

1 Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University

2 Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate unknown parameters based on known outcomes: L(q x) = p(x q) Coin-flip example: q is probability of heads (parameter) x = HHHTTH is outcome form 6 flips

3 Likelihood for Coin-flip Example Probability of outcome given parameter: p(x = HHHTTH q = 0.5) = = Likelihood of parameter given outcome: L(q = 0.5 x = HHHTTH) = p(x q) = General Θ L(Θ HHHTTH) = Θ 4 (1-Θ) Likelihood maximal when q = Likelihood function not a probability density

4 Coin Flip MLE details L(Θ HHHTTH) = Θ 4 (1-Θ) Differentiate log L w.r.t. Θ: 4/Θ /(1-Θ) = 0 à Θ = /3 Maximizer since logarithm is concave Intuitive result: MLE of H probability Θ = fraction of H in sample

5 Likelihood for Cont. Distributions Six samples {-3, -, -1, 1,, 3} believed to be drawn from some Gaussian N(0, s ) Likelihood of s: L( s { -3,-, -1,1,,3}) = p( x = -3 s ) p( x = - s )! p( x = 3 s ) Maximum likelihood: s = (-3) + (-) + (-1) =.16

6 Likelihood for Cont. Distributions Six samples {-3, -, -1, 1,, 3} believed to be drawn from some Gaussian N(0, s ) Likelihood of s: L( s { -3,-, -1,1,,3}) = p( x = -3 s ) p( x = - s )! p( x = 3 s ) Maximum likelihood: s = (-3) + (-) + (-1) =.16 Intuitive: MLE s = sample variance

7 Maximum Likelihood Estimate Parameterized family of distributions of some r.v. X P[X θ] for θ in some paramter set Likelihood L(θ,X) = P[X θ] MLE = argmax θ L(θ,X) Generally can t compute explicitly Single point f(x j ) = Σ k i=1 f(x j μ i, Σ i )P(C i ) P[X θ]=prod j f(x j ) Log-LLHD log P(X θ)=σ n j=1 log f(x j ) = Σ n j=1 log Σ k i=1f(x j μ i, Σ i )P(C i ) Find max by differentiation? Difficult due to sum inside logarithms

8 Latent data Observations X ={x 1,x,.,x n } Suppose latent data Y, unobserved, that explains the observations E.g. if clustering with mixture of k Gaussians Latent variables Y = {y 1,y,.,yn} where each y i in {1,,k} tells which mixture component i actually followed. Parameters θ = (P(C i )) describe joint distribution of P θ (X,Y) of X,Y Y part: P(C i ) = probability to be in component i Distribution of X given Y: μ i, Σ i parametrize Gaussian distribution of x in component i Problem: we don t know the y i Call (X,Y) complete data Contrast with observed data X Complete data likelihood L(θ X,Y) = P θ (X,Y)

9 Expectation-Maximization Let E θ denote the expectation w/ parameter θ Expectation Step: Compute Expected value of the complete data log likelihood, conditioned on observed data X Q(θ, θ) = E θ [log L(θ X,Y) X] Maximization step: given θ, find the parameters f(θ) = arg max θ Q(θ, θ)

10 EM Procedure: Initialize θ(0) Iterate E and M steps sequence of iterates θ(n+1) = f(θ(n)) Iterate until some stopping criterion is met, e.g., small successive differences until θ(n+1) θ(n) < ε Declare victory and hope θ is MLE of the original problem: argmax L(θ,X)

11 3 questions What is the unobserved data Y? How does it relate to any given problem How do I know its distribution? How is this related to the MLE problem? Function Q(θ, θ) = E θ [log L(θ X,Y] X] New parameters f(θ) = arg max θ Q(θ, θ) Find parameters θ that maximize E log likelihood Seems to have something to do with MLE How to find maximizer to compute iterates f(θ)

12 Why Expectation Maximization? P θ (X,Y) = P θ (X) P θ (Y X) log P θ (X) = log P θ (X,Y)- log P θ (Y X) log L(θ X) = log L(θ X,Y ) - log P θ (Y X) Take expectation E θ conditional on X log L(θ X) = Q(θ, θ) + H(θ, θ) Where H(θ, θ) = - E θ [ log P θ (Y X) X ]

13 Why Expectation Maximization? Recap: log L(θ X) = Q(θ, θ) + H(θ, θ) Suppose θ* = argmax θ Q(θ, θ) log L(θ* X) log L(θ X) = Q(θ*, θ)- Q(θ, θ) + (H(θ*, θ)-h(θ, θ)) H(θ*, θ)-h(θ, θ) By Gibbs inequality (see later): H(θ*, θ)-h(θ, θ) 0 Then conclude that log L(θ* X) L(θ X) Observed data likelihood is higher for θ* than for θ

14 Expectation-Maximization: Monotonicity Map θàf(θ)=θ* Sequence θ(n+1) = f(θ(n)). L(θ(n) X) is non-decreasing function of n Best case: L(θ(n) X) increases monotonically to a limit which is the ML θ(n) converges to MLE argmax θ L(θ X) Caveat: L(θ(n) X) could converge to a local maximum of the likelihood global maximum local maximum global maximum L(θ(n) X) θ

15 Gibbs Inequality Theorem: If p and q are two probability distributions with q i = 0 - à p i = 0 D(p,q) = Σ i p i log (p i / q i ) 0 with equality iff p = q H(θ, θ) = - E θ [ log P θ (Y X) X ] = Σ y P θ (Y=y X) log P θ (Y=y X) H(θ*, θ)-h(θ, θ) = Σ y P θ (Y=y X) log P θ (Y=y X) - P θ (Y=y X) log P θ* (Y=y X) = Σ y P θ (Y=y X) {log P θ (Y=y X)/P θ* (Y=y X)} = D(P θ (Y X), P θ* (Y X) 0 by Gibbs

16 Proof of Gibbs Inequality Want to show D(p,q) = Σ i p i log (p i / q i ) 0 g(x) = log(x) a concave function, derivative 1/x is decreasing log x log 1 g (1)(x-1)=x-1 with equality only of x =1. - log x = log(1/x) < 1/x-1 so log x > 1-1/x log p/q > 1 q/p Σ i p i log (p i / q i ) Σ i p i (1-q i /p i ) =Σ i p i q i =1-1 =0 with equality only if p i = q i for all i

17 When does EM converge to MLE? Certain abstract conditions but sometimes difficult to check. Theorem: If L(θ,X) has unique maximum and (d/dθ)q(θ, θ ) is continuous in θ and θ Then EM sequence converges to MLE

18 Gaussian Mixture models Observed Data LLHD: log P(X θ)=σ n j=1 log f(x j ) = Σ n j=1 log Σ k i=1f(x j μ i, Σ i )P(C i ) Complete data: Each point x j comes with vector c j = (c j1,,c jk ) c j indicates which component j lies in: c ji = 1 if j in component i and zero otherwise. Y = {c j } is latent or unobserved data. E[c ji ] = P[C i x j ] = w ij

19 Full Data Likelihood Full data likelihood: Suppose we know the c ji as data, i.e. which component i each point j belongs to For each point x j f(x j, c j ) = Π k i=1 (f(x j μ i, Σ i )P(C i ))**c ji c ji = 1 for exactly 1 of the i: Terms in product for all other i are 1 P(X,Y θ)=π n j=1f(x j, c j )=Π n j=1π k i=1 (f(x j μ i, Σ i )P(C i ))**c ji log P(X,Y θ) = Σ n j=1 Σ n j=1c ji log {f(x j μ i, Σ i )P(C i )}

20 Computation of Q Much nicer than for observed data likelihood no sum inside log Q(θ, θ) = E θ [log L(θ X,Y) X] = E θ [ Σ n j=1 Σ k i=1c ji {log f(x j μ i, Σ i ) + log P (C i )} X] Conditioned on X, so the x j are just constant E θ [c ji ] = P[C i x j ] = w ij Q(θ, θ) = Σ n j=1 Σ n j=1 w ij {log f(x j μ i, Σ i ) + log P (C i ) }

21 Computation of maximizer (1-dim) Recap: Q(θ, θ) = Σ k i=1 Σ n j=1 w ij {log f(x j μ i, Σ i ) + log P (C i ) } log f(x j μ i, Σ i ) = -(x j -μ i ) /(σ i ) - log σ i + const. Differentiate w.r.t. μ i : Σ n j=1 w ij (x j -μ i ) = 0 Occurs when μ i = μ i * = Σ n j=1 w ij x j / Σ n j=1 w ij Differentiate w.r.t. σ i -Σ n j=1 { w ij (x j -μ i ) / (σ i ) 3-1/ σ i } = 0 Occurs when (σ i ) =(σ* i ) = Σ n j=1 w ij (x j -μ i ) / Σ n j=1 w ij

22 Computation of maximizer (1-dim) Recap: Q(θ, θ) = Σ k i=1 Σ n j=1 w ij {log f(x j μ i, Σ i ) + log P (C i ) } Differentiate w.r.t. P (C i ) Subject to constraint Σ k i=1p (C i ) = 1 Σ n j=1 w ij / P (C i ) = constant independent of i Occurs when P (C i ) = Σ n j=1 w ij / n Maximizer Have recovered stated iteration of parameters μ i à μ i*, Σ i à Σ i*, P(C i ) à P*(C i )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select