But if z is conditioned on, we need to model it:

Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or both. If the are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we define a new cost function in which we integrate out the missing values at training or test time: l(θ; D) = log p(x c, y c θ) + log p(x m θ) complete missing = log p(x c, y c θ) + log complete missing y p(x m, y θ) Variables which are always unobserved are called latent variables. Missing Outputs You can thin of unsupervised learning as supervised learning in which all the outputs are missing: Clustering == classification with missing labels. Dimensionality reduction == regression with missing targets. Density estimation is actually very general and encompasses the two problems above and a whole lot more. Let s focus on the idea of unobserved variables... Latent Variables What to do when a variable is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g. given y, x fit the model p(, y x) = p( y)p(y x, w)p(w). But if is conditioned on, we need to model it: e.g. given y, x fit the model p(y x) = p(y x, )p() Latent variables may appear naturally, from the structure of the problem. But also, we may want to intentionally introduce latent variables to model complex dependencies between variables without looing at the dependencies between them directly. This can actually simplify the model (e.g. mixtures).

Why is Learning Harder? In fully observed settings, the probability model is a product thus the log lielihood is a sum where terms decouple. l(θ; D) = n = n log p(y n, x n θ) log p(x n θ x ) + n log p(y n x n, θ y ) Mixture Models Most basic latent variable model with a single discrete node. Allows different submodels (experts) to contribute to the (conditional) density model in different parts of the space. Divide and conquer idea: use simple parts to build complex models. (e.g. multimodal densities, or piecewise-linear regressions). With latent variables, the probability already contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = n log p(x n, θ) = n log p( θ )p(x n, θ x ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 x Learning with Latent Variables Lielihood l(θ) = log p( θ )p(x, θ x ) couples parameters: We can treat this as a blac box probability function and just try to optimie the lielihood as a function of θ. We did this many times before by taing gradients. However, sometimes taing advantage of the latent variable structure can mae parameter estimation easier. Good news: today we will see the EM algorithm which allows us to treat learning with latent variables using fully observed tools. Basic tric: guess the values you don t now. Basic math: use convexity to lower bound the lielihood. Mixture Densities Exactly lie a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: K p(x θ) = p( = θ )p(x =, θ ) =1 = α p (x θ ) where the mixing proportions add to one: α = 1. We can use Bayes rule to compute the posterior probability of the mixture component given some data: r (x) = p( = x, θ) = α p (x θ ) j α jp j (x θ j ) these quantities are called responsibilities. You ve seen them many times before; now you now their names!

Learning with Mixtures We can learn mixture densities using gradient descent on the lielihood as usual. The gradients are quite interesting: l(θ) = log p(x θ) = log α p (x θ ) l θ = 1 p α (x θ ) p(x θ) θ = 1 α p(x θ) p (x θ ) log p (x θ ) θ = p α (x θ ) l = p(x θ) θ α r l θ In other words, the gradient is the responsibility weighted sum of the individual log lielihood gradients. Mixture of Linear Regression Experts Each expert generates data according to a linear function of the input plus additive Gaussian noise: p(y x, θ) = α (x)n (y β x, σ2 ) The gate function can be a softmax classification machine: eη x α (x) = p( = x) = j eη j x Remember: we are not modeling the density of the inputs x. Learning? Gradient descent is one option. You have seen the gradients for this example before. In a minute you will see another one... Conditional Mixtures: MOEs Revisited Mixtures of Experts are also called conditional mixtures. Exactly lie a class-conditional classification model, except the class is unobserved and so we sum it out: K p(y x, θ) = p( = x, θ )p(y =, x, θ ) where α (x) = 1 =1 = α (x θ )p (y x, θ ) x. Harder: must learn α (x) (unless chose independent of x). The α (x) are exactly what we called the gating function. We can still use Bayes rule to compute the posterior probability of the mixture component given some data: p( = x, y, θ) = α (x)p (y x, θ ) j α j(x)p j (y x, θ j ) Clustering Example: Gaussian Mixture Models Consider a mixture of K Gaussian components: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 2 1 0 1 2 3 x p(x θ) = α N (x µ, Σ ) p( = x, θ) = α N (x µ, Σ ) j α jn (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Density model: p(x θ) is a familiarity signal. Clustering: p( x, θ) is the assignment rule, l(θ) is the cost.

Mixure of Gaussians Learning We can learn mixtures of Gaussians using gradient descent. For example, the gradients of the means: l(θ) = log p(x θ) = log l θ = l µ = α r l θ = α r Σ 1 (x µ ) α p (x θ ) α r log p (x θ ) θ Gradients of covariance matrices are harder: require derivatives of log determinants and quadratic forms. Must ensure that mixing proportions α are positive and sum to unity and that covariance matrices are positive definite. Be careful: logsum Often you can easily compute b = log p(x =, θ ), but it will be very negative, say -10 6 or smaller. Now, to compute l = log p(x θ) you need to compute log eb. Careful! Do not compute this by doing log(sum(exp(b))). You will get underflow and an incorrect answer. Instead do this: Add a constant exponent B to all the values b such that the largest value comes close to the maxiumum exponent allowed by machine precision: B = MAXEXPONENT-log(K)-max(b). Compute log(sum(exp(b+b)))-b. Example: if log p(x = 1) = 420 and log p(x = 2) = 420, what is log p(x) = log [p(x = 1) + p(x = 2)]? Answer: log[2e 420 ] = 420 + log 2. Parameter Constraints If we want to use general optimiations (e.g. conjugate gradient) to learn latent variable models, we often have to mae sure parameters respect certain constraints. (e.g. α = 1, Σ pos.definite). A good tric is to reparameterie these quantities in terms of unconstrained values. For mixing proportions, use the softmax: α = exp(q ) j exp(q j) For covariance matrices, use the Cholesy decomposition: Σ 1 = A A Σ 1/2 = A ii i where A is upper diagonal with positive diagonal: A ii = exp(r i ) > 0 A ij = a ij (j > i) A ij = 0 (j < i) Recap: Learning with Latent Variables With latent variables, the probability contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = log p(x, θ) = log p( θ )p(x, θ x ) (we can also consider continuous and replace with ) If the latent variables were observed, parameters would decouple again and learning would be easy: l(θ; D) = log p(x, θ) = log p( θ ) + log p(x, θ x ) One idea: ignore this fact, compute l/ θ, and do learning with a smart optimier lie conjugate gradient. Another idea: what if we use our current parameters to guess the values of the latent variables, and then do fully-observed learning? This bac-and-forth tric might mae optimiation easier.

Expectation-Maximiation (EM) Algorithm Iterative algorithm with two lined steps: E-step: fill in values of ẑ t using p( x, θ t ). M-step: update parameters using θ t+1 argmax l(θ; x, ẑ t ). E-step involves inference, which we need to do at runtime anyway. M-step is no harder than in fully observed case. We will prove that this procedure monotonically improves l (or leaves it unchanged). Thus it always converges to a local optimum of the lielihood (as any optimier should). Note: EM is an optimiation strategy for objective functions that can be interpreted as lielihoods in the presence of missing data. EM is not a cost function such as maximum-lielihood. EM is not a model such as mixture-of-gaussians. Expected Complete Log Lielihood For any distribution q() define expected complete log lielihood: l q (θ; x) = l c (θ; x, ) q q( x) log p(x, θ) Amaing fact: l(θ) l q (θ) + H(q) because of concavity of log: l(θ; x) = log p(x θ) = log = log p(x, θ) p(x, θ) q( x) q( x) q( x) log p(x, θ) q( x) Where the inequality is called Jensen s inequality. (It is only true for distributions: q() = 1; q() > 0.) Complete & Incomplete Log Lielihoods Observed variables x, latent variables, parameters θ: is the complete log lielihood. l c (θ; x, ) = log p(x, θ) Usually optimiing l c (θ) given both and x is straightforward. (e.g. class conditional Gaussian fitting, linear regression) With unobserved, we need the log of a marginal probability: l(θ; x) = log p(x θ) = log p(x, θ) Lower Bounds and Free Energy For fixed data x, define a functional called the free energy: F (q, θ) p(x, θ) q( x) log l(θ) q( x) The EM algorithm is coordinate-ascent on F : E-step: q t+1 = argmax q F (q, θ t ) M-step: θ t+1 = argmax θ F (q t+1, θ t ) which is the incomplete log lielihood.

M-step: maximiation of expected l c Note that the free energy breas into two terms: F (q, θ) = = q( x) log p(x, θ) q( x) q( x) log p(x, θ) q( x) log q( x) EM Constructs Sequential Convex Lower Bounds Consider the lielihood function and the function F (q t+1, ). lielihood F( θ,q t+1 ) = l q (θ; x) + H(q) (this is where its name comes from) The first term is the expected complete log lielihood (energy) and the second term, which does not depend on θ, is the entropy. Thus, in the M-step, maximiing with respect to θ for fixed q we only need to consider the first term: θ t+1 = argmax θ l q (θ; x) = argmax θ q( x) log p(x, θ) θ t θ E-step: inferring latent posterior Claim: the optimim setting of q in the E-step is: q t+1 = p( x, θ t ) This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting saturates the bound l(θ; x) F (q, θ) F (p( x, θ t ), θ t ) = p( x, θ t ) log p(x, θt ) p( x, θ t ) = p( x, θ t ) log p(x θ t ) Partially Hidden Data Of course, we can learn when there are missing (hidden) variables on some cases and not on others. In this case the cost function was: l(θ; D) = log p(x c, y c θ) + log complete missing y log p(x m, y θ) Now you can thin of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only. The M-step optimies the log lielihood on the complete data plus the expected lielihood on the incomplete data using the E-step. = log p(x θ t ) p( x, θt ) = l(θ; x) 1 Can also show this result using variational calculus or the fact that l(θ) F (q, θ) = KL[q p( x, θ)]

Recap: EM Algorithm EM for MOG A way of maximiing lielihood function for latent variable models. Finds ML parameters when the original (hard) problem can be broen up into two (easy) pieces: L = 1 L = 4 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maximum lielihood parameter estimates. L = 6 (a) (c) (d) L = 8 L = 10 L = 12 (e) Alternate between filling in the latent variables using our best guess (posterior) and updating the paramters based on this guess: E-step: q t+1 = p( x, θ t ) M-step: θ t+1 = argmax θ q( x) log p(x, θ) In the M-step we optimie a lower bound on the lielihood. In the E-step we close the gap, maing bound=lielihood. (f) (g) (h) (i) Example: Mixtures of Gaussians Recall: a mixture of K Gaussians: p(x θ) = α N (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Learning with EM algorithm: E step : p t n = N (xn µ t, Σt ) qn t+1 = p( = xn, θ t ) = αt pt n j αt j pt n n M step : µ t+1 = qt+1 n xn n qt+1 n n qt+1 n (xn µ t+1 )(x n µ t+1 Σ t+1 = α t+1 = 1 M n q t+1 n n qt+1 n ) Derivation of M-step Expected complete log lielihood l q (θ; D): [ q n log α 1 2 (xn µ t+1 ) Σ 1 (xn µ t+1 ) 1 ] 2 log 2πΣ n For fixed q we can optimie the parameters: l q = Σ 1 µ q n (x n µ ) n l q Σ 1 = 1 [ ] q 2 n Σ (xn µ t+1 )(x n µ t+1 ) n l q = 1 q α α n λ (λ = M) n Fact: log A 1 A 1 = A and x Ax A = xx

Compare: K-Means The EM algorithm for mixtures of Gaussians is just lie a soft version of the K-means algorithm with fixed priors and covariance. Instead of hard assignment in the E-step, we do soft assignment based on the softmax of the squared distance from each point to each cluster. Each centre is then moved to the weighted mean of the data, with weights given by soft assignments. In K-means, the weights are 0 or 1. E step : d t n = 1 2 (xn µ t ) Σ 1 (x n µ t ) Variants Sparse EM: Do not recompute exactly the posterior probability on each data point under all models, because it is almost ero. Instead eep an active list which you update every once in a while. Generalied (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still mae progress by doing an M-step that improves the lielihood a bit (e.g. gradient step). q t+1 n = exp( dt n ) j exp( dt jn ) = p(ct n = x n, µ t ) n M step : µ t+1 = qt+1 n xn n qt+1 n Some good things about EM: A Report Card for EM no learning rate parameter very fast for low dimensions each iteration guaranteed to improve lielihood adapts unused units rapidly Some bad things about EM: can get stuc in local minima both steps require considering all explanations of the data which is an exponential amount of wor in the dimension of θ EM is typically used with mixture models, for example mixtures of Gaussians or mixtures of experts. The missing data are the labels showing which sub-model generated each datapoint. Very common: also used to train HMMs, Boltmann machines,...