Expectation Maximization

Similar documents
Information Theory Primer:

Week 3: The EM algorithm

Conditional Independence and Factorization

Density Estimation. Seungjin Choi

Latent Variable Models and EM algorithm

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

The Expectation-Maximization Algorithm

Logistic Regression. Seungjin Choi

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Maximization

Lecture 8: Graphical models for Text

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Self-Organization by Optimizing Free-Energy

G8325: Variational Bayes

Gaussian Mixture Models

5 Mutual Information and Channel Capacity

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation Propagation Algorithm

Linear Models for Regression

An introduction to Variational calculus in Machine Learning

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

The Expectation Maximization Algorithm

Dept. of Linguistics, Indiana University Fall 2015

Machine Learning Srihari. Information Theory. Sargur N. Srihari

The Expectation Maximization or EM algorithm

Introduction to Machine Learning

The Regularized EM Algorithm

The Variational Gaussian Approximation Revisited

Lecture 2: August 31

Nonnegative Matrix Factorization

Variational Inference (11/04/13)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Clustering with k-means and Gaussian mixture distributions

Technical Details about the Expectation Maximization (EM) Algorithm

Bayesian Inference Course, WTCN, UCL, March 2013

CS 630 Basic Probability and Information Theory. Tim Campbell

1/37. Convexity theory. Victor Kitov

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Statistical Machine Learning Lectures 4: Variational Bayes

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 6: April 19, 2002

Weighted Finite-State Transducers in Computational Biology

Bayes Decision Theory

Variational Inference and Learning. Sargur N. Srihari

Machine learning - HT Maximum Likelihood

Lecture 1: Introduction, Entropy and ML estimation

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

But if z is conditioned on, we need to model it:

Computational Systems Biology: Biology X

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Estimating Gaussian Mixture Densities with EM A Tutorial

Series 7, May 22, 2018 (EM Convergence)

Computing and Communications 2. Information Theory -Entropy

Probabilistic Graphical Models

Quantitative Biology II Lecture 4: Variational Methods

Bioinformatics: Biology X

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Posterior Regularization

Introduction to Statistical Learning Theory

13: Variational inference II

A Note on the Expectation-Maximization (EM) Algorithm

PATTERN RECOGNITION AND MACHINE LEARNING

QB LECTURE #4: Motif Finding

Probabilistic and Bayesian Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic Graphical Models for Image Analysis - Lecture 4

Variable selection and feature construction using methods related to information theory

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Linear Models for Regression

Nonparameteric Regression:

Motif representation using position weight matrix

Introduction to Machine Learning

EM Algorithm. Expectation-maximization (EM) algorithm.

Computer Intensive Methods in Mathematical Statistics

Information Theory and Communication

Gaussian Mixture Models

Mixture Models and EM

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Parametric Techniques Lecture 3

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

The binary entropy function

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Parametric Techniques

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Basic math for biology

Computing the MLE and the EM Algorithm

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Chapter 20. Deep Generative Models

U Logo Use Guidelines

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Graphical Models for Collaborative Filtering

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Quiz 2 Date: Monday, November 21, 2016

Transcription:

Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 31

Outline Mathematical preliminaries Jensen s inequality and Gibb s inequality Entropy and mutual information Expectation maximization More on EM Generalized EM and incremental EM EM for exponential family 2 / 31

Convex Sets and Functions Definition (Convex Sets) Let C be a subset of R m. C is called a convex set if αx + (1 α)y C, x, y C, α [0, 1] Definition (Convex Function) Let C be a convex subset of R m. A function f : C R is called a convex function if f (αx + (1 α)y) αf (x) + (1 α)f (y) x, y C, α [0, 1] 3 / 31

Jensen s Inequality Theorem (Jensen s Inequality) If f (x) is a convex function and x is a random vector, then E{f (x)} f (E{x}). Note: Jensen s inequality can also be rewritten for a concave function, with the direction of the inequality reversed. 4 / 31

Proof of Jensen s Inequality ( N Need to show that N i=1 p if (x i ) f i=1 p ix i ). The proof is based on the recursion, working from the right-hand side of this equation. ( N ) ( ) N f p i x i = f p 1x 1 + p i x i i=1 and so forth. i=2 [ N ] ( N ) ( ) i=2 p 1f (x 1) + p i f p ix i N i=2 i=2 p choose α = p1 N i i=1 p i [ N ] ( N )} i=3 p 1f (x 1) + p i {αf (x 2) + (1 α)f p ix i N i=2 i=3 p i ( ) choose α = p2 N i=2 p i ( N ) N i=3 = p 1f (x 1) + p 2f (x 2) + p i f p ix i N i=3 p, i i=3 5 / 31

Information Theory Information theory answers two fundamental questions in communication theory What is the ultimate data compression? entropy H. What is the ultimate transmission rate of communication? channel capacity C. In the early 1940 s, it was thought that increasing the transmission rate of information over a communication channel increased the probability of error This is not true. Shannon surprised the communication theory community by proving that this was not true as long as the communication rate was below the channel capacity. Although information theory was developed for communications, it is also important to explain ecological theory of sensory processing. Information theory plays a key role in elucidating the goal of unsupervised learning. 6 / 31

Information and Entropy Information can be thought of as surprise, uncertainty, or unexpectedness. Mathematically it is defined by I = log P(i), where P(i) is the probability that the event labelled i occurs. The rare event gives large information and frequent event produces small information. Entropy is average information, i.e., N H = P(i) log P(i). i=1 7 / 31

Example: Horse Race Suppose we have a horse race with eight horses taking part. Assume that the probabilities of winning for the eight horse are ( 1 2, 1 4, 1 8, 1 16, 1 64, 1 64, 1 64, 1 64 ). Suppose that we wish to send a message to another person indicating which horse won the race. How many bits are required to describe this for each of the horses? 3 bits for any of the horses? 8 / 31

No! The win probabilities are not uniform. It makes sense to use shorter descriptions for the more probable horses and longer descriptions for the less probable ones so that we achieve a lower average description length. For example, we can use the following strings to represent the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. The average description length in this case is 2 bits as opposed to 3 bits for the uniform code. We calculate the entropy: H(X ) = 1 2 log 1 2 1 4 log 1 4 1 8 log 1 8 1 16 log 1 16 4 1 64 log 1 64 = 2 bits. The entropy of a random variable is a lower bound on the average number of bits required to represent the random variables and also on the average number of questions needed to identify the variable in a game of twenty questions. 9 / 31

Entropy and Relative Entropy Entropy is the average information (a measure of uncertainty) that is defined by H(x) = p(x) log p(x) x X = E p {log p(x)}. Relative entropy (Kullback-Leibler divergence) is a measure of distance between two distributions and is defined by KL[p q] = p(x) log p(x) q(x) x X { = E p log p(x) }. q(x) 10 / 31

Mutual Information Mutual information is the relative entropy between the joint distribution and the product of marginal distributions, I (x, y) = [ ] p(x, y) p(x, y) log p(x)p(y) x X y Y = D [p(x, y) p(x)p(y)] [ ]} p(x, y) = E p(x,y) {log. p(x)p(y) Mutual information can be interpreted as the reduction in the uncertainty of x due to the knowledge of y, i.e., I (x, y) = H(x) H(x y), where H(x y) = E p(x,y) {log p(x y)} is the conditional entropy 11 / 31

Gibb s Inequality Theorem KL[p q] 0 with equality iff p = q. Proof: Consider the Kullback-Leibler divergence for discrete distributions: KL[p q] = i p i log p i q i = p i log q i p i i [ ] q i log p i p i i [ ] = log q i = 0. i (by Jensen s inequality) 12 / 31

More on Gibb s Inequality In order to find the distribution p which minimizes KL[p q], we consider a Lagrangian ( E = KL[p q] + λ 1 ) p i = ( p i p i + λ 1 ) p i. q i i i i Compute the partial derivative E p k and set to zero, E p k = log p k log q k + 1 λ = 0, which leads to p k = q k e λ 1. It follows from i p i = 1 that i q ie λ 1 = 1, which leads to λ = 1. Therefore p i = q i. The Hessian, 2 E = 1 p i 2 p i, 2 E p i p j = 0, is positive definite, which shows that p i = q i is a genuine minimum. 13 / 31

The EM Algorithm Maximum likelihood parameter estimates One definition of the best knob settings. Often impossible to find directly. The EM algorithm: Finds ML parameters when the original (hard) problem can be broken up into two (easy) pieces, i.e., finds the ML estimates of parameters for data in which some variables are unobserved. A iterative method which consists of Expectation step (E-step) and Maximization step (M-step). 1. Estimate some missing or unobserved data from observed data and current parameters (E-step) 2. Using this complete data, find the maximum likelihood parameter estimates (M-step). For EM to work, two things have to be easy: 1. Guessing (estimating) missing data form data we have and our current guess of parameters. 2. Solving for the ML parameters directly given the complete data. 14 / 31

Auxiliary Function Definition Q(θ, θ ) is an auxiliary function for L(θ) if the conditions Q(θ, θ ) L(θ), Q(θ, θ) = L(θ). Theorem If Q is an auxiliary function, then L is nondecreasing under the update ( Q θ, θ (k+1)) ( arg max Q θ, θ (k)). θ 15 / 31

A Graphical Illustration for an Auxiliary Function L(θ) Q(θ, θ (k) ) θ (k) θ (k+1) θ 16 / 31

A Lower Bound on the Likelihood Consider a set of observed (visible) variables x, a set of unobserved (hidden) variables s, and model parameters θ. Then the log-likelihood is given by L(θ) = log p(x θ) = log p(x, s θ)ds. Use Jensen s inequality for any distribution of hidden variables, q(s), to obtain p(x, s θ) L(θ) = log q(s) ds q(s) [ ] p(x, s θ) q(s) log ds (use Jensen s inequality) q(s) = F(q, θ). (lower bound on the log-likelihood) 17 / 31

The Lower Bound F(q, θ) Consider the lower bound F(q, θ), F(q, θ) = [ ] p(x, s θ) q(s) log ds q(s) = q(s) log p(x, s θ)ds + H(q), where H(q) = q(s) log q(s)ds (entropy of q) and F(q, θ) corresponds to variational free energy. In the EM algorithm, we alternately optimize F(q, θ) w.r.t q and θ. It can be shown that this will never decrease L. 18 / 31

EM as an Alternative Optimization E-step: Optimize F(q, θ) w.r.t. the distribution over hidden variables given the parameters, i.e., ( q (k+1) = arg max F q, θ (k)). q M-step: Maximize F(q, θ) w.r.t. the parameters, given the hidden distribution, i.e., ( ) θ (k+1) = arg max F q (k+1), θ θ = arg max θ q (k+1) (s) log p(x, s θ)ds, where log p(x, s θ) is the complete-data log-likelihood. 19 / 31

A Graphical Illustration of EM Behavior L(θ) = log p(x θ) E step θ (k+1) θ (k) M step θ 20 / 31

The EM Algorithm never Decreases the Log-Likelihood The difference between the log-likelihood and the lower bound is given by [ ] p(x, s θ) L(θ) F(q, θ) = log p(x θ) q(s) log ds q(s) [ ] p(s x, θ)p(x θ) = log p(x θ) q(s) log ds q(s) [ ] p(s x, θ) = q(s) log ds q(s) = KL [q(s) p(s x, θ)]. This difference is zero only if q(s) = p(s x, θ). (this is E-step) ( L θ (k)) ( = E-step F q (k+1), θ (k)) ( M-step F q (k+1), θ (k+1)) Jensen L ( θ (k+1)). 21 / 31

Generalized EM: Partial M-Steps The M-step of the algorithm may be only partially implemented, with the new estimate for the parameters improving the likelihood given the distribution found in the E-step, but not necessarily maximizing it. Such a partial M-step always results in the true likelihood improving as well. 22 / 31

Incremental EM: Partial E-Steps The unobserved variables are commonly independent, and influence the likelihood of parameters only through simple sufficient statistics. If these statistics can be updated incrementally when the distribution for one of the variables is re-calculated, it makes sense to immediately re-estimate the parameters before performing the E-step for the next unobserved variable, as this utilizes the new information immediately, speeding convergence. The proof holds even for the case of partial updates. Sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. You can also update the posterior over a subset of the hidden variables, while holding others fixed. 23 / 31

Incremental EM Assume that unobserved variables are independent, then we restrict the search for a maximum of F to distributions q(s) = t qt(st). We can write F in the form F(q, θ) = t Ft(qt, θ) where Algorithm Outline: Incremental EM F t(q t, θ) = log p(s t, x t θ) qt + H(q t). E-step Choose a data point, x t. Set q (k) l = q (k 1) l for l t. ( Set q (k) t = arg max F t q t, θ (k 1)). q t M-step ( ) θ (k) = arg max F θ q (k), θ. 24 / 31

Sufficient Statistic Definition Let x be a random variable whose distribution depends on a parameter θ. A real-valued function t(x) of x is said to be sufficient for θ if the condition distribution of x, is independent of θ. That is, t is sufficient for θ if p(x t, θ) = p(x t). 25 / 31

Exponential Family Definition A family of distributions is said to be a k-parameter exponential family if the probability density function for x has the form p(x θ) = a(θ)b(x) exp { η (θ)t(x) }, a 1 (θ) = b(x) exp { η (θ)t(x) } dx, where η(θ) = [η 1 (θ),..., η k (θ)] (natural parameters) contains k functions of the parameter θ and the sufficient statistics t(x) = [t 1 (x),..., t k (x)] contains k functions of the data x. Alternative equivalent form is given by p(x θ) = exp { η (θ)t(x) + g(θ) + h(x) }, g(θ) = log exp { η (θ)t(x) + h(x) } dx. 26 / 31

Exponential Family: Canonical Form If η(θ) = θ, then the exponential family is said to be in canonical form. By defining a transformed parameter η = η(θ), it is always possible to convert a n exponential family to canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that t(x) is multiplied by that constant s reciprocal. Let x R m be a random vector whose distribution belongs to the k-parameter exponential family, then the k-dimensional statistic t(x) is sufficient for the parameter θ. 27 / 31

Example of Exponential Family: Gaussian Let us consider a univariate Gaussian distribution parameterized by its mean and variance: } 1 (x µ)2 p(x) = exp { 2πσ 2σ 2 } { 1 = exp { µ2 2πσ 2σ 2 exp 1 2σ 2 x 2 + µ } σ 2 x. This is a two-parameter exponential family with a(θ) = η = 1 2πσ exp [ 1 2σ 2, µ σ 2 { µ2 } 2σ 2, b(x) = 1 ] T, t = [ x 2, x ] T. 28 / 31

Example of Exponential Family: Bernoulli Consider Bernoulli distribution parameterized by µ [0, 1]: p(x) = µ x (1 µ) 1 x. Matching it with the canonical form p(x) = exp {θx + g(θ) + h(x)} yields θ = log µ 1 µ, g(θ) = log ( 1 + e θ), h(x) = 0. Parameter values of the distribution are mapped to natural parameters via link function. 29 / 31

EM for Exponential Family Given a complete data z = (x, s), we write the expected complete-data log-likelihood: L c = log p(x, s) x, θ (k) = η(θ) t(z) + g(θ) + h(z) x, θ (k) = η(θ) t(z) x, θ (k) + g(θ) + h(z) x, θ (k) EM algorithm for exponential families E-step: Requires only sufficient statistics t (k+1) = t(z) x, θ (k). M-step: [ ] θ (k+1) = arg max θ η(θ) t (k+1) + g(θ). 30 / 31

References Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society B, vol. 39, pp. 1-38. Neal, R. M. and Hinton, G. E. (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants Learning in Graphical Models (edited by M. Jordan), pp. 355-368. 31 / 31