Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Similar documents
IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Review of Maximum Likelihood Estimators

Introduction to Machine Learning

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Computing the MLE and the EM Algorithm

EM for Spherical Gaussians

Chapter 4: Asymptotic Properties of the MLE

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Estimating the parameters of hidden binomial trials by the EM algorithm

Spring 2012 Math 541B Exam 1

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Lecture 4 September 15

Expectation Maximization and Mixtures of Gaussians

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Maximum Likelihood Estimation

Inference in non-linear time series

Lecture 11. Probability Theory: an Overveiw

Parametric Models: from data to models

Expectation Maximization

Mixture Models and Expectation-Maximization

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Statistics 3858 : Maximum Likelihood Estimators

Expectation Maximization

A Note on the Expectation-Maximization (EM) Algorithm

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Expectation maximization

The Expectation Maximization Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Review and Motivation

EM-algorithm for motif discovery

Statistical Estimation

Question: My computer only knows how to generate a uniform random variable. How do I generate others? f X (x)dx. f X (s)ds.

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Unsupervised Learning

Quick Tour of Basic Probability Theory and Linear Algebra

The Expectation-Maximization Algorithm

EM Algorithms from a Non-Stochastic Perspective

Lecture 35: December The fundamental statistical distances

1 Probability Model. 1.1 Types of models to be discussed in the course

Optimization Methods II. EM algorithms.

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Chapter 3. Point Estimation. 3.1 Introduction

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2014 Instructor: Victor Aguirregabiria

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

SUFFICIENT STATISTICS

Theory of Statistical Tests

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Hypothesis Testing: The Generalized Likelihood Ratio Test

COM336: Neural Computing

Statistical learning. Chapter 20, Sections 1 4 1

Latent Variable Models and EM algorithm

[y i α βx i ] 2 (2) Q = i=1

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Chapter 4. Chapter 4 sections

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

1.5 MM and EM Algorithms

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

ON SOME TWO-STEP DENSITY ESTIMATION METHOD

Section 8: Asymptotic Properties of the MLE

Lecture 3 - Linear and Logistic Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Maximum likelihood estimation

Math Stochastic Processes & Simulation. Davar Khoshnevisan University of Utah

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1 Probability Model. 1.1 Types of models to be discussed in the course

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

MATH 680 Fall November 27, Homework 3

Technical Details about the Expectation Maximization (EM) Algorithm

Math 152. Rumbos Fall Solutions to Assignment #12

Probability inequalities 11

Lecture 10: Generalized likelihood ratio test

Recall that if X 1,...,X n are random variables with finite expectations, then. The X i can be continuous or discrete or of any other type.

Linear Algebra March 16, 2019

Probabilistic Graphical Models

THE INVERSE FUNCTION THEOREM

Weighted Finite-State Transducers in Computational Biology

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Announcements Wednesday, October 10

Submitted to the Brazilian Journal of Probability and Statistics

Lecture 4 Towards Deep Learning

Two Useful Bounds for Variational Inference

Theory of Maximum Likelihood Estimation. Konstantin Kashin

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

Expectation maximization tutorial

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Gaussian Mixture Models

Chapter 2. Random Variable. Define single random variables in terms of their PDF and CDF, and calculate moments such as the mean and variance.

1 Expectation Maximization

Notes on Random Variables, Expectations, Probability Densities, and Martingales

Transcription:

Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select a coin at random and flip that one coin m times. Repeat this process n times. Now have data X 11 X 12 X 1m X 21 X 22 X 2m.... X n1 X n2 X nm Here, the X ij are Bernoulli random variables taking values in {0, 1} where Y 1 Y 2. Y n 1, if the jth flip for the ith coin chosen is H X ij = 0, if the jth flip for the ith coin chosen is T and the Y i live in {1, 2} and indicate which coin was used on the nth trial. Note that all the X s are independent and, in particular X i1, X i2,..., X im Y i = j iid Bernoulli(p j We can write out the joint pdf of all nm + n random variables and formally come up with MLEs for p 1 and p 2. Call these MLEs p 1 and p 2. They will turn out as expected: p 1 = p 2 = total # of times Coin 1 came up H total # times Coin 1 was flipped total # of times Coin 2 came up H total # times Coin 2 was flipped Now suppose that the Y i are not observed but we still want MLEs for p 1 and p 2. The data set now consists of only the X s and is incomplete. The goal of the EM Algorithm is to find MLEs for p 1 and p 2 in this case.

Notation for the EM Algorithm: Let X be observed data, generated by some distribution depending on some parameters. Here, X represents something high-dimensional. (In the coin example it is an n m matrix. These data may or may not be iid. (In the coin example it is a matrix with iid observations in each row. X will be called an incomplete data set. Let Y be some hidden or unobserved data depending on some parameters. Here, Y can have some general dimension. (In the coin example, Y is a vector. Let Z = (X, Y represent the complete data set. We say that it is a completion of the data given by X. Assume that the distribution of Z (likely a big fat joint distribution depends on some (likely high-dimensional parameter θ and that we can write the pdf for Z as f(z; θ = f(x, y; θ = f(y x; θf(x; θ. It will be convenient to think of the parameter θ as given and to write this instead as f(z θ = f(x, y θ = f(y x, θf(x θ. (Note: Here, the f s are different pdfs identified by their arguments. For example f(x = f X (x and f(y = f Y (y. We will use subscripts only if it becomes necessary. We usually use L(θ to denote a likelihood function and it always depends on some random variables which are not shown by this notation. Because there are many groups of random variables here, we will be more explicit and write L(θ Z or L(θ X to denote the complete likelihood and incomplete likelihood functions, respectively. The complete likelihood function is L(θ Z = L(θ X, Y = f(x, Y θ. The incomplete likelihood function is L(θ X = f(x θ.

The Algorithm The EM Algorithm is a numerical iterative for finding an MLE of θ. The rough idea is to start with an initial guess for θ and to use this and the observed data X to complete the data set by using X and the guessed θ to postulate a value for Y, at which point we can then find an MLE for θ in the usual way. The actual idea though is slightly more sophisticated. We will use an initial guess for θ and postulate an entire distribution for Y, ultimately averaging out the unknown Y. Specifically, we will look at the expected complete likelihood (or log-likelihood when it is more convenient E[L(θ X, Y ] where the expectation is taken over the conditional distribution for the random vector Y given X and our guess for θ. We proceed as follows. 1 Let k = 0. Give an initial estimate for θ. Call it θ (k. 2 Given observed data X and assuming that θ (k is correct for the parameter θ, find the conditional density (k for the completion variables. 3 Calculate the conditional expected log-likelihood or Q-function : Q(θ θ (k = E[ln f(x, Y θ X, θ (k ]. Here, the expectation is with respect to the conditional distribution of Y given X and θ (k and thus can be written as Q(θ θ (k = ln(f(x, y θ (k dy. (The integral is high-dimensional and is taken over the space where Y lives. 4 Find the θ that maximizes Q(θ θ (k. Call this θ (k+1. Let k = k + 1 and return to Step 2. The EM Algorithm is iterated until the estimate for θ stops changing. Usually, a tolerance ε is set and the algorithm is iterated until θ (k+1 θ (k < ε. We will show that this stopping rule makes sense in the sense that once that distance is less than ε it will remain less than ε.

Figure 1: Visualization for Jensen s Inequality for a Convex Function Jensen s Inequality The EM algorithm is derived from Jensen s inequality, so we review it here. Let X be a random variable with mean µ = E[X], and let g be a convex function. Then g(e[x] E[g(X]. To prove Jensen s inequality, visualize the convex function g and a tangent line at the point (µ, g(µ, as despicted in Figure 1. Note that, by convexity of g, the line is always below g: l(x g(x x. So, m(x µ + g(µ g(x x, and therefore, we can plug in the random variable X to get m(x µ + g(µ g(x. Taking the expected value of both sides leaves us with g(µ E[g(X] which is the desired result g(e[x] E[g(X].

Note that, if g is concave, then the negative of g is convex. Applying Jensen s inequality to g and then multiplying through by 1 gives So, we now know, for example, that g(e[x] E[g(X]. ln(e[x] E[ln(X]. Derivation of the EM Algorithm Imagine we have some data X with joint pdf f(x θ. Let l(θ = ln f(x θ denote the log-likelihood. Suppose we are trying to guess at θ and improve our guesses through some sort of iteration. Let θ (n be our nth iteration guess. We would like to find a new value for θ that satisfies l(θ l( θ (n. We will introduce some hidden variables Y, either because we are actually working with a model that has hidden (unobserved variables or artificially because they make the maximization more tractable. So, we may write l(θ l( θ (n = ln f(x θ ln f(x θ (n ( = ln f(x y, θf(y θ dy ln f(x θ (n ( f(x y, θf(y θ = ln (n dy ln f(x θ (n (n Note that the thing in parentheses is an expectation with respect to the distribution of Y X, θ (n. So, applying Jensen s inequality, we have ( f(x y, θf(y θ l(θ l( θ (n ln (n dy ln f(x θ (n (n = ln Call that entire integral d(θ θ (n. We then have ( f(x y, θf(y θ (n f(x θ (n l(θ l( θ (n + d(θ θ (n (n dy

Figure 2: A Horrible Temporary Image Note that d( θ (n θ (n = ( f(x y, θ (n f(y θ (n ln (n dy (n f(x θ (n = ( f(x, y θ (n ln (n dy f(x, y θ (n = ln (1 (n dy = 0 So, we have that l( θ (n + d(θ θ (n is bounded above by l(θ and that it is equal to this upper bound when θ = θ (n. This is visualized in Figure 2. If we maximize l( θ (n + d(θ θ (n with respect to θ (equivalently, maximize d(θ θ (n with respect to θ, we may improve towards maximizing l(θ. We know, at least, that we will not get worse. Maximizing d(θ θ (n = ( f(x y, θf(y θ ln (n dy (n f(x θ (n with respect to θ is equivalent to maximizing ln (f(x y, θf(y θ (n dy with respect to θ. Note that this is which is ln f(x, y θ (n dy E Y [ln f(x, Y θ X θ (n ].

So, if we could compute this expectation, maximize it with respect to θ, call the result θ (n+1 and iterate, we can improve towards finding the θ that maximizes the likelihood (or at least not get worse. In other words, we can improve towards finding the MLE of θ. These expectation and maximization steps are precisely the EM algorithm! The EM Algorithm for Mixture Densities Assume that we have a random sample X 1, X 2,..., X n is a random sample from the mixture density N f(x θ = p i f j (x θ j. j=1 Here, x has the same dimension as one of the X i and θ is the parameter vector θ = (p 1, p 2,..., p N, θ 1, θ 2,..., θ N. Note that N i=j p j = 1 and each θ i may be high-dimensional. An X i sampled from f(x θ may be interpreted as being sampled from one of the densities f 1 (x θ 1, f 2 (x θ 2,..., f N (x θ N with respective probabilities p 1, p 2,..., p N. The likelihood is n n N L(θ X = f(x i θ = p j f j (X i θ. i=1 i=1 j=1 The log-likelihood is N N l(θ X = ln L(X θ = ln p j f j (X i θ j. i=1 j=1 The log of the sum makes this difficult to work with. In order to make this problem more tractable, we will consider X as incomplete data and we will augment the data with additional variables Y 1, Y 2,..., Y n which indicate the component that the corresponing X s come from. Specifically, for k = 1, 2,..., N,if Y i = k then X i f k (x θ k, and p k = P (Y i = k.

Now L(X θ is thought of an an incomplete likelihood and the complete likelihood is L(θ X, Y = f(x, Y θ = n i=1 f(x i, Y i θ = n i=1 f(x i Y i, θf(y i θ. Note that and that So, f(y i θ = P (Y i = y i θ = p yi f(x i y i, θ = f yi (x i θ yi. n l(θ x, y = ln L(θ x, y = ln[p yi f yi (x i θ yi ]. i=1 To apply the EM algorithm, we will start with some initial guesses (k = 0 for the parameters: θ (k = ( p (k 1, p (k 2,..., p (k (k (k (k N, θ 1, θ 2,..., θ N. Then, we compute for y i = 1, 2,..., N. f(y i x i, θ (k = f(x i y i, θ (k f(y i θ (k f(x i θ (k The EM algorithm Q-function is = p y (k i f yi (x i θ (k y i Nj=1 p (k j f j (x i θ (k Here, Q(θ θ (k = E[ln f(x, Y θ X, θ (k ] }{{} L(θ X,Y = y ln L(θ X, y (k. n (k = f(y i x i, θ (k, i=1 but we will ultimately expand our expression for Q(θ θ (k into terms involving the f(y i x i, θ (k and not need to write this joint density down explicitly. [TO BE CONTINUED...]