CS Lecture 18. Expectation Maximization

Size: px

Start display at page:

Download "CS Lecture 18. Expectation Maximization"

Ferdinand Day
5 years ago
Views:

1 CS 6347 Lecture 18 Expectation Maximization

Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial to the model

2 Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial to the model Some observations in a particular sample may be missing Missing information on surveys or medical records (quite common) We may need to model how the variables are missing 2

3 Hidden Marov Models... Y 1 Y 2 Y T 1 Y T... X 1 X 2 X T 1 X T p x 1,, x T, y 1,, y T = p y 1 p x 1 y 1 X s are observed variables, Y s are latent t p y t y t 1 p(x t y t ) Example: X variables correspond sizes of tree growth rings for one year, the Y variables correspond to average temperature

4 Missing Data Data can be missing from the model in many different ways Missing completely at random: the probability that a data item is missing is independent of the observed data and the other missing data Missing at random: the probability that a data item is missing can depend on the observed data Missing not at random: the probability that a data item is missing can depend on the observed data and the other missing data 4

5 Handling Missing Data Discard all incomplete observations Can introduce bias Imputation: actual values are substituted for missing values so that all of the data is fully observed E.g., find the most probable assignments for the missing data and substitute them in (not possible if the model is unnown) Use the sample mean/mode Explicitly model the missing data For example, could expand the state space The most sensible solution, but may be non-trivial if we don t now how/why the data is missing 5

6 Modelling Missing Data Add additional binary variable m i to the model for each possible observed variable x i that indicates whether or not that variable is observed p x obs, x mis, m = p m x obs, x mis p(x obs, x mis ) 6

7 Modelling Missing Data Add additional binary variable m i to the model for each possible observed variable x i that indicates whether or not that variable is observed p x obs, x mis, m = p m x obs, x mis p(x obs, x mis ) Explicit model of the missing data (missing not at random) 7

8 Modelling Missing Data Add additional binary variable m i to the model for each possible observed variable x i that indicates whether or not that variable is observed p x obs, x mis, m = p m x obs p(x obs, x mis ) Missing at random 8

9 Modelling Missing Data Add additional binary variable m i to the model for each possible observed variable x i that indicates whether or not that variable is observed p x obs, x mis, m = p(m)p(x obs, x mis ) Missing completely at random 9

10 Modelling Missing Data Add additional binary variable m i to the model for each possible observed variable x i that indicates whether or not that variable is observed p x obs, x mis, m = p(m)p(x obs, x mis ) Missing completely at random How can you model latent variables in this framewor? 10

11 Learning with Missing Data In order to design learning algorithms for models with missing data, we will mae two assumptions The data is missing at random The model parameters corresponding to the missing data (δ) are separate from the model parameters of the observed data (θ) That is p x obs, m θ, δ = p m x obs, δ p(x obs θ) 11

12 Learning with Missing Data p x obs, m θ, δ = p m x obs, δ p x obs θ Under the previous assumptions, the log-lielihood of samples x 1, m 1,, (x, m ) is equal to l θ, δ = =1 log p(m x obs, δ) + =1 log p(x obs, x mis θ) x mis 12

13 Learning with Missing Data p x obs, m θ, δ = p m x obs, δ p x obs θ Under the previous assumptions, the log-lielihood of samples x 1, m 1,, (x, m ) is equal to l θ, δ = =1 log p(m x obs, δ) + =1 log p(x obs, x mis θ) x mis Separable in θ and δ, so if we don t care about δ, then we only have to maximize the second term over θ 13

14 Learning with Missing Data l θ = =1 log p(x obs, x mis θ) x mis This is NOT a concave function of θ In the worst case, could have a different local maximum for each possible value of the missing data No longer have a closed form solution, even in the case of Bayesian networs 14

15 Expectation Maximization The expectation-maximization algorithm (EM) is method to find a local maximum or a saddle point of the log-lielihood with missing data Basic idea: l θ = =1 log p(x obs, x mis θ) x mis = =1 log x mis q x mis p x obs, x mis θ q x mis =1 15 x mis q x mis log p x obs, x mis θ q x mis

16 Expectation Maximization F q, θ =1 x mis q x mis log p x obs, x mis θ q x mis Maximizing F is equivalent to the maximizing the log-lielihood Could maximize it using coordinate ascent q t+1 = arg max q 1,,q F(q, θ t ) θ t+1 = argmax θ F(qt+1, θ) 16

17 Expectation Maximization x mis q x mis log p x obs, x mis θ q x mis This is just d q p x obs, θ Maximized when q x mis = p(x mis x obs, θ) Can reformulate the EM algorithm as θ t+1 = argmax θ =1 p(x mis x obs x mis, θ t ) log p x obs, x mis θ 17

18 An Example: Bayesian Networs Recall that MLE for Bayesian networs without latent variables yielded θ xi x parents i = N x i,x parents i x i N xi,x parents i Let s suppose that we are given observations from a Bayesian networ in which one of the variables is hidden What do the iterations of the EM algorithm loo lie? (on board) 18

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic