Expectation-Maximization

Similar documents
Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Information Theory, Statistics, and Decision Trees

Numerical Optimization Techniques

Classification and Pattern Recognition

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Latent Variable Models and EM Algorithm

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Methods for Machine Learning

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture 5

Linear Models for Regression CS534

CSE446: Clustering and EM Spring 2017

Machine Learning for Data Science (CS4786) Lecture 12

Statistical Data Mining and Machine Learning Hilary Term 2016

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

STA 4273H: Statistical Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Maximum Likelihood Estimation. only training data is available to design a classifier

CS534 Machine Learning - Spring Final Exam

STA 4273H: Statistical Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Expectation maximization

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Latent Variable Models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear Models for Regression CS534

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Expectation Maximization and Mixtures of Gaussians

STA 4273H: Sta-s-cal Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

10-701/ Machine Learning - Midterm Exam, Fall 2010

Mixtures of Gaussians. Sargur Srihari

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Expectation Maximization

Review and Motivation

Machine Learning, Fall 2012 Homework 2

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probabilistic Machine Learning. Industrial AI Lab.

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Data Mining Techniques

Introduction to Machine Learning Midterm, Tues April 8

Linear Models for Regression CS534

K-Means and Gaussian Mixture Models

Linear Regression (9/11/13)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Overfitting, Bias / Variance Analysis

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Generative Models for Discrete Data

Linear Regression (continued)

Parametric Techniques Lecture 3

Machine Learning Basics: Maximum Likelihood Estimation

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Logistic Regression and Support Vector Machine

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Parametric Techniques

STA414/2104 Statistical Methods for Machine Learning II

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Expectation Maximization

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Latent Variable Models and EM algorithm

Linear Regression and Its Applications

Statistical Pattern Recognition

An Introduction to Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

EM (cont.) November 26 th, Carlos Guestrin 1

Gaussian Mixture Models

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Mixture of Gaussians Models

Machine Learning Techniques for Computer Vision

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

L11: Pattern recognition principles

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CPSC 340: Machine Learning and Data Mining

Linear Models for Classification

CS60021: Scalable Data Mining. Large Scale Machine Learning

Variational Autoencoder

Based on slides by Richard Zemel

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

Machine learning - HT Maximum Likelihood

Probability and Information Theory. Sargur N. Srihari

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Regression with Numerical Optimization. Logistic

Logistic Regression Trained with Different Loss Functions. Discussion

Transcription:

Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010

Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other. Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Budget constraints Online vs. offline Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms. Léon Bottou 2/30 COS 424 3/9/2010

Summary Expectation Maximization Convenient algorithm for certain Maximum Likelihood problems. Viable alternative to Newton or Conjugate Gradient algorithms. More fashionable than Newton or Conjugate Gradients. Lots of extensions: variational methods, stochastic EM. Outline of the lecture 1. Gaussian mixtures. 2. More mixtures. 3. Data with missing values. Léon Bottou 3/30 COS 424 3/9/2010

Simple Gaussian mixture Clustering via density estimation. Pick a parametric model P θ (X). Maximize likelihood. Parametric model There are K components To generate an observation: a.) pick a component k with probabilities λ 1... λ K with k λ k = 1. b.) generate x from component k with probability N (µ i, σ). Simple GMM: Standard deviation σ known and constant. What happens when σ is a trainable parameter? Different σ i for each mixture component? Covariance matrices Σ instead of scalar standard deviations? Léon Bottou 4/30 COS 424 3/9/2010

When Maximum Likelihood fails Consider a mixture of two Gaussians with trainable standard deviations. The likelihood becomes infinite when one of them specializes on a single observation. MLE works for all discrete probabilistic models and for some continuous probabilistic models. This simple Gaussian mixture model is not one of them. People just ignore the problem and get away with it. Léon Bottou 5/30 COS 424 3/9/2010

Why ignoring the problem does work? Explanation 1 The GMM likelihood has many local maxima. Unlike discrete distributions, densities are not bounded. A ceiling on the densities theoretically fixes the problem. Equivalently: enforcing a minimal standard deviation that prevents Gaussians to specialize on a single observation... The singularity lies in a narrow corner of the parameter space. Optimization algorithms cannot find it!. Léon Bottou 6/30 COS 424 3/9/2010

Why ignoring the problem does work? Explanation 2 There are no rules in the Wild West. We should not condition probabilities with respect to events with probability zero. With continuous probabilistic models, observations always have probability zero! Léon Bottou 7/30 COS 424 3/9/2010

Expectation Maximization for GMM We only observe the x 1, x 2,.... Some models would be very easy to optimize if we knew which mixture components y 1, y 2,... generates them. Decomposition For a given X, guess a distribution Q(Y X). Regardless of our guess, log L(θ) = L(Q, θ) + D(Q, θ) L(Q, θ) = D(Q, θ) = n i=1 n i=1 K y=1 K y=1 Q(y x i ) log P θ(x i y)p θ (y) Q(y x i ) Q(y x i ) log Q(y x i) P θ (y x i ) Easy to maximize KL divergence D(Q Y X P Y X ) Léon Bottou 8/30 COS 424 3/9/2010

Expectation-Maximization D D D L L L L D D E-Step: q ik λ k Σk e 1 2 (x i µ k ) Σ 1 k (x i µ k ) M-Step: µ k i q ik x i i q ik Σ k i q ik(x i µ k )(x i µ k ) i q ik remark: normalization!. λ k i q ik iy q iy Léon Bottou 9/30 COS 424 3/9/2010

Implementation remarks Numerical issues The q ik are often very small and underflow the machine precision. Instead compute log q ik and work with ˆq ik = q ik e max k(log q ik ). Local maxima The likelihood is highly non convex. EM can get stuck in a mediocre local maximum. This happens in practice. Initialization matters. On the other hand, the global maximum is not attractive either. Computing the log likelihood Computing the log likelihood is useful to monitor the progress of EM. The best moment is after the E-step and before the M-step. Since D = 0 it is sufficient to compute L M. Léon Bottou 10/30 COS 424 3/9/2010

EM for GMM Start. (Illustration from Andrew Moore s tutorial on GMM.) Léon Bottou 11/30 COS 424 3/9/2010

EM for GMM After iteration #1. Léon Bottou 12/30 COS 424 3/9/2010

EM for GMM After iteration #2. Léon Bottou 13/30 COS 424 3/9/2010

EM for GMM After iteration #3. Léon Bottou 14/30 COS 424 3/9/2010

EM for GMM After iteration #4. Léon Bottou 15/30 COS 424 3/9/2010

EM for GMM After iteration #5. Léon Bottou 16/30 COS 424 3/9/2010

EM for GMM After iteration #6. Léon Bottou 17/30 COS 424 3/9/2010

EM for GMM After iteration #20. Léon Bottou 18/30 COS 424 3/9/2010

GMM for anomaly detection 1. Model P {X} with a GMM. 2. Declare anomaly when density fails below a threshold. Léon Bottou 19/30 COS 424 3/9/2010

GMM for classification 1. Model P {X Y = y} for every class with a GMM. 2. Calulate Bayes optimal decision boundary. 3. Possibility to detect outliers and ambiguous patterns. Léon Bottou 20/30 COS 424 3/9/2010

GMM for regression 1. Model P {X, Y } with a GMM. 2. Compute f(x) = E [Y X = x]. Léon Bottou 21/30 COS 424 3/9/2010

The price of probabilistic models Estimating densities is nearly impossible! A GMM with many components is very flexible model. Nearly as demanding as a general model. Can you trust the GMM distributions? Maybe in very low dimension... Maybe when the data is abundant... Can you trust decisions based on the GMM distribution? They are often more reliable than the GMM distributions themselves. Use cross-validation to check! Alternatives? Directly learn the decision function! Use cross-validation to check!. Léon Bottou 22/30 COS 424 3/9/2010

More mixture models We can make mixtures of anything. Bernoulli mixture Example: Represent a text document by D binary variables indicating the presence or absence of word t = 1... D. Base model: model each word independently with a Bernoulli. Mixture model: see next slide. Non homogeneous mixtures It is sometimes useful to mix different kinds of distributions. Example: model how long a patient survives after a treatment. One component with thin tails for the common case. One component with thick tails for patients cured by the treatment. Léon Bottou 23/30 COS 424 3/9/2010

Bernoulli mixture Consider D binary variables x = (x 1,..., x D ). Each x i independently follows a Bernoulli distribution B(µ i ). P µ (x) = D i=1 µ x i i (1 µ i) 1 x i Mean µ Covariance diag[µ i (1 µ i )] Now let s consider a mixture of such distributions. The parameters are θ = (λ 1, µ 1,... λ k, µ k ) with k λ k = 1. P θ (x) = K λ k P µk (x i ) k=1 Mean k λ kµ k Covariance no longer diagonal! Since the covariance matrix is no longer diagonal, the mixture models dependencies between the x i. Léon Bottou 24/30 COS 424 3/9/2010

EM for Bernoulli mixture We are given a dataset x = x 1,..., x n. n k The log likelihood is log L(θ) = log λ k P µk (x i ) i=1 Let s derive an EM algorithm. Variable Y = y 1,..., y n says which component generates X. Maximizing the likelihood would be easy if we were observing the Y. So let s just guess Y with distribution Q(Y = y X = x i ) q iy. i=1 Decomposition: log L(θ) = L(Q, θ) + D(Q, θ), with the usual definitions (slide 8.) Léon Bottou 25/30 COS 424 3/9/2010

EM for a Bernoulli mixture D D D L L L L D D E-Step: q ik λ k P µk (x i ) remark: normalization!. M-Step: µ k i q ik x i i q λ k i q ik ik iy q iy Léon Bottou 26/30 COS 424 3/9/2010

Data with missing values Fitting my probabilistic model would be so easy without missing values. mpg cyl disp hp weight accel year name 15.0 8 350.0 165.0 3693 11.5 70 buick skylark 320 18.0 8 318.0 150.0 3436 11.0 70 plymouth satellite 15.0 8 429.0 198.0 4341 10.0 70 ford galaxie 500 14.0 8 454.0 n/a 4354 9.0 70 chevrolet impala 15.0 8 390.0 190.0 3850 8.5 70 amc ambassador dpl n/a 8 340.0 n/a n/a 8.0 70 plymouth cuda 340 18.0 4 121.0 112.0 2933 14.5 72 volvo 145e 22.0 4 121.0 76.00 2511 18.0 n/a volkswagen 411 21.0 4 120.0 87.00 2979 19.5 72 peugeot 504 26.0 n/a 96.0 69.00 2189 18.0 72 renault 12 22.0 4 122.0 86.00 n/a 16.0 72 ford pinto 28.0 4 97.0 92.00 2288 17.0 72 datsun 510 n/a 8 440.0 215.0 4735 n/a 73 chrysler new yorker Léon Bottou 27/30 COS 424 3/9/2010

EM for missing values Fitting my probabilistic model would be so easy without missing values. This magic sentence suggests EM Let X = x 1, x 2,..., x n be the observed values on each row. Let Y = y 1, y 2,..., y n be the missing values on each row. Decomposition Guess a distribution Q λ (Y X). Regardless of our guess, log L(θ) = L(λ, θ) + D(λ, θ) L(λ, θ) = D(λ, θ) = n i=1 y n i=1 y Q λ (y x i ) log P θ(x i, y) Q λ (y x i ) Q λ (y x i ) log Q λ(y x i ) P θ (y x i ) Easy to maximize KL divergence D(Q Y X P Y X ) Léon Bottou 28/30 COS 424 3/9/2010

EM for missing values D D D L L L L D D E-Step: Depends on the parametric expression of Q λ (Y X). M-Step: Depends on the parametric expression of P θ (X, Y ). This works when the missing value patterns are sufficiently random! Léon Bottou 29/30 COS 424 3/9/2010

Conclusion Expectation Maximization EM is a very useful algorithm for probabilistic models. EM is an alternative to sophisticated optimization EM is simpler to implement. Probabilistic Models More versatile than direct approaches. More demanding than direct approaches (assumptions, data, etc.) Léon Bottou 30/30 COS 424 3/9/2010