Introduction to Machine Learning

Similar documents
Probabilistic Graphical Models

Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Probabilistic Graphical Models

The Expectation-Maximization Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Expectation Maximization

13: Variational inference II

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Probabilistic Graphical Models

Unsupervised Learning

Clustering and Gaussian Mixture Models

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STA 4273H: Statistical Machine Learning

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Data Mining Techniques

Unsupervised Learning

Clustering, K-Means, EM Tutorial

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Statistical Pattern Recognition

Lecture 7: Con3nuous Latent Variable Models

CSCI-567: Machine Learning (Spring 2019)

K-Means and Gaussian Mixture Models

Lecture 13 : Variational Inference: Mean Field Approximation

ECE 5984: Introduction to Machine Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Posterior Regularization

Probabilistic Time Series Classification

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Pattern Recognition and Machine Learning

Latent Variable Models and EM Algorithm

Lecture 6: April 19, 2002

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Technical Details about the Expectation Maximization (EM) Algorithm

Linear Dynamical Systems

Probabilistic Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Pattern Recognition

CPSC 540: Machine Learning

Hidden Markov Models

Lecture 14. Clustering, K-means, and EM

Expectation maximization tutorial

Note 1: Varitional Methods for Latent Dirichlet Allocation

Latent Variable Models

Naïve Bayes classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Recent Advances in Bayesian Inference Techniques

Latent Variable Models and Expectation Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Mixtures of Gaussians. Sargur Srihari

Variational Inference (11/04/13)

Expectation Maximization

Study Notes on the Latent Dirichlet Allocation

An Introduction to Statistical and Probabilistic Linear Models

Hidden Markov Models

STA 414/2104: Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CSC411 Fall 2018 Homework 5

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Latent Variable Models and Expectation Maximization

Expectation Propagation Algorithm

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Latent Variable View of EM. Sargur Srihari

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Lecture 8: Graphical models for Text

Latent Variable Models and EM algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Mixture Models, Expectation Maximization

Expectation maximization

GWAS V: Gaussian processes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4: Probabilistic Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Mathematical Formulation of Our Example

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Expectation Maximization Algorithm

Machine Learning Techniques for Computer Vision

Hidden Markov Models

STA 4273H: Statistical Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Lecture 2: GMM and EM

Gaussian Mixture Models

Probabilistic Graphical Models for Image Analysis - Lecture 4

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Transcription:

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s textbook, Machine Learning: A Probabilistic Perspective

Gaussian Mixture Models 0.75 0.7 0.65 0.6 0.7 0.65 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Mixture of 3 Gaussian Distributions in 2D Contour Plot of Joint Density, Marginalizing Cluster Assignments

Gaussian Mixture Models! Observed feature vectors: x i R d, i =1, 2,...,N! Hidden cluster labels: z i {1, 2,...,K}, i =1, 2,...,N! Hidden mixture means: µ k R d, k =1, 2,...,K! Hidden mixture covariances:! Hidden mixture probabilities:! Gaussian mixture marginal likelihood: p(x i π, µ, Σ) = Σ k R d d, k =1, 2,...,K π k, K π k =1 K z i =1 k=1 π zi N (x i µ zi, Σ zi ) p(x i z i, π, µ, Σ) =N (x i µ zi, Σ zi )

Clustering with Gaussian Mixtures 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete Data Labeled by True Cluster Assignments 0 0.5 1 Incomplete Data: Points to be Clustered With complete data, learning is Gaussian discriminant analysis. C. Bishop, Pattern Recognition & Machine Learning

π Expectation Maximization (EM) y i π y t π z i θ x i N Supervised Training θ x t Supervised Testing θ x i N Unsupervised Learning π, θ parameters (define cluster location and shape) z 1,...,z N hidden data (group observations into clusters)! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data! Equivalent to test inference of full posterior distribution! M-Step: Given posterior distributions, find likely parameters! Distinct from supervised ML/MAP, but often still tractable! Iteration: Alternate E-step & M-step until convergence

EM Algorithm 2 0!2!2 0 (b) 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm 2 L =1 0!2!2 0 (c) 2 C. Bishop, Pattern Recognition & Machine Learning

EM Algorithm 2 L =5 0!2!2 0 (e) 2 C. Bishop, Pattern Recognition & Machine Learning

Convexity

Convexity & Jensen s Inequality f(e[x]) E[f(X)] f(x) chord a x! b

Concavity & Jensen s Inequality ln(e[x]) E[ln(X)]

EM as Lower Bound Maximization ( ) ln p(x θ) =ln p(x, z θ) ln p(x θ) z ( ) p(x, z θ) q(z) ln q(z) z ln p(x θ) z q(z) ln p(x, z θ) z q(z) ln q(z) L(q, θ)! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data q (t) = arg max q L(q, θ (t 1) )! M-Step: Given posterior distributions, find likely parameters θ (t) = arg max L(q (t), θ) θ! Iteration: Alternate E-step & M-step until convergence θ (0)

Lower Bounds on Marginal Likelihood KL(q p) L(q, θ) ln p(x θ) C. Bishop, Pattern Recognition & Machine Learning

EM: A Sequence of Lower Bounds ln p(x θ) L (q,θ) θ old θ new C. Bishop, Pattern Recognition & Machine Learning

Expectation Maximization Algorithm KL(q p) KL(q p) =0 L(q, θ old ) ln p(x θ old ) L(q, θ new ) ln p(x θ new ) E Step: Optimize distribution on hidden variables given parameters M Step: Optimize parameters given distribution on hidden variables C. Bishop, Pattern Recognition & Machine Learning

EM: Expectation Step ln p(x θ) z q(z) ln p(x, z θ) z q(z) ln q(z) L(q, θ) q (t) = arg max q L(q, θ (t 1) )! General solution, for any probabilistic model: q (t) (z) =p(z x, θ (t 1) ) posterior distribution given current parameters! Applying to probabilistic mixture models: p(z i π) = Cat(z i π) p(x i z i, θ) =p(x i θ ) zi π k p(x i θ k ) K l=1 π lp(x i θ l ) r ik = p(z i = k x i, π, θ) = π θ z i x i N

E-Step for Gaussian Mixtures 1 (c) 1 (b) 0.5 0.5 0 0 0 0.5 1 Posterior Probabilities of Assignment to Each Cluster r ik = p(z i = k x i, π, θ) = 0 0.5 1 Incomplete Data: Points to be Clustered π k p(x i θ k ) K l=1 π lp(x i θ l )

Reminder: Exponential Families φ(x) R d θ Θ Z(θ) > 0 h(x) > 0 fixed vector of sufficient statistics (features), specifying the family of distributions unknown vector of natural parameters, determine particular distribution in this family normalization constant or partition function, ensuring this is a valid probability distribution reference measure independent of parameters (for many models, we simply have ) h(x) =1! Examples: Multinomial, Dirichlet, Gaussian, Poisson,!! ML estimation via moment matching: E θ [φ(x)] = 1 N φ(x i ) N i=1

Reminder: Constrained Optimization! Solution: ˆθ = arg max θ K k=1 θ k =1 ˆθ k = a k K k=1 subject to a 0 a 0 = a k log θ k a k 0 θ k 0 K k=1! Proof for K=2: Change of variables to unconstrained problem! Proof for general K: Lagrange multipliers (see textbook) a k

EM: Maximization Step ln p(x θ) q(z) ln p(x, z θ) z z θ (t) = arg max L(q (t), θ) = arg max θ θ! Unlike E-step, no simplified general solution! Applying to mixtures of exponential families: p(z i π) = Cat(z i π) p(x i z i, θ) = exp(θ T z i φ(x i ) A(θ zi )) ˆπ k = N k N Eˆθk [φ(x)] = 1 N k N i=1 r ik φ(x i ) q(z) ln q(z) L(q, θ) q(z) ln p(x, z θ) z weighted moment matching π θ N k = N i=1 z i x i N r ik

Binary Features: Mixtures of Bernoullis 10 Clusters Identified via EM Algorithm from Binarized MNIST Digits

Density Shape Matters 2 21 errors using gauss (red=error) 2 4 errors using student (red=error) 1 1 0 0 1 1 2 2 3 3 4 4 5 5 6 Bankrupt Solvent 7 5 4 3 2 1 0 1 2 Mixture of Gaussians 6 Bankrupt Solvent 7 5 4 3 2 1 0 1 2 Mixture of Student t Densities

Real Clusters may not be Round Ng, Jordan, & Weiss, NIPS02

Dimensionality Reduction Supervised Learning Unsupervised Learning Discrete classification or categorization clustering Continuous regression dimensionality reduction!goal: Infer label/response y given only features x!classical: Find latent variables y good for compression of x!probabilistic learning: Estimate parameters of joint distribution p(x,y) which maximize marginal probability p(x)