Expectation maximization

Similar documents
Expectation maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Mixture Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering and Gaussian Mixture Models

Naïve Bayes classification

Statistical Pattern Recognition

K-Means and Gaussian Mixture Models

Statistical Pattern Recognition

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Mixtures of Gaussians continued

Expectation Maximization

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Maximization

10-701/15-781, Machine Learning: Homework 4

CPSC 540: Machine Learning

Lecture 4: Probabilistic Learning

Expectation Maximization Algorithm

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

ECE 5984: Introduction to Machine Learning

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

CSE446: Clustering and EM Spring 2017

EM Algorithm LECTURE OUTLINE

Machine Learning Lecture 5

Machine Learning for Data Science (CS4786) Lecture 12

Review and Motivation

FINAL: CS 6375 (Machine Learning) Fall 2014

Kaggle.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning for Signal Processing Bayes Classification and Regression

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

CS6220: DATA MINING TECHNIQUES

The Expectation-Maximization Algorithm

Markov Chains and Hidden Markov Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Bayesian Networks Structure Learning (cont.)

CS145: INTRODUCTION TO DATA MINING

Expectation maximization tutorial

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Data Mining Techniques

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

CS534 Machine Learning - Spring Final Exam

Logistic Regression & Neural Networks

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Methods: Naïve Bayes

Introduction to Bayesian Learning. Machine Learning Fall 2018

Latent Variable Models and EM Algorithm

Gaussian Mixture Models

Parameter estimation Conditional risk

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

MLE/MAP + Naïve Bayes

Support Vector Machines

Expectation Maximization and Mixtures of Gaussians

Gaussian Mixture Models, Expectation Maximization

A brief introduction to Conditional Random Fields

MLE/MAP + Naïve Bayes

Lecture 6: Gaussian Mixture Models (GMM)

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning, Midterm Exam

Machine Learning for Signal Processing Bayes Classification

Machine Learning: A Statistics and Optimization Perspective

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Statistical learning. Chapter 20, Sections 1 4 1

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

STA 414/2104: Machine Learning

Lecture 3: Machine learning, classification, and generative models

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Linear Dynamical Systems

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The Naïve Bayes Classifier. Machine Learning Fall 2017

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bias-Variance Tradeoff

Introduction to Graphical Models

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Density Estimation. Seungjin Choi

Non-Parametric Bayes

STA 4273H: Statistical Machine Learning

CS Lecture 18. Topic Models and LDA

Transcription:

Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015

Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data. You have a probabilistic model that assumes labelled data, but you don't have any labels. Can you still do something? Amazingly you can Treat the labels as hidden variables and try to learn them simultaneously along with the parameters of the model Expectation Maximization (EM) A broad family of algorithms for solving hidden variable problems In today s lecture we will derive EM algorithms for clustering and naive Bayes classification and learn why EM works 2/19

Gaussian mixture model for clustering Suppose data comes from a Gaussian Mixture Model (GMM) you have K clusters and the data from the cluster k is drawn from a Gaussian with mean μk and variance σk 2 We will assume that the data comes with labels (we will soon remove this assumption) Generative story of the data: For each example n = 1, 2,.., N Choose a label y n Mult( 1, 2,..., K ) 2 Choose example x n N(µ k, k) Likelihood of the data: NY NY p(d) = p(y n )p(x n y n ) = p(d) = n=1 NY n=1 yn 2 2 y n D 2 exp 3/19 n=1 yn N(x n ; µ yn, xn µ yn 2 2 2 y n 2 y n )

GMM: known labels Likelihood of the data: p(d) = NY n=1 yn 2 2 y n D 2 exp xn µ yn 2 2 2 y n If you knew the labels yn then the maximum-likelihood estimates of the parameters is easy: k = 1 N µ k = [y n = k] n n [y n = k]x n n [y n = k] fraction of examples with label k mean of all the examples with label k 2 k = n [y n = k] x n µ k 2 n [y n = k] variance of all the examples with label k 4/19

GMM: unknown labels Now suppose you didn t have labels yn. Analogous to k-means, one solution is to iterate. Start by guessing the parameters and then repeat the two steps: Estimate labels given the parameters Estimate parameters given the labels In k-means we assigned each point to a single cluster, also called as hard assignment (point 10 goes to cluster 2) In expectation maximization (EM) we will will use soft assignment (point 10 goes half to cluster 2 and half to cluster 5) Lets define a random variable zn = [z1, z2,, zk] to denote the assignment vector for the n th point Hard assignment: only one of zk is 1, the rest are 0 Soft assignment: zk is positive and sum to 1 5/19

GMM: parameter estimation Formally zn,k is the probability that the n th point goes to cluster k z n,k = p(y n = k x n ) = (y n = k, x n ) (x n ) / (y n = k) (x n y n )= k N(x n ; µ k, Given a set of parameters (θk,μk,σk 2 ), zn,k is easy to compute Given zn,k, we can update the parameters (θk,μk,σk 2 ) as: k = 1 N µ k = 2 k = n z n,k n z n,kx n n z n,k n z n,k x n µ k 2 n z n,k 6/19 2 k) fraction of examples with label k mean of all the fractional examples with label k variance of all the fractional examples with label k

GMM: example We have replaced the indicator variable [yn = k] with p(yn=k) which is the expectation of [yn=k]. This is our guess of the labels. Just like k-means the EM is susceptible to local minima. Clustering example: k-means GMM http://nbviewer.ipython.org/github/nicta/mlss/tree/master/clustering/ 7/19

The EM framework We have data with observations xn and hidden variables yn, and would like to estimate parameters θ The likelihood of the data and hidden variables: Only xn are known so we can compute the data likelihood by marginalizing out the yn: arameter estimation by maximizing log-likelihood: ML p(d) = Y n p( ) = Y n arg max p(x n,y n ) n y n p(x n,y n ) log y n p(x n,y n ) hard to maximize since the sum is inside the log 8/19

Jensen s inequality Given a concave function f and a set of weights λi 0 and ᵢ λᵢ = 1 Jensen s inequality states that f( ᵢ λᵢ xᵢ) ᵢ λᵢ f(xᵢ) This is a direct consequence of concavity f(ax + by) a f(x) + b f(y) when a 0, b 0, a + b = 1 f(y) f(ax+by) f(x) a f(x) + b f(y) 9/19

The EM framework Construct a lower bound the log-likelihood using Jensen s inequality L( ) = n Maximize the lower bound: = n n = n, ˆL( ) log y n p(x n,y n ) f log q(y n ) p(x n,y n ) q(y y n ) n p(xn,y n ) q(y n ) log q(y y n ) n y n [q(y n ) log p(x n,y n ) q(y n ) log q(y n )] arg max n y n q(y n ) log p(x n,y n ) independent of θ 10/19 x λ Jensen s inequality

Lower bound illustrated Maximizing the lower bound increases the value of the original function if the lower bound touches the function at the current value L( ) ˆL( t+1 ) ˆL( t ) t t+1 11/19

An optimal lower bound Any choice of the probability distribution q(yn) is valid as long as the lower bound touches the function at the current estimate of θ" We can the pick the optimal q(yn) by maximizing the lower bound arg max [q(y n ) log p(x n,y n ) q(y n ) log q(y n )] q y n This gives us q(y n ) p(y n x n, t ) roof: use Lagrangian multipliers with sum to one constraint L( t )= ˆL( t ) This is the distributions of the hidden variables conditioned on the data and the current estimate of the parameters This is exactly what we computed in the GMM example 12/19

The EM algorithm We have data with observations xn and hidden variables yn, and would like to estimate parameters θ of the distribution p(x θ) EM algorithm Initialize the parameters θ randomly Iterate between the following two steps: E step: Compute probability distribution over the hidden variables q(y n ) p(y n x n, ) M step: Maximize the lower bound arg max q(y n ) log p(x n,y n ) n y n EM algorithm is a great candidate when M-step can done easily but p(x θ) cannot be easily optimized over θ For e.g. for GMMs it was easy to compute means and variances given the memberships 13/19

Naive Bayes: revisited Consider the binary prediction problem Let the data be distributed according to a probability distribution: We can simplify this using the chain rule of probability: Naive Bayes assumption: p (y, x) =p (y, x 1,x 2,...,x D ) p (y, x) =p (y)p (x 1 y)p (x 2 x 1,y)...p (x D x 1,x 2,...,x D DY = p (y) p (x d x 1,x 2,...,x d 1,y) d=1 p (x d x d 0,y)=p (x d y), 8d 0 6= d E.g., The words free and money are independent given spam 1,y) 14/19

Naive Bayes: a simple case Case: binary labels and binary features robability of the data: p (y) =Bernoulli( 0 ) p (x d y = 1) = Bernoulli( + d ) p (x d y = 1) = Bernoulli( d ) p (y, x) =p (y) DY d=1 p (x d y) }1+2D parameters = [y=+1] [y= 1] 0 (1 0 ) DY... +[x d=1,y=+1] d (1 + d )[x d=0,y=+1]... d=1 DY d=1 [x d=1,y= 1] d (1 d ) [x d=0,y= 1] // label +1 // label -1 15/19

Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood The maximum likelihood estimates are: ˆ 0 = ˆ + d ˆ d = n [y n = +1] N = n [x d,n =1,y n = +1] n [y n = +1] n [x d,n =1,y n = 1] n [y n = 1] // fraction of the data with label as +1 // fraction of the instances with 1 among +1 // fraction of the instances with 1 among -1 16/19

Naive Bayes: EM Now suppose you don t have labels yn Initialize the parameters θ randomly E step: compute the distribution over the hidden variables q(yn) DY q(y n = 1) = p(y n =+1 x n, ) / 0 + +[x d,n=1] d (1 + d )[x d,n=0] M step: estimate θ given the guesses n 0 = q(y n = 1) N + d = n [x d,n = 1]q(y n = 1) n q(y n = 1) d = n [x d,n = 1]q(y n = 1) n q(y n = 1) d=1 // fraction of the data with label as +1 // fraction of the instances with 1 among +1 // fraction of the instances with 1 among -1 17/19

Summary Expectation maximization A general technique to estimate parameters of probabilistic models when some observations are hidden EM iterates between estimating the hidden variables and optimizing parameters given the hidden variables EM can be seen as a maximization of the lower bound of the data log-likelihood we used Jensen s inequality to switch the log-sum to sum-log EM can be used for learning: mixtures of distributions for clustering, e.g. GMM parameters for hidden Markov models (next lecture) topic models in NL probabilistic CA. 18/19

Slides credit Some of the slides are based on CIML book by Hal Daume III The figure for the EM lower bound is based on https:// cxwangyi.wordpress.com/2008/11/ Clustering k-means vs GMM is from http://nbviewer.ipython.org/ github/nicta/mlss/tree/master/clustering/ 19/19