Expectation maximization

Similar documents
Expectation maximization

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

1 Review and Overview

Mixtures of Gaussians and the EM Algorithm

The Expectation-Maximization (EM) Algorithm

Expectation-Maximization Algorithm.

Machine Learning Theory (CS 6783)

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Algorithms for Clustering

6.3.3 Parameter Estimation

6.867 Machine learning, lecture 11 (Jaakkola)

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Lecture 12: September 27

Exponential Families and Bayesian Inference

Lecture 11 and 12: Basic estimation theory

Statistical Pattern Recognition

10-701/ Machine Learning Mid-term Exam Solution

Regression and generalization

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 6 Testing Nonlinear Restrictions 1. The previous lectures prepare us for the tests of nonlinear restrictions of the form:

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Agnostic Learning and Concentration Inequalities

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Support vector machine revisited

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Axis Aligned Ellipsoid

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 2 February 8, 2016

Probability and MLE.

Maximum Likelihood Estimation and Complexity Regularization

Clustering: Mixture Models

Lecture 13: Maximum Likelihood Estimation

Probabilistic Unsupervised Learning

Distributional Similarity Models (cont.)

Definition 2 (Eigenvalue Expansion). We say a d-regular graph is a λ eigenvalue expander if

The Chi Squared Distribution Page 1

Distributional Similarity Models (cont.)

(average number of points per unit length). Note that Equation (9B1) does not depend on the

Problem Set 4 Due Oct, 12

Lecture 2: Monte Carlo Simulation

(7 One- and Two-Sample Estimation Problem )

Lecture 23 Rearrangement Inequality

ECE 901 Lecture 13: Maximum Likelihood Estimation

16 EXPECTATION MAXIMIZATION

Recurrence Relations

Confidence Level We want to estimate the true mean of a random variable X economically and with confidence.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

1 Review and Overview

CSE 527, Additional notes on MLE & EM

Intro to Learning Theory

Algorithms in The Real World Fall 2002 Homework Assignment 2 Solutions

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Summary. Recap. Last Lecture. Let W n = W n (X 1,, X n ) = W n (X) be a sequence of estimators for

Empirical Process Theory and Oracle Inequalities

Probability in Medical Imaging

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Stat 421-SP2012 Interval Estimation Section

6.867 Machine learning

5.1 Review of Singular Value Decomposition (SVD)

How to Maximize a Function without Really Trying

5. Fractional Hot deck Imputation

Machine Learning for Data Science (CS 4786)

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Unsupervised Learning 2001

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

3. Calculus with distributions

Topic 9: Sampling Distributions of Estimators

Intelligent Systems I 08 SVM

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Inhomogeneous Poisson process

NUMERICAL METHODS FOR SOLVING EQUATIONS

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Linear Support Vector Machines

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Learning Theory: Lecture Notes

Simulation. Two Rule For Inverting A Distribution Function

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

CSCI567 Machine Learning (Fall 2014)

Machine Learning 4771

REGRESSION WITH QUADRATIC LOSS

Sparsification using Regular and Weighted. Graphs

Analytic Number Theory Solutions

Lecture #3. Math tools covered today

Pattern Classification, Ch4 (Part 1)

Machine Learning Brett Bernstein

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

Machine Learning Brett Bernstein

Understanding Samples

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Part I: Covers Sequence through Series Comparison Tests

Solution of Final Exam : / Machine Learning

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Seunghee Ye Ma 8: Week 5 Oct 28

15-780: Graduate Artificial Intelligence. Density estimation

Probabilistic Unsupervised Learning

Classification with linear models

Transcription:

Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label the ata. You have a probabilistic moel that assumes labelle ata, but you o't have ay labels. Ca you still o somethig? Amazigly you ca Treat the labels as hie variables a try to lear them simultaeously alog with the parameters of the moel Expectatio Maximizatio (EM) A broa family of algorithms for solvig hie variable problems I toay s lecture we will erive EM algorithms for clusterig a aive Bayes classificatio a lear why EM works /19 Gaussia mixture moel for clusterig Suppose ata comes from a Gaussia Mixture Moel (GMM) you have K clusters a the ata from the cluster k is raw from a Gaussia with mea μk a variace σk We will assume that the ata comes with labels (we will soo remove this assumptio) Geerative story of the ata: For each example = 1,,.., Choose a label y Mult( 1,,..., K ) Choose example x (µ k, k) Likelihoo of the ata: p(d) = p(y )p(x y ) = y (x ; µ y, p(d) = =1 =1 3/19 =1 y y D x µ y exp y y ) GMM: kow labels Likelihoo of the ata: p(d) = y y D x µ y exp If you kew the labels y the the maximum-likelihoo estimates of the parameters is easy: k = 1 [y = k] µ k = =1 [y = k]x [y = k] k = [y = k] x µ k [y = k] y fractio of examples with label k mea of all the variace of all the 4/19

GMM: ukow labels GMM: parameter estimatio ow suppose you i t have labels y. Aalogous to k-meas, oe solutio is to iterate. Start by guessig the parameters a the repeat the two steps: Estimate labels give the parameters Estimate parameters give the labels I k-meas we assige each poit to a sigle cluster, also calle as har assigmet (poit 10 goes to cluster ) I expectatio maximizatio (EM) we will will use soft assigmet (poit 10 goes half to cluster a half to cluster 5) Lets efie a raom variable z = [z1, z,, zk] to eote the assigmet vector for the th poit Har assigmet: oly oe of zk is 1, the rest are 0 Soft assigmet: zk is positive a sum to 1 Formally z,k is the probability that the th poit goes to cluster k 5/19 6/19 z,k = p(y = k x ) = (y = k, x ) (x ) / (y = k) (x y )= k (x ; µ k, Give a set of parameters (θk,μk,σk ), z,k is easy to compute Give z,k, we ca upate the parameters (θk,μk,σk ) as: k = 1 µ k = k = z,k z,kx z,k z,k x µ k z,k k) fractio of examples with label k mea of all the fractioal variace of all the fractioal GMM: example We have replace the iicator variable [y = k] with p(y=k) which is the expectatio of [y=k]. This is our guess of the labels. Just like k-meas the EM is susceptible to local miima. Clusterig example: k-meas GMM http://bviewer.ipytho.org/github/icta/mlss/tree/master/clusterig/ 7/19 The EM framework We have ata with observatios x a hie variables y, a woul like to estimate parameters θ The likelihoo of the ata a hie variables: Oly x are kow so we ca compute the ata likelihoo by margializig out the y: p(d) = Y p( ) = Y p(x,y ) p(x,y ) arameter estimatio by maximizig log-likelihoo: ML arg max y log p(x,y ) y har to maximize sice the sum is isie the log 8/19

Jese s iequality Give a cocave fuctio f a a set of weights λi 0 a ᵢ λᵢ = 1 Jese s iequality states that f( ᵢ λᵢ xᵢ) ᵢ λᵢ f(xᵢ) This is a irect cosequece of cocavity f(ax + by) a f(x) + b f(y) whe a 0, b 0, a + b = 1 f(y) f(ax+by) a f(x) + b f(y) f(x) 9/19 The EM framework Costruct a lower bou the log-likelihoo usig Jese s iequality L( ) = log p(x,y ) y = f log x q(y ) p(x,y ) Jese s iequality q(y y ) λ p(x,y ) q(y ) log q(y y ) = [q(y ) log p(x,y ) q(y ) log q(y )] y, ˆL( ) Maximize the lower bou: iepeet of θ arg max q(y ) log p(x,y ) y 10/19 Lower bou illustrate Maximizig the lower bou icreases the value of the origial fuctio if the lower bou touches the fuctio at the curret value ˆL( t ) L( ) ˆL( t+1 ) A optimal lower bou Ay choice of the probability istributio q(y) is vali as log as the lower bou touches the fuctio at the curret estimate of θ" We ca the pick the optimal q(y) by maximizig the lower bou arg max [q(y ) log p(x,y ) q(y ) log q(y )] q y This gives us q(y ) p(y x, t ) roof: use Lagragia multipliers with sum to oe costrait L( t )= ˆL( t ) This is the istributios of the hie variables coitioe o the ata a the curret estimate of the parameters This is exactly what we compute i the GMM example t t+1 11/19 1/19

The EM algorithm We have ata with observatios x a hie variables y, a woul like to estimate parameters θ of the istributio p(x θ) EM algorithm Iitialize the parameters θ raomly Iterate betwee the followig two steps: E step: Compute probability istributio over the hie variables q(y ) p(y x, ) M step: Maximize the lower bou arg max q(y ) log p(x,y ) y EM algorithm is a great caiate whe M-step ca oe easily but p(x θ) caot be easily optimize over θ For e.g. for GMMs it was easy to compute meas a variaces give the memberships 13/19 aive Bayes: revisite Cosier the biary preictio problem Let the ata be istribute accorig to a probability istributio: aive Bayes assumptio: p (y, x) =p (y, x 1,x,...,x D ) We ca simplify this usig the chai rule of probability: p (y, x) =p (y)p (x 1 y)p (x x 1,y)...p (x D x 1,x,...,x D = p (y) p (x x 1,x,...,x 1,y) p (x x 0,y)=p (x y), 8 0 6= E.g., The wors free a moey are iepeet give spam 1,y) 14/19 aive Bayes: a simple case Case: biary labels a biary features robability of the ata: p (y) =Beroulli( 0 ) p (x y = 1) = Beroulli( + ) p (x y = 1) = Beroulli( ) p (y, x) =p (y) p (x y) }1+D parameters = [y=+1] 0 (1 [y= 0 ) 1]... +[x,y=+1] (1 + )[x =0,y=+1]... [x,y= 1] (1 ) [x =0,y= 1] // label +1 // label -1 15/19 aive Bayes: parameter estimatio Give ata we ca estimate the parameters by maximizig ata likelihoo The maximum likelihoo estimates are: ˆ 0 = [y = +1] ˆ + = [x, =1,y = +1] [y = +1] ˆ = [x, =1,y = 1] [y = 1] // fractio of the ata with label as +1 // fractio of the istaces with 1 amog +1 // fractio of the istaces with 1 amog -1 16/19

aive Bayes: EM ow suppose you o t have labels y Iitialize the parameters θ raomly E step: compute the istributio over the hie variables q(y) q(y = 1) = p(y =+1 x, ) / 0 + +[x,=1] (1 + )[x,=0] M step: estimate θ give the guesses 0 = q(y = 1) + = [x, = 1]q(y = 1) q(y = 1) = [x, = 1]q(y = 1) q(y = 1) // fractio of the ata with label as +1 // fractio of the istaces with 1 amog +1 // fractio of the istaces with 1 amog -1 Summary Expectatio maximizatio A geeral techique to estimate parameters of probabilistic moels whe some observatios are hie EM iterates betwee estimatig the hie variables a optimizig parameters give the hie variables EM ca be see as a maximizatio of the lower bou of the ata log-likelihoo we use Jese s iequality to switch the log-sum to sum-log EM ca be use for learig: mixtures of istributios for clusterig, e.g. GMM parameters for hie Markov moels (ext lecture) topic moels i L probabilistic CA. 17/19 18/19 Slies creit Some of the slies are base o CIML book by Hal Daume III The figure for the EM lower bou is base o https:// cxwagyi.worpress.com/008/11/ Clusterig k-meas vs GMM is from http://bviewer.ipytho.org/ github/icta/mlss/tree/master/clusterig/ 19/19