Expectation Maximization

Size: px

Start display at page:

Download "Expectation Maximization"

Ross Gordon
5 years ago
Views:

1 Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same Expectation Step: Fill in missing data, given current values of parameters, θ (t) If variable y is missing (could be many variables) Compute, for each data point x j, for each value i of z: P(z=i x j,θ (t) ) Maximization step: Find maximum likelihood parameters for (weighted) completed data : For each data point x j, create k weighted data points Set θ (t+1) as the maximum likelihood parameter estimate for this weighted data Repeat 2 1

2 The general learning problem with missing data Marginal likelihood x is observed, z is missing: 3 E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 4 2

3 Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 5 Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 6 3

4 The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 7 The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 8 4

5 Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood We saw that M-step corresponds to fixing Q, max θ E-step fix θ and max Q 9 M-step is easy Using potential function 10 5

6 E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 11 KL-divergence Measures distance between distributions KL=zero if and only if Q=P 12 6

7 E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 13 E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 14 7

8 EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 15 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 16 8

9 Dimensionality Reduction PCA Machine Learning CSE4546 Carlos Guestrin University of Washington November 13, Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional Carlos Guestrin

10 Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations of existing features Let s see this in the unsupervised setting just X, but no Y Carlos Guestrin Linear projection and reconstruction x 2 project into 1-dimension z 1 x 1 reconstruction: only know z 1, what was (x 1,x 2 ) Carlos Guestrin

11 Principal component analysis basic idea Project n-dimensional data into k-dimensional space while preserving information: e.g., project space of words into 3-dimensions e.g., project 3-d into 2-d Choose projection with minimum reconstruction error Carlos Guestrin Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i minimum squared error Carlos Guestrin

12 PCA finds projection that minimizes reconstruction error Given N data points: x i = (x 1i,,x di ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Carlos Guestrin Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Carlos Guestrin

13 Reconstruction error and covariance matrix N d N N Carlos Guestrin Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: Eigen vector: N d Minimizing reconstruction error equivalent to picking (u k+1,,u d ) to be eigen vectors with smallest eigen values Carlos Guestrin

14 Basic PCA algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Compute covariance matrix: Σ 1/N X c T X c Find eigen vectors and values of Σ Principal components: k eigen vectors with highest eigen values Carlos Guestrin PCA example Carlos Guestrin

15 PCA example reconstruction only used first principal component Carlos Guestrin Eigenfaces [Turk, Pentland 91] Input images: Principal components: Carlos Guestrin

Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin 31 2005-2014 Scaling up Covariance matrix can be really big!

16 Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin Scaling up Covariance matrix can be really big! Σ is d by d Say, only features finding eigenvectors is very slow Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., python, R, Matlab svd Carlos Guestrin

17 SVD Write X = W S V T X data matrix, one row per datapoint W weight matrix, one row per datapoint coordinate of x i in eigenspace S singular value matrix, diagonal matrix in our setting each entry is eigenvalue λ j V T singular vector matrix in our setting each row is eigenvector v j Carlos Guestrin PCA using SVD algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Call SVD algorithm on X c ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) Coefficients become: Carlos Guestrin

18 What you need to know Dimensionality reduction why and when it s important Simple feature selection Principal component analysis minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD Carlos Guestrin

Dimensionality reduction

Dimensionality Reduction PCA continued Machine Learning CSE446 Carlos Guestrin University of Washington May 22, 2013 Carlos Guestrin 2005-2013 1 Dimensionality reduction n Input data may have thousands