Expectation Maximization

Similar documents
Dimensionality reduction

EM (cont.) November 26 th, Carlos Guestrin 1

ECE 5984: Introduction to Machine Learning

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Expectation Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Gaussian Mixture Models

PCA, Kernel PCA, ICA

Principal Component Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis (PCA)

CSE446: Clustering and EM Spring 2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Deriving Principal Component Analysis (PCA)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 7: Con3nuous Latent Variable Models

Example: Face Detection

Latent Variable Models and EM algorithm

What is Principal Component Analysis?

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Learning

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mixtures of Gaussians continued

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Latent Variable Models and EM Algorithm

Advanced Introduction to Machine Learning CMU-10715

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CS281 Section 4: Factor Analysis and PCA

Probabilistic Latent Semantic Analysis

Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture 8: Graphical models for Text

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Lecture: Face Recognition and Feature Reduction

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Dimensionality Reduction

PCA and admixture models

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Lecture: Face Recognition and Feature Reduction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning - MT & 14. PCA and MDS

Linear Subspace Models

Data Preprocessing. Cluster Similarity

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Data Mining Techniques

CS 559: Machine Learning Fundamentals and Applications 5 th Set of Notes

7. Variable extraction and dimensionality reduction

Latent Variable Models

Mathematical foundations - linear algebra

Learning Theory Continued

CS 4495 Computer Vision Principle Component Analysis

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Collaborative Filtering: A Machine Learning Perspective

Eigenfaces. Face Recognition Using Principal Components Analysis

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

MACHINE LEARNING ADVANCED MACHINE LEARNING

Probabilistic Time Series Classification

Variational Autoencoders

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Factor Analysis (10/2/13)

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

MACHINE LEARNING ADVANCED MACHINE LEARNING

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

STA 414/2104: Machine Learning

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Generative Clustering, Topic Modeling, & Bayesian Inference

CSC411 Fall 2018 Homework 5

Dimensionality Reduction

The Expectation Maximization or EM algorithm

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Eigenface-based facial recognition

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Variational Principal Components

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CSE 554 Lecture 7: Alignment

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Review and Motivation

Principal Component Analysis

Lecture 21: Spectral Learning for Graphical Models

Clustering and Gaussian Mixture Models

Machine Learning 11. week

Machine Learning Techniques for Computer Vision

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Lecture 17: Face Recogni2on

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Clustering VS Classification

Transcription:

Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same Expectation Step: Fill in missing data, given current values of parameters, θ (t) If variable y is missing (could be many variables) Compute, for each data point x j, for each value i of z: P(z=i x j,θ (t) ) Maximization step: Find maximum likelihood parameters for (weighted) completed data : For each data point x j, create k weighted data points Set θ (t+1) as the maximum likelihood parameter estimate for this weighted data Repeat 2 1

The general learning problem with missing data Marginal likelihood x is observed, z is missing: 3 E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means 4 2

Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) 5 Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) 6 3

The M-step maximizes lower bound on weighted data Lower bound from Jensen s: Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 7 The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] 8 4

Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood We saw that M-step corresponds to fixing Q, max θ E-step fix θ and max Q 9 M-step is easy Using potential function 10 5

E-step also doesn t decrease potential function 1 Fixing θ to θ (t) : 11 KL-divergence Measures distance between distributions KL=zero if and only if Q=P 12 6

E-step also doesn t decrease potential function 2 Fixing θ to θ (t) : 13 E-step also doesn t decrease potential function 3 Fixing θ to θ (t) Maximizing F(θ (t),q) over Q set Q to posterior probability: Note that 14 7

EM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: Realigns F with likelihood: 15 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM 16 8

Dimensionality Reduction PCA Machine Learning CSE4546 Carlos Guestrin University of Washington November 13, 2014 17 Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional Carlos Guestrin 18 2005-2014 9

Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations of existing features Let s see this in the unsupervised setting just X, but no Y Carlos Guestrin 19 2005-2014 Linear projection and reconstruction x 2 project into 1-dimension z 1 x 1 reconstruction: only know z 1, what was (x 1,x 2 ) Carlos Guestrin 20 2005-2014 10

Principal component analysis basic idea Project n-dimensional data into k-dimensional space while preserving information: e.g., project space of 10000 words into 3-dimensions e.g., project 3-d into 2-d Choose projection with minimum reconstruction error Carlos Guestrin 21 2005-2014 Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i minimum squared error Carlos Guestrin 22 2005-2014 11

PCA finds projection that minimizes reconstruction error Given N data points: x i = (x 1i,,x di ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Carlos Guestrin 23 2005-2014 Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Carlos Guestrin 24 2005-2014 12

Reconstruction error and covariance matrix N d N N Carlos Guestrin 25 2005-2014 Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: Eigen vector: N d Minimizing reconstruction error equivalent to picking (u k+1,,u d ) to be eigen vectors with smallest eigen values Carlos Guestrin 26 2005-2014 13

Basic PCA algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Compute covariance matrix: Σ 1/N X c T X c Find eigen vectors and values of Σ Principal components: k eigen vectors with highest eigen values Carlos Guestrin 27 2005-2014 PCA example Carlos Guestrin 28 2005-2014 14

PCA example reconstruction only used first principal component Carlos Guestrin 29 2005-2014 Eigenfaces [Turk, Pentland 91] Input images: Principal components: Carlos Guestrin 30 2005-2014 15

Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin 31 2005-2014 Scaling up Covariance matrix can be really big! Σ is d by d Say, only 10000 features finding eigenvectors is very slow Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., python, R, Matlab svd Carlos Guestrin 32 2005-2014 16

SVD Write X = W S V T X data matrix, one row per datapoint W weight matrix, one row per datapoint coordinate of x i in eigenspace S singular value matrix, diagonal matrix in our setting each entry is eigenvalue λ j V T singular vector matrix in our setting each row is eigenvector v j Carlos Guestrin 33 2005-2014 PCA using SVD algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Call SVD algorithm on X c ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) Coefficients become: Carlos Guestrin 34 2005-2014 17

What you need to know Dimensionality reduction why and when it s important Simple feature selection Principal component analysis minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD Carlos Guestrin 35 2005-2014 18