ECE 5984: Introduction to Machine Learning

Similar documents
Expectation Maximization

EM (cont.) November 26 th, Carlos Guestrin 1

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models

Expectation Maximization Algorithm

ECE 5984: Introduction to Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Unsupervised Learning

CSE446: Clustering and EM Spring 2017

Dimension Reduction (PCA, ICA, CCA, FLD,

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to Machine Learning

Latent Variable Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

ECE 5984: Introduction to Machine Learning

CS534 Machine Learning - Spring Final Exam

Clustering and Gaussian Mixture Models

Expectation maximization

Latent Variable Models and EM Algorithm

Lecture 8: Graphical models for Text

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 13 : Variational Inference: Mean Field Approximation

Data Mining Techniques

Principal Component Analysis

Clustering, K-Means, EM Tutorial

ECE 5424: Introduction to Machine Learning

EM Algorithm LECTURE OUTLINE

PCA, Kernel PCA, ICA

Mixture Models and Expectation-Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Statistical Pattern Recognition

Non-Parametric Bayes

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Learning Theory Continued

But if z is conditioned on, we need to model it:

Lecture 7: Con3nuous Latent Variable Models

Dimensionality reduction

Latent Variable View of EM. Sargur Srihari

CS281 Section 4: Factor Analysis and PCA

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Lecture 8: Clustering & Mixture Models

CPSC 540: Machine Learning

CS6220: DATA MINING TECHNIQUES

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

Bayesian Networks Structure Learning (cont.)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Gaussian Mixture Models, Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data Mining Techniques

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

K-Means and Gaussian Mixture Models

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

13: Variational inference II

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Introduction to Machine Learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning for Signal Processing Bayes Classification and Regression

A Note on the Expectation-Maximization (EM) Algorithm

Collaborative Filtering: A Machine Learning Perspective

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA 4273H: Statistical Machine Learning

Mixtures of Gaussians continued

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

CS Lecture 18. Topic Models and LDA

Gaussian Mixture Models

Technical Details about the Expectation Maximization (EM) Algorithm

Clustering with k-means and Gaussian mixture distributions

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

VBM683 Machine Learning

Review and Motivation

Probabilistic Time Series Classification

Statistical Pattern Recognition

Machine Learning Lecture 5

Generative v. Discriminative classifiers Intuition

CS6220: DATA MINING TECHNIQUES

Machine Learning, Fall 2012 Homework 2

Probabilistic Graphical Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

CSCI-567: Machine Learning (Spring 2019)

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Latent Variable Models and Expectation Maximization

Programming Assignment 4: Image Completion using Mixture of Bernoullis

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Latent Variable Models and EM algorithm

Factor Analysis (10/2/13)

Latent Variable Models and Expectation Maximization

Advanced Introduction to Machine Learning

Machine Learning for Data Science (CS4786) Lecture 12

Transcription:

ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech

Administrativia Poster Presentation: May 8 1:30-4:00pm 310 Kelly Hall: ICTAS Building Print poster (or bunch of slides) Format: Portrait Make 1 dimen = 36in Board size = 30x40 Less text, more pictures. Best Project Prize! (C) Dhruv Batra 2

Administrativia Final Exam When: May 11 7:45-9:45am Where: In class Format: Pen-and-paper. Open-book, open-notes, closed-internet. No sharing. What to expect: mix of Multiple Choice or True/False questions Prove this statement What would happen for this dataset? Material Everything! Focus on the recent stuff. Exponentially decaying weights? Optimal policy? (C) Dhruv Batra 3

Recap of Last Time (C) Dhruv Batra 4

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until (C) Dhruv Batra terminated! Slide Credit: Carlos Guestrin 5

K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix µ, optimize a (or C) Assignment Step min µ 1,...,µ k min a 1,...,a N NX kx a ij x i µ j 2 j=1 Fix a (or C), optimize µ Recenter Step (C) Dhruv Batra Slide Credit: Carlos Guestrin 6

GMM 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (C) Dhruv Batra Figure Credit: Kevin Murphy 7

K-means vs GMM K-Means http://home.deib.polimi.it/matteucc/clustering/tutorial_html/ AppletKM.html GMM http://www.socr.ucla.edu/applets.dir/mixtureem.html (C) Dhruv Batra 8

EM Expectation Maximization [Dempster 77] Often looks like soft K-means Extremely general Extremely useful algorithm Essentially THE goto algorithm for unsupervised learning Plan EM for learning GMM parameters EM for general unsupervised learning problems (C) Dhruv Batra 9

EM for Learning GMMs Simple Update Rules E-Step: estimate P(z i = j x i ) M-Step: maximize full likelihood weighted by posterior (C) Dhruv Batra 10

Gaussian Mixture Example: Start (C) Dhruv Batra Slide Credit: Carlos Guestrin 11

After 1st iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

After 2nd iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 13

After 3rd iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 14

After 4th iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 15

After 5th iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 16

After 6th iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 17

After 20th iteration (C) Dhruv Batra Slide Credit: Carlos Guestrin 18

General Mixture Models P(x z) P(z) Name Gaussian Categorial GMM Multinomial Categorical Mixture of Multinomials Categorical Dirichlet Latent Dirichlet Allocation (C) Dhruv Batra 19

Mixture of Bernoullis 0.12 0.14 0.12 0.06 0.13 0.07 0.05 0.15 0.07 0.09 (C) Dhruv Batra 20

The general learning problem with missing data Marginal likelihood x is observed, z is missing: ll( NY : D) = log P (x i ) NX = log P (x i ) = NX log X z P (x i, z ) (C) Dhruv Batra 21

Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z) ll( : D) = NX log X z Q i (z) P (x i, z ) Q i (z) (C) Dhruv Batra 22

Convergence of EM Define potential function F(θ,Q): ll( : D) F (,Q i )= NX X z Q i (z) log P (x i, z ) Q i (z) EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood (C) Dhruv Batra 23

EM is coordinate ascent ll( : D) F (,Q i )= E-step: Fix θ (t), maximize F over Q: F (,Q i )= = = NX X Realigns F with likelihood: z NX X z NX X F ( (t),q (t) )=ll( (t) : D) z Q i (z) log P (x i, z (t) ) Q i (z) Q i (z) log P (z x i, )P (x i (t) ) Q i (z) NX X Q i (z) log P (x i z = ll( (t) : D) (t) )+ Q i (z) log P (x i, z ) Q i (z) NX X NX KL(Q i (z) P (z x i, Q (t) i (z) =P (z x i, z Q i (z) log P (z x i, (t) ) Q i (z) (t) ) (t) ) (C) Dhruv Batra 24

EM is coordinate ascent ll( : D) F (,Q i )= NX X M-step: Fix Q (t), maximize F over θ F (,Q i )= = = NX X Corresponds to weighted dataset: <x 1,z=1> with weight Q (t+1) (z=1 x 1 ) <x 1,z=2> with weight Q (t+1) (z=2 x 1 ) <x 1,z=3> with weight Q (t+1) (z=3 x 1 ) <x 2,z=1> with weight Q (t+1) (z=1 x 2 ) <x 2,z=2> with weight Q (t+1) (z=2 x 2 ) z NX X z NX X z Q (t) <x 2,z=3> with weight Q (t+1) (z=3 x 2 ) 25 (C) Dhruv Batra z i (z) log P (x i, z ) Q (t) (z) Q (t) i (z) log P (x i, z ) Q (t) i (z) log P (x i, z )+ i Q i (z) log P (x i, z ) Q i (z) NX X NX z Q (t) i H(Q (t) i ) {z } constant (z) log Q (t) i (z)

EM Intuition Q(θ,θ t ) Q(θ,θ t+1 ) l(θ) θ t θ t+1 θ t+2 (C) Dhruv Batra 26

What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data EM is coordinate ascent Remember, E.M. can get stuck in local minima, and empirically it DOES General case for EM (C) Dhruv Batra Slide Credit: Carlos Guestrin 27

Tasks x Classification y Discrete x Regression y Continuous x Clustering c Discrete ID x Dimensionality z Continuous Reduction (C) Dhruv Batra 28

New Topic: PCA

Synonyms Principal Component Analysis Karhunen Loève transform Eigen-Faces Eigen-<Insert-your-problem-domain> PCA is a Dimensionality Reduction Algorithm Other Dimensionality Reduction algorithms Linear Discriminant Analysis (LDA) Independent Component Analysis (ICA) Local Linear Embedding (LLE) (C) Dhruv Batra 30

Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., images have 5M pixels Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional

Dimensionality Reduction Demo http://lcn.epfl.ch/tutorial/english/pca/html/ http://setosa.io/ev/principal-component-analysis/ (C) Dhruv Batra 32

PCA / KL-Transform De-correlation view Make features uncorrelated No projection yet Max-variance view: Project data to lower dimensions Maximize variance in lower dimensions Synthesis / Min-error view: Project data to lower dimensions Minimize reconstruction error All views lead to same solution (C) Dhruv Batra 33