Unsupervised learning (part 1) Lecture 19

Similar documents
Mixture Models & EM algorithm Lecture 21

Expectation Maximization Algorithm

CSE446: Clustering and EM Spring 2017

Bayesian networks Lecture 18. David Sontag New York University

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models

Mixtures of Gaussians continued

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Latent Variable Models

K-Means and Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Bayesian Networks Structure Learning (cont.)

CSE546: Naïve Bayes Winter 2012

Clustering and Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Naïve Bayes classification

CS 6140: Machine Learning Spring 2017

Day 5: Generative models, structured classification

Density Estimation. Seungjin Choi

Statistical learning. Chapter 20, Sections 1 4 1

CSCI-567: Machine Learning (Spring 2019)

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Based on slides by Richard Zemel

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Notes on Machine Learning for and

EM (cont.) November 26 th, Carlos Guestrin 1

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

But if z is conditioned on, we need to model it:

Statistical Pattern Recognition

Expectation Maximization

STA 4273H: Statistical Machine Learning

ECE 5984: Introduction to Machine Learning

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Clustering, K-Means, EM Tutorial

Expectation Maximization

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC321 Lecture 18: Learning Probabilistic Models

Graphical Models for Collaborative Filtering

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSE 473: Artificial Intelligence Autumn Topics

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning Techniques for Computer Vision

COM336: Neural Computing

Mixture Models and Expectation-Maximization

Lecture 3: Pattern Classification

STA 4273H: Sta-s-cal Machine Learning

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

Latent Variable Models and EM Algorithm

Unsupervised Learning: K-Means, Gaussian Mixture Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

UVA CS / Introduc8on to Machine Learning and Data Mining

STA414/2104 Statistical Methods for Machine Learning II

CS 6375 Machine Learning

Lecture 6: Graphical Models: Learning

L11: Pattern recognition principles

Mixtures of Gaussians. Sargur Srihari

CS 6140: Machine Learning Spring 2016

Machine Learning for Data Science (CS4786) Lecture 12

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Expectation maximization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Machine Learning Summer School

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Mixture of Gaussians Models

Learning Parameters of Undirected Models. Sargur Srihari

10-701/15-781, Machine Learning: Homework 4

Data Preprocessing. Cluster Similarity

Naïve Bayes Lecture 17

CS534 Machine Learning - Spring Final Exam

Review and Motivation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Logistic Regression. Machine Learning Fall 2018

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

MIXTURE MODELS AND EM

Expectation Maximization

Probabilistic Graphical Models

Weighted Finite-State Transducers in Computational Biology

Introduction to Machine Learning Midterm, Tues April 8

CPSC 540: Machine Learning

Data Mining Techniques

Probabilistic & Unsupervised Learning

Gaussian Mixture Models

Latent Variable Models and EM algorithm

Pattern Recognition and Machine Learning

Transcription:

Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

Bayesian networks enable use of domain knowledge Will my car start this morning? p(x 1,...x n )= Y i2v p(x i x Pa(i) ) Heckerman et al., Decision-TheoreMc TroubleshooMng, 1995

Bayesian networks enable use of domain knowledge p(x 1,...x n )= Y i2v What is the differenmal diagnosis? p(x i x Pa(i) ) Beinlich et al., The ALARM Monitoring System, 1989

Bayesian networks are genera*ve models Can sample from the joint distribumon, top-down Suppose Y can be spam or not spam, and X i is a binary indicator of whether word i is present in the e-mail Let s try generamng a few emails! Label Y X 1 X 2 X 3... X n Features OZen helps to think about Bayesian networks as a generamve model when construcmng the structure and thinking about the model assumpmons

Inference in Bayesian networks CompuMng marginal probabilimes in tree structured Bayesian networks is easy The algorithm called belief propagamon generalizes what we showed for hidden Markov models to arbitrary trees X 1 X 2 X 3 X 4 X 5 X 6 Label Y X 1 X 2 X 3... X n Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features Wait this isn t a tree! What can we do?

Inference in Bayesian networks In some cases (such as this) we can transform this into what is called a juncmon tree, and then run belief propagamon

Approximate inference There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these Markov chain Monte Carlo algorithms repeatedly sample assignments for esmmamng marginals Varia4onal inference algorithms (determinismc) find a simpler distribumon which is close to the original, then compute marginals using the simpler distribumon

Maximum likelihood esmmamon in Bayesian networks Suppose that we know the Bayesian network structure G Let xi x pa(i) be the parameter giving the value of the CPD p(x i x pa(i) ) Maximum likelihood estimation corresponds to solving: 1 MX max Xlog p(x M ; ) M m=1 subject to the non-negativity and normalization constraints This is equal to: 1 XMX max log p(x M ; ) = max M m=1 = max 1 M XNX i=1 X MX XNX m=1 1 M i=1 XMX m=1 log p(x M i x M pa(i) ; ) log p(x M i x M pa(i) ; ) The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.

Returning to clustering Clusters may overlap Some clusters may be wider than others Can we model this explicitly? With what probability is a point from a cluster?

ProbabilisMc Clustering Try a probabilismc model! allows overlaps, clusters of different size, etc. Can tell a genera*ve story for data P(Y)P(X Y) Challenge: we need to esmmate model parameters without labeled Ys Y X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5

Gaussian Mixture Models P(Y): There are k components P(X Y): Each component generates data from a mul>variate Gaussian with mean μ i and covariance matrix Σ i Each data point assumed to have been sampled from a genera4ve process: 1. Choose component i with probability P(y=i) [Mul*nomial] 2. Generate datapoint ~ N(m i, Σ i ) P(X = x j Y = i) = 1 (2π) m / 2 Σ i exp 1 1/ 2 2 x j µ i ( ) T Σ 1 i ( x j µ i ) µ 1 µ 2 By fi:ng this model (unsupervised learning), we can learn new insights about the data µ 3

MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ idenmty matrix

MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ = diagonal matrix X i are independent ala Gaussian NB

MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ = arbitrary (semidefinite) matrix: - specifies rotamon (change of basis) - eigenvalues specify relamve elongamon

Eigenvalue, λ, of Σ MulMvariate Gaussians Covariance matrix, Σ, = degree to which x i vary together 1 P(X = x j Y = i) = (2π ) m/2 Σ i exp # P(X=x 1 1/2 2 x j )= j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' (

Modelling erupmon of geysers Old Faithful Data Set Time to ErupMon DuraMon of Last ErupMon

Modelling erupmon of geysers Old Faithful Data Set Single Gaussian Mixture of two Gaussians

Marginal distribumon for mixtures of Gaussians Component Mixing coefficient K=3

Marginal distribumon for mixtures of Gaussians

Learning mixtures of Gaussians Original data (hypothesized) Observed data (y missing) Inferred y s (learned model) Shown is the posterior probability that a point was generated from i th Gaussian: Pr(Y = i x)

ML esmmamon in supervised setng Univariate Gaussian Mixture of Mul4variate Gaussians ML esmmate for each of the MulMvariate Gaussians is given by: n k x n Σ k ML = 1 n k k j=1 n j=1 µ ML = 1 n ( x j µ ) ML x j µ ML ( ) T Just sums over x generated from the k th Gaussian

What about with unobserved data? Maximize marginal likelihood: argmax θ j P(x j ) = argmax j k=1 P(Y j =k, x j ) Almost always a hard problem! Usually no closed form solumon Even when lgp(x,y) is convex, lgp(x) generally isn t Many local opmma K

1977: Dempster, Laird, & Rubin ExpectaMon MaximizaMon

The EM Algorithm A clever method for maximizing marginal likelihood: argmax θ j P(x j ) = argmax θ j k=1 K P(Y j =k, x j ) Based on coordinate descent. Easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: Compute an expectamon Compute a maximizamon Not magic: s4ll op4mizing a non-convex func4on with lots of local op4ma The computamons are just easier (ozen, significantly so)

EM: Two Easy Steps Objec>ve: argmax θ lg j k=1 K P(Y j =k, x j ; θ) = j lg k=1 K P(Y j =k, x j ; θ) Data: {x j j=1.. n} E-step: Compute expectamons to fill in missing y values according to current parameters, θ For all examples j and values k for Y j, compute: P(Y j =k x j ; θ) M-step: Re-esMmate the parameters with weighted MLE esmmates Set θ new = argmax θ j k P(Y j =k x j ;θ old ) log P(Y j =k, x j ; θ) Par>cularly useful when the E and M steps have closed form solu>ons

Gaussian Mixture Example: Start

AZer first iteramon

AZer 2nd iteramon

AZer 3rd iteramon

AZer 4th iteramon

AZer 5th iteramon

AZer 6th iteramon

AZer 20th iteramon

EM for GMMs: only learning means (1D) Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E-step Compute expected classes of all datapoints M-step ( ) exp 1 P Y j = k x j,µ 1...µ K Compute most likely new μs given class expectamons 2σ (x 2 j µ k ) 2 P Y j = k ( ) µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j

What if we do hard assignments? Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E-step Compute expected classes of all datapoints P( Y j = k x j,µ 1...µ ) K exp 1 2σ (x 2 j µ k ) 2 P Y j = k M-step Compute most likely new μs given class expectamons µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j µ k = m j =1 m j =1 δ( Y j = k, x ) j x j ( ) δ Y j = k, x j ( ) Equivalent to k-means clustering algorithm!!! δ represents hard assignment to most likely or nearest cluster

E.M. for General GMMs Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t), Σ 1 (t), Σ 2 (t) Σ K (t), p 1 (t), p 2 (t) p K (t) } E-step Compute expected classes of all datapoints for each class M-step P( Y j = k x j ;λ ) (t t p ) k p( (t x j ;µ ) (t ) k,σ ) k Compute weighted MLE for μ given expected classes above ( t +1 µ ) k = P Y j = k x j ;λ t x j j j ( ) P( Y j = k x j ;λ ) t p k (t +1) = j ( t +1 Σ ) k = P( Y j = k x j ;λ ) t m t +1 P Y j = k x j ;λ t x j µ k j ( ) j p k (t) is shorthand for esmmate of P(y=k) on t th iteramon Evaluate probability of a mul*variate a Gaussian at x j ( ) ( t +1) [ ] x j µ k P( Y j = k x j ;λ ) t m = #training examples [ ] T

The general learning problem with missing data Marginal likelihood: X is observed, Z (e.g. the class labels Y) is missing: ObjecMve: Find argmax θ l(θ:data) Assuming hidden variables are missing completely at random (otherwise, we should explicitly model why the values are missing)

ProperMes of EM One can prove that: EM converges to a local maxima Each iteramon improves the log-likelihood How? (Same as k-means) Likelihood objecmve instead of k-means objecmve M-step can never decrease likelihood

EM pictorially L(θ n+1 ) l(θ n+1 θ n ) L(θ n )=l(θ n θ n ) L(θ) l(θ θ n ) Likelihood objecmve Lower bound at iter n L(θ) l(θ θ n ) θ n θ n+1 θ (Figure from tutorial by Sean Borman)

What you should know Mixture of Gaussians EM for mixture of Gaussians: How to learn maximum likelihood parameters in the case of unlabeled data RelaMon to K-means Two step algorithm, just like K-means Hard / soz clustering ProbabilisMc model Remember, EM can get stuck in local minima, And empirically it DOES