STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Similar documents
STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning for Data Science (CS4786) Lecture 12

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

The Expectation-Maximization Algorithm

Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering and Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Statistical Pattern Recognition

K-Means and Gaussian Mixture Models

Latent Variable Models and EM Algorithm

Variational Inference (11/04/13)

Mixtures of Gaussians continued

Lecture 11: Unsupervised Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Models, Expectation Maximization

Clustering, K-Means, EM Tutorial

CPSC 540: Machine Learning

Latent Variable Models and EM algorithm

CS6220: DATA MINING TECHNIQUES

Lecture 3: Pattern Classification

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS281 Section 4: Factor Analysis and PCA

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

CS6220: DATA MINING TECHNIQUES

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Expectation maximization

Data Mining Techniques

Maximum Likelihood Estimation. only training data is available to design a classifier

Data Preprocessing. Cluster Similarity

ECE 5984: Introduction to Machine Learning

Gaussian Mixture Models

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning

CS534 Machine Learning - Spring Final Exam

CSE446: Clustering and EM Spring 2017

Expectation Maximization Algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Machine Learning

Statistical Pattern Recognition

Lecture 15. Probabilistic Models on Graph

Review and Motivation

COM336: Neural Computing

Latent Variable Models

Clustering with k-means and Gaussian mixture distributions

Machine Learning Lecture Notes

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Machine Learning Techniques for Computer Vision

CS229 Lecture notes. Andrew Ng

Mixtures of Gaussians. Sargur Srihari

Lecture 21: Spectral Learning for Graphical Models

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

But if z is conditioned on, we need to model it:

Density Estimation. Seungjin Choi

Unsupervised Learning

EXPECTATION- MAXIMIZATION THEORY

Finite Singular Multivariate Gaussian Mixture

Introduction to Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Unsupervised Learning with Permuted Data

13: Variational inference II

Statistical learning. Chapter 20, Sections 1 4 1

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

An Introduction to Expectation-Maximization

10-701/15-781, Machine Learning: Homework 4

Regression with Numerical Optimization. Logistic

Machine Learning Lecture 5

Lecture 13 : Variational Inference: Mean Field Approximation

Graphical Models for Collaborative Filtering

Note Set 5: Hidden Markov Models

Mixture of Gaussians Models

Expectation Maximization

Linear Factor Models. Sargur N. Srihari

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

STA 4273H: Statistical Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

The Expectation Maximization or EM algorithm

GAUSSIAN PROCESS REGRESSION

Review of Maximum Likelihood Estimators

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Study Notes on the Latent Dirichlet Allocation

Mixture Models and EM

Probabilistic Graphical Models

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Transcription:

STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised learning: discovering the latent structure underlying our data, without prior observations of that structure. Our exploration of the field by began with the task of clustering, in which we aim to group or segment data points based on some notion of similarity. In particular, we studied the k-means approach to clustering and motivated it entirely in terms of minimizing an objective x i m zi 2 2 that characterizes the similarity of data points to their cluster means. This is an example of a model-free approach to clustering, as it makes no explicit attempt to model the process that generated our observations. In this lecture, we will examine a popular alternative to k-means clustering Gaussian mixture modeling with Expectation-Maximization that reposes upon an explicit model of the data-generating process. 2.2 Gaussian Mixture Modeling The Gaussian mixture model (GMM) is a probabilistic model for clustered data with real-valued components. Although the aims and assumptions of Gaussian mixture modeling appear to be quite different from those of k-means, we will see soon that they share some key similarities. 2.2.1 Model formulation We begin by positing a statistical model for our data. That is, we view our data X 1, X 2,, X n as random variables drawn i.i.d. from an unknown distribution with density p(x). A GMM demands a specific form for this density, p(x) k π j φ(x; µ j, Σ j ). j1 This is a mixture of k component multivariate Gaussian distributions where 1 φ(x; µ j, Σ j ) 2πΣ j exp ( 1(x µ 1/2 2 j) T Σ 1 j (x µ j ) ) is a multivariate Gaussian density with unknown parameters (µ j, Σ j ), and 2-1

π j is the unknown probability of selecting component j, satisfying k j1 π j 1. A GMM has an equivalent representation as a generative model for our data: x i z i z i iid Mult(π, 1) ind N (µ zi, Σ zi ). Here, z i represents the latent component indicator or latent class / cluster for datapoint x i. Remark. Throughout the course, we will be considering a number of probabilistic modeling approaches to unsupervised learning. Typically, we will not view these models as literal descriptions of reality but rather as convenient modeling frameworks that give rise to procedures for unsupervised learning. 2.2.2 Clustering with GMMs Under the GMM, our clustering task amounts to inferring the latent component z i responsible for each x i. For the moment, we will ignore the fact that we do not know the parameters of the GMM and imagine how we would carry out the clustering task given (π, µ 1:k, Σ 1:k ). Since a GMM with known parameters defines a joint distribution over (x i, z i ), it is natural to consider the conditional distribution of each z i given x i : p(z i j x i ) p(z i j)p(x i z i j) p(x i ) π j φ(x i ; µ j, Σ j ) k l1 π lφ(x i ; µ l, Σ l ). These conditionals reflect our updated beliefs concerning z i after x i is observed: before we observe x i, we have the prior belief that it belongs to cluster j with probability π j ; after observing x i, we can update this belief in accordance with the likelihood of x i under each Gaussian component. Remark. The task of computing conditionals or marginals from a known joint distribution is sometimes called probabilistic inference. The conditional distribution provides us with what is called a soft clustering since it assigns some probability to x i belonging to each cluster. To obtain a hard clustering (an assignment of x i to a single cluster), one typically selects a mode of the conditional distribution argmax j p(z i j x i ). Remark. Those familiar with discriminant analysis will notice a relationship with the classification rule in quadratic and linear discriminant analysis. This is to be expected as discriminant analysis also models datapoints as being drawn from class conditional multivariate Gaussian distributions. However, LDA operates in a supervised setting, which greatly simplifies the task of parameter estimation. Parameter estimation is a good deal more difficult in our unsupervised GMM setting. 2-2

2.2.3 Estimating GMM parameters with Expectation-Maximization In the prior section, we carried out clustering assuming that the GMM parameters (π, µ 1:k, Σ 1:k ) were known. In this section, we will attempt to estimate these parameters, a task sometimes termed statistical inference to distinguish it from probabilistic inference. We will begin by trying to derive maximum likelihood estimates (MLEs) of the GMM parameters. Consider the log likelihood of our data log(p(x i )) k log( π j φ(x i ; µ j, Σ j )). (2.1) j1 The maximum likelihood principle would choose estimates of (π, µ 1:k, Σ 1:k ) that maximize this expression. When k 1 (i.e., when there is no clustering structure), the log likelihood simplifies to log(φ(x i ; µ j, Σ j )) which has closed-form maxima, [ 1 2 (x i µ 1 ) T Σ 1 1 (x i µ 1 ) log 2πΣ 1 1/2 ], (µ 1, Σ 1) ( 1 n x i, 1 n (x i µ 1)(x i µ 1) T ). 11 Unfortunately, when k > 1 (the case of interest), the log likelihood no longer simplifies and no longer yields closed-form solutions. To find (approximate) MLEs one often turns to (I) off-the-shelf numerical optimizers or (II) the Expectation-Maximization (EM) algorithm, which leverages the latent-variable problem structure to form parameter estimates. We will develop and examine the EM approach in the remainder of this lecture. The fundamental difficulty with the log likelihood in (2.1) is the sum inside of each log (representing an expectation over an unknown cluster assignment z i ), which couples all of the parameters of all of the component Gaussian distributions of the mixture together. If we knew the z i s, then this problem would be solved, since we could instead maximize the complete log likelihood, log(p(x i, z i )) (log(p(x i z i )) + log(p(z i ))) (log(φ(x i ; µ zi, Σ zi )) + log(π zi )) j1 k (I[z i j] log φ(x i ; µ j, Σ j ) + I[z i j] log π j ). Note that the sum over z i values is now outside of the log and that parameters of different Gaussian components no only appear in separate summands. 2-3

Moreover, the complete log likelihood, viewed as a function of (π, µ 1:k, Σ 1:k ), has closedform maxima: πj 1 I[z i j], n µ j I[z i j]x i I[z i j], and Σ j I[z i j](x i µ j)(x i µ j) T I[z i j]. That is, πj equals to the proportion of sample points assigned to cluster j, and µ j and Σ j are the sample mean and covariance of points within cluster j. Hence, if we knew the cluster assignments ahead of time, we could easily estimate the GMM parameters, and, as mentioned previously, if we knew the parameters, we could easily estimate cluster assignments using probabilistic inference. Unfortunately (as is often the case in unsupervised learning), we know neither, so we will simply guess the values of the GMM and iteratively refine our guess by alternating between probabilistic inference (updating our cluster assignment inferences) and statistical inference (updating our parameter estimates). This is the Expectation-Maximization algorithm for parameter estimation in a nutshell. More precisely, the algorithm consists of the following steps. Expectation-Maximization for GMMs: 0. Initialize π, (µ 1:k, Σ 1:k ) arbitrarily 1. Alternate until convergence (E-step) [Expectation step]: Compute soft class memberships, given the current parameters: τ ij P (z i j x ij, π, (µ l, Σ l )). (M-step) [Maximization step]: Update parameters by plugging in τ ij (our guess) for the unknown I[z i j], which gives us: π j 1 n τ ij, µ j τ ijx i n τ, ij Σ j τ ij(x i µ j )(x i µ j ) T τ ij. This is similar to the case in which all z i s are known, but now each x i is partially assigned to each cluster j through the conditional probability that z i j. Note the similarity to Lloyd s algorithm for k-means discussed last time. If we ignore the updates to π and Σ 1:k, we see the same basic structure: assign data points to classes, and update the means according to the assigned classes. There are also important distinctions between the two algorithms: 2-4

Lloyd s algorithm makes hard assignments on each iteration, with each point assigned to one class, while the EM algorithm uses soft, probabilistic assignments, where conditional probabilities are computed for each class given the parameter estimates. In the GMM setting, we model the mixture proportions and covariance structure in the data. This is especially useful if we have correlated features, features of with varying variances, or clusters of varying sizes. Meanwhile, k-means effectively assumes identity covariance structure (spherical clusters) and equal cluster sizes. Interestingly, we can actually recover the Lloyd s algorithm from a variant of the GMM EM algorithm above. If we assume the cluster proportions π j 1 k, and we also assume Σ j σ 2 I for some known σ 2, then the EM algorithm (with π and Σ 1:k fixed and known) updates the means µ j and becomes Lloyd s algorithm as σ 2 0. The design of fast unsupervised learning algorithms like k-means from probabilistic models using small variance asymptotics is currently an active research area. At this point, one should have a great many questions concerning this mysterious EM algorithm. For example, Does it converge? If so, are rates of convergence known? Are its solutions optimal? If not, can we bound their suboptimality? What is its relationship to the likelihood and our goal of maximizing it? To provide satisfying answers to these questions, we will, in the next lecture, turn to a more general view of EM for estimation in any latent variable model. In the remainder of this lecture, we will content ourselves with a teaser indicating a fundamental relationship between EM and maximum likelihood. 2.2.4 Optimality conditions for maximum likelihood To understand how EM might relate to MLE, let us consider the first-order optimality conditions for maximum likelihood. We begin by computing the partial derivatives of the log-likelihood with respect to µ j, µ j { [ k log π j φ(x i ; µ j, Σ j ) l π lφ(x i ; µ l, Σ l ) τ ij Σ 1 j (x i µ j ), ] } π j φ(x i ; µ j, Σ j ) µ j log φ(x i ; µ j, Σ j ) where the second line follows from the chain rule, and, in the third line, we have recognized the expressions for our previously-computed conditional probabilities τ ij. The first-order 2-5

condition for optimality requires that these derivatives equal zero. Hence, any maximum likelihood solution (π, µ 1:k, Σ 1:k ) satisfies τ ij Σ 1 j (x i µ j) 0 µ j τ ijx i n τ. ij This derived requirement has the same form as the M-step update from EM! However, the τ ij on the RHS of this expression depends implicitly on the optimal parameters (π, µ 1:k, Σ 1:k ). Hence, this is not a closed-form solution for µ j but rather a fixed-point equation. (Similar fixed-point equations would result from considering the first-order optimality conditions for π and Σ.) This derivation makes clear that EM performs fixed-point iteration on the optimality equations for likelihood maximization; that is, EM iteratively plugs in estimates of τ ij into the RHS of the fixed-point equations based on current parameter estimates and uses the results as its updated parameter estimates. We will explore additional properties of the EM algorithm next time. 2-6