Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Similar documents
CSCI-567: Machine Learning (Spring 2019)

Overfitting, Bias / Variance Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Neural Networks and Deep Learning

Statistical Pattern Recognition

Support Vector Machines, Kernel SVM

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Clustering and Gaussian Mixture Models

Linear Regression (continued)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

The Expectation-Maximization Algorithm

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CS534 Machine Learning - Spring Final Exam

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Latent Variable Models

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning Lecture 5

Introduction to Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation maximization

Mixtures of Gaussians. Sargur Srihari

K-Means and Gaussian Mixture Models

Statistical Pattern Recognition

CPSC 540: Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Introduction to Logistic Regression and Support Vector Machine

ECE 5984: Introduction to Machine Learning

Lecture 14. Clustering, K-means, and EM

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Machine Learning Midterm, Tues April 8

Linear & nonlinear classifiers

Gaussian Mixture Models

FINAL: CS 6375 (Machine Learning) Fall 2014

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Expectation Maximization

PCA and admixture models

Pattern Recognition and Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Machine Learning for Signal Processing Bayes Classification and Regression

Gaussian Mixture Models, Expectation Maximization

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 11: Unsupervised Machine Learning

Latent Variable Models and Expectation Maximization

Introduction to Machine Learning Midterm Exam

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Lecture 3: Pattern Classification

CMU-Q Lecture 24:

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Data Mining Techniques

Linear Models for Classification

Linear & nonlinear classifiers

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Latent Variable Models and Expectation Maximization

Introduction to Machine Learning HW6

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Dynamical Systems

Lecture 2 Machine Learning Review

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Introduction to Machine Learning Midterm Exam Solutions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Association studies and regression

Logistic Regression. Machine Learning Fall 2018

Support Vector Machines and Kernel Methods

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Gaussian Mixture Models

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 3: Pattern Classification. Pattern classification

Probabilistic Machine Learning. Industrial AI Lab.

Discriminative Models

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Neural Network Training

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Review and Motivation

Machine Learning, Midterm Exam

Machine Learning, Fall 2012 Homework 2

But if z is conditioned on, we need to model it:

Regression with Numerical Optimization. Logistic

Expectation Maximization

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Least Mean Squares Regression

Learning From Data Lecture 25 The Kernel Trick

Transcription:

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Schedule and Final Upcoming Schedule Today: Last day of class Next Monday (3/13): No class I will hold office hours in my office (BH 4531F) Next Wednesday (3/15): Final Exam, HW6 due Final Exam Cumulative but with more emphasis on new material 8 short questions (recall that midterm had 6 short and 3 long questions) Should be much shorter than midterm, but of equal difficulty Focus on major concepts (e.g., MLE, primal and dual formulations of SVM, gradient descent, etc.) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 3 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 4 / 26

Neural Networks Basic Idea Learning nonlinear basis functions and classifiers Hidden layers are nonlinear mappings from input features to new representation Output layers use the new representations for classification and regression Learning parameters Backpropogation = efficient algorithm for (stochastic) gradient descent Can write down explicit updates via chain rule of calculus Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 5 / 26

bias'unit ' x x =1 1 2 3 X Single'Node' x = 2 6 4 x x 1 x 2 x 3 3 7 5 = 6 4 h (x) =g ( x) 1 = 1+e T x 2 1 2 3 3 7 5 Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) = 1 1+e z Based'on'slide'by'Andrew'Ng' 9'

Neural'Network' bias'units' x a (2) h (x) Layer'1' (Input'Layer)' Layer'2' (Hidden'Layer)' Layer'3' (Output'Layer)' Slide'by'Andrew'Ng' 11'

Vectoriza1on' a (2) = g z (2) 1 = g (1) 1 x + (1) 11 x 1 + (1) 12 x 2 + (1) 13 x 3 1 a (2) = g z (2) 2 = g (1) 2 x + (1) 21 x 1 + (1) 22 x 2 + (1) 23 x 3 2 a (2) = g z (2) 3 = g (1) 3 x + (1) 31 x 1 + (1) 32 x 2 + (1) 33 x 3 3 h (x) =g (2) 1 a(2) + (2) 11 a(2) 1 + (2) 12 a(2) 2 + (2) 13 a(2) 3 = g z (3) 1 Based'on'slide'by'Andrew'Ng' (1) (2) h (x) Feed@Forward'Steps:' z (2) = (1) x a (2) = g(z (2) ) Add a (2) =1 z (3) = (2) a (2) h (x) =a (3) = g(z (3) ) 15'

Learning'in'NN:'Backpropaga1on' Similar'to'the'perceptron'learning'algorithm,'we'cycle' through'our'examples' If'the'output'of'the'network'is'correct,'no'changes'are'made' If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error' We'are'just'performing''(stochas1c)'gradient'descent!' Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par' 32'

J( ) = Op1mizing'the'Neural'Network' 1 n " n X LX 1 + 2n i=1 k=1 s X l 1 l=1 Solve'via:'' KX y ik log(h (x i )) k +(1 Xs l i=1 j=1 (l) ji 2 # y ik ) log 1 (h (x i )) k Unlike before, J(Θ)'is'not' convex,'so'gd'on'a'neural'net' yields'a'local'op1mum' @ @ (l) ij J( ) =a (l) j (l+1) i (ignoring'λ;'if'λ = )' Based'on'slide'by'Andrew'Ng' Forward' Propaga1on' Backpropaga1on' 34'

Forward'Propaga1on' Given'one'labeled'training'instance'(x, y):' ' Forward'Propaga1on' a (1) '= x' z (2) = Θ (1) a (1) a (2) '= g(z (2) ) z (3) = Θ (2) a (2) a (3) '= g(z (3) ) z (4) = Θ (3) a (3) [add'a (2) ]' [add'a (3) ]' a (4) '= h Θ (x) = g(z (4) )' a (1) a (2) a (3) a (4) Based'on'slide'by'Andrew'Ng' 35'

Backpropaga1on:'Gradient'Computa1on' Let'δ j (l) =' error 'of'node'j'in'layer'l (#layers'l'='4) Backpropaga1on'' Element@wise' product'.*' δ (4) = a (4) y δ (3) = (Θ (3) ) T δ (4).* g (z (3) ) δ (2) = (Θ (2) ) T δ (3).* g (z (2) ) (No'δ (1) ) δ (2) δ (3) g (z (3) ) = a (3).* (1 a (3) )' g (z (2) ) = a (2).* (1 a (2) )' δ (4) ' ' Based'on'slide'by'Andrew'Ng' 42'

Training'a'Neural'Network'via'Gradient' Descent'with'Backprop' Given: training set {(x 1,y 1 ),...,(x n,y n )} Initialize all (l) randomly (NOT to!) Loop // each iteration is called an epoch (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a (1) = x i Compute {a (2),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L 1),..., (2) } Compute gradients (l) ij = (l) ij + a(l) j Compute avg regularized gradient D (l) ij (l+1) i = ( 1 n Update weights via gradient step (l) ij = (l) ij Until weights converge or max #epochs is reached (Used'to'accumulate'gradient)' 1 n (l) ij + (l) (l) ij D (l) ij ij if j 6= otherwise Backpropaga1on' Based'on'slide'by'Andrew'Ng' 45'

Summary of the course so far Supervised learning has been our focus Setup: given a training dataset {x n, y n } N n=1, we learn a function h(x) to predict x s true value y (i.e., regression or classification) Linear vs. nonlinear features 1 Linear: h(x) depends on w T x 2 Nonlinear: h(x) depends on w T φ(x), where φ is either explicit or depends on a kernel function k(x m, x n ) = φ(x m ) T φ(x n ) Loss function 1 Squared loss: least square for regression (minimizing residual sum of errors) 2 Logistic loss: logistic regression 3 Exponential loss: AdaBoost 4 Margin-based loss: support vector machines Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 6 / 26

cont d Optimization 1 Methods: gradient descent, Newton method 2 Convex optimization: global optimum vs. local optimum 3 Lagrange duality: primal and dual formulation Learning theory 1 Difference between training error and generalization error 2 Overfitting, bias and variance tradeoff 3 Regularization: various regularized models Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 7 / 26

Supervised versus Unsupervised Learning Supervised Learning from labeled observations Labels teach algorithm to learn mapping from observations to labels Classification, Regression Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 8 / 26

Supervised versus Unsupervised Learning Supervised Learning from labeled observations Labels teach algorithm to learn mapping from observations to labels Classification, Regression Unsupervised Learning from unlabeled observations Learning algorithm must find latent structure from features alone Can be goal in itself (discover hidden patterns, exploratory analysis) Can be means to an end (preprocessing for supervised task) Clustering (Today) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 8 / 26

Outline 1 Administration 2 Review of last lecture 3 Clustering K-means Gaussian mixture models Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 9 / 26

Clustering Setup Given D = {x n } N n=1 and K, we want to output: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

Clustering Setup Given D = {x n } N n=1 and K, we want to output: {µ k } K k=1 : prototypes of clusters A(x n ) {1, 2,..., K}: the cluster membership Toy Example Cluster data into two clusters. 2 (a) 2 (i) 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

Clustering Setup Given D = {x n } N n=1 and K, we want to output: {µ k } K k=1 : prototypes of clusters A(x n ) {1, 2,..., K}: the cluster membership Toy Example Cluster data into two clusters. 2 (a) 2 (i) 2 2 2 2 2 2 Applications Identify communities within social networks Find topics in news stories Group similiar sequences into gene families Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26

K-means example 2 (a) 2 (b) 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means example 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 11 / 26

K-means clustering Intuition Data points assigned to cluster k should be near prototype µ k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 12 / 26

K-means clustering Intuition Data points assigned to cluster k should be near prototype µ k Distortion measure (clustering objective function, cost function) N K J = r nk x n µ k 2 2 n=1 k=1 where r nk {, 1} is an indicator variable r nk = 1 if and only if A(x n ) = k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 12 / 26

Algorithm Minimize distortion Alternative optimization between {r nk } and {µ k } Step Initialize {µ k } to some values Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 13 / 26

Algorithm Minimize distortion Alternative optimization between {r nk } and {µ k } Step Initialize {µ k } to some values Step 1 Fix {µ k } and minimize over {r nk }, to get this assignment: r nk = { 1 if k = arg minj x n µ j 2 2 otherwise Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 13 / 26

Algorithm Minimize distortion Alternative optimization between {r nk } and {µ k } Step Initialize {µ k } to some values Step 1 Fix {µ k } and minimize over {r nk }, to get this assignment: r nk = { 1 if k = arg minj x n µ j 2 2 otherwise Step 2 Fix {r nk } and minimize over {µ k } to get this update: n µ k = r nkx n n r nk Step 3 Return to Step 1 unless stopping criterion is met Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 13 / 26

Remarks Prototype µ k is the mean of points assigned to cluster k, hence K-means The procedure reduces J in both Step 1 and Step 2 and thus makes improvements on each iteration No guarantee we find the global solution Quality of local optimum depends on initial values at Step k-means++ is a principled approximation algorithm Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 14 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? 1 (b) Points seem to form 3 clusters.5.5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Probabilistic interpretation of clustering? How can we model p(x) to reflect our intuition that points stay close to their cluster centers? 1.5 (b) Points seem to form 3 clusters We cannot model p(x) with simple and known distributions E.g., the data is not a Guassian b/c we have 3 distinct concentrated regions.5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 15 / 26

Gaussian mixture models: intuition 1.5 (a) Model each region with a distinct distribution Can use Gaussians Gaussian mixture models (GMMs).5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 16 / 26

Gaussian mixture models: intuition 1.5 (a).5 1 Model each region with a distinct distribution Can use Gaussians Gaussian mixture models (GMMs) We don t know cluster assignments (label), parameters of Gaussians, or mixture components! Must learn from unlabeled data D = {x n } N n=1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 16 / 26

Gaussian mixture models: formal definition GMM has the following density function for x K p(x) = ω k N(x µ k, Σ k ) k=1 K: number of Gaussians they are called mixture components µ k and Σ k : mean and covariance matrix of k-th component Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 17 / 26

Gaussian mixture models: formal definition GMM has the following density function for x K p(x) = ω k N(x µ k, Σ k ) k=1 K: number of Gaussians they are called mixture components µ k and Σ k : mean and covariance matrix of k-th component ω k : mixture weights (or priors) represent how much each component contributes to final distribution. They satisfy 2 properties: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 17 / 26

Gaussian mixture models: formal definition GMM has the following density function for x K p(x) = ω k N(x µ k, Σ k ) k=1 K: number of Gaussians they are called mixture components µ k and Σ k : mean and covariance matrix of k-th component ω k : mixture weights (or priors) represent how much each component contributes to final distribution. They satisfy 2 properties: k, ω k >, and ω k = 1 These properties ensure p(x) is in fact a probability density function k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 17 / 26

GMM as the marginal distribution of a joint distribution Consider the following joint distribution p(x, z) = p(z)p(x z) where z is a discrete random variable taking values between 1 and K. Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 18 / 26

GMM as the marginal distribution of a joint distribution Consider the following joint distribution p(x, z) = p(z)p(x z) where z is a discrete random variable taking values between 1 and K. Denote ω k = p(z = k) Now, assume the conditional distributions are Gaussian distributions p(x z = k) = N(x µ k, Σ k ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 18 / 26

GMM as the marginal distribution of a joint distribution Consider the following joint distribution p(x, z) = p(z)p(x z) where z is a discrete random variable taking values between 1 and K. Denote ω k = p(z = k) Now, assume the conditional distributions are Gaussian distributions p(x z = k) = N(x µ k, Σ k ) Then, the marginal distribution of x is p(x) = K ω k N(x µ k, Σ k ) k=1 Namely, the Gaussian mixture model Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 18 / 26

GMMs: example 1.5 (a).5 1 The conditional distribution between x and z (representing color) are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 19 / 26

GMMs: example 1.5 (a).5 1 The conditional distribution between x and z (representing color) are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) 1 The marginal distribution is thus (b).5 p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ).5 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 19 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are θ = {ω k, µ k, Σ k } K k=1 Let s first consider the simple/unrealistic case where we have labels z Define D = {x n, z n } N n=1 D is the complete data D the incomplete data How can we learn our parameters? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for Gaussian mixture models The parameters in GMMs are θ = {ω k, µ k, Σ k } K k=1 Let s first consider the simple/unrealistic case where we have labels z Define D = {x n, z n } N n=1 D is the complete data D the incomplete data How can we learn our parameters? Given D, the maximum likelihood estimation of the θ is given by θ = arg max log D = n log p(x n, z n ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 2 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: log p(x n, z n ) = n k γ nk log p(z = k)p(x n z = k) n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data The complete likelihood is decomposable log p(x n, z n ) = log p(z n )p(x n z n ) = n n k n:z n=k log p(z n )p(x n z n ) where we have grouped data by its values z n. Let γ nk {, 1} be a binary variable that indicates whether z n = k: log p(x n, z n ) = n k = k γ nk log p(z = k)p(x n z = k) n γ nk [log ω k + log N(x n µ k, Σ k )] n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 21 / 26

Parameter estimation for GMMs: complete data From our previous discussion, we have log p(x n, z n ) = γ nk [log ω k + log N(x n µ k, Σ k )] n k n Regrouping, we have log p(x n, z n ) = n k γ nk log ω k + n k { } γ nk log N(x n µ k, Σ k ) n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 22 / 26

Parameter estimation for GMMs: complete data From our previous discussion, we have log p(x n, z n ) = γ nk [log ω k + log N(x n µ k, Σ k )] n k n Regrouping, we have log p(x n, z n ) = n k γ nk log ω k + n k { } γ nk log N(x n µ k, Σ k ) n The term inside the braces depends on k-th component s parameters. It is now easy to show that (left as an exercise) the MLE is: n ω k = γ nk 1 k n γ, µ k = nk n γ γ nk x n nk n 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk What s the intuition? n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 22 / 26

Intuition Since γ nk is binary, the previous solution is nothing but ω k : fraction of total data points whose z n is k note that k n γ nk = N µ k : mean of all data points whose z n is k Σ k : covariance of all data points whose z n is k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 23 / 26

Intuition Since γ nk is binary, the previous solution is nothing but ω k : fraction of total data points whose z n is k note that k n γ nk = N µ k : mean of all data points whose z n is k Σ k : covariance of all data points whose z n is k This intuition will help us develop an algorithm for estimating θ when we do not know z n (incomplete data) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 23 / 26

Parameter estimation for GMMs: incomplete data When z n is not given, we can guess it via the posterior probability p(z n = k x n ) = p(x n z n = k)p(z n = k) p(x n ) = p(x n z n = k)p(z n = k) K k =1 p(x n z n = k )p(z n = k ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 24 / 26

Parameter estimation for GMMs: incomplete data When z n is not given, we can guess it via the posterior probability p(z n = k x n ) = p(x n z n = k)p(z n = k) p(x n ) = p(x n z n = k)p(z n = k) K k =1 p(x n z n = k )p(z n = k ) To compute the posterior probability, we need to know the parameters θ! Let s pretend we know the value of the parameters so we can compute the posterior probability. How is that going to help us? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 24 / 26

Estimation with soft γ nk We define γ nk = p(z n = k x n ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 25 / 26

Estimation with soft γ nk We define γ nk = p(z n = k x n ) Recall that γ nk was previously binary Now it s a soft assignment of x n to k-th component Each x n is assigned to a component fractionally according to p(z n = k x n ) Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 25 / 26

Estimation with soft γ nk We define γ nk = p(z n = k x n ) Recall that γ nk was previously binary Now it s a soft assignment of x n to k-th component Each x n is assigned to a component fractionally according to p(z n = k x n ) We now get the same expression for the MLE as before! n ω k = γ nk 1 k n γ, µ k = nk n γ γ nk x n nk n 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk But remember, we re cheating by using θ to compute γ nk! n Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 25 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 This is an example of the EM algorithm a powerful procedure for model estimation with hidden/latent variables Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 This is an example of the EM algorithm a powerful procedure for model estimation with hidden/latent variables Connection with K-means? Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26

Iterative procedure Alternate between estimating γ nk and computing parameters Step : initialize θ with some values (random or otherwise) Step 1: compute γ nk using the current θ Step 2: update θ using the just computed γ nk Step 3: go back to Step 1 This is an example of the EM algorithm a powerful procedure for model estimation with hidden/latent variables Connection with K-means? GMMs provide probabilistic interpretation for K-means K-means is hard GMM or GMMs is soft K-means Posterior γ nk provides a probabilistic assignment for x n to cluster k Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 26 / 26