CPSC 540: Machine Learning

Similar documents
CPSC 540: Machine Learning

CPSC 540: Machine Learning

CPSC 540: Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering and Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Expectation Maximization Algorithm

Mixtures of Gaussians continued

CSCI-567: Machine Learning (Spring 2019)

Data Mining Techniques

K-Means and Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Statistical Pattern Recognition

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 540: Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Expectation Maximization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

But if z is conditioned on, we need to model it:

CPSC 540: Machine Learning

CPSC 340: Machine Learning and Data Mining

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning for Signal Processing Bayes Classification and Regression

Generative Clustering, Topic Modeling, & Bayesian Inference

1 Expectation Maximization

Statistical learning. Chapter 20, Sections 1 4 1

Nonparametric Bayesian Methods (Gaussian Processes)

Mixtures of Gaussians. Sargur Srihari

Expectation maximization

Latent Variable Models and Expectation Maximization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Density Estimation. Seungjin Choi

Linear Models for Classification

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Data Preprocessing. Cluster Similarity

Latent Variable Models and Expectation Maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Finite Singular Multivariate Gaussian Mixture

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Variable Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Logistic Regression. Machine Learning Fall 2018

Machine Learning Lecture 5

Support Vector Machines

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

CS534 Machine Learning - Spring Final Exam

Algorithmisches Lernen/Machine Learning

The Expectation-Maximization Algorithm

Graphical Models for Collaborative Filtering

CPSC 540: Machine Learning

Latent Variable Models and EM Algorithm

Bayesian Machine Learning

Machine Learning: A Statistics and Optimization Perspective

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Gaussian Mixture Models, Expectation Maximization

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning

Statistical Pattern Recognition

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

The Expectation Maximization or EM algorithm

Clustering, K-Means, EM Tutorial

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

The Bayes classifier

Introduction to Probabilistic Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning for Data Science (CS4786) Lecture 12

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CPSC 540: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bias-Variance Tradeoff

MIXTURE MODELS AND EM

Expectation Maximization (EM)

EM (cont.) November 26 th, Carlos Guestrin 1

Unsupervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Transcription:

CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018

Last Time: Learning with MAR Values We discussed learning with missing at random values in data: 1.33 0.45 0.05 1.08? 1.49 2.36 1.29 0.80? 0.35 1.38 2.89 0.10? X = 0.10 1.29 0.64 0.46? 0.79 0.25 0.47 0.18? 2.93 1.56 1.11 0.81? 1.15 0.22 0.11 0.25? Imputation approach: Guess the most likely value of each?, fit model with these values (and repeat). K-means clustering algorithm is a special case: Mixture of Gaussians with Σ c = I and? being the cluster (? {1, 2,, k}).

Parameters, Hyper-Parameters, and Nuisance Parameters Are the? values parameters or hyper-parameters? Parameters: Variables in our model that we optimize based on the training set. Hyper-Parameters Variables that control model complexity, typically set using validation set. Often become degenerate if we set these based on training data. We sometimes add optimization parameters in here like step-size. Nuisance Parameters Not part of the model and not really controlling complexity. An alternative to optimizing ( imputation ) is to integrate over these values. Consider all possible imputations, and weight them by their probability.

Expectation Maximization Notation Expectation maximization (EM) is an optimization algorithm for MAR values: Applies to problems that are easy to solve with complete data (i.e., you knew?). Allows probabilistic or soft assignments to MAR (or other nuisance) variables. EM is among the most cited paper in statistics. Imputation approach is sometimes called hard EM. EM notation: we use O as observed variables and H as hidden (?) variables. Semi-supervised learning: observe O = {X, y, X} but don t observe H = {ȳ}. Mixture models: observe data O = {X} but don t observe clusters H = {z i } n i=1. We use Θ as parameters we want to optimize.

Complete Data and Marginal Likelihoods Assume observing H makes complete likelihood p(o, H Θ) nice. It has a closed-form MLE, gives a convex NLL, or something like that. From marginalization rule, likelihood of O in terms of complete likelihood is p(o Θ) = p(o, H Θ) = p(o, H Θ). }{{} H 1 H 2 H m H complete likelihood where we sum (or integrate) over all possible H {H 1, H 2,..., H m }. For mixture models, this sums over all possible clusterings. The negative log-likelihood thus has the form ( ) log p(o Θ) = log p(o, H Θ), H which has a sum inside the log. This does not preserve convexity: minimizing it is usually NP-hard.

Expectation Maximization Bound To compute Θ t+1, the approximation used by EM and hard-em is ( ) log p(o, H Θ) αh t log p(o, H Θ) H H where αh t is a probability for the assignment H to the hidden variables. Note that αh t changes on each iteration t. In hard-em we set α t H = 1 for the most likely H given Θt (all other α t H = 0). In soft-em we set α t H = p(h O, Θt ), weighting H by probability given Θ t. We ll show the EM approximation minimizes an upper bound, log p(o Θ) p(h O, Θ t ) log p(o, H Θ) + const., H }{{} Q(Θ Θ t )

Expectation Maximization as Bound Optimization Expectation maximization is a bound-optimization method: At each iteration t we optimize a bound on the function. In gradient descent, our bound came from Lipschitz-continuity of the gradient. In EM, our bound comes from expectation over hidden variables (non-quadratic).

Expectation Maximization (EM) So EM starts with Θ 0 and sets Θ t+1 to maximize Q(Θ Θ t ). This is typically written as two steps: 1 E-step: Define expectation of complete log-likelihood given last parameters Θ t, Q(Θ Θ t ) = H p(h O, Θ t ) log p(o, H Θ) }{{}}{{} nice term fixed weights α t H = E H O,Θ t[log p(o, H Θ)], which is a weighted version of the nice log p(o, H) values. 2 M-step: Maximize this expectation to generate new parameters Θ t+1, Θ t+1 = argmax Q(Θ Θ t ). Θ

Expectation Maximization for Mixture Models In the case of a mixture model with extra cluster variables z i EM uses Q(Θ Θ t ) = E z X,Θ [log p(x, z Θ)] k k k = p(z X, Θ t ) log p(x, z Θ) }{{}}{{} z 1 =1 z 2 =1 z n =1 α z nice ( k k k n ) ( n ) = p(z i x i, Θ t ) log p(x i, z i Θ) z 1 =1 z 2 =1 z n =1 i=1 = (see EM notes, tedious use of distributive law and independences) n k = p(z i x i, Θ t ) log p(x i, z i Θ). i=1 z i =1 Sum over k n clusterings turns into sum over nk 1-example assignments. Same simplification happens for semi-supervised learning, we ll discuss why later. i=1

Expectation Maximization for Mixture Models In the case of a mixture model with extra cluster variables z i EM uses n k Q(Θ Θ t ) = p(z i x i, Θ t ) log p(x i, z i Θ). }{{} i=1 z i =1 rc i This is just a weighted version of the usual likelihood. We just need to do MLE in weighted Gaussian, weighted Bernoulli, etc. We typically write update in terms of responsibilitites, r i c p(z i = c x i, Θ t ) = p(xi z i = c, Θ t )p(z i = c Θ t ) p(x i Θ t ) (Bayes rule), the probability that cluster c generated x i. By marginalization rule, p(x i Θ t ) = k c=1 p(xi z i = c, Θ t )p(z i = c Θ t ). We get k-means if rc i = 1 for most likely cluster and 0 otherwise.

Expectation Maximization for Mixture of Gaussians For mixture of Gaussians, E-step computes all r i c and M-step minimizes the weighted NLL: π t+1 c = 1 n µ t+1 c = n rc i (proportion of examples soft-assigned to cluster c) i=1 n i=1 ri cx i n i=1 ri c (mean of examples soft-assigned to cluster c) Σ t+1 c = n i=1 ri c(x i µ t+1 c )(x i µ t+1 c n i=1 ri c ) T (covariance of examples soft-assigned to c). Now you would compute new responsibilities and repeat. Notice that there is no step-size. EM for fitting mixture of Gaussians in action: https://www.youtube.com/watch?v=b36fzchfygu

Discussing of EM for Mixtures of Gaussians EM and mixture models are used in a ton of applications. One of the default unsupervised learning methods. EM usually doesn t reach global optimum. Classic solution: restart the algorithm from different initializations. MLE for some clusters may not exist (e.g., only responsible for one point). Use MAP estimates or remove these clusters. How do you choose number of mixtures k? Use cross-validation or other model selection criteria. Can you make it robust? Use mixture of Laplace of student t distributions. Are there alternatives to EM? Could use gradient descent on NLL. Spectral and other recent methods have some global guarantees.

Summary Expectation maximization: Optimization with MAR variables, when knowing MAR variables make problem easy. Instead of imputation, works with soft assignments to nuisance variables. Maximizes log-likelihood, weighted by all imputations of hidden variables. Next time: the sad truth about rain in Vancouver.

Generative Mixture Models and Mixture of Experts Classic generative model for supervised learning uses p(y i x i ) p(x i y i )p(y i ), and typically p(x i y i ) is assumed Gaussian (LDA) or independent (naive Bayes). But we could allow more flexibility by using a mixture model, p(x i y i ) = k p(z i = c y i )p(x i z i = c, y i ). c=1 Another variation is a mixture of disciminative models (like logistic regression), p(y i x i ) = k p(z i = c x i )p(y i z i = c, x i ). c=1 Called a mixture of experts model: Each regression model becomes an expert for certain values of x i.