Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Similar documents
Series 7, May 22, 2018 (EM Convergence)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation maximization tutorial

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

But if z is conditioned on, we need to model it:

Expectation Maximization and Mixtures of Gaussians

Latent Variable Models and EM algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CPSC 540: Machine Learning

Gaussian Mixture Models

MATH 680 Fall November 27, Homework 3

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Posterior Regularization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Unsupervised Learning

Support Vector Machines

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

EM-algorithm for motif discovery

Bayesian Machine Learning

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Expectation Maximization Algorithm

Lecture 8: Graphical models for Text

Expectation Maximization

CSE446: Clustering and EM Spring 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

EM Algorithm LECTURE OUTLINE

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

The Expectation Maximization or EM algorithm

Clustering, K-Means, EM Tutorial

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Generative Clustering, Topic Modeling, & Bayesian Inference

CS 6375 Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Lecture 4 Logistic Regression

Expectation Maximization (EM)

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Mean Squares Regression

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

EXPECTATION- MAXIMIZATION THEORY

Lecture 4 September 15

Expectation maximization

Study Notes on the Latent Dirichlet Allocation

CS Lecture 18. Topic Models and LDA

Mixtures of Gaussians. Sargur Srihari

Jeff Howbert Introduction to Machine Learning Winter

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Conditional Random Field

CPSC 540: Machine Learning

Based on slides by Richard Zemel

Machine Learning Lecture 5

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

1 Expectation Maximization

Machine Learning for Signal Processing Bayes Classification and Regression

Advanced Introduction to Machine Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Probabilistic Graphical Models & Applications

Latent Variable Models

Variational Autoencoders

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Data Preprocessing. Cluster Similarity

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Linear discriminant functions

Logistic Regression. COMP 527 Danushka Bollegala

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 14. Clustering, K-means, and EM

Generative v. Discriminative classifiers Intuition

Constrained Optimization

Advanced Parametric Mixture Model for Multi-Label Text Categorization. by Tzu-Hsiang Kao

Machine Learning, Fall 2012 Homework 2

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Statistical Machine Learning Lectures 4: Variational Bayes

Latent Variable Models and EM Algorithm

The Expectation-Maximization Algorithm

10-701/15-781, Machine Learning: Homework 4

CS229 Supplemental Lecture notes

Probabilistic Graphical Models for Image Analysis - Lecture 1

Data Mining Techniques

The Expectation-Maximization Algorithm

Optimization and Gradient Descent

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Probabilistic Graphical Models for Image Analysis - Lecture 4

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Transcription:

Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof Dr Andreas Krause Web: http://lasinfethzch/teaching/lis-s16/ Email questions to: Anastasia Makarova, anmakaro@infethzch Problem 1 (EM for Naïve Bayes): Assume that you want to train a naïve Bayes model on data with missing class labels Specifically, there are k binary variables X 1, X k corresponding to the features, and a variable Y taking on values in {1, 2,, m} denoting the class Let us denote the set of model parameters as P (X i 1 Y y) θ i y and P (Y y) θ y You are given n data points D {(x 1, y 1 ),, (x n, y n )} where x i {0, 1} k and y i {1, 2,, m, } The value means that the label of the data point is missing Write down the log-likelihood l(θ) of the data as a function of the parameters θ Recall that the E-step of the EM algorithm computes the posterior over the unknown variables when we fix the parameters θ Compute these probabilities γ j (x i ) P (Y j x i ; θ) for j st y i Once we have the quantities γ j ( ), we can compute the M-step update, which is computed as the maximizer θ of γ j(x i ) log P (x i, y i j; θ) Show how to compute θ Note that there are constraints on θ to make sure that the distributions are valid (non-negative and sum up to 1) Solution 1: The log-likelihood is equal to l(θ) log P (D) log P (x i ; θ) + y i y i log P (x i, Y j; θ) + y i log P (x i Y j; θ)p (Y j; θ) + y i log θ j k l1 l j (1 θ l j) 1 x i,l + log θ yi k l1 l y i (1 θ l yi ) 1 x i,l

To compute the requested posterior probabilities, note that by Bayes rule γ j (x i ) P (y i j x i ; θ) P (x i y i j; θ)p (y i j; θ) P (x i ; θ) 1 Z P (x i y i j; θ)p (y i j; θ) 1 Z θ j k l1 l j (1 θ l j) 1 x i,l We then have to compute the normalizer Z so that γ j(x i ) 1 Note that for those data points x i for which we are given the labels y i we set the γ j (x i ) to be a deterministic distribution, ie γ j (x i ) [j y i ] To compute the M-step update we have to optimize the following quantity γ j (x i ) log P (x i, y i j; θ) γ j (x i ) [ log θ j + k l1 log l j (1 θ l j) 1 x ] i,l with respect to the parameters θ We form the Lagrangian by adding a multiplier λ to make sure that θ j 1: L(θ, λ) γ j (x i ) [ log θ j + By setting the derivatives to zero we obtain: θl j L(θ, λ) θj L(θ, λ) x i,l 1 γ j (x i )/θ l j + x i,l 0 k log l j (1 θ l j) 1 x ] i,l + λ( θ j 1) l1 γ j (x i )/(θ l j 1) 0 θ l j γ j (x i )/θ j + λ 0 θ j γ j(x i ) λ [x i,l 1]γ j (x i ) γ j(x i ) From the constraint θ j 1 we find the correct multiplier to be λ γ j(x i ) n Problem 2 (EM for a 1D Laplacian Mixture Model): In this problem you will derive the EM algorithm for a one-dimensional Laplacian mixture model You are given n observations x 1,, x n R and we want to fit a mixture of m Laplacians, which has the following density f(x) π j f L (x; µ j, ), where f L (x; µ j, ) 1 β x µ j j, and the mixture weights πj are a convex combination, ie π j 0 and π j 1 For simplicity, assume that the scale parameters > 0 are known beforehand and thus fixed 2 e 1 Introduce latent variables so that we can apply the EM procedure Analogously to the previous question, write down the steps of the EM procedure for this model If some updates cannot be written analytically, give an approach on how to compute them (Hint: Recall a property of functions that makes them easy to optimize) 2

Solution 2: For each data point x i, we introduce a latent variable Y i {1, 2,, m} denoting the component that point belongs to For the E-step, we compute the posterior over the classes similarly to the previous problem, ie γ j (x i ) P (y i j x i ) P (x i y i j)p (y i j) π j f L (x i ; µ j, ) Again, we have to normalize, so that the final posterior is equal to γ j (x i ) π jf L (x i ; µ j, ) l1 π lf L (x i ; µ l, β l ) In the M-step, we optimize γ j (x i ) log P (x i, y i j) γ j (x i ) log π j f L (x i ; µ j, ) (1) γ j (x i )(log π j 1 x i µ j ) + const We add a Lagrange multiplier λ to make sure that π j 1 and obtain the Lagrangian L(π, µ, λ) γ j (x i )(log π j 1 x i µ j ) + λ( π j 1) Exactly as in the previous problem, by setting the gradient with respect to π j to zero, we obtain πj L(π, µ, λ) γ j (x i )/π j + λ 0 π j γ j(x i ) λ The multiplier is again equal to λ n and we arrive at the same equation as in the last example If we want to maximize (1) with respect to the variables µ j, we have to solve m separate optimization problems, one for each µ j These m problems have the following form maximize µ j γ j (x i ) x i µ j These are one-dimensional convex optimization problems (the negative of the objective is easily seen to be convex) While one can try solving this via an iterative process like subgradient descent, a direct approach is also possible if we observe that the function is piecewise linear and the breakpoints are x 1, x 2,, x n Hence, the optimum must be attained at one of these n points and we can simply set µ j to the point x i with the largest objective value 3

Problem 3 (A different perspective on EM 1 ): In this question you will show that EM can be seen as iteratively maximizing a lower bound on the log-likelihood We will treat any general model P (X, Z) with observed variables X and latent variables Z For the sake of simplicity, we will assume that Z is discrete and takes on values in {1, 2,, m} If we are observe X x, the goal is to maximize the log-likelihood l(θ) log P (x; θ) log with respect to the parameter vector θ In what follows we will denote by Q(Z) any distribution over the latent variables Show that if > 0 when P (x, z) > 0, then it holds that (Hint: Consider using Jensen s inequality) l(θ) E Q [log P (X, Z)] log Hence, we have a bound on the log-likelihood parametrized by a distribution Q(Z) over the latent variables Show that for a fixed θ, the lower bound is maximized for Q (Z) P (Z X; θ) Moreover, show that the bound is exact (holds with equality) for this specific distribution Q (Z) (Hint: Do not forget to add Lagrange multipliers to make sure that Q is a valid distribution) Show that if we optimize with respect to Q and θ in an alternating manner, that this corresponds to the EM procedure Discuss what this implies for the convergence properties of EM Solution 3: For the first part, note that l(θ) log P (x; θ) log log log E Z Q [ E Z Q [log ] ] E Z Q [log ] log, where for the inequality we have used Jensen s inequality Now, assume that we want to maximize the above with respect to Q, and let us add a multiplier λ to make sure that Q sums up to 1 Then, we have the following Lagrangian L(Q, λ) log log + λ( 1) 1 This is an advanced question 4

By setting the derivative of the Lagrangian with respect to to zero, we have L(Q, λ) log 1 log + λ 0 e λ 1 Hence, we have that and this is exactly the posterior P (Z x; θ), which we had to show It is also easy to see that the bound is tight, as E Z Q [log ] log P (z x; θ) log P (z x; θ)p (x; θ) P (z x; θ) log P (x; θ) Then we can easily see the EM algorithm as optimizing the lower bound with respect to Q( ) and θ in an alternating manner Specifically, if we optimize with respect to Q we have shown that the optimal Q is the posterior, and this is exactly the E-step Optimizing with respect to θ for fixed Q is clearly equivalent to the M-step As the lower bound is monotonically increased at every step the EM algorithm has to converge 5