Series 7, May 22, 2018 (EM Convergence)

Similar documents
Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Data Mining Techniques

Expectation Maximization

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Quantitative Biology II Lecture 4: Variational Methods

Expectation Propagation Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

STA 4273H: Statistical Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

PATTERN RECOGNITION AND MACHINE LEARNING

Variational inference

Clustering, K-Means, EM Tutorial

Clustering with k-means and Gaussian mixture distributions

Probabilistic Graphical Models for Image Analysis - Lecture 4

Lecture 8: Graphical models for Text

The Expectation Maximization or EM algorithm

Introduction to Machine Learning

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Posterior Regularization

Bregman Divergences for Data Mining Meta-Algorithms

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

An Introduction to Expectation-Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Introduction to Probabilistic Graphical Models: Exercises

Clustering with k-means and Gaussian mixture distributions

Variational Inference (11/04/13)

Statistical Machine Learning Lectures 4: Variational Bayes

Expectation Maximization

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Lecture 1: Introduction, Entropy and ML estimation

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Lecture 2: August 31

Variational Autoencoders

Introduction to Statistical Learning Theory

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Variational Inference. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Gaussian Mixture Models

Introduction to Probability and Statistics (Continued)

CSC411 Fall 2018 Homework 5

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Week 3: The EM algorithm

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Information Theory in Intelligent Decision Making

Variational Autoencoders (VAEs)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Relative Loss Bounds for Multidimensional Regression Problems

A minimalist s exposition of EM

ECE 4400:693 - Information Theory

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Bioinformatics: Biology X

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Machine learning - HT Maximum Likelihood

Bayesian Machine Learning - Lecture 7

Information Theory Primer:

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Latent Variable Models and EM algorithm

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 35: December The fundamental statistical distances

Computational Systems Biology: Biology X

Introduction to Machine Learning

Gaussian Mixture Models

14 : Mean Field Assumption

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

13: Variational inference II

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

The Expectation-Maximization Algorithm

Clustering with k-means and Gaussian mixture distributions

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

Lecture 5: GPs and Streaming regression

Linear Dynamical Systems

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Variable selection and feature construction using methods related to information theory

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Hands-On Learning Theory Fall 2016, Lecture 3

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Introduction to Machine Learning

Probabilistic & Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Review. December 4 th, Review

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Lecture 5 - Information theory

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

16.4. Power Series. Introduction. Prerequisites. Learning Outcomes

Transcription:

Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18 Email questions to: Mohammad Reza Karimi, mkarimi@ethz.ch It is not mandatory to submit solutions and sample solutions will be published in two weeks. If you choose to submit your solution, please send an e-mail from your ethz.ch address with subject Exercise7 containing a PDF (L A TEX or scan) to mkarimi@ethz.ch until Tuesday, May 29, 2018. The problems marked with an asterisk * are intended for deeper understanding. One should look at these problems as an opportunity to have more insight into the theory. Note that these problems are not necessarily harder than the other ones. Problem 1 (EM for Censored Linear Regression): Suppose you are trying to learn a model that can predict how long a program will take to run for different settings. In some situations, when the program is taking too long, you abort the program and just note down the time at which you aborted. These values are lower bounds for the actual running time of the program. We call this type of data right-censored. Concretely, all you know is that the running time y i c i, where c i is the censoring time. Written in another way, one can say y i = min{z i, c i } where z i is the true running time. Our goal is to derive an EM algorithm for fitting a linear regression model to right-censored data. (a) Let z i = µ i + σε i, where ε i N (0, 1). Suppose that we do not observe z i, but we observe the fact that it is higher than some threshold. Namely, we observe the event E = I(z i c i ). Show that ( ) ci µ i E[z i z i c i ] = µ i + σr σ and where we have defined ( ) E[zi 2 z i c i ] = µ 2 i + σ 2 ci µ i + σ(c i + µ i )R, σ R(x) := φ(x) 1 Φ(x). Here, φ(x) is the pdf of the standard Gaussian, and Φ(x) is its cdf. (b) Derive the EM algorithm for fitting a linear regression model to right-censored data. Describe completely the E-step and M-step.

Problem 2 (Soft k-means, Revisited): (a) Consider the following optimization problem: max c R k v i log(c i ) s.t. c i > 0, c i = 1, where v R k + is a vector of non-negative weights. Check that the M-step of soft k-means includes solving such an optimization problem. (b) Let c = 1 i vi v. Verify that c is a probability vector. (c) Show that the optimization problem is equivalent to the following problem: min c R k D KL (c c) s.t. c i > 0, c i = 1, (d) Using the properties of KL divergence, prove that c is indeed the solution to the optimization problem. 2

Problem 3 (Yet another perspective on EM): The EM algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables. Take a probabilistic model in which we denote all of the observed variables as X and all of the hidden variables as Z (here we assume Z is discrete, for the sake of simplicity). Let us assume that the joint distribution is p(x, Z θ), where θ is the set of all parameters describing this distribution (e.g. for a Gaussian distribution, θ = (µ, Σ)). The goal is to maximize the likelihood function p(x θ) = Z p(x, Z θ). (a) For an arbitrary distribution q(z) over the latent variables, show that the following decomposition holds: ln p(x θ) = L(q, θ) + D KL (q p post ), (1) where p post = p(z X, θ) is the posterior distribution. Also find the formulation of L(q, θ). (b) Verify that L(q, θ) ln p(x θ), and that equality holds if and only if q(z) = p(z X, θ). (c) Suppose that the current value of the parameters is θ curr. Verify that in the E-step, the lower bound L(q, θ curr ) is maximized with respect to the distribution q(z), while keeping θ curr fixed. Since the lefthand-side of (1) does not depend on q(z), maximizing L(q, θ curr ) will result in minimizing the KL divergence between q and p post, which happens at q = p post. (d) Verify that in the M-step, the lower bound L(q, θ) is maximized with respect to θ while keeping q(z) fixed, resulting in a new value of parameters θ new. This step will result in an increase in left-hand-side of (1) (if it is not already in a local maximum). (e) Substitute q(z) = p(z X, θ curr ) in (1), and observe that L(q, θ) = E q [complete-data log likelihood] H(q). In other words, in the M-step we are maximizing the expectation of the complete-data log likelihood 1, since the entropy term is independent of θ. Compare this result with the EM for Gaussian mixture models. (f) Show that the lower bound L(q, θ), where q(z) = q (Z) = p(z X, θ curr ), has the same gradient w.r.t. θ as the log likelihood function p(x θ) at the point θ = θ curr. This shows that the lower bound becomes tangent to the log likelihood function at the end of E-step. (g) Have you found an argument to prove the convergence of EM algorithm? 1 p(x, Z θ) 3

Problem 4 (On Statistical Distances*): In some situations in statistics and machine learning, the objective that we are going to optimize is some distribution (e.g. in the E-step of EM algorithm). This motivates us to understand a bit more about the space of probability distributions. For simplicity, let P be the set of all probability distributions over the set {1,..., n}, i.e. P = {(p 1,..., p n ) p i = 1, p i 0}. Usually P is called the probability simplex, or simply the n-simplex. One can equip P with a metric, inducing a geometry on P. Recall that a metric is a function d : P P R + satisfying the following criteria: (Non-negativity) d(p, q) 0 for all p, q P and equality holds iff p = q, (Symmetry) d(p, q) = d(q, p), (Triangle inequality) d(p, q) + d(q, r) d(p, r). A metric on the probability simplex is also called a statistical distance. Here we mention a few distances and some of their properties: (a) Total Variation Distance. For p, q P, we define their TV distance as D TV (p, q) := 1 2 p q 1 = 1 2 n p i q i. Prove that TV distance is indeed a metric, and equals to the largest possible difference between the probabilities that the two probability distributions p and q can assign to the same event, i.e. D TV (p, q) = max p(e) q(e). E {1,...,n} (b) Kullback-Leibler Divergence. For p, q P, we define their KL divergence as D KL (p q) = n p i log q i p i. (b.1) Prove that KL divergence satisfies the first property of a metric: it is non-negative, and it is zero if and only if the distributions are equal. (b.2) Give an example that D KL (p q) D KL (q p). (b.3) Give a counter-example for the triangle inequality for KL divergence. (b.4) Prove the Pinsker s Inequality: D TV (p, q) 1 2 D KL(p q). (b.5) Although KL divergence fails to be a metric on P, it satisfies some convergence properties. As an example, prove the following theorem: Let p (1), p (2),... be a sequence of probability distributions in P, such that lim n D KL(p (n) q) = 0, i.e. the sequence is converging to q with respect to KL divergence. Prove that this sequence is actually converging to q in Euclidean sense, i.e. lim n p(n) q 2 = 0. 4

(b.6) Let X and Y be two random variables with distributions p X and p Y and joint distribution p X,Y. If X and Y were independent, the we had p X,Y = p X p Y. Otherwise, if one tries to give a measure of independence of X and Y, one idea is to consider D KL (p X,Y p X p Y ). This value is called the mutual information between X and Y, denoted by I(X, Y ). Prove that I(X, Y ) = H(X) H(X Y ), where H(X) is the entropy 2 of X and H(X Y ) is the conditional entropy of X given Y. In Bayesian point-of-view, the mutual information shows how much information does knowledge about Y reveal about X. 2 Entropy of a random variable X is defined as H(X) := E X [ log X] = x p X(x) log p X (x), and is a measure of uncertainty of X. For example if X has the uniform distribution, it has the highest entropy. If the base of log is 2, entropy is measured with the unit bits, suggesting the idea that one needs H(X) bits to encode the outcome of X with zeros and ones. Convince yourself that this definition makes sense. 5