CS229 Lecture notes. Andrew Ng

Similar documents
Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

A Brief Introduction to Markov Chains and Hidden Markov Models

Separation of Variables and a Spherical Shell with Surface Charge

14 Separation of Variables Method

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Week 6 Lectures, Math 6451, Tanveer

6 Wave Equation on an Interval: Separation of Variables

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

Akaike Information Criterion for ANOVA Model with a Simple Order Restriction

4 1-D Boundary Value Problems Heat Equation

Strauss PDEs 2e: Section Exercise 2 Page 1 of 12. For problem (1), complete the calculation of the series in case j(t) = 0 and h(t) = e t.

Lecture Note 3: Stationary Iterative Methods

C. Fourier Sine Series Overview

David Eigen. MA112 Final Paper. May 10, 2002

$, (2.1) n="# #. (2.2)

A. Distribution of the test statistic

LECTURE NOTES 9 TRACELESS SYMMETRIC TENSOR APPROACH TO LEGENDRE POLYNOMIALS AND SPHERICAL HARMONICS

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

AALBORG UNIVERSITY. The distribution of communication cost for a mobile service scenario. Jesper Møller and Man Lung Yiu. R June 2009

AST 418/518 Instrumentation and Statistics

u(x) s.t. px w x 0 Denote the solution to this problem by ˆx(p, x). In order to obtain ˆx we may simply solve the standard problem max x 0

221B Lecture Notes Notes on Spherical Bessel Functions

Explicit overall risk minimization transductive bound

The EM Algorithm applied to determining new limit points of Mahler measures

Math 124B January 31, 2012

LECTURE NOTES 8 THE TRACELESS SYMMETRIC TENSOR EXPANSION AND STANDARD SPHERICAL HARMONICS

Srednicki Chapter 51

Lecture Notes 4: Fourier Series and PDE s

Pricing Multiple Products with the Multinomial Logit and Nested Logit Models: Concavity and Implications

Haar Decomposition and Reconstruction Algorithms

Mat 1501 lecture notes, penultimate installment

( ) is just a function of x, with

NOISE-INDUCED STABILIZATION OF STOCHASTIC DIFFERENTIAL EQUATIONS

Course 2BA1, Section 11: Periodic Functions and Fourier Series

A CLUSTERING LAW FOR SOME DISCRETE ORDER STATISTICS

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

(1 ) = 1 for some 2 (0; 1); (1 + ) = 0 for some > 0:

17 Lecture 17: Recombination and Dark Matter Production

On the evaluation of saving-consumption plans

Componentwise Determination of the Interval Hull Solution for Linear Interval Parameter Systems

XSAT of linear CNF formulas

Some Measures for Asymmetry of Distributions

Lecture Notes for Math 251: ODE and PDE. Lecture 32: 10.2 Fourier Series

Math 124B January 17, 2012

Chemical Kinetics Part 2

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

arxiv: v1 [math.co] 17 Dec 2018

1D Heat Propagation Problems

b n n=1 a n cos nx (3) n=1

Midterm 2 Review. Drew Rollins

APPENDIX C FLEXING OF LENGTH BARS

Learning Fully Observed Undirected Graphical Models

Stochastic Variational Inference with Gradient Linearization

Notes: Most of the material presented in this chapter is taken from Jackson, Chap. 2, 3, and 4, and Di Bartolo, Chap. 2. 2π nx i a. ( ) = G n.

Physics 235 Chapter 8. Chapter 8 Central-Force Motion

A proposed nonparametric mixture density estimation using B-spline functions

CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION

2M2. Fourier Series Prof Bill Lionheart

Legendre Polynomials - Lecture 8

4 Separation of Variables

Chemical Kinetics Part 2. Chapter 16

SVM: Terminology 1(6) SVM: Terminology 2(6)

More Scattering: the Partial Wave Expansion

c 2007 Society for Industrial and Applied Mathematics

Stat 155 Game theory, Yuval Peres Fall Lectures 4,5,6

HYDROGEN ATOM SELECTION RULES TRANSITION RATES

An Infeasibility Result for the Multiterminal Source-Coding Problem

Formulas for Angular-Momentum Barrier Factors Version II

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING

Higher dimensional PDEs and multidimensional eigenvalue problems

On the estimation of multiple random integrals and U-statistics

c 2016 Georgios Rovatsos

Uniformly Reweighted Belief Propagation: A Factor Graph Approach

A NOTE ON QUASI-STATIONARY DISTRIBUTIONS OF BIRTH-DEATH PROCESSES AND THE SIS LOGISTIC EPIDEMIC

Assignment 7 Due Tuessday, March 29, 2016

Rate-Distortion Theory of Finite Point Processes

Unsupervised Learning

Pattern Frequency Sequences and Internal Zeros

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn

Lecture 9. Stability of Elastic Structures. Lecture 10. Advanced Topic in Column Buckling

Math 220B - Summer 2003 Homework 1 Solutions

FOURIER SERIES ON ANY INTERVAL

GENERAL ASYMPTOTIC BAYESIAN THEORY OF QUICKEST CHANGE DETECTION. Alexander G. Tartakovsky and Venugopal V. Veeravalli

STA 216 Project: Spline Approach to Discrete Survival Analysis

MONTE CARLO SIMULATIONS

Problem set 6 The Perron Frobenius theorem.

V.B The Cluster Expansion

V.B The Cluster Expansion

Reichenbachian Common Cause Systems

Primal and dual active-set methods for convex quadratic programming

SEMINAR 2. PENDULUMS. V = mgl cos θ. (2) L = T V = 1 2 ml2 θ2 + mgl cos θ, (3) d dt ml2 θ2 + mgl sin θ = 0, (4) θ + g l

Theory and implementation behind: Universal surface creation - smallest unitcell

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Homework 5 Solutions

An explicit resolution of the equity-efficiency tradeoff in the random allocation of an indivisible good

arxiv: v1 [math.fa] 23 Aug 2018

Discrete Techniques. Chapter Introduction

Approximated MLC shape matrix decomposition with interleaf collision constraint

Transcription:

CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view of the EM agorithm, and show how it can be appied to a arge famiy of estimation probems with atent variabes. We begin our discussion with a very usefu resut caed Jensen s inequaity 1 Jensen s inequaity Let f be a function whose domain is the set of rea numbers. Reca that f is a convex function if f (x) 0 (for a x R). In the case of f taking vector-vaued inputs, this is generaized to the condition that its hessian H is positive semi-definite (H 0). If f (x) > 0 for a x, then we say f is stricty convex (in the vector-vaued case, the corresponding statement is that H must be stricty positive semi-definite, written H > 0). Jensen s inequaity can then be stated as foows: Theorem. Let f be a convex function, and et X be a random variabe. Then: E[f(X)] f(ex). Moreover, if f is stricty convex, then E[f(X)] = f(ex) hods true if and ony if X = E[X] with probabiity 1 (i.e., if X is a constant). Reca our convention of occasionay dropping the parentheses when writing expectations, so in the theorem above, f(ex) = f(e[x]). For an interpretation of the theorem, consider the figure beow. 1

2 f(a) f E[f(X)] f(b) f(ex) a E[X] b Here, f is a convex function shown by the soid ine. Aso, X is a random variabe that has a 0.5 chance of taking the vaue a, and a 0.5 chance of taking the vaue b (indicated on the x-axis). Thus, the expected vaue of X is given by the midpoint between a and b. We aso see the vaues f(a), f(b) and f(e[x]) indicated on the y-axis. Moreover, the vaue E[f(X)] is now the midpoint on the y-axis between f(a) and f(b). From our exampe, we see that because f is convex, it must be the case that E[f(X)] f(ex). Incidentay, quite a ot of peope have troube remembering which way the inequaity goes, and remembering a picture ike this is a good way to quicky figure out the answer. Remark. Reca that f is [stricty] concave if and ony if f is [stricty] convex (i.e., f (x) 0 or H 0). Jensen s inequaity aso hods for concave functions f, but with the direction of a the inequaities reversed (E[f(X)] f(ex), etc.). 2 The EM agorithm Suppose we have an estimation probem in which we have a training set {x (1),..., x (m) } consisting of m independent exampes. We wish to fit the parameters of a mode p(x, z) to the data, where the ikeihood is given by m (θ) = og p(x; θ) = m og z p(x, z; θ).

3 But, expicity finding the maximum ikeihood estimates of the parameters θ may be hard. Here, the s are the atent random variabes; and it is often the case that if the s were observed, then maximum ikeihood estimation woud be easy. In such a setting, the EM agorithm gives an efficient method for maximum ikeihood estimation. Maximizing (θ) expicity might be difficut, and our strategy wi be to instead repeatedy construct a ower-bound on (E-step), and then optimize that ower-bound (M-step). For each i, et Q i be some distribution over the z s ( z Q i(z) = 1, Q i (z) 0). Consider the foowing: 1 og p(x (i) ; θ) = i i = i i og og p(x (i), ; θ) (1) p(x(i), ; θ) og p(x(i), z(i) ; θ) The ast step of this derivation used Jensen s inequaity. Specificay, f(x) = og x is a concave function, since f (x) = 1/x 2 < 0 over its domain x R +. Aso, the term [ ] p(x Q i ( (i), ; θ) ) in the summation is ust an expectation of the quantity [ p(x (i), ; θ)/ ] with respect to drawn according to the distribution given by Q i. By Jensen s inequaity, we have f ( [ ]) [ p(x (i), ; θ) E Q i E z(i) Qi f ( p(x (i), ; θ) )], where the Q i subscripts above indicate that the expectations are with respect to drawn from Q i. This aowed us to go from Equation (2) to Equation (3). Now, for any set of distributions Q i, the formua (3) gives a ower-bound on (θ). There re many possibe choices for the Q i s. Which shoud we choose? We, if we have some current guess θ of the parameters, it seems 1 If z were continuous, then Q i woud be a density, and the summations over z in our discussion are repaced with integras over z. (2) (3)

4 natura to try to make the ower-bound tight at that vaue of θ. I.e., we make the inequaity above hod with equaity at our particuar vaue of θ. (We see ater how this enabes us to prove that (θ) increases monotonicay with successsive iterations of EM.) To make the bound tight for a particuar vaue of θ, we need for the step invoving Jensen s inequaity in our derivation above to hod with equaity. For this to be true, we know it is sufficient that that the expectation be taken over a constant -vaued random variabe. I.e., we require that p(x (i), ; θ) = c for some constant c that does not depend on. This is easiy accompished by choosing p(x (i), ; θ). Actuay, since we know z Q i( ) = 1 (because it is a distribution), this further tes us that = p(x(i), ; θ) z p(x(i), z; θ) = p(x(i), ; θ) p(x (i) ; θ) = p( x (i) ; θ) Thus, we simpy set the Q i s to be the posterior distribution of the s given x (i) and the setting of the parameters θ. Now, for this choice of the Q i s, Equation (3) gives a ower-bound on the ogikeihood that we re trying to maximize. This is the E-step. In the M-step of the agorithm, we then maximize our formua in Equation (3) with respect to the parameters to obtain a new setting of the θ s. Repeatedy carrying out these two steps gives us the EM agorithm, which is as foows: Repeat unti convergence { (E-step) For each i, set (M-step) Set θ := arg max θ := p( x (i) ; θ). i og p(x(i), z(i) ; θ).

5 } How we we know if this agorithm wi converge? We, suppose θ (t) and θ (t+1) are the parameters from two successive iterations of EM. We wi now prove that (θ (t) ) (θ (t+1) ), which shows EM aways monotonicay improves the og-ikeihood. The key to showing this resut ies in our choice of the Q i s. Specificay, on the iteration of EM in which the parameters had started out as θ (t), we woud have chosen i ( ) := p( x (i) ; θ (t) ). We saw earier that this choice ensures that Jensen s inequaity, as appied to get Equation (3), hods with equaity, and hence (θ (t) ) = i i ( ) og p(x(i), ; θ (t) ). i ( ) The parameters θ (t+1) are then obtained by maximizing the right hand side of the equation above. Thus, (θ (t+1) ) i i i ( ) og p(x(i), ; θ (t+1) ) i ( ) i ( ) og p(x(i), ; θ (t) ) i ( ) (4) (5) = (θ (t) ) (6) This first inequaity comes from the fact that (θ) i og p(x(i), z(i) ; θ) hods for any vaues of Q i and θ, and in particuar hods for Q i = i, θ = θ (t+1). To get Equation (5), we used the fact that θ (t+1) is chosen expicity to be arg max θ i og p(x(i), z(i) ; θ) and thus this formua evauated at θ (t+1) must be equa to or arger than the same formua evauated at θ (t). Finay, the step used to get (6) was shown earier, and foows from i having been chosen to make Jensen s inequaity hod with equaity at θ (t).,

6 Hence, EM causes the ikeihood to converge monotonicay. In our description of the EM agorithm, we said we d run it unti convergence. Given the resut that we ust showed, one reasonabe convergence test woud be to check if the increase in (θ) between successive iterations is smaer than some toerance parameter, and to decare convergence if EM is improving (θ) too sowy. Remark. If we define J(Q, θ) = i og p(x(i), z(i) ; θ), the we know (θ) J(Q, θ) from our previous derivation. The EM can aso be viewed a coordinate ascent on J, in which the E-step maximizes it with respect to Q (check this yoursef), and the M-step maximizes it with respect to θ. 3 Mixture of Gaussians revisited Armed with our genera definition of the EM agorithm, ets go back to our od exampe of fitting the parameters φ, µ and Σ in a mixture of Gaussians. For the sake of brevity, we carry out the derivations for the M-step updates ony for φ and µ, and eave the updates for Σ as an exercise for the reader. The E-step is easy. Foowing our agorithm derivation above, we simpy cacuate = Q i ( = ) = P ( = x (i) ; φ, µ, Σ). Here, Q i ( = ) denotes the probabiity of taking the vaue under the distribution Q i. Next, in the M-step, we need to maximize, with respect to our parameters φ, µ, Σ, the quantity m og p(x(i), z(i) ; φ, µ, Σ) m k = Q i ( = ) og p(x(i) = ; µ, Σ)p( = ; φ) Q i ( = ) = m =1 k =1 og 1 (2π) n/2 Σ exp ( 1 1/2 2 (x(i) µ ) T Σ 1 (x (i) µ ) ) φ

7 Lets maximize this with respect to µ. If we take the derivative with respect to µ, we find µ m k =1 og = µ m = 1 2 m = m k =1 1 (2π) n/2 Σ exp ( 1 1/2 2 (x(i) µ ) T Σ 1 1 2 (x(i) µ ) T Σ 1 (x (i) µ ) µ 2µ T Σ 1 ( ) Σ 1 x (i) Σ 1 µ x (i) µ T Σ 1 µ (x (i) µ ) ) φ Setting this to zero and soving for µ therefore yieds the update rue µ := m w(i) x (i) m w(i) which was what we had in the previous set of notes. Lets do one more exampe, and derive the M-step update for the parameters φ. Grouping together ony the terms that depend on φ, we find that we need to maximize m k =1 og φ. However, there is an additiona constraint that the φ s sum to 1, since they represent the probabiities φ = p( = ; φ). To dea with the constraint that k =1 φ = 1, we construct the Lagrangian, L(φ) = m k =1 og φ + β( k φ 1), =1 where β is the Lagrange mutipier. 2 Taking derivatives, we find φ L(φ) = m φ + 1 2 We don t need to worry about the constraint that φ 0, because as we shorty see, the soution we find from this derivation wi automaticay satisfy that anyway.

8 Setting this to zero and soving, we get φ = m w(i) β I.e., φ m w(i). Using the constraint that φ = 1, we easiy find that β = m k =1 w(i) = m 1 = m. (This used the fact that w(i) = Q i ( = ), and since probabiities sum to 1, w(i) = 1.) We therefore have our M-step updates for the parameters φ : φ := 1 m m. The derivation for the M-step updates to Σ are aso entirey straightforward.