IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Size: px
Start display at page:

Download "IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm"

Transcription

1 IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing. More generally, however, the EM algorithm can also be applied when there is latent, i.e. unobserved, data which was never intended to be observed in the first place. In that case, we simply assume that the latent data is missing and proceed to apply the EM algorithm. The EM algorithm has many applications throughout statistics. It is often used for example, in machine learning and data mining applications, and in Bayesian statistics where it is often used to obtain the mode of the posterior marginal distributions of parameters. The Classical EM Algorithm We begin by assuming that the complete data-set consists of Z (X, Y) but that only X is observed. The complete-data log likelihood is then denoted by l(θ; X, Y) where θ is the unknown parameter vector for which we wish to find the MLE. E-Step: The E-step of the EM algorithm computes the expected value of l(θ; X, Y) given the observed data, X, and the current parameter estimate, θ old say. In particular, we define Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ] l(θ; X, y) p(y X, θ old ) dy () where p( X, θ old ) is the conditional density of Y given the observed data, X, and assuming θ θ old. M-Step: The M-step consists of maximizing over θ the expectation computed in (). That is, we set We then set θ old θ new. θ new : max θ Q(θ; θ old ). The two steps are repeated as necessary until the seuence of θ new s converges. Indeed under very general circumstances convergence to a local maximum can be guaranteed and we explain why this is the case below. If it is suspected that the log-likelihood function has multiple local maximums then the EM algorithm should be run many times, using a different starting value of θ old on each occasion. The ML estimate of θ is then taken to be the best of the set of local maximums obtained from the various runs of the EM algorithm. Why Does the EM Algorithm Work? We use p( ) to denote a generic conditional PDF. Now observe that l(θ; X ) ln p(x θ) ln p(x, y θ) dy ln p(x, y θ) p(y X, θ old ) p(y X, θ old) dy

2 The EM Algorithm 2 [ p(x, Y θ) ln E E [ ln p(y X, θ old ) X, θ old ( ) ] p(x, Y θ) X, θ old p(y X, θ old ) E [ln p(x, Y θ) X, θ old ] E [ln p(y X, θ old ) X, θ old ] ] Q(θ; θ old ) E [ln p(y X, θ old ) X, θ old ] (3) where (2) follows from Jensen s Ineuality and since the ln function is concave. It is also clear (because the term inside the expectation becomes a constant) that the ineuality in (2) becomes an euality if we take θ θ old. Letting g(θ θ old ) denote the right-hand-side of (3), we therefore have l(θ; X ) g(θ θ old ) for all θ with euality when θ θ old. Therefore any value of θ that increases g(θ θ old ) beyond g(θ old θ old ) must also increase l(θ; X ) beyond l(θ old ; X ). The M-step finds such a θ by maximizing Q(θ; θ old ) over θ which is euivalent (why?) to maximizing g(θ θ old ) over θ. It is also worth mentioning that in many applications the function Q(θ; θ old ) will be a convex function of θ and therefore easy to optimize. (2) 2 Examples Example (Missing Data in a Multinomial Model) Suppose x : (x, x 2, x 3, x 4 ) is a sample from a Mult(n, π θ ) distribution where ( π θ 2 + ) 4 θ, 4 ( θ), 4 ( θ), 4 θ. The likelihood, L(θ; x), is then given by L(θ; x) so that the log-likelihood l(θ; x) is n! x!x 2!x 3!x 4! l(θ; x) C + x ln ( 2 + ) x ( ) x2 ( ) x3 ( ) x4 4 θ ( θ) ( θ) θ ( 2 + ) 4 θ + (x 2 + x 3 ) ln ( θ) + x 4 ln (θ) where C is a constant that does not depend on θ. We could try to maximize l(θ; x) over θ directly using standard non-linear optimization algorithms. However, in this example we will perform the optimization instead using the EM algorithm. To do this we assume that the complete data is given by y : (y, y 2, y 3, y 4, y 5 ) and that y has a Mult(n, πθ ) distribution where ( ) πθ 2, 4 θ, 4 ( θ), 4 ( θ), 4 θ. However, instead of observing y we only observe (y + y 2, y 3, y 4, y 5 ), i.e, we only observe x. We therefore take X (y + y 2, y 3, y 4, y 5 ) and take Y y 2. The log-likelihood of the complete data is then given by l(θ; X, Y) C + y 2 ln (θ) + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) where again C is a constant containing all terms that do no depend on θ. It is also clear that the conditional density of Y satisfies ( ) θ/4 f(y X, θ) Bin y + y 2,. /2 + θ/4

3 The EM Algorithm 3 We can now implement the E-step and M-step. E-Step: Recalling that Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ], we have Q(θ; θ old ) : C + E [y 2 ln (θ) X, θ old ] + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) C + (y + y 2 )p old ln (θ) + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) where p old : θ old /4 /2 + θ old /4. (4) M-Step: We now maximize Q(θ; θ old ) to find θ new. Taking the derivative we obtain which is zero when we take θ θ new where dq dθ (y + y 2 ) p old (y 3 + y 4 ) θ θ + y 5 θ θ new : y 5 + p old (y + y 2 ) y 3 + y 4 + y 5 + p old (y + y 2 ) (5) Euations (4) and (5) now define the EM iteration which begins with some (judiciously) chosen value of θ old. Example 2 (A Simple Normal-Mixture Model) An extremely common application of the EM algorithm is to estimate the MLE of normal mixture models. This is often used, for example, in clustering algorithms. Suppose for example that X (X,..., X n ) are IID random variables each with PDF f x (x) m j e (x µj)2 /2σ 2 j p j 2πσj 2 where p j 0 for all j and where p j. The parameters in this model are the p j s, the µ j s and the σ j s. Instead of trying to finding the maximum likelihood estimates of these parameters directly via numerical optimization, we can use the EM algorithm. We do this by assuming the presence of an additional random variable, Y say, where P (Y j) p j for j,..., m. The realized value of Y then determines which of the m normal distributions generates the corresponding value of X. There are n such random variables, (Y,..., Y n ) : Y. Note that f x y (x i y i j, θ) e (xi µj)2 /2σ 2 j (6) 2πσj 2 where θ : (p,..., p m, µ,..., µ m, σ,..., σ m ) is the unknown parameter vector and that the likelihood is given by n L(θ; X, Y) p yi e (xi µyi ) 2 /2σ 2 y i. 2πσy 2 i The EM algorithm starts with an initial guess, θ old, and then iterates the E-step and M-step as described below until convergence. E-Step: We need to compute Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ]. Indeed it is straightforward to show that Q(θ; θ old ) n m P (Y i j x i, θ old ) ln ( f x y (x i y i j, θ) P (Y i j θ) ). (7) j

4 The EM Algorithm 4 Note that f x y (x i y i j, θ) is given by (6) and that P (Y i j θ old ) p j,old. Finally, since P (Y i j x i, θ old ) P (Y i j, X i x i θ old ) P (X i x i θ old ) it is clear that we can compute (7) analytically. f x y (x i y i j, θ old ) P (Y i j θ old ) m k f x y(x i y i k, θ old ) P (Y i k θ old ) M-Step: We can now maximize Q(θ; θ old ) by setting the vector of partial derivatives, Q/ θ, eual to 0 and solving for θ new. After some algebra, we obtain µ j,new x ip (Y i j x i, θ old ) P (Y (8) i j x i, θ old ) σ 2 j,new (x i µ j ) 2 P (Y i j x i, θ old ) P (Y i j x i, θ old ) (9) p j,new n n P (Y i j x i, θ old ). (0) Given an initial estimate, θ old, the EM algorithm cycles through (8) to (0) repeatedly, setting θ old θ new after each cycle, until the estimates converge. 3 A More General Version of the EM Algorithm The EM algorithm is often stated more generally using the language of information theory. In this section we will describe this more general formulation and relate it back to EM algorithm as described in Section. As before the goal is to maximize the likelihood function, L(θ; X ), which is given by 2 L(θ; X ) p(x θ) p(x, y θ) dy. () The implicit assumption underlying the EM algorithm is that it is difficult to optimize p(x θ) with respect to θ but that it is much easier to optimize p(x, Y θ). We first introduce an arbitrary distribution, (Y), over the latent variables, Y, and note that we can decompose the log-likelihood, l(θ; X ), into two terms (the first of which is sometimes called the energy term) according to l(θ; X ) : ln p(x θ) L(, θ) }{{} energy y + KL( p Y X ) (2) where L(, θ) and KL( p Y X ) are the likelihood and Kullback-Leibler (KL) divergence 3 which are given by ( ) p(x, Y θ) L(, θ) (Y) ln (3) Y (Y) ( ) p(y X, θ) KL( p Y X ) (Y) ln. (Y) Y It is well-known (see the Appendix) that the KL divergence satisfies KL( p Y X ) 0 and euals 0 if and only if (Y) p Y X. It therefore follows that L(, θ) l(θ; X ) for all distributions, ( ). We can now use the decomposition of (2) to define the EM algorithm. We begin with an initial parameter estimate, θ old. E-Step: The E-step maximizes the lower bound, L(, θ old ), with respect to ( ) while keeping θ old fixed. In principle this is a variational problem since we are optimizing a functional, but the solution is easily found. First note that l(θ old ; X ) does not depend on ( ). It then follows from (2) (with θ θ old ) that maximizing L(, θ old ) is euivalent to minimizing KL( p Y X ). Since this latter term is always non-negative we see that The material in this section is drawn from Pattern Recognition and Machine Learning (2006) by Chris Bishop. 2 If Y is discrete then we replace the integral in () with a summation. 3 The KL divergence is also often called the relative entropy.

5 The EM Algorithm 5 L(, θ old ) is optimized when KL( p Y X ) 0 which, by our earlier observation, is the case when we take (Y) p(y X, θ old ). At this point we see that the lower bound, L(, θ old ), will now eual the current value of the log-likelihood, l(θ old ; X ). M-Step: In the M-step we keep (Y) fixed and maximize L(, θ) over θ to obtain θ new. This will therefore cause the lower bound to increase (if it is not already at a maximum) which in turn means that the log-likelihood must also increase. Moreover, at this new value θ new it will no longer be the case that KL( p Y X ) 0 and so by (2) the increase in the log-likelihood will be greater than the increase in the lower bound. Comparing the General EM Algorithm with the Classical EM Algorithm It is instructive to compare the E-step and M-step of the general EM algorithm with the corresponding steps of Section. To do this, first substitute (Y) p(y X, θ old ) into (3) to obtain L(, θ) Q(θ; θ old ) + constant (4) where Q(θ; θ old ) is the expected complete-date log-likelihood as defined in () where the expectation is taken assuming θ θ old. The M-step of the general EM algorithm is therefore identical to the M-step of Section since the constant term in (4) does not depend on θ. The E-step in the general EM algorithm takes (Y) p(y X, θ old ) which, at first glance, appears to be different to the E-step in Section. But there is no practical difference: the E-step in Section simply uses p(y X, θ old ) to compute Q(θ; θ old ) and, while not explicitly stated, the general E-step must also do this since it is reuired for the M-step. 3. The Case of Independent and Identically Distributed Observations When the data set consists of N IID observations, X {x n } with corresponding latent or unobserved variables, Y {y n }, then we can simplify the calculation of p(y X, θ old ). In particular we obtain p(y X, θ old ) p(x, Y θ old) Y p(x, Y θ old) N n p(x N n, y n θ old ) N Y n p(x n, y n θ old ) n p(x n, y n θ old ) N n p(x n θ old ) N p(y n x n, θ old ) n so that the posterior distribution also factorizes which makes its calculation much easier. Note that the E-step of Example 2 is consistent with this observation. 3.2 Bayesian Applications The EM algorithm can also be used to compute the mode of the posterior distribution, p(θ X ), in a Bayesian setting where we are given a prior, p(θ), on the unknown parameter (vector), θ. To see this, first note that we can write p(θ X ) p(x θ)p(θ)/p(x ) which upon taking logs yields ln p(θ X ) ln p(x θ) + ln p(θ) ln p(x ). (5) If we now use (2) to substitute for ln p(x θ) on the right-hand-side of (5) we obtain ln p(θ X ) L(, θ) + KL( p Y X ) + ln p(θ) ln p(x ). (6) We can now find the posterior mode of ln p(θ X ) using a version of the EM algorithm. The E-step is exactly the same as before since the final two terms on the right-hand-side of (6) do not depend on ( ). The M-step, where we keep ( ) fixed and optimize over θ, must be modified however to include the ln p(θ) term. There are also related methods that can be used to estimate the variance-covariance matrix, Σ, of θ. In this case it is uite common to approximate the posterior distribution of θ with a Gaussian distribution centered at the mode and with variance-covariance matrix, Σ. This is called a Laplacian approximation and it is a simple but commonly used framework for deterministic inference. It only works well, however, when the posterior is

6 The EM Algorithm 6 unimodal with contours that are approximately elliptical. It is also worth pointing out, however, that it is often straightforward to compute the mode of the posterior and determine a suitable Σ for the Gaussian approximation so that the Laplacian approximation need not rely on the EM algorithm. We also note in passing that the decomposition in (2) also forms the basis of another commonly used method of deterministic inference called variational Bayes. The goal with variational Bayes is to select ( ) from some parametric family of distributions, Q, to approximate p(y X ). The dependence on θ is omitted since we are now in a Bayesian setting and θ can be subsumed into the latent or hidden variables, Y. In choosing ( ) we seek to maximize the lower bound, L(), or euivalently by (2), to minimize KL( p Y X ). A common choice of Q is the set of distributions under which the latent variables are independent. Appendix: Kullback-Leibler Divergence Let P and Q be two probability distributions such that if P (x) 0 then Q(x) 0. The Kullback-Leibler (KL) divergence or relative entropy of Q from P is defined to be ( ) P (x) KL(P Q) P (x) ln (7) Q(x) x with the understanding 4 that 0 log 0 0. The KL divergence is a fundamental concept in information theory and machine learning. One can imagine P representing some true but unknown distribution that we approximate with Q and that KL(P Q) measures the distance between P and Q. This interpretation is valid because we will see below that KL(P Q) 0 with euality if and only if P Q. Note however, that the KL divergence is not a true measure of distance since it is asymmetric in that KL(P Q) KL(Q P ) and does not satisfy the triangle ineuality. In order to see that KL(P Q) 0 we first recall that a function f( ) is convex on R if We also recall Jensen s ineuality: f(αx + ( α)y) αf(x) + ( α)f(y) for all α [0, ]. Jensen s Ineuality: Let f( ) be a convex function on R and suppose E[X] < and E[f(X)] <. Then f(e[x]) E[f(X)]. Noting that ln(x) is a convex function we have ( ) Q(x) KL(P Q) P (x) ln x P (x) ( ln P (x) Q(x) ) x P (x) 0. by Jensen s ineuality Moreover it is clear from (7) that KL(P Q) 0 if P Q. In fact because ln(x) is strictly convex it is easy to see that KL(P Q) 0 only if P Q. Example 3 (An Optimization Trick that s Worth Remembering) Suppose c is a non-negative vector in R n, i.e. c R n +, and we wish to maximize c i ln( i ) over probability mass functions, {,..., n }. Let p {p,..., p n } where p i : c i / j c j so that p is a (known) pmf. We then have: n ( n ) { n } max c i ln( i ) c i max p i ln( i ) 4 This is consistent with the fact that lim x 0 x log x 0.

7 The EM Algorithm 7 ( n c i ) max { p i ln(p i ) n p i ln ( n ) ( n c i p i ln(p i ) min KL(p ) ( pi ) i ) }, from which it follows (why?) that the optimal satisfies p.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Statistical Machine Learning Lectures 4: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

G8325: Variational Bayes

G8325: Variational Bayes G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Variational inference

Variational inference Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

1 Expectation Maximization

1 Expectation Maximization Introduction Expectation-Maximization Bibliographical notes 1 Expectation Maximization Daniel Khashabi 1 khashab2@illinois.edu 1.1 Introduction Consider the problem of parameter learning by maximizing

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Note 1: Varitional Methods for Latent Dirichlet Allocation

Note 1: Varitional Methods for Latent Dirichlet Allocation Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

Variational Inference and Learning. Sargur N. Srihari

Variational Inference and Learning. Sargur N. Srihari Variational Inference and Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics in Approximate Inference Task of Inference Intractability in Inference 1. Inference as Optimization 2. Expectation Maximization

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Expectation Maximization and Mixtures of Gaussians

Expectation Maximization and Mixtures of Gaussians Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation

More information

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

EM for Spherical Gaussians

EM for Spherical Gaussians EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical

More information

1/37. Convexity theory. Victor Kitov

1/37. Convexity theory. Victor Kitov 1/37 Convexity theory Victor Kitov 2/37 Table of Contents 1 2 Strictly convex functions 3 Concave & strictly concave functions 4 Kullback-Leibler divergence 3/37 Convex sets Denition 1 Set X is convex

More information

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Nested Sampling. Brendon J. Brewer.   brewer/ Department of Statistics The University of Auckland Department of Statistics The University of Auckland https://www.stat.auckland.ac.nz/ brewer/ is a Monte Carlo method (not necessarily MCMC) that was introduced by John Skilling in 2004. It is very popular

More information

EM Algorithm. Expectation-maximization (EM) algorithm.

EM Algorithm. Expectation-maximization (EM) algorithm. EM Algorithm Outline: Expectation-maximization (EM) algorithm. Examples. Reading: A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.,

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian

More information