Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Size: px
Start display at page:

Download "Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems"

Transcription

1 LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 08 p. 1/28

2 References for this chapter Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 9, Springer, 2006 Joseph L. Schafer, Analysis of Incomplete Multivariate Data, Chapman&Hall, 1997 Zoubin Ghahramani, Michael I. Jordan, Learning from incomplete data, Technical Report #1509, MIT Artificial Intelligence Laboratory, /AIM-1509.pdf?sequence=2 Arthur P. Dempster, Nan M. Laird, Donald B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, in: Journal of the Royal Statistical Society Series B, vol. 39, pp. 1-38, 1977 Xiao-Li Meng and Donald B. Rubin, Maximum likelihood estimation via the ECM algorithm: A general framework, in: Biometrika, vol. 80, no. 2, pp , 1993 Learning and Inference in Graphical Models. Chapter 08 p. 2/28

3 Motivation up to now: 1. calculate/approximate p(parameters data) 2. find a meaningful reference value for p(parameters data), e.g. argmax parameters p(parameters data) requires more calculation than is actually necessary this chapter: findargmax parameters findargmax parameters p(parameters data) direct (MAP) or p(data parameters) direct (ML) Remark: ML and MAP require basically the same approaches. The only difference is whether we consider priors (which are just additional factors in graphical models). Therefore, we consider both approaches together. Learning and Inference in Graphical Models. Chapter 08 p. 3/28

4 Direct MAP calculation Posterior distribution in a graphical model: p(u 1,...,u n o 1,...,o m )= p(u 1,...,u n,o 1,...,o m ) p(o 1,...,o m ) p(u 1,...,u n,o 1,...,o m ) = i f i (Neighbors(i)) MAP means: solve arg max u 1,...,u n i logf i (Neighbors(i)) =e i logf i(neighbors(i)) Learning and Inference in Graphical Models. Chapter 08 p. 4/28

5 Direct MAP calculation Ways to find the MAP The systems of equations u j i logf i (Neighbors(i)) = 0 can be resolved analytically Each equation u j i logf i (Neighbors(i)) = 0 can be solved analytically analytical solution for MAP use an iterative approach Learning and Inference in Graphical Models. Chapter 08 p. 5/28

6 Direct MAP calculation Iterative approach 1. repeat 2. setu 1 argmax u1 i logf i(neighbors(i)) 3. setu 2 argmax u2 i logf i(neighbors(i)) setu n argmax un i logf i(neighbors(i)) 6. until convergence 7. return(u 1,...,u n ) The derivatives u j i logf i (Neighbors(i)) can be calculated easily numerical solution use generic gradient descent algorithm for Second approach often converges faster than generic gradient descent Learning and Inference in Graphical Models. Chapter 08 p. 6/28

7 Example: bearing-only tracking revisited observing a moving object from a fixed position object moves with constant velocity for every point in time, observer senses angle of observation, but only sometimes distance to object distributions: x 0 N( a,r) v N( b,s) y i x 0, v N( x 0 +t i v,σ 2 I) r i = y i w i = y i y i angle of observation σ t i x 0 unknown object movement w i y i unknown distance v r i x 0 v observer w i r i n Learning and Inference in Graphical Models. Chapter 08 p. 7/28

8 Example: bearing-only tracking revisited conditional distributions: x 0 v,( y i ),(t i ) N ( ( n σ 2I +R 1 ) 1 ( 1 σ 2 ( yi t i v)+r 1 a), ( n σ 2I +R 1 ) 1) v x 0,( y i ),(t i ) N ( ( 1 σ 2 t 2 i I +S 1 ) 1 ( 1 σ 2( t i ( y i x 0 ))+S 1 b), ( 1 σ 2 t 2 i I +S 1 ) 1) r i x 0, v,t i, w i N( w T i ( x 0 +t i v),σ 2 ) updates derived from conditionals: x 0 ( n σ 2I +R 1 ) 1 ( 1 σ 2 ( yi t i v)+r 1 a) v ( 1 σ 2 t 2 i I +S 1 ) 1 ( 1 σ 2( t i ( y i x 0 ))+S 1 b) r i w T i ( x 0 +t i v) Matlab demo (using non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 8/28

9 Example: Gaussian mixtures revisited m 0 r 0 a 0 b 0 µ j s j k β w µ j N(m 0,r 0 ) s j Γ 1 (a 0,b 0 ) w D( β) Z i w C( w) X i Z i,µ Zi,s Zi N(µ Zi,s Zi ) X i Z i n Learning and Inference in Graphical Models. Chapter 08 p. 9/28

10 Example: Gaussian mixtures revisited conditional distributions: see slide 07/36 derive MAP updates: β 1 +n 1 1 w ( n k + k j=1 β j µ j s jm 0 +r 0 s j +n j r 0 i z i =j x i,..., β k +n k 1 n k + k j=1 β j ) withn j = {i z i = j} s j b z i argmax j i z i =j (x i µ j ) 2 1+a 0 + n j 2 w j 2πsj e 1 2 (x i µ j ) 2 Matlab demo (using priors close to non-informativity) s j Learning and Inference in Graphical Models. Chapter 08 p. 10/28

11 Example: Gaussian mixtures revisited Observations: convergence is very fast m 0 r 0 a 0 b 0 β result depends very much from initialization µ j s j k we treatz i like parameters of the model although mixture w model is completely specified by w,µ 1,...,µ k,s 1,...,s k z i are no parameters of the mixture but latent variables X i Z i n which are only used to simplify our calculations why should we maximize posterior w.r.tz i? Learning and Inference in Graphical Models. Chapter 08 p. 11/28

12 Latent variables Latent variables are not part of the stochastic model not interesting for the final estimate useful to simplify calculations often interpreted as missing observation Examples the class assignment variablesz i in the mixture modeling can be interpreted as missing class labels for a multi-class distribution the missing distancesr i in the bearing-only tracking task can be interpreted as missing parts of the data occluded parts of an object in an image can be seen as missing pixels data from a statistical evaluation which have been lost Learning and Inference in Graphical Models. Chapter 08 p. 12/28

13 Incomplete data problems Let us assume that all data x are split into an observed part y and a missing part z, i.e. x = ( y, z). We can distinguish three cases: completely missing at random (CMAR): whether an entry of x belongs to y or z is stochastically independent on both, y and z P(x i belongs to z) = P(x i belongs to z y) = P(x i belongs to z y, z) missing at random (MAR): whether an entry of x belongs to y or z is stochastically independent of z but might depend on y P(x i belongs to z) P(x i belongs to z y) = P(x i belongs to z y, z) censored data: whether an entry of x belongs to y or z is stochastically dependent on z P(x i belongs to z y) P(x i belongs to z y, z) Learning and Inference in Graphical Models. Chapter 08 p. 13/28

14 Incomplete data problems Discuss the following examples of incomplete data: thez i in mixture models a sensor that measures values only down to a certain minimal value an interrupted connection between a sensor and a host computer so that some measurements are not transmitted a stereo camera system that measures light intensity and distance but is unable to calculate the distance for overexposed areas a sensor that fails often if temperatures are low if the sensor measures the activities of the sun if the sensor measures the persons on a beach non-responses at public opinion polls Learning and Inference in Graphical Models. Chapter 08 p. 14/28

15 Incomplete data problems consequences for stochastic analysis CMAR: no problem at all, incomplete data do not disturb our results MAR: can be treated if we model the stochastic dependency between the observed data and the missing data censored data: no general treatment at all possible. Results will be disturbed. No reconstruction of missing data possible we focus on the CMAR+MAR case here Learning and Inference in Graphical Models. Chapter 08 p. 15/28

16 Inference for incomplete data problems variational Bayes, Monte Carlo: Model the full posterior over the parameters of the model and the latent (missing) data. Afterwards, ignore the latent variables and return the result for the parameters of your model. direct MAP/ML: do not maximize the posterior/likelihood over the parameters and the latent variables. But, consider all possible values that can be taken by the latent variables and maximize the posterior/likelihood only w.r.t. the parameters of your stochastic model. expectation-maximization algorithm (EM) expectation-conditional-maximization algorithm (ECM) Learning and Inference in Graphical Models. Chapter 08 p. 16/28

17 EM algorithm Let us denote the parameters of the stochastic model (the posterior distribution) θ = (θ 1,...,θ k ) the latent variables λ = (λ1,...,λ m ) the observed data o = (o 1,...,o n ) the log-posterior L( θ, λ, o) = i logf i (Neighbors(i)) Learning and Inference in Graphical Models. Chapter 08 p. 17/28

18 EM algorithm We aim at maximizing the expected log-posterior over all values of the latent variables arg max θ R m L( θ, λ, o) p( λ θ, o) d λ an iterative approach to solve it 1. start with some parameter vector θ 2. repeat 3. Q( θ ) R m L( θ, λ, o) p( λ θ, o) d λ 4. θ argmax θ Q( θ ) 5. until convergence This algorithm is known as expectation maximization algorithm (Dempster, Laird, Rubin, 1977) step 3: expectation step (E-step) step 4: maximization step (M-step) Learning and Inference in Graphical Models. Chapter 08 p. 18/28

19 EM algorithm Remarks: during the E-step intermediate variables are calculated which allow to representqwithout relying on the previous values of θ closed form expressions for Q and explicit maximization often requires lengthy algebraic calculations for some applications calculating the E-step means calculating the expectation values of latent variables. But this does not apply in general. Famous application areas mixture distributions learning hidden Markov models from example sequences (Baum-Welch-algorithm) Learning and Inference in Graphical Models. Chapter 08 p. 19/28

20 Example: bearing-only tracking revisited conditions distribution r i x o, v, w i N( w T i ( x 0 +t i v),σ 2 ) unknown object movement x 0 v the posterior distribution 1 2π 1 R e 2 ( x 0 a) T R 1 ( x 0 a) }{{} prior of x 0 1 2π 1 S e 2 ( v b) T S 1 ( v b) }{{} n i=1 1 2πσ 2e prior of v 1 2 x 0 +t i v r i w i 2 σ 2 } {{ } data term angle of observation σ t i x 0 w i y i unknown distance v r i observer w i r i n Learning and Inference in Graphical Models. Chapter 08 p. 20/28

21 Example: bearing-only tracking revisited... (after lengthy, error-prone calculations)... Q( x 0, v )=const 1 2 ( x 0 a) T R 1 ( x 0 a) 1 2 ( v b) T S 1 ( v b) 1 n x 0 +t i v ρ i w i 2 2 σ 2 with ρ i = i=1 { r i ( x 0 +t i v) T w i ifr i is observed ifr i is unobserved Determining the maxima w.r.t x 0, v ( ) R 1 + n I ( 1 σ 2 σ 2 ti )I ( 1 σ 2 ti )I S 1 +( 1 σ t 2 2 i )I ( ) x 0 v = ( ) R 1 a+ 1 σ 2 ρi w i S 1 b+ 1 σ 2 ti ρ i w i Matlab demo (using non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 21/28

22 ECM algorithm We still aim at maximizing the expected log-posterior over all values of the latent variables arg max θ R m L( θ, λ, o) p( λ θ, o) d λ sometimes, the M-step of the EM-algorithm cannot be calculated, i.e. arg max θ 1,...,θ k Q( θ ) cannot be resolved analytically. But it might happen that arg max θ i Q( θ ) can be resolved for eachθ i or for groups of parameters Expectation-conditional-maximization algorithm (Meng & Rubin, 1993) Learning and Inference in Graphical Models. Chapter 08 p. 22/28

23 ECM algorithm Define a set of constraintsg i ( θ, θ) on the parameter set, e.g. g i : θ j = θ j for allj i replace the single M-step of the EM algorithm by a sequence of CM-steps, one for each constraint 1. start with some parameter vector θ 2. repeat 3. Q( θ ) R m L( θ, λ, o) p( λ θ, o) d λ (M-step) 4. θ argmax θ Q( θ ) subject tog 1 ( θ, θ) (CM-step) θ argmax θ Q( θ ) subject tog ν ( θ, θ) (CM-step) 7. until convergence Learning and Inference in Graphical Models. Chapter 08 p. 23/28

24 Example: Gaussian mixtures revisited m 0 r 0 a 0 b 0 µ j s j k β w µ j N(m 0,r 0 ) s j Γ 1 (a 0,b 0 ) w D( β) Z i w C( w) X i Z i,µ Zi,s Zi N(µ Zi,s Zi ) X i Z i n conditional distribution (cf. slide 07/35) z i = j w,x i,µ 1,...,µ k,s 1,...,s k C(h i,1,...,h i,k ) withh i,j w j 2πsj e 1 2 (x i µ j ) 2 s j Learning and Inference in Graphical Models. Chapter 08 p. 24/28

25 Example: Gaussian mixtures revisited Q( w,µ 1,...,µ k,s 1,...,s k ) = k k ( ( k µ 1 log( e 1 j m 0 2 r 0 ) + 2πr0 z 1 =1 z n =1 j=1 }{{} k j=1 b 0 s j ) log( ba 0 0 Γ(a 0 ) (s j) a0 1 e }{{} prior ofs j 1 log( e 1 2 2πs zj µ z j x i s z j ) } {{ } data term ofx i easily, we can maximize Q (blackboard/homework) prior ofµ j k (w j) βj 1 ) +log( Γ(β 1 + +β k ) Γ(β 1 ) Γ(β k ) j=1 }{{} prior of w ) ) hn,zn h 1,z1 + log(w z i ) }{{} data term ofz i + Learning and Inference in Graphical Models. Chapter 08 p. 25/28

26 Example: Gaussian mixtures revisited Matlab demo (using non-informative priors) Some observations on EM/ECM for Gaussian mixtures very popular very sensitive to initialization of parameters overfits the data if mixture is too large (for ML/MAP with non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 26/28

27 Laplace approximation MAP calculates a best estimate. Can we derive an approximation for the posterior distribution? Idea: determine a Gaussian that is locally most similar to the posterior. Taylor-approximation of log-posterior around MAP estimate θ MAP logp( θ) logp( θ MAP )+grad( θ θ MAP )+ 1 2 ( θ θ MAP ) T H( θ θ MAP ) =logp( θ MAP )+ 1 2 ( θ θ MAP ) T H( θ θ MAP ) withh the Hessian oflogp log of a Gaussian aroundθ MAP : 1 log d 1 2π Σ 2 ( θ θ MAP ) T Σ 1 ( θ θ MAP ) We obtain the same shape of the Gaussian if we chooseσ 1 = H. This is known as Laplace approximation. Learning and Inference in Graphical Models. Chapter 08 p. 27/28

28 Summary direct maximization of likelihood/posterior latent variables incomplete data problems EM/ECM algorithm Laplace approximation Learning and Inference in Graphical Models. Chapter 08 p. 28/28

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Chapter 04: Exact Inference in Bayesian Networks

Chapter 04: Exact Inference in Bayesian Networks LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 04: Exact Inference in Bayesian Networks Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David

More information

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 6: Gaussian Mixture Models (GMM) Helsinki Institute for Information Technology Lecture 6: Gaussian Mixture Models (GMM) Pedram Daee 3.11.2015 Outline Gaussian Mixture Models (GMM) Models Model families and parameters Parameter learning

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Chapter 05: Hidden Markov Models

Chapter 05: Hidden Markov Models LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 05: Hidden Markov Models Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs 11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1 Learning Distributions for Data roblem: Given a collection of examples from some data, estimate

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution Carl Scheffler First draft: September 008 Contents The Student s t Distribution The

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

MIXTURE MODELS AND EM

MIXTURE MODELS AND EM Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

More information

Graphical models: parameter learning

Graphical models: parameter learning Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Mixtures of Rasch Models

Mixtures of Rasch Models Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Overlapping Astronomical Sources: Utilizing Spectral Information

Overlapping Astronomical Sources: Utilizing Spectral Information Overlapping Astronomical Sources: Utilizing Spectral Information David Jones Advisor: Xiao-Li Meng Collaborators: Vinay Kashyap (CfA) and David van Dyk (Imperial College) CHASC Astrostatistics Group April

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Accelerating the EM Algorithm for Mixture Density Estimation

Accelerating the EM Algorithm for Mixture Density Estimation Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18 Accelerating the EM Algorithm for Mixture Density Estimation Homer Walker Mathematical Sciences Department Worcester Polytechnic

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

EM for ML Estimation

EM for ML Estimation Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct / Machine Learning for Signal rocessing Expectation Maximization Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1 Learning Distributions for Data roblem: Given a collection of examples from some data,

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko Helsinki University of Technology, Finland Adaptive Informatics Research Center The Data Explosion We are facing an enormous

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information