Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
|
|
- Bertram Oliver
- 5 years ago
- Views:
Transcription
1 Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan
2 Latent Variable Models What to do when a variable z is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g. given y, fit the model p(z, y ) = p(z y)p(y, w)p(w). (In other words if it is a leaf node.) But if z is conditioned on, we need to model it: e.g. given y, fit the model p(y ) = z p(y, z)p(z) Z X
3 Where Do Latent Variables Come From? Latent variables may appear naturally, from the structure of the problem, because something wasn t measured, because of faulty sensors, occlusion, privacy, etc. But also, we may want to intentionally introduce latent variables to model comple dependencies between variables without looking at the dependencies between them directly. This can actually simplify the model (e.g. mitures). Z n X n N X n N (a) (b)
4 Clustering vs. Classification Latent Factor Models vs. Regression You can think of clustering as the problem of classification with missing class labels. Z X You can think of factor models (such as factor analysis, PCA, ICA, etc.) as linear or nonlinear regression with missing inputs.
5 Why is Learning Harder? In fully observed iid settings, the probability model is a product thus the log likelihood is a sum where terms decouple. (At least for directed models.) l(θ; D) = log p(, z θ) = log p(z θ z ) + log p( z, θ ) With latent variables, the probability already contains a sum, so the log likelihood has all parameters coupled together via log (): l(θ; D) = log z p(, z θ) = log z p(z θ z )p( z, θ ) (Just as with the partition function in undirected models.)
6 Learning with Latent Variables Likelihood l(θ; D) = log z p(z θ z)p( z, θ ) couples parameters: Z Z X1 X 2 X3 X1 X 2 X3 (a) We can treat this as a black bo probability function and just try to optimize the likelihood as a function of θ (e.g. gradient descent). However, sometimes taking advantage of the latent variable structure can make parameter estimation easier. Good news: soon we will see the EM algorithm which allows us to treat learning with latent variables using fully observed tools. Basic trick: guess the values you don t know. Basic math: use conveity to lower bound the likelihood. (b)
7 Miture Models (1 discrete latent var) Most basic latent variable model with a single discrete node z. Allows different submodels (eperts) to contribute to the (conditional) density model in different parts of the space. Divide and conquer idea: use simple parts to build comple models. (e.g. multimodal densities, or piecewise-linear regressions). (a) X n Y n Z n N y z n = 0 z n = (b) X n Z n y Y n N
8 Miture Densities Eactly like a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: K p( θ) = p(z = k θ z )p( z = k, θ k ) k=1 = α k p k ( θ k ) k where the miing proportions add to one: k α k = 1. We can use Bayes rule to compute the posterior probability of the miture component given some data: p(z = k, θ) = α kp k ( θ k ) j α jp j ( θ j ) these quantities are sometimes called the responsibilities.
9 Clustering Eample: Gaussian Miture Models Consider a miture of K Gaussian components: p( θ) = α k N ( µ k, Σ k ) k p(z = k, θ) = α kn ( µ k, Σ k ) j α jn ( µ k, Σ k ) l(θ; D) = n log k α k N ( n µ k, Σ k ) Density model: p( θ) is a familiarity signal. Clustering: p(z, θ) is the assignment rule, l(θ) is the cost.
10 Eample: Miture of Linear Regression Eperts Each epert generates data according to a linear function of the input plus additive Gaussian noise: p(y, θ) = α k ()N (y βk, σ2 k ) k The gating function can be a softma classification machine: α k () = p(z = k ) = eη k j eη j Remember: we are not modeling the density of the inputs. X n y (a) Y n Z n N z n = 0 z n = 1
11 Gradient Learning with Mitures We can learn miture densities using gradient descent on the likelihood as usual. The gradients are quite interesting: l(θ) = log p( θ) = log α k p k ( θ k ) k l θ = 1 p α k ( θ k ) p( θ) k θ = k = k k 1 α k p( θ) p k( θ k ) log p k( θ k ) θ α k p k ( θ k ) p( θ) l k = θ k k α k r k l k θ k In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients.
12 Recap: Learning with Latent Variables With latent variables, the probability contains a sum, so the log likelihood has all parameters coupled together: l(θ; D) = log p(, z θ) = log p(z θ z )p( z, θ ) z z (we can also consider continuous z and replace with ) If the latent variables were observed, parameters would decouple again and learning would be easy: l(θ; D) = log p(, z θ) = log p(z θ z ) + log p( z, θ ) One idea: ignore this fact, compute l/ θ, and do learning with a smart optimizer like conjugate gradient. Another idea: what if we use our current parameters to guess the values of the latent variables, and then do fully-observed learning? This back-and-forth trick might make optimization easier.
13 Epectation-Maimization (EM) Algorithm Iterative algorithm with two linked steps: E-step: fill in values of ẑ t using p(z, θ t ). M-step: update parameters using θ t+1 argma l(θ;, ẑ t ). E-step involves inference, which we need to do at runtime anyway. M-step is no harder than in fully observed case. We will prove that this procedure monotonically improves l (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood (as any optimizer should). Note: EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data. EM is not a cost function such as maimum-likelihood. EM is not a model such as miture-of-gaussians.
14 Epected Complete Log Likelihood For any distribution q(z) define epected complete log likelihood: l q (θ; ) = l c (θ;, z) q z q(z ) log p(, z θ) Amazing fact: l(θ) l q (θ) + H(q) because of concavity of log: l(θ; ) = log p( θ) = log z = log z z p(, z θ) p(, z θ) q(z ) q(z ) q(z ) log p(, z θ) q(z ) Where the inequality is called Jensen s inequality. (It is only true for distributions: q(z) = 1; q(z) > 0.)
15 Lower Bounds and Free Energy For fied data, define a functional called the free energy: F (q, θ) p(, z θ) q(z ) log l(θ) q(z ) z The EM algorithm is coordinate-ascent on F : E-step: q t+1 = argma q F (q, θ t ) M-step: θ t+1 = argma θ F (q t+1, θ t ) EM Constructs Sequential Conve Lower Bounds likelihood F( θ,q t+1 ) θ t θ
16 M-step: maimization of epected l c Note that the free energy breaks into two terms: F (q, θ) = z q(z ) log p(, z θ) z q(z ) log q(z ) = l q (θ; ) + H(q) (this is where its name comes from) The first term is the epected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy. Thus, in the M-step, maimizing with respect to θ for fied q we only need to consider the first term: θ t+1 = argma θ l q (θ; ) = argma θ q(z ) log p(, z θ) Usually optimizing l c (θ) given both z and is straightforward. (e.g. class conditional Gaussian fitting, linear regression) z
17 E-step: inferring latent posterior Claim: the optimum setting of q in the E-step is: q t+1 = p(z, θ t ) This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting saturates the bound l(θ; ) F (q, θ) F (p(z, θ t ), θ t ) = p(z, θ t ) log p(, z θt ) p(z, θ z t ) = p(z, θ t ) log p( θ t ) z = log p( θ t ) z p(z, θt ) = l(θ; ) 1 Can also show this result using variational calculus or the fact that l(θ) F (q, θ) = KL[q p(z, θ)]
18 Eample: Mitures of Gaussians Recall: a miture of K Gaussians: p( θ) = k α kn ( µ k, Σ k ) l(θ; D) = n log k α kn ( n µ k, Σ k ) Learning with EM algorithm: E step : p t kn = N (n µ t k, Σt k ) M step : µ t+1 k = qkn t+1 = p(z =k n, θ t ) = αt k pt kn Σ t+1 k = α t+1 k n qt+1 kn n n qt+1 kn n qt+1 = 1 M n j αt j pt kn kn (n µ t+1 k )( n µ t+1 k q t+1 kn n qt+1 kn )
19 EM for MOG L = 1 L = 4 (a) (c) (d) (e) L = 6 L = 8 L = 10 L = 12 (f) (g) (h) (i)
20 Compare: K-means The EM algorithm for mitures of Gaussians is just like a soft version of the K-means algorithm. In the K-means E-step we do hard assignment: c t+1 n = argmin k ( n µ t k ) Σ 1 k (n µ t k ) In the K-means M-step we update the means as the weighted sum of the data, but now the weights are 0 or 1: n µ t+1 k = [ct+1 k = n] n n [ct+1 k = n] (a) (b) (c) (d) (e) (f)
21 Continuous Latent Variables In many models there are some underlying causes of the data. Miture models use a discrete class variable: clustering. Sometimes, it is more appropriate to think in terms of continuous factors which control the data we observe. Geometrically, this is equivalent to thinking of a data manifold or subspace. y 3 λ 1 µ λ2 y 1 y 2 To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.
22 When we assume that the subspace is linear and that the underlying latent variable has a Gaussian distribution we get a model known as factor analysis: data y (p-dim); latent variable (k-dim) Factor Analysis µ y 1 y 2 p() = N ( 0, I) p(y, θ) = N (y µ + Λ, Ψ) where µ is the mean vector, Λ is the p by k factor loading matri, and Ψ is the sensor noise covariance (ususally diagonal). Important: since the product of Gaussians is still Gaussian, the joint distribution p(, y), the other marginal p(y) and the conditional p( y) are also Gaussian. y 3 λ 1 λ2
23 FA = Constrained Covariance Gaussian Marginal density for factor analysis (y is p-dim, is k-dim): p(y θ) = N (y µ, ΛΛ +Ψ) (integrate out ) So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matri: Cov[y] Λ In other words, factor analysis is just a constrained Gaussian model. (If Ψ were not diagonal then we could model any Gaussian and it would be pointless.) Learning: how should we fit the ML parameters? It is easy to find µ: just take the mean of the data. From now on assume we have done this and re-centred y. What about the other parameters? Λ T Ψ
24 EM for Factor Analysis We will do maimum likelihood learning using the EM algorithm. E-step: qn t+1 = p( n y n, θ t ) M-step: θ t+1 = argma θ n qt+1 ( n y n ) log p(y n, n θ)d n For E-step we need the conditional distribution (inference) For M-step we need the epected log of the complete data. E step : qn t+1 = p( n y n, θ t ) = N ( n m n, V n ) M step : Λ t+1 = argma Λ n Ψ t+1 = argma Ψ n l c ( n, y n ) q t+1 n l c ( n, y n ) q t+1 n Inferring the posterior mean is just a linear operation, and theposterior covariance does not depend on observed data: m n = β(y n µ) V = (I + Λ Ψ 1 Λ) 1 where β can be computed beforehand given the model parameters.
25 EM Algorithm for Factor Analysis First, set µ equal to the sample mean (1/N) n y n, and subtract this mean from all the data. Now run the following iterations: E step : q t+1 = p( y, θ t ) = N ( n m n, V n ) V n = (I + Λ Ψ 1 Λ) 1 m n = V n Λ Ψ 1 (y µ) ( ) ( M step : Λ t+1 = y n m n n n Ψ t+1 = 1 N diag [ n V n ) 1 y n y n + Λ t+1 n m n y n ]
26 Principal Component Analysis In Factor Analysis, we can write the marginal density eplicitly: p(y θ) = p()p(y, θ)d = N (y µ, ΛΛ +Ψ) Noise Ψ mut be restricted for model to be interesting. (Why?) In Factor Analysis the restriction is that Ψ is diagonal (ais-aligned). What if we further restrict Ψ = σ 2 I (ie spherical)? We get the Probabilistic Principal Component Analysis (PPCA) model: p() = N ( 0, I) p(y, θ) = N (y µ + Λ, σ 2 I) where µ is the mean vector, columns of Λ are the principal components (usually orthogonal), and σ 2 is the global sensor noise.
27 PCA: Zero Noise Limit The traditional PCA model is actually a limit as σ 2 0. The model we saw is actually called probabilistic PCA. However, the ML parameters Λ are the same. The only difference is the global sensor noise σ 2. In the zero noise limit inference is easier: orthogonal projection. lim Λ (ΛΛ + σ 2 I) 1 = (Λ Λ) 1 Λ σ 2 0 y y 3 µ y 1 y 2
28 Direct Fitting For FA the parameters are coupled in a way that makes it impossible to solve for the ML params directly. We must use EM or other nonlinear optimization techniques. But for (P)PCA, the ML params can be solved for directly: The k th column of Λ is the k th largest eigenvalue of the sample covariance S times the associated eigenvector. The global sensor noise σ 2 is the sum of all the eigenvalues smaller than the k th one. This technique is good for initializing FA also. Actually PCA is the limit as the ratio of the noise variance on the output to the prior variance on the latent variables goes to zero. We can either achieve this with zero noise or with infinite variance priors.
29 Gaussians are Footballs in High-D Recall the intuition that Gaussians are hyperellipsoids. Mean == centre of football Eigenvectors of covariance matri == aes of football Eigenvalues == lengths of aes In FA our football is an ais aligned cigar. In PPCA our football is a sphere of radius σ 2. FA Ψ PCA ει
30 Scale Invariance in Factor Analysis In FA the scale of the data is unimportant: we can multiply y i by α i without changing anything: µ i α i µ i Λ ij α i Λ ij j Ψ i α 2 i Ψ i However, the rotation of the data is important. FA looks for directions of large correlation in the data, so it is not fooled by large variance noise. FA PCA
31 Rotational Invariance in PCA In PCA the rotation of the data is unimportant: we can multiply the data y by and rotation Q without changing anything: µ Qµ Λ QΛ Ψ unchanged However, the scale of the data is important. PCA looks for directions of large variance, so it will chase big noise directions. FA PCA
32 Recap: EM Algorithm A way of maimizing likelihood function for latent variable models. Finds ML parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maimum likelihood parameter estimates. Alternate between filling in the latent variables using our best guess (posterior) and updating the paramters based on this guess: E-step: q t+1 = p(z, θ t ) M-step: θ t+1 = argma θ z q(z ) log p(, z θ) In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.
But if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationFactor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis
Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationChapter 4: Factor Analysis
Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationLecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan
Lecture 2: Parameter Learning in Fully Observed Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan Likelihood Functions So far we have focused on the (log) probability
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationFactor Analysis. Qian-Li Xue
Factor Analysis Qian-Li Xue Biostatistics Program Harvard Catalyst The Harvard Clinical & Translational Science Center Short course, October 7, 06 Well-used latent variable models Latent variable scale
More informationSTA 414/2104, Spring 2014, Practice Problem Set #1
STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,
More informationCPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018
CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationProbabilistic & Unsupervised Learning. Latent Variable Models
Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture Note 2: Estimation and Information Theory
Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 2: Estimation and Information Theory Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 2.1 Estimation A static estimation problem
More informationGeometric Modeling Summer Semester 2010 Mathematical Tools (1)
Geometric Modeling Summer Semester 2010 Mathematical Tools (1) Recap: Linear Algebra Today... Topics: Mathematical Background Linear algebra Analysis & differential geometry Numerical techniques Geometric
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLecture 2: Simple Classifiers
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More information