But if z is conditioned on, we need to model it:
|
|
- Rosamond George
- 5 years ago
- Views:
Transcription
1 Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or both. If the are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we define a new cost function in which we integrate out the missing values at training or test time: l(θ; D) = log p(x c, y c θ) + log p(x m θ) complete missing = log p(x c, y c θ) + log complete missing y p(x m, y θ) Variables which are always unobserved are called latent variables. Missing Outputs You can thin of unsupervised learning as supervised learning in which all the outputs are missing: Clustering == classification with missing labels. Dimensionality reduction == regression with missing targets. Density estimation is actually very general and encompasses the two problems above and a whole lot more. Let s focus on the idea of unobserved variables... Latent Variables What to do when a variable is always unobserved? Depends on where it appears in our model. If we never condition on it when computing the probability of the variables we do observe, then we can just forget about it and integrate it out. e.g. given y, x fit the model p(, y x) = p( y)p(y x, w)p(w). But if is conditioned on, we need to model it: e.g. given y, x fit the model p(y x) = p(y x, )p() Latent variables may appear naturally, from the structure of the problem. But also, we may want to intentionally introduce latent variables to model complex dependencies between variables without looing at the dependencies between them directly. This can actually simplify the model (e.g. mixtures).
2 Why is Learning Harder? In fully observed settings, the probability model is a product thus the log lielihood is a sum where terms decouple. l(θ; D) = n = n log p(y n, x n θ) log p(x n θ x ) + n log p(y n x n, θ y ) Mixture Models Most basic latent variable model with a single discrete node. Allows different submodels (experts) to contribute to the (conditional) density model in different parts of the space. Divide and conquer idea: use simple parts to build complex models. (e.g. multimodal densities, or piecewise-linear regressions). With latent variables, the probability already contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = n log p(x n, θ) = n log p( θ )p(x n, θ x ) x Learning with Latent Variables Lielihood l(θ) = log p( θ )p(x, θ x ) couples parameters: We can treat this as a blac box probability function and just try to optimie the lielihood as a function of θ. We did this many times before by taing gradients. However, sometimes taing advantage of the latent variable structure can mae parameter estimation easier. Good news: today we will see the EM algorithm which allows us to treat learning with latent variables using fully observed tools. Basic tric: guess the values you don t now. Basic math: use convexity to lower bound the lielihood. Mixture Densities Exactly lie a classification model but the class is unobserved and so we sum it out. What we get is a perfectly valid density: K p(x θ) = p( = θ )p(x =, θ ) =1 = α p (x θ ) where the mixing proportions add to one: α = 1. We can use Bayes rule to compute the posterior probability of the mixture component given some data: r (x) = p( = x, θ) = α p (x θ ) j α jp j (x θ j ) these quantities are called responsibilities. You ve seen them many times before; now you now their names!
3 Learning with Mixtures We can learn mixture densities using gradient descent on the lielihood as usual. The gradients are quite interesting: l(θ) = log p(x θ) = log α p (x θ ) l θ = 1 p α (x θ ) p(x θ) θ = 1 α p(x θ) p (x θ ) log p (x θ ) θ = p α (x θ ) l = p(x θ) θ α r l θ In other words, the gradient is the responsibility weighted sum of the individual log lielihood gradients. Mixture of Linear Regression Experts Each expert generates data according to a linear function of the input plus additive Gaussian noise: p(y x, θ) = α (x)n (y β x, σ2 ) The gate function can be a softmax classification machine: eη x α (x) = p( = x) = j eη j x Remember: we are not modeling the density of the inputs x. Learning? Gradient descent is one option. You have seen the gradients for this example before. In a minute you will see another one... Conditional Mixtures: MOEs Revisited Mixtures of Experts are also called conditional mixtures. Exactly lie a class-conditional classification model, except the class is unobserved and so we sum it out: K p(y x, θ) = p( = x, θ )p(y =, x, θ ) where α (x) = 1 =1 = α (x θ )p (y x, θ ) x. Harder: must learn α (x) (unless chose independent of x). The α (x) are exactly what we called the gating function. We can still use Bayes rule to compute the posterior probability of the mixture component given some data: p( = x, y, θ) = α (x)p (y x, θ ) j α j(x)p j (y x, θ j ) Clustering Example: Gaussian Mixture Models Consider a mixture of K Gaussian components: x p(x θ) = α N (x µ, Σ ) p( = x, θ) = α N (x µ, Σ ) j α jn (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Density model: p(x θ) is a familiarity signal. Clustering: p( x, θ) is the assignment rule, l(θ) is the cost.
4 Mixure of Gaussians Learning We can learn mixtures of Gaussians using gradient descent. For example, the gradients of the means: l(θ) = log p(x θ) = log l θ = l µ = α r l θ = α r Σ 1 (x µ ) α p (x θ ) α r log p (x θ ) θ Gradients of covariance matrices are harder: require derivatives of log determinants and quadratic forms. Must ensure that mixing proportions α are positive and sum to unity and that covariance matrices are positive definite. Be careful: logsum Often you can easily compute b = log p(x =, θ ), but it will be very negative, say or smaller. Now, to compute l = log p(x θ) you need to compute log eb. Careful! Do not compute this by doing log(sum(exp(b))). You will get underflow and an incorrect answer. Instead do this: Add a constant exponent B to all the values b such that the largest value comes close to the maxiumum exponent allowed by machine precision: B = MAXEXPONENT-log(K)-max(b). Compute log(sum(exp(b+b)))-b. Example: if log p(x = 1) = 420 and log p(x = 2) = 420, what is log p(x) = log [p(x = 1) + p(x = 2)]? Answer: log[2e 420 ] = log 2. Parameter Constraints If we want to use general optimiations (e.g. conjugate gradient) to learn latent variable models, we often have to mae sure parameters respect certain constraints. (e.g. α = 1, Σ pos.definite). A good tric is to reparameterie these quantities in terms of unconstrained values. For mixing proportions, use the softmax: α = exp(q ) j exp(q j) For covariance matrices, use the Cholesy decomposition: Σ 1 = A A Σ 1/2 = A ii i where A is upper diagonal with positive diagonal: A ii = exp(r i ) > 0 A ij = a ij (j > i) A ij = 0 (j < i) Recap: Learning with Latent Variables With latent variables, the probability contains a sum, so the log lielihood has all parameters coupled together: l(θ; D) = log p(x, θ) = log p( θ )p(x, θ x ) (we can also consider continuous and replace with ) If the latent variables were observed, parameters would decouple again and learning would be easy: l(θ; D) = log p(x, θ) = log p( θ ) + log p(x, θ x ) One idea: ignore this fact, compute l/ θ, and do learning with a smart optimier lie conjugate gradient. Another idea: what if we use our current parameters to guess the values of the latent variables, and then do fully-observed learning? This bac-and-forth tric might mae optimiation easier.
5 Expectation-Maximiation (EM) Algorithm Iterative algorithm with two lined steps: E-step: fill in values of ẑ t using p( x, θ t ). M-step: update parameters using θ t+1 argmax l(θ; x, ẑ t ). E-step involves inference, which we need to do at runtime anyway. M-step is no harder than in fully observed case. We will prove that this procedure monotonically improves l (or leaves it unchanged). Thus it always converges to a local optimum of the lielihood (as any optimier should). Note: EM is an optimiation strategy for objective functions that can be interpreted as lielihoods in the presence of missing data. EM is not a cost function such as maximum-lielihood. EM is not a model such as mixture-of-gaussians. Expected Complete Log Lielihood For any distribution q() define expected complete log lielihood: l q (θ; x) = l c (θ; x, ) q q( x) log p(x, θ) Amaing fact: l(θ) l q (θ) + H(q) because of concavity of log: l(θ; x) = log p(x θ) = log = log p(x, θ) p(x, θ) q( x) q( x) q( x) log p(x, θ) q( x) Where the inequality is called Jensen s inequality. (It is only true for distributions: q() = 1; q() > 0.) Complete & Incomplete Log Lielihoods Observed variables x, latent variables, parameters θ: is the complete log lielihood. l c (θ; x, ) = log p(x, θ) Usually optimiing l c (θ) given both and x is straightforward. (e.g. class conditional Gaussian fitting, linear regression) With unobserved, we need the log of a marginal probability: l(θ; x) = log p(x θ) = log p(x, θ) Lower Bounds and Free Energy For fixed data x, define a functional called the free energy: F (q, θ) p(x, θ) q( x) log l(θ) q( x) The EM algorithm is coordinate-ascent on F : E-step: q t+1 = argmax q F (q, θ t ) M-step: θ t+1 = argmax θ F (q t+1, θ t ) which is the incomplete log lielihood.
6 M-step: maximiation of expected l c Note that the free energy breas into two terms: F (q, θ) = = q( x) log p(x, θ) q( x) q( x) log p(x, θ) q( x) log q( x) EM Constructs Sequential Convex Lower Bounds Consider the lielihood function and the function F (q t+1, ). lielihood F( θ,q t+1 ) = l q (θ; x) + H(q) (this is where its name comes from) The first term is the expected complete log lielihood (energy) and the second term, which does not depend on θ, is the entropy. Thus, in the M-step, maximiing with respect to θ for fixed q we only need to consider the first term: θ t+1 = argmax θ l q (θ; x) = argmax θ q( x) log p(x, θ) θ t θ E-step: inferring latent posterior Claim: the optimim setting of q in the E-step is: q t+1 = p( x, θ t ) This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification). Proof (easy): this setting saturates the bound l(θ; x) F (q, θ) F (p( x, θ t ), θ t ) = p( x, θ t ) log p(x, θt ) p( x, θ t ) = p( x, θ t ) log p(x θ t ) Partially Hidden Data Of course, we can learn when there are missing (hidden) variables on some cases and not on others. In this case the cost function was: l(θ; D) = log p(x c, y c θ) + log complete missing y log p(x m, y θ) Now you can thin of this in a new way: in the E-step we estimate the hidden variables on the incomplete cases only. The M-step optimies the log lielihood on the complete data plus the expected lielihood on the incomplete data using the E-step. = log p(x θ t ) p( x, θt ) = l(θ; x) 1 Can also show this result using variational calculus or the fact that l(θ) F (q, θ) = KL[q p( x, θ)]
7 Recap: EM Algorithm EM for MOG A way of maximiing lielihood function for latent variable models. Finds ML parameters when the original (hard) problem can be broen up into two (easy) pieces: L = 1 L = 4 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maximum lielihood parameter estimates. L = 6 (a) (c) (d) L = 8 L = 10 L = 12 (e) Alternate between filling in the latent variables using our best guess (posterior) and updating the paramters based on this guess: E-step: q t+1 = p( x, θ t ) M-step: θ t+1 = argmax θ q( x) log p(x, θ) In the M-step we optimie a lower bound on the lielihood. In the E-step we close the gap, maing bound=lielihood. (f) (g) (h) (i) Example: Mixtures of Gaussians Recall: a mixture of K Gaussians: p(x θ) = α N (x µ, Σ ) l(θ; D) = n log α N (x n µ, Σ ) Learning with EM algorithm: E step : p t n = N (xn µ t, Σt ) qn t+1 = p( = xn, θ t ) = αt pt n j αt j pt n n M step : µ t+1 = qt+1 n xn n qt+1 n n qt+1 n (xn µ t+1 )(x n µ t+1 Σ t+1 = α t+1 = 1 M n q t+1 n n qt+1 n ) Derivation of M-step Expected complete log lielihood l q (θ; D): [ q n log α 1 2 (xn µ t+1 ) Σ 1 (xn µ t+1 ) 1 ] 2 log 2πΣ n For fixed q we can optimie the parameters: l q = Σ 1 µ q n (x n µ ) n l q Σ 1 = 1 [ ] q 2 n Σ (xn µ t+1 )(x n µ t+1 ) n l q = 1 q α α n λ (λ = M) n Fact: log A 1 A 1 = A and x Ax A = xx
8 Compare: K-Means The EM algorithm for mixtures of Gaussians is just lie a soft version of the K-means algorithm with fixed priors and covariance. Instead of hard assignment in the E-step, we do soft assignment based on the softmax of the squared distance from each point to each cluster. Each centre is then moved to the weighted mean of the data, with weights given by soft assignments. In K-means, the weights are 0 or 1. E step : d t n = 1 2 (xn µ t ) Σ 1 (x n µ t ) Variants Sparse EM: Do not recompute exactly the posterior probability on each data point under all models, because it is almost ero. Instead eep an active list which you update every once in a while. Generalied (Incomplete) EM: It might be hard to find the ML parameters in the M-step, even given the completed data. We can still mae progress by doing an M-step that improves the lielihood a bit (e.g. gradient step). q t+1 n = exp( dt n ) j exp( dt jn ) = p(ct n = x n, µ t ) n M step : µ t+1 = qt+1 n xn n qt+1 n Some good things about EM: A Report Card for EM no learning rate parameter very fast for low dimensions each iteration guaranteed to improve lielihood adapts unused units rapidly Some bad things about EM: can get stuc in local minima both steps require considering all explanations of the data which is an exponential amount of wor in the dimension of θ EM is typically used with mixture models, for example mixtures of Gaussians or mixtures of experts. The missing data are the labels showing which sub-model generated each datapoint. Very common: also used to train HMMs, Boltmann machines,...
Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationTechnical Details about the Expectation Maximization (EM) Algorithm
Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy
More informationCS Lecture 18. Expectation Maximization
CS 6347 Lecture 18 Expectation Maximization Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationEM Algorithm LECTURE OUTLINE
EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationThe Multivariate Gaussian Distribution [DRAFT]
The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationOutline lecture 6 2(35)
Outline lecture 35 Lecture Expectation aximization E and clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 13-1373 Office: House
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More information