Expectation maximization tutorial
|
|
- Arnold Hamilton
- 5 years ago
- Views:
Transcription
1 Expectation maximization tutorial Octavian Ganea November 18, /1
2 Today Expectation - maximization algorithm Topic modelling 2/1
3 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1
4 ML & MAP Observed data: X = {x 1, x 2... x N } Probabilistic model of the data: p(x θ) = n i=1 p(x i θ) 3/1
5 ML & MAP Observed data: X = {x 1, x 2... x N } Probabilistic model of the data: p(x θ) = n i=1 p(x i θ) Estimate parameters: 3/1
6 ML & MAP Observed data: X = {x 1, x 2... x N } Probabilistic model of the data: p(x θ) = n i=1 p(x i θ) Estimate parameters: Maximum likelihood: ˆθ ML = arg max θ p(x θ) 3/1
7 ML & MAP Observed data: X = {x 1, x 2... x N } Probabilistic model of the data: p(x θ) = n i=1 p(x i θ) Estimate parameters: Maximum likelihood: ˆθ ML = arg max θ p(x θ) Maximum a-posteriori: ˆθ MAP = arg max θ p(θ X ) = arg max θ [p(θ) + p(x θ)] 3/1
8 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } 4/1
9 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) 4/1
10 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) Latent variables: log p(x θ) = log Z p(x, Z θ) 4/1
11 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) Latent variables: log p(x θ) = log Z p(x, Z θ) Hard to maximize l(θ) directly (no closed form solution in most of the interesting cases). 4/1
12 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) Latent variables: log p(x θ) = log Z p(x, Z θ) Hard to maximize l(θ) directly (no closed form solution in most of the interesting cases). One solution: 4/1
13 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) Latent variables: log p(x θ) = log Z p(x, Z θ) Hard to maximize l(θ) directly (no closed form solution in most of the interesting cases). One solution: use a gradient method (e.g. gradient ascent, Newton) 4/1
14 Maximizing the log-likelihood Observed data: X = {x 1, x 2... x N } Log-likelihood: l(θ) = log p(x θ) = N i=1 log p(x i θ) Latent variables: log p(x θ) = log Z p(x, Z θ) Hard to maximize l(θ) directly (no closed form solution in most of the interesting cases). One solution: use a gradient method (e.g. gradient ascent, Newton) sometimes the gradient is hard to compute, hard to implement, or we do not want a black-box optimization routine with no guarantees 4/1
15 Expectation - maximization algorithm Used in models with latent variables. Iterative algorithm that guarantees convergence to stationary point of l(θ) (i.e. point with gradient zero, either local optimum or saddle point). No global optima guarantees. EM reaches either a local maximum or a saddle point Convergence speed might be slow. Idea: Builds sequence: l(θ (0) ) l(θ (1) )... l(θ (t) )... 5/1
16 Expectation - maximization algorithm Used in models with latent variables. Iterative algorithm that guarantees convergence to stationary point of l(θ) (i.e. point with gradient zero, either local optimum or saddle point). No global optima guarantees. EM reaches either a local maximum or a saddle point Convergence speed might be slow. Idea: Builds sequence: l(θ (0) ) l(θ (1) )... l(θ (t) )... At each step, using Jensen s inequality, finds a lower bound g s.t. l(θ (t) ) g(θ (t+1), q) l(θ (t+1) ) 5/1
17 Expectation - maximization algorithm For any probability distribution q(z) (s.t. Z q(z) = 1), Jensen inequality gives a lower bound F (q, θ) on the true likelihood: ( ) ( ) p(x, Z θ) l(θ) = log p(x, Z θ) = log q(z) q(z) Z Z Reason: log( ) is concave. ( ), Z θ) q(z) log p(x q(z) Equality case: q(z) = p(z X, θ). Z := F (q, θ) 6/1
18 Expectation - maximization algorithm Update rule: where g t (θ) := F (p(z X, θ (t) ), θ) = Z θ (t+1) = arg max g t (θ) θ ( ) p(x, Z θ) p(z X, θ (t) ) log p(z X, θ (t) ) From above, g t (θ) l(θ), θ in particular: gt (θ (t+1) ) l(θ (t+1) ) Equality in Jensen: g t (θ (t) ) = l(θ (t) ) So: l(θ (t) ) = g t (θ (t) ) g t (θ (t+1) ) l(θ (t+1) ) 7/1
19 Expectation - maximization algorithm EM algorithm: E-step: q (t+1) = arg max q F (q, θ (t) ) (i.e. q (t+1) = p(z X, θ (t) )) M-step: θ (t+1) = arg max θ F (q (t+1), θ) 8/1
20 EM algorithm - convergence We proved so far that: l(θ (0) ) l(θ (1) )... l(θ (t) )... But why does it converge to a stationary point? (Who guarantees no early stopping?) Proof: Let θ be the limit of the sequence defined by the EM algorithm. Then: θ = arg max θ g (θ), where g (θ) = F (p(z X, θ ), θ). This implies: θ g (θ ) = 0. Let h (θ) := l(θ) g (θ) = ( ) Z p(z X, θ ) log p(z X,θ) p(z X,θ ) Then, h (θ) 0, θ (since g is a lower bound of l) and h (θ ) = 0 (Jensen equality case) So, θ = arg min θ h (θ) θ h (θ ) = 0 So, θ l(θ ) = θ h (θ ) + θ g (θ ) = 0, q.e.d. 9/1
21 EM Applications Tired of too much math? :) Let s look at some cool applications of EM 10/1
22 Application 1 : Coin Flipping There are two coins A and B with θ A and θ B being the probability landing on Head when tossed. Do 5 rounds. In each round, select one coin uniformly at random and toss it 10 times then record the results. The observed data consists of 50 coin tosses. However, we don t know which coin was selected for a particular round. Estimate θ A and θ B. 11/1
23 Application 1 : Coin Flipping Let s start simple: One coin A with P(Y = H) = θ A 10 tosses: #H = x {0,..., 10}, #T = 10 x How to estimate θ A? Maximize what we see! Mathematically, maximize data (log-)likelihood: θ A = arg max θ A l(θ A ), where l(θ A ) := log P(X = x θ A ) P(X = x θa ) = θ x A (1 θ A) 10 x (note: fixed order of tosses) l(θ A ) = x log(θ A ) + (10 x) log(1 θ A ) Set derivative to 0: l θ A (θa ) = 0 θ A = x 10 Best ML distribution is the empirical distribution. 12/1
24 Application 1 : Coin Flipping Back to our original problem. Parameters θ = {θ A, θ B } 13/1
25 Application 1 : Coin Flipping Back to our original problem. Parameters θ = {θ A, θ B } Latent r.v. Z r - the coin selected in round r {1,..., 5}: p(z r = A) = p(z r = B) = /1
26 Application 1 : Coin Flipping Back to our original problem. Parameters θ = {θ A, θ B } Latent r.v. Z r - the coin selected in round r {1,..., 5}: p(z r = A) = p(z r = B) = 0.5 In each round r, the number of heads is x r. Associated r.v. X r. 13/1
27 Application 1 : Coin Flipping Back to our original problem. Parameters θ = {θ A, θ B } Latent r.v. Z r - the coin selected in round r {1,..., 5}: p(z r = A) = p(z r = B) = 0.5 In each round r, the number of heads is x r. Associated r.v. X r. p(x r = x r Z r = A; θ) = θ xr A (1 θ A) 10 xr 13/1
28 Application 1 : Coin Flipping Back to our original problem. Parameters θ = {θ A, θ B } Latent r.v. Z r - the coin selected in round r {1,..., 5}: p(z r = A) = p(z r = B) = 0.5 In each round r, the number of heads is x r. Associated r.v. X r. p(x r = x r Z r = A; θ) = θ xr A (1 θ A) 10 xr Bayes rule: p(z r = A x r ; θ) = θ xr A (1 θ A) 10 xr θ xr A (1 θ A) 10 xr +θ xr B (1 θ B) 10 xr 13/1
29 Application 1 : Coin Flipping Data likelihood (per one round): p(x r ; θ) = p(x r Z r = A; θ)p(z r = A) + p(x r Z r = B; θ)p(z r = B) = 0.5 ( θ xr A (1 θ A) 10 xr + θ xr B (1 θ B) ) 10 xr 14/1
30 Application 1 : Coin Flipping Data likelihood (per one round): p(x r ; θ) = p(x r Z r = A; θ)p(z r = A) + p(x r Z r = B; θ)p(z r = B) = 0.5 ( θ xr A (1 θ A) 10 xr + θ xr B (1 θ B) ) 10 xr Data log-likelihood (all rounds): l(θ) = log p(x ; θ) = 5 r=1 log p(x r ; θ) 14/1
31 Application 1 : Coin Flipping Data likelihood (per one round): p(x r ; θ) = p(x r Z r = A; θ)p(z r = A) + p(x r Z r = B; θ)p(z r = B) = 0.5 ( θ xr A (1 θ A) 10 xr + θ xr B (1 θ B) ) 10 xr Data log-likelihood (all rounds): l(θ) = log p(x ; θ) = 5 r=1 log p(x r ; θ) Cannot maximize log-likelihood directly (i.e. by setting gradient to zero). 14/1
32 Application 1 : Coin Flipping Data likelihood (per one round): p(x r ; θ) = p(x r Z r = A; θ)p(z r = A) + p(x r Z r = B; θ)p(z r = B) = 0.5 ( θ xr A (1 θ A) 10 xr + θ xr B (1 θ B) ) 10 xr Data log-likelihood (all rounds): l(θ) = log p(x ; θ) = 5 r=1 log p(x r ; θ) Cannot maximize log-likelihood directly (i.e. by setting gradient to zero). Instead, maximize EM lower bound on l(θ) (formalized last time). 14/1
33 Application 1 : Coin Flipping EM lower-bound per round (Jensen inequality): log p(x r ; θ) ( ) p(xr, Z r = c; θ) q r (Z r = c) log := F r (q r, θ) q r (Z r = c) c=a,b 15/1
34 Application 1 : Coin Flipping EM lower-bound per round (Jensen inequality): log p(x r ; θ) ( ) p(xr, Z r = c; θ) q r (Z r = c) log := F r (q r, θ) q r (Z r = c) c=a,b Expectation step: q r (Z r = c) = p(z r = c x r ; θ (t) ), r {1,..., 5} 15/1
35 Application 1 : Coin Flipping EM lower-bound per round (Jensen inequality): log p(x r ; θ) ( ) p(xr, Z r = c; θ) q r (Z r = c) log := F r (q r, θ) q r (Z r = c) c=a,b Expectation step: q r (Z r = c) = p(z r = c x r ; θ (t) ), r {1,..., 5} Maximization step: where g t (θ) = θ (t+1) = arg max g t (θ) θ 5 F r (p(z r = x r, θ (t) ), θ) r=1 15/1
36 Application 1 : Coin Flipping Maximization step: θ (t+1) = arg max θ 5 r=1 c=a,b p(z r = c x r, θ (t) ) log (p(x r, Z r = c; θ)) 16/1
37 Application 1 : Coin Flipping Maximization step: θ (t+1) = arg max θ Gradient: g t (θ) θ A = = 5 r=1 c=a,b p(z r = c x r, θ (t) ) log (p(x r, Z r = c; θ)) 5 p(z r = A x r, θ (t) ) log (p(x r, Z r = A; θ)) θ A 5 ( p(z r = A x r, θ (t) xr ) + 10 x ) r θ A 1 θ A r=1 r=1 16/1
38 Application 1 : Coin Flipping Maximization step: θ (t+1) = arg max θ Gradient: g t (θ) θ A = = 5 r=1 c=a,b p(z r = c x r, θ (t) ) log (p(x r, Z r = c; θ)) 5 p(z r = A x r, θ (t) ) log (p(x r, Z r = A; θ)) θ A 5 ( p(z r = A x r, θ (t) xr ) + 10 x ) r θ A 1 θ A r=1 r=1 Gradient set to 0 gives: θ (t+1) α (t) 5 A = p(z r = A x r, θ (t) )x r ; r=1 A = α(t) A α (t) A +β(t) A β (t) where 5 A = p(z r = A x r, θ (t) )(10 x r ) r=1 16/1
39 Application 1 : Coin Flipping Final algorithm: Iteration: t 0 Initialize parameters randomly: θ (0) A, θ(0) B (0, 1) Do until convergence: θ (t+1) A = α(t) A α (t) A +β(t) A θ (t+1) B = α(t) B α (t) B +β(t) B t t /1
40 Application 2 : Topic Modelling Document representations: Used for classification, query retrieval, document similarity, etc. A document can be seen as a multi-set of words d = {(w i tf (w i ; d))} i=1, V R V Issues: high dimensionality, sparsity issues, potentially many infrequent words (with noisy estimated parameters) Alternative (compressed topic representation): topic distributions: d = {(t p(t d))} t=1,k R K K = num of topics K << V How to choose the number of topics K? Hyper-parameter: the one that gives the best performance on a validation set for the task at hand Minimize perplexity of seen words 18/1
41 Application 2 : Topic Modelling Model parameters (to be learned): π t := p(t d), a nt := p(w n t) 19/1
42 Application 2 : Topic Modelling Model parameters (to be learned): π t := p(t d), a nt := p(w n t) Log likelihood (one document): N N T l(π) = log p(w n d) = log π t a nt n=1 n=1 t=1 19/1
43 Application 2 : Topic Modelling Model parameters (to be learned): π t := p(t d), a nt := p(w n t) Log likelihood (one document): N N T l(π) = log p(w n d) = log π t a nt n=1 n=1 t=1 Iterative algorithm: keep a nt fixed, learn π t ; and reverse. 19/1
44 Application 2 : Topic Modelling Model parameters (to be learned): π t := p(t d), a nt := p(w n t) Log likelihood (one document): N N T l(π) = log p(w n d) = log π t a nt n=1 n=1 t=1 Iterative algorithm: keep a nt fixed, learn π t ; and reverse. We do here just the update of π t. The update of a nt is similar. 19/1
45 Application 2 : Topic Modelling Model parameters (to be learned): π t := p(t d), a nt := p(w n t) Log likelihood (one document): N N T l(π) = log p(w n d) = log π t a nt n=1 n=1 t=1 Iterative algorithm: keep a nt fixed, learn π t ; and reverse. We do here just the update of π t. The update of a nt is similar. Log-likelihood with Lagrange multipliers: ( N T T ) L(π, λ) = log π t a nt λ π t 1 n=1 t=1 t=1 19/1
46 Application 2 : Topic Modelling Iterative update algorithm. 20/1
47 Application 2 : Topic Modelling Iterative update algorithm. Latent variables Z are now the topics t. 20/1
48 Application 2 : Topic Modelling Iterative update algorithm. Latent variables Z are now the topics t. EM lower bound using Jensen: L(π, λ) F (q, π, λ) = where t q nt = 1, n N T n=1 t=1 [ q nt log π ] ( T ) t + log a nt λ π t 1 q nt t=1 20/1
49 Application 2 : Topic Modelling Iterative update algorithm. Latent variables Z are now the topics t. EM lower bound using Jensen: L(π, λ) F (q, π, λ) = where t q nt = 1, n E-step, iteration k: q (k) N T n=1 t=1 nt = π(k) t a nt t π(k) t a nt [ q nt log π ] ( T ) t + log a nt λ π t 1 q nt t=1 20/1
50 Application 2 : Topic Modelling Iterative update algorithm. Latent variables Z are now the topics t. EM lower bound using Jensen: L(π, λ) F (q, π, λ) = where t q nt = 1, n E-step, iteration k: q (k) M-step, iteration k: π (k+1) t N T n=1 t=1 nt = π(k) t a nt t π(k) t a nt = π(k) t N [ q nt log π ] ( T ) t + log a nt λ π t 1 q nt t=1 N n=1 a nt t π(k) t a nt 20/1
51 Questions? 21/1
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationMotif representation using position weight matrix
Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31 Position weight matrix Position weight matrix representation
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationEM-algorithm for motif discovery
EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More information1 Expectation Maximization
Introduction Expectation-Maximization Bibliographical notes 1 Expectation Maximization Daniel Khashabi 1 khashab2@illinois.edu 1.1 Introduction Consider the problem of parameter learning by maximizing
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationExpectation maximization
Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationIntroduction To Machine Learning
Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14 Expectation maximization
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationExpectation-Maximization (EM) algorithm
I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationLecture 4 September 15
IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationLikelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University
Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationMachine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation
Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Yuriy Sverchkov Intelligent Systems Program University of Pittsburgh October 6, 2011 Outline Latent Semantic Analysis (LSA) A quick review Probabilistic LSA (plsa)
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28
More informationModeling Environment
Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationBayesian Methods. David S. Rosenberg. New York University. March 20, 2018
Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian
More informationMaximum likelihood estimation
Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationExpectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo
Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo 1 Incomplete Data So far we have seen problems where - Values of all attributes are known - Learning
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling
DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationNote for plsa and LDA-Version 1.1
Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationEM & Variational Bayes
EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vmfs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationNaïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014
Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014 Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationStatistical Computing (36-350)
Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationSTAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01
STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist
More information