The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood

Size: px
Start display at page:

Download "The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood"

Transcription

1 The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 Abstract We propose a general approach for estimating the parameters of latent variable probability models to maximize conditional likelihood and discriminant criteria. Unlike joint likelihood, these objectives are better suited for classication and regression. The approach utilizes and extends the previously introduced CEM framework (Conditional Expectation Maximization), which reformulates EM to handle the conditional likelihood case. We generalize the CEM algorithm to estimate any mixture of exponential family densities. This includes structured graphical models over exponential families, such as HMMs. The algorithm eciently takes advantage of the factorization of the underlying graph. In addition, the new CEM bound is tighter and more rigorous than the original one. The nal result is a CEM algorithm that mirrors the EM algorithm where both estimate a variational lower bound on their respective incomplete objective functions, and both generate the same standard M-steps over complete likelihood for direct maximization. The equivalence of M-steps facilitates migration of current ML approaches to conditional criteria for improved classication and regression results. 1 Introduction Recently, the machine learning community and its application domains have seen the proliferation of conditional or discriminative criteria for classication and regression. For instance, support vector machines [7] have generated competitive classier systems and are being combined with probabilistic models [3]. In the speech community, discriminatively trained HMMs minimize classication error for superior phoneme labeling [6]. Mixtures of experts are used as probabilistic regressors after maximizing their conditional likelihood [5]. Even traditional neural networks employ a least-squares objective function on the output, emphasizing prediction performance [1]. All these criteria allocate modeling resources with the given task in mind, yielding improved performance. In contrast, under ML and MAP (the canonical criteria of probabilistic models), each density is trained separately to describe observations rather than optimally solve for classication or regression. Therefore performance is compromised.

2 The ML/MAP criteria suer most when the model is inaccurate (this modelmismatch occurs often in real-world situations). For visualization, observe Figure 1 where data must be classied using Gaussians ( per class). The optimal ML solution unfortunately does no better than chance on this data. In comparison, the maximum conditional likelihood solution correctly labels most examples (a) Max. Likelihood Classier (b) Max. Conditional Likelihood Classier Figure 1: Likelihood of class and data vs. likelihood of class given data. Nevertheless, ML and MAP remain attractive criteria since they can be easily and convergently optimized for a large class of latent variable probability densities and graphical models. These otherwise intractable models are lower bounded and decoupled via the EM [] algorithm which generates complete data and simple M-steps. Since the EM algorithm is limited to ML and MAP, proponents of other criteria must resort to gradient algorithms [6], or second order methods [5] to estimate latent variable probability models. These generic optimization algorithms need extra bookkeeping, selection of step size, exhibit non-monotonic convergence, and so on. It is thus desirable to nd a lower bound like EM which facilitates optimization of conditional or discriminant criteria and generates simple M-steps. Such a lower bound was proposed in [] as the CEM (Conditional Expectation Maximization) algorithm and used to perform regression 1 with Gaussian mixtures. However CEM's full generality extends beyond that case to a large class of probability models. It can also be further rened to make it tighter. In the following, we describe the peculiar structure of conditional and discriminative problems and show how EM cannot fully lower bound them. We then show how lower bounding arises naturally from the log-concavity of certain probability models. This CEM lower bound in question eectively generates weighted and displaced complete data from incomplete data to produce M-steps that are structurally identical to the EM case. We then present the estimation of the CEM bound's parameters. This is done by approximating the log-likelihood function with a sparse envelope on its epigraph. Lower bounding is depicted for mixtures of Gaussians and multinomials. For latent variable graphical models, we discuss HMMs and explain the ecient computation of their CEM lower bound. This document is a brief theoretical discussion, but practical and implementation details can be veried online. The reader is encouraged to please visit the accompanying web page: for experiments, results and visualization material. 1 Here we emphasisize classication instead.

3 CEM: Discriminative Lower Bounds for Simple M-Steps For latent variable P models, a distribution is described as the sum of simple models: p(c; xj) = m p(m; c; xj). Here m is the missing data, x P is the input and c is the class label. The standard ML objective function is: l = i log p(c i; x i j). EM simplies each i'th component of the log-likelihood with a variational bound using Jensen inequality (with equality achieved at = ~): log X m p(m; xj) X m p(m; xj ~) P n p(n; xj ~) log p(m; xj)! + log X m p(m; xj ~) (1) An alternative criterion is conditional maximum likelihood (CML), P where the loglikelihood of the class given the observations is optimized: l c = i log p(c ijx i ; ). Equation shows a component of the conditional P Plog-likelihood in further detail. Due to the presence of the negative in? log( m c ), the so-called \negative logsum", EM acts as an upper bound on that term. Therefore, EM can only lower bound half the terms in the conditional likelihood, leaving the rest intractable and preventing a direct M-step. log p(cjx; ) = log p(c; xj) p(xj) = log X m p(m; c; xj)? log X m X c p(m; c; xj) () In the case of discriminative learning [3] [7] likelihood ratios between two classes are compared via a discriminant function L(xj) (Equation 3). The sign of the function denotes the class, ^c = sign(l(x)). For each class, one could have a latent variable probability model ( 1 and ). However, if a lower bound is needed on such an expression, EM will succeed on the log() function but applying EM on the remaining? log() will produce an undesirable upper bound. Therefore, to lower bound such negative log-sums we propose the generalized CEM algorithm. L(xj) = log p(xj 1) p(xj ) = logx m p(m; xj 1 )? log X m p(m; xj ) (3) Denition 1 The generalized CEM algorithm is the complementary lower bound on the negative log-sum that EM would otherwise upper bound. Its form is:? log X m p(m; xj) X m w m log p(m; y m j) + k where w m is a positive scalar weight, y m is a displaced data point in the space of x and k is a scalar constant. Equality is achieved at some ~. Note its structural similarity to the EM lower bound, hence their similar simple M-steps. 3 Exploiting Log-Concavity in the Exponential Family It seems unusual that the log-sum can be both upper bounded and lower bound by a positive weighted sum of logs. In fact, the lower bound (EM) always exists while the upper bound (CEM) is only feasible for certain models, basically log-concave ones such as the exponential P family. For the EM case, the probabilities in the log-summation, log( m p(m; )), could be any arbitrary positive function. Jensen's inequality merely exploits the presence of the concave logarithm of an expectation. Furthermore, the lower bound's parameters only compute \responsibilities" or expectations of p(m; ) at. ~ CEM, however, can not use Jensen, and hence requires log-concave probabilities for p(m; ). Furthermore, the choice of its parameters depends on the form of

4 p(m; ), not just responsibility ratios. We thus assume exponential family members for p(m; ) since they are log-concave. The family is dened as p(xj) = a(x) exp( T x?k()), where a(x) is positive and K() is convex. The exponential family subsumes a variety of models including Gaussians, multinomials, Poisson, and their conjugates. Assume that we are lower bounding (with CEM) the negative log of a latent variable model by a sum of log-exponential family models (Equation ). One typical choice for the latent variable model is an exponential family mixture, i.e. p(xj) = P m p(m)a(x) exp(t mx? K( m )).? log p(xj) X m w m T my m? w m K( m ) + k () We use the following intuitive requirements to solve for the bound's parameters (k; y m ; w m ). Contact: a variational bound must equal the negative log-sum at the current operating point ( ~). T angentiality: the gradients of the bound equal those of the negative log-sum at the current point. T ightness: the bound is as tight a bound as possible without stepping over the negative log-sum. The rst two requirements need the value and the gradient of the negative log-sum at the current model settings ( ~ S = ~ m m ). Since EM generates a variational bound, its value and gradient at ~ are identical to those computed from the negative logsum. Usually, using EM is computationally more ecient. Thus we compute the following quantities with EM: the value V = p(xj ), ~ the responsibilities h m = p(m)p(xj ~ m ) (or ^h P m = h m = n h n) and the gradients G m p(xj) at k =? log p(xj ~)? X m y m = G m w m m ~m = 1 w m m m w m ~ T my m? K( ~ m )!? x ~m (5) m ~m (6) The above equations recover the parameters for the CEM bound and show the case of y m for a mixture of the exponential family. Note that if we set w m =?^h m, then y m! x and CEM reduces to the exact same upper bound EM generates. CEM's added exibility arises from allowing the data point to move to another location (y m ) as well undergo weighting. EM generates weighted complete data while CEM generates displaced and weighted complete data. Now, we wish to nd the smallest w m permissible for a true lower bound, i.e. the tightest approximation while staying below the negative log-sum. Of course, one need not pick the smallest possible w m, any w m greater than the minimal one will also generate a true bound. In addition, we post-process w m which must be positive and guarantee reasonable M-step calculations. Substituting k and y m into the bound gives an expression for w m : Pm wm K(m )?K( ~ m)?(m? ~ ~ m log p(xj) p(xj ~) + Pm (m? ~ m) T G m (7) By convexity of K( m ), we know that the terms multiplying each w m remain positive. Therefore, w m are always constrained from below, w m (), regardless of Note the mixing components (class frequencies) are xed while updating the exponential family parameters. The mixing components themselves also form a log-concave structure suitable for CEM if we alternatively x the exponential family models. However [] already implemented a (slightly dierent) bound and update rule for mixing proportions that was valid for general mixtures (i.e. not just Gaussians).

5 Gaussian log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } Binomial HMM 3 1 log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } (a) Negative Log-Sum (b) Sparse Envelope (c) CEM Lower Bound Figure : Computing envelopes and CEM bounds. the choice of (i.e. the inequality never ips over). To nd the minimum w m (tightest bound), we could exhaustively vary the value of over the whole space and verify all the conditions on w m. This is intractable so we propose an ecient approximation. The EM Epigraph and Envelope Approximation Recall that EM generated an undesirable upper bound on the negative log-sum (i.e. consider negating Equation 1). Yet these upper bounds play a critical role in computing the CEM lower bound. Recall from concave duality that a function can be dened as the minimum of all its convex upper bounds. EM generates upper bounds that lie in the negative log-sum's epigraph and their minimum forms its envelope. EM achieves equality (Equation 8) if we minimize over all bounds in the continuous space of fh j g variations.? log X j p(j)p(xj j ) = fh MIN j g? P P P h j j log p(j)p(xj j)? log n hn n h n Of course, we need not exhaustively consider every upper bound, rather one may choose a select few sample bounds whose envelope captures most of the interesting behavior of the function. For the log-sum, we get a very nice approximation when we use the winner-takes-all case where one model dominates all others and one responsibility ratio h j is set to 1. while all others are.. Considering j = 1::M such models (envelope components) and minimizing over them approximates the log-mixture. Here, the envelopes components are really M negative log-exponential-family models. Thus we approximate? log P j p(j)p(xj j) MINf? log p(j)p(xj j ) 8 j g. Figure shows very accurate sparse envelope approximation for a -component Gaussian mixture (varying 1D means) and a mixture of binomials. We now have a parsimonious approximation of the negative log-sum which we can replace back into the CEM bound. If CEM's concave bound is less than each (8)

6 envelope, it is underneath the negative log-sum. Furthermore, the expression for w m which seemed intractable simplies since log p(xj) decouples into separate log-exponential-family terms. Each of these envelope components (j = 1::M) varies with a single j at a time. Plugging an envelope member in place of the log-sum greatly simplies Equation 7 for w m : Pm wm K(m)?K( ~ m )?(m? ~ ~ m P log p(j)p(~xj j ) + p(xj ~) m (m? m) ~ T G m (9) Here, the latent variable model is now replaced by p(j)p(~xj j ), a mere exponential family member. Now, due to the decoupling of the m models it is possible to naturally split the above joint constraint over all w m into M stricter individual inequalties as in Equation 1. We also heuristically split the constant value log p(xj ~) into M dierent c m variables such that P m c m = V. The result is a decoupled set of M constraints j = 1::M (from M sparse envelope components) for each of the M CEM parameters, w m. We use (j = m) to indicate that the envelope / epigraph component only varies with one j (unlike the latent variable probability). w m (j = m) log p(j)p(~xj j)? c m + ( m? ~ m ) T G m K( m )? K( ~ m )? ( m? ~ m (1) ) We solve for the smallest w m (call it wm) possible under Equation 1 for each of the j = 1::M envelope components. These are then consolidated by picking the largest wm (call this the nal w m ) that was achieved. Assume that we knew a priori the value of wm. This wm denes a component of the CEM lower bound which supports a component of the envelope. The closest point in m space between the bound and the envelope component can then be computed by taking gradients and we note that the following constraint holds at m ) 1 m wm + (j = m) (j = m)~x + w m m ~m + G m! The above constraint maps the optimization of each w m from a search over the whole m space to a single degree of freedom. Using the convexity of K( m ) we have a 1 to 1 map between the gradient of K( m ) and m. Equation 11 eectively changes the optimization over the space of m to have a single degree of freedom which, when varied, determines w m. For the Gaussian case with identity covariance, computing the maximum w m from Equation 1 can then be done analytically (it is quadratic due to the simplicity of K() = 1 T ). For more dicult models, we perform a simple bisection or secant search in 1 dimension for each w m which typically converges to the true solution with an average of 5 iterations (using some eciency heuristics). Another possibility is storing lookup tables for direct computation of the w m as in []. We solve Equation 11 in this manner a total of M M times to fully describe the CEM bound on the negative log-sum. Once the CEM bound is computed, it is straightforward to combine it with EM bounds of the same form, sum them over multiple components of the log-likelihood and maximize. By iterating bound and maximization steps, monotonic convergence of the mixture models is veried. Deterministic annealing of CEM can also be used to avoid local minima. We rst use annealed EM to compute the value V and gradients G m of the function and then form an annealed version of the EM sparse envelope. The resulting parameters in CEM (k; y m ; w m ) generate a less local bound. (11)

7 5 HMMs - Bounding Structured Models We now consider CEM for HMMs, a structured model with latent variables and a probability factorization given by an underlying markov chain. The logsum P structure in an HMM for an observation sequence X is: log p(xj) = log (s 1;:::;s T ) p(s 1 ; :::; s T ; Xj), for N states and chains of length T. The sum in the log contains N T mixture components. However, the probability density factorizes allowing ecient computation of EM bounds. Thus, after Baum-Welch computations, the HMM is bounded via EM and can be updated with N M-step equations for emission distributions and N M-step equations for multinomials (i.e. the transition P matrix). Therefore, EM generates exponential family lower bounds: log p(xj) n;t w n;t log p( t j n ) + v n;t log p( t j n). CEM also generates this form except it bounds? log p(xj). As usual, to compute k and ym we use the value V of p(xj ) ~ and its gradients G m from the EM bound. Furthermore, we approximate the HMM's envelope using EM to obtain an envelope of upper bounds. CEM's w m parameters are then checked against these sparse components. Of course, we certainly do not want to check all N T terms in the mixture. Only N T terms are needed due to factorization. Each of the N models (multinomials and emission densities) is checked against these N T terms via Equation 1. The terms are again in a winner-takes-all form over the T elements of the chain which are imputed as values of ~x. Figure shows the envelope approximation of the HMM's likelihood with 1D means varying in a -state HMM with xed transition probabilities. Note the faithful representation of the function via the EM envelope. Thus, the HMM can be accurately represented as a minimum of log-exponential family models and CEM lower bounds its negative logarithm. 6 Discussion We have derived the CEM algorithm to lower bound the negative log summation. Combining it with EM produces tight lower bounds on conditional and discriminant criteria with latent variables. By iteratively maximizing the bound, monotonic convergence and deterministic annealing are feasible. This generalized algorithm holds for log-concave mixtures such as the exponential family and certain graphical models. Eectively, the CEM algorithm eciently nds a lower bound on discriminative criteria and maps them to the usual tractable M-step structure found in EM techniques. The re-use of maximization machinery permits easy migration of current probabilistic models from ML to conditional and discriminative criteria. 7 Appendix: Conditional Bayesian Estimation This appendix motivates maximum conditional likelihood as an approximation of conditional Bayesian integration. It also shows that the conditional integral diers from the conditioned joint integral. We rst compute the joint Bayesian integral from (X ; Y) data and then condition it to obtain p(yjx) j : p(yjx) j = R p(x; yj)p(jx ; Y)d R p(xj)p(jx ; Y)d (1) The corresponding dependency graphs (Figure 3(b) and (c)) show how joint estimation contrasts conditional estimation which assumes that x is always given as a parent of y. The conditional Bayesian integral takes advantage of the graph's factorization and estimates a dierent p(yjx) c.

8 X θ θ Y Z X Y p(x,y) Integrate j p(y x) c p(y x) Integrate {X,Y} {Y X} Condition (a) Data (b) Conditioned Joint (c) Direct Conditional (b) Inconsistency Figure 3: Inconsistency of Conditioned Joint and Conditional Bayesian Estimates p(yjx) c = Z p(yjx; c ) [p( c jx ; Y)]d c = Z p(yj p(yjx; c ) c ; X )p( c ) d c (13) p(yjx ) Conditional maximum likelihood approximates this integral by picking the single p(yjx; c ) that maximizes p(yj c ; X ). Fundamentally, though, the two Bayesian integrals p(yjx) j and p(yjx) c are dierent. We exhaustively perform both Bayesian integral for a Gaussian mixture model on data points. Figure 3 shows the data and the resulting conditional densities. There is a clear inconsistency between joint and conditional estimation techniques that is not only a problem with ML but also at the Bayesian integration level (Figure 3(d)). References [1] Bishop, C. (1996). Neural networks for pattern recognition. Oxford Press. [] Dempster, A.P. and Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,B39. [3] Jaakkola T. and Haussler D. (1998). Exploiting generative models in discriminative classiers. NIPS 11. [] Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound maximization and the CEM algorithm. NIPS 11. [5] Jordan, M.I. and Jacobs, R.A. (199). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6: [6] Rathinavelu, C. and Deng, L. (1996). The trended HMM with discriminative training for phonetic classication. ICSLP 96. [7] Vapnik V. (1995). The nature of statistical learning theory. Springer-Verlag.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1 To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract INTERNATIONAL COMPUTER SCIENCE INSTITUTE 947 Center St. Suite 600 Berkeley, California 94704-98 (50) 643-953 FA (50) 643-7684I A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA

Samy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA An EM Algorithm for Asynchronous Input/Output Hidden Markov Models Samy Bengioy, Yoshua Bengioz y INRS-Telecommunications, 6, Place du Commerce, Ile-des-Soeurs, Qc, H3E H6, CANADA z Dept. IRO, Universite

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

CS Lecture 18. Expectation Maximization

CS Lecture 18. Expectation Maximization CS 6347 Lecture 18 Expectation Maximization Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

1 What is a hidden Markov model?

1 What is a hidden Markov model? 1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information