The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood
|
|
- Reynold Townsend
- 5 years ago
- Views:
Transcription
1 The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 Abstract We propose a general approach for estimating the parameters of latent variable probability models to maximize conditional likelihood and discriminant criteria. Unlike joint likelihood, these objectives are better suited for classication and regression. The approach utilizes and extends the previously introduced CEM framework (Conditional Expectation Maximization), which reformulates EM to handle the conditional likelihood case. We generalize the CEM algorithm to estimate any mixture of exponential family densities. This includes structured graphical models over exponential families, such as HMMs. The algorithm eciently takes advantage of the factorization of the underlying graph. In addition, the new CEM bound is tighter and more rigorous than the original one. The nal result is a CEM algorithm that mirrors the EM algorithm where both estimate a variational lower bound on their respective incomplete objective functions, and both generate the same standard M-steps over complete likelihood for direct maximization. The equivalence of M-steps facilitates migration of current ML approaches to conditional criteria for improved classication and regression results. 1 Introduction Recently, the machine learning community and its application domains have seen the proliferation of conditional or discriminative criteria for classication and regression. For instance, support vector machines [7] have generated competitive classier systems and are being combined with probabilistic models [3]. In the speech community, discriminatively trained HMMs minimize classication error for superior phoneme labeling [6]. Mixtures of experts are used as probabilistic regressors after maximizing their conditional likelihood [5]. Even traditional neural networks employ a least-squares objective function on the output, emphasizing prediction performance [1]. All these criteria allocate modeling resources with the given task in mind, yielding improved performance. In contrast, under ML and MAP (the canonical criteria of probabilistic models), each density is trained separately to describe observations rather than optimally solve for classication or regression. Therefore performance is compromised.
2 The ML/MAP criteria suer most when the model is inaccurate (this modelmismatch occurs often in real-world situations). For visualization, observe Figure 1 where data must be classied using Gaussians ( per class). The optimal ML solution unfortunately does no better than chance on this data. In comparison, the maximum conditional likelihood solution correctly labels most examples (a) Max. Likelihood Classier (b) Max. Conditional Likelihood Classier Figure 1: Likelihood of class and data vs. likelihood of class given data. Nevertheless, ML and MAP remain attractive criteria since they can be easily and convergently optimized for a large class of latent variable probability densities and graphical models. These otherwise intractable models are lower bounded and decoupled via the EM [] algorithm which generates complete data and simple M-steps. Since the EM algorithm is limited to ML and MAP, proponents of other criteria must resort to gradient algorithms [6], or second order methods [5] to estimate latent variable probability models. These generic optimization algorithms need extra bookkeeping, selection of step size, exhibit non-monotonic convergence, and so on. It is thus desirable to nd a lower bound like EM which facilitates optimization of conditional or discriminant criteria and generates simple M-steps. Such a lower bound was proposed in [] as the CEM (Conditional Expectation Maximization) algorithm and used to perform regression 1 with Gaussian mixtures. However CEM's full generality extends beyond that case to a large class of probability models. It can also be further rened to make it tighter. In the following, we describe the peculiar structure of conditional and discriminative problems and show how EM cannot fully lower bound them. We then show how lower bounding arises naturally from the log-concavity of certain probability models. This CEM lower bound in question eectively generates weighted and displaced complete data from incomplete data to produce M-steps that are structurally identical to the EM case. We then present the estimation of the CEM bound's parameters. This is done by approximating the log-likelihood function with a sparse envelope on its epigraph. Lower bounding is depicted for mixtures of Gaussians and multinomials. For latent variable graphical models, we discuss HMMs and explain the ecient computation of their CEM lower bound. This document is a brief theoretical discussion, but practical and implementation details can be veried online. The reader is encouraged to please visit the accompanying web page: for experiments, results and visualization material. 1 Here we emphasisize classication instead.
3 CEM: Discriminative Lower Bounds for Simple M-Steps For latent variable P models, a distribution is described as the sum of simple models: p(c; xj) = m p(m; c; xj). Here m is the missing data, x P is the input and c is the class label. The standard ML objective function is: l = i log p(c i; x i j). EM simplies each i'th component of the log-likelihood with a variational bound using Jensen inequality (with equality achieved at = ~): log X m p(m; xj) X m p(m; xj ~) P n p(n; xj ~) log p(m; xj)! + log X m p(m; xj ~) (1) An alternative criterion is conditional maximum likelihood (CML), P where the loglikelihood of the class given the observations is optimized: l c = i log p(c ijx i ; ). Equation shows a component of the conditional P Plog-likelihood in further detail. Due to the presence of the negative in? log( m c ), the so-called \negative logsum", EM acts as an upper bound on that term. Therefore, EM can only lower bound half the terms in the conditional likelihood, leaving the rest intractable and preventing a direct M-step. log p(cjx; ) = log p(c; xj) p(xj) = log X m p(m; c; xj)? log X m X c p(m; c; xj) () In the case of discriminative learning [3] [7] likelihood ratios between two classes are compared via a discriminant function L(xj) (Equation 3). The sign of the function denotes the class, ^c = sign(l(x)). For each class, one could have a latent variable probability model ( 1 and ). However, if a lower bound is needed on such an expression, EM will succeed on the log() function but applying EM on the remaining? log() will produce an undesirable upper bound. Therefore, to lower bound such negative log-sums we propose the generalized CEM algorithm. L(xj) = log p(xj 1) p(xj ) = logx m p(m; xj 1 )? log X m p(m; xj ) (3) Denition 1 The generalized CEM algorithm is the complementary lower bound on the negative log-sum that EM would otherwise upper bound. Its form is:? log X m p(m; xj) X m w m log p(m; y m j) + k where w m is a positive scalar weight, y m is a displaced data point in the space of x and k is a scalar constant. Equality is achieved at some ~. Note its structural similarity to the EM lower bound, hence their similar simple M-steps. 3 Exploiting Log-Concavity in the Exponential Family It seems unusual that the log-sum can be both upper bounded and lower bound by a positive weighted sum of logs. In fact, the lower bound (EM) always exists while the upper bound (CEM) is only feasible for certain models, basically log-concave ones such as the exponential P family. For the EM case, the probabilities in the log-summation, log( m p(m; )), could be any arbitrary positive function. Jensen's inequality merely exploits the presence of the concave logarithm of an expectation. Furthermore, the lower bound's parameters only compute \responsibilities" or expectations of p(m; ) at. ~ CEM, however, can not use Jensen, and hence requires log-concave probabilities for p(m; ). Furthermore, the choice of its parameters depends on the form of
4 p(m; ), not just responsibility ratios. We thus assume exponential family members for p(m; ) since they are log-concave. The family is dened as p(xj) = a(x) exp( T x?k()), where a(x) is positive and K() is convex. The exponential family subsumes a variety of models including Gaussians, multinomials, Poisson, and their conjugates. Assume that we are lower bounding (with CEM) the negative log of a latent variable model by a sum of log-exponential family models (Equation ). One typical choice for the latent variable model is an exponential family mixture, i.e. p(xj) = P m p(m)a(x) exp(t mx? K( m )).? log p(xj) X m w m T my m? w m K( m ) + k () We use the following intuitive requirements to solve for the bound's parameters (k; y m ; w m ). Contact: a variational bound must equal the negative log-sum at the current operating point ( ~). T angentiality: the gradients of the bound equal those of the negative log-sum at the current point. T ightness: the bound is as tight a bound as possible without stepping over the negative log-sum. The rst two requirements need the value and the gradient of the negative log-sum at the current model settings ( ~ S = ~ m m ). Since EM generates a variational bound, its value and gradient at ~ are identical to those computed from the negative logsum. Usually, using EM is computationally more ecient. Thus we compute the following quantities with EM: the value V = p(xj ), ~ the responsibilities h m = p(m)p(xj ~ m ) (or ^h P m = h m = n h n) and the gradients G m p(xj) at k =? log p(xj ~)? X m y m = G m w m m ~m = 1 w m m m w m ~ T my m? K( ~ m )!? x ~m (5) m ~m (6) The above equations recover the parameters for the CEM bound and show the case of y m for a mixture of the exponential family. Note that if we set w m =?^h m, then y m! x and CEM reduces to the exact same upper bound EM generates. CEM's added exibility arises from allowing the data point to move to another location (y m ) as well undergo weighting. EM generates weighted complete data while CEM generates displaced and weighted complete data. Now, we wish to nd the smallest w m permissible for a true lower bound, i.e. the tightest approximation while staying below the negative log-sum. Of course, one need not pick the smallest possible w m, any w m greater than the minimal one will also generate a true bound. In addition, we post-process w m which must be positive and guarantee reasonable M-step calculations. Substituting k and y m into the bound gives an expression for w m : Pm wm K(m )?K( ~ m)?(m? ~ ~ m log p(xj) p(xj ~) + Pm (m? ~ m) T G m (7) By convexity of K( m ), we know that the terms multiplying each w m remain positive. Therefore, w m are always constrained from below, w m (), regardless of Note the mixing components (class frequencies) are xed while updating the exponential family parameters. The mixing components themselves also form a log-concave structure suitable for CEM if we alternatively x the exponential family models. However [] already implemented a (slightly dierent) bound and update rule for mixing proportions that was valid for general mixtures (i.e. not just Gaussians).
5 Gaussian log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } Binomial HMM 3 1 log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } (a) Negative Log-Sum (b) Sparse Envelope (c) CEM Lower Bound Figure : Computing envelopes and CEM bounds. the choice of (i.e. the inequality never ips over). To nd the minimum w m (tightest bound), we could exhaustively vary the value of over the whole space and verify all the conditions on w m. This is intractable so we propose an ecient approximation. The EM Epigraph and Envelope Approximation Recall that EM generated an undesirable upper bound on the negative log-sum (i.e. consider negating Equation 1). Yet these upper bounds play a critical role in computing the CEM lower bound. Recall from concave duality that a function can be dened as the minimum of all its convex upper bounds. EM generates upper bounds that lie in the negative log-sum's epigraph and their minimum forms its envelope. EM achieves equality (Equation 8) if we minimize over all bounds in the continuous space of fh j g variations.? log X j p(j)p(xj j ) = fh MIN j g? P P P h j j log p(j)p(xj j)? log n hn n h n Of course, we need not exhaustively consider every upper bound, rather one may choose a select few sample bounds whose envelope captures most of the interesting behavior of the function. For the log-sum, we get a very nice approximation when we use the winner-takes-all case where one model dominates all others and one responsibility ratio h j is set to 1. while all others are.. Considering j = 1::M such models (envelope components) and minimizing over them approximates the log-mixture. Here, the envelopes components are really M negative log-exponential-family models. Thus we approximate? log P j p(j)p(xj j) MINf? log p(j)p(xj j ) 8 j g. Figure shows very accurate sparse envelope approximation for a -component Gaussian mixture (varying 1D means) and a mixture of binomials. We now have a parsimonious approximation of the negative log-sum which we can replace back into the CEM bound. If CEM's concave bound is less than each (8)
6 envelope, it is underneath the negative log-sum. Furthermore, the expression for w m which seemed intractable simplies since log p(xj) decouples into separate log-exponential-family terms. Each of these envelope components (j = 1::M) varies with a single j at a time. Plugging an envelope member in place of the log-sum greatly simplies Equation 7 for w m : Pm wm K(m)?K( ~ m )?(m? ~ ~ m P log p(j)p(~xj j ) + p(xj ~) m (m? m) ~ T G m (9) Here, the latent variable model is now replaced by p(j)p(~xj j ), a mere exponential family member. Now, due to the decoupling of the m models it is possible to naturally split the above joint constraint over all w m into M stricter individual inequalties as in Equation 1. We also heuristically split the constant value log p(xj ~) into M dierent c m variables such that P m c m = V. The result is a decoupled set of M constraints j = 1::M (from M sparse envelope components) for each of the M CEM parameters, w m. We use (j = m) to indicate that the envelope / epigraph component only varies with one j (unlike the latent variable probability). w m (j = m) log p(j)p(~xj j)? c m + ( m? ~ m ) T G m K( m )? K( ~ m )? ( m? ~ m (1) ) We solve for the smallest w m (call it wm) possible under Equation 1 for each of the j = 1::M envelope components. These are then consolidated by picking the largest wm (call this the nal w m ) that was achieved. Assume that we knew a priori the value of wm. This wm denes a component of the CEM lower bound which supports a component of the envelope. The closest point in m space between the bound and the envelope component can then be computed by taking gradients and we note that the following constraint holds at m ) 1 m wm + (j = m) (j = m)~x + w m m ~m + G m! The above constraint maps the optimization of each w m from a search over the whole m space to a single degree of freedom. Using the convexity of K( m ) we have a 1 to 1 map between the gradient of K( m ) and m. Equation 11 eectively changes the optimization over the space of m to have a single degree of freedom which, when varied, determines w m. For the Gaussian case with identity covariance, computing the maximum w m from Equation 1 can then be done analytically (it is quadratic due to the simplicity of K() = 1 T ). For more dicult models, we perform a simple bisection or secant search in 1 dimension for each w m which typically converges to the true solution with an average of 5 iterations (using some eciency heuristics). Another possibility is storing lookup tables for direct computation of the w m as in []. We solve Equation 11 in this manner a total of M M times to fully describe the CEM bound on the negative log-sum. Once the CEM bound is computed, it is straightforward to combine it with EM bounds of the same form, sum them over multiple components of the log-likelihood and maximize. By iterating bound and maximization steps, monotonic convergence of the mixture models is veried. Deterministic annealing of CEM can also be used to avoid local minima. We rst use annealed EM to compute the value V and gradients G m of the function and then form an annealed version of the EM sparse envelope. The resulting parameters in CEM (k; y m ; w m ) generate a less local bound. (11)
7 5 HMMs - Bounding Structured Models We now consider CEM for HMMs, a structured model with latent variables and a probability factorization given by an underlying markov chain. The logsum P structure in an HMM for an observation sequence X is: log p(xj) = log (s 1;:::;s T ) p(s 1 ; :::; s T ; Xj), for N states and chains of length T. The sum in the log contains N T mixture components. However, the probability density factorizes allowing ecient computation of EM bounds. Thus, after Baum-Welch computations, the HMM is bounded via EM and can be updated with N M-step equations for emission distributions and N M-step equations for multinomials (i.e. the transition P matrix). Therefore, EM generates exponential family lower bounds: log p(xj) n;t w n;t log p( t j n ) + v n;t log p( t j n). CEM also generates this form except it bounds? log p(xj). As usual, to compute k and ym we use the value V of p(xj ) ~ and its gradients G m from the EM bound. Furthermore, we approximate the HMM's envelope using EM to obtain an envelope of upper bounds. CEM's w m parameters are then checked against these sparse components. Of course, we certainly do not want to check all N T terms in the mixture. Only N T terms are needed due to factorization. Each of the N models (multinomials and emission densities) is checked against these N T terms via Equation 1. The terms are again in a winner-takes-all form over the T elements of the chain which are imputed as values of ~x. Figure shows the envelope approximation of the HMM's likelihood with 1D means varying in a -state HMM with xed transition probabilities. Note the faithful representation of the function via the EM envelope. Thus, the HMM can be accurately represented as a minimum of log-exponential family models and CEM lower bounds its negative logarithm. 6 Discussion We have derived the CEM algorithm to lower bound the negative log summation. Combining it with EM produces tight lower bounds on conditional and discriminant criteria with latent variables. By iteratively maximizing the bound, monotonic convergence and deterministic annealing are feasible. This generalized algorithm holds for log-concave mixtures such as the exponential family and certain graphical models. Eectively, the CEM algorithm eciently nds a lower bound on discriminative criteria and maps them to the usual tractable M-step structure found in EM techniques. The re-use of maximization machinery permits easy migration of current probabilistic models from ML to conditional and discriminative criteria. 7 Appendix: Conditional Bayesian Estimation This appendix motivates maximum conditional likelihood as an approximation of conditional Bayesian integration. It also shows that the conditional integral diers from the conditioned joint integral. We rst compute the joint Bayesian integral from (X ; Y) data and then condition it to obtain p(yjx) j : p(yjx) j = R p(x; yj)p(jx ; Y)d R p(xj)p(jx ; Y)d (1) The corresponding dependency graphs (Figure 3(b) and (c)) show how joint estimation contrasts conditional estimation which assumes that x is always given as a parent of y. The conditional Bayesian integral takes advantage of the graph's factorization and estimates a dierent p(yjx) c.
8 X θ θ Y Z X Y p(x,y) Integrate j p(y x) c p(y x) Integrate {X,Y} {Y X} Condition (a) Data (b) Conditioned Joint (c) Direct Conditional (b) Inconsistency Figure 3: Inconsistency of Conditioned Joint and Conditional Bayesian Estimates p(yjx) c = Z p(yjx; c ) [p( c jx ; Y)]d c = Z p(yj p(yjx; c ) c ; X )p( c ) d c (13) p(yjx ) Conditional maximum likelihood approximates this integral by picking the single p(yjx; c ) that maximizes p(yj c ; X ). Fundamentally, though, the two Bayesian integrals p(yjx) j and p(yjx) c are dierent. We exhaustively perform both Bayesian integral for a Gaussian mixture model on data points. Figure 3 shows the data and the resulting conditional densities. There is a clear inconsistency between joint and conditional estimation techniques that is not only a problem with ML but also at the Bayesian integration level (Figure 3(d)). References [1] Bishop, C. (1996). Neural networks for pattern recognition. Oxford Press. [] Dempster, A.P. and Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,B39. [3] Jaakkola T. and Haussler D. (1998). Exploiting generative models in discriminative classiers. NIPS 11. [] Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound maximization and the CEM algorithm. NIPS 11. [5] Jordan, M.I. and Jacobs, R.A. (199). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6: [6] Rathinavelu, C. and Deng, L. (1996). The trended HMM with discriminative training for phonetic classication. ICSLP 96. [7] Vapnik V. (1995). The nature of statistical learning theory. Springer-Verlag.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationbound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o
Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationDynamic Time-Alignment Kernel in Support Vector Machine
Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract
INTERNATIONAL COMPUTER SCIENCE INSTITUTE 947 Center St. Suite 600 Berkeley, California 94704-98 (50) 643-953 FA (50) 643-7684I A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation
More informationIn: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999
In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationAn introduction to Variational calculus in Machine Learning
n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models
U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationTwo Useful Bounds for Variational Inference
Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationSamy Bengioy, Yoshua Bengioz. y INRS-Telecommunications, 16, Place du Commerce, Ile-des-Soeurs, Qc, H3E 1H6, CANADA
An EM Algorithm for Asynchronous Input/Output Hidden Markov Models Samy Bengioy, Yoshua Bengioz y INRS-Telecommunications, 6, Place du Commerce, Ile-des-Soeurs, Qc, H3E H6, CANADA z Dept. IRO, Universite
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationThe Expectation Maximization Algorithm
The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCS Lecture 18. Expectation Maximization
CS 6347 Lecture 18 Expectation Maximization Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationMIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA
MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationMachine Learning 4771
Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More information1 What is a hidden Markov model?
1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and
More informationInference and estimation in probabilistic time series models
1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More information