Deterministic annealing variant of variational Bayes method

Size: px
Start display at page:

Download "Deterministic annealing variant of variational Bayes method"

Transcription

1 Journal of Physics: Conference Series Deterministic annealing variant of variational Bayes method To cite this article: K Katahira et al 28 J. Phys.: Conf. Ser View the article online for updates and enhancements. Related content - Estimating parameter of Rayleigh distribution by using Maximum Likelihood method and Bayes method itri Ardianti and Sutarman - The diagnose of oil palm disease using Naive Bayes Method based on Expert System Technology Marlince Nababan, Yonata Laia, Delima Sitanggang et al. - Approximate method of variational Bayesian matrix factorization/completion with sparse prior Ryota Kawasumi and Koujin Takeda Recent citations - Jesus Villalba et al - Jesus Villalba and Eduardo Lleida - Real Time Ligand-Induced Motion Mappings of AChBP and nachr Using X- ray Single Molecule Tracking Hiroshi Sekiguchi et al This content was downloaded from IP address on 11/11/218 at 13:18

2 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 Deterministic Annealing Variant of Variational Bayes Method Kentaro Katahira 1,2, Kazuho Watanabe 1 and Masato Okada 1,2 1 Graduate School of rontier Sciences, The University of Tokyo, Kashiwa, Japan 2 RIKEN Brain Science Institute, Saitama, Japan katahira@mns.k.u-tokyo.ac.jp Abstract. The Variational Bayes (VB) method is widely used as an approximation of the Bayesian method. Because the VB method is a gradient algorithm, it is often trapped by poor local optimal solutions. We introduce deterministic annealing to the VB method to overcome such a local optimal problem. A temperature parameter is introduced to the free energy for controlling the annealing process deterministically. Applying the method to a mixture of Gaussian models and hidden Markov models, we show that it can obtain the global optimum of the free energy and discover optimal model structure. 1. Introduction The Variational Bayes (VB) method has been widely used as an approximation of the Bayseian method for statistical models that have hidden variables [1, 2]. The VB method approximates true Bayesian posterior distributions with a factored distribution using an iterative algorithm that is similar to the expectation maximization (EM) algorithm. In many applications, it has performed better generalization compared to the maximum likelihood estimation [3]. Because VB is a gradient algorithm, it suffers from a serious local optimal problem in practice; that is, the algorithm is often trapped by poor local solutions near the initial parameter value. To overcome this problem, Ueda and Ghahramani [4] applied a split and merge procedure to VB. However, their approach is limited to mixture models; it can not be applied to other models, e.g., hidden Markov models (HMM). In this paper, we introduce a new deterministic annealing (DA) method to VB to overcome the local optimal problem. We call the method Deterministic Annealing VB (DA-VB). This method is general and can be easily applied to a wide class of statistical models. Such a deterministic annealing approach has been used for EM algorithms in the maximum likelihood context [5], and successfully applied to practical models [6]. Ueda and Nakano [5] reformulated maximizing the log-likelihood function as minimizing the free energy, defined as an effective cost function that depends on a temperature parameter. The objective function of VB is originally of the free energy form. Therefore, we can naturally use a statistical mechanics analogy to add a temperature parameter to it. We applied DA-VB methods to a simple mixture of Gaussian models and hidden Markov models (HMM). In both models, DA-VB enables us to obtain a better estimation compared with conventional VB. Unlike the maximum likelihood estimation, VB can automatically eliminates redundant components in the model, so it has the potential to discovering appropriate c 28 Ltd 1

3 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 model structure [1, 2]. However, the local optimal problem deteriorates this advantage. We demonstrate the ability of DA-VB to eliminate redundant components and find optimal model structure. The paper is organized as follows. In section 2, we introduce the DA-VB method. In section 3, we apply DA-VB to a simple mixture of Gaussian models as an illustrative example. In section 4, we apply DA-VB to hidden Markov models and show the experimental results. A conclusion follow in section The DA-VB method We consider learning models p(x,y θ) where x is an observable variable, y is a hidden variable, and θ is a set of parameters. Let X n = {x 1,...,x n } be a set of observable data, Y n = {y 1,...,y n } be a set of unobservable data, and ϕ(θ) be the prior distribution on the parameter. By introducing a distribution q(y n,θ), the marginal log likelihood log p(x n ) can be lower bounded by appealing to Jensen s inequality as log p(x n ) = log Y n = log Y n dθ p(x n,y n,θ) (1) dθ q(y n,θ) p(xn,y n,θ) q(y n,θ) log p(x n,y n,θ) q(y n,θ) + H q(y n,θ) (3) (q), (4) where H p (x) is the entropy of the distribution p(x), and p(x) denotes the expectation over p(x). We define the energy of a global configuration (Y n,θ) to be the average of log p(x n,y n,θ) over the distribution q(y n,θ). Then, according to the basic equations of statistical physics, = E TS ( is free energy, T is temperature, and S is the entropy), (q) corresponds to the negative free energy at temperature T = 1. Using this statistical mechanics analogy, we introduce the inverse temperature parameter β = 1/T to this free energy for annealing: (2) β (q) log p(x n,y n,θ) q(y n,θ) + 1 β H q(y n,θ). (5) Using the factorized form for q(y n,θ), q(y n,θ) = Q(Y n )r(θ), (6) the free energy with inverse temperature parameter β becomes β (Q,r) = log p(x n,y n,θ) Q(Y n )r(θ) + 1 β H Q(Y n ) + 1 β H r(θ). (7) The VB method iteratively maximizes (q) with respect to Q(Y n ) and r(θ) until it converges to a local maximum. The maximization of (q) with respect to Q(Y n ) is called the VB-E step and the maximization of (q) with respect to r(θ) is called the VB-M step. The DA-VB method uses the free energy function with the temperature parameter β (Q,r) as an objective function, instead of (q). By adding an annealing loop to the original VB procedure, the following DA-VB method can be derived. 2

4 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 [DA-VB method] 1. Set β β init ( < β init < 1), t. 2. Initialize the hyperparameter of r () (θ). 3. Perform the following VB-EM steps until convergence: VB-E step: VB-M step: Q (t+1) (Y n ) = 1 C Q exp β log p(x n,y n θ) r (t) (θ) (8) r (t+1) (θ) = 1 C r ϕ(θ) β exp β log p(x n,y n θ) Q (t) (Y n ) (9) Set t t βnew β const 5. If βnew < 1, repeat from step 3; otherwise stop. Here, C Q and C r are normalization constants, Q (t) and r (t) denote current estimates of posteriors after the tth iteration. When β = 1, these VB-EM steps are the same as those of the original VB method. The VB-E step and VB-M step in the above algorithm are derived as follows. By introducing the Lagrange multipliers λ and λ, the functional derivative of β with respect to Q(Y n ) is log p(x n,y n θ) r(θ) 1 β (log Q(Y n ) + 1) + λ, (1) for fixed r(θ), and the functional derivative of β with respect to r(θ) is log p(x n,y n θ) Q(Y n ) 1 β (log r(θ) + 1) + λ, (11) for fixed Q(Y n ). Equating these derivatives to zero gives equation (8) and equation (9). Algorithmically, DA-VB is similar to the deterministic annealing variant of the EM (DAEM) algorithm [5]. The key difference is that in the DAEM algorithm, the temperature parameter β modifies the weight of the entropy of the hidden variable posterior on the objective function, while in the DA-VB method, β changes the weight of the entropy of the parameter posterior r(θ) as well as the hidden variable posterior Q(Y n ). Beal described an annealing procedure for the VB of conjugate-exponential models [2], although its effects have not been examined yet. His annealing procedure corresponds to the special case of our DA-VB method, that updates the second β alone in the VB-M step while the first β in the VB-M step and β in the VB-E step are fixed. As we will show later, Beal s annealing method is ineffective at training hidden Markov Models in our experiment. 3. Illustrative example To demonstrate the effects of β on the free energy β, we consider a simple one-dimensional, two-component mixture of Gaussian models [5]: p(x θ) = 2 k=1 α k exp { 1 } 2π 2 (x µ k) 2, (12) 3

5 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 The complete likelihood of the model is given by p(x n,y n θ) = [ 2 n { αk exp 1 } ] yk i k=1 i=1 2π 2 (x i µ k ) 2, (13) where yi k is not observed and is representing the component from which the datum x i is generated. If the datum x i is generated from the k-th component, then yi k = 1, if otherwise, yi k =. The prior over the means, is normal distribution with mean and variance σ 2 : ϕ(µ) = 2 N(µ k,σ 2 ). (14) k=1 We fix the mixing coefficients α k to α 1 =.3 and α 2 =.7. Then, the posteriors over the parameters to be estimated are only r( ) and r( ). Because r( ) and r( ) are parameterized by their own means (, ), we can visualize the free energy surface β defined on the (, )- plane. The VB-M step consists of updating the posterior according to where r(µ k ) = N ( µ k µ k, σ 2 ) β(1 + σ 2 N, (15) k ) µ k = ȳ k i = σ 2 Nk 1 + σ 2 N k x k, Nk = exp{βγ i k } 2j=1 exp{βγ i j }, n ȳi k, x k = 1 Nk n ȳi k x i, i=1 i=1 γ i k = log α k log 2π 1 2 (x i µ k ) 2 σ 2 2β(1 + σ 2 N k ). ive hundred samples were generated from this model with (, ) = (4, 4). Therefore, the global optimal posterior means are (, ) = (4, 4). igure 1 shows the learning process of the conventional VB method, that is, the special case of DA-VB with β init = 1.. In the trial with the initial value (, ) = (,6) close to the local maximum, the algorithm converged to the local maximum, whereas it converged to the global maximum with the initial value (, ) = (6,). igure 2 shows learning process of the DA-VB method. We set the hyperparameter for the prior as σ 2 = 1. Note that σ 2 of the prior is independent of the variance of Gaussians of the model (in our case, that is 1). The choice of the prior σ 2 affects the estimation, e.g., for small σ 2, the possibility that µ k exists far from zero is small. However, as can be seen in the equation for µ k, if the data size n is large enough and σ 2 is not too close to zero, the effect of the prior σ 2 is negligible. The starting point for each β is set with the convergence point of the previous annealing step. We set β init =.6 and βnew = β current 1.1. When β is small (β =.6), β has only one maximum. Hence, for arbitrary initial parameter values, the VB-EM steps can converge to the maximum (figure 2(a)). As β increases, two maxima appear, corresponding to the global optimal solution (, ) = (4, 4) and a local optimal solution (, ) = ( 4,4) (figure 2(b)). The form of the free energy surface β gradually changes as β increases. The maxima are always larger at the point near the global optimal solution than near the local optimal solution. Therefore, the DA-VB method can reach the global optimal solution without being trapped by the local optimal solution. 4

6 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 local maximum = global maximum = initial point (, ) = (,6) 1 1 initial point (, ) = (6,) igure 1. Learning process of the conventional VB method with two different initial values (, ) = (,6) and (6,). The algorithm starting from the initial point near the global maximum converges to the global maximum, while the one starting from the initial point near a local maximum converges to the local maximum. 4. DA-VB method for HMM Here, we derive the DA-VB method for hidden Markov models, by expanding the conventional VB method for HMM (VB HMM) [2, 7]. Suppose a discrete sequence of L-valued symbols X T = {x 1,...,x T }, where T is the length of the time series, was observed. We assume x t was produced by a K-valued discrete hidden state y t and the sequence of hidden states Y T = {y 1,...,y T } was generated by a first-order Markov process. We represent the observed data by L-dimensional binary vectors x t,m such that if the output symbol at time t is m, then x t,m = 1; otherwise. The hidden states y t are also represented by a K-dimensional binary vector y t,k such that if a hidden state at time t is k, then y t,k = 1; otherwise. The HMM has the following parameters: initial hidden state prior: π = {π i } : π i = p(y 1 = i) (K 1) state transition matrix: A = {a ij } : a ij = p(y t = j y t 1 = i) (K K) symbol emission matrix: C = {c im } : c im = p(x t = m y t = i) (K L) where all parameters are non-negative and obey the normalization constraints: Ki=1 π i = 1, Kj=1 a ij = 1, and L m=1 c im = 1. The VB-E step in the DA-VB computes the expectations required for the VB-M steps, y t,k Q(Yt) = y t 1,i y t,j Q(Yt) = p(y t,k = 1 X T,) β Kk=1 p(y t,k = 1 X T ) β, (16) p(y t 1,i = 1,y t,j = 1 X T ) β Ki=1 Kj=1 p(y t 1,i = 1,y t,j = 1 X T ) β. (17) These quantities are calculated by using well-known forward-backward algorithm. 5

7 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 convergence point starting point convergence point Initial point β 1 β (a) β =.5 (b) β =.86 starting point convergence point starting point convergence point β β µ 1 (c) β = (d) β = starting point convergence point final solution β β (e) β =.31 (f) β = 1. igure 2. Learning process of the DA-VB method with an initial value (, ) = (,6). The method successively finds the global optimal solution. (a)-(f) show free energy surfaces corresponding to particular β values. The parameter prior over π, the rows of A, and the rows of C are Dirichlet distributions: ϕ(π) = Dir({π 1,...,π K } u (π) ), (18) K ϕ(a) = Dir({a j1,...,a jk } u (A) ), (19) j=1 6

8 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 where ϕ(c) = K Dir({c j1,...,c jl } u (C) ), (2) j=1 Dir({a 1,...,a K } u) = Γ( K j=1 u j ) Kj=1 Γ(u j ) Then, VB-M step in the DA-VB is given by where K j=1 a u j 1 j. (21) r(π) = Dir({π 1,...,π K } {w1 π,...,wk}), π (22) K r(a) = Dir({a i1,...,a ik } {wi1 A,...,wA ik }), (23) r(c) = w π j w A ij w C ij i=1 K i=1 Dir({c i1,...,c il } {w C i1,...,w C il}), (24) ( = β u (π) ) j + y 1,j Q(Yt) 1 + 1, (25) ( ) T = β u (A) j + y t 1,i y t,j Q(Yt) 1 + 1, (26) = β ( u (C) j + t=2 T t=1 x t,j y t,i Q(Yt) 1 ) + 1. (27) 5. Experiments To demonstrate the ability of the DA-VB HMM to find global optimal solutions, we performed an experiment on synthetic data. We generated the sample data set as follows: By using standard regular expression notation, the three type of sequences, (abc), (acb), (a b ) are concatenated by switching to other sequences with probability.1 and repeating with probability.8. The true model structure for these sequences is shown in figure 3(c). The training data consisted of 5 sequences of length 5 symbols. They are synthetic data modified from those used in [2]. We chose an HMM with K = 15 hidden states to allow for some redundancy. VB HMMs can learn by pruning redundant hidden states [2]. In other words, the posterior means for the transition probabilities to the redundant hidden states converge to automatically. or the DA-VB method, we set β init =.6, βnew β current 1.2. The prior hyperparameters were set as u (π) j = u (A) j = 1/K,u (C) j = 1/L, j. igure 3(a-c) shows the histograms of free energies after convergence across 5 random initializations. The conventional VB method reaches a maximum free energy of 113 in less than 1% of the trials. On the other hand, the DA-VB method reaches a maximum free energy of about 118, which was never reached with the conventional VB method, in about 62% of the trials. igures 3(e) and (f) show typical model structures obtained by DA-VB and conventional VB, respectively. We can see that the typical model structure obtained most frequently by DA-VB corresponds to the true model structure, whereas the model structure obtained most frequently by conventional VB is too complex. The DA-VB eliminated the redundant hidden states appropriately, and used seven hidden states that is optimal number, while conventional VB used more than seven hidden states in most trials. Beal s annealing method [2] corresponds to the case that replaces equations (25)-(27) with wj π = u(π) j +β y 1,j Q(Yt), wij A = u (A) j + β T t=2 y t 1,i y t,j Q(Yt), wij C = u (C) j + β T t=1 x t,j y t,i Q(Yt). As we can see in figure 3(c), his method was ineffective in our experiment. 7

9 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/ VB 15 requency (a) conventional VB.8 a /.5 b / c / 1. a / a / b / 1. b / 1..1 (d) true model structure 1. c / 1. 4 DA VB.14 requency requency β (b) DA-VB Beal s annealing.76 a /.48 b /.52.1 a / a / c / b / c / b / a / (e) DA-VB ( = 118).98 c /.89 a / c / b /.99 b / b / 1. (f) conventional VB ( = 122) a /.9 b /.9 b /.97 a / 1. c / a /.39 b /.51 c / β (c) Beal s annealing method igure 3. (a-c): Histograms of converged free energies across random initializations for (a) VB method, (b) DA-VB method with β init =.6, and (c) VB with Beal s annealing method. (d) true HMM structure of synthetic data. The HMM structure obtained by (e) DA-VB ( β = 122) and (f) conventional VB ( = 118). The transition probabilities and emission probabilities are obtained from the posterior means: a ij r(a), c im r(c). Unused hidden states and the initial hidden state prior π i r(π) are omitted for the sake of viewability. 6. Conclusion We developed a novel deterministic annealing method for VB by introducing an inverse temperature parameter β based on the statistical mechanics analogy. By applying the method to simple mixture of Gaussian models and hidden Markov models, we showed that the proposed method improved the ability of the VB method to find optimal solutions. Our method can be easily applied to other statistical models, with minor modification to the conventional VB method, i.e., by adding the parameter β. We believe that our method will be a powerful tool for finding statistical structures underlying real world data. Acknowledgments This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas No and No , a Grant-in-Aid for Scientific Research (C) No , and the Director s fund of BSI, RIKEN. 8

10 International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/ /95/1/1215 References [1] Attias H 1999 Proc. 15th Conf. on Uncertainty in Artificial Intelligence 21 [2] Beal M J 23 Ph.D thesis, University College London [3] Watanabe S, Minami Y, Nakamura A and Ueda N 22 Advances in Neural Information Processing Systems vol 15 (Cambridge, MA: MIT Press) 1261 [4] Ueda N and Ghahramani Z 22 Neural Netw [5] Ueda N and Nakano R 1998 Neural Netw [6] Ghahramani Z and Hinton G E 2 Neural Comput [7] MacKay D 1997 Unpublished manuscript 9

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture

Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture 009 Technical Report on Information-Based Induction Sciences 009 (IBIS009) Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture Daisue Kaji Sumio Watanabe Abstract: Variational Bayes

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Variational Bayesian Hidden Markov Models

Variational Bayesian Hidden Markov Models Chapter 3 Variational Bayesian Hidden Markov Models 3.1 Introduction Hidden Markov models (HMMs) are widely used in a variety of fields for modelling time series data, with applications including speech

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Statistical Learning Theory of Variational Bayes

Statistical Learning Theory of Variational Bayes Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM

Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM Extracting State Transition Dynamics from Multiple Spie Trains with Correlated Poisson HMM Kentaro Katahira,2, Jun Nishiawa 2, Kazuo Oanoya 2 and Masato Oada,2 Graduate School of Frontier Sciences The

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

A Bayesian Local Linear Wavelet Neural Network

A Bayesian Local Linear Wavelet Neural Network A Bayesian Local Linear Wavelet Neural Network Kunikazu Kobayashi, Masanao Obayashi, and Takashi Kuremoto Yamaguchi University, 2-16-1, Tokiwadai, Ube, Yamaguchi 755-8611, Japan {koba, m.obayas, wu}@yamaguchi-u.ac.jp

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

G8325: Variational Bayes

G8325: Variational Bayes G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

CSE 473: Artificial Intelligence Autumn Topics

CSE 473: Artificial Intelligence Autumn Topics CSE 473: Artificial Intelligence Autumn 2014 Bayesian Networks Learning II Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 473 Topics

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Fast Approximate MAP Inference for Bayesian Nonparametrics

Fast Approximate MAP Inference for Bayesian Nonparametrics Fast Approximate MAP Inference for Bayesian Nonparametrics Y. Raykov A. Boukouvalas M.A. Little Department of Mathematics Aston University 10th Conference on Bayesian Nonparametrics, 2015 1 Iterated Conditional

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Fast Inference and Learning with Sparse Belief Propagation

Fast Inference and Learning with Sparse Belief Propagation Fast Inference and Learning with Sparse Belief Propagation Chris Pal, Charles Sutton and Andrew McCallum University of Massachusetts Department of Computer Science Amherst, MA 01003 {pal,casutton,mccallum}@cs.umass.edu

More information

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Algorithms for Variational Learning of Mixture of Gaussians

Algorithms for Variational Learning of Mixture of Gaussians Algorithms for Variational Learning of Mixture of Gaussians Instructors: Tapani Raiko and Antti Honkela Bayes Group Adaptive Informatics Research Center 28.08.2008 Variational Bayesian Inference Mixture

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Hidden Markov Bayesian Principal Component Analysis

Hidden Markov Bayesian Principal Component Analysis Hidden Markov Bayesian Principal Component Analysis M. Alvarez malvarez@utp.edu.co Universidad Tecnológica de Pereira Pereira, Colombia R. Henao rhenao@utp.edu.co Universidad Tecnológica de Pereira Pereira,

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information