Graphical models: parameter learning

Size: px
Start display at page:

Download "Graphical models: parameter learning"

Transcription

1 Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England zoubin/ INTRODUCTION Graphical models combine graph theory and probability theory to provide a general framework for representing models in which a number of variables interact. Graphical models trace their origins to many different fields and have been applied in wide variety of settings: for example, to develop probabilistic expert systems, to understand neural network models, to infer trait inheritance in geneologies, to model images, to correct errors in digital communication, or to solve complex decision problems. Remarkably, the same formalisms and algorithms can be applied to this wide range of problems. Each node in the graph represent a random variable (or more generally a set of random variables). The pattern of edges in the graph represents the qualitative dependencies between the variables; the absence of an edge between two nodes means that any statistical dependency between these two variables is mediated via some other variable or set of variables. The quantitative dependencies between variables which are connected via edges are specified via parameterized conditional distributions, or more generally non-negative potential functions. The pattern of edges and the potential functions together specify a joint probability distribution over all the variables in the graph. We refer to the pattern of edges as the structure of the graph, while the parameters of the potential functions simply as the parameters of the graph. In this chapter, we assume that the structure of the graph is given, and that our goal is to learn the parameters of the graph from data. Solutions to the problem of learning the graph structure from data are given in GRAPHICAL MODELS, STRUCTURE LEARNING. We briefly review some of the notation from PROBABILISTIC INFERENCE IN GRAPHICAL MODELS which we will need to cover parameter learning in graphical models; we assume that the reader is familiar with the contents 1

2 that chapter. More in-depth treatments of graphical models can be found in Pearl (1988), Heckerman (1996), Jordan (1998), and Cowell et al. (1999). There are two main varieties of graphical model. Directed graphical models, also known as BAYESIAN NET- WORKS, represent the joint distribution of d random variables X = (X 1,..., X d ) by a directed acyclic graph in which each node i, representing variable X i, receives directed edges from its set of parent nodes π i. The semantics of a directed graphical models are that the joint distribution of X can be factored into the product of conditional distributions of each variable given its parents. That is, for each setting x of the variable X: p(x θ) = d p(x i x πi, θ i ). (1) i=1 This factorization formalizes the graphical intuition that X i depends on its parents: given its parents, X i is statistically independent of all other variables which are not descendendents of X i. The set of parameters governing the conditional distribution which relates X πi to X i is denoted by θ i, while the set of all parameters in the graphical model is denoted θ = (θ 1,..., θ d ). Note that 1 is identical to equation 1 in PROBABILISTIC INFERENCE IN GRAPHICAL MODELS except that we have made explicit the dependence of the conditional distributions on the model parameters. Undirected graphical models represent the joint distribution of a set of variables via a graph with undirected edges. Defining C to be the set of maximal cliques (i.e. fully-connected subgraphs) of this graph, an undirected graphical model corresponds to the statement that the joint distribution of X can be factored into the product of functions over the variables in each clique: p(x θ) = 1 Z(θ) ψ C (x C θ C ) (2) C C where ψ C (x C θ C ) is a potential function assigning a non-negative real number to each configuration x C of X C, and is parameterized by θ C. An undirected graphical model corresponds to the graphical intuition that dependencies are transmitted via the edges in the graph: each variable is statistically independent of all other variables given the set of variables it is connected to (i.e. its neighbors). Note again that we have reproduced equation 2 from PROBABILISTIC INFERENCE IN GRAPHICAL MODELS while making explicit the parameters of the potential functions. The chapter is organized as follows. We start by concentrating on directed graphical models. In the next section, we discuss the problem of learning maximum likelihood (ML) parameters when all the variables are observed. The following secion generalizes this to the case where some of the variables are hidden or missing, and introduces the EM algorithm. We then turn to learning parameters of undirected graphical models using both EM and IPF. Finally, we discuss the Bayesian approach in which a posterior distribution over parameters is inferred from data.

3 ML LEARNING FROM COMPLETE DATA Assume we are given a data set d of N independent and identically distributed observations of the settings of all the variables in our directed graphical model d = {x (1),..., x (N) }, where x (n) = (x (n) 1,..., x(n) ). The likelihood is a function of the parameters which is proportional to the probability of the observed data: p(d θ) = d N p(x (n) θ) (3) n=1 We assume that the parameters are unknown and we wish to estimate them from data. We focus on the problem of estimating a single setting of the parameters which maximizes the likelihood (3). (In contast, the Bayesian approach to learning described in the last section starts with a prior distribution over the parameters p(θ) which is meant to capture any prior knowedge we may have about θ, and infers the posterior distribution over parameters given the data using Bayes rule.) Equivalently we can maximize the log likelihood: L(θ) = log p(d θ) = = N d n=1 i=1 N log p(x (n) θ) (4) n=1 log p(x (n) i x (n) π i, θ i ) (5) where the last equality makes use of the factorization (1) of joint distribution in the directed graphical model. If we assume that the parameters θ i governing the conditional probability distribution of X i given its parents are distinct and functionally independent of the parameters governing the conditional probability distribution of other nodes in the graphical model, then the log likelihood decouples into a sum of local terms involving each node and its parents: where L i (θ i ) = n log p(x(n) i L(θ) = d L i (θ i ) (6) i=1 x (n) π i, θ i ). Each L i can be maximized independently as a function of θ i. For example, if the X variables are discrete and θ i is the conditional probability table for x i given its parents, then the ML estimate of θ i is simply a normalised table containing counts of each setting of X i given each setting of its parents in the data set. Maximum a posteriori (MAP) parameter estimation incorporates prior knowledge about the parameters in the form of a distribution p(θ). The goal of MAP estimation is to find the parameter setting that maximizes the posterior over parameters, p(θ d), which is proportional to the prior times the likelihood. If the prior factorizes over the parameters governing each conditional probability distribution, i.e. p(θ) = i p(θ i) then MAP estimates can be found by maximizing L (θ) = d L i (θ i ) + log p(θ i ). (7) i=1

4 The log prior can be seen as a regularizer, which can help reduce overfitting in situations where there is insufficient data for the parameters to be well-determined (see GENERALIZATION AND REGULARIZATION IN NONLIN- EAR LEARNING SYSTEMS). While ML estimation is invariant to reparameterization, since the location of the maximum of the likelihood function does not change if you apply a one-to-one transformation f : θ φ, MAP estimation is not. Indeed, for any θ one can always find one-to-one mapping such that the MAP estimate of φ = f( θ), as long as p(θ d) > 0 in a small neighborhood around θ. Thus, care should be taken in the choice of parameterization. ML LEARNING WITH HIDDEN VARIABLES The EM algorithm Often, the observed data will not include the values of some of the variables in the graphical model. We refer to these variables as missing or hidden variables. With hidden variables, the log likelihood cannot be decomposed as in (6). Rather, we find: L(θ) = log p(x θ) = log p(x, y θ) (8) y where x denotes the setting of the observed variables, y the setting of the hidden variables, and y is the sum (or integral) over Y required to obtain the marginal probability of the observed data. (For notational convenience, we have dropped the superscript (n) in (8) by evaluating the log likelihood for a single observation.) Maximising (8) directly is often difficult because the log of the sum can potentially couple all of the parameters of the model. We can simplify the problem of maximising L with respect to θ by making use of the following insight. Any distribution q(y) over the hidden variables defines a lower bound on L: L(θ) = log y p(x, y θ) = log y y p(x, y θ) q(y) q(y) q(y) log p(x, y θ) q(y) (9) (10) = y q(y) log p(x, y θ) y q(y) log q(y) (11) = F(q, θ) (12) where the inequality is known as Jensen s inequality and follows from the fact that the log function is concave. If we define the energy of a global configuration (x, y) to be log p(x, y θ), then some readers may notice that the lower bound F(q, θ) L(θ) is the negative of a quantity known in statistical physics as the free energy: the expected energy under q minus the entropy of q (Neal and Hinton, 1998). The Expectation-Maximization (EM) algorithm (Baum et al., 1970; Dempster et al., 1977) alternates between maximising F with respect to q and θ,

5 respectively, holding the other fixed. Starting from some initial parameters θ 0, the k +1st iteration of the algorithm consists of the following two steps: E step: q [k+1] arg max M step: θ [k+1] arg max q F(q, θ [k] ) (13) F(q [k+1], θ) (14) θ It is easy to show that the maximum in the E step is obtained by setting q [k+1] (y) = p(y x, θ [k] ), at which point the bound becomes an equality: F(q [k+1], θ [k] ) = L(θ [k] ). This involves inferring the distribution over the hidden variables given the observed variables and the current settings of the parameters, p(y x, θ [k] ). Algorithms which solve this inference problem are presented in PROBABILISTIC INFERENCE IN GRAPHICAL MODELS. These algorithms make use of the structure of the graphical model to compute the quantities of interest efficiently by passing local messages from each node to its neighbors. Exact inference results in the bound being satisfied, but is in general computationally intractable for multiply-connected graphical structures. Even these efficient message passing procedures can take exponential time to compute the exact solution in such cases. For such graphs, deterministic and Monte Carlo methods provide a tool for approximating the E step of EM. One deterministic approximation which can be used in intractable models is to increase but not fully maximize the functional w.r.t. q in the E step. In particular, if q is chosen to be in a tractable family of distributions Q (i.e. a family of distributions for which the required expectations can be computed in polynomial time) then maximizing F over this tractable family E step: q [k+1] arg max q Q F(q, θ [k] ) (15) is called a variational approximation to the EM algorithm (Jordan et al., 1998). This maximizes a lower bound to the likelihood rather than the likelihood itself. The maximum in the M step is obtained by maximizing the first term in (11); since the entropy of q does not depend on θ: M step: θ [k+1] arg max p(y x, θ [k] ) log p(y, x θ). (16) y θ This is the expression most often associated with the EM algorithm (Dempster et al., 1977), but it obscures the elegant interpretation of EM as coordinate ascent in F. Since F = L at the beginning of each M step (following an exact E step), and since the E step does not change θ, we are guaranteed not to decrease the likelihood after each combined EM step.

6 It is worthwhile to point out that it is usually not necessary to explicitly evaluate the entire posterior distribution p(y x, θ [k] ). Since log p(x, y θ) contains both hidden and observed variables in the network, it can be factored as before as the sum of log probabilities of each node given its parents (5). Consequently, the quantities required for the M step are the expected values, under the posterior distribution p(y x, θ [k] ), of the same quantities (namely the sufficient statistics) required for ML estimation in the complete data case. Consider a directed graphical with discrete variables, some hidden and some observed. Each node is parameterized by a conditional probability table which relates its values to the values of its parents. For example, if node i has two parents, j and k, and each variable can take on L values, then θ i is an L L L table, with entries θ i,rst = P (X i = r X j = s, X k = t). In the complete data setting where X i, X j, X k are observed, the maximum likelihood estimate is: ˆθ i,rst = #(X i = r, X j = s, X k = t) #(X j = s, X k = t) where #( ) denotes the count (frequency) with which the bracketed expression occurs in the data. However, if all three variables were hidden, then one could use the EM algorithm for learning the directed graphical model (Lauritzen, 1995; Russell et al., 1995). The analogous M step for θ i,rst would be: n ˆθ i,rst = P (Y i = r, Y j = s, Y k = t X = x (n) ) n P (Y. (18) j = s, Y k = t X = x (n) ) The sufficient statistics of the data required to estimate the parameters are the counts of the setting of each node and its parents no other information in the data is relevant for ML parameter estimation. The expectation step of EM computes the expected value of these sufficient statistics. The EM algorithm provides an intuitive way of dealing with hidden/missing data. The E step fills-in the hidden variables with the distribution given by the current model. The M step then treats these filled-in values as if they had been observed and re-estimates the model parameters. It is pleasantly surprising that these steps result in a convergent procedure for finding the most likely parameters. Although EM is intuitive and often easy to implement, it is sometimes not the most efficient algorithm for finding ML parameters. The EM procedure for learning directed graphical models with hidden variables can be applied to a wide variety of well-known models. Of particular note is the special case known as the Baum-Welch algorithm for training HIDDEN MARKOV MODELS (Baum et al., 1970; Rabiner, 1989). In the E step it uses a local message passing algorithm called the forward-backward algorithm to compute the required expected sufficient statistics. M step it uses parameter restimation equation based on expected counts analogous to (18). (17) In the The Baum-Welch algorithm predates the original EM algorithm and in fact the convergence proof for EM is based on the one given in (Baum et al., 1970). The EM algorithm can also be used to fit a variety of other models that have been studied in the machine learning, neural networks, statitistics and engineering literatures. These include linear dynamical systems, factor analysis, mixtures of Gaussians, and mixtures of experts (see Roweis and Ghahramani 1999 for a

7 review). It is straightforward to modify EM so that it maximizes the parameter posterior probability rather than the likelihood. PARAMETER LEARNING IN UNDIRECTED GRAPHICAL MODELS When compared to learning the parameters of directed graphical models, undirected graphical models present an additional challenge: the partition function. parameters, the partition function from (2) couples all the parameters together. Z(θ) = x Even if each clique has distinct and functionally independent ψ C (x C θ C ) (19) C C We examine the effect this coupling has in the context of an undirected graphical model which has has a great deal of impact in the neural networks field: the Boltzmann machine (Ackley et al., 1985). Boltzmann machines (BMs) are indirected graphical models over a set of d binary variables S i {0, 1} (see also SIMULATED ANNEALING AND BOLTZMANN MACHINES). The probability distribution over the variables in a BM is given by P (s W ) = 1 Z(W ) exp 1 2 d i=1 j ne(i) W ij s i s j = 1 exp {W ij s i s j }. (20) Z(W ) The first equation uses standard notation for BMs, where W is the symmetric matrix of weights (i.e. model parameters) and ne(i) is the set of neighbors of node i in the BM. The second equation writes it as a product of clique potentials where (ij) denotes the clique consisting of the pair of connected nodes i and j. (Actually, in BMs, the maximal cliques in the graph may be very large although the interactions are all pairwise; because of these pairwise constraint on interactions we abuse terminology and consider the cliques to be the pairs, no matter what the graph connectivity is.) Assuming that S i and S j respect to W ij, are observed, taking derivatives of the log probability of the n th data point with log P (s (n) W ) W ij = s (n) i s (n) j s (ij) s i s j P (s W ) = s i s j + n s i s j, (21) we find that it is the of difference between the correlation of S i and S j in the data and the correlation of S i and S j in the model (the notation means expectation). The standard ML gradient descent learning rule for Boltzmann machines therefore tries to make the model match the correlations in the data. The same learning rule applies if there are hidden variables. The second term arises from the partition function. Note that even for fully observed data, while the first term can be computed directly from the observed data, the second term depends potentially on all the parameters,

8 underlining the fact that the partition function couples the parameters in undirected models. Even for fully observed data, computing the second term is nontrivial; for fully connected BMs it is intractable and needs to be approximated. The IPF algorithm Consider the following problem: given an undirected graphical model, and an initial set of clique potentials, we wish to find the clique potentials closest to the initial potentials that satisfy a certain set of consistent marginals. Closeness of probability distributions in this context is measured using the Kullback-Leibler divergence (Cover and Thomas, 1991). The simplest example is a clique of two discrete variables, X i and X j, where the clique potential is proportional to the continency table for the joint probability of these variables P (X i, X j ), and the marginal constraints are P (x i ) = ˆP (x i ) and P (x j ) = ˆP (x j ). These constraints could, for example, have come from observing data with these marginals. A very simple and intuitive iterative algorithm for trying to satisfy these constraints is to start from the initial table and satisfy each constraint in turn. For exampling if we want to satisfy the marginal on X i : ˆP (x i ) P k (x i, x j ) = P k 1 (x i, x j ) P k 1 (x i ) (22) This has to be iterated over X i and X j since satisfying one marginal can change the other marginal. The simple algorithm is known as Iterative Proportional Fitting (IPF), and it can be generalized in several ways (Darroch and Ratcliff, 1972). More generally we wish to find a distribution in the form p(x) = 1 ψ c (x c ) (23) Z which minimizes the KL divergence to some prior p 0 (x) and satisfies a set of constraints (the data) of the form: a r (x)p(x) = h r, (24) x where r indexes the constraint. If the prior is set to the uniform distribution, and the constraints are measured marginal distributions over all the variables in each of the cliques of the graph, then the problem solved by IPF is equivalent to finding the maximum likelihood clique potentials given a complete data set of observations. IPF can be used to train an ML Boltzmann machine if ˆP (S i, S j ) is given by the data set for all pairs of variables connected in the BM. The procedure is to start from the uniform distribution (i.e. all weights set to 0) then apply IPF steps to each clique potential until all marginals match those in the data. One can generalize this by starting from a non-uniform distribution, which would give a BM with minimum divergence from the starting distribution. c

9 But what if some of the variables in the BM are hidden? Byrne (1992) presents an elegant solution to this problem using ideas from Alternating Minimization (AM) and information geometry (Csizár and Tusnády, 1984). One step of the alternating minimization computes the distribution over the hidden variables of the BM given the observed variables. This is the E step of EM, and can also be interpreted within information geometry as finding the probability distribution which satisfies the marginal constraints and is closest to the space of probability distributions defined by the Boltzmann machines. The other step of the minimization starts from W = 0 (the maximum entropy distribution for a BM), and uses IPF to find the BM weights which satisfy all the marginals found in the E step, ˆP ij = s i s j +. This is the M step of the algorithm; a single M step thus involves an entire IPF optimization. The update rule for each step of the IPF optimization for a particular weight is: [ ] ˆP ij (1 s i s j ) W ij W ij + log s i s j (1 ˆP ij ) This algorithm is conceptually interesting as it presents an alternative method for fitting BMs with ties to IPF and AM procedures. However, it is impractical for large multiply-connected Boltzmann machines since computing the exact unclamped correlations s i s j can take exponential time. Although we have presented this EM-IPF algorithm for the case of Boltzmann machines, it is widely applicable to learning many undirected graphical models. In particular, in the E step, marginal distributions over the set of variables in each clique in the graph are computed conditioned on the settings of the observed variables. A propagation algorithm such as the Junction Tree algorithm can be used for this step (PROBABILISTIC INFERENCE IN GRAPHICAL MODELS ). In the M step the IPF procedure is run so as to satisfy all the marginals computed in the E step. There also exist Junction Tree style propagation algorithms which exploit the structure of the graphical model to solve the IPF problem efficiently (Jiroušek and Přeučil, 1995; Teh and Welling, 2002). BAYESIAN LEARNING OF PARAMETERS A Bayesian approach to learning starts with some a priori knowledge about the model structure the set of arcs in the Bayesian network and model parameters. This initial knowledge is represented in the form of a prior probability distribution over model structures and parameters, and is updated using the data to obtain a posterior probability distribution over models and parameters. In this article we will assume that the model structure is given and we focus on computing the posterior probability distribution over parameters (see GRAPHICAL MODELS, STRUCTURE LEARNING for solutions to the problem of inferring model structure, and also BAYESIAN METHODS FOR NEURAL NETWORKS). For a given model structure, m, we can compute the posterior distribution over the parameters: p(θ m, d) = (25) p(d θ, m)p(θ m). (26) p(d m)

10 If the data set is d = {x (1),..., x (n) } and we wish to predict the next observation, x (n+1) based on our data and model, then the Bayesian prediction p(x (n+1) d, m) = p(x (n+1) θ, m, d) p(θ m, d) dθ (27) averages over the uncertainty in the model parameters. This is known as the predictive distribution for the model. In the limit of a large data set and as long as the prior over the parameters assigns non-zero probability in the region around the maximum likelihood parameter values, the posterior p(θ m, d) will be sharply peaked around the maxima of the likelihood, and therefore the predictions of a single maximum likelihood (ML) model will be similar to those obtained by Bayesian integration over the parameters. Often models are fit with relatively small amounts of data, so asymptotic results are not applicable and the predictions of the ML estimate will differ significantly from those of Bayesian averaging. In such sitations it is important to compute or approximate the averaging over parameters in (27). For certain discrete models with Dirichlet distributed priors over the parameters, and for certain linear-gaussian models, it is possible to compute these integrals exactly. Otherwise, approximations can be used. There are a large number of approximations to the integral over the parameter distribution that have been used in graphical models, include Laplace s approximation, variational approximations (Attias, 1999; Ghahramani and Beal, 2000), the expectation-propagation algorithm (Minka, 2001) and a variety of MCMC methods (see Neal 1993 for a review). DISCUSSION There are several key insights regarding parameter learning in graphical models. When there are no hidden variables in a directed graphical model, the graph structure determines the statistics of the data needed to learn the parameters: the joint distribution of each variable and its parents in the graph. Parameter estimation can then often occur independently for each node. The presence of hidden variables introduces dependencies between the parameters. However, the EM algorithm transforms the problem so that in each M step the parameters of the graphical model are again uncoupled. The E step of EM fills-in the hidden variables with the distribution predicted by the model, thereby turning the hidden-data problem into a complete-data problem. The intuitive appeal of EM has lead to its widespread use in models of unsupervised learning where the goal is to learn a generative model of sensory data. The E step corresponds to perception or recognition: inferring the (hidden) state of the world from the sensory data; while the M step corresponds to learning: modifying the model that relates the actual world to the sensory data. It has been suggested that top-down and bottom-up connections in cortex play the roles of generative and recognition models (see HELMHOLTZ MACHINES AND WAKE SLEEP LEARNING). From a graphical model perspective, the bottom-up recognition model in Helmholtz machines can be thought of as a graph which approximates the distribution of the hidden variables given the observed variables,

11 much in the same way as the variational approximation approximates that same distribution. Undirected graphical models pose additional challenges. The partition function is usually a function of the parameters, and can introduce dependencies between the parameters even in the complete data case. The IPF algorithm can be used to fit undirected graphical models from complete data, and an IPF-EM algorithm can be used when there is hidden data as well. Although maximum likelihood learning is adequate when there is enough data, in general it is necessary to approximate the average over the parameters. Averaging avoids overfitting; it is hard to see how overfitting can occur when nothing is fit to the data. Non-Bayesian methods for avoiding overfitting often also involve averaging, for example, via boostrap resampling of the data. Even with parameter averaging, predictions can suffer in quality if the assumed structure of the model the conditional indepenence relationships are incorrect. To overcome this, it is necessary to generalize the approach in this chapter to learn the structure of the model as well as the parameters from data. This topic is covered in GRAPHICAL MODELS: STRUCTURE LEARNING. REFERENCES Ackley, D., Hinton, G., and Sejnowski, T. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9: Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence. Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41: Byrne, W. (1992). Alternating minimization and Boltzmann machine learning. IEEE Trans on Neural Networks, 3(4): Cover, T. and Thomas, J. (1991). Elements of Information Theory. Wiley, New York. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York, NY. Csizár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. In Dudewicz, E. J., Plachky, D., and Sen, P. K., editors, Statistics and Decisions, Suplementary Issue Number 1, pages Oldenbourg Verlag, Munich. Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43:

12 Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1 38. Ghahramani, Z. and Beal, M. J. (2000). Propagation algorithms for variational Bayesian learning. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems, volume 13, pages , Cambridge, MA. MIT Press. Heckerman, D. (1996). A tutorial on learning with Bayesian networks. Technical Report MSR-TR [ftp://ftp.research.microsoft.com/pub/tr/tr ps], Microsoft Research. Jiroušek, R. and Přeučil, S. (1995). On the effective implementation of the iterative proportional fitting procedure. Computational Statistics and Data Analysis, 19: Jordan, M. I., editor (1998). Learning in Graphical Models. Kluwer Academic Publishers. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1998). An Introduction to variational methods in graphical models. In Jordan, M. I., editor, Learning in Graphical Models. Kluwer Academic Publishers. Lauritzen, S. L. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19: Minka, T. P. (2001). Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001), pages , San Francisco, CA. Morgan Kaufmann Publishers. Neal, R. M. (1993). Probabilistic inference using Markov chain monte carlo methods. Technical Report CRG-TR- 93-1, Department of Computer Science, University of Toronto. Neal, R. M. and Hinton, G. E. (1998). A new view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in Graphical Models. Kluwer Academic Press. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2): Roweis, S. T. and Ghahramani, Z. (1999). A unifying review of linear Gaussian models. Neural Computation, 11(2):

13 Russell, S. J., Binder, J., Koller, D., and Kanazawa, K. (1995). Local learning in probabilistic models with hidden variables. In Proc. Intl. Joint Conf. on Artificial Intelligence, pages Teh, Y. W. and Welling, M. (2002). The unified propagation and scaling algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14.

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models $ Technical Report, University of Toronto, CSRG-501, October 2004 Density Propagation for Continuous Temporal Chains Generative and Discriminative Models Cristian Sminchisescu and Allan Jepson Department

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Sparse Forward-Backward for Fast Training of Conditional Random Fields Sparse Forward-Backward for Fast Training of Conditional Random Fields Charles Sutton, Chris Pal and Andrew McCallum University of Massachusetts Amherst Dept. Computer Science Amherst, MA 01003 {casutton,

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series

Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series Technical Report UCI-ICS 04-04 Information and Computer Science University of California, Irvine Sergey Kirshner and

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Integrating Correlated Bayesian Networks Using Maximum Entropy

Integrating Correlated Bayesian Networks Using Maximum Entropy Applied Mathematical Sciences, Vol. 5, 2011, no. 48, 2361-2371 Integrating Correlated Bayesian Networks Using Maximum Entropy Kenneth D. Jarman Pacific Northwest National Laboratory PO Box 999, MSIN K7-90

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Fast Inference and Learning with Sparse Belief Propagation

Fast Inference and Learning with Sparse Belief Propagation Fast Inference and Learning with Sparse Belief Propagation Chris Pal, Charles Sutton and Andrew McCallum University of Massachusetts Department of Computer Science Amherst, MA 01003 {pal,casutton,mccallum}@cs.umass.edu

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Fractional Belief Propagation

Fractional Belief Propagation Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Learning in Markov Random Fields An Empirical Study

Learning in Markov Random Fields An Empirical Study Learning in Markov Random Fields An Empirical Study Sridevi Parise, Max Welling University of California, Irvine sparise,welling@ics.uci.edu Abstract Learning the parameters of an undirected graphical

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Bayesian Model Scoring in Markov Random Fields

Bayesian Model Scoring in Markov Random Fields Bayesian Model Scoring in Markov Random Fields Sridevi Parise Bren School of Information and Computer Science UC Irvine Irvine, CA 92697-325 sparise@ics.uci.edu Max Welling Bren School of Information and

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Structure Learning in Markov Random Fields

Structure Learning in Markov Random Fields THIS IS A DRAFT VERSION. FINAL VERSION TO BE PUBLISHED AT NIPS 06 Structure Learning in Markov Random Fields Sridevi Parise Bren School of Information and Computer Science UC Irvine Irvine, CA 92697-325

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks An Empirical-Bayes Score for Discrete Bayesian Networks scutari@stats.ox.ac.uk Department of Statistics September 8, 2016 Bayesian Network Structure Learning Learning a BN B = (G, Θ) from a data set D

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004 A Brief Introduction to Graphical Models Presenter: Yijuan Lu November 12,2004 References Introduction to Graphical Models, Kevin Murphy, Technical Report, May 2001 Learning in Graphical Models, Michael

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Applying Bayesian networks in the game of Minesweeper

Applying Bayesian networks in the game of Minesweeper Applying Bayesian networks in the game of Minesweeper Marta Vomlelová Faculty of Mathematics and Physics Charles University in Prague http://kti.mff.cuni.cz/~marta/ Jiří Vomlel Institute of Information

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information