Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics

Size: px
Start display at page:

Download "Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics"

Transcription

1 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. Reprinted from: Proceedings of Uncertainty 99: Seventh International Workshop on Articial Intelligence and Statistics, Morgan Kaufmann, San Mateo, CA, Address: Marco Ramoni, Knowledge Media Institute,, Milton Keynes, United Kingdom MK7 6AA. phone: +44 (1908) , fax: +44 (1908) , m.ramoni@open.ac.uk, url:

2 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. 1. INTRODUCTION The estimation of conditional probabilities from a database of cases plays a central role in a variety of machine learning domain and approaches, from classication, to clustering, neural networks, and learning of Bayesian Belief Networks (bbns). This paper will focus on bbns as they have been proved to be general enough to encompass a variety of models, including classication models (Friedman & Goldsmidth, 1996), clustering methods (Heckerman, 1996), and feed-forward neural networks (Buntine, 1994), so that the results here presented can easily be applied to other domains. A Bayesian Belief Network (bbn) is dened by a set of variables X = fx1; : : : ; X I g and a directed acyclic graph dening a model M of conditional dependencies among the elements of X. A conditional dependency links a child variable X i to a set of parent variables i, and it is dened by the conditional distributions of X i given the congurations of the parent variables. We consider discrete variables only and denote by c i the number of states of X i and X i = x ik by x ik. The structure M yields a factorization of the joint probability of a set of values x k = fx1k; : : : ; x Ik g of the variables in X as p(x Q I = x k ) = i=1 p(x ikj ij ), where ij denotes one of the q i states of i in x k. Suppose we are given a database of n cases D = fx1; : : : ; x n g and a graphical model M specifying the dependencies among the variables X. Our task is to estimate, from D, the conditional probabilities ijk = p(x ik j ij ; ) dening the dependencies in the graph, where = ( ijk ). When the database is complete, closed form solutions allow ecient estimation of the conditional probabilities in time proportional to the size of the database. The Maximum Likelihood estimates (mle) of ijk are the values ^ ijk that maximize the likelihood function l() = Q ijk n(x ikjij) ijk where n(x ik j ij ) is the frequency of (x ik ; ij ) in D. Thus, ^ijk = n(x ik j ij )=n( ij ) which is the relative frequencies of relevant cases, and n( ij ) = P k n(x ikj ij ) is the frequency of ij. The Bayesian approach generalizes the mle by introducing a attening constant ijk > 0 for each frequency, so that the estimate is computed as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i (1) P where ij = k ijk. The quantity ( ijk?1)=( ij?c i ) is interpreted as the prior probability believed by the learning system before seeing any data, and the learning process is regarded as the updating of this prior probability by means of Bayes' theorem. When the prior distribution of ( ij1:::; ijci ) is Dirichlet, with hyperparameters ( ij1; :::; ijc1 ), then (1) is the posterior mode, the Maximum a Posteriori (map), of ijk. Unfortunately, simplicity and eciency of these closed form solutions are lost when the database is incomplete, that is, some entries are reported as unknown. In this case, the exact estimate is the mixture of the estimates given by (1) for each database generated by the combination of all possible values for each missing entry, and the computational cost of this operation would grow exponentially in the number of missing data. Simple solutions to handle this problem are either to ignore the cases including missing entries or to ascribe the missing cases to an ad hoc dummy state. Both solutions introduce potentially dangerous bias in the

3 estimate, as they ignore the fact that the missing entries may be relevant to the estimation of. Current methods rely on more sophisticated techniques, such as the em algorithm (Dempster, Laird, & Rubin, 1977) and Gibbs Sampling (gs) (Geman & Geman, 1984). Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dif- cult. Furthermore, they rely on the assumption that data are missing at random (mar): within each parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. This situation motivated the recent development of a deterministic method, called Bound and Collapse (bc) (Ramoni & Sebastiani, 1998) to estimate conditional probability distributions from possibly incomplete databases. bc was proved to be extremely scalable, as the complexity of the algorithm does not depend on the number of missing data. This paper presents an experimental comparison of em, gs and bc on a real-world database. The aim of this comparison is to evaluate whether the signicant computational advantage oered by bc reduces, in practice, the accuracy of the estimation. 2. METHODS This paper will compare the accuracy and the eciency of em, gs, and bc. This section outlines the character of these three methods. The em algorithm is an iterative method to compute mles and map. The em algorithm alternates an expectation step, in which unknown quantities depending on the missing entries are replaced by their expectation in the likelihood, to a maximization step, in which the likelihood completed in the expectation step is maximized with respect to the unknown parameters, and the resulting estimates are used to replace unknown quantities in the next expectation step, until the difference between successive estimates is smaller than a xed threshold. The convergence rate of this process can be \painfully slow" (Thiesson, 1995), and several modication have been proposed to increase its speed under certain circumstances. gs is one of the most popular Markov Chain Monte Carlo methods for Bayesian inference. The algorithm treats each missing entry as a parameter to be estimated and iterates a stochastic process that provides a sample from the posterior distribution of. This sample is used to compute empirical estimates of the posterior mean, the posterior mode, or any other function of the parameters. In practical applications, the algorithm iterates a number of times to reach stability (burn-in) and then a nal sample from the joint posterior distribution of the parameters is taken. Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dicult. Furthermore, they rely on the assumption that the missing data mechanism is ignorable: within each observed parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. These limitations motivated the development of bc, a deterministic method to estimate conditional probabilities from an incomplete database. The method bounds the set of possible estimates consistent with the available information by computing the minimum and the maximum estimate that would be obtained from all possible completions of the database. This process returns probability intervals containing all possible estimates consistent with the available information. These bounds are then collapsed into a unique value via a convex combination of the extreme points with weights depending on the assumed pattern of missing data. The intuition behind bc is that an incomplete database is still able to constrain the possible estimates within a set and that the assumed pattern of missing data, encoded as probability of missing data, can be used to select a point estimate within the set of possible ones. Let X i be a variable in X with parent variables i. Denote by n(?j ij ) the frequency of cases in which only the entry on the child variable is missing, by n(x ik j?) the frequency of cases in which only the parent con- guration is unknown but it can be completed as ij, and by n(?j?) the frequency of cases in which the en-

4 tries X i ; i are unknown and they can be completed as (x ik ; ij ). Dene virtual frequencies as: n (x ik j ij ) = n(?j ij ) + n(x ik j?) + n(?j?) Xci n (x ik j ij ) = n(?j ij ) + n(x ih j?) + n(?j?): h6=k=1 Thus, n (x ik j ij ) is the maximum achievable frequency of (x ik ; ij ) in the incomplete database that is used to compute the maximum of p(x ik j ij ). The virtual frequency n (x ik j ij ) is used to compute the minimum estimate of p(x ik j ij ). It can be easily shown that the estimate of p(x ik j ij ) is bounded above by: p (x ik j ij ; D) = ijk + n(x ik j ij ) + n (x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i and below by p (x ik j ij ; D) = ijk + n(x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The minimum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ) and (X i =?; i =?) are completed as (x ih ; ij ), for h 6= k, all cases (X i = x ih ; i =?) are completed as (x ih ; ij ) if h 6= k, and, if h = k, as (x ik ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the total virtual frequency of cases with i = ij, for h 6= k, and the frequency of (x ik ; ij ) is not augmented. The maximum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ), (X i =?; i =?) and (X i = x ik ; i =?) are completed as (x ik ; ij ) and, for h 6= k, (X i = x ih ; i =?) are completed as (x ih ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the virtual frequency of (x ik ; ij ) as well as of ij, while the frequency of other cases with the same parent con- guration is not augmented. This probability interval contains all and only the possible estimates consistent with D, and therefore, it is sound and it is the tightest estimable interval. The width of each interval is a function of the virtual frequencies, so that it accounts for the amount of information available in D about the parameter to be estimated, and it represents an explicit measure of the quality of probabilistic information conveyed by the database about a parameter. The second step of bc collapses the intervals estimated in the bound step into point estimates using a convex combination of the extreme estimates with weights depending on the assumed pattern of missing data. We assume that information about missing data is encoded as a probability distribution describing, for each variable in the database, the probability of a completion as p(x ik j ij ; X i =?) = ijk, where P k = 1; :::; c i and k ijk = 1. Note that this is only a part of the information required about the distribution of missing data. We have seen that incomplete cases can be either (x ik ; i =?), or (X i =?; ij ) or (X i =?; i =?). Thus, a full description of the pattern of missing data requires the probabilities p( ij j i =?) and p(x ik j ij ; i =?), as well as ijk. We will show that the probabilities ijk can be used to obtain accurate estimates of ijk, if we exclude, amongst the possible patterns of missing data, those extreme mechanisms that yield the lower bounds p (x ik j ij ; D). In order to be able to limit the amount of information about the process underlying missing data, we will assume that if all incomplete case (x ik ; i =?) are completed as (x ik ; ij ), then for all k 6= h, (x ih ; i =?) cannot be completed as (x ih ; ij ). This assumption allows us to derive new local lower bounds from the maximum probabilities as follows. Each maximum probability p (x ik j ij ; D) is obtained when all incomplete cases that could be completed as (x ik ; ij ) are attributed to (x ik ; ij ), and the observed frequencies of the other states of X i given ij are not augmented. Thus, when p(x ik j ij ) = p (x ik j ij ; D), p(x ik j ij ) = p k (x il j ij ; D) for l 6= k where: p k (x il j ij ; D) = ijl + n(x il j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The maximum probabilities induce c i probability distributions: fp (x ik j ij ); p k (x ih j ij ; D)g, where k 6= h for k = 1; : : : ; c i. Denote by fp h (x ik j ij ; D)g h 6= k = 1; : : : ; c i the set of c i? 1 local minima of ^ ijk and by p l (x ik j ij ; D) = min h fp h (x ik j ij ; D)g. The distribution of missing entries in terms of ijk can now be used to identify a point estimate ^p(x ik j ij ; D; ijk ) within the interval [p l (x ik j ij ; D); p (x ik j ij ; D)] as: P ci X l6=k ijl p l (x ik j ij ; D) + ijk p (x ik j ij ; D): (2) These estimates dene a probability P distribution since k=1 ^p(x ci ikj ij ; D; ijk ) = k=1 ijk = 1. The intuition behind (2) is that the upper bound of p(x ik ; ij ) is obtained when all incomplete cases are completed as (x ik ; ij ). Thus, if p(x ik j ij ; X i =?) = 1 for a particular k, then (2) will return the upper bound of the interval probability as estimate of p(x ik j ij ), and p k (x ih j ij ; D) as estimates of p(x ih j ij ), h 6= k. This case corresponds to the assumption that data are systematically missing about x ik, i.e. entries are missing with probability 1. On the other hand, when no information on the mechanism generating missing data is available, and therefore all patterns of missing data are equally likely, then ijk = 1=c i. As the number of missing entries decreases, p (x ik j ij ; D) and p h (x ik j ij ; D) approach ( ijk +n(x ijk j ij )?1)=( ij + n( ij )? c i ), so that, when the database is complete,

5 (2) returns the exact estimate ^ ijk. As the number of missing entries increases then p h (x ik j ij ; D)! 0, for all l, and p (x ik j ij ; D)! 1, so that the estimate (2) approaches the prior probability ijk, and coherently nothing is learned from a database in which all entries on (X i ; ij ) are missing. Finally, if n (x ik j ij ) = n ij then (2) simplies to ijk + n(x ik j ij ) + n ij ijk? 1 ij + n( ij ) + n ij? c i (3) which is the expected map, given the assumed pattern of missing data. When data are mar, incomplete samples within parent congurations are representative samples of the complete but unknown ones, so that the probability ijk can be estimated from the available data as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i : (4) Then ^ ijk can be used to compute (2). Thus, when unreported data are mar, bc estimates are corrections of the estimates computed from the observed data. In particular, if n (x ik j ij ) = n then (3) reduces to (1), ij and to the standard mle when ijk = 1 (Little & Rubin, 1987). The estimates computed by (2) with ijk replaced by ^ ijk are the expected estimates and, as the database increases, they will approach the exact ones, as gs and em estimates do. The advantage of bc is that, as a deterministic method, it does not pose any problem of convergence rate and detection, and it reduces the cost of estimating a conditional distribution of X i j ij to the cost of one exact Bayesian updating and one convex combination for each state of X i. Since we consider only the extreme estimates induced by the maximum probabilities, the information needed in the collapse step is limited to the probabilities ijk, and there is no need to specify probabilities of completions for cases in which the parent conguration is unknown. This limited amount of information about the process underlying missing data implies that ^p(x ik j ij ; D; ijk ) p l (x ik j ij ; D) p (x ik j ij ; D). Thus, ^p(x ik j ij ; D; ijk ) cannot be equal to p (x ik j ij ; D) unless p l (x ik j ij ; D) = p (x ik j ij ; D). However, p l ijk + n(x ik j ij )? 1 (x ik j ij ; D) = ij + n( ij ) + max n ; (x ih j ij )? c i h6=k which shows that the dierence between p (x ik j ij ; D) and p l (x ik j ij ; D) depends only on those cases in which the state of the child variable is known and the parent conguration is not. Note that p l (x ik j ij ; D) = p (x ik j ij ; D) if there is no missing data mechanism X1 - X2 - X3 - X6?? X4 -???? X5 Figure 1: The bbn used for the evaluation. consistent with n (x ik j ij ), which is the assumption made above. The dierence p l (x ik j ij ; D)? p (x ik j ij ; D) is positive and it is easy to show that it is maximized when n(x ik j?) = m, for all k, so that it approaches 0 at the same rate of m, and it becomes negligible as the database increases. Furthermore, if we happen to know that, for some missing data mechanism, ^ ijk = p (x ik j ij ; D) for some k, then, for all h 6= k and j, p l (x ih j ij ; D) < p(x ih j ij ) < p (x ih j ij ; D), so that all the other estimates are bounded below by p l (x ih j ij ; D). Thus, any error would be limited to at most one conditional probability for each parent-child conguration. 3. EXPERIMENTAL EVALUATION Aim of these experiments is to evaluate the estimation accuracy of bc, em algorithm and gs as the available information in the database decreases. We used a database reporting the values of six risk factors of coronary diseases in 1841 employees of a Czechoslovakian car factory during a follow up study (Whittaker, 1990, page 261). All variables are binary and they represent Anamnesis (X1), Strenuous Mental Work (X2), Ratio of Beta and Alpha Lipoproteins (X3), Strenuous Physical Work (X4) Smoking (X5) and Systolic Blood Pressure (X6). The database is complete and we extracted the most probable directed graphical model, depicted in Figure 1, using the K2 algorithm (Cooper & Herskovitz, 1992). The 15 conditional dependencies specied by this bbn were then quantied from the complete database. Uniform prior distributions were assumed for the parameters. We then used the bbn in Figure 1 to run two dierent learning tests in order to evaluate the accuracy of bc relative to gs and em, using the implementation of accelerated em in games (Thiesson, 1995), the implementation of gs in bugs (Thomas, Spiegelhalter, & Gilks, 1992) and the implementation of bc in BKD. The threshold for the em was 10?4. Results of gs are based on a rst burn-in of 5,000 iterations, sucient to reach stability, and a successive sample of 5,000 cases. The aim of the rst test was to compare the estimation

6 Estimation Errors Ignorable Non Ignorable 9% 37% 3% 9% gs em bc gs em bc gs em bc gs em bc Min st Qu Median rd Qu Max X Ent Prediction Errors Min st Qu Median rd Qu Max Table 1: Summary statistics of estimation and prediction errors. accuracy of the three methods when the missing data mechanism is ignorable. For this purpose, four incomplete databases were created by incrementally deleting data from the complete database. A vector of 15 numbers between 0 and 1 was randomly generated, and elements of were taken as the probability of deleting the occurrences of each variable X i, independently of its value, given the parent conguration, in the the 10%, 20%, 30% and 40% of the database. This process generated four databases with 9% (1004), 18% (2035), 28% (3041) and 37% (4092) missing entries, and the observed entries within parent congurations are representative of the complete samples. The rationale of the second test was to compare the robustness of these methods with a not ignorable missing data mechanism. Four samples were generated from the complete database by incrementally deleting respectively 25%, 50%, 75% and 100% of the entries (X5 = 2; X6 = 1) with probability 0.9, and (X5 = 2; X6 = 2) with probability 0.1. This process generated 4 databases with 3% (278), 5% (532), 7% (790) and 9% (1030) missing entries. The estimation accuracy is measured by comparing the exact joint probability distribution of (X1; :::; X6) to those learned from the incomplete data using gs, em and bc. Table 1 reports some summary statistics of the results obtained on the two extreme databases generated in the two tests. The rst ve rows of Table 1 report summary statistics, in terms of minimum, maximum, rst, third quartile and median of the absolute dierence between the 64 exact joint probabilities and those obtained from the bbns learned with gs, em and bc. X Ent. labels the cross entropy between joint probability distributions. The predictive accuracy was evaluated by comparing the predictive probabilities of X6 obtained by the three methods to those calculated by the bbn extracted from the complete database. Given the conditional independence assumptions embedded in the bbn there are 43 possible evidences that can be considered: 10 are given by observing one of the variables X1; ; X5 alone, 32 by observing one of the pairs (X1; X3), (X1; X4), (X1; X5), (X2; X3), (X2; X4), (X2; X5), (X3; X4), (X3; X5), and one in which the evidence is the empty set, so that the marginal probability of X6 is returned. Bounds on the predictive probabilities were also computed by propagating bc upper bounds for each conditional dependency. Summary statistics of the absolute errors are given in the second half of Table 1. When 9% of data is missing and the missing data mechanism is ignorable, the distribution of the errors incurred by the three methods diers at the fourth decimal point and a slightly higher maximum value of bc is balanced by a lower median point. The dierence becomes a little more evident when 37% of data are missing. Nonetheless, the median dierence still remains around and, most of all, these dierences do not aect the predictive power of the bbn extracted by bc, which, if anything, scores a slightly better median performance, larger than any dierence among the estimates. This overall equivalence among the three methods is conrmed when missing data are ni, where error distributions are almost identical, and the impact of a wrong assumption about the pattern of missing data equally aect all methods. However, bc provides bounds on the predicted values which reect their reliability and can be taken into account during the reasoning process. It must be also remarked that bc does not rely per se on the ignorability assumption and that the available information about the missing data could

7 Missing gs em bc :15:22 00:00:30 00:00: :23:00 00:01:10 00:00: :29:42 00:01:50 00:00: :37:47 00:01:50 00;00: :29:12 00:02:00 00:00: :09:47 00:03:12 00:00: :56:02 00:04:18 00:00: :26:07 00:07:26 00:00:13 Table 2: Execution time for each method. have been exploited by bc to achieve a better performance. The main dierence among the three methods highlighted by the experiments is the execution time and, most of all, the shape of its growth curve, showing that the execution time of bc is independent of the number of missing data (Ramoni & Sebastiani, 1998). Table 2 shows the execution time for all eight databases generated during the experiments. The rst column reports the absolute number of missing entries from the original entries of the complete database. Results are in hours:minutes:seconds format. 4. CONCLUSIONS The rst character of bc is its ability to explicitly and separately represent the information available in the database and the assumed pattern of missing data, so that bc does not need to rely on the ignorability assumption, although it can be easily represented as any other. Furthermore, bc reduces the cost of estimating each conditional distribution of X i to the cost of one exact Bayesian updating and one convex combination for each state of X i in each parent conguration. This deterministic process does not pose any problem of convergence rate and detection and its computational complexity is independent of the number of missing data. The experimental comparison with em and gs showed a substantial equivalence of the estimates provided by these three methods and a dramatic gain in eciency using bc. Dempster, A., Laird, D., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1{38. Friedman, N., & Goldsmidth, M. (1996). Building classiers using bayesian networks. In Proceedings of the National Conference on Articial Intelligence, pp. 1277{1284 San Mateo, CA. Morgan Kaufman. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721{741. Heckerman, D. (1996). Bayesian networks for knowledge discovery. In Advances in Knowledge Discovery and Data Mining, pp. 153{180. MIT Press, Cambridge, MA. Little, R., & Rubin, D. (1987). Statistical Analysis with Missing Data. Wiley, New York. Ramoni, M., & Sebastiani, P. (1998). Parameter estimation in bayesian networks from incomplete databases. Intelligent Data Analysis Journal, 2 (1). Thiesson, B. (1995). Accelerated quantication of Bayesian networks with incomplete data. In Proceedings of rst international conference on knowledge discovery and data mining, pp. 306{ 11. AAAI press. Thomas, A., Spiegelhalter, D., & Gilks, W. (1992). Bugs: A program to perform Bayesian inference using Gibbs Sampling. In Bayesian Statistics 4, pp. 837{42. Clarendon Press, Oxford. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York, NY. References Buntine, W. (1994). Operations for learning with graphical models. Journal of Articial Intelligence Research, 2, 159{225. Cooper, G., & Herskovitz, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309{347.

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics

Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics (Ft. Lauderdale, USA, January 1997) Comparing Predictive Inference Methods for Discrete Domains Petri

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

Fractional Belief Propagation

Fractional Belief Propagation Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Learning Bayesian Networks for Biomedical Data

Learning Bayesian Networks for Biomedical Data Learning Bayesian Networks for Biomedical Data Faming Liang (Texas A&M University ) Liang, F. and Zhang, J. (2009) Learning Bayesian Networks for Discrete Data. Computational Statistics and Data Analysis,

More information

Learning latent structure in complex networks

Learning latent structure in complex networks Learning latent structure in complex networks Lars Kai Hansen www.imm.dtu.dk/~lkh Current network research issues: Social Media Neuroinformatics Machine learning Joint work with Morten Mørup, Sune Lehmann

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Machine Learning in Simple Networks. Lars Kai Hansen

Machine Learning in Simple Networks. Lars Kai Hansen Machine Learning in Simple Networs Lars Kai Hansen www.imm.dtu.d/~lh Outline Communities and lin prediction Modularity Modularity as a combinatorial optimization problem Gibbs sampling Detection threshold

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Exact distribution theory for belief net responses

Exact distribution theory for belief net responses Exact distribution theory for belief net responses Peter M. Hooper Department of Mathematical and Statistical Sciences University of Alberta Edmonton, Canada, T6G 2G1 hooper@stat.ualberta.ca May 2, 2008

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1 To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

More information

Bayesian network modeling. 1

Bayesian network modeling.  1 Bayesian network modeling http://springuniversity.bc3research.org/ 1 Probabilistic vs. deterministic modeling approaches Probabilistic Explanatory power (e.g., r 2 ) Explanation why Based on inductive

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks JMLR: Workshop and Conference Proceedings vol 52, 438-448, 2016 PGM 2016 An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari Department of Statistics University of Oxford Oxford, United

More information

Improving Search-Based Inference in Bayesian Networks

Improving Search-Based Inference in Bayesian Networks From: Proceedings of the Eleventh International FLAIRS Conference. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Improving Search-Based Inference in Bayesian Networks E. Castillo, J. M. Guti~rrez

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Bayesian Methods for Multivariate Categorical Data. Jon Forster (University of Southampton, UK)

Bayesian Methods for Multivariate Categorical Data. Jon Forster (University of Southampton, UK) Bayesian Methods for Multivariate Categorical Data Jon Forster (University of Southampton, UK) Table 1: Alcohol intake, hypertension and obesity (Knuiman and Speed, 1988) Alcohol intake (drinks/day) Obesity

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Applying Bayesian networks in the game of Minesweeper

Applying Bayesian networks in the game of Minesweeper Applying Bayesian networks in the game of Minesweeper Marta Vomlelová Faculty of Mathematics and Physics Charles University in Prague http://kti.mff.cuni.cz/~marta/ Jiří Vomlel Institute of Information

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Variational Probabilistic Inference. and the QMR-DT Network. Abstract

Variational Probabilistic Inference. and the QMR-DT Network. Abstract Journal of Articial Intelligence Research 10 (1999) 291-322 Submitted 10/98; published 5/99 Variational Probabilistic Inference and the QMR-DT Network Tommi S. Jaakkola Articial Intelligence Laboratory,

More information

Exact model averaging with naive Bayesian classifiers

Exact model averaging with naive Bayesian classifiers Exact model averaging with naive Bayesian classifiers Denver Dash ddash@sispittedu Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15213 USA Gregory F

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a preprint version which may differ from the publisher's version. For additional information about this

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood

The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 jebara@media.mit.edu Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 sandy@media.mit.edu Abstract We propose a

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Study of Four Types of Learning Bayesian Networks Cases

Study of Four Types of Learning Bayesian Networks Cases Appl. Math. Inf. Sci. 8, No. 1, 379-386 (2014) 379 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080147 Study of Four Types of Learning Bayesian Networks

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005 Instructor: Professor Jeff A. Bilmes Uncertainty & Bayesian Networks

More information

Learning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions

Learning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions Learning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions Carsten Riggelsen Department of Information & Computing Sciences Universiteit Utrecht

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah Introduction to Graphical Models Srikumar Ramalingam School of Computing University of Utah Reference Christopher M. Bishop, Pattern Recognition and Machine Learning, Jonathan S. Yedidia, William T. Freeman,

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

An Empirical Investigation of the K2 Metric

An Empirical Investigation of the K2 Metric An Empirical Investigation of the K2 Metric Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering Otto-von-Guericke-University of Magdeburg Universitätsplatz 2,

More information

Beyond Uniform Priors in Bayesian Network Structure Learning

Beyond Uniform Priors in Bayesian Network Structure Learning Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) scutari@stats.ox.ac.uk Department of Statistics April 5, 2017 Bayesian Network Structure Learning Learning

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

CSE 473: Artificial Intelligence Autumn Topics

CSE 473: Artificial Intelligence Autumn Topics CSE 473: Artificial Intelligence Autumn 2014 Bayesian Networks Learning II Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 473 Topics

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Introduction to Artificial Intelligence. Unit # 11

Introduction to Artificial Intelligence. Unit # 11 Introduction to Artificial Intelligence Unit # 11 1 Course Outline Overview of Artificial Intelligence State Space Representation Search Techniques Machine Learning Logic Probabilistic Reasoning/Bayesian

More information

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Learning With Bayesian Networks. Markus Kalisch ETH Zürich Learning With Bayesian Networks Markus Kalisch ETH Zürich Inference in BNs - Review P(Burglary JohnCalls=TRUE, MaryCalls=TRUE) Exact Inference: P(b j,m) = c Sum e Sum a P(b)P(e)P(a b,e)p(j a)p(m a) Deal

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

A Tutorial on Learning with Bayesian Networks

A Tutorial on Learning with Bayesian Networks A utorial on Learning with Bayesian Networks David Heckerman Presented by: Krishna V Chengavalli April 21 2003 Outline Introduction Different Approaches Bayesian Networks Learning Probabilities and Structure

More information

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information