Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics
|
|
- Oliver Carroll
- 5 years ago
- Views:
Transcription
1 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. Reprinted from: Proceedings of Uncertainty 99: Seventh International Workshop on Articial Intelligence and Statistics, Morgan Kaufmann, San Mateo, CA, Address: Marco Ramoni, Knowledge Media Institute,, Milton Keynes, United Kingdom MK7 6AA. phone: +44 (1908) , fax: +44 (1908) , m.ramoni@open.ac.uk, url:
2 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. 1. INTRODUCTION The estimation of conditional probabilities from a database of cases plays a central role in a variety of machine learning domain and approaches, from classication, to clustering, neural networks, and learning of Bayesian Belief Networks (bbns). This paper will focus on bbns as they have been proved to be general enough to encompass a variety of models, including classication models (Friedman & Goldsmidth, 1996), clustering methods (Heckerman, 1996), and feed-forward neural networks (Buntine, 1994), so that the results here presented can easily be applied to other domains. A Bayesian Belief Network (bbn) is dened by a set of variables X = fx1; : : : ; X I g and a directed acyclic graph dening a model M of conditional dependencies among the elements of X. A conditional dependency links a child variable X i to a set of parent variables i, and it is dened by the conditional distributions of X i given the congurations of the parent variables. We consider discrete variables only and denote by c i the number of states of X i and X i = x ik by x ik. The structure M yields a factorization of the joint probability of a set of values x k = fx1k; : : : ; x Ik g of the variables in X as p(x Q I = x k ) = i=1 p(x ikj ij ), where ij denotes one of the q i states of i in x k. Suppose we are given a database of n cases D = fx1; : : : ; x n g and a graphical model M specifying the dependencies among the variables X. Our task is to estimate, from D, the conditional probabilities ijk = p(x ik j ij ; ) dening the dependencies in the graph, where = ( ijk ). When the database is complete, closed form solutions allow ecient estimation of the conditional probabilities in time proportional to the size of the database. The Maximum Likelihood estimates (mle) of ijk are the values ^ ijk that maximize the likelihood function l() = Q ijk n(x ikjij) ijk where n(x ik j ij ) is the frequency of (x ik ; ij ) in D. Thus, ^ijk = n(x ik j ij )=n( ij ) which is the relative frequencies of relevant cases, and n( ij ) = P k n(x ikj ij ) is the frequency of ij. The Bayesian approach generalizes the mle by introducing a attening constant ijk > 0 for each frequency, so that the estimate is computed as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i (1) P where ij = k ijk. The quantity ( ijk?1)=( ij?c i ) is interpreted as the prior probability believed by the learning system before seeing any data, and the learning process is regarded as the updating of this prior probability by means of Bayes' theorem. When the prior distribution of ( ij1:::; ijci ) is Dirichlet, with hyperparameters ( ij1; :::; ijc1 ), then (1) is the posterior mode, the Maximum a Posteriori (map), of ijk. Unfortunately, simplicity and eciency of these closed form solutions are lost when the database is incomplete, that is, some entries are reported as unknown. In this case, the exact estimate is the mixture of the estimates given by (1) for each database generated by the combination of all possible values for each missing entry, and the computational cost of this operation would grow exponentially in the number of missing data. Simple solutions to handle this problem are either to ignore the cases including missing entries or to ascribe the missing cases to an ad hoc dummy state. Both solutions introduce potentially dangerous bias in the
3 estimate, as they ignore the fact that the missing entries may be relevant to the estimation of. Current methods rely on more sophisticated techniques, such as the em algorithm (Dempster, Laird, & Rubin, 1977) and Gibbs Sampling (gs) (Geman & Geman, 1984). Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dif- cult. Furthermore, they rely on the assumption that data are missing at random (mar): within each parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. This situation motivated the recent development of a deterministic method, called Bound and Collapse (bc) (Ramoni & Sebastiani, 1998) to estimate conditional probability distributions from possibly incomplete databases. bc was proved to be extremely scalable, as the complexity of the algorithm does not depend on the number of missing data. This paper presents an experimental comparison of em, gs and bc on a real-world database. The aim of this comparison is to evaluate whether the signicant computational advantage oered by bc reduces, in practice, the accuracy of the estimation. 2. METHODS This paper will compare the accuracy and the eciency of em, gs, and bc. This section outlines the character of these three methods. The em algorithm is an iterative method to compute mles and map. The em algorithm alternates an expectation step, in which unknown quantities depending on the missing entries are replaced by their expectation in the likelihood, to a maximization step, in which the likelihood completed in the expectation step is maximized with respect to the unknown parameters, and the resulting estimates are used to replace unknown quantities in the next expectation step, until the difference between successive estimates is smaller than a xed threshold. The convergence rate of this process can be \painfully slow" (Thiesson, 1995), and several modication have been proposed to increase its speed under certain circumstances. gs is one of the most popular Markov Chain Monte Carlo methods for Bayesian inference. The algorithm treats each missing entry as a parameter to be estimated and iterates a stochastic process that provides a sample from the posterior distribution of. This sample is used to compute empirical estimates of the posterior mean, the posterior mode, or any other function of the parameters. In practical applications, the algorithm iterates a number of times to reach stability (burn-in) and then a nal sample from the joint posterior distribution of the parameters is taken. Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dicult. Furthermore, they rely on the assumption that the missing data mechanism is ignorable: within each observed parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. These limitations motivated the development of bc, a deterministic method to estimate conditional probabilities from an incomplete database. The method bounds the set of possible estimates consistent with the available information by computing the minimum and the maximum estimate that would be obtained from all possible completions of the database. This process returns probability intervals containing all possible estimates consistent with the available information. These bounds are then collapsed into a unique value via a convex combination of the extreme points with weights depending on the assumed pattern of missing data. The intuition behind bc is that an incomplete database is still able to constrain the possible estimates within a set and that the assumed pattern of missing data, encoded as probability of missing data, can be used to select a point estimate within the set of possible ones. Let X i be a variable in X with parent variables i. Denote by n(?j ij ) the frequency of cases in which only the entry on the child variable is missing, by n(x ik j?) the frequency of cases in which only the parent con- guration is unknown but it can be completed as ij, and by n(?j?) the frequency of cases in which the en-
4 tries X i ; i are unknown and they can be completed as (x ik ; ij ). Dene virtual frequencies as: n (x ik j ij ) = n(?j ij ) + n(x ik j?) + n(?j?) Xci n (x ik j ij ) = n(?j ij ) + n(x ih j?) + n(?j?): h6=k=1 Thus, n (x ik j ij ) is the maximum achievable frequency of (x ik ; ij ) in the incomplete database that is used to compute the maximum of p(x ik j ij ). The virtual frequency n (x ik j ij ) is used to compute the minimum estimate of p(x ik j ij ). It can be easily shown that the estimate of p(x ik j ij ) is bounded above by: p (x ik j ij ; D) = ijk + n(x ik j ij ) + n (x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i and below by p (x ik j ij ; D) = ijk + n(x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The minimum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ) and (X i =?; i =?) are completed as (x ih ; ij ), for h 6= k, all cases (X i = x ih ; i =?) are completed as (x ih ; ij ) if h 6= k, and, if h = k, as (x ik ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the total virtual frequency of cases with i = ij, for h 6= k, and the frequency of (x ik ; ij ) is not augmented. The maximum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ), (X i =?; i =?) and (X i = x ik ; i =?) are completed as (x ik ; ij ) and, for h 6= k, (X i = x ih ; i =?) are completed as (x ih ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the virtual frequency of (x ik ; ij ) as well as of ij, while the frequency of other cases with the same parent con- guration is not augmented. This probability interval contains all and only the possible estimates consistent with D, and therefore, it is sound and it is the tightest estimable interval. The width of each interval is a function of the virtual frequencies, so that it accounts for the amount of information available in D about the parameter to be estimated, and it represents an explicit measure of the quality of probabilistic information conveyed by the database about a parameter. The second step of bc collapses the intervals estimated in the bound step into point estimates using a convex combination of the extreme estimates with weights depending on the assumed pattern of missing data. We assume that information about missing data is encoded as a probability distribution describing, for each variable in the database, the probability of a completion as p(x ik j ij ; X i =?) = ijk, where P k = 1; :::; c i and k ijk = 1. Note that this is only a part of the information required about the distribution of missing data. We have seen that incomplete cases can be either (x ik ; i =?), or (X i =?; ij ) or (X i =?; i =?). Thus, a full description of the pattern of missing data requires the probabilities p( ij j i =?) and p(x ik j ij ; i =?), as well as ijk. We will show that the probabilities ijk can be used to obtain accurate estimates of ijk, if we exclude, amongst the possible patterns of missing data, those extreme mechanisms that yield the lower bounds p (x ik j ij ; D). In order to be able to limit the amount of information about the process underlying missing data, we will assume that if all incomplete case (x ik ; i =?) are completed as (x ik ; ij ), then for all k 6= h, (x ih ; i =?) cannot be completed as (x ih ; ij ). This assumption allows us to derive new local lower bounds from the maximum probabilities as follows. Each maximum probability p (x ik j ij ; D) is obtained when all incomplete cases that could be completed as (x ik ; ij ) are attributed to (x ik ; ij ), and the observed frequencies of the other states of X i given ij are not augmented. Thus, when p(x ik j ij ) = p (x ik j ij ; D), p(x ik j ij ) = p k (x il j ij ; D) for l 6= k where: p k (x il j ij ; D) = ijl + n(x il j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The maximum probabilities induce c i probability distributions: fp (x ik j ij ); p k (x ih j ij ; D)g, where k 6= h for k = 1; : : : ; c i. Denote by fp h (x ik j ij ; D)g h 6= k = 1; : : : ; c i the set of c i? 1 local minima of ^ ijk and by p l (x ik j ij ; D) = min h fp h (x ik j ij ; D)g. The distribution of missing entries in terms of ijk can now be used to identify a point estimate ^p(x ik j ij ; D; ijk ) within the interval [p l (x ik j ij ; D); p (x ik j ij ; D)] as: P ci X l6=k ijl p l (x ik j ij ; D) + ijk p (x ik j ij ; D): (2) These estimates dene a probability P distribution since k=1 ^p(x ci ikj ij ; D; ijk ) = k=1 ijk = 1. The intuition behind (2) is that the upper bound of p(x ik ; ij ) is obtained when all incomplete cases are completed as (x ik ; ij ). Thus, if p(x ik j ij ; X i =?) = 1 for a particular k, then (2) will return the upper bound of the interval probability as estimate of p(x ik j ij ), and p k (x ih j ij ; D) as estimates of p(x ih j ij ), h 6= k. This case corresponds to the assumption that data are systematically missing about x ik, i.e. entries are missing with probability 1. On the other hand, when no information on the mechanism generating missing data is available, and therefore all patterns of missing data are equally likely, then ijk = 1=c i. As the number of missing entries decreases, p (x ik j ij ; D) and p h (x ik j ij ; D) approach ( ijk +n(x ijk j ij )?1)=( ij + n( ij )? c i ), so that, when the database is complete,
5 (2) returns the exact estimate ^ ijk. As the number of missing entries increases then p h (x ik j ij ; D)! 0, for all l, and p (x ik j ij ; D)! 1, so that the estimate (2) approaches the prior probability ijk, and coherently nothing is learned from a database in which all entries on (X i ; ij ) are missing. Finally, if n (x ik j ij ) = n ij then (2) simplies to ijk + n(x ik j ij ) + n ij ijk? 1 ij + n( ij ) + n ij? c i (3) which is the expected map, given the assumed pattern of missing data. When data are mar, incomplete samples within parent congurations are representative samples of the complete but unknown ones, so that the probability ijk can be estimated from the available data as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i : (4) Then ^ ijk can be used to compute (2). Thus, when unreported data are mar, bc estimates are corrections of the estimates computed from the observed data. In particular, if n (x ik j ij ) = n then (3) reduces to (1), ij and to the standard mle when ijk = 1 (Little & Rubin, 1987). The estimates computed by (2) with ijk replaced by ^ ijk are the expected estimates and, as the database increases, they will approach the exact ones, as gs and em estimates do. The advantage of bc is that, as a deterministic method, it does not pose any problem of convergence rate and detection, and it reduces the cost of estimating a conditional distribution of X i j ij to the cost of one exact Bayesian updating and one convex combination for each state of X i. Since we consider only the extreme estimates induced by the maximum probabilities, the information needed in the collapse step is limited to the probabilities ijk, and there is no need to specify probabilities of completions for cases in which the parent conguration is unknown. This limited amount of information about the process underlying missing data implies that ^p(x ik j ij ; D; ijk ) p l (x ik j ij ; D) p (x ik j ij ; D). Thus, ^p(x ik j ij ; D; ijk ) cannot be equal to p (x ik j ij ; D) unless p l (x ik j ij ; D) = p (x ik j ij ; D). However, p l ijk + n(x ik j ij )? 1 (x ik j ij ; D) = ij + n( ij ) + max n ; (x ih j ij )? c i h6=k which shows that the dierence between p (x ik j ij ; D) and p l (x ik j ij ; D) depends only on those cases in which the state of the child variable is known and the parent conguration is not. Note that p l (x ik j ij ; D) = p (x ik j ij ; D) if there is no missing data mechanism X1 - X2 - X3 - X6?? X4 -???? X5 Figure 1: The bbn used for the evaluation. consistent with n (x ik j ij ), which is the assumption made above. The dierence p l (x ik j ij ; D)? p (x ik j ij ; D) is positive and it is easy to show that it is maximized when n(x ik j?) = m, for all k, so that it approaches 0 at the same rate of m, and it becomes negligible as the database increases. Furthermore, if we happen to know that, for some missing data mechanism, ^ ijk = p (x ik j ij ; D) for some k, then, for all h 6= k and j, p l (x ih j ij ; D) < p(x ih j ij ) < p (x ih j ij ; D), so that all the other estimates are bounded below by p l (x ih j ij ; D). Thus, any error would be limited to at most one conditional probability for each parent-child conguration. 3. EXPERIMENTAL EVALUATION Aim of these experiments is to evaluate the estimation accuracy of bc, em algorithm and gs as the available information in the database decreases. We used a database reporting the values of six risk factors of coronary diseases in 1841 employees of a Czechoslovakian car factory during a follow up study (Whittaker, 1990, page 261). All variables are binary and they represent Anamnesis (X1), Strenuous Mental Work (X2), Ratio of Beta and Alpha Lipoproteins (X3), Strenuous Physical Work (X4) Smoking (X5) and Systolic Blood Pressure (X6). The database is complete and we extracted the most probable directed graphical model, depicted in Figure 1, using the K2 algorithm (Cooper & Herskovitz, 1992). The 15 conditional dependencies specied by this bbn were then quantied from the complete database. Uniform prior distributions were assumed for the parameters. We then used the bbn in Figure 1 to run two dierent learning tests in order to evaluate the accuracy of bc relative to gs and em, using the implementation of accelerated em in games (Thiesson, 1995), the implementation of gs in bugs (Thomas, Spiegelhalter, & Gilks, 1992) and the implementation of bc in BKD. The threshold for the em was 10?4. Results of gs are based on a rst burn-in of 5,000 iterations, sucient to reach stability, and a successive sample of 5,000 cases. The aim of the rst test was to compare the estimation
6 Estimation Errors Ignorable Non Ignorable 9% 37% 3% 9% gs em bc gs em bc gs em bc gs em bc Min st Qu Median rd Qu Max X Ent Prediction Errors Min st Qu Median rd Qu Max Table 1: Summary statistics of estimation and prediction errors. accuracy of the three methods when the missing data mechanism is ignorable. For this purpose, four incomplete databases were created by incrementally deleting data from the complete database. A vector of 15 numbers between 0 and 1 was randomly generated, and elements of were taken as the probability of deleting the occurrences of each variable X i, independently of its value, given the parent conguration, in the the 10%, 20%, 30% and 40% of the database. This process generated four databases with 9% (1004), 18% (2035), 28% (3041) and 37% (4092) missing entries, and the observed entries within parent congurations are representative of the complete samples. The rationale of the second test was to compare the robustness of these methods with a not ignorable missing data mechanism. Four samples were generated from the complete database by incrementally deleting respectively 25%, 50%, 75% and 100% of the entries (X5 = 2; X6 = 1) with probability 0.9, and (X5 = 2; X6 = 2) with probability 0.1. This process generated 4 databases with 3% (278), 5% (532), 7% (790) and 9% (1030) missing entries. The estimation accuracy is measured by comparing the exact joint probability distribution of (X1; :::; X6) to those learned from the incomplete data using gs, em and bc. Table 1 reports some summary statistics of the results obtained on the two extreme databases generated in the two tests. The rst ve rows of Table 1 report summary statistics, in terms of minimum, maximum, rst, third quartile and median of the absolute dierence between the 64 exact joint probabilities and those obtained from the bbns learned with gs, em and bc. X Ent. labels the cross entropy between joint probability distributions. The predictive accuracy was evaluated by comparing the predictive probabilities of X6 obtained by the three methods to those calculated by the bbn extracted from the complete database. Given the conditional independence assumptions embedded in the bbn there are 43 possible evidences that can be considered: 10 are given by observing one of the variables X1; ; X5 alone, 32 by observing one of the pairs (X1; X3), (X1; X4), (X1; X5), (X2; X3), (X2; X4), (X2; X5), (X3; X4), (X3; X5), and one in which the evidence is the empty set, so that the marginal probability of X6 is returned. Bounds on the predictive probabilities were also computed by propagating bc upper bounds for each conditional dependency. Summary statistics of the absolute errors are given in the second half of Table 1. When 9% of data is missing and the missing data mechanism is ignorable, the distribution of the errors incurred by the three methods diers at the fourth decimal point and a slightly higher maximum value of bc is balanced by a lower median point. The dierence becomes a little more evident when 37% of data are missing. Nonetheless, the median dierence still remains around and, most of all, these dierences do not aect the predictive power of the bbn extracted by bc, which, if anything, scores a slightly better median performance, larger than any dierence among the estimates. This overall equivalence among the three methods is conrmed when missing data are ni, where error distributions are almost identical, and the impact of a wrong assumption about the pattern of missing data equally aect all methods. However, bc provides bounds on the predicted values which reect their reliability and can be taken into account during the reasoning process. It must be also remarked that bc does not rely per se on the ignorability assumption and that the available information about the missing data could
7 Missing gs em bc :15:22 00:00:30 00:00: :23:00 00:01:10 00:00: :29:42 00:01:50 00:00: :37:47 00:01:50 00;00: :29:12 00:02:00 00:00: :09:47 00:03:12 00:00: :56:02 00:04:18 00:00: :26:07 00:07:26 00:00:13 Table 2: Execution time for each method. have been exploited by bc to achieve a better performance. The main dierence among the three methods highlighted by the experiments is the execution time and, most of all, the shape of its growth curve, showing that the execution time of bc is independent of the number of missing data (Ramoni & Sebastiani, 1998). Table 2 shows the execution time for all eight databases generated during the experiments. The rst column reports the absolute number of missing entries from the original entries of the complete database. Results are in hours:minutes:seconds format. 4. CONCLUSIONS The rst character of bc is its ability to explicitly and separately represent the information available in the database and the assumed pattern of missing data, so that bc does not need to rely on the ignorability assumption, although it can be easily represented as any other. Furthermore, bc reduces the cost of estimating each conditional distribution of X i to the cost of one exact Bayesian updating and one convex combination for each state of X i in each parent conguration. This deterministic process does not pose any problem of convergence rate and detection and its computational complexity is independent of the number of missing data. The experimental comparison with em and gs showed a substantial equivalence of the estimates provided by these three methods and a dramatic gain in eciency using bc. Dempster, A., Laird, D., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1{38. Friedman, N., & Goldsmidth, M. (1996). Building classiers using bayesian networks. In Proceedings of the National Conference on Articial Intelligence, pp. 1277{1284 San Mateo, CA. Morgan Kaufman. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721{741. Heckerman, D. (1996). Bayesian networks for knowledge discovery. In Advances in Knowledge Discovery and Data Mining, pp. 153{180. MIT Press, Cambridge, MA. Little, R., & Rubin, D. (1987). Statistical Analysis with Missing Data. Wiley, New York. Ramoni, M., & Sebastiani, P. (1998). Parameter estimation in bayesian networks from incomplete databases. Intelligent Data Analysis Journal, 2 (1). Thiesson, B. (1995). Accelerated quantication of Bayesian networks with incomplete data. In Proceedings of rst international conference on knowledge discovery and data mining, pp. 306{ 11. AAAI press. Thomas, A., Spiegelhalter, D., & Gilks, W. (1992). Bugs: A program to perform Bayesian inference using Gibbs Sampling. In Bayesian Statistics 4, pp. 837{42. Clarendon Press, Oxford. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York, NY. References Buntine, W. (1994). Operations for learning with graphical models. Journal of Articial Intelligence Research, 2, 159{225. Cooper, G., & Herskovitz, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309{347.
Lecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationbound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o
Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale
More informationPp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics
Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics (Ft. Lauderdale, USA, January 1997) Comparing Predictive Inference Methods for Discrete Domains Petri
More informationLearning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling
Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence
More informationAutomatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries
Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationLEARNING WITH BAYESIAN NETWORKS
LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016
More informationFractional Belief Propagation
Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationLearning Bayesian Networks for Biomedical Data
Learning Bayesian Networks for Biomedical Data Faming Liang (Texas A&M University ) Liang, F. and Zhang, J. (2009) Learning Bayesian Networks for Discrete Data. Computational Statistics and Data Analysis,
More informationLearning latent structure in complex networks
Learning latent structure in complex networks Lars Kai Hansen www.imm.dtu.dk/~lkh Current network research issues: Social Media Neuroinformatics Machine learning Joint work with Morten Mørup, Sune Lehmann
More informationMachine Learning Overview
Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression
More information4.1 Notation and probability review
Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall
More informationMachine Learning in Simple Networks. Lars Kai Hansen
Machine Learning in Simple Networs Lars Kai Hansen www.imm.dtu.d/~lh Outline Communities and lin prediction Modularity Modularity as a combinatorial optimization problem Gibbs sampling Detection threshold
More informationBayesian Networks. Motivation
Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations
More informationChapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems
LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationExact distribution theory for belief net responses
Exact distribution theory for belief net responses Peter M. Hooper Department of Mathematical and Statistical Sciences University of Alberta Edmonton, Canada, T6G 2G1 hooper@stat.ualberta.ca May 2, 2008
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationUsing Expectation-Maximization for Reinforcement Learning
NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationThe Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel
The Bias-Variance dilemma of the Monte Carlo method Zlochin Mark 1 and Yoram Baram 1 Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel fzmark,baramg@cs.technion.ac.il Abstract.
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More information6.867 Machine learning, lecture 23 (Jaakkola)
Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationBayesian network modeling. 1
Bayesian network modeling http://springuniversity.bc3research.org/ 1 Probabilistic vs. deterministic modeling approaches Probabilistic Explanatory power (e.g., r 2 ) Explanation why Based on inductive
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationAn Empirical-Bayes Score for Discrete Bayesian Networks
JMLR: Workshop and Conference Proceedings vol 52, 438-448, 2016 PGM 2016 An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari Department of Statistics University of Oxford Oxford, United
More informationImproving Search-Based Inference in Bayesian Networks
From: Proceedings of the Eleventh International FLAIRS Conference. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Improving Search-Based Inference in Bayesian Networks E. Castillo, J. M. Guti~rrez
More informationIntelligent Systems:
Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition
More informationBayesian Methods for Multivariate Categorical Data. Jon Forster (University of Southampton, UK)
Bayesian Methods for Multivariate Categorical Data Jon Forster (University of Southampton, UK) Table 1: Alcohol intake, hypertension and obesity (Knuiman and Speed, 1988) Alcohol intake (drinks/day) Obesity
More informationMCMC: Markov Chain Monte Carlo
I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov
More informationApplying Bayesian networks in the game of Minesweeper
Applying Bayesian networks in the game of Minesweeper Marta Vomlelová Faculty of Mathematics and Physics Charles University in Prague http://kti.mff.cuni.cz/~marta/ Jiří Vomlel Institute of Information
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationBayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan
Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks
More informationVariational Probabilistic Inference. and the QMR-DT Network. Abstract
Journal of Articial Intelligence Research 10 (1999) 291-322 Submitted 10/98; published 5/99 Variational Probabilistic Inference and the QMR-DT Network Tommi S. Jaakkola Articial Intelligence Laboratory,
More informationExact model averaging with naive Bayesian classifiers
Exact model averaging with naive Bayesian classifiers Denver Dash ddash@sispittedu Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15213 USA Gregory F
More informationBayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses
Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationMH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution
MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous
More informationStructure Learning: the good, the bad, the ugly
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform
More informationPDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a preprint version which may differ from the publisher's version. For additional information about this
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationConvergence Rate of Expectation-Maximization
Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationThe Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood
The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 jebara@media.mit.edu Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 sandy@media.mit.edu Abstract We propose a
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationIn: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999
In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationStudy of Four Types of Learning Bayesian Networks Cases
Appl. Math. Inf. Sci. 8, No. 1, 379-386 (2014) 379 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080147 Study of Four Types of Learning Bayesian Networks
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationDoes the Wake-sleep Algorithm Produce Good Density Estimators?
Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationEE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS
EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005 Instructor: Professor Jeff A. Bilmes Uncertainty & Bayesian Networks
More informationLearning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions
Learning Bayesian Networks from Incomplete Data: An Efficient Method for Generating Approximate Predictive Distributions Carsten Riggelsen Department of Information & Computing Sciences Universiteit Utrecht
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationIntroduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah
Introduction to Graphical Models Srikumar Ramalingam School of Computing University of Utah Reference Christopher M. Bishop, Pattern Recognition and Machine Learning, Jonathan S. Yedidia, William T. Freeman,
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationAn Empirical Investigation of the K2 Metric
An Empirical Investigation of the K2 Metric Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering Otto-von-Guericke-University of Magdeburg Universitätsplatz 2,
More informationBeyond Uniform Priors in Bayesian Network Structure Learning
Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) scutari@stats.ox.ac.uk Department of Statistics April 5, 2017 Bayesian Network Structure Learning Learning
More informationMarkov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationChapter 11. Stochastic Methods Rooted in Statistical Mechanics
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationCSE 473: Artificial Intelligence Autumn Topics
CSE 473: Artificial Intelligence Autumn 2014 Bayesian Networks Learning II Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 473 Topics
More informationRelevance Vector Machines for Earthquake Response Spectra
2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas
More informationProbabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April
Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationOutline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012
CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More information12 : Variational Inference I
10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationIntroduction to Artificial Intelligence. Unit # 11
Introduction to Artificial Intelligence Unit # 11 1 Course Outline Overview of Artificial Intelligence State Space Representation Search Techniques Machine Learning Logic Probabilistic Reasoning/Bayesian
More informationLearning With Bayesian Networks. Markus Kalisch ETH Zürich
Learning With Bayesian Networks Markus Kalisch ETH Zürich Inference in BNs - Review P(Burglary JohnCalls=TRUE, MaryCalls=TRUE) Exact Inference: P(b j,m) = c Sum e Sum a P(b)P(e)P(a b,e)p(j a)p(m a) Deal
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationA Tutorial on Learning with Bayesian Networks
A utorial on Learning with Bayesian Networks David Heckerman Presented by: Krishna V Chengavalli April 21 2003 Outline Introduction Different Approaches Bayesian Networks Learning Probabilities and Structure
More informationTo make a grammar probabilistic, we need to assign a probability to each context-free rewrite
Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More information