Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Similar documents
Bayes Networks 6.872/HST.950

Introduction to Bayesian Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

PROBABILISTIC REASONING SYSTEMS

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS 188: Artificial Intelligence. Bayes Nets

Learning in Bayesian Networks

Graphical Models and Kernel Methods

Directed Graphical Models

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Lecture 8: Bayesian Networks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayes Nets: Independence

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Artificial Intelligence Bayesian Networks

Bayesian Networks. Motivation

CSE 473: Artificial Intelligence Autumn 2011

Directed Graphical Models or Bayesian Networks

CS6220: DATA MINING TECHNIQUES

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Probabilistic Graphical Networks: Definitions and Basic Results

CS 5522: Artificial Intelligence II

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian networks. Chapter Chapter

CS 343: Artificial Intelligence

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Directed Graphical Models

Artificial Intelligence Bayes Nets: Independence

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bayes Nets III: Inference

Probabilistic Graphical Models (I)

Quantifying uncertainty & Bayesian networks

CS6220: DATA MINING TECHNIQUES

Inference in Bayesian Networks

Probabilistic Models. Models describe how (a portion of) the world works

An Introduction to Bayesian Machine Learning

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

CS 188: Artificial Intelligence Fall 2009

CS 188: Artificial Intelligence Spring Announcements

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

4 : Exact Inference: Variable Elimination

Learning Bayesian Networks (part 1) Goals for the lecture

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Lecture 6: Graphical Models

Review: Bayesian learning and inference

Bayesian networks (1) Lirong Xia

Machine Learning Lecture 14

CS 188: Artificial Intelligence Spring Announcements

Lecture 6: Graphical Models: Learning

Inference in Bayesian networks

Bayesian belief networks. Inference.

Bayesian networks. Chapter 14, Sections 1 4

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

School of EECS Washington State University. Artificial Intelligence

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Probabilistic Reasoning Systems

Machine Learning Summer School

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Bayesian Network Representation

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Introduction to Probabilistic Graphical Models

Probabilistic Models

Statistical Approaches to Learning and Discovery

6.867 Machine learning, lecture 23 (Jaakkola)

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Directed and Undirected Graphical Models

Informatics 2D Reasoning and Agents Semester 2,

Bayesian Machine Learning

Artificial Intelligence Methods. Inference in Bayesian networks

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

A Tutorial on Learning with Bayesian Networks

Probabilistic Classification

Intelligent Systems (AI-2)

Graphical Models - Part I

Artificial Intelligence

Bayesian networks: approximate inference

CS 5522: Artificial Intelligence II

Bayesian Networks. Philipp Koehn. 29 October 2015

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Networks. Philipp Koehn. 6 April 2017

last two digits of your SID

Product rule. Chain rule

Probabilistic Graphical Models: Representation and Inference

Sampling Algorithms for Probabilistic Graphical models

Transcription:

Bayesian Networks Characteristics of Learning BN Models (All hail Judea Pearl) (some hail Greg Cooper) Benefits Handle incomplete data Can model causal chains of relationships Combine domain knowledge and data Can avoid overfitting Two main uses: Find (best) hypothesis that accounts for a body of data Find a probability distribution over hypotheses that permits us to predict/interpret future data An Example Bayesian Learning Surprise Candy Corp. makes two flavors of candy: cherry and lime Both flavors come in the same opaque wrapper Candy is sold in large bags, which have one of the following distributions of flavors, but are visually indistinguishable: h1: 100% cherry h2: 75% cherry, 25% lime h3: 50% cherry, 50% lime h4: 25% cherry, 75% lime h5: 100% lime Relative prevalence of these types of bags is (.1,.2,.4,.2,.1) As we eat our way through a bag of candy, predict the flavor of the next piece; actually a probability distribution. Calculate the probability of each hypothesis given the data To predict the probability distribution over an unknown quantity,, If the observations d are independent, then E.g., suppose the first 10 candies we taste are all lime

Learning Hypotheses and Predicting from Them (a) probabilities of hi after k lime candies; (b) prob. of next lime MAP prediction: predict just from most probable hypothesis After 3 limes, h5 is most probable, hence we predict lime Even though, by (b), it s only 80% probable Observations Bayesian approach asks for prior probabilities on hypotheses! Natural way to encode bias against complex hypotheses: make their prior probability very low Choosing hmap to maximize is equivalent to minimizing but from our earlier discussion of entropy as a measure of information, these two terms are # of bits needed to describe the data given hypothesis # bits needed to specify the hypothesis Thus, MAP learning chooses the hypothesis that maximizes compression of the data; Minimum Description Length principle Assuming uniform priors on hypotheses makes MAP yield hml, the maximum likelihood hypothesis, which maximizes What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional independence) Relatively concise Need data ~ (#hypotheses) (#features) (#feature-vals) Fast ~ (#features) Cannot express dependencies among features or among hypotheses Cannot consider possibility of multiple hypotheses cooccurring Bayesian Networks (aka Belief Networks) Graphical representation of dependencies among a set of random variables Nodes: variables Directed links to a node from its parents: direct probabilistic dependencies Each i has a conditional probability distribution, P(i Parents(i)), showing the effects of the parents on the node. The graph is directed (DAG); hence, no cycles. This is a language that can express dependencies between Naive Bayes and the full joint distribution, more concisely Given some new evidence, how does this affect the probability of some other node(s)? P( E) belief propagation/updating Given some evidence, what are the most likely values of other variables? MAP explanation

Network (due to J. Pearl) If everything depends on everything P(B) 0.001 P(E) 0.002 21 2 0 B E P(A B,E) t t 0.95 t f 0.94 f t 0.29 f f 0.001 2 2 A P(J A) t 0.9 f 0.05 A P(M A) t 0.7 f 0.01 2 3 This model requires just as many parameters as the full joint distribution! 2 4 Computing the Joint Distribution from a Bayes Network As usual, we abuse notation: E.g., what s the probability that an alarm has sounded, there was neither an earthquake nor a burglary, and both John and Mary called? Requirements for Constructing a BN Recall that the definition of the conditional probability was and thus we get the chain rule, Generalizing to n variables, and repeatedly applying this idea, This works just in case we can define a partial order so that

Topological Interpretations BN s can be Compact U1 Un U1 Un Z1 Y1 Yn A node,, is conditionally independent of its non-descendants, Zi, given its parents, Ui. Zn Z1 Y1 A node,, is conditionally independent of all other nodes in the network given its Markov blanket: its parents, Ui, children, Yi, and children s parents, Zi. Yn Zn For a network of 40 binary variables, the full joint distribution has 2 40 entries (> 1,000,000,000,000) If Par(xi) 5, however, then the 40 (conditional) probability tables each have 32 entries, so the total number of parameters 1,280 Largest medical BN I know (Pathfinder) had 109 variables! 2 109 10 36 How to Build Bayes Nets David Heckerman, Pathfinder/Intellipath, around 1990 How Not to Build BN s With the wrong ordering of nodes, the network becomes more complicated, and requires more (and more difficult) conditional probability assessments Order: M, J, A, B, E Order: M, J, E, B, A

Simplifying Conditional Probability Tables Do we know any structure in the way that Par(x) cause x? If each destroyer can sink the ship with probability P(s di), what is the probability that the ship will sink if it s attacked by both? For Par(x) = n, this requires O(n) parameters, not O(k n ) Inference Recall the two basic inference problems: Belief propagation & MAP explanation Trivially, we can enumerate all matching rows of the joint probability distribution For poly-trees (not even undirected loops i.e., only one connection between any pair of nodes; like our example), there are efficient linear algorithms, similar to constraint propagation For arbitrary BN s, all inference is NP-hard! Exact solutions Approximation In general, can model p(y x1, x2,...) by any simple formula! Exact Solution of BN s ( example) Poly-trees are easy Notes: Sum over all don t care variables Factor common terms out of summation Calculation becomes a sum of products of sums of products... Singly-connected structures allow propagation of observations via single paths Down is just use of conditional probability Up is just Bayes rule Formulated as message propagation rules Linear time (network diameter) Fails on general networks! Bottles Escape Drunk TVshow

Exact Solution of BN s (non-poly-trees) Other Exact Methods B A C What is the probability of a specific state, say A=t, B=f, C=t, D=t, E=f? B A C A B,C D Alas, optimal factoring is NP-hard E What is the probability that E=t given B=t? Consider the term P(e,b) 12 instead of 32 multiplications (even in this small example) D Join-tree: Merge variables into (small!) sets of variables to make graph into a poly-tree. Most commonly-used; aka Clustering, Junction-tree, Potential) Cutset-conditioning: Instantiate a (small!) set of variables, then solve each residual problem, and add solutions weighted by probabilities of the instantiated variables having those values... E All these methods are essentially equivalent; with some timespace tradeoffs. D E Approximate Inference in BN s Direct Sampling Direct Sampling samples joint distribution Rejection Sampling computes P( e), uses ancestor evidence nodes in sampling Likelihood Weighting like Rejection Sampling, but weights by probability of descendant evidence nodes Markov chain Monte Carlo Gibbs and other similar sampling methods function Prior-Sample(bn) returns an event sampled from bn inputs: bn, a Bayes net specifying the joint distribution P(1,... n) x := an event with n elements for i = 1 to n do xi := a random sample from P(i Par(i)) return x From a large number of samples, we can estimate all joint probabilities The probability of an event is the fraction of all complete events generated by PS that match the partially specified event hence we can compute all conditionals, etc.

Rejection Sampling Likelihood Weighting function Rejection-Sample(, e, bn, N) returns an estimate of P( e) inputs: bn, a Bayes net, the query variable e, evidence specified as an event N, the number of samples to be generated local: K, a vector of counts over values of, initially 0 for j = 1 to N do y := PriorSample(bn) if y is consistent with e then K[v] := K[v]+1 where v is the value of in y return Normalize(K[]) Uses PriorSample to estimate the proportion of times each value of appears in samples that are consistent with e But, most samples may be irrelevant to a specific query, so this is quite inefficient In trying to compute P( e), where e is the evidence (variables with known, observed values), Sample only the variables other than those in e Weight each sample by how well it predicts e S WS (z, e)w(z, e) = l i=1 = P (z, e) P (z i Par(Z i )) m i=1 P (e i Par(E i )) Likelihood Weighting S WS (z, e)w(z, e) = function Likelihood-Weighting(, e, bn, N) returns an estimate of P( e) inputs: bn, a Bayes net, the query variable e, evidence specified as an event N, the number of samples to be generated local: W, a vector of weighted counts over values of, initially 0 for j = 1 to N do y,w := WeightedSample(bn,e) if y is consistent with e then W[v] := W[v]+w where v is the value of in y return Normalize(W[]) function Weighted-Sample(bn,e) returns an event and a weight x := an event with n elements; w := 1 for i = 1 to n do if i has a value xi in e then w := w * P(i = xi Par(i)) else xi := a random sample from P(i Par(i)) return x,w l i=1 = P (z, e) P (z i Par(Z i )) m i=1 P (e i Par(E i )) Markov chain Monte Carlo aka MCMC function MCMC(, e, bn, N) returns an estimate of P( e) local: K[], a vector of counts over values of, initially 0 Z, the non-evidence variables in bn (includes ) x, the current state of the network, initially a copy of e initialize x with random values for the vars in Z for j = 1 to N do for each Zi in Z do sample the value of Zi in x from P(Zi mb(zi)), given the values of mb(zi) in x K[v] := K[v]+1 where v is the value of in x return Normalize(K[]) Wander incrementally from the last state sampled, instead of regenerating a completely new sample For every unobserved variable, choose a new value according to its probability given the values of vars in its Markov blanket (remember, it s independent of all other vars) After each change, tally the sample for its value of ; this will only change sometimes Problem: narrow passages Z1 U1 Y1 Yn Un Zn

Most Probable Explanation Rules and Probabilities So far, we have been solving for P( e), which yields a distribution over all possible values of the x s What it we want the best explanation of a set of evidence, i.e., the highest-probability set of values for the x s, given e? Just maximize over the don t care variables rather than summing This is not necessarily the same as just choosing the value of each x with the highest probability Many have wanted to put a probability on assertions and on rules, and compute with likelihoods E.g., Mycin s certainty factor framework A (p=.3) & B (p=.7) ==p=.8==> C (p=?) Problems: How to combine uncertainties of preconditions and of rule How to combine evidence from multiple rules Theorem: There is NO such algebra that works when rules are considered independently. Need BN for a consistent model of probabilistic inference How to Build BN Models Human expertise E.g., like building the Acute Renal Failure program, Pathfinder,, Human expertise to determine structure, data to determine parameters Point parameter estimation Smoothing Useful distributions: Automated methods to discover structure and parameters Best model is the one that predicts the highest probability for the data actually observed. Common: Beta (binomial), Dirichlet (multinomial) Any of the exponential family, e.g., normal, Poisson, Gamma, etc. Learning Structure In general, we are trying to determine not only parameters for a known structure but in fact which structure is best (or the probability of each structure, so we can average over them to make a prediction) Ø Heckerman s excellent tutorial: http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=msr-tr-95-06

Structure Learning Recall that a Bayes Network is fully specified by a DAG G that gives the (in)dependencies among variables the collection of parameters θ that define the conditional probability tables for each of the Then We define the Bayesian score as But First term: usual marginal likelihood calculation Second term: parameter priors Third term: penalty for complexity of graph Define a search problem over all possible graphs & parameters 1 2.000000e+00 2 3.200000e+01 3 3.072000e+03 4 1.572864e+06 5 4.026532e+09 6 4.947802e+13 7 2.837268e+18 8 7.437727e+23 9 8.773900e+29 10 4.600050e+36 11 1.061171e+44 12 1.068209e+52 13 4.659610e+60 14 8.755632e+69 15 7.050966e+79 Searching for Models How many possible DAGs are there for n variables? = all possible directed graphs on n vars Not all are DAGs To get a closer estimate, imagine that we order the variables so that the parents of each var come before it in the ordering. Then there are n! possible ordering, and the j-th var can have any of the previous vars as a parent If we can choose a particular ordering, say based on prior knowledge, then we need consider merely models If we restrict Par() to no more than k, consider models; this is actually practical Search actions: add, delete, reverse an arc Hill-climb on P(D G) or on P(G D) All usual tricks in search: simulated annealing, random restart,... Y Y Y Caution about Hidden Variables Suppose you are given a dataset containing data on patients smoking, diet, exercise, chest pain, fatigue, and shortness of breath You would probably learn a model like the one below left If you can hypothesize a hidden variable (not in the data set), e.g., heart disease, the learned network might be much simpler, such as the one below right But, there are potentially infinitely many such variables S D E S D E Re-Learning the ALARM Network from 10,000 Samples Starting Network Complete independence Original Network Sampled Data C F B H C F B Learned Network