Directed Graphical Models (aka. Bayesian Networks)

Similar documents
Bayesian Networks (Part II)

Hidden Markov Models

Bayesian Networks (Part I)

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Hidden Markov Models

Directed Graphical Models. William W. Cohen Machine Learning

Bayesian Networks: Independencies and Inference

Directed Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

2 : Directed GMs: Bayesian Networks

Graphical Models - Part I

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Graphical Models and Kernel Methods

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Latent Dirichlet Allocation

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models

Bayesian Networks. Motivation

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Review: Bayesian learning and inference

Machine Learning Lecture 14

Chris Bishop s PRML Ch. 8: Graphical Models

Bayesian Machine Learning

Lecture 8: Bayesian Networks

Undirected Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

Learning Bayesian Networks (part 1) Goals for the lecture

2 : Directed GMs: Bayesian Networks

Representation of undirected GM. Kayhan Batmanghelich

Introduction to Artificial Intelligence. Unit # 11

Probabilistic Graphical Models (I)

Probabilistic Graphical Models: Representation and Inference

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

MLE/MAP + Naïve Bayes

PROBABILISTIC REASONING SYSTEMS

Bayesian Networks BY: MOHAMAD ALSABBAGH

Statistical Approaches to Learning and Discovery

Bayes Nets: Independence

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Chapter 16. Structured Probabilistic Models for Deep Learning

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Introduction to Bayesian Learning

Directed Graphical Models or Bayesian Networks

Bayesian networks. Chapter 14, Sections 1 4

Based on slides by Richard Zemel

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Conditional Independence

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

14 PROBABILISTIC REASONING

Probabilistic Graphical Models

CS 5522: Artificial Intelligence II

CS Lecture 4. Markov Random Fields

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Alternative Parameterizations of Markov Networks. Sargur Srihari

STA 4273H: Statistical Machine Learning

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Artificial Intelligence Bayes Nets: Independence

Bayesian Learning. Instructor: Jesse Davis

CPSC 540: Machine Learning

Be able to define the following terms and answer basic questions about them:

Lecture 6: Graphical Models

Qualifier: CS 6375 Machine Learning Spring 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

An Introduction to Bayesian Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Informatics 2D Reasoning and Agents Semester 2,

Lecture 15. Probabilistic Models on Graph

Probabilistic Classification

Directed Graphical Models

{ p if x = 1 1 p if x = 0

Bayesian Networks Representation

Another look at Bayesian. inference

CS 361: Probability & Statistics

Bayes Nets III: Inference

Undirected Graphical Models: Markov Random Fields

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Bayesian Networks. Alan Ri2er

CSE 473: Artificial Intelligence Autumn 2011

Probabilistic Models. Models describe how (a portion of) the world works

Final Examination CS 540-2: Introduction to Artificial Intelligence

Review: Directed Models (Bayes Nets)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

A brief introduction to Conditional Random Fields

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

CS 188: Artificial Intelligence Spring Announcements

Transcription:

School of Computer Science 10-601B Introduction to Machine Learning Directed Graphical Models (aka. Bayesian Networks) Readings: Bishop 8.1 and 8.2.2 Mitchell 6.11 Murphy 10 Matt Gormley Lecture 21 November 9, 2016 1

Homework 6 due Mon., Nov. 21 Final Exam in- class Wed., Dec. 7 Reminders 2

Outline Motivation Structured Prediction Background Conditional Independence Chain Rule of Probability Directed Graphical Models Bayesian Network definition Qualitative Specification Quantitative Specification Familiar Models as Bayes Nets Example: The Monty Hall Problem Conditional Independence in Bayes Nets Three case studies D- separation Markov blanket 3

MOTIVATION 4

Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real- world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 5

Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 6

Data: Dataset for Supervised Part- of- Speech (POS) Tagging D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 7

Dataset for Supervised Handwriting Recognition words, and its part of speech lab Data: D = {x (n), y (n) } N n=1 Sample 1: u n e x p e c t e d y (1) 1. Firs 2. Fou x (1) Han Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Fig. 5. Handwriting recognition: Example words from the dataset used. Figures from (Chatzis & Demiris, 2013) 8

Data: Dataset for Supervised Phoneme (Speech) Recognition D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 9

Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 10

Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 11

Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 12

Case Study: Object Recognition Data consists of images x and labels y. x(2) x(1) pigeon y(1) rhinoceros x(3) leopard y(3) y(2) x(4) llama y(4) 13

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 14

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 X7 Define graphical model with these latent variables in mind z is not observed at train or test time Z2 Z3 X3 X2 leopard Y 15

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 ψ4 ψ1 X7 Define graphical model with these latent variables in mind z is not observed at train or test time ψ4 ψ2 Z2 Z3 ψ4 ψ5 ψ3 X3 X2 leopard Y 16

Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 17

Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 18

3 Al Machine Learning Alice saw Bob on a hill with a telescope Data 4 time flies like an arrow Model X 1 time flies like an arrow X 2 X 3 time flies like an arrow time flies like an arrow Objective X 4 X 5 Inference time flies like an arrow Learning 2 (Inference is usually called as a subroutine in learning) 19

BACKGROUND 20

Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 21

Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 22

Bayesian Networks DIRECTED GRAPHICAL MODELS 23

Whiteboard Writing Joint Distributions Strawman: Giant Table Alternate #1: Rewrite using chain rule Alternate #2: Assume full independence Alternate #3: Drop variables from RHS of conditionals 24

Bayesian Network X 1 Definition: X 2 X 3 X 4 X 5 n i=1 P(X 1 X n ) = P(X i parents(x i )) 25

X 2 X 1 X 3 X 4 X 5 Bayesian Network Definition: p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 26

X 2 X 1 X 3 X 4 X 5 Bayesian Network Definition: p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) A Bayesian Network is a directed graphical model It consists of a graph G and the conditional probabilities P These two parts full specify the distribution: Qualitative Specification: G Quantitative Specification: P 27

Qualitative Specification l Where does the qualitative specification come from? l l l l l l Prior knowledge of causal relationships Prior knowledge of modular relationships Assessment from experts Learning from data We simply link a certain architecture (e.g. a layered graph) Eric Xing @ CMU, 2006-2011 28

If time Whiteboard Example: 2016 Presidential Election 29

Towards quantitative specification of probability distribution l l Separation properties in the graph imply independence properties about the associated variables For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents l The Equivalence Theorem For a graph G, Let D 1 denote the family of all distributions that satisfy I(G), Let D 2 denote the family of all distributions that factor according to G, Then D 1 D 2. Eric Xing @ CMU, 2006-2011 30

Quantitative Specification A B p(a,b,c) = C Eric Xing @ CMU, 2006-2011 31

Conditional probability tables (CPTs) a 0 0.75 a 1 0.25 b 0 0.33 b 1 0.67 P(a,b,c.d) = P(a)P(b)P(c a,b)p(d c) A B C a 0 b 0 a 0 b 1 a 1 b 0 a 1 b 1 c 0 0.45 1 0.9 0.7 c 1 0.55 0 0.1 0.3 D c 0 c 1 d 0 0.3 0.5 d 1 07 0.5 Eric Xing @ CMU, 2006-2011 32

Conditional probability density func. (CPDs) A~N(µ a, Σ a ) B~N(µ b, Σ b ) P(a,b,c.d) = P(a)P(b)P(c a,b)p(d c) A B C C~N(A+B, Σ c ) P(D C) D D~N(µ a +C, Σ a ) D C Eric Xing @ CMU, 2006-2011 33

Conditional Independencies Y Label X 1 X 2 X n-1 X n Features What is this model 1. When Y is observed? 2. When Y is unobserved? Eric Xing @ CMU, 2006-2011 34

Conditionally Independent Observations θ Model parameters X 1 X 2 X n-1 X n Data = {y 1, y n } Eric Xing @ CMU, 2006-2011 35

Plate Notation θ Model parameters X i Data = {x 1, x n } i=1:n Plate = rectangle in graphical model variables within a plate are replicated in a conditionally independent manner Eric Xing @ CMU, 2006-2011 36

Example: Gaussian Model µ σ Generative model: x i i=1:n p(x 1, x n µ, σ) = P p(x i µ, σ) = p(data parameters) = p(d θ) where θ = {µ, σ} Likelihood = p(data parameters) = p( D θ ) = L (θ) Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Often easier to work with log L (θ) Eric Xing @ CMU, 2006-2011 37

Bayesian models θ x i i=1:n Eric Xing @ CMU, 2006-2011 38

More examples Density estimation Parametric and nonparametric methods Regression Linear, conditional mixture, nonparametric Classification Generative and discriminative approach X m,s X Q X X Y Q X Eric Xing @ CMU, 2006-2011 39

EXAMPLE: THE MONTY HALL PROBLEM Slide from William Cohen Extra slides from last semester 40

The (highly practical) Monty Hall problem You re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing a goat! You now can either stick with your guess 3 always change doors flip a coin and pick a new door randomly according to the coin Slide from William Cohen Extra slides from last semester

The (highly practical) Monty Hall problem Extra slides from last semester You re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing a goat! D You now can either stick with your guess or change doors Slide from William Cohen A P(A) 1 0.33 2 0.33 3 0.33 Stick, or swap? P(D) Stick 0.5 Swap 0.5 D Second guess First guess W A E The money B B C The revealed goat P(B) 1 0.33 2 0.33 3 0.33 A B C P(C A,B) 1 1 2 0.5 1 1 3 0.5 1 2 3 1.0 1 3 2 1.0 ( a b) ( c { a, b} ) ( a = b) ( c { a, }) 1.0 if P( C = c A = a, B = b) = 0.5 if b 0 otherwise

The (highly practical) Monty Hall problem Extra slides from last semester P(E = e A,C, D) # % = $ % &% 1.0 if ( e = a) ( d = stick) 1.0 if ( e {a, c} ) (d = swap) 0 otherwise If you stick: you win if your first guess was right. If you swap: you win if your first guess was wrong. A P(A) 1 0.33 2 0.33 3 0.33 ' % ( % )% D Second guess A C D P(E A,C,D) First guess Stick or swap? A E C The money B The goat B P(B) 1 0.33 2 0.33 3 0.33 A B C P(C A,B) 1 1 2 0.5 1 1 3 0.5 1 2 3 1.0 1 3 2 1.0 ( a b) ( c { a, b} ) ( a = b) ( c { a, }) 1.0 if P( C = c A = a, B = b) = 0.5 if b 0 otherwise Slide from William Cohen

The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B D=swap) again by the chain rule: P(A,B,C,D,E) = P(E A,C,D) * P(D) * P(C A,B ) * P(B ) * P(A) A P(A) 1 0.33 2 0.33 3 0.33 D Second guess First guess Stick or swap? A C D P(E A,C,D) A E C The money B The goat B P(B) 1 0.33 2 0.33 3 0.33 A B C P(C A,B) 1 1 2 0.5 1 1 3 0.5 1 2 3 1.0 1 3 2 1.0 Slide from William Cohen Extra slides from last semester

The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B D=swap) again by the chain rule: P(A,B,C,D,E) = P(E A,B,C,D) * P(D A,B,C) * P(C A,B ) * P(B A) * P(A) A P(A) 1 0.33 2 0.33 3 0.33 D Second guess First guess Stick or swap? A C D P(E A,C,D) A E C The money B The goat B P(B) 1 0.33 2 0.33 3 0.33 A B C P(C A,B) 1 1 2 0.5 1 1 3 0.5 1 2 3 1.0 1 3 2 1.0 Slide from William Cohen Extra slides from last semester

The (highly practical) Monty Hall problem The joint table has? 3*3*3*2*3 = 162 rows The conditional probability tables (CPTs) shown have? 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows A P(A) 1 0.33 2 0.33 3 0.33 First guess Stick or swap? A The money Big questions: The goat D C why are the CPTs smaller? how much smaller 1 1 are 2 the 0.5 CPTs than the joint? 1 1 3 0.5 E 1 2 3 1.0 can we compute the answers 1 3 2 1.0 to queries like P(E=B d) without building the joint probability tables, just using the CPTs? Second guess A C D P(E A,C,D) B B P(B) 1 0.33 2 0.33 3 0.33 A B C P(C A,B) Slide from William Cohen Extra slides from last semester

The (highly practical) Monty Hall problem Why is the CPTs representation smaller? Follow the money! (B) P( E P( E = e 1.0 = 1.0 0 a, b, c, d, e = P( E A, C, D) if if = e A ( e = a) ( d = stick ) ( e { a, c} ) ( d = otherwise = e A = a, C = a, B = c, D = b, C swap = d) A ) P(A) 1 0.33 2 0.33 3 0.33 D Second guess First guess Stick or swap? A C D P(E A,C,D) = b, D = d) A E C The money B The goat B P(B) 1 0.33 2 0.33 3 0.33 E A is B conditionally C P(C A,B) independent 1 1 2 0.5 of B 1 1 3 0.5 given A,D,C 1 2 3 1.0 E B A, C, 1 3 2 1.0 D I < E,{ A, C, D}, B > Slide from William Cohen Extra slides from last semester

The (highly practical) Monty Hall problem What are the conditional indepencies? I<A, {B}, C>? I<A, {C}, B>? I<E, {A,C}, B>? I<D, {E}, B>? D First guess Stick or swap? A C The money B The goat Second guess E Slide from William Cohen Extra slides from last semester

GRAPHICAL MODELS: DETERMINING CONDITIONAL INDEPENDENCIES Slide from William Cohen

What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non- descendants in the graph given the value of all its parents. This follows from But what else does it imply? n i=1 P(X 1 X n ) = P(X i parents(x i )) n i=1 = P(X i X 1 X i 1 ) Slide from William Cohen

What Independencies does a Bayes Net Model? Three cases of interest Cascade Common Parent V- Structure Z Y X Z Y X Z Y X 51

What Independencies does a Bayes Net Model? Three cases of interest Cascade Common Parent V- Structure Z Y X Z Y X Z Y X X Z Y X Z Y X Z Y Knowing Y decouples X and Z Knowing Y couples X and Z 52

Whiteboard Common Parent Proof of conditional independence X Y Z (The other two cases can easily be shown just as easily.) X Z Y 53

The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home s burglar alarm is ringing. Uh oh! Burglar Alarm Phone Call Earthquake Quiz: True or False? Burglar Earthquake P honecall Slide from William Cohen

D- Separation (Definition #1) Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d- separation. Definition: variables X and Z are d- separated (conditionally independent) given a set of evidence variables E iff every undirected path from X to Z is blocked, where a path is blocked iff one or more of the following conditions is true:... ie. X and Z are dependent iff there exists an unblocked path Slide from William Cohen

D- Separation (Definition #1) A path is blocked when... There exists a variable Y on the path such that it is in the evidence set E the arcs putting Y in the path are tail- to- tail Or, there exists a variable Y on the path such that it is in the evidence set E the arcs putting Y in the path are tail- to- head Or,... Y Y unknown common causes of X and Z impose dependency unknown causal chains connecting X an Z impose dependency Slide from William Cohen

D- Separation (Definition #1) A path is blocked when... Or, there exists a variable V on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are head- to- head Y Known common symptoms of X and Z impose dependencies X may explain away Z Slide from William Cohen

D- Separation (Definition #2) D- separation criterion for Bayesian networks (D for Directed edges): Definition: variables X and Y are D- separated (conditionally independent) given Z if they are separated in the moralized ancestral graph Example: Eric Xing @ CMU, 2006-2011 58

D- Separation Theorem [Verma & Pearl, 1998]: If a set of evidence variables E d- separates X and Z in a Bayesian network s graph, then I<X, E, Z>. d- separation can be computed in linear time using a depth- first- search- like algorithm. Be careful: d- separation finds what must be conditionally independent Might : Variables may actually be independent when they re not d- separated, depending on the actual probabilities involved Slide from William Cohen

Bayes- ball and D- Separation X is d- separated (directed- separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "Bayes- ball" algorithm illustrated bellow (and plus some boundary conditions): Defn: I(G)=all independence properties that correspond to d- separation: I( G) = { X Z Y :dsep ( X; Z Y) } G D- separation is sound and complete Eric Xing @ CMU, 2006-2011 60

Markov Blanket Ancestor A node is conditionally independent of every other node in the network outside its Markov blanket Y 1 Y 2 Child X Parent Co-parent Eric Xing @ CMU, 2006-2011 Descendent 61

Summary: Bayesian Networks Structure: DAG Meaning: a node is conditionally independent of every other node in the network outside its Markov blanket Local conditional distributions (CPD) and the DAG completely determine the joint dist. Y 1 Y 2 X Give causality relationships, and facilitate a generative process Eric Xing @ CMU, 2006-2011 62