Bayesian Networks ETICS Pierre-Henri Wuillemin (LIP6 UPMC)

Size: px
Start display at page:

Download "Bayesian Networks ETICS Pierre-Henri Wuillemin (LIP6 UPMC)"

Transcription

1 Bayesian Networks ETICS 2017 Pierre-Henri Wuillemin (LIP6 UPMC)

2 POI BNs in a nutshell Foundations of BNs Learning BNs Inference in BNs Graphical models for copula pyagrum / 94

3 Nutshell 3 / 94

4 Inference in discrete probabilistic model Inference Let A, B be (discrete) random variables, Inference consists in computing the distribution of A given observations on B. P(A B = ɛ b ) = P(A ɛ b ) = P(ɛ b A) P(A) P(ɛ b ) P(ɛ b A) P(A) = P(A, ɛ b ) Probabilistic complex systems : P(X 1,, X n ), Inference in complex system : P(X i X j = ɛ j ) P(X i, ɛ j ) Nutshell 4 / 94

5 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Nutshell 5 / 94

6 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Nutshell 6 / 94

7 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Factorization of the joint distribution in a BN n P(X 1,, X n ) = P (X i parents(x i )) i=1 P(A, S, T, L, O, B, X, D) P(A) P(S) P(T A) P(L S) P(O T, L) P(B S) P(X O) P(D O, B) (2 8 = 256 parameters) ( = 32 parameters) Nutshell 7 / 94

8 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Inference : P(dyspnoea)? Nutshell 8 / 94

9 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Inference : P(dyspnoea smoking)? Nutshell 9 / 94

10 BN and probabilistic inference diagnostic : P(A F ) p(a) A p(b) B p(c A, B) p(d B) diagnostic reliability classification C D prediction P(E B, A) E F p(e F ) p(f A, C) Others tasks G p(g C, D) Process simulation (modelisation) forecasting (dynamics, etc.) Behavioral analysis (bot, intelligent tutoring system) Most Probable Case : arg max P(X D) Sensitivity analysis, Informational analysis (mutual information), etc. Decision process, Troubleshooting : arg max P(.) C(.) Nutshell 10 / 94

11 Application 1 : diagnostic NASA He P1 Fail He leak He P2 fail He P1 trend He P2 trend He pressure 1 He pressure 2 He P1 meas He P2 meas He Ox valve He Fu valve Ox tank leak Fu tank leak N tank leak NASA : using a BN to monitor the boosters of the space shuttle N valve fail Ox tank P Fu tank P N P1 fail N tank pres N P2 fail Engine Fail Ox inlet P Fu inlet P N tank P1 N tank P2 N accum P Combust P Comb press Nutshell 11 / 94

12 Application 2 : medical diagnosis the Great Ormond Street hospital for sick children Diagnosis of the causes of cyanosis or heart attack in the child just after birth. birth asphyxia disease age at presentation LVH Duct Flow Cardiac mixing lung parenchyma lung flow sick Hypoxia distribution hypoxia in O2 CO2 chest X ray grunting LVH report lower body O2 right up quad O2 CO2 report X ray report grunting report Nutshell 12 / 94

13 Application 3 : risk analysis Risk modelisation using BN : modular approach. Nutshell 13 / 94

14 Application 4 : Bayesian classification X (dimension d, features) and Y (dimension 1, often binary but not necessarily). Using a database Π a = (x (k), y (k) ) k {1,,N} (supervised learning), one can estimate the joint distribution P(X, Y ). Classification For a vector x, values of X, the goal is to predict the class (value of Y ) : ŷ. 1 Maximum of the likelihood (ML) 2 Maximum a posteriori (MAP) ŷ = arg max y ŷ = arg max P(x y) y P(y x) = arg max P(y) P(x y) y Those distributions may be hard to estimate. P(X Y ) may induce more parameters than Π a!! Nutshell 14 / 94

15 Bayesian classification (2) : naive Bayes How to compute P(X Y )? Naive Bayes classifier if we assume k l, X k X l Y then P(x, y) = P(y) d k=1 P(x k y) Very strong assumption! In most cases, it is an approximation. However, this approximation often gives good results. Parameters estimation : trivial (if no missing values) ML : d k=1 P(x k y)... MAP : P(y x 1,, x d ) : inference in the BN! Nutshell 15 / 94

16 Bayesian classification (3) : more complex models TAN : Tree-Augmented Naive Models Every variable X i can have Y and another parent among X (only one!). Complete Bayesian network In a BN including Y and (X i ), infer P(Y X 1,, X n ). Note : There is no need to all X i : Markov Blanket MB(.)) P(Y X ) = P(Y MB(Y )) Nutshell 16 / 94

17 Application 5 : dynamic Bayesian networks dbn (dynamic BN) A dynamic BN is a BN where variables are indexed by the time t and by i : X t = {X t 1,, X t N } and verifies : Markov order 1 : P(X t X 0,, X t 1 ) = P(X t X t 1 ), Homogeneity : P(X t X t 1 ) = = P(X 1 X 0 ) t = 0 t (t 1) X 1 X 1 X 1 X 1 X 1 X 0 1 X t 1 X 2 X 2 X 2 X 2 X 2 X 0 2 X t 2 X 3 X 3 X 3 X 3 X 3 X 0 3 X t 3 X 4 X 4 X 4 X 4 X 4 X 0 4 X t 4 X 5 X 5 X 5 X 5 X 5 X 0 5 X t 5 2-TBN A dbn is characterized by : initial conditions (P(X 0 ) the relations between t 1 and t (timeslice). 2TBN (2 timeslice BN) allows to specify a dbn of size T by starting with X 0 and copying T times the pattern (t t 1). Nutshell 17 / 94

18 dbn and Markov chain Markov chain A state variable (X n ) (at time n). Parameters : Initial condition : P(X O ) Transition model : P(X n X n 1 ) P(X n X n 1 ) = Equivalent dynamic Bayesian Network : X dbn : 0 2TBN : X 1 X 2 X 3 X 0 X n Nutshell 18 / 94

19 dbn and Hidden Markov Model HMM A state variable (X n ) (at time n). An observation variable (Y n ) Parameters : Initial condition : P(X O ) Transition model : P(X n X n 1 ) Observation model : P(Y n X n ) Equivalent dynamic Bayesian Network : X 0 X 1 X 2 X 3 X 0 X n dbn : Y 0 Y 1 Y 2 Y 3 2TBN : Y 0 Y n Nutshell 19 / 94

20 Foundations Nutshell 20 / 94

21 Joint probability, Factorization, conditional independence A joint probability is expensive in memory K 3 p(x, Y, Z) A joint probability can be factorized (chain rule) K 3 p(x, Y, Z) = p(x ) p(y X ) p(z X, Y ) K + K 2 + K 3 Conditional independence is the key. With X Y and Z X, Y : K 3 p(x, Y, Z) = p(x ) p(y ) p(z) 3K Goal : how to describe the list of all conditional independence in a joint probability : {U, V, W X with U V W } Independence models 21 / 94

22 Independence model More generally, this ternary relation between subsets is called separability. Definition (Independence model and separability) Let X be a finite set, let I P(X ) P(X ) P(X ). i.e. a list of triplets of subsets of X. I is called an independence model. U, V, W X, U and V are separated by W ( U V W I ) if and only if (U, V, W ) I. i.e. I is the list of all separations found in X. Relation between I and p I p = {(U, V, W ) P(X ) P(X ) P(X ), U V W } is an independence model. U V W U V W Ip Reciprocally, if I is an independence model for X set of random variables, can we found P a distribution that verifies this list of conditional independence? Independence models 22 / 94

23 Semi-graphoid and graphoid Definition (semi-graphoid) An independence model I is a semi-graphoid if and only if A, B, S, P X : 1 trivial independence A S I 2 Symmetry A B S I B A S I 3 Decomposition A (B P) S I A B S I 4 Weak union { A (B P) S } I A B (S P) I 5 A B (S P) I Contraction A (B P) S A P S I I Definition (graphoid) An independence model I is a graphoid if and only if A, B, S, P X : { I is a semi-graphoid } 6 A B (S P) I Intersection A (B P) S A P (S B) I I These axioms create a strong structure inside the independence model. Independence models 23 / 94

24 Semi-graphoid and graphoid - representation of the axioms 3 Decomposition A (B P) S I A B S I 4 Weak union { A (B P) S } I A B (S P) I 5 A B (S P) I Contraction A (B P) S A P S I I 6 Intersection { } A B (S P) I A (B P) S A P (S B) I I 3 Decomposition 4 Weak union A S B P A S B A S B P A S B P 5 Contraction 6 Intersection A A S S B P P A S B P A A S S B P B P A S B P Independence models 24 / 94

25 Semi-graphoid and graphoid Theorem (probability and graphoid) I p is a semi-graphoid. If p positive then I p is a graphoid. Is there any other kind of ternary relation that is a graphoid? A B C D Theorem (Undirected graph and graphoid) E F Let G = (X, E) an undirected graph, U, V, W X, U W V G if and only if every path from a node in U to a node in V includes a node in W. { U W V G, U, V, W X } is a graphoid. Independence models 25 / 94 G

26 Graphical model P(X ),... I Independence model G = (X, E),... G Definition (Graphical model) A graphical model is a joint probability distribution P(X ) which uses a graph G = (X, E) to represent its conditional independence using separation in the graph. P(X ),...? I Independence model? G = (X, E),... G Definition (I-map, D-map, P-map, graph-isomorphism) let G = (X, E) a graph and a distribution p(x ). G is a Dependency-map for p (X Y Z) p X Z Y G. G is a Independency-map for p (X Y Z) p X Z Y G. G is a Perfect-map for p (X Y Z) p X Z Y G. p(x ) is graph-isomorph if and only if G = (X, E) P-map for p(x ). The empty graph (X, ) is a D-map for all p(x ). The complete graph is a I-map for all distribution p(x ). Independence models 26 / 94

27 Undirected graphical model : example 1 example 1 Let p(d 1, D 2, S) be the joint probability distribution with D 1 and D 2 two (independent) dice and S = D 1 + D 2. (in)dependence in example 1 D 1 D 1 S and D 2 D 2 but D 1 S D 2 S D 1 D 2 S Independence models 27 / 94

28 Undirected graphical model : example 2 example 2 In a database of users, a strong correlation has been found betweenl the ability to read and P the size of the shoes. This correlation is explained by the fact that a third variable A (the age) can be < 5. (in)dependence in example 2 L L A and P P but L A P A A P L Independence models 28 / 94

29 Directed graphical model example 1 D 1 D 2 but D 1 D 2 S D 1 D 2 example 2 L P but L A P A S P L Giving more information on the graph by adding the orientation. directed example 1 D 1 D 2 but D 1 D 2 S D 1 D 2 directed example 2 L P but L A P A S P collider or V-structure We need a separation criterion for directed graphs : d-separation. L Independence models 29 / 94

30 Directed graphical model and d-separation Let C = (x i ) i I be an undirected path in G = (X, E). x i is a C-collider if C contains : x i 1 x i x i+1. Definition (Active path) Let Z X. C is an active path wrt Z if x i C : If x i is a C-collider Then x i or one of its descendant belongs to Z. Else x i does not belong to Z. If a path is not active wrt Z, it is blocked by Z. Definition (d-separation) Let G = (X, E) a directed graph, (U, V, W ) X, U is d-separated from V by W in G ( U W V G ) if and only if every undirected path from U to V is blocked by W. Independence models 30 / 94

31 Sampling some d-separation A B C D E F G Independence models 31 / 94

32 Bayesian network and Markov properties Definition (Global Markov Property) G satisfies the GMP for p A, B, S X, A S B G A B S. i.e. G is an I-map for p. Definition (Locale Markov Property) G satisfies the LMP for p Xi X, {X i } nd (X i ) Π Xi. where nd (X i ) represents the non-descendant nodes of X i Independence models 32 / 94

33 Bayesian network and Markov properties (2) Theorem (when p(x ) positive) GMP LMP Definition (Bayesian network, graphical Factorization) Let p(x ) be a directed graphical model with the graph G = (X, E). p(x ) is a Bayesian network if and only if G is an I-map for p. Theorem Let p(x ) be a Bayesian network with the graph G = (X, E), p(x ) = X i X p (X i Π Xi ) Independence models 33 / 94

34 Markov Equivalence class Markov Equivalence Two Bayesian networks are equivalent if their graphs represent the same independence model. Definition (Markov Equivalence class) The Markov equivalence class for a Bayes net represented by G is the set of all Bayesian networks that are Markov equivalent to G. X Y X Y X Y X Y Z Z Z X Y X Y Z Z X Y X Y Z Independence models 34 / 94

35 Causality Independence models 35 / 94

36 Causality and probability Conditional probabilities do not allow to deal with causality. They even create paradox. Simpson s paradox Impact analysis : does a certain drug help to cure the patients? With the values : c 1 (cured) / d 1 (with drug) / d 0 (no drug) / M,W (man, woman). P(c 1 d 1 ) = > P(c 1 d 0 ) = 0.5 the drug helps... P(c 1 d 1, M) = 0.7 < P(c 1 d 0, M) = except if the patient is a man... P(c 1 d 1, W ) = 0.2 < P(c 1 d 0, W ) = or a woman. The conditional probability P(c 1 d 1 ) is observational and is not relevant, one wants to give the drug and not to observe : intervention on d : P(c 1 d 1 ) Conditioning by intervention Let I be the state of the light switch, cause of L : is there light in the room?. With observational conditioning P(L I ) and P(I L) (no distinction between cause and effect), P(L I ) = P(L I ) P(I L) = P(I ). Intervention 36 / 94

37 Causal Bayesian network Smoke Genotype Cancer P(C S) = P(C S) P(C S) = P(C) Smoke Genotype Cancer Smoke Genotype Cancer P(C S) unknown! We do not know how to differentiate the causal impact of the 2 causes from the observation. P(C S) computable! Do- Calculus Smoke Tar Cancer Intervention 37 / 94

38 Markov equivalence class, essential graph Definition ( essential graph) A Markov equivalence class can be represented by a partially directed acyclic graph (PDAG) : the essential graph. In an essential graph, an arc is directed if every Bayesian network in the equivalence class have the same arc. A B A B C D C D F E F E G H The essential graph can be build from a BN by removing orientation of the arcs that can be reversed without creating or removing a collider. if the BN is causal, the essential graph indicates the causal arc that can be learned from a dataset. Undirected arc is an indication for a (possible) latent cause. G Intervention 38 / 94 H

39 Learning Intervention 39 / 94

40 What can we learn? Learning in Bayesian network The goal of a learning algorithm is to estimate from a dataset and from prior : the structure of the Bayesian network (is X parent of Y?) the parameters of the Bayesian network (P(X = 0 Y = 1)?) The dataset can be : complete, incomplete (missing values). The prior knowledge can be for instance : (part of) the structure of the BN, The probability distribution for certain variables, etc. Therefore 4 main classes of learning algorithms in BN : Learning of {parameters structure} with {complete incomplete} data. Learning 40 / 94

41 Learning parameters with complete data D : d1 A d1 B d1 C d1 E V F F V dm A d M B d M C d M E E A C B Let Θ be the set of all parameters for the model and L(Θ : D) the likelihood : L(Θ : D) = P(D Θ) M = P(d m Θ) m=1 M = P(E = d E m, B = d B m, A = d A m, C = d C m Θ) m=1 (iid) Parameters learning complete data 41 / 94

42 Learning parameters with complete data (2) Let us rename E, B, A, C with n = 4, (X i ) 1 i n, L(Θ : D) = = = M P(X 1 = dm, 1 X 2 = dm, 2, X n = dm n Θ) m=1 M m=1 i=1 n i=1 m=1 n P(X i Pa i, Θ) M P(X i Pa i, Θ i ) L(Θ : D)= n i=1 L i(θ i : D) The estimation of the parameters can be decomposed into the estimation of the different conditional probability table for each node. No need for only one global dataset : heterogeneous learning. Parameters learning complete data 42 / 94

43 Maximization of the likelihood for a single variable Let X be a binary variable. With θ = P(X = 1) : 0.04 x*(1-x)*(1-x)*x*x Θ = {θ, 1 θ} D = (1, 0, 0, 1, 1) L(Θ : D) = P(X = d m Θ) m L(Θ : D) Here : L(Θ : D) = θ (1 θ) (1 θ) θ θ ( θ ^θ = argmax θ θ 3 (1 θ) 2) Generalization For X random variable which can take the values (1,, r), with Θ X = (θ 1,, θ r ) where θ i = P(X = i), and N i = # D (X = i) number of occurrences of i in the dataset D, r L(Θ X : D) = θ N i i and ΘX = argmax ΘX (L(Θ X : D)) i=1 Parameters learning complete data 43 / 94

44 Maximizing the likelihood in a Bayesian network θ ijk = P(X i = k Pa i = j), N ijk = # D (X i = k, Pa i = j), k {1 r i }, j {1 q i } L(Θ : D) = n i=1 L i (Θ i : D) = n M i=1 m=1 P(X i = k m Pa i = j m, Θ i ) L(Θ : D) = n q i r i i=1 j=1 k=1 θ N ijk ijk LL(Θ : D) = n M i=1 m=1 log P(X i Pa i, Θ i ) = n qi ri i=1 j=1 k=1 N ijk log θ ijk k θ ijk = 1 then θ ijri = 1 r i 1 k=1 θ ijk hence ( ( )) LL(Θ : D) = n q i ri 1 r i 1 N ijk log θ ijk + N ijri log 1 θ ijk i=1 j=1 k k=1 We are looking for Θ that maximizes L(Θ : D) and then LL(Θ : D) : i.e. Θ LL(Θ : D) tel que i, j, k, θ ijk ( ) Θ = N ijk ^θ ijk N ijri = r i 1 1 ^θ ijk k=1 N ijk N ijr i = 0 ^θ ijk ^θ ijri Finally, N ijr i ^θ ijri = N ij1 = = N ij(r i 1) (and ^θ ^θ ij1 ^θ k ijk = 1) : ij(ri 1) k {1,..., r i }, θ ijk = N ijk N ij With N ij = r i k=1 N ijk Parameters learning complete data 44 / 94

45 Bayesian prediction The a posteriori distribution of Θ is P(Θ D). P(Θ D) P(D Θ) P(Θ) = L(Θ : D) P(Θ) Goal to take into account a prior on Θ in order to integrate experts knowledge or to stabilize the estimation if the dataset is too small. The Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution. Dirichlet Distribution : f (p 1,, p K ; α 1, α K ) K i=1 x α i 1 i (where i p i = 1) f can be interpret as : P (P(X = i) = p i # X =i = α i 1) If the prior P(Θ) is a Dirichlet distribution : P(Θ) = n q i r i i=1 j=1 k=1 θ α ijk 1 ijk [Wikipedia] Clockwise from top left : α = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). Parameters learning complete data 45 / 94

46 Bayesian prediction (2) From : P(Θ D) L(Θ : D) P(Θ) P(Θ) = n q i r i θ α ijk 1 ijk i=1 j=1 k=1 L(Θ : D) = n q i r i θ N ijk ijk i=1 j=1 k=1 P(Θ D) = n q i r i i=1 j=1 k=1 θ N ijk +α ijk 1 ijk MAP : maximum a posteriori Θ MAP = arg max P(Θ D) Θ ^θ MAP N ijk + α ijk 1 ijk = k (N ijk + α ijk 1) EAP : expectation a posteriori Θ EAP = Θ P(Θ D) dθ Θ ^θ EAP N ijk + α ijk ijk = k (N ijk + α ijk ) Parameters learning complete data 46 / 94

47 Parameters learning with complete data With N ijk the count of occurrences in the dataset where variable X i has the value k and its parents have the values (t-uple) j, With α ijk the parameters of a Dirichlet prior. Parameters estimation Two main solutions : MLE (Maximum Likelihood Estimation) Bayesian estimation (with Dirichlet prior) θ ijk = θ {xi =k pa i =j} = N ijk N ij θ MAP ijk = θ {xi =k pa i =j} = α ijk + N ijk 1 α ij + N ij r i θ EAP ijk = θ {xi =k pa i =j} = α ijk + N ijk α ij + N ij Prior important when N ijk 0 : no occurrence in the dataset. All these estimations are equivalent when N ijk α ijk are often hard to find Laplace smoothing : α ijk = constant(= 1) Parameters learning complete data 47 / 94

48 Learning parameters with missing values d A 1 d B 1 d C 1 d E 1 V F? V D : V F? V? F? V d A M d B M d C M d E M E A C B D = D o D h respectively observed data and unobserved. Typology for incomplete dataset With M il = d i l D h MCAR :P(M D) = P(M) (Missing Completely At Random). MAR : P(M D) = P(M D o ) (Missing At Random). NMAR :P(M D) (Not Missing At Random). Parameters learning with incomplete data 48 / 94

49 EM for BNs Expectation-Maximization for BNs Repeat until convergence Step E : Estimate N (t+1) ijk from P(X i Pa i, θ t ijk ) inference in the BN with parameters θ t ijk Step M : θ t+1 ijk = N(t+1) ijk N (t+1) ij P S P S o? n? o n n n o o Parameters P(P) = [θ P 1 θ P ] P(S P = o) = [θ S P=o 1 θ S P=o ] P(S P = n) = [θ S P=n 1 θ S P=n ] With MLE : θ P = 3 5 Parameters learning with incomplete data 49 / 94

50 EM in a BN : example 0 Initialisation We have to choose a initial value for each parameters : θ (0) S P=o = 0.3, θ(0) S P=n = 0.4 P(S P = o) P(S P = n) P S S = o S = n S = o S=n o? n? o n n n o o N Step M θ (1) S P=o = = and θ(1) S P=n = = 0.2 P(S P = o) P(S P = n) P S S = o S = n S = o S=n o? n? o n n n o o N Step M θ (2) S P=o = = and θ(2) S P=n = = etc. (θ (t) 0.5 and θ(t) S P=o S P=n 0) Parameters learning with incomplete data 50 / 94

51 EM in BNs E A C B A B C E 1325? What is the step E? Replace the? by P(A B = 0, C = 1, E = 0) inference in the BN with the parameters Θ t Learning parameters with missing values EM converges to a local optimum, Sensibility to initial parameters Each step E can be expensive (as an inference in a complex BN) Parameters learning with incomplete data 51 / 94

52 Structural learning with complete data Goal : learning the arcs of the graph from data. Theoretically : χ 2 test plus enumeration of all the possible models : OK In practice : many different problems but above all : Set of Bayesian networks (Robinson, 1977) The number of different possible structures for n random variables is super-exponential. 1, n 1 NS(n) = n ( 1) i+1 Ci n 2 i (n i) NS(n 1), n > 1 i=1 Robinson (1977) Counting unlabelled acyclic digraphs. In Lecture Notes in Mathematics : Combinatorial Mathematics V An algorithm brute-force is not feasible. The space of Bayesian networks it too large : NS(10) ! Structure learning with complete data 52 / 94

53 Structural learning - introduction General picture of structural learning Identification of symmetrical relation (independence) + orientation algorithm IC/PC algorithm IC*/FCI Important for causal models. Local search In the (very large) space of structures, Greedy algorithms maximizing a score (entropy, AIC, BIC, MDL,BD, BDe, BDeu, ). Structure learning with complete data 53 / 94

54 Identification of symmetrical relation Statistically, the relation that can be tested between variables are symmetrical : correlation or independence. However, once these symmetrical relations have been found, other tests (conditional independence) can lead to the discovery of colliders which force some orientation. Main principle for (IC, IC*, PC, FCI) 1 Build an undirected graph based on symmetrical relation statistically found (χ 2, correlation, mutual information, etc.) : Add edges from the empty graph. Remove edges from the complete graph. 2 Identify colliders and add the implied orientations. 3 Finalize the orientations without creating any other colliders (in order to stay in the same Markov equivalence class. Major drawback : a very large number of statistic tests is needed. Each test is not robust in the size of the dataset. Structure learning with complete data 54 / 94

55 Example : PC Let P be a BN. We generate a dataset of 5000 lines compatible with P. 1 Using χ 2 test, every possible marginal independence (X Y ) is tested. Then every remaining possible conditional independence (X Y Z) are tested. Then with 2 parents, etc. We find : A S, L L T,O B, X A, B B. A, O A, X A, D A, T S, We find : T A O, O S L, X S L, B T S, X T O, D T O, B L S, X L O, D L O, D X O. 1. from Philippe Leray Structure learning with complete data 55 / 94

56 Example : PC We continue with χ 2 conditioned on 2 variables. we find : D S (L, B), X O (T, L), D O (T, L). Looking for colliders, propagation of orientation constraints, orientation of the last edges in the Markov equivalence class. We find : T L and T L O Orientation in the same Markov equivalence class. Conclusion : with 5000 lines, PC has lost many information (noisy χ 2 tests). Structure learning with complete data 56 / 94

57 Structural learning with local search Local search A local search is composed of : a space of all possible solutions (search space), a neighborhood defined by elementary transformations of a solution.the neighbors of a solution are the solutions that can be obtained by the application of an elementary transformation. a score (heurisitic) that evaluates the quality of a solution. From an initial solution, the local search then produces a sequence of solutions such that every solution in the sequence has a better score than the precedent solutions in the sequence (Greedy Search). Local search in Bayesian networks space of Bayesian networks (huge) The score (see next slide) The initial solution (empty Bayesian network for instance) Elementary operations : add/remove/reverse an arc Structure learning with complete data 57 / 94

58 Scores Properties for a score Let D a dataset, T the graph of the current solution and Θ its parameters.a score must satissfy : 1 Likelihood : The solution must explain the data (max L(T, Θ : D)). 2 Occam s razor : The score must prefer simple graphs for T rather than complex ones (min Dim(T )). 3 Local consistency : Adding a useful arc should increase the score. 4 Score equivalence : Two Markov-equivalent Bayesian networks should have the same score. Definition (Dim(T )) The dimension of a Bayesian network is its number of free parameters. Dim(T ) = i ((r i 1) q i ) where r i is the size of the variable X i and q i is the number of configurations for the parents of X i. Maximizing the likelihood is not a good score : it leads to a complete graph : overfitting. Structure learning with complete data 58 / 94

59 Some scores (1) : AIC/BIC Key idea : Maximizing likelihood but minimizing the dimension of the Bayesian network. score AIC (Akaike, 70) Akaike Information Criterion Score AIC (T, D) = log 2 L(Θ ML, T : D) Dim(T ) score BIC (Schwartz, 78) Bayesian Information Criterion Score BIC (T, D) = log 2 L(Θ ML, T : D) 1 2 Dim(T ) log 2N Structure learning with complete data 59 / 94

60 Minimum Description Length (Rissanen,78) The MDL score consists in consider the compacity of the representation as a good indicator for the quality of the solution. score MDL (Lam and Bacchus, 93) Minimum Description Length Score MDL (T, D) = log 2 L(Θ MV, T : D) arcs T log 2 N c Dim(T ) where arcs T is the set of arcs in T, c is the number of bits needed to represent a parameter. Structure learning with complete data 60 / 94

61 Bayesian Dirichlet score Equivalent With a Bayesian score, we want to maximize the joint probability of T and D : P(T, D) = P(D Θ, T ) P(Θ T ) P(T )dθ Θ = P(T ) L(Θ, T : D) P(Θ T )dθ With some independence hypothesis and a Dirichlet prior : score BDe Score BDe (T, D) = P(T ) Θ n q i i=1 j=1 Γ(α i,j ) Γ(N i,j + α i,j ) r i k=1 Γ(N i,j,k + α i,j,k ) Γ(α i,j,k ) Structure learning with complete data 61 / 94

62 Recherche locale : Example With the same dataset. Major drawbacks : The algorithm may be trapped in zones where all the neighborhood has the same score. The algorithm may stop in local minima. Solutions Meta-heuristic : Random restart (not only from empty Bayesian networks) TABU-search (force the algorithm to find new solutions) Simulated annealing (accept from time to time structures with decreasing score) Structure learning with complete data 62 / 94

63 Learning trees In order to reduce the state space, this algorithm can find only one parent for each random variables. This may be an over-simplification of the model, but learning tree brings : a mathematically beautiful solution (find the global optimum), a small number of parameters ( minimize the risk of overfitting). Basic idea : decomposition of the log-likelihood LL(T ) = i LL i (i, pa(i)) = X Y LL(Y X ) + K With LL(Y X ) = LL Y (Y, X ) LL Y (Y, ) Learning optimal tree X, Y, compute LL(Y X ) Find the tree (forest) that maximize LL(T ). Max Spanning Tree Algorithm O(n 2 log(n)) Structure learning with complete data 63 / 94

64 Inference Structure learning with complete data 64 / 94

65 Inference in a Bayesian network Elementary operations on a joint probability : Marginalization P(x, y z) = p(x z) Total sum y y P(y z) = 1 Decomposition P(x, y z) = P(x y, z) P(y z) n Chain rule P(X 1,, X n ) = P (X i X 1,, X i 1 ) Independence X i=1 Y Z P(x y, z) = P(x z) Bayes rule P(x y, z) P(y x, z) P(x z) In a Bayesian network : Markov local P(X 1,, X n ) = n P (X i parents(x i )) i=1 Inference 65 / 94

66 Hard and soft evidence Let P(X 1,, X n ) be a Bayesian network, let ɛ be an event. P(X 1,, X n ɛ) P(ɛ X 1,, X n ) P(X 1,, X n ) Evidence in a Bayesian network Let assume that (ɛ i ) I {1,,n} s.t. : ɛ = i I ɛ i i I, ɛ i X 1,, X i 1, X i+1,, X n X i then P(X 1,, X n, ɛ) = n P(x i π i ) P(ɛ i X i ) i=1 if P(ɛ i X i ) contains a 1 and many 0, ɛ i is called a hard evidence. ɛ i is the event X i takes a certain value. otherwise, ɛ i is called a soft evidence. ɛ i acts like virtual child of X i. Evidence will not appear in the following since they belong to the general framework. Inference 66 / 94 i I

67 Inference in a Bayesian network (1) : P(D)? C A B D P(d) = a = a = a = b P(a, b, c, d) b c P(a) P(b a) P(c b) P(d b) b c ( ) P(a) P(b a) P(d b) P(c b) b P(d b) }{{} a in D P(a) P(b a) }{{}}{{} in A in B c }{{} =1 Inference 67 / 94

68 Inference in a Bayesian network (2) : P(D a)? C A B D P(d a) = a = a = a = b P(a, b, c, d a) b c P(a a) P(b a, a) P(c b, a) P(d b, a) b c ( ) P(a a) P(b a, a) P(d b, a) P(c b, a) }{{}}{{} b c a B A A D B }{{} =1 P(d b) }{{} a in D P(a a) P(b a) }{{}}{{} in A in B Inference 68 / 94

69 Inference in a Bayesian network (3) : P(C d)? C A B D P(c d) = a = b P(a d) P(b a, d) P(c b, d) P(d b, d) b d P(c b) P(a d) P(b a, d) }{{} a in C }{{} P(b d) Bayes rule for P(b d) P(b d) P(d b) p(b) P(d b) p(b a) P(a) }{{}}{{}}{{} a in D in B in A Inference 69 / 94

70 Inference in a Bayesian network (4) : P(A d)? C A B D P(a d) P(d a) P(a) }{{} in A P(d a) = b = b P(d a, b) P(b a) }{{} in B P(d b) P(b a) }{{}}{{} in D in B Inference 70 / 94

71 Inference in a Bayesian network (5) : P(B c, a)? C A B D P(b c, a) P(c b, a) P(b a) P(c b) p(b a) P(a a) }{{}}{{}}{{} a in C in B in A Inference 71 / 94

72 Inference in polytree π X (U 1 ) π X (Un) π(x) = P ( x e +) Y e U 1 λ X (U 1 ) π Y1 (X ) X λ X (Un) π Ym (X ) U n e + X λ(x) = P ( e x ) Y 1 λy1 (X ) λ Ym (X ) If a node has n neighbors (parents or children), he must know n 1 messages in order to send its message to the last neighbor. Y m Once all messages have been sent (2 E ), every node knows all the information received by the graph. If a node has n neighbors (parents or children), he must know n messages in order to compute its own posterior. Inference 72 / 94

73 A first algorithm for inference in polytree Propagation : repeat for all node N with n neighbors, if N received n 1 messages then N can send the message to the last neighbor. if N received n messages then N can send all its messages. until all messages have been sent Every node can compute its posterior. 4a 1b 4c 3a 5b 2b Complexity 1a 2a 4b 1c 6a 1d 6b O( E ) 5a Inference 73 / 94

74 Inference in polytree (2) 4a 1b 4c 3a 5b 2b 5c 1b 3a 4a 4b 2b 1a 2a 4b 1c 6a1d 6b 1a 2a 5d 1c 5a1d 5b 5a Centralized version Selection of a root Absorption Each node send the message to the root (as soon as he can). Intégration The root has received all its messages from its neighbors and can send all the messages to its neighbors. Diffusion When a node receives its message from the root, it sends all the remaining messages. 6a Inference 74 / 94

75 Problem with message passing algorithms in DAG P(E) = A,B,C,D P(A) P(B A) P(C A) P(D B) P(E C, D) π C (A) C π E (C) A π B (A) B D π D (B) Message passing algorithm : 1 π B (A) = P(A) 2 π C (A) = P(A) 3 π D (B) = A P(B A) π B(A) 4 π E (D) = B P(D B) π D(B) π E (D) 5 π E (C) = A P(C A) π C (A) E 6 P(E) = D,C P(E C, D) π E (C) π E (D) P(E) P(E C, D) P(C A) P(A) P(D B) P(B A ) P(A ) A,B,C,D,A Inference 75 / 94

76 Inference in DAG Message passing algorithms need a polytree. From DAG to polytree Conditioning : cutting arcs in the graph. Clustering : Merging nodes in the graph. Conditioning Clustering Inference 76 / 94

77 Conditioning Conditioning a Bayesian network G a Bayesian network over the set of random variables V, S V and s an instantiation of S. The conditioned Bayesian network G [S=s] is the Bayesian network obtained by : removing arcs from every node in S if the node Y has parent(s) in S then p(y Π Y ) is changed in p(y s, Π Y ). E A G B C E P(E A = 0) A 0 G 1 B C F Graph G D P(F E, G = 1) P(D C, G = 1) F D Conditioned graph G [A=0,G=1] An inference in G with evidence S = s is equivalent to an inference in G [S=s] with no evidence. S is called a cutset. Inference 77 / 94

78 Inference by conditioning S V, x V, P(x) = s P(x s).p(s) Inference by conditioning With G Bayesian network, 1 find S V cutset such that G [S] is a polytree. 2 s instantiation of S, Compute P s(x) = P(x s) in G [S]. Compute p s = P(s) (secondary result of the last inference). 3 P(x) = s (p s P s (x)) If # i is the size of S i, the inference computes i # i inferences in G [S]. O(k S E [S] ) Finding the minimal cutset is NP-complete. Inference 78 / 94

79 Clustering : Junction tree algorithm S L D P(s, l, d, f, v) = P(s) P(l s) P(d s) P(f l) P(v d, l) = f (s, l) g(s, d) h(l, f ) k(d, l, v) = j(s, l, d) h(l, f ) k(d, l, v) F V SL S SD SLD L D LD LF L LDV Junction graph LF L LDV Junction tree Inference 79 / 94

80 How to build a junction tree Idea : create an undirected graph from the Bayesian network. The cliques of this undirected graph will be the nodes of the junction tree. The CPTs P(X Parent X ) of the Bayesian network indicate necessary clusters. Moralization The moral graph of a BN is the undirected skeleton of the Bayesian network to which edges between parents of the same node are added. S T S T L D L D F V F V Inference 80 / 94

81 How to build a junction tree (2) To insure the existence of the junction tree, the undirected graph must be triangulated : every cycle with length> 3 must have be chordal. Triangulation To triangulate a graph : variables elimination : Iteratively remove all nodes in the graphs. When a node is removed, add edge between every pair of its neighbors Start to eliminate nodes that will not create new edge (no neighbor, only one neighbor, neigbours already connected). S T S T S T L D L D L D F V F V F V Elimination order : F, V, T, S, L, D. To find the best elimination order is NP-complete. Inference 81 / 94

82 How to build a junction tree (3) The nodes of the junction tree are the cliques of the moralized and triangulated graph, what about its edges? Junction tree Choose an order in the cliques (C i ). for each C i, find the clique C j (j < i) that maximize C i C j. Add an edge between C i and C j. Create the separator S ij = C i C j. S T S T STD L D L D LF L SD SLD LD F V F V LDV The junction tree is not unique for a Bayesian network. Inference 82 / 94

83 S Potentials T S T STD L D L D LF L SD SLD LD F V F V LDV Decomposition of P(V ) following the cliques : Another decomposition : P(v) = P(s) P(l s) P(t s) P(d t) P(f l) P(v l, d) }{{}}{{}}{{}}{{} φ 1 (d,l,s) φ 2 (s,t,d) φ 3 (f,l) P(s, t, d) P(s, l, d) P(l, f ) P(l, d, v) P(v) = P(s, d) P(l) P(l, d) φ 4 (v,l,d) Factorization of P in a Junction Tree i P(v) = Φ C i (c i ) i<j Φ S ij (s ij ) Inference 83 / 94

84 Propagation on potentials in the junction tree goal : Transform the potential of all cliques (C) and separators (S) into joint probability of their variables. Message passing in the Junction Tree Initialization : C i C, Ψ 0 C = i P(X Π X ) X C i,x / C j,j<i S S, Ψ 0 S = 1 (constant function). Root selection : choose a clique as a root for the propagation (and the evidence) Collect : send all messages from all cliques to the root (a clique can send a message if it receives all the others). Distribution : send all messages from the root to all cliques Message from a clique C i to a clique C j : update Ψ t C j from Ψ t+1 C i : Ψ t+1 S (s) = ij C i \S Ψ t+1 ij C (c) and Ψ t+1 i C = Ψ t Ψ t+1 S j C ij j Ψ t S ij There still is only a linear complexity in number of arcs, but the clique and the Ψ C may be very huge. This algorithm is the most used for exact inference in Bayesian network. Inference 84 / 94

85 Copulas Bayesian Network Inference 85 / 94

86 Copulas Let X = {X 1,, X n } be a set of continuous random variables, its joint CDF F (x) = P(X x), its density of probability p(x) : p(x) = n F x 1 x n (x 1,, x n ). Copula C(u 1,, u n ) = CDF for the variable U 1,, U n uniformally distributed on [0, 1] Sklar,1959 F, C copula s.t. F (x 1,, x n ) = C(F (x 1 ),, F (x n )) : C(u) = F (F1 1 (u 1 ),, F1 1 (u n )). Unicity if continuity and p > 0 For the density : p(x) = c(f 1 (x 1 ),, F n (x n )) n i=1 p i(x i ). C integrate the structural complexity of the relation between variables and do not take into account their marginal behavior. Copula C(X ) rarely used/computed/learned if n > 10 Copula Bayesian network 86 / 94

87 Copula Bayesian Networks Let G be a Bayesian network on the density of probability : n p(x 1,, X n ) = p(x i π i ) i=1 Let c be a copula for p, we define, for each node X i, R c (X i, π i ) = where k i = π i (if k i = 0, Rc(X i, π i ) = 1). c(f (X i ), F (P 1 ),, F (P ki )) k i F (P 1) F (P Ki ) c(1, F (p 1),, F (p Ki )) p(x 1,, X n ) = n R ci (X i, π i ) p(x i ) i=1 Reciprocally, if {c i } n i=1 are copulas (> 0) for each (X i, π c i), then n i=1 R c i (X i, π i ) defines a copula for X. c(f (X 1 ),, F (X n )) = n R ci (X i, π i ) Copula Bayesian network i=1 87 / 94

88 Copula Bayesian Network - Learning and inference Definition (Copula Bayesian Network) A CBN is described by a triplet (G, Θ C, Θ p ) where G is a DAG over X, Θ C is a set of copulas c(x i, π i ), Θ p is the set of marginals p(x i ). Then the CBN has a joint density of probability p(x ) factorised by : p(x 1,, X n ) = n R ci (X i, π i ) p(x i ) i=1 The same properties of locality hold for CBN as for BN. Large reduction of the number of parameters Parameter learning can be made locally on X i, π i not the same dataset for each R c. not the same type of copula for each X i, π i Structural learning could use the same kind of algorithms than classical BN. Inference is simpified : simulation, Rosenblatts transformation (etc) Copula Bayesian network 88 / 94

89 Copula Junction Tree A B C I ABC BC BCI BI D DE D BDI E EF E E EG E EH F G H p(x ) = C clique(g) p(c) S separator(g) p(s) Let c S (S) be copula for all S Junction Tree (clique or separator, with some restriction for consistency). c(x ) = C clique(g) c C (C) S separator(g) c S(S) is a copula for p(x ) Copula Bayesian network 89 / 94

90 Non parametric independence test : Y Z X? Bouezmarni and al, Based on Hellinger distance between a joint copula and the product of marginal copulas, Every copula is approximated by a non parametric Bernstein s copula on a dataset of size N : Copula Bayesian network 90 / 94

91 Y Z X? [Bouezmarni and al, 2009] shows that under H0 : [Y Z X ], T N (0, 1) : with k = d (arbitrary) and : Copula Bayesian network 91 / 94

92 Structural Learning with CI test : PC Algorithm Copula Bayesian network 92 / 94

93 Early results Copula Bayesian network 93 / 94

94 continuous-pc Structural identification : increasing blocks, 10 variables, N=10000, 27 minutes. Conclusion : CBN proposes to explore the structure inside joint copulas, CBN may increase the number of dimensions for a joint copula, CBN may ease the different inference algorithms, CBN can be automatically learned from data using a non parametric CI test. Copula Bayesian network 94 / 94

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

COMPSCI 276 Fall 2007

COMPSCI 276 Fall 2007 Exact Inference lgorithms for Probabilistic Reasoning; OMPSI 276 Fall 2007 1 elief Updating Smoking lung ancer ronchitis X-ray Dyspnoea P lung cancer=yes smoking=no, dyspnoea=yes =? 2 Probabilistic Inference

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Learning Bayesian networks

Learning Bayesian networks 1 Lecture topics: Learning Bayesian networks from data maximum likelihood, BIC Bayesian, marginal likelihood Learning Bayesian networks There are two problems we have to solve in order to estimate Bayesian

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2016/2017 Lesson 13 24 march 2017 Reasoning with Bayesian Networks Naïve Bayesian Systems...2 Example

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Scoring functions for learning Bayesian networks

Scoring functions for learning Bayesian networks Tópicos Avançados p. 1/48 Scoring functions for learning Bayesian networks Alexandra M. Carvalho Tópicos Avançados p. 2/48 Plan Learning Bayesian networks Scoring functions for learning Bayesian networks:

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

Learning Bayesian Networks (part 1) Goals for the lecture

Learning Bayesian Networks (part 1) Goals for the lecture Learning Bayesian Networks (part 1) Mark Craven and David Page Computer Scices 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some ohe slides in these lectures have been adapted/borrowed from materials

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example Bayesian Networks Characteristics of Learning BN Models (All hail Judea Pearl) (some hail Greg Cooper) Benefits Handle incomplete data Can model causal chains of relationships Combine domain knowledge

More information

Lecture 5: Bayesian Network

Lecture 5: Bayesian Network Lecture 5: Bayesian Network Topics of this lecture What is a Bayesian network? A simple example Formal definition of BN A slightly difficult example Learning of BN An example of learning Important topics

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 4 Learning Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Another TA: Hongchao Zhou Please fill out the questionnaire about recitations Homework 1 out.

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Learning Bayes Net Structures

Learning Bayes Net Structures Learning Bayes Net Structures KF, Chapter 15 15.5 (RN, Chapter 20) Some material taken from C Guesterin (CMU), K Murphy (UBC) 1 2 Learning Bayes Nets Known Structure Unknown Data Complete Missing Easy

More information

Learning Bayesian Networks for Biomedical Data

Learning Bayesian Networks for Biomedical Data Learning Bayesian Networks for Biomedical Data Faming Liang (Texas A&M University ) Liang, F. and Zhang, J. (2009) Learning Bayesian Networks for Discrete Data. Computational Statistics and Data Analysis,

More information

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination Probabilistic Graphical Models COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline It Introduction ti Representation Bayesian network Conditional Independence Inference:

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Recitations: Every Tuesday 4-5:30 in 243 Annenberg Homework 1 out. Due in class

More information

Learning Bayesian belief networks

Learning Bayesian belief networks Lecture 4 Learning Bayesian belief networks Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administration Midterm: Monday, March 7, 2003 In class Closed book Material covered by Wednesday, March

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Statistical Learning

Statistical Learning Statistical Learning Lecture 5: Bayesian Networks and Graphical Models Mário A. T. Figueiredo Instituto Superior Técnico & Instituto de Telecomunicações University of Lisbon, Portugal May 2018 Mário A.

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Directed Graphical Models or Bayesian Networks

Directed Graphical Models or Bayesian Networks Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability

More information

CS281A/Stat241A Lecture 19

CS281A/Stat241A Lecture 19 CS281A/Stat241A Lecture 19 p. 1/4 CS281A/Stat241A Lecture 19 Junction Tree Algorithm Peter Bartlett CS281A/Stat241A Lecture 19 p. 2/4 Announcements My office hours: Tuesday Nov 3 (today), 1-2pm, in 723

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Bayesian Networks: Representation, Variable Elimination

Bayesian Networks: Representation, Variable Elimination Bayesian Networks: Representation, Variable Elimination CS 6375: Machine Learning Class Notes Instructor: Vibhav Gogate The University of Texas at Dallas We can view a Bayesian network as a compact representation

More information

Directed Graphical Models

Directed Graphical Models Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Learning Bayesian Networks

Learning Bayesian Networks Learning Bayesian Networks Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-1 Aspects in learning Learning the parameters of a Bayesian network Marginalizing over all all parameters

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net Bayesian Networks aka belief networks, probabilistic networks A BN over variables {X 1, X 2,, X n } consists of: a DAG whose nodes are the variables a set of PTs (Pr(X i Parents(X i ) ) for each X i P(a)

More information

4: Parameter Estimation in Fully Observed BNs

4: Parameter Estimation in Fully Observed BNs 10-708: Probabilistic Graphical Models 10-708, Spring 2015 4: Parameter Estimation in Fully Observed BNs Lecturer: Eric P. Xing Scribes: Purvasha Charavarti, Natalie Klein, Dipan Pal 1 Learning Graphical

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Introduction to Artificial Intelligence. Unit # 11

Introduction to Artificial Intelligence. Unit # 11 Introduction to Artificial Intelligence Unit # 11 1 Course Outline Overview of Artificial Intelligence State Space Representation Search Techniques Machine Learning Logic Probabilistic Reasoning/Bayesian

More information

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Learning With Bayesian Networks. Markus Kalisch ETH Zürich Learning With Bayesian Networks Markus Kalisch ETH Zürich Inference in BNs - Review P(Burglary JohnCalls=TRUE, MaryCalls=TRUE) Exact Inference: P(b j,m) = c Sum e Sum a P(b)P(e)P(a b,e)p(j a)p(m a) Deal

More information

Bayes Networks 6.872/HST.950

Bayes Networks 6.872/HST.950 Bayes Networks 6.872/HST.950 What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

Lecture 2: Simple Classifiers

Lecture 2: Simple Classifiers CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics Implementing Machine Reasoning using Bayesian Network in Big Data Analytics Steve Cheng, Ph.D. Guest Speaker for EECS 6893 Big Data Analytics Columbia University October 26, 2017 Outline Introduction Probability

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

Bayesian Networks for Causal Analysis

Bayesian Networks for Causal Analysis Paper 2776-2018 Bayesian Networks for Causal Analysis Fei Wang and John Amrhein, McDougall Scientific Ltd. ABSTRACT Bayesian Networks (BN) are a type of graphical model that represent relationships between

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

BN Semantics 3 Now it s personal! Parameter Learning 1

BN Semantics 3 Now it s personal! Parameter Learning 1 Readings: K&F: 3.4, 14.1, 14.2 BN Semantics 3 Now it s personal! Parameter Learning 1 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 1 Building BNs from independence

More information

Exact Inference Algorithms Bucket-elimination

Exact Inference Algorithms Bucket-elimination Exact Inference Algorithms Bucket-elimination COMPSCI 276, Spring 2011 Class 5: Rina Dechter (Reading: class notes chapter 4, Darwiche chapter 6) 1 Belief Updating Smoking lung Cancer Bronchitis X-ray

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Bayesian Approaches Data Mining Selected Technique

Bayesian Approaches Data Mining Selected Technique Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals

More information

Conditional Independence

Conditional Independence Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why

More information

Stephen Scott.

Stephen Scott. 1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Uncertainty and knowledge. Uncertainty and knowledge. Reasoning with uncertainty. Notes

Uncertainty and knowledge. Uncertainty and knowledge. Reasoning with uncertainty. Notes Approximate reasoning Uncertainty and knowledge Introduction All knowledge representation formalism and problem solving mechanisms that we have seen until now are based on the following assumptions: All

More information

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation

More information

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference COS402- Artificial Intelligence Fall 2015 Lecture 10: Bayesian Networks & Exact Inference Outline Logical inference and probabilistic inference Independence and conditional independence Bayes Nets Semantics

More information

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks An Empirical-Bayes Score for Discrete Bayesian Networks scutari@stats.ox.ac.uk Department of Statistics September 8, 2016 Bayesian Network Structure Learning Learning a BN B = (G, Θ) from a data set D

More information

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects

More information

Lecture 17: May 29, 2002

Lecture 17: May 29, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models University of Washington Spring 2000 Dept. of Electrical Engineering Lecture 17: May 29, 2002 Lecturer: Jeff ilmes Scribe: Kurt Partridge, Salvador

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Bayesian Networks Representation and Reasoning

Bayesian Networks Representation and Reasoning ayesian Networks Representation and Reasoning Marco F. Ramoni hildren s Hospital Informatics Program Harvard Medical School (2003) Harvard-MIT ivision of Health Sciences and Technology HST.951J: Medical

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information