Bayesian Networks ETICS Pierre-Henri Wuillemin (LIP6 UPMC)

Size: px

Start display at page:

Download "Bayesian Networks ETICS Pierre-Henri Wuillemin (LIP6 UPMC)"

Jordan Bennett
6 years ago
Views:

1 Bayesian Networks ETICS 2017 Pierre-Henri Wuillemin (LIP6 UPMC)

2 POI BNs in a nutshell Foundations of BNs Learning BNs Inference in BNs Graphical models for copula pyagrum / 94

3 Nutshell 3 / 94

Inference in discrete probabilistic model Inference Let A, B be (discrete) random variables, Inference consists in computing the distribution of A given observations on B.

4 Inference in discrete probabilistic model Inference Let A, B be (discrete) random variables, Inference consists in computing the distribution of A given observations on B. P(A B = ɛ b ) = P(A ɛ b ) = P(ɛ b A) P(A) P(ɛ b ) P(ɛ b A) P(A) = P(A, ɛ b ) Probabilistic complex systems : P(X 1,, X n ), Inference in complex system : P(X i X j = ɛ j ) P(X i, ɛ j ) Nutshell 4 / 94

Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables.

5 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Nutshell 5 / 94

6 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Nutshell 6 / 94

7 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is a joint distribution over a set of random (discrete) variables. A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Factorization of the joint distribution in a BN n P(X 1,, X n ) = P (X i parents(x i )) i=1 P(A, S, T, L, O, B, X, D) P(A) P(S) P(T A) P(L S) P(O T, L) P(B S) P(X O) P(D O, B) (2 8 = 256 parameters) ( = 32 parameters) Nutshell 7 / 94

8 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Inference : P(dyspnoea)? Nutshell 8 / 94

9 Bayesian networks : definition Definition (Bayesian Network (BN)) A Bayesian network is represented by a directed acyclic graph (DAG) and by a conditional probability table (CPT) for each node P(X i parents i ) Inference : P(dyspnoea smoking)? Nutshell 9 / 94

10 BN and probabilistic inference diagnostic : P(A F ) p(a) A p(b) B p(c A, B) p(d B) diagnostic reliability classification C D prediction P(E B, A) E F p(e F ) p(f A, C) Others tasks G p(g C, D) Process simulation (modelisation) forecasting (dynamics, etc.) Behavioral analysis (bot, intelligent tutoring system) Most Probable Case : arg max P(X D) Sensitivity analysis, Informational analysis (mutual information), etc. Decision process, Troubleshooting : arg max P(.) C(.) Nutshell 10 / 94

11 Application 1 : diagnostic NASA He P1 Fail He leak He P2 fail He P1 trend He P2 trend He pressure 1 He pressure 2 He P1 meas He P2 meas He Ox valve He Fu valve Ox tank leak Fu tank leak N tank leak NASA : using a BN to monitor the boosters of the space shuttle N valve fail Ox tank P Fu tank P N P1 fail N tank pres N P2 fail Engine Fail Ox inlet P Fu inlet P N tank P1 N tank P2 N accum P Combust P Comb press Nutshell 11 / 94

12 Application 2 : medical diagnosis the Great Ormond Street hospital for sick children Diagnosis of the causes of cyanosis or heart attack in the child just after birth. birth asphyxia disease age at presentation LVH Duct Flow Cardiac mixing lung parenchyma lung flow sick Hypoxia distribution hypoxia in O2 CO2 chest X ray grunting LVH report lower body O2 right up quad O2 CO2 report X ray report grunting report Nutshell 12 / 94

13 Application 3 : risk analysis Risk modelisation using BN : modular approach. Nutshell 13 / 94

14 Application 4 : Bayesian classification X (dimension d, features) and Y (dimension 1, often binary but not necessarily). Using a database Π a = (x (k), y (k) ) k {1,,N} (supervised learning), one can estimate the joint distribution P(X, Y ). Classification For a vector x, values of X, the goal is to predict the class (value of Y ) : ŷ. 1 Maximum of the likelihood (ML) 2 Maximum a posteriori (MAP) ŷ = arg max y ŷ = arg max P(x y) y P(y x) = arg max P(y) P(x y) y Those distributions may be hard to estimate. P(X Y ) may induce more parameters than Π a!! Nutshell 14 / 94

15 Bayesian classification (2) : naive Bayes How to compute P(X Y )? Naive Bayes classifier if we assume k l, X k X l Y then P(x, y) = P(y) d k=1 P(x k y) Very strong assumption! In most cases, it is an approximation. However, this approximation often gives good results. Parameters estimation : trivial (if no missing values) ML : d k=1 P(x k y)... MAP : P(y x 1,, x d ) : inference in the BN! Nutshell 15 / 94

16 Bayesian classification (3) : more complex models TAN : Tree-Augmented Naive Models Every variable X i can have Y and another parent among X (only one!). Complete Bayesian network In a BN including Y and (X i ), infer P(Y X 1,, X n ). Note : There is no need to all X i : Markov Blanket MB(.)) P(Y X ) = P(Y MB(Y )) Nutshell 16 / 94

17 Application 5 : dynamic Bayesian networks dbn (dynamic BN) A dynamic BN is a BN where variables are indexed by the time t and by i : X t = {X t 1,, X t N } and verifies : Markov order 1 : P(X t X 0,, X t 1 ) = P(X t X t 1 ), Homogeneity : P(X t X t 1 ) = = P(X 1 X 0 ) t = 0 t (t 1) X 1 X 1 X 1 X 1 X 1 X 0 1 X t 1 X 2 X 2 X 2 X 2 X 2 X 0 2 X t 2 X 3 X 3 X 3 X 3 X 3 X 0 3 X t 3 X 4 X 4 X 4 X 4 X 4 X 0 4 X t 4 X 5 X 5 X 5 X 5 X 5 X 0 5 X t 5 2-TBN A dbn is characterized by : initial conditions (P(X 0 ) the relations between t 1 and t (timeslice). 2TBN (2 timeslice BN) allows to specify a dbn of size T by starting with X 0 and copying T times the pattern (t t 1). Nutshell 17 / 94

18 dbn and Markov chain Markov chain A state variable (X n ) (at time n). Parameters : Initial condition : P(X O ) Transition model : P(X n X n 1 ) P(X n X n 1 ) = Equivalent dynamic Bayesian Network : X dbn : 0 2TBN : X 1 X 2 X 3 X 0 X n Nutshell 18 / 94

19 dbn and Hidden Markov Model HMM A state variable (X n ) (at time n). An observation variable (Y n ) Parameters : Initial condition : P(X O ) Transition model : P(X n X n 1 ) Observation model : P(Y n X n ) Equivalent dynamic Bayesian Network : X 0 X 1 X 2 X 3 X 0 X n dbn : Y 0 Y 1 Y 2 Y 3 2TBN : Y 0 Y n Nutshell 19 / 94

20 Foundations Nutshell 20 / 94

21 Joint probability, Factorization, conditional independence A joint probability is expensive in memory K 3 p(x, Y, Z) A joint probability can be factorized (chain rule) K 3 p(x, Y, Z) = p(x ) p(y X ) p(z X, Y ) K + K 2 + K 3 Conditional independence is the key. With X Y and Z X, Y : K 3 p(x, Y, Z) = p(x ) p(y ) p(z) 3K Goal : how to describe the list of all conditional independence in a joint probability : {U, V, W X with U V W } Independence models 21 / 94

22 Independence model More generally, this ternary relation between subsets is called separability. Definition (Independence model and separability) Let X be a finite set, let I P(X ) P(X ) P(X ). i.e. a list of triplets of subsets of X. I is called an independence model. U, V, W X, U and V are separated by W ( U V W I ) if and only if (U, V, W ) I. i.e. I is the list of all separations found in X. Relation between I and p I p = {(U, V, W ) P(X ) P(X ) P(X ), U V W } is an independence model. U V W U V W Ip Reciprocally, if I is an independence model for X set of random variables, can we found P a distribution that verifies this list of conditional independence? Independence models 22 / 94

23 Semi-graphoid and graphoid Definition (semi-graphoid) An independence model I is a semi-graphoid if and only if A, B, S, P X : 1 trivial independence A S I 2 Symmetry A B S I B A S I 3 Decomposition A (B P) S I A B S I 4 Weak union { A (B P) S } I A B (S P) I 5 A B (S P) I Contraction A (B P) S A P S I I Definition (graphoid) An independence model I is a graphoid if and only if A, B, S, P X : { I is a semi-graphoid } 6 A B (S P) I Intersection A (B P) S A P (S B) I I These axioms create a strong structure inside the independence model. Independence models 23 / 94

24 Semi-graphoid and graphoid - representation of the axioms 3 Decomposition A (B P) S I A B S I 4 Weak union { A (B P) S } I A B (S P) I 5 A B (S P) I Contraction A (B P) S A P S I I 6 Intersection { } A B (S P) I A (B P) S A P (S B) I I 3 Decomposition 4 Weak union A S B P A S B A S B P A S B P 5 Contraction 6 Intersection A A S S B P P A S B P A A S S B P B P A S B P Independence models 24 / 94

25 Semi-graphoid and graphoid Theorem (probability and graphoid) I p is a semi-graphoid. If p positive then I p is a graphoid. Is there any other kind of ternary relation that is a graphoid? A B C D Theorem (Undirected graph and graphoid) E F Let G = (X, E) an undirected graph, U, V, W X, U W V G if and only if every path from a node in U to a node in V includes a node in W. { U W V G, U, V, W X } is a graphoid. Independence models 25 / 94 G

26 Graphical model P(X ),... I Independence model G = (X, E),... G Definition (Graphical model) A graphical model is a joint probability distribution P(X ) which uses a graph G = (X, E) to represent its conditional independence using separation in the graph. P(X ),...? I Independence model? G = (X, E),... G Definition (I-map, D-map, P-map, graph-isomorphism) let G = (X, E) a graph and a distribution p(x ). G is a Dependency-map for p (X Y Z) p X Z Y G. G is a Independency-map for p (X Y Z) p X Z Y G. G is a Perfect-map for p (X Y Z) p X Z Y G. p(x ) is graph-isomorph if and only if G = (X, E) P-map for p(x ). The empty graph (X, ) is a D-map for all p(x ). The complete graph is a I-map for all distribution p(x ). Independence models 26 / 94

27 Undirected graphical model : example 1 example 1 Let p(d 1, D 2, S) be the joint probability distribution with D 1 and D 2 two (independent) dice and S = D 1 + D 2. (in)dependence in example 1 D 1 D 1 S and D 2 D 2 but D 1 S D 2 S D 1 D 2 S Independence models 27 / 94

28 Undirected graphical model : example 2 example 2 In a database of users, a strong correlation has been found betweenl the ability to read and P the size of the shoes. This correlation is explained by the fact that a third variable A (the age) can be < 5. (in)dependence in example 2 L L A and P P but L A P A A P L Independence models 28 / 94

29 Directed graphical model example 1 D 1 D 2 but D 1 D 2 S D 1 D 2 example 2 L P but L A P A S P L Giving more information on the graph by adding the orientation. directed example 1 D 1 D 2 but D 1 D 2 S D 1 D 2 directed example 2 L P but L A P A S P collider or V-structure We need a separation criterion for directed graphs : d-separation. L Independence models 29 / 94

30 Directed graphical model and d-separation Let C = (x i ) i I be an undirected path in G = (X, E). x i is a C-collider if C contains : x i 1 x i x i+1. Definition (Active path) Let Z X. C is an active path wrt Z if x i C : If x i is a C-collider Then x i or one of its descendant belongs to Z. Else x i does not belong to Z. If a path is not active wrt Z, it is blocked by Z. Definition (d-separation) Let G = (X, E) a directed graph, (U, V, W ) X, U is d-separated from V by W in G ( U W V G ) if and only if every undirected path from U to V is blocked by W. Independence models 30 / 94

31 Sampling some d-separation A B C D E F G Independence models 31 / 94

32 Bayesian network and Markov properties Definition (Global Markov Property) G satisfies the GMP for p A, B, S X, A S B G A B S. i.e. G is an I-map for p. Definition (Locale Markov Property) G satisfies the LMP for p Xi X, {X i } nd (X i ) Π Xi. where nd (X i ) represents the non-descendant nodes of X i Independence models 32 / 94

33 Bayesian network and Markov properties (2) Theorem (when p(x ) positive) GMP LMP Definition (Bayesian network, graphical Factorization) Let p(x ) be a directed graphical model with the graph G = (X, E). p(x ) is a Bayesian network if and only if G is an I-map for p. Theorem Let p(x ) be a Bayesian network with the graph G = (X, E), p(x ) = X i X p (X i Π Xi ) Independence models 33 / 94

34 Markov Equivalence class Markov Equivalence Two Bayesian networks are equivalent if their graphs represent the same independence model. Definition (Markov Equivalence class) The Markov equivalence class for a Bayes net represented by G is the set of all Bayesian networks that are Markov equivalent to G. X Y X Y X Y X Y Z Z Z X Y X Y Z Z X Y X Y Z Independence models 34 / 94

35 Causality Independence models 35 / 94

36 Causality and probability Conditional probabilities do not allow to deal with causality. They even create paradox. Simpson s paradox Impact analysis : does a certain drug help to cure the patients? With the values : c 1 (cured) / d 1 (with drug) / d 0 (no drug) / M,W (man, woman). P(c 1 d 1 ) = > P(c 1 d 0 ) = 0.5 the drug helps... P(c 1 d 1, M) = 0.7 < P(c 1 d 0, M) = except if the patient is a man... P(c 1 d 1, W ) = 0.2 < P(c 1 d 0, W ) = or a woman. The conditional probability P(c 1 d 1 ) is observational and is not relevant, one wants to give the drug and not to observe : intervention on d : P(c 1 d 1 ) Conditioning by intervention Let I be the state of the light switch, cause of L : is there light in the room?. With observational conditioning P(L I ) and P(I L) (no distinction between cause and effect), P(L I ) = P(L I ) P(I L) = P(I ). Intervention 36 / 94

37 Causal Bayesian network Smoke Genotype Cancer P(C S) = P(C S) P(C S) = P(C) Smoke Genotype Cancer Smoke Genotype Cancer P(C S) unknown! We do not know how to differentiate the causal impact of the 2 causes from the observation. P(C S) computable! Do- Calculus Smoke Tar Cancer Intervention 37 / 94

38 Markov equivalence class, essential graph Definition ( essential graph) A Markov equivalence class can be represented by a partially directed acyclic graph (PDAG) : the essential graph. In an essential graph, an arc is directed if every Bayesian network in the equivalence class have the same arc. A B A B C D C D F E F E G H The essential graph can be build from a BN by removing orientation of the arcs that can be reversed without creating or removing a collider. if the BN is causal, the essential graph indicates the causal arc that can be learned from a dataset. Undirected arc is an indication for a (possible) latent cause. G Intervention 38 / 94 H

39 Learning Intervention 39 / 94

40 What can we learn? Learning in Bayesian network The goal of a learning algorithm is to estimate from a dataset and from prior : the structure of the Bayesian network (is X parent of Y?) the parameters of the Bayesian network (P(X = 0 Y = 1)?) The dataset can be : complete, incomplete (missing values). The prior knowledge can be for instance : (part of) the structure of the BN, The probability distribution for certain variables, etc. Therefore 4 main classes of learning algorithms in BN : Learning of {parameters structure} with {complete incomplete} data. Learning 40 / 94

41 Learning parameters with complete data D : d1 A d1 B d1 C d1 E V F F V dm A d M B d M C d M E E A C B Let Θ be the set of all parameters for the model and L(Θ : D) the likelihood : L(Θ : D) = P(D Θ) M = P(d m Θ) m=1 M = P(E = d E m, B = d B m, A = d A m, C = d C m Θ) m=1 (iid) Parameters learning complete data 41 / 94

42 Learning parameters with complete data (2) Let us rename E, B, A, C with n = 4, (X i ) 1 i n, L(Θ : D) = = = M P(X 1 = dm, 1 X 2 = dm, 2, X n = dm n Θ) m=1 M m=1 i=1 n i=1 m=1 n P(X i Pa i, Θ) M P(X i Pa i, Θ i ) L(Θ : D)= n i=1 L i(θ i : D) The estimation of the parameters can be decomposed into the estimation of the different conditional probability table for each node. No need for only one global dataset : heterogeneous learning. Parameters learning complete data 42 / 94

43 Maximization of the likelihood for a single variable Let X be a binary variable. With θ = P(X = 1) : 0.04 x*(1-x)*(1-x)*x*x Θ = {θ, 1 θ} D = (1, 0, 0, 1, 1) L(Θ : D) = P(X = d m Θ) m L(Θ : D) Here : L(Θ : D) = θ (1 θ) (1 θ) θ θ ( θ ^θ = argmax θ θ 3 (1 θ) 2) Generalization For X random variable which can take the values (1,, r), with Θ X = (θ 1,, θ r ) where θ i = P(X = i), and N i = # D (X = i) number of occurrences of i in the dataset D, r L(Θ X : D) = θ N i i and ΘX = argmax ΘX (L(Θ X : D)) i=1 Parameters learning complete data 43 / 94

44 Maximizing the likelihood in a Bayesian network θ ijk = P(X i = k Pa i = j), N ijk = # D (X i = k, Pa i = j), k {1 r i }, j {1 q i } L(Θ : D) = n i=1 L i (Θ i : D) = n M i=1 m=1 P(X i = k m Pa i = j m, Θ i ) L(Θ : D) = n q i r i i=1 j=1 k=1 θ N ijk ijk LL(Θ : D) = n M i=1 m=1 log P(X i Pa i, Θ i ) = n qi ri i=1 j=1 k=1 N ijk log θ ijk k θ ijk = 1 then θ ijri = 1 r i 1 k=1 θ ijk hence ( ( )) LL(Θ : D) = n q i ri 1 r i 1 N ijk log θ ijk + N ijri log 1 θ ijk i=1 j=1 k k=1 We are looking for Θ that maximizes L(Θ : D) and then LL(Θ : D) : i.e. Θ LL(Θ : D) tel que i, j, k, θ ijk ( ) Θ = N ijk ^θ ijk N ijri = r i 1 1 ^θ ijk k=1 N ijk N ijr i = 0 ^θ ijk ^θ ijri Finally, N ijr i ^θ ijri = N ij1 = = N ij(r i 1) (and ^θ ^θ ij1 ^θ k ijk = 1) : ij(ri 1) k {1,..., r i }, θ ijk = N ijk N ij With N ij = r i k=1 N ijk Parameters learning complete data 44 / 94

45 Bayesian prediction The a posteriori distribution of Θ is P(Θ D). P(Θ D) P(D Θ) P(Θ) = L(Θ : D) P(Θ) Goal to take into account a prior on Θ in order to integrate experts knowledge or to stabilize the estimation if the dataset is too small. The Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution. Dirichlet Distribution : f (p 1,, p K ; α 1, α K ) K i=1 x α i 1 i (where i p i = 1) f can be interpret as : P (P(X = i) = p i # X =i = α i 1) If the prior P(Θ) is a Dirichlet distribution : P(Θ) = n q i r i i=1 j=1 k=1 θ α ijk 1 ijk [Wikipedia] Clockwise from top left : α = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). Parameters learning complete data 45 / 94

46 Bayesian prediction (2) From : P(Θ D) L(Θ : D) P(Θ) P(Θ) = n q i r i θ α ijk 1 ijk i=1 j=1 k=1 L(Θ : D) = n q i r i θ N ijk ijk i=1 j=1 k=1 P(Θ D) = n q i r i i=1 j=1 k=1 θ N ijk +α ijk 1 ijk MAP : maximum a posteriori Θ MAP = arg max P(Θ D) Θ ^θ MAP N ijk + α ijk 1 ijk = k (N ijk + α ijk 1) EAP : expectation a posteriori Θ EAP = Θ P(Θ D) dθ Θ ^θ EAP N ijk + α ijk ijk = k (N ijk + α ijk ) Parameters learning complete data 46 / 94

47 Parameters learning with complete data With N ijk the count of occurrences in the dataset where variable X i has the value k and its parents have the values (t-uple) j, With α ijk the parameters of a Dirichlet prior. Parameters estimation Two main solutions : MLE (Maximum Likelihood Estimation) Bayesian estimation (with Dirichlet prior) θ ijk = θ {xi =k pa i =j} = N ijk N ij θ MAP ijk = θ {xi =k pa i =j} = α ijk + N ijk 1 α ij + N ij r i θ EAP ijk = θ {xi =k pa i =j} = α ijk + N ijk α ij + N ij Prior important when N ijk 0 : no occurrence in the dataset. All these estimations are equivalent when N ijk α ijk are often hard to find Laplace smoothing : α ijk = constant(= 1) Parameters learning complete data 47 / 94

48 Learning parameters with missing values d A 1 d B 1 d C 1 d E 1 V F? V D : V F? V? F? V d A M d B M d C M d E M E A C B D = D o D h respectively observed data and unobserved. Typology for incomplete dataset With M il = d i l D h MCAR :P(M D) = P(M) (Missing Completely At Random). MAR : P(M D) = P(M D o ) (Missing At Random). NMAR :P(M D) (Not Missing At Random). Parameters learning with incomplete data 48 / 94

49 EM for BNs Expectation-Maximization for BNs Repeat until convergence Step E : Estimate N (t+1) ijk from P(X i Pa i, θ t ijk ) inference in the BN with parameters θ t ijk Step M : θ t+1 ijk = N(t+1) ijk N (t+1) ij P S P S o? n? o n n n o o Parameters P(P) = [θ P 1 θ P ] P(S P = o) = [θ S P=o 1 θ S P=o ] P(S P = n) = [θ S P=n 1 θ S P=n ] With MLE : θ P = 3 5 Parameters learning with incomplete data 49 / 94

50 EM in a BN : example 0 Initialisation We have to choose a initial value for each parameters : θ (0) S P=o = 0.3, θ(0) S P=n = 0.4 P(S P = o) P(S P = n) P S S = o S = n S = o S=n o? n? o n n n o o N Step M θ (1) S P=o = = and θ(1) S P=n = = 0.2 P(S P = o) P(S P = n) P S S = o S = n S = o S=n o? n? o n n n o o N Step M θ (2) S P=o = = and θ(2) S P=n = = etc. (θ (t) 0.5 and θ(t) S P=o S P=n 0) Parameters learning with incomplete data 50 / 94

51 EM in BNs E A C B A B C E 1325? What is the step E? Replace the? by P(A B = 0, C = 1, E = 0) inference in the BN with the parameters Θ t Learning parameters with missing values EM converges to a local optimum, Sensibility to initial parameters Each step E can be expensive (as an inference in a complex BN) Parameters learning with incomplete data 51 / 94

52 Structural learning with complete data Goal : learning the arcs of the graph from data. Theoretically : χ 2 test plus enumeration of all the possible models : OK In practice : many different problems but above all : Set of Bayesian networks (Robinson, 1977) The number of different possible structures for n random variables is super-exponential. 1, n 1 NS(n) = n ( 1) i+1 Ci n 2 i (n i) NS(n 1), n > 1 i=1 Robinson (1977) Counting unlabelled acyclic digraphs. In Lecture Notes in Mathematics : Combinatorial Mathematics V An algorithm brute-force is not feasible. The space of Bayesian networks it too large : NS(10) ! Structure learning with complete data 52 / 94

53 Structural learning - introduction General picture of structural learning Identification of symmetrical relation (independence) + orientation algorithm IC/PC algorithm IC*/FCI Important for causal models. Local search In the (very large) space of structures, Greedy algorithms maximizing a score (entropy, AIC, BIC, MDL,BD, BDe, BDeu, ). Structure learning with complete data 53 / 94

54 Identification of symmetrical relation Statistically, the relation that can be tested between variables are symmetrical : correlation or independence. However, once these symmetrical relations have been found, other tests (conditional independence) can lead to the discovery of colliders which force some orientation. Main principle for (IC, IC*, PC, FCI) 1 Build an undirected graph based on symmetrical relation statistically found (χ 2, correlation, mutual information, etc.) : Add edges from the empty graph. Remove edges from the complete graph. 2 Identify colliders and add the implied orientations. 3 Finalize the orientations without creating any other colliders (in order to stay in the same Markov equivalence class. Major drawback : a very large number of statistic tests is needed. Each test is not robust in the size of the dataset. Structure learning with complete data 54 / 94

55 Example : PC Let P be a BN. We generate a dataset of 5000 lines compatible with P. 1 Using χ 2 test, every possible marginal independence (X Y ) is tested. Then every remaining possible conditional independence (X Y Z) are tested. Then with 2 parents, etc. We find : A S, L L T,O B, X A, B B. A, O A, X A, D A, T S, We find : T A O, O S L, X S L, B T S, X T O, D T O, B L S, X L O, D L O, D X O. 1. from Philippe Leray Structure learning with complete data 55 / 94

56 Example : PC We continue with χ 2 conditioned on 2 variables. we find : D S (L, B), X O (T, L), D O (T, L). Looking for colliders, propagation of orientation constraints, orientation of the last edges in the Markov equivalence class. We find : T L and T L O Orientation in the same Markov equivalence class. Conclusion : with 5000 lines, PC has lost many information (noisy χ 2 tests). Structure learning with complete data 56 / 94

57 Structural learning with local search Local search A local search is composed of : a space of all possible solutions (search space), a neighborhood defined by elementary transformations of a solution.the neighbors of a solution are the solutions that can be obtained by the application of an elementary transformation. a score (heurisitic) that evaluates the quality of a solution. From an initial solution, the local search then produces a sequence of solutions such that every solution in the sequence has a better score than the precedent solutions in the sequence (Greedy Search). Local search in Bayesian networks space of Bayesian networks (huge) The score (see next slide) The initial solution (empty Bayesian network for instance) Elementary operations : add/remove/reverse an arc Structure learning with complete data 57 / 94

58 Scores Properties for a score Let D a dataset, T the graph of the current solution and Θ its parameters.a score must satissfy : 1 Likelihood : The solution must explain the data (max L(T, Θ : D)). 2 Occam s razor : The score must prefer simple graphs for T rather than complex ones (min Dim(T )). 3 Local consistency : Adding a useful arc should increase the score. 4 Score equivalence : Two Markov-equivalent Bayesian networks should have the same score. Definition (Dim(T )) The dimension of a Bayesian network is its number of free parameters. Dim(T ) = i ((r i 1) q i ) where r i is the size of the variable X i and q i is the number of configurations for the parents of X i. Maximizing the likelihood is not a good score : it leads to a complete graph : overfitting. Structure learning with complete data 58 / 94

59 Some scores (1) : AIC/BIC Key idea : Maximizing likelihood but minimizing the dimension of the Bayesian network. score AIC (Akaike, 70) Akaike Information Criterion Score AIC (T, D) = log 2 L(Θ ML, T : D) Dim(T ) score BIC (Schwartz, 78) Bayesian Information Criterion Score BIC (T, D) = log 2 L(Θ ML, T : D) 1 2 Dim(T ) log 2N Structure learning with complete data 59 / 94

60 Minimum Description Length (Rissanen,78) The MDL score consists in consider the compacity of the representation as a good indicator for the quality of the solution. score MDL (Lam and Bacchus, 93) Minimum Description Length Score MDL (T, D) = log 2 L(Θ MV, T : D) arcs T log 2 N c Dim(T ) where arcs T is the set of arcs in T, c is the number of bits needed to represent a parameter. Structure learning with complete data 60 / 94

61 Bayesian Dirichlet score Equivalent With a Bayesian score, we want to maximize the joint probability of T and D : P(T, D) = P(D Θ, T ) P(Θ T ) P(T )dθ Θ = P(T ) L(Θ, T : D) P(Θ T )dθ With some independence hypothesis and a Dirichlet prior : score BDe Score BDe (T, D) = P(T ) Θ n q i i=1 j=1 Γ(α i,j ) Γ(N i,j + α i,j ) r i k=1 Γ(N i,j,k + α i,j,k ) Γ(α i,j,k ) Structure learning with complete data 61 / 94

62 Recherche locale : Example With the same dataset. Major drawbacks : The algorithm may be trapped in zones where all the neighborhood has the same score. The algorithm may stop in local minima. Solutions Meta-heuristic : Random restart (not only from empty Bayesian networks) TABU-search (force the algorithm to find new solutions) Simulated annealing (accept from time to time structures with decreasing score) Structure learning with complete data 62 / 94

63 Learning trees In order to reduce the state space, this algorithm can find only one parent for each random variables. This may be an over-simplification of the model, but learning tree brings : a mathematically beautiful solution (find the global optimum), a small number of parameters ( minimize the risk of overfitting). Basic idea : decomposition of the log-likelihood LL(T ) = i LL i (i, pa(i)) = X Y LL(Y X ) + K With LL(Y X ) = LL Y (Y, X ) LL Y (Y, ) Learning optimal tree X, Y, compute LL(Y X ) Find the tree (forest) that maximize LL(T ). Max Spanning Tree Algorithm O(n 2 log(n)) Structure learning with complete data 63 / 94

64 Inference Structure learning with complete data 64 / 94

65 Inference in a Bayesian network Elementary operations on a joint probability : Marginalization P(x, y z) = p(x z) Total sum y y P(y z) = 1 Decomposition P(x, y z) = P(x y, z) P(y z) n Chain rule P(X 1,, X n ) = P (X i X 1,, X i 1 ) Independence X i=1 Y Z P(x y, z) = P(x z) Bayes rule P(x y, z) P(y x, z) P(x z) In a Bayesian network : Markov local P(X 1,, X n ) = n P (X i parents(x i )) i=1 Inference 65 / 94

66 Hard and soft evidence Let P(X 1,, X n ) be a Bayesian network, let ɛ be an event. P(X 1,, X n ɛ) P(ɛ X 1,, X n ) P(X 1,, X n ) Evidence in a Bayesian network Let assume that (ɛ i ) I {1,,n} s.t. : ɛ = i I ɛ i i I, ɛ i X 1,, X i 1, X i+1,, X n X i then P(X 1,, X n, ɛ) = n P(x i π i ) P(ɛ i X i ) i=1 if P(ɛ i X i ) contains a 1 and many 0, ɛ i is called a hard evidence. ɛ i is the event X i takes a certain value. otherwise, ɛ i is called a soft evidence. ɛ i acts like virtual child of X i. Evidence will not appear in the following since they belong to the general framework. Inference 66 / 94 i I

67 Inference in a Bayesian network (1) : P(D)? C A B D P(d) = a = a = a = b P(a, b, c, d) b c P(a) P(b a) P(c b) P(d b) b c ( ) P(a) P(b a) P(d b) P(c b) b P(d b) }{{} a in D P(a) P(b a) }{{}}{{} in A in B c }{{} =1 Inference 67 / 94

68 Inference in a Bayesian network (2) : P(D a)? C A B D P(d a) = a = a = a = b P(a, b, c, d a) b c P(a a) P(b a, a) P(c b, a) P(d b, a) b c ( ) P(a a) P(b a, a) P(d b, a) P(c b, a) }{{}}{{} b c a B A A D B }{{} =1 P(d b) }{{} a in D P(a a) P(b a) }{{}}{{} in A in B Inference 68 / 94

69 Inference in a Bayesian network (3) : P(C d)? C A B D P(c d) = a = b P(a d) P(b a, d) P(c b, d) P(d b, d) b d P(c b) P(a d) P(b a, d) }{{} a in C }{{} P(b d) Bayes rule for P(b d) P(b d) P(d b) p(b) P(d b) p(b a) P(a) }{{}}{{}}{{} a in D in B in A Inference 69 / 94

70 Inference in a Bayesian network (4) : P(A d)? C A B D P(a d) P(d a) P(a) }{{} in A P(d a) = b = b P(d a, b) P(b a) }{{} in B P(d b) P(b a) }{{}}{{} in D in B Inference 70 / 94

71 Inference in a Bayesian network (5) : P(B c, a)? C A B D P(b c, a) P(c b, a) P(b a) P(c b) p(b a) P(a a) }{{}}{{}}{{} a in C in B in A Inference 71 / 94

72 Inference in polytree π X (U 1 ) π X (Un) π(x) = P ( x e +) Y e U 1 λ X (U 1 ) π Y1 (X ) X λ X (Un) π Ym (X ) U n e + X λ(x) = P ( e x ) Y 1 λy1 (X ) λ Ym (X ) If a node has n neighbors (parents or children), he must know n 1 messages in order to send its message to the last neighbor. Y m Once all messages have been sent (2 E ), every node knows all the information received by the graph. If a node has n neighbors (parents or children), he must know n messages in order to compute its own posterior. Inference 72 / 94

73 A first algorithm for inference in polytree Propagation : repeat for all node N with n neighbors, if N received n 1 messages then N can send the message to the last neighbor. if N received n messages then N can send all its messages. until all messages have been sent Every node can compute its posterior. 4a 1b 4c 3a 5b 2b Complexity 1a 2a 4b 1c 6a 1d 6b O( E ) 5a Inference 73 / 94

74 Inference in polytree (2) 4a 1b 4c 3a 5b 2b 5c 1b 3a 4a 4b 2b 1a 2a 4b 1c 6a1d 6b 1a 2a 5d 1c 5a1d 5b 5a Centralized version Selection of a root Absorption Each node send the message to the root (as soon as he can). Intégration The root has received all its messages from its neighbors and can send all the messages to its neighbors. Diffusion When a node receives its message from the root, it sends all the remaining messages. 6a Inference 74 / 94

75 Problem with message passing algorithms in DAG P(E) = A,B,C,D P(A) P(B A) P(C A) P(D B) P(E C, D) π C (A) C π E (C) A π B (A) B D π D (B) Message passing algorithm : 1 π B (A) = P(A) 2 π C (A) = P(A) 3 π D (B) = A P(B A) π B(A) 4 π E (D) = B P(D B) π D(B) π E (D) 5 π E (C) = A P(C A) π C (A) E 6 P(E) = D,C P(E C, D) π E (C) π E (D) P(E) P(E C, D) P(C A) P(A) P(D B) P(B A ) P(A ) A,B,C,D,A Inference 75 / 94

76 Inference in DAG Message passing algorithms need a polytree. From DAG to polytree Conditioning : cutting arcs in the graph. Clustering : Merging nodes in the graph. Conditioning Clustering Inference 76 / 94

77 Conditioning Conditioning a Bayesian network G a Bayesian network over the set of random variables V, S V and s an instantiation of S. The conditioned Bayesian network G [S=s] is the Bayesian network obtained by : removing arcs from every node in S if the node Y has parent(s) in S then p(y Π Y ) is changed in p(y s, Π Y ). E A G B C E P(E A = 0) A 0 G 1 B C F Graph G D P(F E, G = 1) P(D C, G = 1) F D Conditioned graph G [A=0,G=1] An inference in G with evidence S = s is equivalent to an inference in G [S=s] with no evidence. S is called a cutset. Inference 77 / 94

78 Inference by conditioning S V, x V, P(x) = s P(x s).p(s) Inference by conditioning With G Bayesian network, 1 find S V cutset such that G [S] is a polytree. 2 s instantiation of S, Compute P s(x) = P(x s) in G [S]. Compute p s = P(s) (secondary result of the last inference). 3 P(x) = s (p s P s (x)) If # i is the size of S i, the inference computes i # i inferences in G [S]. O(k S E [S] ) Finding the minimal cutset is NP-complete. Inference 78 / 94

79 Clustering : Junction tree algorithm S L D P(s, l, d, f, v) = P(s) P(l s) P(d s) P(f l) P(v d, l) = f (s, l) g(s, d) h(l, f ) k(d, l, v) = j(s, l, d) h(l, f ) k(d, l, v) F V SL S SD SLD L D LD LF L LDV Junction graph LF L LDV Junction tree Inference 79 / 94

80 How to build a junction tree Idea : create an undirected graph from the Bayesian network. The cliques of this undirected graph will be the nodes of the junction tree. The CPTs P(X Parent X ) of the Bayesian network indicate necessary clusters. Moralization The moral graph of a BN is the undirected skeleton of the Bayesian network to which edges between parents of the same node are added. S T S T L D L D F V F V Inference 80 / 94

81 How to build a junction tree (2) To insure the existence of the junction tree, the undirected graph must be triangulated : every cycle with length> 3 must have be chordal. Triangulation To triangulate a graph : variables elimination : Iteratively remove all nodes in the graphs. When a node is removed, add edge between every pair of its neighbors Start to eliminate nodes that will not create new edge (no neighbor, only one neighbor, neigbours already connected). S T S T S T L D L D L D F V F V F V Elimination order : F, V, T, S, L, D. To find the best elimination order is NP-complete. Inference 81 / 94

82 How to build a junction tree (3) The nodes of the junction tree are the cliques of the moralized and triangulated graph, what about its edges? Junction tree Choose an order in the cliques (C i ). for each C i, find the clique C j (j < i) that maximize C i C j. Add an edge between C i and C j. Create the separator S ij = C i C j. S T S T STD L D L D LF L SD SLD LD F V F V LDV The junction tree is not unique for a Bayesian network. Inference 82 / 94

83 S Potentials T S T STD L D L D LF L SD SLD LD F V F V LDV Decomposition of P(V ) following the cliques : Another decomposition : P(v) = P(s) P(l s) P(t s) P(d t) P(f l) P(v l, d) }{{}}{{}}{{}}{{} φ 1 (d,l,s) φ 2 (s,t,d) φ 3 (f,l) P(s, t, d) P(s, l, d) P(l, f ) P(l, d, v) P(v) = P(s, d) P(l) P(l, d) φ 4 (v,l,d) Factorization of P in a Junction Tree i P(v) = Φ C i (c i ) i<j Φ S ij (s ij ) Inference 83 / 94

84 Propagation on potentials in the junction tree goal : Transform the potential of all cliques (C) and separators (S) into joint probability of their variables. Message passing in the Junction Tree Initialization : C i C, Ψ 0 C = i P(X Π X ) X C i,x / C j,j<i S S, Ψ 0 S = 1 (constant function). Root selection : choose a clique as a root for the propagation (and the evidence) Collect : send all messages from all cliques to the root (a clique can send a message if it receives all the others). Distribution : send all messages from the root to all cliques Message from a clique C i to a clique C j : update Ψ t C j from Ψ t+1 C i : Ψ t+1 S (s) = ij C i \S Ψ t+1 ij C (c) and Ψ t+1 i C = Ψ t Ψ t+1 S j C ij j Ψ t S ij There still is only a linear complexity in number of arcs, but the clique and the Ψ C may be very huge. This algorithm is the most used for exact inference in Bayesian network. Inference 84 / 94

85 Copulas Bayesian Network Inference 85 / 94

86 Copulas Let X = {X 1,, X n } be a set of continuous random variables, its joint CDF F (x) = P(X x), its density of probability p(x) : p(x) = n F x 1 x n (x 1,, x n ). Copula C(u 1,, u n ) = CDF for the variable U 1,, U n uniformally distributed on [0, 1] Sklar,1959 F, C copula s.t. F (x 1,, x n ) = C(F (x 1 ),, F (x n )) : C(u) = F (F1 1 (u 1 ),, F1 1 (u n )). Unicity if continuity and p > 0 For the density : p(x) = c(f 1 (x 1 ),, F n (x n )) n i=1 p i(x i ). C integrate the structural complexity of the relation between variables and do not take into account their marginal behavior. Copula C(X ) rarely used/computed/learned if n > 10 Copula Bayesian network 86 / 94

87 Copula Bayesian Networks Let G be a Bayesian network on the density of probability : n p(x 1,, X n ) = p(x i π i ) i=1 Let c be a copula for p, we define, for each node X i, R c (X i, π i ) = where k i = π i (if k i = 0, Rc(X i, π i ) = 1). c(f (X i ), F (P 1 ),, F (P ki )) k i F (P 1) F (P Ki ) c(1, F (p 1),, F (p Ki )) p(x 1,, X n ) = n R ci (X i, π i ) p(x i ) i=1 Reciprocally, if {c i } n i=1 are copulas (> 0) for each (X i, π c i), then n i=1 R c i (X i, π i ) defines a copula for X. c(f (X 1 ),, F (X n )) = n R ci (X i, π i ) Copula Bayesian network i=1 87 / 94

88 Copula Bayesian Network - Learning and inference Definition (Copula Bayesian Network) A CBN is described by a triplet (G, Θ C, Θ p ) where G is a DAG over X, Θ C is a set of copulas c(x i, π i ), Θ p is the set of marginals p(x i ). Then the CBN has a joint density of probability p(x ) factorised by : p(x 1,, X n ) = n R ci (X i, π i ) p(x i ) i=1 The same properties of locality hold for CBN as for BN. Large reduction of the number of parameters Parameter learning can be made locally on X i, π i not the same dataset for each R c. not the same type of copula for each X i, π i Structural learning could use the same kind of algorithms than classical BN. Inference is simpified : simulation, Rosenblatts transformation (etc) Copula Bayesian network 88 / 94

89 Copula Junction Tree A B C I ABC BC BCI BI D DE D BDI E EF E E EG E EH F G H p(x ) = C clique(g) p(c) S separator(g) p(s) Let c S (S) be copula for all S Junction Tree (clique or separator, with some restriction for consistency). c(x ) = C clique(g) c C (C) S separator(g) c S(S) is a copula for p(x ) Copula Bayesian network 89 / 94

90 Non parametric independence test : Y Z X? Bouezmarni and al, Based on Hellinger distance between a joint copula and the product of marginal copulas, Every copula is approximated by a non parametric Bernstein s copula on a dataset of size N : Copula Bayesian network 90 / 94

91 Y Z X? [Bouezmarni and al, 2009] shows that under H0 : [Y Z X ], T N (0, 1) : with k = d (arbitrary) and : Copula Bayesian network 91 / 94

92 Structural Learning with CI test : PC Algorithm Copula Bayesian network 92 / 94

93 Early results Copula Bayesian network 93 / 94

94 continuous-pc Structural identification : increasing blocks, 10 variables, N=10000, 27 minutes. Conclusion : CBN proposes to explore the structure inside joint copulas, CBN may increase the number of dimensions for a joint copula, CBN may ease the different inference algorithms, CBN can be automatically learned from data using a non parametric CI test. Copula Bayesian network 94 / 94

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several