Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

12735: Urban Systems Modeling Lec. 09 Bayesian Networks instructor: Matteo Pozzi x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 1

outline example of applications how to shape a problem as a BN complexity of the inference problem inference via variable elimination inference via junction tree MCMC approximate inference 2

intro on Bayesian Networks random variables are nodes, links defines conditional dependence/independence. seismic intensity magnitudo damage Discrete variables, possible values for each var. 1 table table, table JOINT PROBABILITY Chain rule (product rule),,, table Each variable is defined by a table with number of dimensions equal to number of parents plus one. 3

example of Bayesian network scenario x 1 stiffness material x 2 strength Set of random variables, defined by conditional independence. load x 3 x 4 x 5 demand stress x 7 x 6 x 8 damage x 9 loss 4

example of Bayesian network roots scenario x 1 stiffness material x 2 strength Set of random variables, defined by conditional independence. roots defined by: load x 3 x 4 x 5 demand stress x 7 x 6 x 8 damage x 9 loss 5

example of Bayesian network roots scenario x 1 stiffness material x 2 strength Set of random variables, defined by conditional independence. roots defined by: load x 3 x 4 x 5 parent demand stress x 7 x 6 x 8 damage child children defined by: parents x 9 loss 6

example of Bayesian network roots scenario x 1 stiffness material x 2 strength Set of random variables, defined by conditional independence. roots defined by: load x 3 x 4 x 5 parent stress children defined by: demand x 7 x 6 x 8 damage child joint probability: parents parents x 9 loss task: prediction conditional prediction 7

applications integrated risk analysis predicting global warming predicting effects of natural hazards road construction time models: degrading systems, e.g. due to fatigue HMM time models: vibration of structures (Kalman Filter) 8

example of 2 vars. BN seismic intensity magnitudo Discrete variables, possible values for each variable, Joint probability, : table: 1 degrees of freedom (dofs), 1 if : fully connected, or complete graph Chain rule (product rule), : 1 table: 1 dofs : table: dofs : 1 1, : 2 2 dofs this reduced graph is less powerful than the complete one. It can represent only joint probability satisfying. However inference is much easier for this graph: 9

Independence [from lec.2], the joint prob. is no richer than the set of marginal prob. P (Y ) Y 1 Y 2 Y 3 20% 50% 30% 100% 0.3, P (X ) P (X,Y ) Y 1 Y 2 Y 3 X 1 10% X 1 2% 5% 3% 10% X 2 60% X 2 12% 30% 18% 60% X 3 30% X 3 6% 15% 9% 30% 100% 20% 50% 30% 100%, the joint prob. is richer than the set of marginal prob. P (Y ) Y 1 Y 2 Y 3 20% 50% 30% 100% P (X ) P (X,Y ) Y 1 Y 2 Y 3 X 1 10% X 1 2% 5% 3% 10% X 2 60% X 2 3% 30% 27% 60% X 3 30% X 3 15% 15% 0% 30% 100% 20% 50% 30% 100%, P(X,Y) P(X,Y) 0.2 0.1 0 0.3 0.2 0.1 0 1 1 2 Y 2 Y 3 3 1 1 2 X 2 X 3 3 10

example of 3 vars. BN seismic intensity magnitudo damage complete graph Discrete variables, possible values for each var. Joint probability,, : table: 1 dofs Chain rule (product rule),,,,,,, 1 if :, conditional independence After observing intensity, any additional information on magnitudo is irrelevant for inferring the damage.,, 12 2 N1 dofs 11

chain graph for n vars Chain graph Complete graph: : table: 1 dofs,, If,, :,, 1 1 dofs the chain graph is less powerful, but much easier to handle. number of dofs N = 10 10 9 10 7 complete chain 10 5 10 3 10 1 1 2 3 4 5 6 7 8 9 n 12

prediction by variable elimination seismic intensity magnitudo damage M I D build joint probability:,, table derive marginal probability by marginalization:,,, you can derive everything from the joint prob.: I D D, we can derive without handling any 3 d table: only handling 1 d and 2 d tables. vector matrix product 13

prediction by variable elimination [cont.] seismic intensity magnitudo damage M I D build joint probability:,, table derive marginal probability by marginalization:,,, you can derive everything from the joint prob.: M I, 1 I 14

inference by variable elimination seismic intensity magnitudo damage M I D build joint probability:,, table derive marginal probability by marginalization:,,, / you can derive everything from the joint prob.: M D, normalization:,, 16

inference by variable elimination [cont.] seismic intensity magnitudo damage M I D build joint probability:,, table derive marginal probability by marginalization:,,, / you can derive everything from the joint prob.: I D normalization:,,, 17

best order of elimination load stiffness x 3 x 4 strength x 5 stress x 6,,,,, x 8 damage,, 4D table x 3 x 4 x 5 x 8 The efficiency of the algorithm depends on the order for eliminating variables. By selecting an inappropriate order, you may increase the dimension of the Condition Probability Tables (CPTs). E.g., for predicting, it is not efficient to eliminate first, relating damage to {load, stiffness, strength}. 18

branching graph damage on building 1 I seismic intensity damage on building 2 build joint probability:,, D 1 D 2 D 1 D 2 3 d table D 1 D 2 task: modeling no 3 d table is used prediction:, after observing and :,,, after observing :,, cost. 1 is irrelevant and are NOT independent, while is not fixed. 1 d table is irrelevant after 19

V graph load 1 load 2 L 1 L 2 L 1 L 2 build joint probability:,,, damage D, task: modeling prediction:,,,,, 1 after observing :,,, cost. 1, are irrelevant as L 1 L 2 20

V graph [cont.] load 1 load 2 build joint probability:,,, L 1 L 2 L 1 L 2 L 1 L 2 damage D, task: modeling after observing :,,, knowledge on L 1 is used for building likelihood. after observing and :,,,,, cost. conditionally to (having observed), this is an example of INDUCED DEPENDENCE variables L 1 and L 2 are NOT independent. (induced correlation) 21

inference via variable elimination and junction tree seismic intensity magnitudo damage target: M I D method: eliminate M to get, eliminate I to get, The variables to be eliminated depend on the specific query. If we are interested in more than one query, we may repeat some operations in different queries. The Junction Tree is an algorithm to get response to all possible queries, without repeating operations. clique M,I separator I clique I,D,, 22

HHM revised S 0 S 1 S k S k+1 y 1 y k y k+1 S n y n task: compute : eliminate, process, eliminate, process,, eliminate, process. (eliminate) (eliminate) (eliminate) The prediction correction algorithm is an application of a best elimination order. 23

conditions for exact inference Discrete variables, except for course of dimensionality. Continuous variables: integral instead of sum. Generally integrals cannot be solved in close form. But they can be solved for Gaussian Linear Models (GLM). x 1 x 3 x condition for GLM: if vector lists all parents of : 5,, x 2 x 4 x 6 x 7 GLMs are used for dynamic systems (Kalman filters) GLM can be seen as a special case of Gaussian processes, with special independency relations (while Gaussian processes are complete graphs). Other problems can be also mapped into a GLM. For example Log normal models can be mapped by taking into GLMs by taking the log. Hybrid graphs have also been proposed, mixing discrete and continuous variables, by imposing some rules. 24

approximate inference MC: sequential sampling. We start sampling roots from their marginal, then each other variables conditional to their (sampled) parents. After observing any variable, we can reject samples non compatible with observations, or use importance sampling. MCMC: Gibbs sampling. We samples randomly variables conditional to the other vars. In the Markov blanket (kept fixed). It is an application of the Metropolis algorithm with special proposal distribution. Markov blanket Gibbs sampling Russell, S. and P. Norvig. (2010). Artificial Intelligence: A Modern Approach. Pearson Education. Barber, B. (2012). Bayesian Reasoning and Machine Learning. Cambridge UP 25

summary Inference and prediction in Bayesian Network can be done in three steps. i) compute the joint probability: parents ii) compute the conditional distribution.,.., iii) marginalize on variables of interest:.. \. \ All exact and approximate methods are used to overcome computational difficulties related to previous approach. 26

HHM with dummy algorithm S 0 S 1 S k S k+1 y 1 y k y k+1 S n y n task: compute : i) compute the joint probability: :, : huge table/function: it is not an effective path ii) compute the conditional distribution: : : :, : iii) marginalize on variables of interest : : : : : 27

references Barber, B. (2012). Bayesian Reasoning and Machine Learning. Cambridge UP. Downloadable from http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer Russell, S. and P. Norvig. (2010). Artificial Intelligence: A Modern Approach. Pearson Education. 28