Bayesian networks. 鮑興國 Ph.D. National Taiwan University of Science and Technology

Size: px

Start display at page:

Download "Bayesian networks. 鮑興國 Ph.D. National Taiwan University of Science and Technology"

Andrew Wheeler
5 years ago
Views:

1 Bayesian networks 鮑興國 Ph.D. National Taiwan University of Science and Technology

2 Outline Introduction to Bayesian networks Bayesian networks: definition, d-searation, equivalence of networks Eamles: causal grahs, elaining away, Markov chains, Naïve Bayes, etc Undirected models Probabilistic inference node elimination junction tree Building the networks

3 Probabilistic Grahical Models Combination of grah theory and robability theory Informally, Grah structure secifies which arts of system are directly deendent Local functions at each node secify how arts interact More formally, Grah encodes conditional indeendence assumtions Local functions at each node are factors in the joint robability distribution E.g. : Bayesian networks = PGMs based on directed acyclic grahs E.g. : Markov networks Markov random fields = PGM with undirected grah Others: Discrete models, continuous models, Gaussian grahical models, hybrid Bayesian networks, CG models, etc

4 Alications of PGMs Machine learning Statistics Seech recognition Natural language rocessing Comuter vision Error-control codes Bioinformatics Medical diagnosis Financial redictions etc

5 Why Probability? Why robabilistic grahical models? Why robabilistic aroach can hel us in machine learning? Guess : world is nondeterministic or Guess : world is deterministic, but lacking of relevant facts/attributes to decide the answer robabilistic model can make inferences about missing inuts classification is one eamle missing is y! Either way, robabilistic aroach can make decisions which minimize eected loss

6 Probabilistic Aroach Classification and regression: conditional density estimation PY X then, by Bayes rule Unsuervised learning: density estimation PX Imaging Y as a latent variable, classification is no more than missing value filling N N... d d d N d+ d+ d+ N... d+k d+k d+k N y y y Y X ste : building the model ste : inference

7 Why Grahical Models? We discuss this issue later on Basically, grahical models hel us to deal with comlicated roblem modularized and in a visualized way That is, breaking roblems to subroblems recursively until we can solve them

8 Welcome to Hotel California! Srinkler Cloudy Rain P C, S, R, W, F = P CP SCP RC, S P W C, S, RP F C, S, R, W = P CP SCP RC P W S, RP F R Wet grass roof actually Conditional Probability Table PW R, S = 0.95 PW R, ~S = 0.90 PW ~R, S = 0.90 PW ~R, ~S = 0.0 P RC, S = P RC P W C, S, R = P W S, R P F C, S, R, W = P F R

9 Joint Probability Goal : reresent a joint distribution PX = = PX =,, X n = n comactly even when there are many variables Goal : efficiently calculate marginal and conditionals of such comactly reresented joint distribution For n discrete variables of arity k, the naïve table resentation if HUGE: it requires k n entries We need to make some assumtions about the distribution One simle assumtion: indeendence = comlete factorization PX = Π i PX i But the indeendence assumtion is too restrictive. So we make conditional indeendence assumtion instead

10 Conditional Indeendence Notation: X A X B X C Definition: two sets of variables X A and X B are conditionally indeendent given a third X C if: PX A, X B X C = PX A X C PX B X C X C which is equivalent to saying PX A X B, X C = PX A X C X C Only a subset of all distribution resect any given nontrivial conditional indeendence statement. The subset of distributions that resect all the CI assumtions we make is the family of distributions consistent with our assumtions Bayesian networks robabilistic grahical models are a owerful, elegant and simle way to secify such a family

11 Between Simle and Comle Models Given n r.v. s X, X,, X n, with state sace = {,, 3}, let us consider the degree of freedom or comleity of the model to describe the joint robabilities most general df = 3 n not efficient in time and sace??? Bayesian networks indeendent df = n PX i =, PX i = i.i.d. df = PX =, PX =

12 Bayesian Belief networks Bayesian networks grahical models is an intermediate aroach i.i.d. assumtion too restrictive the most general cases ineffective BN GM reresent large joint distributions comactly using a set of local relationshis secified by a grah. The joint robability then can be factorized into such local relationshis BN describes conditional indeendence among subsets of variables conditional robabilities joint robabilities Bayes theorem Bayesian networks or directed models vs. Markov random fields MRF, or undirected models Belief networks using conditional r., while Markov random fields using otential function

13 Part I Bayesian Networks: Introduction to Directed Models and Undirected Models

14 Outline Introduction to Bayesian networks Bayesian networks: definition, d-searation, equivalence of networks Eamles: causal grahs, elaining away, Markov chains, Naïve Bayes, etc Undirected models Probabilistic inference node elimination junction tree Building the networks

15 Bayesian Networks A.K.A. belief network, directed grahical model Definition a set of nodes reresenting random variables and a set of directed edges between variables each variable has a finite set of mutually eclusive states to each variable X with arents Y, Y,, Y n, there is attached the conditional robability table PX Y, Y,, Y n

16 Bayesian Networks cont. Informally, edges reresent causation The variables together with the directed edges form a Directed Acyclic Grah no cycles allowed X Y X, Y = X Y X Y X, Y = X Y X V W Y X 3 X X 3 Z X X X X X U

17 Factorization Consider directed acyclic grahs over n variables. Each node has ossibly emty set of arents π i. Each node maintains a function f i X i ; X ùi such that f i > 0and P i f i X i = i ; X ùi = ù i Define the joint robability to be: PX,X,...,X n = Q i Even with no further restriction on the the f i, it is always true that f i X i ; X ùi =PX i X ùi so we will just write PX,X,...,X n = Q i f i X i ; X ùi PX i X ùi Factorization of the joint in terms of local conditional robabilities. Eonential in fan-in of each node instead of in total variables n.

18 So by the Alternative Definition X X X 4 P X,X,X 3,X 4,X 5,X 6 = Y i X 6 P X,X,X 3,X 4,X 5,X 6 f i X i X πi X 3 X 5 = P X P X X P X 3 X P X 4 X P X 5 X 3 P X 6 X,X 5 PX,X 5,X 6 = X X,X 3,X 4 f X f X ; X f 3 X 3 ; X f 4 X 4 ; X f 5 X 5 ; X 3 f 6 X 6 ; X,X 5 = f 6 X 6 ; X,X 5 X X,X 3,X 4 f X f X ; X f 3 X 3 ; X f 4 X 4 ; X f 5 X 5 ; X 3 PX,X 5 = X X f X f X ; X f 3 X 3 ; X f 4 X 4 ; X f 5 X 5 ; X 3 f 6 X 6 ; X,X 5 X,X 3,X 4 X 6 = X f X f X ; X f 3 X 3 ; X f 4 X 4 ; X f 5 X 5 ; X 3 X,X 3,X 4 PX 6 X,X 5 =PX,X 5,X 6 /PX,X 5 =f 6 X 6 ; X,X 5

19 Conditional Indeendence in DAGs If we order the nodes in a directed grahical model so that arents always come before their children in the ordering then the grahical model imlies the following about the distribution: {X i X ùi àx ùi } i X ùi where à are the nodes coming before X i that are not its arents. In other words, the DAG is telling us that each variable is conditionally indeendent of its nondescendants given its arents, this is called local Markov roerty Such an ordering is called a toological ordering

20 From the Definition again! X X 4 X X 6 Prove: X 4 X X X 3 X 5 PX [..4] =P X P X X PX 3 X P X 4 X X X 5 P X 5 X 3 X X 6 PX 6 X,X 5 = P X P X X PX 3 X P X 4 X PX [..3] =P X P X X PX 3 X X X 4 P X 4 X X X 5 PX 5 X 3 X X 6 P X 6 X,X 5 = P X P X X PX 3 X PX 4 X [..3] =PX 4 X and PX 4 X,X =PX 4 X why?

21 Saving in Time & Sace X X X 4 X 3 X 5 X 6 Conditional indeendence: E.g.a: X 4 X X E.g.b: X 5 X X 3 E.g.c: X 5 X 6 X 3 Idea can be etended to sets: E.g.a: X 6 X 3 X, X 5 E.g.b: X 5, X 6 X X, X 3 PX [..6] = X X [..6] PX PX X PX 3 X,X PX 4 X [..3] PX 5 X [..4] PX 6 X [..5] = X X [..6] PX PX X PX 3 X P X 4 X P X 5 X 3 PX 6 X,X 5

22 Saving in Time & Sace cont. X X X 4 X 6 P X,X,X 3,X 4,X 5,X 6 = P X P X X P X 3 X P X 4 X P X 5 X 3 P X 6 X,X 5 X 3 X 5 Missing edges imly cond. indeendence Each entry/cell reresents a cond. rob. combination The biggest table, the bottleneck of the comutation

23 Eamle: Causal Grahs Rain PR = 0.4 PRW = PWRPR PW 0.9â0.4 = 0.9â0.4+0.â0.6 =0.75 Wet grass PW R = 0.9 PW ~R = 0. In general, fewer values need to be secified for the whole distribution than that for the unconstrained case O n On k, where k is the maimum fan-in of a node

24 Eamle: Comlete Grah P X,X,X 3,X 4,X 5 = P X P X X P X 3 X,X P X 4 X,X,X 3 P X 5 X,X,X 3,X 4 X X X 3 X 4 X 5 A network reresents a relationshi between r.v. s, however, the reresentation to give this relationshi is not unique; e.g., different orders of X i can roduce different networks

25 Eamle: Markov Chain X X X 3 X 4 X 5 P X = P X,X,...,X n = P X n X n,...,x P X n X n,...,x PX X P X = P X n X n P X n X n PX X P X = P X Y n ix i i=

26 Naïve Bayes What is the connection between a BN and classification? Suose one of the variables is the target variable. Can we comute the robability of the target variable given the other variables? In Naïve Bayes: Concet Cj Directions are not the same with the direction of classification! X X X n PX, X,, X n, Cj = PCj PX C j PX C j PX n C j

27 Elaining Away PS = 0. PR = 0.4 Srinkler Rain PSW = PWSPS PW = 0.9â =0.35 PW R, S = 0.95 PW R, ~S = 0.90 PW ~R, S = 0.90 PW ~R, ~S = 0.0 Wet grass PSR, W = PWR,SPSR PWR = PWR,SPS PWR =0. <PSW Knowing it rained, the robability that srinkler is on to cause the web grass decreases, knowing that the grass is wet Berkson s arado!

28 Directed-searation: Definition Definition d-searation A ath is said to be d-searated or blocked if and only if contains a chain X Y Z or a fork X Y Z and Y is instantiated shaded X Y Z X Y Z contains an inverted fork or collider X Y Z, and neither Y nor any descendants of Y have received evidence X U Y Z

29 Directed-searation: Definition cont. A set A is said to be d-searated from a set B if every ath from a node in A to a node in B is d-searated A and B connected if they are not searated! A set A is said to be d-searated from a set B given a set C if every ath from a node in A to a node in B is d-searated, given all nodes in C are instantiated! A and B connected given C if they are not searated given C!

30 D-searation and Conditional Indeendence G = set of C.I. rules = family of distributions Probabilistic Imlications of d-searation If sets A and B are d-searated given C in a DAG G, then A is indeendent of B conditional on C in every distribution comatible with G. Conversely, if A and B are not d-searated given C in a DAG G, then A and B are deendent conditional on C in at least one distribution comatible with G.

31 D-searation: eamle I X W Y Z X Z Z Z 3 Y R S X and Y are d-searated given Z and d-connected given Z T X and R are d-searated by {Y, Z} X and T are d-searated by {Y, Z} W and T are d-searated by {R} W and X are not d-searated by Y W and X are d-searated by

32 D-searation: eamle II Is X d-searated from Y? X Y e e e 3 e 4 e 5 d-searated: given no evidence or e or e 3 or only e 4 d-connected: given only e or only e 5 rovided the rest ath is connected

33 Bayes Bouncing Ball Rules A is d-searated from B given C if we cannot send a ball from any node in A to any node in B according to the rules below, where shaded nodes are in C Case : X Y Z X Y Z Case : X Y Z X Y Z Case 3: X Y Z X Y Z

34 Bayes Ball Rules cont. Boundary condition Case : X Y X Y Case : X Y X Y not really necessary X Y ancestor X Y descendant

35 Undirected Grahical Models A.K.A Markov Random Fields, Markov Networks Also grahs with one node er random variable and edges that connect airs of nodes, but now the edges are undirected Semantics: every node set is conditionally indeendent from its non-neighbours given its neighbours, i.e. X A X C X B if every ath between X A and X C goes through X B Can model symmetric interactions that directed models cannot! X A X B X C

36 Simle Grah Searation In undirected models, simle grah searation as oosed to d-searation tells us about conditional indeendencies X A X C X B if every ath between X A and X C is blocked by some node in X B X A X B X C Markov Ball algorithm: remove X B and see if there is any ath from X A to X C

37 Conditional Parameterization? In directed models, we started with X = Q i i ùi and we derived the d-searation semantics from that. Undirected models: have the semantics, need arametrization. What about this conditional arameterization? X = Q i i neighborsi Good: roduct of local functions. Good: each one has a simle conditional interretation. Bad: local functions cannot be arbitrary, but must agree roerly in order to define a valid distribution.

38 Marginal Parameterization? OK, what about this marginal arameterization? X = Q i i ùi X = Q i i, neighborsi Good: roduct of local functions. Good: each one has a simle marginal interretation. Bad: only very few athalogical marginals on overaling nodes can be multilied to give a valid joint.

39 Interretation of Clique Potentials X Y Z The model imlies X Z Y PX, Y, Z = PY PX Y PZ Y We can write this as: PX, Y, Z = PX, Y PZ Y = ψ y X, Y ψ yz Y, Z PX, Y, Z = PX Y PZ, Y = ψ y X, Y ψ yz Y, Z cannot have all otentials be marginals cannot have all otentials be conditionals The ositive clique otentials can only be thought of as general comatibility, goodness or hainess functions over their variables, but not as robability distributions.

40 When Causality not Playing a Role! X P X Q X R C D E ψ: otential functions B A G F X P, X Q, X R : maimal cliques P A, B, C, D, E, F, G = P A, B, C, DP E,F,GA, B, C, D = P A, B, C, DP E,F,GD = P A, B, C, DP EDP F, GD,E = P A, B, C, DP EDP F, GE = ψa, B, C, DψD, EψE,F,G = ψx P ψx Q ψx R

41 An eamle of Undirected Model Whatever factorization we ick, we know that only connected nodes can be arguments of a single local function. A clique is a fully connected subset of nodes. Thus, consider using a roduct of ositive clique otentials: P X = ψc c Z = ψc c Z cliques c X cliques c The roduct of functions that don t need to agree with each other. Still factors in the way that the grah semantics demand. Without loss of generality we can restrict ourselves to maimal cliques. Why? Potential functions P X [ :5] = ψ X, X, X 3 ψ X 3, X 4 ψ X 3, X 5 Z Partition function 3 4 5

42 Partition Function Normalizer ZX above is called the artition function. Comuting the normalizer and its derivatives can often be the hardest art of inference and learning in undirected models. Often the factored structure of the distribution makes it ossible to efficiently do the sums/integrals required to comute Z. Don t always have to comute Z, e.g. for conditional robabilities.

43 Directed vs. Undirected Models Directed models using conditional rob. for each local substructure called Bayesian network may describe some distributions which can not be described by undirected models mainly used in A.I., diagnostic, decision making, etc. Undirected models using otential functions in each local substructure called Markov random field or Markov network may describe some distributions which can not be described by directed models for roblems with little causal structure to guide the grah construction: image restoration, certain otimization roblems, models of hysical systems

44 A Directed Model of 3 r.v. A general dist. PX, Y, Z = PX PY X PZ X, Y Chain PX, Y, Z = PX PY X PZ Y X Z X Z Y Elaining away PX, Y, Z = PX PY PZ X, Y X Y Fork PX, Y, Z = PY PX Y PZ Y Z the same grou One link PX, Y, Z = PX PY PZ Y X Y Y All indeendence PX, Y, Z = PX PY PZ Z X Z X Z Y Y

45 An Undirected Model of 3 r.v. A general dist. PX, Y, Z = PX PY X PZ X, Y = ψx, Y, Z One link 3 cases PX, Y, Z = PX PY PZ Y = ψx ψy, Z X Z X Z Y Y Two links 3 cases PX, Y, Z = PY, Z X PX = PX PY X PZ X = ψx, Y ψx, Z All indeendence PX, Y, Z = PX PY PZ = ψx ψy ψz X Z X Z Y Y

46 Between Directed and Undirected Models Directed can t do it! X Y { W, Z } W Z { X, Y } Must be acyclic, will have at least one V structure and Bayes ball goes through X Y W X Y { W, Z } W W X Y X Y Z Undirected can t do it! X Z X Z Y Undirected can t do it! X Z Y X Z Z X Z Y Y X Z X Z Y

47 Probabilistic Grahical Models Grahical Models Probabilistic Models Directed Models Bayesian belief nets Alarm network State-sace models HMMS Naïve Bayes classifier PCA / ICA Undirected Models Markov nets Markov Random Field Boltzmann machine Ising model Ma-ent model Log-linear models

48 Part II Probabilistic Inference

49 Outline Introduction to Bayesian networks Bayesian networks: definition, d-searation, equivalence of networks Eamles: causal grahs, elaining away, Markov chains, Naïve Bayes, etc Undirected models Probabilistic inference node elimination junction tree Building the networks

50 Probabilistic Inference Task infer value of some X i given values of some other X j in the network Facts all other attributes known: easy general case: NP-hard Cooer, 990 roving 3-SAT easier than robabilistic inference! aroimated inference: still NP-hard Dagum & Luby, 993 Monte-Carlo Pradham & Dagum, 996

51 Probabilistic Inference II Partition the random variables in a domain X into three disjoint subsets X E, X F, X R. The general robabilistic inference roblem is to comute the osterior X F X E over the query nodes X F. This involves conditioning on evidence nodes X E and integrating summing out marginal nodes X R If the joint distribution is reresented as a huge table, this is trivial: just select the aroriate indices in the columns corresonding to X E based on the values, sum over the columns corresonding to X R, and renormalize the resulting table over X F

52 Probabilistic Inference III If the joint is a known continuous function this can sometimes be done analytically. e.g. Gaussian: eliminate rows/cols corresonding to X R ; aly conditioning formulas for X F X E But what if the joint distribution over X is reresented by a directed or undirected grahical model?

53 Recall Bayes Rule For simle models, we can derive the inference formulae by hand using Bayes rule This is called reversing the arrow In general, the calculation we want to do is: Q: Can we do these sums efficiently? Q: Can we avoid reeating unnecessary work each time we do inference? A: Yes, if we eloit the factorization of the joint distribution X Y X Y X Y a b c = = y y y y y y, c b a = R F R R F E R F E E F,,,,,

54 Take Advantage of Distributed Law X X 4 Comute PX [..5]? X X 6 X 3 X 5 P X [..5] = X X 6 P X P X X P X 3 X P X 4 X P X 5 X 3 P X 6 X,X 5 X 6 0 X 0 5 X : 0 = P X P X X P X 3 X P X 4 X P X 5 X 3 X X 6 P X 6 X,X 5 = P X P X X P X 3 X P X 4 X P X 5 X 3 5 multilications, additions, 5 multilications, additions!

55 The Generalized Distributed Law ab+ ac= a b + c left: +, right:, +

56 一場遊戲一場夢?!

57 The Most Probable Path A B n n Given: a multilayer network, with transition robability a kl shown on edges. Problem: find the most robable ath from A to B One of the solutions is given by Viterbi algorithm, using dynamic rogramming

58 The Viterbi Algorithm All aths have the same start state A, so v 0 0 =. By keeing ointers backwards, the actual ath can be found by backtracking. The full algorithm: Init i = 0: Recursion i Termination : Traceback i =..n: = n.. : 0 =, v * i Note : end state is assumed v 0 v l tr l = π π i = * n i ma P Y, π* = = arg ma arg ma = i k 0 k v ma i k * tr π = k v 0 i k k k v for a v k k i n a k kl n a k 0 a > k 0 0 kl

59 Commutative Semi-ring A set K, together with two binary oerations called + and, which satisfy the following three aioms:. The oerations + is associative and commutative, and there is an additive identity element called 0 s.t. k + 0 = k, k. The oeration is also associative and commutative, and there is a multilicative identity element called s.t. k = k, k 3. The distributive law holds, i.e. a b + a c = a b + c, a, b, c from K A semi-ring is a commutative ring without the additive inverse

60 Eamle: ma-roduct K = R + = [0, +, + : ma, : usual multilication. Checking the oerations + mak, 0 = k, k ; i.e., identity: 0. Checking the oeration k = k, k ; i.e., identity: 3. The distributive law holds, i.e. ma{a b, a c} = a ma b, c, a, b, c from K Other eamles: min-roduct, min-sum, ma-sum, etc.

61 Viterbi vs. GDL Viterbi is just a form of GDL, by choosing an aroriate semi-ring In fact, many dynamic rogramming rocesses can be interreted by GDL Other eamles: Baum-Welch algorithm, FFTfast Fourier transform on any finite Abelian grou, Gallager-Tanner-Wiberg decoding algorithm, BCJR algorithm, Pearl s belief roagation, Shafer-Shenoy robability roagation algorithm, turbo decoding algorithm, etc.

62 Eamle Comute? ,,,,,,,,,, ,,, ,,, Φ = Φ Φ = Φ = = = = = ' 6 6 ', X : 0 X 3 0 X 5 0 /, =

63 Single Node Posteriors For a single node osterior i.e. X F is a single node, there is a simle, efficient algorithm based on eliminating nodes. Notation: i is the value of evidence node i. The algorithm, called ELIMINATION, requires a node ordering to be given, which tells it which order to do the summations in. In this ordering, the query node must aear last. Grahically, we ll remove a node from the grah once we sum it out

64 Evidence Potentials Elimination also uses a bookeeing trick, called evidential functions: g X i = g X i δ X i, X i i where δ i, is if and 0 otherwise. i i = i This trick allows us to treat conditioning in the same way as we treat marginalization. So everything boils down to doing sums: PX PX PX F F E X E,X = E We just ick an ordering and go for it... = = PX R R E F F E,X E PX PX / PX F F,X,X E E E,X,X R R δ X δ X E E,X,X E E

65 Elimination Algorithm ELIMINATEG lace all PX i X ùi and δ X i, X i on the active list choose an ordering I such that F aears last for each X i in I find all otentials on the active list that reference X i and remove them from the active list define a new otential as the sum with resect to X i of the roduct of these otentials lace the new otential on the active list end return the roduct of the remaining otentials

66 Algorithm Details At each ste we are trying to remove the current variable in the elimination ordering from the distribution For marginal nodes the sums them out, for evidence nodes this conditions on their observed values using the evidential functions Each ste erforms a sum over a roduct of otential functions The algorithm terminates when we reach the query node, which always aears last in the ordering We renormalize what we have left to get the final result PX F X E For undirected models, everything is the same ecet that initialization hase uses the clique otentials instead of the arent-conditionals

67 Marginalization without Evidence Marginalization of joint distributions reresented by grahical models is a secial case of robabilistic inference To comute the marginal PX i of a single node, we set it to be the query node and set the evidence set to be emty In directed models, we can ignore all nodes downstream from the query node, and marginalize only the art of the grah before it If the node has no arents, we can read off its marginal directly In undirected models, we need to do the full comutation: comute PX i / Z using elimination and then normalize in the last ste of elimination to get Z. We can reuse Z later if we want to save work

68 Efficiency Trick in Directed Elimination In directed models, we often know that a certain sum must evaluate to unity, since it is a conditional robability. For eamle, consider the term Φ 4 in our si node eamle: Φ 4 = P 4 4 ñ We can t use this trick in undirected models, because there are no guarantees about what clique otentials sum to.

69 Node Elimination The algorithm we resented is really a way of eliminating nodes from a grah one by one. For undirected grahs: foreach node i in ordering I: connect all the neighbors of i remove i from the grah end The removal oeration requires summing out i or conditioning on observed evidence for i. Summing out i leaves a function involving all its revious neighbors and thus they become connected by this ste. The original grah, augmented by all the added edges is now a triangulated grah. Reminder: triangulated means that every cycle of length >3 contains a chord, i.e. an edge not on the cycle but between two nodes in the cycle.

70 Eamle: Node/Variable Elimination X X X 4 X 3 X 5 X 6 Push the sums in as far as ossible After erforming the innermost sum, we create a new term which does not deend on the term summed out Continue to do summations, PX, X 6 = X ψx,x ψx,x 3 ψx,x 4 ψx 3,X 5 ψx,x 6 ψx 5,X 6 Z X [..5] = X X X X ψx,x ψx,x 4 ψx,x 3 ψx 3,X 5 ψx, X 6 ψx 5, X 6 Z X X 4 X 3 X 5 = X X X X ψx,x ψx,x 4 ψx,x 3 ψx 3,X 5 ΦX,X 5 Z X X 4 X 3 X 5 = X X X ψx,x ψx,x 4 ψx,x 3 ΦX,X 3 Z X X 4 X 3 = X ψx,x ΦX ΦX,X = Z Z ΦX X

71 Added Edges = Triangulation ,,,,,,,,,, ,,, ,,, Φ = Φ Φ = Φ = = = = It is easy to check if a grah is triangulated in linear time It is easy to triangulate a non-triangulated grah But it is very hard to do so in a way that induces small clique sizes

72 Moralization For directed grahs, the arents may not be elicitly connected, but they are involved in the same otential function P i π i Thus to think of ELIMINATION as a node removal algorithm, we first must connect all the arents of every node and dro the directions on the links This ste is known as Moralization and it is essential: since conditioning coules arents in directed models elaining away we need a mechanism for resecting this when we do inference

73 Markov Blanket Elimination order : Elimination order : For each eliminated node X i = X 5, we need to take care the child = X 6, the arents = X 3, and 3 the arents of the child = X. This three is called the Markov blanket of X ,, Φ = = Φ = = ,,, Order : Order :

74 Moral Grah , ψ, ψ, 3 ψ, 4 ψ 3, 5 ψ, 5, Z The grah after moralization looks more general! What do we lose? X X 5 X, X 3 Moral grah is more general loses some indeendencies 6 most secific most general

75 Junction Tree Moralization Triangulation Construct Junction Tree Proagate Probabilities

76 Tree-Structured Grahical Models For now, we will focus on tree-structured grahical models. Trees are an imortant class; they incororate all chains e.g. HMMs as well. Eact inference on trees is the basis for the junction tree algorithm which solves the general eact inference roblem for directed acyclic grahs and for many aroimate algorithms which can work on intractable or cyclic grahs. Directed and undirected trees make eactly same conditional indeendence assumtions, so we cover them together. a b c

77 Elimination on Trees Recall basic structure of Eliminate:. Convert directed grah to undirected by moralization.. Chose elimination ordering with query node last. 3. Place all otentials on active list. 4. Eliminate nodes by removing all relevant otentials, taking roduct, summing out node and lacing resulting factor back onto otential list. What haens when the original grah is a tree?. No moralization is necessary.. There is a natural elimination ordering with query node as root. Any deth first search order. 3. All subtrees with no evidence nodes can be ignored since they will leave a otential of unity once they are eliminated.

78 Elimination on Trees Now consider eliminating node j which is followed by i in the order. Which nodes aear in the otential created after summing over j? nothing in the subtree below j already eliminated nothing from other subtrees, since the grah is a tree only i, from ψ ij which relates i and j Call the factor that is created m ji i, and think of it as a message that j asses to i when j is eliminated. This message is created by summing over j the roduct of all earlier messages m kj j sent to j E as well as ψ j j if j is an evidence node. to root i j k l i m ji i m kj j m lj j j

79 Eliminate = Message Passing On a tree, ELIMINATE can be thought of as assing messages u to the query node at the root from the other nodes at the leaves or interior. Since we ignore subtrees with no evidence, observed evidence nodes at always at the leaves. The message m ji i is created when we sum over j m ij = ψ i E ψ j i, j mkj j j k c j At the final node f, we obtain the answer: f E ψ E E If j is an evidence node, ψ j = δ j, j else ψ E j =. If j is a leaf node in the ordering, cj is emty, otherwise cj are the children of j in the ordering. f k c f m kf f

80 Message are Reused in MultiElimination Consider querying,, 3 and 4 in the grah below. The messages needed for,, 4 individually are shown a-c. Also shown in d is the set of messages needed to comute all ossible marginals over single query nodes. X X X X m m m m m X X X X m 3 m 4 m 3 m 4 m 3 m 4 4 m 3 m 4 X 3 X 4 a X 3 X 4 b X 3 X 4 Key insight: even though the naive aroach rerun Elimination needs to comute N messages to find marginals for all N query nodes, there are only N ossible messages. We can comute all ossible messages in only double the amount of work it takes to do one query. Then we take the roduct of relevant messages to get marginals. c X 3 X 4 m 3 3 d m 4 4

81 Comuting All Possible Messages How can we comute all ossible messages efficiently? Idea: resect the following Message-Passing-Protocol: A node can send a message to a neighbor only when it has received messages from all its other neighbors. Protocol is realizable: designate one node arbitrarily as the root. Collect messages inward to root then distribute back out to leaves. Once we have the messages, we can comute marginals using: i E ψ E Remember that the directed tree on which we ass messages might not be same directed tree we started with. We can also consider synchronous or asynchronous message assing nodes that resect the rotocol but don t use the Collect-Distribute schedule above. Must rove this terminates. i k c i m ki i

82 Triangulation Triangulation: connect nodes in moral grah such that no cycle of 4 or more nodes remains in grah So, add links, but many ossible choices HINT: Try to kee largest clique size small makes junction tree algorithm more efficient Sub-otimal triangulations of moral grah are olynomial Triangulation that minimizes largest clique size is NP But, OK to use a subotimal triangulation slower JTA

83 Part III Learning Belief Networks

84 Outline Introduction to Bayesian networks Bayesian networks: definition, d-searation, equivalence of networks Eamles: causal grahs, elaining away, Markov chains, Naïve Bayes, etc Undirected models Probabilistic inference node elimination junction tree Building the networks

85 Building Bayesian Network Models Tasks of building models Catching the model structure Determining the conditional robabilities What deciding the models? Theoretical considerations e.g.: mied data, overfitting, etc. A set of data Subjective views from eerts can be artially rovided!

86 Some Issues in Learning Structure & Parameters Nodes as variables How many edges? we can always start from a comlete grah, but it may not be a good idea! edge missing can be treated as an edge of robability zero Undirected, directed or hybrid, and how to decide the direction for the directed case? direction may not reflect causality Comleity vs. training error

87 Maimum Likelihood For IID data: P Dθ = Y m `θ; D = X m P X m θ log P X m θ Idea of maimum likelihood estimation MLE: ick the setting of arameters most likely to have generated the data we saw: θ ML =argma θ`θ; D Very commonly used in statistics. Often leads to intuitive, aealing, or natural estimators. For a start, the IID assumtion makes the log likelihood into a sum, so its derivative can be easily taken term by term.

88 MLE for Directed GMs For a directed GM, the likelihood function has a nice form: log P Dθ =log Q Q P Xi mx π i, θ i = P P log P Xi mx π i, θ i m i m i The arameters decoule; so we can maimize likelihood indeendently for each node s function by setting θ i Only need the values of i and its arents in order to estimate θ i Furthermore, if X m, X πi have sufficient statistics only need those. In general, for fully observed data if we know how to estimate arameters at a single node we can do it for the whole network. W W W W X Y X Y Z X Y Z

89 Three Key Regularization Ideas To avoid overfitting, we can ut riors on the arameters of the class and class conditional feature distributions We can also tie some arameters together so that fewer of them are estimated using more data Finally, we can make factorization or indeendence assumtions about the distributions. In articular, for the class conditional distributions we can assume the features are fully deendent, artly deendent, or indeendent!. Y Y Y X X X n X X X n X

90 Discrete Multinomial Naive Bayes Discrete features i, assumed indeendent given the class label y Classification rule: P X i = jy = k =η ijk Y P Xy = k, η = Y i j η [X i=j] ijk Y P y = kx, η = π k Qi Qj η[ i=j] ijk Q i Qj η[ i=j] ijq P q π q X X X n = e βt k Pq eβt q X β k =log[η k η jk η ijk log π k ] =[ =; =;...; i = j;...;]

91 Fitting Discrete Naive Bayes ML arameters are class-conditional frequency counts: P ηijk m = [m i = j][y m = k] P m [ym = k] How do we know? Write down the likelihood: `θ; D = P m log P ym π+ P mi log P m i ym, η and otimize it by setting its derivative to zero careful! enforce normalization with Lagrange multiliers: `η; D = X X [ m i = j][y m = k]logη ijk + X λ ik X η ijk m ijk ik j P ` m = [m i = j][y m = k] λ ik η ijk η ijk ` =0 λ ik = X [y m = k] ηijk =above η ijk m

92 Learning Markov Models The ML arameter estimates for a simle Markov model are easy: TY P y, y,...,y T =Py,...,y k P y t y t, y t,...,y t k log P {y} =logp y,...,y k + t=k+ Each window of k + oututs is a training case for the model P y t y t, y t,...,y t k Eamle: for discrete oututs symbols and a nd-order Markov model we can use the multinomial model: P y t = my t = a, y t = b =α mab The maimum likelihood values for α are: αmab = t s.t. y t = m, y t = a, y t = b t s.t. y t = a, y t = b TX t=k+ log P y t y t, y t,...,y t k

93 Summary Introduction to Bayesian networks Bayesian networks: definition, d-searation, equivalence of networks Eamles: causal grahs, elaining away, Markov chains, Naïve Bayes, etc Undirected models Probabilistic inference node elimination junction tree Building the networks

94 What Else Bayesian networks with continuous r.v. s PCA, ICA, other dimension reduction methods will be discussed in coming lectures! Models with unobservable r.v. s learning by Eectation-Maimization will also be delivered!

Graphical Models (Lecture 1 - Introduction)

Graphical Models (Lecture 1 - Introduction) Grahical Models Lecture - Introduction Tibério Caetano tiberiocaetano.com Statistical Machine Learning Grou NICTA Canberra LLSS Canberra 009 Tibério Caetano: Grahical Models Lecture - Introduction / 7