More belief propaga+on (sum- product)

Size: px

Start display at page:

Download "More belief propaga+on (sum- product)"

Rosamond Boyd
5 years ago
Views:

1 Notes for Sec+on 5

2 Today More mo+va+on for graphical models A review of belief propaga+on Special- case: forward- backward algorithm From variable elimina+on to junc+on tree (mainly just intui+on)

3 More belief propaga+on (sum- product) Consider an arbitrary probability distribu+on over N variables each of domain size D. You can think of this as a table where each column corresponds to a variable, and for each row we specify a probability. How many rows are there? Suppose we condi+on on a variable x=v. What does this mean? It means picking out all rows with x=v and then renormalizing. Renormalizing is just dividing by the marginal probability of x=v. Suppose we want to find the marginal probability of some variable x. What does this mean? For each value of x, it means finding all the rows with that value and adding their probabili+es together. How many rows do we need to add?

4 (slides: Rina Dechter) 4

5 Naively, then, an arbitrary probability distribu+on requires O(N^D) opera+ons. This might not seem so bad: but real world problems are large.

6 Monitoring Intensive- Care Pa+ents MINVOLSET PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE MINOVL FIO2 VENTALV PRESS ANAPHYLAXIS PVSAT ARTCO2 TPR SAO2 INSUFFANESTH EXPCO2 HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER CVP PCWP CO HREKG HRSAT HRBP BP (slides: Rina Dechter) 37 variables << parameters 37

7 Another example: linkage analysis???? 1 2 A a B b pedigree A A B b 3 4 A? B????? 5 6 A a B b 6 individuals Haplotype: {2, 3} Genotype: {6} Unknown (slides: Rina Dechter)

8 Pedigree: 6 people, 3 markers L 11m L 11f L 12m L 12f X 11 X 12 S 13m L 13m L 13f S 15m L 14m L 14f X 13 X 14 S 15m L 15m L 15f L 16m L 16f S 15m S 15m S 16m X 15 X 16 L 21m L 21f L 22m L 22f X 21 X 22 S 23m L 23m L 23f S 25m L 24m L 24f X 23 X 24 S 25m L 25m L 25f L 26m L 26f S 25m S 25m S 26m X 25 X 26 L 31m L 31f L 32m L 32f X 31 X 32 S 33m L 33m L 33f S 35m L 34m L 34f X 33 X 34 S 35m L 35m L 35f L 36m L 36f S 35m S 35m S 36m (slides: Rina Dechter) X 35 X 36

9 But we can do beher with message passing on trees. Using BP: choose root, sum over leaf variables and send message to parents, take the product, repeat, un+l you reach the root. Every summa+on is takes D^2 opera+ons. At each parent, we need to mul+ply the messages from the children. There are at most N children, so at most N mul+plica+ons and one for each value in the parent variable. And if we go leaves to root and back this happens twice at each node, so O(N*D^2).

10 Forward backward algorithm Decomposing the probability of an observation sequence!"!#!$!% P (o 1,...o T )= X &" &# &$ &% s 1,...s T P (o 1,...o T,s 1,...s T ) = X s 1,...s T P (s 1 )! TY P (s t s t 1 ) t=2! TY P (o t s t ) t=1 = X P (o T s T ) X P (s T s T 1 )P (s 1 ) s T s 1,...s T 1 TY 1 t=2 P (s t s t 1 ) (using the model)! TY 1 t=1 P (o t s t )! hhp:// ML/Lectures/ml- lecture20.pdf

11 The forward algorithm Given an HMM model and an observation sequence o 1,...o T,define: t(s) =P (o 1,...o t,s t = s) We can put these variables together in a vector t of size S. In particular, 1(s) =P (o 1,S 1 = s) =P (o 1 S 1 = s)p (S 1 = s) =q so1 b 0 (s) For t =2,...T, t(s) =p sot Ps p s s t 1(s ) The solution is then P (o 1,...o T )= X s T (s) hhp:// ML/Lectures/ml- lecture20.pdf

12 Computing state probabilities in general If we know the model parameters and an observation sequence, how do we compute P (S t = s o 1,o 2,...o T )? P (S t = s o 1,...o T )= P (o 1,...o T,S t = s) P (o 1,...o T ) = P (o t+1,...o T o 1,...o t,s t = s)p (o 1,...o t,s t = s) P (o 1,...o T ) = P (o t+1,...o T S t = s)p (o 1,...o t,s t = s) P (o 1,...o T ) The denominator is a normalization constant and second factor in the numerator can be computed using the forward algorithm (it is t(s)) We now compute the first factor hhp:// ML/Lectures/ml- lecture20.pdf

13 The forward-backward algorithm Given the observation sequence, o 1,...o T we can compute the probability of any state at any time as follows: 1. Compute all the t(s), using the forward algorithm 2. Compute all the t (s), using the backward algorithm 3. For any s S and t {1,...T}: P (S t = s o 1,...o T )= P (o 1,...o t,s t = s)p (o t+1,...o T S t = s) P (o 1,...o T ) The complexity of the algorithm is O( S T ). = t (s) t (s) P s T (s ) hhp:// ML/Lectures/ml- lecture20.pdf

14 From VE to JT In most cases we don t have trees BP doesn t work because of feedback So we use variable elimina+on which is exp(treewidth). Tree width is roughly the size of the largest factor we generate during elimina+on.

15 The Junc+on Tree Algorithm c a b d First, let s think about belief propaga+on in a slightly different way: Every node in a undirected graphical model that is a tree is a SEPARATOR for the graph. If you remove it, it separates the graph into two unconnected components. So we can do all the marginaliza+ons in each component separately, and then send the resul+ng poten+als over the separator to the separator node.

16 The Junc+on Tree Algorithm c a b d This is convenient, because that means that in a tree we can break up the message calcula+ons into nested sub- problems and we end up calcula+ng the marginals by message passing. But what do we do if we have cycles in our graph?

17 The Junc+on Tree Algorithm c a b d e f We can do the same thing! Choose a separator set of variables. Make that into a super- node which just the cartesian product of those variables and connect it to whatever the individual nodes were connected to before. a b c, d d d If you think about it, this has to represent the same distribu+on. It s just that if we want the marginal over c, first we get the marginal over (c,d) and then marginalize over d.

18 The Junc+on Tree Algorithm If you have more than one cycle, you just keep gathering together separators un+l you have a tree. h h c g g a b e f a b c,d e f d And then you just do BP on the resul+ng graph. h a b c,d e, g f

19 The Junc+on Tree Algorithm Of course, there are oten many separators and which separators you choose can have a big effect on how big the super- nodes you make are. And remember the func+ons that h live on the edges are now func+ons in poten+ally many variables. So the running +me is now exponen+al in D^{max number of variables connected by an edge}. a b c,d e, g f e,g,h a,b b,c,d c,d,e,g e,g,h

20 h Just to make things confusing people in the field oten talk about the dual graph which puts all the variables on an edge in a node and puts the nodes on the edges. Don t worry about it. a b c,d e, g f dual- graph (a.k.a junc+on tree) e,g,h a,b b,c,d c,d,e,g e,g,h

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller Probabilis2c' Graphical' Models' Learning' BN'Structure' Structure' Learning' Why Structure Learning To learn model for new queries, when domain expertise is not perfect For structure discovery, when inferring