5. Sum-product algorithm

Size: px

Start display at page:

Download "5. Sum-product algorithm"

David Barrett
5 years ago
Views:

1 Sum-product algorithm Sum-product algorithm Elimination algorithm Sum-product algorithm on a line Sum-product algorithm on a tree

2 Sum-product algorithm 5-2 Inference tasks on graphical models consider an undirected graphical model (a.k.a. Markov random field) µ(x) = 1 ψ c (x c ) Z c C where C is the set of all maximal cliques in G we want to calculate marginals: µ(x A ) = x V \A µ(x) calculating conditional distributions µ(x A x B ) = µ(x A, x B ) µ(x B ) calculation maximum a posteriori estimates: arg maxˆx µ(ˆx) calculating the partition function Z sample from this distribution

3 Elimination algorithm for calculating marginals Elimination algorithm is exact but can require O( X V ) operations µ(x) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) we want to compute µ(x 1 ) brute force marginalization: µ(x 1 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) x 2,x 3,x 4,x 5 X requires O( C X 5 ) operations, where big-o denotes that it is upper bounded by C C X 5 for some constant C Sum-product algorithm 5-3

4 consider an elimination ordering (5, 4, 3, 2) µ(x 1 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) = = = = x 2,x 3,x 4,x 5 X x 2,x 3,x 4 X x 2,x 3,x 4 X x 2,x 3 X x 2,x 3 X = x 2 X ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 ) x 5 X ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )m 5 (x 2, x 3, x 4 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 ) x 4 X ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) {{ m 5 (x 2,x 3,x 4 ) m 5 (x 2, x 3, x 4 ) {{ m 4 (x 2,x 3 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )m 4 (x 2, x 3 ) ψ 12 (x 1, x 2 ) x 3 X ψ 13 (x 1, x 3 )m 4 (x 2, x 3 ) {{ m 3 (x 1,x 2 ) Sum-product algorithm 5-4

5 Sum-product algorithm 5-5 µ(x 1 ) ψ 12 (x 1, x 2 ) ψ 13 (x 1, x 3 )m 4 (x 2, x 3 ) x 2 X = x 2 X m 2 (x 1 ) x 3 X ψ 12 (x 1, x 2 )m 3 (x 1, x 2 ) {{ m 3 (x 1,x 2 ) normalize m 2 ( ) to get µ(x 1 ) µ(x 1 ) = m 2(x 1 ) ˆx 1 m 2 (ˆx 1 ) computational complexity depends on the elimination ordering how do we know which ordering is better?

6 Computational complexity of elimination algorithm µ(x 1 ) = x 2,x 3,x 4,x 5 X x 2,x 3,x 4 X = x 2,x 3 X = x 2 X = x 2 X ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 )ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 ) ψ 12 (x 1, x 2 )ψ 13 (x 1, x 3 ) ψ 12 (x 1, x 2 ) x 3 X x 5 X ψ 25 (x 2, x 5 )ψ 345 (x 3, x 4, x 5 ) {{ m 5(S 5), S 5={x 2,x 3,x 4, Ψ 5={ψ 25,ψ 345 x 4 X m 5 (x 2, x 3, x 4 ) {{ m 4(S 4), S 4={x 2,x 3,Ψ 4={m 5 ψ 13 (x 1, x 3 )m 4 (x 2, x 3 ) {{ m 3(S 3), S 3={x 1,x 2, Ψ 3={ψ 13,m 4 ψ 12 (x 1, x 2 )m 3 (x 1, x 2 ) {{ m 2(S 2), S 2={x 1, Ψ 2={ψ 12,m 3 = m 2 (x 1 ) Total complexity: i O( Ψ i X 1+ Si ) = O( V max i Ψ i X 1+maxi Si ) Sum-product algorithm 5-6

7 theorem: every maximal clique in G(G, I) corresponds to a domain of a message S i {i for some i size of the largest clique in G(G, I) is 1 + max i S i different ordering I gives different cliques, therefore different complexity Sum-product algorithm 5-7 Induced graph elimination algorithm as transformation of graphs induced graph G(G, I) for a graph G and an elimination ordering I is the union of all the transformed graphs or equivalently, start from G and for each i I connect all pairs in Si

8 Sum-product algorithm 5-8 theorem: finding optimal elimination ordering is NP-hard any suggestions? heuristics give I = (4, 5, 3, 2, 1) for Bayesian networks the same algorithm works with conditional probabilities instead of compatibility functions complexity analysis can be done on moralized undirected graph intermediate messages do not correspond to a conditional distribution

9 Sum-product algorithm 5-9 Elimination algorithm input: {ψ c c C, alphabet X, subset A V, elimination ordering I output: marginal µ(x A ) 1. initialize active set Ψ to be the set of input compatibility functions 2. for node i in I that is not in A do let S i be the set of nodes, not including i, that share a compatibility function with i end let Ψ i be the set of compatibility functions in Ψ involving x i compute m i (x Si ) = x i ψ Ψ i ψ(x i, x Si ) remove elements of Ψ i from Ψ add m i to Ψ 3. normalize µ(x A ) = ψ Ψ ψ(x A) / x A ψ Ψ ψ(x A)

10 Sum-product algorithm on a line ν 3 4 (x 3 ) ν 5 4 (x 5 ) µ(x i) = x 1 x 2 x 3 x 4 x 5 x 6 x 7 x [n]\{i µ(x) µ(x) = 1 Z n 1 i=1 ψ i,i+1 (x i, x i+1 ) ψ(x 1, x 2) ψ(x i 2, x i 1) ψ(x i 1, x i)ψ(x i, x i+1) ψ(x i+1, x i+2) ψ(x n 1, x n) {{{{ x [n]\{i µ i 1 i µ i+1 i ν i 1 i(x i 1)ψ(x i 1, x i)ψ(x i, x i+1)ν i+1 i(x i+1) x i 1,x i+1 ν i 1 i(x i 1) µ i 1 i(x 1,..., x i 1) x 1,...,x i 2 ν i+1 i(x i+1) µ i+1 i(x i+1,..., x n) x i+2,...,x n Sum-product algorithm 5-10

11 Sum-product algorithm 5-11 ν i (x i ) x i 1,x i+1 ν i 1 i (x i 1 )ψ(x i 1, x i )ψ(x i, x i+1 )ν i+1 i (x i+1 ) µ(x i ) = ν i(x i ) x i ν i (x i ) definitions: µ i 1 i (x 1,..., x i 1 ) ν i 1 i (x i 1 ) µ i+1 i (x i+1,..., x n ) ν i+1 i (x i+1 ) 1 Z i 1 i k [i 2] ψ(x k, x k+1 ) x 1,...,x i 2 µ i 1 i (x 1,..., x i 1 ) 1 Z i+1 i k {i+2,...,n ψ(x k 1, x k ) x i+2,...,x n µ i+1 i (x i+1,..., x n )

12 Sum-product algorithm 5-12 how can we compute the messages ν? µ i i+1 (x 1,..., x i ) µ i 1 i (x 1,..., x i 1 )ψ(x i 1, x 1 ) ν i i+1 (x i ) = µ i i+1 (x 1,..., x i ) x 1,...,x i 1 µ i 1 i (x 1,..., x i 1 )ψ(x i 1, x 1 ) x 1,...,x i 1 = ν i 1 i (x i 1 )ψ(x i 1, x i ) x i 1 how many operations are required? O(n X 2 ) operations to compute one marginal µ(x i ) what if we want all the marginals? compute all the messages forward and backward O(n X 2 ) then compute all the marginals in O(n X 2 ) operations

13 Sum-product algorithm 5-13 computing the partition function from the messages Z i (i+1) = ψ(x k, x k+1 ) x 1,...,x i k [i 1] = Z (i 1) i ν (i 1) i (x i 1 ) ψ(x i 1, x i ) x i,x i 1 Z 1 = 1 Z n = Z how many operations do we need? O(n X 2 ) operations

14 Sum-product algorithm 5-14 Sum-product algorithm for hidden Markov models hidden Markov model Sequence of r.v. s {(X 1, Y 1 ); (X 2, Y 2 );... ; (X n, Y n ) n 1 hidden state {X i Markov Chain P{x = P{x 1 P{x i+1 x i {Y i noisy observations P{y x = i=1 n P{y i x i i=i x 1 x 2 x 3 x 4 x 5 x 6 x 7 y 1 y 2 y 3 y 4 y 5 y 6 y 7 (equivalent directed form) x 1 x 2 x 3 x 4 x 5 x 6 x 7 y 1 y 2 y 3 y 4 y 5 y 6 y 7

15 Sum-product algorithm 5-15 time homogeneous hidden Markov models Sequence of r.v. s {(X 1, Y 1 ); (X 2, Y 2 );... ; (X n, Y n ) n 1 {X i Markov Chain P{x = q 0 (x 1 ) q(x i, x i+1 ) {Y i noisy observations P{y x = i=1 n r(x i, y i ) i=i µ(x, y) = 1 Z n 1 i=1 ψ i (x i, x i+1 ) n ψ i (x i, y i ), i=1 ψ i (x i, x i+1 ) = q(x i, x i+1 ), ψi (x i, y i ) = r(x i, y i ).

16 Sum-product algorithm 5-16 we want to compute marginals of the following graphical model on a line (q 0 uniform) x 1 x 2 x 3 x 4 x 5 x 6 x 7 y 1 y 2 y 3 y 4 y 5 y 6 y 7 µ y (x) = P{x y Bayes thm = 1 Z(y) n 1 i=1 q(x i, x i+1 ) n r(x i, y i ). i=1 µ y (x) = = n 1 1 q(x i, x i+1 ) Z(y) i=1 n 1 1 ψ i (x i, x i+1 ) Z(y) i=1 n r(x i, y i ) i=1 ψ i (x i, x i+1 ) = q(x i, x i+1 ) r(x i, y i ) (for i < n 1) ψ n 1 (x n 1, x n ) = q(x n 1, x n ) r(x n 1, y n 1 ) r(x n, y n ).

17 Sum-product algorithm 5-17 apply sum-product algorithm to compute marginals in O(n X 2 ) time ν i (i+1) (x i ) q(x i 1, x i ) r(x i 1, y i 1 ) ν (i 1) i (x i 1 ), ν (i+1) i (x i ) x i 1 X x i+1 X q(x i, x i+1 ) r(x i, y i ) ν (i+2) (i+1) (x i+1 ). known as forward-backward algorithm a special case of the sum-product algorithm BCJR algorithm for convolutional codes (1974) cannot find the maximum likelihood estimate (cf. Viterbi algorithm)

18 Sum-product algorithm 5-18 Example: Neuron firing patterns Hypothesis Assemblies of neurones activate in a coordinate way in correspondence to specific cognitive functions. Performing of the function corresponds sequence of these activity states. Approach Firing process Observed variables Activity states Hidden variables

19 Sum-product algorithm 5-19 [C. Kemere, G. Santhanam, B. M. Yu, A. Afshar, S.I. Ryu, T. H. Meng and K.V. Shenoy, J. Neurophysiol. 100: (2008)]

20 Sum-product algorithm 5-20 [I. Gat, N. Tishby, and M. Abeles, Network: Computation in Neural Systems, 8: (1997)]

21 Sum-product algorithm 5-21 [C. Kemere, G. Santhanam, B. M. Yu, A. Afshar, S.I. Ryu, T. H. Meng and K.V. Shenoy, J. Neurophysiol. 100: (2008)]

22 Sum-product algorithm on trees a motivating example: influenza virus complete sequence of the gene of the 1918 influenza virus [A.H. Reid, T.G. Fanning, J.V. Hultin, and J.K. Taubenberger, Proc. Natl. Acad. Sci. 96 (1999) ] Sum-product algorithm 5-22

23 challenges in phylogeny phylogeny reconstruction: given DNA sequences at vertices (only at leaves), infer the underlying tree T = (V, E). phylogeny evaluation: given a tree T = (V, E) evaluate the probability of observed DNA sequences at vertices (only at leaves). Bayesian network model for phylogeny evaluation T = (V, D) directed graph, DNA sequences x = (x i ) i V X V µ T (x) = q o (x o ) q i,j (x i, x j ), (i,j) D q i,j (x i, x j ) = Probability that the descendent is x j if ancestor is x i. Sum-product algorithm 5-23

24 Sum-product algorithm 5-24 simplified model: X = {+1, 1 q o (x o ) = 1 { 2 1 q if x j = x i q(x i, x j ) = q if x i x j MRF representation: q(x i, x j ) e θx ix j probability of certain mutations x: µ T (x) = 1 Z θ (T ) problem: for given T, compute marginal µ T (x i ) (i,j) E e θx ix j

25 Sum-product algorithm 5-25 define graphical model on sub-trees j i T j i T j i = (V j i, E j i) Subtree rooted at j and excluding i µ j i(x Vj i ) 1 Z(T j i) ν j i(x j) x Vj i \{j (u,v) E j i e θxuxv µ j i(x Vj i )

26 T k1 i = (V 1, E 1) T k2 i k 1 i k 2 k 4 k 3 T k4 i T k3 i the messages from neighbors k 1, k 2, k 3, k 4 are sufficient to compute the marginal µ(x i) µ T (x i) e θxuxv = = x V \{i (u,v) E x V1,x V2,x V3,x V4 l=1 4 l=1 x kl,x Vl \{k l 4 { l=1 x kl e θxixkl 4 {e θx ix kl {e θx ix kl x Vl \{k l (u,v) E l e θxuxv (u,v) E l e θxuxv µ kl i(x Vl ) {{ ν kl i(x kl ) Sum-product algorithm 5-26

27 Sum-product algorithm 5-27 recursion on sub-trees to compute the messages ν j i k l T i j µ i j(x Vi j ) = = 1 Z(T i j) (u,v) E i j e θxuxv 1 { Z(T i j) eθx ix k e θx ix l e θxuxv e θx ix l { (u,v) E k i e θxuxv (u,v) E k i e θxuxv { e θx ix k e θx ix l µ k i (x Vk i )µ l i (x Vl i ) { (u,v) E l i e θxuxv (u,v) E l i e θxuxv

28 ν i j j ν k i k i ν l i l ν i j(x i) = µ i j(x Vi j ) x Vi j \i e θx ix k e θx ix l µ k i (x Vk i )µ l i (x Vl i ) = x Vi j \i { { e θxixk µ k i (x Vk i ) e θxixl µ l i (x Vl i ) x Vk i x Vl i { { e θxixk µ k i (x Vk i ) e θxixl x k x Vk i \{k { x k { ) x l ) e θxixk ν k i (x k e θxixl ν l i (x l x l x Vl i \{l µ l i (x Vl i ) with uniform initialization ν i j(x i) = 1 X for all leaves i Sum-product algorithm 5-28

29 sum-product algorithm (for our example) ν i j (x i ) { e θx ix k ν k i (x k ) ν i (x i ) k i\j x k { e θx ix k ν k i (x k ) k i x k µ T (x i ) = ν i (x i ) x i ν i (x i ) what if we want all the marginals? choose an arbitrary root φ compute all the messages towards the root ( E messages) then compute all the messages outwards from the root ( E messages) then compute all the marginals (n marginals) how many operations are required? naive implementation requires O( X 2 i d2 i ) if i has degree d i, then computing ν i j requires d i X 2 operations d i messages start at each node i, each require d i X 2 operations total computation for 2 E messages is { i d i (d i X 2 ) however, we can compute all marginals in O(n X 2 ) operations Sum-product algorithm 5-29

30 Sum-product algorithm 5-30 let D = {(i, j), (j, i) (i, j) E be the directed version of E (cf. D = 2 E ) (sequential) sum-product algorithm 1. initialize ν i j (x i ) = 1/ X for all leaves i 2. recursively over (i, j) D compute (from leaves) { ν i j (x i ) = ψ ik (x i, x k )ν k i (x k ) x k k i\j 3. for each i V compute marginal ν i (x i ) = { ψ ik (x i, x k )ν k i (x k ) x k µ T (x i ) = k i ν i (x i ) x i ν i (x i )

31 (parallel) sum-product algorithm 1. initialize ν (0) i j (x i) = 1/ X for all (i, j) D 2. for t {0, 1,..., t max for all (i, j) D compute ν (t+1) i j (x { i) = ψ ik (x i, x k )ν (t) k i (x k) k i k i\j 3. for each i V compute marginal ν i (x i ) = { ψ ik (x i, x k )ν (tmax+1) k i (x k ) µ T (x i ) = x k ν i (x i ) x i ν i (x i ) also called belief propagation when t max is larger than the diameter of the tree (the length of the longest path), this converges to the correct marginal more operations than the sequential version ( O(n X 2 diam(t )) ) a naive implementation requires O( X 2 diam(t ) d 2 i ) naturally extends to general graphs Sum-product algorithm 5-31 x k

32 Sum-product algorithm on general graphs (loopy) belief propagation 1. initialize ν i j (x i ) = 1/ X for all (i, j) D 2. for t {0, 1,..., t max for all (i, j) D compute ν (t+1) i j (x i) = k i k i\j x k { x k ψ ik (x i, x k )ν (t) k i (x k) 3. for each i V compute marginal ν i (x i ) = { ψ ik (x i, x k )ν (tmax+1) k i (x k ) µ T (x i ) = ν i (x i ) x i ν i (x i ) computes approximate marginals in O(n X 2 t max ) operations generally it does not converge, and if it does, the output is incorrect folklore about loopy BP works better when G has few short loops works better when ψij (x i, x j ) = ψ ij,1 (x i )ψ ij,2 (x j ) + small(x i, x j ) nonconvex variational principle Sum-product algorithm 5-32

33 Exercise: partition function on trees using the recursion for messages: { ν i j (x i ) = ψ ik (x i, x k )ν k i (x k ) k i\j x k ν i (x i ) = { ψ ik (x i, x k )ν k i (x k ) x k k i it follows that Z(T ) = x i ν i (x i ) alternatively, if we had a black box that computes marginals for any tree, then we can use it to compute partition functions efficiently Z(T i j ) = { ψ ik (x i, x k ) Z(T k i ) µ k i (x k ) Z(T ) = x i X k i\j x i X k i x k X { ψ ik (x i, x k ) Z(T k i ) µ k i (x k ) x k X this recursive algorithm naturally extends to general graphs Sum-product algorithm 5-33

34 Sum-product algorithm 5-34 Suppose you observe x = (+1, +1, +1, +1, +1, +1, +1, +1, +1) and you know this comes from either of µ(x) = 1 Z(T ) (e.g. coloring) (i,j) E ψ(x i, x j ) which one has highest likelihood?

35 Exercise: sampling on the tree if we have a black-box for computing marginals on any tree, we can use it to sample from any distribution on a tree Sampling( Tree T = (V, E), ψ = {ψ ij (ij) E ) 1: Choose a root o V ; 2: Sample X o µ o ( ); 2: Recursively over i V (from root to leaves): 3: Compute µ i π(i) (x i x π(i) ); 4: Sample X i µ i π(i) ( x π(i) ); π(i) is the parent of node i in the rooted tree T o we use the black-box to compute the conditional distribution k i j µ T (x Vi j x j ) ψ ij (x i x j ) µ i j (x Vi j ) µ T (x i x j ) ψ ij (x i x j ) µ i j (x i ) Sum-product algorithm 5-35

36 Sum-product algorithm 5-36 Tree decomposition when we don t have a tree we can create an equivalent tree graph by enlarging the alphabet X X k Treewidth(G) Minimum such k it is NP-hard to determine the treewidth of a graph problem: in general Treewidth(G) = Θ(n)

37 Sum-product algorithm 5-37 Tree decomposition of G = (V, E) A tree T = (V T, E T ) and a mapping V : V T SUBSETS(V ) s.t.: For each i V there exists at least one u V T with i V (u). For each (i, j) E there exists at least one u V T with i, j V (u). If i V (u 1 ) and i V (u 2 ), then i V (w) for any w on the path between u 1 and u 2 in T.

38 Sum-product algorithm 5-38 Belief propagation for factor graphs b x 5 x 6 x 4 c x 3 d x 2 x 1 factor ψ a (x 1, x 5, x 6 ) a {{ a µ(x) = 1 Z ψ a (x a ) a F

39 variable nodes i, j, etc. factor nodes a, b, etc. messages from variables to factors ν i a (x i ) = ν b i (x i ) b i\{a messages from factors to variables ν a i (x i ) = ψ a (x a ) x a\{i j a\{i messages from variables to factors ν i (x i ) = ν b i (x i ) b i ν j a (x j ) this includes belief propagation (=sum-product algorithm) on (general) Markov random fields exact on factor trees Sum-product algorithm 5-39

Inference and Representation

Inference and Representation David Sontag New York University Lecture 5, Sept. 30, 2014 David Sontag (NYU) Inference and Representation Lecture 5, Sept. 30, 2014 1 / 16 Today s lecture 1 Running-time of