Information Projection Algorithms and Belief Propagation

Size: px

Start display at page:

Download "Information Projection Algorithms and Belief Propagation"

Patience Phelps
5 years ago
Views:

1 χ 1 π(χ 1 ) Information Projection Algorithms and Belief Propagation Phil Regalia Department of Electrical Engineering and Computer Science Catholic University of America Washington, DC with J. M. Walsh, Drexel University P. Dewilde Workshop, Wassenaar, June 2008

2 Outline 1 Introduction 2 Bregman Distance 3 Iterative Projection Algorithms 4 Belief Propagation 5 Conclusion

3 Belief Propagation Iterative Convex Projection algorithms Belief Propagation (Pearl, 1986) has met with success in: Error correction decoding (low density parity-check codes, turbo codes,... ); Network diagnostics and link monitoring; Sensor self-localization; Distributed estimation in sensor networks; Lossy source quantization; Multi-user communications; Et cetera. Iterative Convex Projection algorithms have convergence proofs, unlike BP for which proofs assume either: Tree or forest dependency graph; Arbitrarily large factor graph girth.

4 Belief Propagation Iterative Convex Projection algorithms Belief Propagation (Pearl, 1986) has met with success in: Error correction decoding (low density parity-check codes, turbo codes,... ); Network diagnostics and link monitoring; Sensor self-localization; Distributed estimation in sensor networks; Lossy source quantization; Multi-user communications; Et cetera. Iterative Convex Projection algorithms have convergence proofs, unlike BP for which proofs assume either: Tree or forest dependency graph; Arbitrarily large factor graph girth.

5 Belief Propagation Iterative Convex Projection algorithms Belief Propagation (Pearl, 1986) has met with success in: Error correction decoding (low density parity-check codes, turbo codes,... ); Network diagnostics and link monitoring; Sensor self-localization; Distributed estimation in sensor networks; Lossy source quantization; Multi-user communications; Et cetera. Iterative Convex Projection algorithms have convergence proofs, unlike BP for which proofs assume either: Tree or forest dependency graph; Arbitrarily large factor graph girth.

6 Basic Query Can iterative convex projection algorithms lend insight into belief propagation? A priori yes, and the role of information projections and information geometry in BP is nothing new: Moher and Gulliver (Trans. IT-98) rephrased iterative decoding as information projections of Csiszár; Grant (ISIT-99) examined turbo decoding via information projections; Richardson (Trans. IT-03) viewed nonlinear dynamics of iterative decoding via (tacitly) information geometry; Ikeda, Tanaka and Amari (Trans. IT-04) rephrased all of the above in terms of information geometry. Key obstacle: Extrinsic information extraction goes against projections on invariant sets.

7 Basic Query Can iterative convex projection algorithms lend insight into belief propagation? A priori yes, and the role of information projections and information geometry in BP is nothing new: Moher and Gulliver (Trans. IT-98) rephrased iterative decoding as information projections of Csiszár; Grant (ISIT-99) examined turbo decoding via information projections; Richardson (Trans. IT-03) viewed nonlinear dynamics of iterative decoding via (tacitly) information geometry; Ikeda, Tanaka and Amari (Trans. IT-04) rephrased all of the above in terms of information geometry. Key obstacle: Extrinsic information extraction goes against projections on invariant sets.

8 Basic Query Can iterative convex projection algorithms lend insight into belief propagation? A priori yes, and the role of information projections and information geometry in BP is nothing new: Moher and Gulliver (Trans. IT-98) rephrased iterative decoding as information projections of Csiszár; Grant (ISIT-99) examined turbo decoding via information projections; Richardson (Trans. IT-03) viewed nonlinear dynamics of iterative decoding via (tacitly) information geometry; Ikeda, Tanaka and Amari (Trans. IT-04) rephrased all of the above in terms of information geometry. Key obstacle: Extrinsic information extraction goes against projections on invariant sets.

9 Bregman Distance Graph of a convex function f( ) is lower bounded by any tangent hyperplane: f(r) f(q) f(q) + f(q), r q This induces the gradient inequality: q r f(r) f(q) + f(q), r q whose discrepancy in turn induces the Bregman distance D f (r, q) = f(r) f(q) f(q), r q 0

10 Common examples Probability mass functions defined on M outcomes: q i = Pr(x = x i ), i = 0, 1, 2,..., M 1. Introduce convex domain q 1 0 D = q i :., q 1 + q q M 1 1 q M 1 0 Setting q 0 = 1 M 1 i = 1 q i, the negative Shannon entropy f(q) = is convex over D. ( M 1 1 i = 1 ) ( M 1 q i log 1 i = 1 ) M 1 q i + i = 1 q i log q i

11 The induced Bregman distance becomes D f (r, q) = M 1 i = 0 r i log r i q i and is recognized as the Kullback-Leibler divergence. Closely related is the Fenchel conjugate function f (θ) = sup q = log ( ) q, θ f(q) ( M 1 i = 0 ) exp(θ i ), with θ i = log q i q 0 This is convex over a domain D IR M of vectors having first component zero. ( Log probability ratios )

12 The induced Bregman distance becomes D f (r, q) = M 1 i = 0 r i log r i q i and is recognized as the Kullback-Leibler divergence. Closely related is the Fenchel conjugate function f (θ) = sup q = log ( ) q, θ f(q) ( M 1 i = 0 ) exp(θ i ), with θ i = log q i q 0 This is convex over a domain D IR M of vectors having first component zero. ( Log probability ratios )

13 Induced Bregman distance is now By associating M 1 i = 0 exp(ρ i ) D f (ρ, θ) = log M 1 exp(θ i ) i = 0 M 1 i = 0 exp(θ i ) (ρ i θ i ) M 1 j = 0 exp(θ j ) q i = exp(θ i) M 1 exp(θ j ) j = 0 r i = exp(ρ i) M 1 exp(ρ j ) j = 0 This assumes the more familiar form D f (ρ, θ) = M 1 i = 0 q i log q i r i

14 Observe that [ f(q)] i = df(q) dq i = log q i q 0 = θ i [ f (θ)] i = df (θ) dθ i = exp(θ i) i exp(θ j) = q i giving inverse maps, as f(q) and f (θ) are Legendre transforms of each other. In general, for convex f and its conjugate f in the Legendre class, their induced Bregman distances satisfy ( ) D f (r, q) = D f f(q), f(r).

15 Consider finally the energy function: f(q) = 1 2 M 1 i = 0 q i 2, q IR M the Bregman distance becomes D f (r, q) = 1 M 1 (r i q i ) 2. 2 i = 0 As f(q) = q, and f = f, this gives a symmetric Bregman distance: D f (r, q) = D f (q, r)

16 Bregman Projections If f(q) is a convex function over a domain D, and C is a convex subset of D, the Bregman projection of q onto C is π C (q) = arg min r C D f(r, q). and is characterized by the inequality ) ( ) D f (r, q) D f (r, π C (q) + D f π C (q), q, for all r C, or, equivalently, f(q) f ( πc (q) ), r π C (q) 0, for all r C.

17 Dykstra s Cyclic Projection Algorithm Seek minimization π C (q) = arg min r C D f(r, q) where C is the intersection of convex sets: C = First stab: Sometimes convergent algorithm, using C n+n = C n and initialization r n = π Cn (r n 1 ) = π Cn ( f ( f(r n 1 ) )) N n = 1 r 0 = q, s (N 1) = = s 1 = s 0 = 0. C n

18 Dykstra s Cyclic Projection Algorithm Seek minimization π C (q) = arg min r C D f(r, q) where C is the intersection of convex sets: C = N n = 1 Improved algorithm, r n n π C (q): r n = π Cn ( f ( f(r n 1 ) + s n N ) ) s n = f(r n 1 ) + s n N f(r n ) using C n+n = C n and initialization r 0 = q, s (N 1) = = s 1 = s 0 = 0. C n

19 Minimum Distance Algorithm Find closest members: D f (r, q ) = inf D f (r, q), where C 1 C 2 =. r C 1, q C 2 Cyclic projection algorithm becomes r n = π C1 ( f ( f(q n 1 ) + v n 1 ) ) v n = f(q n 1 ) + v n 1 f(r n ( q n = π C2 f ( f ) ) (r n ) + w n 1 w n = f (r n ) + w n 1 f (q n ) Convergent in the Euclidean case D f (r, q) = 1 2 k (r k q k ) 2.

20 Belief Propagation Iterative (sometimes fast ) algorithm to calculate marginal probability functions. Given binary vector x = [x 1,..., x M ] {0, 1} M, and probability function G(x), seek marginals f k (x k ) = 1 x 1 =0 1 1 x k 1 =0 x k+1 =0 1 G(x) x M =0 Calculation is hard since there are 2 M evaluations for x. BP is applicable when G(x) splits into simpler factors: G(x) = K g k (x) k = 1 Usually, each factor g k depends only on a subset of variables in x.

21 Message passing algorithm m xi g k (x i ) = variable x i to factor g k = β k m gl x i (x i ) l k m gk x i (x i ) = factor g k to variable x i = α i m xl g k (x l ) x n:n i g k (x) l i x 1 x 2 x 3 x 4 x 5 m x1 g 1 (x 1 ) m g6 x 5 (x 5 ) Outgoing message on an edge depends only on incoming messages from other edges at that node ( extrinsic information). If convergent, the beliefs b i (x i ) = β i m gl x i (x i ) l g 1 (x) g 2 (x) g 3 (x) g 4 (x) g 5 (x) g 6 (x) are thresholded: b i (0) > < b i(1). Ideally, beliefs would be marginals.

22 For projection interpretation, let x 1,..., x K be K copies of x, and consider extended likelihood function G(x 1,..., x K ) = K g k (x k ). k = 1 Original function is obtained with constraint x 1 = = x K = x. Two convex sets: Product distributions of 2 MK variables: P = { θ : f (θ) = K M k = 1 m = 1 } q k,m (x k m) Constraint (or sparse ) distributions { } Q = r : r(x 1,..., x K ) = 0 if x i x j for any i j

23 Information geometric view: At factor nodes: Given pmf q, if r P is a product distribution built from the marginals of q, then D f (q, s) = D f (q, r) + D f (r, s), for all s P }{{} 0 so that r is the Bregman projection of q onto P. At variable nodes: Given pmf q, if Q denotes a set of sparse distributions, then { β qi, i index set for Q; b i = 0, otherwise; satisfies D f (s, q) = D f (s, b) + D }{{} f (b, q), for all s Q. 0 so that b (containing beliefs) is Bregman projection onto Q.

24 Information geometric view: At factor nodes: Given pmf q, if r P is a product distribution built from the marginals of q, then D f (q, s) = D f (q, r) + D f (r, s), for all s P }{{} 0 so that r is the Bregman projection of q onto P. At variable nodes: Given pmf q, if Q denotes a set of sparse distributions, then { β qi, i index set for Q; b i = 0, otherwise; satisfies D f (s, q) = D f (s, b) + D }{{} f (b, q), for all s Q. 0 so that b (containing beliefs) is Bregman projection onto Q.

25 Calculation of marginal pmfs Ideal solution is p = ( π Q f (χ 1 ) ) (constrain to sparse distributions) χ 0 = ( ) π P f(p) (calculate marginal distributions) with initialization ( ) χ 1 = f g k (x k ) k

26 Belief propagation is the cyclic projection algorithm ξ n = π P (χ n 1 + σ n 1 ) xlskdjflskjldkfjls99lskdjf88 σ n = χ n 1 + σ n 1 ξ n p n = π Q ( f (ξ n + τ n 1 ) ) χ n = π P ( f(pn ) ) τ n = ξ n + τ n 1 χ n ξ n + τ n 1 χ n 1 + σ n 1 with initialization ( ) χ 1 = f g k (x k ) k σ 1 = τ 1 = 0 P ξ n χ n π P f ( ) χ n + σ n π Q f π P f ( f( ) ) ( f ( ) ) p n Q

27 Special Case: Replace negative Shannon entropy by energy function, to get Euclidean belief propagation : ξ n = π P (χ n 1 + σ n 1 ) σ n = χ n 1 + σ n 1 ξ n p n = π Q (ξ n + τ n 1 ) χ n = π P (p n ) τ n = ξ n + τ n 1 χ n Given arbitary convex sets P and Q, can show that {χ n } converges for any initial condition χ 1. Conventional belief propagation, by contrast, is convergent when either of the following apply: Factor graph is a tree/forest (or nearly so: large girth); Initialization χ 1 is a product distribution (or nearly so: π P (χ 1 ) χ 1 ).

28 Special Case: Replace negative Shannon entropy by energy function, to get Euclidean belief propagation : ξ n = π P (χ n 1 + σ n 1 ) σ n = χ n 1 + σ n 1 ξ n p n = π Q (ξ n + τ n 1 ) χ n = π P (p n ) τ n = ξ n + τ n 1 χ n Given arbitary convex sets P and Q, can show that {χ n } converges for any initial condition χ 1. Conventional belief propagation, by contrast, is convergent when either of the following apply: Factor graph is a tree/forest (or nearly so: large girth); Initialization χ 1 is a product distribution (or nearly so: π P (χ 1 ) χ 1 ).

29 Further perspectives: Rephrasing belief propagation in terms of cyclic convex projection algorithms suggests that convergence studies of the latter might extend to the former. Euclidean belief propagation is always convergent, although conventional (information projection-based) belief propagation can diverge is some settings. Continuity of projectors can be invoked to extend cases where belief propagation is proved to converge to good solutions. Can some modification to belief propagation give a more faithful transcription of Dykstra s algorithm?

Belief Propagation, Information Projections, and Dykstra s Algorithm

Belief Propagation, Information Projections, and Dykstra s Algorithm John MacLaren Walsh, PhD Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu