Message passing and approximate message passing

Similar documents
How to Design Message Passing Algorithms for Compressed Sensing

Message Passing Algorithms for Compressed Sensing: I. Motivation and Construction

Approximate Message Passing

13 : Variational Inference: Loopy Belief Propagation

12 : Variational Inference I

Variational Inference (11/04/13)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

5. Density evolution. Density evolution 5-1

Loopy Belief Propagation for Bipartite Maximum Weight b-matching

Probabilistic and Bayesian Machine Learning

On convergence of Approximate Message Passing

Inference and Representation

Does l p -minimization outperform l 1 -minimization?

14 : Theory of Variational Inference: Inner and Outer Approximation

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Recitation 9: Loopy BP

Approximate Message Passing Algorithms

Convergence Analysis of the Variance in. Gaussian Belief Propagation

17 Variational Inference

Min-Max Message Passing and Local Consistency in Constraint Networks

Phase transitions in discrete structures

13: Variational inference II

5. Sum-product algorithm

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Bayesian Machine Learning - Lecture 7

arxiv: v1 [stat.ml] 28 Oct 2017

Sparse Superposition Codes for the Gaussian Channel

Probabilistic Graphical Models

Message Passing Algorithms: A Success Looking for Theoreticians

14 : Theory of Variational Inference: Inner and Outer Approximation

11 The Max-Product Algorithm

Submodularity in Machine Learning

Efficient Network-wide Available Bandwidth Estimation through Active Learning and Belief Propagation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Graphical Model Inference with Perfect Graphs

1 Regression with High Dimensional Data

Probabilistic Graphical Models

Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation

Expectation propagation for symbol detection in large-scale MIMO communications

Probabilistic Graphical Models

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

A Gaussian Tree Approximation for Integer Least-Squares

Fixed Point Solutions of Belief Propagation

Linear Dynamical Systems

Probabilistic Graphical Models

Bayesian Learning in Undirected Graphical Models

Machine Learning for Data Science (CS4786) Lecture 24

Overfitting, Bias / Variance Analysis

Chapter 17: Undirected Graphical Models

Compressive Sensing under Matrix Uncertainties: An Approximate Message Passing Approach

9 Forward-backward algorithm, sum-product on factor graphs

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Junction Tree, BP and Variational Methods

arxiv: v1 [cs.it] 21 Feb 2013

Bayesian Learning in Undirected Graphical Models

Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016

Graphical Models and Kernel Methods

Structured Variational Inference

Probabilistic Graphical Models (I)

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

CPSC 540: Machine Learning

Approximate Inference Part 1 of 2

11. Learning graphical models

Probabilistic Graphical Models

Expectation Propagation Algorithm

Fractional Belief Propagation

CS Lecture 19. Exponential Families & Expectation Propagation

Approximate Inference Part 1 of 2

Info-Greedy Sequential Adaptive Compressed Sensing

High dimensional Ising model selection

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes

Stochastic Proximal Gradient Algorithm

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

CPSC 540: Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

10708 Graphical Models: Homework 2

Reconstruction from Anisotropic Random Measurements

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

Message Passing Algorithms for Compressed Sensing: I. Motivation and Construction

Conditional Independence and Factorization

Probabilistic Graphical Models Lecture Notes Fall 2009

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Chris Bishop s PRML Ch. 8: Graphical Models

Low-Density Parity-Check Codes

Machine Learning 4771

Graphical Models Concepts in Compressed Sensing

ON THE MINIMUM DISTANCE OF NON-BINARY LDPC CODES. Advisor: Iryna Andriyanova Professor: R.. udiger Urbanke

Recent developments on sparse representation

The Ising model and Markov chain Monte Carlo

Gaussian Graphical Models and Graphical Lasso

Linear Response for Approximate Inference

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximate Message Passing for Bilinear Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 9: PGM Learning

CPSC 540: Machine Learning

Transcription:

Message passing and approximate message passing Arian Maleki Columbia University 1 / 47

What is the problem? Given pdf µ(x 1, x 2,..., x n ) we are interested in arg maxx1,x 2,...,x n µ(x 1, x 2,..., x n) Calculating marginal pdf µi(x i) Calculating joint distribution µs(x S) for S {1, 2,..., n} These problems are important because Many inference problems are in these forms Many applications in image procession, communication, machine learning, signal processing This talk: Marginalizing distribution Most tools are applied to other problems 2 / 47

This talk: marginalization Problem: Given pdf µ(x 1, x 2,..., x n ) find µ i (x i ) Simplification: all xj s are binary random variables Is it difficult? Yes. No polynomial time algorithm Complexity of variable elimination: 2 n 1 Solution: No good solution for generic µ Consider structured µ 3 / 47

Structured µ Problem: Given pdf µ(x 1, x 2,..., x n ) find µ i (x i ) Simplification: all xj s are binary random variables Example 1: x 1, x 2,..., x n are independent Easy marginalization Not interesting Example 2: Independent subsets µ(x 1, x 2,..., x n ) = µ 1(x 1,..., x k )µ 2(x k+1,..., x n ) More interesting and less easy Complexity still exponential in terms of size of subsets Does not cover many interesting cases 4 / 47

Structured µ:cont d Problem: Given pdf µ(x 1, x 2,..., x n ) find µ i (x i ) Simplification: all xj s are binary random variables Example 3: more interesting structure: µ(x 1, x 2,..., x n ) = µ 1(x S1 )µ 2(x S2 )... µ l(x Sl ) Unlike independence, Si S j Covers many interesting practical problems Let Si n, i. Is marginalization easier than a generic µ? o Not clear!! We introduce factor graph to address this question 5 / 47

Factor graph Slight generalization µ(x 1, x 2,..., x n ) = 1 Z ψ a (x Sa ) a F 1 Variable nodes Factor nodes ψa : {0, 1} Sa R + not necessarily pdf Z: normalizing constant. Called "partition function" ψa(x Sa ): factors Factor graph: 2.. a b Variable node: n nodes corresponding to x 1, x 2,..., x n n Function node: F nodes corresponding to ψ a(x S1 ), ψ b (x S2 ),... Edge between a and variable node i iff xi S a 6 / 47

Factor graph Cont d µ(x 1, x 2,..., x n ) = 1 Z ψ a (x Sa ) Factor graph: a F Variable node: n nodes corresponding to x 1, x 2,..., x n Function node: F nodes corresponding to ψ a(x S1 ), ψ b (x S2 ),... 1 2 Variable nodes. Factor nodes a b Edge between a and variable node i iff xi S a Some notation:. a: neighbors of function node a: a = {i : x i S a}. n i: neighbors of variable node i i = {b F : i S b }. 7 / 47

Factor graph example µ(x 1, x 2, x 3 ) = 1 Z ψ a(x 1, x 2 )ψ b (x 2, x 3 ) Some notation: a: neighbors of function node a: a = {1, 2} b = {2, 3} i: neighbors of variable node i 1 2 3 Variable nodes Factor nodes a b 1 = {a} 2 = {a, b} 8 / 47

Why factor graph? 1 Variable nodes Factor nodes Factor graphs simplifies understanding structures Independence: graph has disconnected pieces Tree graphs: Easy marginalization Loops (specially short loops): generally make the problem more complicated 2.. a b n 9 / 47

Marginalization on tree structured graphs Tree: Graph with no loops Marginalization by variable elimination is efficient on such graphs (distributions) o Linear in n o Exponential in size of factors i Proof: on the board 10 / 47

Summary of variable elimination on trees Notation µ(x1, x 2,..., x n) = ψa(x a F a) µj a: belief from variable node j to factor a ˆµa j: belief from factor node a to variable j i We also have µ a j (x j ) = ψ a (x a ) ˆµ l a (x l ) ˆµ j a (x j ) = {x j:j a\j} b j\a µ b j (x j ) l a\j 11 / 47

Belief propagation algorithm on trees Based on variable elimination on trees consider iterative algorithm: ν t+1 a j (x j) = ψ a (x a ) ˆν l a(x t l ) ˆν t+1 j a (x j) = {x j:j a\j} b j\a ν t b j(x j ) l a\j i ν t a j and ˆν t j a are beliefs of nodes at time t Each node propagates its belief to other nodes Hope: Converge to a good fixed point νb j(x t j ) µ b j (x j ) t ˆν j b(x t j ) µ j b (x j ) t 12 / 47

Belief propagation algorithm on trees Belief propagation: ν t+1 a j (x j) = ˆν t+1 j a (x j) = {x j:j a\j} b j\a ψ a (x a ) ν t b j(x j ) l a\j ˆν t l a(x l ) i Why BP and not variable elimination? On trees: no good reason loopy graphs: can be easily applied 13 / 47

Belief propagation algorithm on trees Belief propagation: ν t+1 a j (x j) = ˆν t+1 j a (x j) = {x j:j a\j} b j\a ψ a (x a ) ν t b j(x j ) l a\j ˆν t l a(x l ) i Lemma: On a tree graph BP converges to the marginal distributions in finite number of iterations, i.e, νb j(x t j ) µ b j (x j ) t ˆν j b(x t j ) µ j b (x j ) t 14 / 47

Summary of belief propagation on trees i Equivalent to variable elimination Exact on trees Converges in finite number of iterations: 2 times the maximum depth of the tree. 15 / 47

Loopy belief propagation Apply belief propagation to loopy graph: ν t+1 a j (x j) = ψ a (x a ) ˆν l a(x t l ) ˆν t+1 j a (x j) = {x j:j a\j} b j\a ν t b j(x j ) l a\j i Wait until the algorithm converges: Announce a j ν a j(x j) as µ j(x j) 16 / 47

Challenges for loopy BP Does not converge necessarily Example : µ(x 1, x 2, x 3) = I(x 1 x 2)I(x 2 x 3)I(x 3 x 1) o t = 1 start with x 1 = 0 o t = 2 x 2 = 1 o t = 3 x 3 = 0 o t = 4 x 4 = 1 Even if converges, not necessarily the marginal distribution 1 2 3 Can have more than one fixed point 17 / 47

Advantages of loopy BP In many applications, works really well i Coding theory Machine vision compressed sensing. In some cases, loopy BP can be analyzed theoretically 18 / 47

Some theoretical results on loopy BP BP equations have at least one fixed point Proof uses Brouwer s theorem Does not necessarily converge to any of the fixed points BP is exact on trees i BP is accurate for locally tree like graphs Gallager 1963, Luby et al. 2001, Richardson and Urbanke 2001 Several methods exist for verifying correctness of BP Computation trees Dobrushin condition. 19 / 47

Locally tree like graphs BP is accurate for locally tree like graphs Heuristic: sparse graphs (without many edges) can be locally tree like i How to check? If girth(g) = O(log(n)) girth(g) is the length of the shortest cycle Random (dv, d c) graph is locally tree like with high probability 20 / 47

Dobrushin condition sufficient condition for convergence of BP Influence of node j on node i: C ij = sup x,x { µ(x i x V \i,j ) µ(x i x V \i,j ) T V : x l = x l l j} Intuition: if the influence of the nodes on each other is very small, then oscillations should not occur Theorem If γ = sup i ( j C ij) < 1, then BP marginals converge to the true marginals. 21 / 47

Summary of BP Belief propagation: ν t+1 a j (x j) = ˆν t+1 j a (x j) = {x j:j a\j} b j\a ψ a (x a ) ν t b j(x j ) l a\j ˆν t l a(x l ) i Exact on trees Accurate on locally tree like graphs No generic proving technique that works on all problems Successful in many applications, even when we cannot prove convergence 22 / 47

More challenges What if µ is a pdf on R n? Extension seems straightforward ν t+1 a j (x j) = ˆν t+1 j a (x j) = {x j:j a\j} b j\a But, calculations become complicated ν t b j(x j ) ψ a (x a ) l a\j ˆν t l a(x l )dx Shall discretize the pdf with certain accuracy Do lots of numeric multiple integrtions Exception: Gaussian graphical models µ(x) = exp( x t Jx+c t x) Z All messages become Gaussian NOT the focus of this talk 23 / 47

Approximate message passing 24 / 47

Model n N k sparse signal y A w y = Ax o + w x x o: k-sparse vector in R N A: n N design matrix y: measurement vector in R n w: measurement noise in R n 25 / 47

LASSO Many useful heuristic algorithms: Noiseless: l1-minimization minimize x subject to x 1 y = Ax x 1 = i xi Noisy: LASSO minimize x 1 2 y Ax 2 2 + λ x 1 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 26 / 47

LASSO Many useful heuristic algorithms: Noiseless: l1-minimization minimize x subject to x 1 y = Ax x 1 = i xi Noisy: LASSO minimize x 1 2 y Ax 2 2 + λ x 1 convex optimizations Chen, Donoho, Saunders (96), Tibshirani (96) 27 / 47

Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: interior point method: o generic tools o appropriate for N < 5000 homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 First order methods o Low computational complexity per iteration o Require many iterations 28 / 47

Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: interior point method: o generic tools o appropriate for N < 5000 homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 First order methods o Low computational complexity per iteration o Require many iterations 29 / 47

Algorithmic challenges of sparse recovery Use convex optimization tools to solve LASSO Computational complexity: interior point method: o generic tools o appropriate for N < 5000 homotopy methods: o use the structure of the LASSO o appropriate for N < 50000 First order methods o Low computational complexity per iteration o Require many iterations 30 / 47

Message passing for LASSO Define: µ(dx) = 1 Z e βλ x 1 β 2 y Ax 2 dx Z: normalization constant β > 0 As β : µ concentrates around solution of LASSO: ˆx λ Marginalize µ to obtain ˆx λ 31 / 47

Message passing for LASSO Define: µ(dx) = 1 Z e βλ x 1 β 2 y Ax 2 dx Z: normalization constant β > 0 As β : µ concentrates around solution of LASSO: ˆx λ Marginalize µ to obtain ˆx λ 32 / 47

Message passing for LASSO Define: µ(dx) = 1 Z e βλ x 1 β 2 y Ax 2 dx Z: normalization constant β > 0 As β : µ concentrates around solution of LASSO: ˆx λ Marginalize µ to obtain ˆx λ 33 / 47

Message passing algorithm µ(dx) = 1 Z e βλ i xi β 2 a (ya (Ax)a)2 dx Belief propagation update rules: ν t+1 i a (si) = e βλ s i b a ˆνt b i(s i) ˆν t a i(s i) = e β 2 (ya (As)a)2 j i dνt j a(s j) Pearl (82) 34 / 47

Issues of message passing algorithm Challenges: ν t+1 i a (s i) = e βλ si b a ˆν t b i(s i ) ˆν a i(s t i ) = e β 2 (ya (As)a)2 dνj a(s t j ) j i messages are distributions over real line calculation of messages is difficult o Numeric integration o The number of variables in the integral is N 1 The graph is not "tree like" o No good analysis tools 35 / 47

Challenges for message passing algorithm Challenges: Solution: ν t+1 i a (s i) = e βλ si b a ν t b i(s i ) ˆν a i(s t i ) = e β 2 (ya (As)a)2 messages are distributions over real line calculation of messages is difficult The graph is not "tree like" blessing of dimensionality: o as N : ˆν t a i (s i) converges to Gaussian dνj a(s t j ) j i o as N : νi a t (s i) converges to 1 β f β (s; x, b) z β (x, b) e β s 2b (s x)2. 36 / 47

Challenges for message passing algorithm Challenges: Solution: ν t+1 i a (s i) = e βλ si b a ν t b i(s i ) ˆν a i(s t i ) = e β 2 (ya (As)a)2 messages are distributions over real line calculation of messages is difficult The graph is not "tree like" blessing of dimensionality: o as N : ˆν t a i (s i) converges to Gaussian dνj a(s t j ) j i o as N : νi a t (s i) converges to 1 β f β (s; x, b) z β (x, b) e β s 2b (s x)2. 37 / 47

Simplifying messages for large N Simplification of ˆν a i t (s i) ˆν a i(s t i ) = e β 2 (ya (As)a)2 dνj a(s t j ) j i = E exp β 2 (y a A ai s i A aj s j ) 2 j i ( = E exp β 2 (A ais i Z) 2) Z = y a j i A ajs j d W ; W is Gaussian Ehsi (Z) Eh si (W ) C t N 1/2 (ˆτ a i t, )3/2 38 / 47

Message passing As N, ˆν a i t converges to a Gaussian distribution. Theorem Let x t j a and (τ j a t /β) denote the mean and variance of distribution νj a t. Then there exists a constant C t such that sup ˆν a i(s t i ) ˆφ t a i(s i ) s i ˆφ t a i(ds i ) C t N(ˆτ a i t, )3 βa 2 ai 2πˆτ t a i e β(aaisi zt a i )2 /2ˆτ t a i dsi, where z t a i y a j i A aj x t j a, ˆτ t a i j i A 2 ajτ t j a. Donoho, M., Montanari (11) 39 / 47

Shape of ν t i a; intuition Summary ˆν t b i (s i) Gaussian ν t+1 i a (si) = e βλ s i b a νt b i(s i) Shape of ν t+1 i a (s i) ν t+1 i a (si) 1 z β (x,b) e β s β 2b (s x)2 40 / 47

Message passing (cont d) Theorem Suppose that ˆν t a i = ˆφ t a i, with mean zt a i and variance ˆτ t a i = ˆτ t. Then at the next iteration we have ν t+1 i a (s i) φ t+1 i a (s i) < C/n, φ t+1 i a (s i) λ f β (λs i ; λ b a A bi z t b i, λ 2 (1 + ˆτ t )), and 1 f β (s; x, b) z β (x, b) e β s β 2b (s x)2. Donoho, M., Montanari (11) 41 / 47

Next asymptotic: β Reminder: x t j a mean of ν t j a z t a i mean of ˆν t a i z t a i y a j i Aajxt j a Can we write x t j a in terms of zt a i? x t j a = 1 z β s i e β si β(si b a A biz t b i )2 Can be done, but... Laplace method simplifies it (β ) x t j a = η( b a A bi z t b i). 42 / 47

Summary x t j a mean of νt j a z t a i mean of ˆνt a i z t a i = y a j i A aj x t j a x t+1 i a = η( b a A bi z t b i). 43 / 47

Comparison of MP and IST MP IST z t a i = y a j i A aj x t j a z t a i = y a j i A aj x t j a x t+1 i a = η( b a A bi z t b i). x t+1 i a = η( b A bi z t b i). x 1 x 1 y 1 y 1 y 2 x 2 y 2 x 2 y n y n x n x n 44 / 47

Approximate message passing We can further simplify the messages x t+1 = η(x t + A T z t ; λ t ) z t = y Ax t + xt 0 n zt 1 45 / 47

IST versus AMP IST: x t+1 = η(x t + A T z t ; λ t ) z t = y Ax t AMP: x t+1 = η(x t + A T z t ; λ t ) z t = y Ax t + 1 n xt 0z t 1 Mean square error 0.05 0.04 0.03 0.02 0.01 State evolution Empirical 0 0 10 20 30 40 Iteration Mean square error 0.05 0.04 0.03 0.02 0.01 0 State evolution Empirical 0.01 0 10 20 30 40 Iteration 46 / 47

Theoretical guarantees We can predict the performance of AMP at every iteration in asymptotic regime Idea: x t + A T z t can be modeled as signal plus Gaussian noise We keep track of noise variance I will discuss this part in the "Risk Seminar" 47 / 47