Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016

Size: px

Start display at page:

Download "Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016"

Phyllis Allen
6 years ago
Views:

1 Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines Eric W. Tramel itwist 2016 Aalborg, DK 24 August 2016 Andre MANOEL, Francesco CALTAGIRONE, Marylou GABRIE, Florent KRZAKALA

2 Inverse Problems General Linear Problem: y = g (F x) (M N) (N 1) (M 1) F x g y Projection Matrix Signal? Channel Measurements Compressed Sensing, Regression, Deconvolution/Debluring, Localization, Super-Resolution, Medical Image Reconstruction (CT/MRI), In-Painting, Denoising, Inference, etc.

3 Example Compressed Sensing y = F x + w w µ N (0, ) How do we obtain x from y knowing. OLS is under-determined, in general we can t! g is AWGN, x is K-Sparse, F is iid random, and M << N? With Sparsity, we can! Null(F ) x y F x + s x y F x + s (EC & TT, 2005) (EC, JR, & TT, 2006) (EC & MW, 2008)

4 Sparsity & Recovery For M>=K, we can recover with OLS, up to noise, if we are given the support locations by an oracle. = = However: Without an oracle, finding S brute-force is a combinatorial problem! arg min S2S y F S x S 2 2

5 Optimization Approaches y = F x + w w µ N (0, ) Greedy Approach ˆx = arg min x x 0 s.t. y F x 2 2 apple ˆx =argmin x y F x 2 2 s.t. x 0 apple K Greedily searching for support, solving OLS support. Convex Approach ˆx =argmin x 1 s.t. y F x 2 2 apple x ˆx = arg min y F x x 1 x Relax L0 penalty to convex L1 penalty ( pointiest convex Lp)

6 Phase Diagram for CS StOMP L1 NP-Hard? Un-Recoverable

7 Bayesian Approaches y = F x + w w µ N (0, ) Maximum a posteriori (MAP) ˆx = arg max x P (x y,f) Find signal to maximize probability. Can use unnormalized posterior minimize negative log prob. For some settings maps to convex optimization. Minimum Mean Square Error (MMSE) ˆx = E[x] = Z dx x P (x y,f) Average over posterior distribution.

8 Defining the Posterior Bayes' Rule P (x y,f)= 1 Z P (y x,f)p 0(x) Likelihood defined by stochastic description of g. Posterior Factorized Prior, AWGN Channel P (x y,f)= 1 Z Y µ 1 p 2 exp 8 < : 1 2 y µ X i! 9 2 = Y F µi x i ; i P 0 (x i ) For exact posterior, we must calculate an intractable Z! Inference: We can approximate it with Belief Propagation.

9 Graphical Model of Posterior x 1 P 0 (x i ) x 2 x 3 y 1,F 1 y 2,F 2 y M,F M P (y F x) x N Factors Prior Variables Coefficients Factors Measurements Loopy Belief Propagation The presence of many loops makes exact inference impossible, but approximate inference may be tractable and accurate [Weiss 2000].

10 Inference via BP Messages Messages Variable Factor Variable Factor P 0 x 1 PDFs m i!µ (x i ) P 0 x 1 PDFs m µ!i (x i ) y 1,F 1 y 1,F 1 x 2 x 2 y 2,F 2 y 2,F 2 x 3 x 3 y M,F M y M,F M x N x N Factors Prior Variables Coefficients Factors Measurements Factors Prior Variables Coefficients Factors Measurements Goal: Produce P (x i y,f) / P 0 (x i ) Y µ m µ!i (x i )

11 r-bp to AMP via TAP A Corrected Mean-Field (Donoho, Maleki, Montanari 2009) If F is dense and if its entries are uncorrelated, then message means and variances are nearly independent of any single edge message in the limit N. x 1 x 2 Onsager Correction y 1,F 1 x 3 y µ,f µ P 0 (x i ) x i y 2,F 2 y M,F M x N Big Savings: Compute Burden O( N 2 )! O((1 + )N)

12 r-bp/amp with GB Prior StOMP L1 BP/AMP NP-Hard? Un-Recoverable

13 r-bp/tap Convergence Small iteration count far from transition provides efficient estimation. (Krzakala et al., 2012) However, the approach encounters critical slowing at or near the transition, as shown via state evolution analysis.

14 Structured Signal Priors But what if we have more information about our signal? Correlations? Exact Inference Approximate Inference x 1 x 2 x 2 x 4 x 1 x 2 x 1 x 2 x 1 x 3 x 4 x 3 x 4 x 3 x 3 Pairwise Interactions High Order Int. For completely general visible models, see (Rangan et al, Hybrid-GAMP, 2012).

15 Regression & Latent Variables x 1 y 1,F 1 x 1 y 1,F 1 y 2,F 2 y 2,F 2 x 2 x 2 W 11 x 3 x 3 W 21 W 31 h 1 h 1 W 12 W 22 h 2 h 2 W 32 Binary Restricted Boltzmann Machine (RBM) P 0 (x, h) = 1 Z ep il x iw il h l + P i b ix i + P l c lh l Latent Model: Model data via a nonlinear composition of features. Unsupervised: Extract features from unlabelled training data. Tuning: Scale memorization/generalization via number of latent variables. Training: Sampling (Contrastive Divergence [Hinton 2002]) Mean-field (NMF [Welling, Hinton 2002], EMF [Gabrié, Tramel, Krzakala 2015])

AMP with RBM Support Prior E.W.T, Drémeau, Krzakala, AMP with Boltzmann machine priors, JSTAT 2016.

For structured sparse signals, we can train a binary RBM to model the signal support, subsequently

16 AMP with RBM Support Prior E.W.T, Drémeau, Krzakala, AMP with Boltzmann machine priors, JSTAT Gabrié, E.W.T, Krzakala, Training RBMs with the TAP FE, NIPS For structured sparse signals, we can train a binary RBM to model the signal support, subsequently use it in AMP. Demonstrates: The RBM and AMP interact only via local biases inference on each essentially agnostic to the other.

17 AMP with General RBM Prior A General RBM P 0 (x, h ) = 1 Z ext Wh Y i P 0 (x i x ) Y l P 0 (h l h ) Using this generalized Boltzmann prior, we can model signals directly, without invoking sparsity.

18 AMP with General RBM Prior A General RBM P 0 (x, h ) = 1 Z ext Wh Y i P 0 (x i x ) Y l P 0 (h l h ) Exact knowledge of P(x) is intractable and sampling impractical for AMP. TAP Approximation of GRBM ln Z F(a, c ;,W), X X ln Z i (Bi,A i ; i ) Bi a i i i + X W ij a i a j + 1 X W 2 ijc 2 i c j (i,j) (i,j) X i A i ((a i ) 2 + c i ) Inference: Moments at each visible variable can be approximated by fixed-point iteration.

19 AMP with General RBM Prior What does this algorithm look like?

20 AMP with General RBM Prior 1. Update information from observation factors given current state.

21 AMP with General RBM Prior 2. Calculate local fields from AMP to use as a bias during GRBM inference.

22 AMP with General RBM Prior 3. Run GRBM inference in order to obtain marginal estimate of moments at each signal coefficient.

23 AMP with General RBM Prior 4. Apply light damping in order to avoid thrashing between GRBM modes and repeat. Extra Cost: Proportional to the number of interior steps you take, though can be set < 10.

24 Experimental Framework Offline Training 60k real-valued MNIST training samples 784 (28x28) Trunc. Gauss-Bernoulli Visible 500 Binary Hidden 100 sample mini-batches 150 Epochs (90k parameter updates) Reconstruction 1k real-valued MNIST test samples (h.o.) Noise Variance: 10-8 IID Random Projection Matrix F Methods non-.i.i.d. Use empirical support probability with GB prior AMP. BRBM Binary RBM to model support location. GRBM Generalized RBM to model entire signal. =0.125 =0.100 =0.075 =0.050 =0.025 non-i.i.d. BRBM GRBM (Tramel et al., 2016) Measure Reconstruction quality as a function of number of measurements.

AMP with General RBM Prior 0.35 > 10 0.35 > 10 0.35 > 10 MSE = M/N 0.

10 0.15 0.20 0.25 0.30 0.35 < 50 0.05 0.10 0.15 0.20 0.25 0.30 0.35 < 50 0.05 0.10 0.15 0.20 0.25 0.30 0.35 < 50 0.35 1.

05 0.10 0.15 0.20 0.25 0.30 0.35 = K/N 0.95 0.9 0.85 < 0.8 0.30 0.25 0.20 0.15 0.10 0.05 0.10 0.15 0.20 0.25 0.30 0.35 = K/N 0.95 0.9 0.85 < 0.8 0.30 0.25 0.20 0.15 0.10 0.05 0.10 0.15 0.20 0.25 0.30 0.35 = K/N 0.95 0.9 0.85 < 0.8 non-i.

25 AMP with General RBM Prior 0.35 > > > 10 MSE = M/N < < < Correlation = M/N = K/N < = K/N < = K/N < 0.8 non-i.i.d. AMP Support RBM-AMP (TDK, 2016) General RBM-AMP (TMCGK, 2016)

26 Learning GRBMs Open Work 1. What can be gained with deep architectures? (a) Can we push the boundaries even further with DBMs? 2. Is there an upper limit? (a) What can we learn from density evolution in random case? 3. Further experiments on non-sparse data. (a) Can an RBM be trained to a sufficient level to allow for CS-like reconstruction of non-sparse structured signals? 4. Time Indexing. (a) Parallel time indexing for BP on pairwise graphical models not set in stone (b) Need to re-derive time indices via pairwise r-bp. 5. Comparison with r-bp learning. 6. The Influence of Hidden Unit Distribution (a) E.g. Gauss-Bernoulli hidden units to allow for modeling of visible covariance structure (ala SS-RBM).

SPHINX @ENS Statistical PHysics of INformation extraction

27 Statistical PHysics of INformation extraction «ou» Statistical PHysics of INverse complex sysems Questions? Merci! Collaborators Andre MANOEL Francesco CALTAGIRONE Marylou GABRIE Florent KRZAKALA LPS-ENS INRIA Paris LPS-ENS LPS-ENS

28 Supplements

29 Learning GRBMs How do we train the necessary GRBM models? Iterating the same equations allows us to evaluate the TAP f.e. approximation of the GRBM. => Stochastic Gradient Ascent on dataset likelihood. Biggest Issue Inescapable negative variances! Signed definition required to make messages Gaussian, but now it bites us. What to do? Current sol n: Truncated Distributions, allow neg. vars!

30 Learning GRBMs For a Gaussian distributed unit The `ERF - ERF` in the denominators make for terrible numerical issues for sufficiently likely arguments. - We solved via high-order Taylor approximation. Not themes elegant solution

31 Relaxed BP (r-bp) Problems Analytically intractable messages. Messages are continuous objects (PDFs), not fit for a computable algorithm. Assumption All values of F scale as O(1/N). Remedy (Rangan, 2010), (Krzakala et al., 2012) Given the above, we can perform a small-weight expansion on the messages, allowing for all messages to be written as Normal Distributions via the CLT, and are parameterized by a i!µ, Z dx i x i m i!µ (x i ), v i!µ, (a i!µ ) 2 + Z dx i x 2 i m i!µ (x i )

32 AMP with RBM Support Prior (Tramel, Drémeau, Krzakala, 2016) MNIST Experiments Goal: CS reconstruction of 300 test set digits given training set of 60,000 samples.

33 AMP with RBM Support Prior a b c d a b c d (Tramel, Drémeau, Krzakala, 2016) decreasing measurements decreasing measurements (a) i.i.d. GB-AMP, (b) non-i.i.d. GB-AMP, (c) naive mean-field RBM-AMP (d) TAP mean-field RBM-AMP

On convergence of Approximate Message Passing

On convergence of Approximate Message Passing Francesco Caltagirone (1), Florent Krzakala (2) and Lenka Zdeborova (1) (1) Institut de Physique Théorique, CEA Saclay (2) LPS, Ecole Normale Supérieure, Paris