Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information Theory in Computer Vision November 13, 2011, Barcelona, Spain

Inverse Image Problems 2 / 22 Denoising Deblurring Inpainting

The Sparse Linear Model A hidden vector x R N and noisy measurements y R M. Sparse linear model K P(x;θ) t(g T k x) k=1 P(y x;θ) = N(y; Hx,σ 2 I) g 1 g 2 g 3 g K x 1 x 2 x 3 x 4 x N h 1 h 2 h 3 h M Sparsity directions: s = Gx, with G = [g T 1 ;...;gt K ] Measurement directions: H = [h T 1 ;...;ht M ] Sparse potential: t(s), e.g., Laplacian t(s) = e τ k s k Model parameters: θ = (G, H,σ 2 ) 3 / 22

4 / 22 Deterministic or Probabilistic Modeling? Deterministic modeling: Standard Compressive Sensing Find minimum energy configuration Same as finding the posterior MAP Probabilistic modeling: Bayesian Compressive Sensing Try to capture the full posterior distribution Suitable for learning parameters by maximum likelihood (ML) Harder than just point estimate

Deterministic Modeling 5 / 22 MAP estimate as an optimization problem Estimate is ˆx MAP = argminφ MAP (x), where φ MAP (x) = σ 2 y Hx 2 2 K log t(s k ), s k = g T k x. k=1 Properties Modern optimization techniques allow us find ˆx MAP efficiently for large-scale problems.

5 / 22 Deterministic Modeling MAP estimate as an optimization problem Estimate is ˆx MAP = argminφ MAP (x), where φ MAP (x) = σ 2 y Hx 2 2 K log t(s k ), s k = g T k x. k=1 Properties Modern optimization techniques allow us find ˆx MAP efficiently for large-scale problems. How much do we trust the solution? What about error bars? Is the MAP best in terms of PSNR performance?

Probabilistic Modeling Work with the full posterior distribution K P(x y) N(y; Hx,σ 2 I) t(g T k x). k=1 Posterior Prior/Measure (Figure from Seeger & Wipf, 10) 6 / 22

Probabilistic Modeling Markov Chain Monte-Carlo vs. Variational Bayes Markov Chain Monte-Carlo Draw samples from the posterior Typically model prior with Gaussian mixtures and perform block Gibbs sampling. Very general, but can be slow and difficult to monitor convergence [Schmidt, Rao & Roth 10], [Papandreou & Yuille, 10],... Variational Bayes Approximate the posterior distribution with a tractable parametric form Systematic error but often guaranteed convergence [Attias, 99], [Girolami, 01], [Lewicki & Sejnowski, 00], [Palmer et al., 05], [Levin et al., 11], [Seeger & Nickisch, 11],... 7 / 22

Variational Bounding 8 / 22 Approximate the posterior distribution with a Gaussian Q(x y) N(y; Hx,σ 2 I)e 1 2 st Γ 1s = N(x;ˆx Q, A 1 ), with ˆx Q = A 1 b, A = σ 2 H T H+G T Γ 1 G, Γ = diag(γ), and b = σ 2 H T y. Suitable for super-gaussian priors t(s k ) = sup e s2 k /(2γ k) h k (γ k )/2 γ k >0 Optimization problem: Find the variational parameters γ that give the tightest fit.

Variational Bounding: Double-Loop Algorithm Outer Loop: Variance Computation Compute z = diag(ga 1 G T ), i.e. the vector of variances z k = Var Q (s k y) along the sparsity directions s k = g T k x. Inner Loop: Smoothed Estimation Obtain the variational mean ˆx Q = argmin x φ Q (x; z), where φ Q (x; z) = σ 2 y Hx 2 2 Update the variational parameters γ 1 K log t ((s k 2 + z k) 1/2) k=1 k = 2 d log t( v) dv v=ŝ 2 k +z k Convex if standard MAP is convex. See [Seeger & Nickisch, 11]. 9 / 22

10 / 22 Variance Computation Goal: Estimate elements of Σ = A 1, where A = σ 2 H T H+G T Γ 1 G Direct inversion is hopeless (N 10 6 ). Accurate and fast techniques for problems of special structure [Malioutov et al., 08]. Lanczos iteration (only MVM required) [Schneider & Willsky, 01], [Seeger & Nickisch, 11].

11 / 22 Gaussian Sampling by Local Perturbations g 1 g 2 g 3 g K g 1 g 2 g 3 g K x 1 x 2 x 3 x 4 x N x 1 x 2 x 3 x 4 x N h 1 h 2 h 3 h M h 1 h 2 h 3 h M Gaussian MRF sampling by local noise injection 1. Local Perturbations : ỹ N(0,σ 2 I), and β N(0,Γ 1 ) 2. Gaussian Mode : A x = σ 2 H T ỹ+g T β Then x N(0, A 1 ), where A = σ 2 H T H+G T Γ 1 G. [Papandreou & Yuille, 10]

12 / 22 Monte-Carlo Variance Estimation Let x i N(0, A 1 ), with i = 1,...,N s. General purpose Monte-Carlo variance estimator ˆΣ = 1 N s where s k,i = g T k x i. Properties N s i=1 x i x T i, ẑ k = 1 N s N s i=1 s 2 k,i, Marginal distribution of estimates ẑ k /z k 1 N s χ 2 (N s ). Unbiased E {ẑ k } = z k. Relative error is r = (ẑ k )/z k = 2/N s.

Monte-Carlo vs. Lanczos Variance Estimates 13 / 22 8 x 10 3 SAMPLE LANCZOS EXACT 6 ẑk 4 2 0 0 2 4 6 8 z k x 10 3

14 / 22 Application: Image Deconvolution Measurement equation: y k x = Hx. Non-blind deconvolution (known blur kernel k). Blind deconvolution (unknown blur kernel k).

15 / 22 Blind Image Deconvolution Blur kernel recovery by Maximum Likelihood ML objective: ˆk = argmax k P(y; k) = argmax k P(y, x; k)dx. Variational ML: ˆk = argmax k Q(y; k) Contrast with argmax k (max x P(x, y; k)). [Fergus et al., 06], [Levin et al., 09].

16 / 22 Variational EM for Maximum Likelihood Find k by maximizing Q(y; k) [Girolami, 01], [Levin et al., 11]. E-Step Given current kernel estimate k t, do variational Bayesian inference, i.e., fit Q(x y; k t ). M-Step Maximize w.r.t. k the expected complete log-likelihood E Q(x y;k t ){log Q(x, y; k)}. Equivalently, minimize w.r.t. k { } 1 2 y Hx 2 = 1 ) ((H 2 tr T H)(A 1 + ˆxˆx T ) y T Hˆx+(const) E Q(x y;k t ) = 1 2 kt R xx k r T xyk+(const) Expected moments R xx estimated by Gaussian sampling.

Summary of Computational Primitives 17 / 22 Smoothed estimation Obtain the variational mean ˆx Q = argmin x φ Q (x; z), where φ Q (x; z) = σ 2 y Hx 2 2 K log t ((s k 2 + z k) 1/2) k=1 Inner loop of variational inference. Sparse linear system Ax = b, where A = σ 2 H T H+G T Γ 1 G. Estimate variances in outer loop of variational inference and moments R xx in blind image deconvolution.

17 / 22 Summary of Computational Primitives Smoothed estimation Obtain the variational mean ˆx Q = argmin x φ Q (x; z), where φ Q (x; z) = σ 2 y Hx 2 2 K log t ((s k 2 + z k) 1/2) k=1 Inner loop of variational inference. Sparse linear system Ax = b, where A = σ 2 H T H+G T Γ 1 G. Estimate variances in outer loop of variational inference and moments R xx in blind image deconvolution. Solve with preconditioned conjugate gradients.

18 / 22 Efficient Circulant Preconditioning Approximate A = σ 2 H T H+G T Γ 1 G with P = σ 2 H T H+ γ 1 G T G, with γ 1 (1/K) K k=1 γ 1 k [Lefkimmiatis et al., 12]. Properties Thanks to stationarity of P, DFT techniques apply. Optimality: P = argmin X C X A

Effect of Preconditioner 19 / 22 10 10 CG 10 5 PCG 10 0 10 5 10 10 10 15 0 20 40 60 80 100 120

Non-Blind Image Deblurring Example 20 / 22 ground truth our result (PSNR=31.93dB) blurred (PSNR=22.57dB) VB stdev

Blind Image Deblurring Example 21 / 22 ground truth our result (PSNR=27.54dB) blurred (PSNR=22.57dB) kernel

Summary Main Points Variational Bayesian inference using standard optimization primitives. Scalable to large-scale problems. Open question: Monte-Carlo or Variational?

Summary Main Points Variational Bayesian inference using standard optimization primitives. Scalable to large-scale problems. Open question: Monte-Carlo or Variational? Our software integrated in the glm-ie open source toolbox. THANK YOU!