Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Similar documents
Bayesian Methods for Sparse Signal Recovery

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

Sparse Signal Recovery: Theory, Applications and Algorithms

Latent Variable Bayesian Models for Promoting Sparsity

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

l 0 -norm Minimization for Basis Selection

Elaine T. Hale, Wotao Yin, Yin Zhang

An Introduction to Sparse Approximation

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Compressed Sensing and Neural Networks

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Introduction to Compressed Sensing

Reconstruction from Anisotropic Random Measurements

Sparse Linear Models (10/7/13)

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity in Underdetermined Systems

Greedy Signal Recovery and Uniform Uncertainty Principles

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

On the Role of the Properties of the Nonzero Entries on Sparse Signal Recovery

Oslo Class 6 Sparsity based regularization

Sparse Solutions of an Undetermined Linear System

Info-Greedy Sequential Adaptive Compressed Sensing

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Statistical Data Mining and Machine Learning Hilary Term 2016

Sparsity Regularization

y Xw 2 2 y Xw λ w 2 2

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Signal Recovery from Permuted Observations

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Compressive Sensing under Matrix Uncertainties: An Approximate Message Passing Approach

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Robust multichannel sparse recovery

Introduction to Machine Learning

Lecture 5 : Sparse Models

y(x) = x w + ε(x), (1)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Sparsifying Transform Learning for Compressed Sensing MRI

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Lecture 22: More On Compressed Sensing

Regularization Algorithms for Learning

Density Estimation. Seungjin Choi

Compressed Sensing and Related Learning Problems

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Machine Learning for Signal Processing Sparse and Overcomplete Representations

A hierarchical Krylov-Bayes iterative inverse solver for MEG with physiological preconditioning

Type II variational methods in Bayesian estimation

Support Vector Machines

EUSIPCO

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Sparse & Redundant Representations by Iterated-Shrinkage Algorithms

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Generalized Orthogonal Matching Pursuit- A Review and Some

Sketching for Large-Scale Learning of Mixture Models

Compressed Sensing and Linear Codes over Real Numbers

CSC 576: Variants of Sparse Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Sparse & Redundant Signal Representation, and its Role in Image Processing

Compressive Sensing and Beyond

A Review of Sparsity-Based Methods for Analysing Radar Returns from Helicopter Rotor Blades

SPARSE signal representations have gained popularity in recent

Compressive Sensing of Temporally Correlated Sources Using Isotropic Multivariate Stable Laws

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Convex Optimization: Applications

Sparse Approximation and Variable Selection

Least Squares Regression

LEARNING DATA TRIAGE: LINEAR DECODING WORKS FOR COMPRESSIVE MRI. Yen-Huan Li and Volkan Cevher

Bayesian linear regression

STA414/2104 Statistical Methods for Machine Learning II

Optimization for Compressed Sensing

On Bayesian Computation

Gradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property

Sparse linear models

Regularized Regression A Bayesian point of view

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Bayesian Machine Learning

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gauge optimization and duality

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Special Thanks to Yuzhe Jin

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

GREEDY SIGNAL RECOVERY REVIEW

Source Reconstruction for 3D Bioluminescence Tomography with Sparse regularization

Transcription:

Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Zhilin Zhang and Ritwik Giri

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery problem can be a valuable tool for signal processing practitioners. Many interesting developments in recent past that make the subject timely. Bayesian Framework offers some interesting options.

Outline Sparse Signal Recovery (SSR) Problem and some Extensions Applications Bayesian Methods MAP estimation Empirical Bayes SSR Extensions: Block Sparsity Summary

Problem Description: Sparse Signal Recovery (SSR) 1. y is a N 1 measurement vector. 2. Φ is N M dictionary matrix where M >> N. 3. x is M 1 desired vector which is sparse with k non zero entries. 4. v is the measurement noise.

Problem Statement: SSR Noise Free Case Given a target signal y and dictionary Φ, find the weights x that solve, min I (x i 0) subject to y =Φx x I (.) is the indicator function. Noisy case i Given a target signal y and dictionary Φ, find the weights x that solve, min I (x i 0) subject to y Φx 2 <β x i

Useful Extensions 1. Block Sparsity 2. Multiple Measurement Vectors (MMV) 3. Block MMV 4. MMV with time varying sparsity

Block Sparsity

Multiple Measurement Vectors (MMV) Multiple measurements: L measurements Common Sparsity Profile: k nonzero rows

Applications 1. Signal Representation (Mallat, Coifman, Donoho,..) 2. EEG/MEG (Leahy, Gorodnitsky,Loannides,..) 3. Robust Linear Regression and Outlier Detection 4. Speech Coding (Ozawa, Ono, Kroon,..) 5. Compressed Sensing (Donoho, Candes, Tao,..) 6. Magnetic Resonance Imaging (Lustig,..) 7. Sparse Channel Equalization (Fevrier, Proakis,...) and many more...

MEG/EEG Source Localization? source space (x) sensor space (y) Forward model dictionary can be computed using Maxwell s equations [Sarvas,1987]. In many situations the active brain regions may be relatively sparse, and so solving a sparse inverse problem is required. [Baillet et al., 2001]

Compressive Sampling (CS) Transform Coding

Compressive Sampling (CS) Computation: 1. Solve for x such that Φx = y. 2. Reconstruction: b =Ψx Issues: 1. Need to recover sparse signal x with constraint Φx = y. 2. Need to design sampling matrix A.

Robust Linear Regression X, y: data; c: regression coeffs.; n: model noise; y X c n w: Sparse Component, Outliers Model noise ε: Gaussian Component, Regular error Transform into overcomplete representation: Y = X c + Φ w + ε, where Φ=I, or Y = [X, Φ] c + ε w

Potential Algorithmic Approaches Finding the Optimal Solution is NP hard. So need low complexity algorithms with reasonable performance. Greedy Search Techniques Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP). Minimizing Diversity Measures Indicator function is not continuous. Define Surrogate Cost functions that are more tractable and whose minimization leads to sparse solutions, e.g. l 1 minimization. Bayesian Methods Make appropriate Statistical assumptions on the solution and apply estimation techniques to identify the desired sparse solution.

Bayesian Methods 1. MAP Estimation Framework (Type I) 2. Hierarchical Bayesian Framework (Type II)

MAP Estimation Problem Statement Advantages ˆx = arg max x P(x y) = arg max P(y x)p(x) x 1. Many options to promote sparsity, i.e. choose some sparse prior over x. 2. Growing options for solving the underlying optimization problem. 3. Can be related to LASSO and other l 1 minimization techniques by using suitable P(x).

MAP Estimation Assumption: Gaussian Noise ˆx = arg max P(y x)p(x) x = arg min logp(y x) logp(x) x = arg min y Φx 2 2 + λ x m g( x i ) Theorem If g is non decreasing and strictly concave function for x R +, the local minima of the above optimization problem will be the extreme points, i.e. have max of N non-zero entries. i=1

Special cases of MAP estimation Gaussian Prior Gaussian assumption of P(x) leads to l 2 norm regularized problem ˆx = arg min x y Φx 2 2 + λ x 2 2 Laplacian Prior Laplacian assumption of P(x) leads to standard l 1 norm regularized problem i.e. LASSO. ˆx = arg min x y Φx 2 2 + λ x 1

Examples of Sparse Distributions Sparse distributions can be viewed using a general framework of supergaussian distribution. P(x) e x p, p 1

Example of Sparsity Penalties Practical Selections g(x i ) = log(x 2 i + ɛ), [Chartrand and Yin, 2008] g(x i ) = log( x i + ɛ), [Candes et al., 2008] g(x i ) = x i p, [Rao et al., 2003] Different choices favor different levels of sparsity.

Which Sparse prior to choose? Two issues: ˆx = arg min y Φx 2 2 + λ x M x l p 1. If the prior is too sparse, i.e. p 0, then we may get stuck at a local minima which results in convergence error. 2. If the prior is not sparse enough, i.e. p 1, then though global minima can be found, it may not be the sparsest solution, which results in a structural error. l=1

Reweighted l 2 /l 1 optimization Underlying Optimization problem is ˆx = arg min y Φx 2 2 + λ x m g( x i ) 1. Useful algorithms exist to minimize the original cost function with a strictly concave penalty function g on R +. 2. The essence of this algorithm is to create a bound for the concave penalty function and follow the steps of a Majorize-Minimization (MM) algorithm. i=1

Reweighted l 2 optimization Assume: g(x i )=h(xi 2 ) with h concave. Updates x (k+1) argmin x y Φx 2 2 + λ i w (k) i xi 2 = W (k) Φ T (λi +Φ W (k) Φ T ) 1 y w k+1 i g(x i) xi 2 xi, W (k+1) diag[w (k+1) ] 1 =x (k+1) i

Reweighted l 2 optimization: Examples FOCUSS Algorithm[Rao et al., 2003] 1. Penalty: g(x i )= x i p, 0 p 2 2. Weight Update: w (k+1) i x (k+1) i p 2 3. Properties: Well-characterized convergence rates; very susceptible to local minima when p is small. Chartrand and Yin (2008) Algorithm 1. Penalty: g(x i ) = log(x 2 i + ɛ), ɛ 0 2. Weight Update: w (k+1) i [(x (k+1) i ) 2 + ɛ] 1 3. Properties: Slowly reducing ɛ to zero smoothes out local minima initially allowing better solutions to be found;

Empirical Comparison For each test case 1. Generate a random dictionary Φ with 50 rows and 250 columns. 2. Generate a sparse coefficient vector x 0. 3. Compute signal, y =Φx 0 (Noiseless case). 4. Compare Chartrand and Yin s reweighted l 2 method with l 1 norm solution with regard to estimating x 0. 5. Average over 1000 independent trials.

Empirical Comparison: Unit nonzeros

Empirical Comparison: Gaussian nonzeros

Reweighted l 1 optimization Assume: g(x i )=h( x i ) with h concave. Updates x (k+1) argmin x y Φx 2 2 + λ i w (k) i x i w k+1 i g(x i) x i xi =x (k+1) i

Reweighted l 1 optimization Candes et al., 2008 1. Penalty: g(x i ) = log( x i + ɛ), 0ɛ 0 2. Weight Update: w (k+1) i [ x (k+1) i + ɛ] 1

Empirical Comparison For each test case 1. Generate a random dictionary Φ with 50 rows and 100 columns. 2. Generate a sparse coefficient vector x 0 with 30 truncated Gaussian, strictly positive nonzero coefficients. 3. Compute signal, y =Φx 0 (Noiseless case). 4. Compare Candes et al s reweighted l 1 method with l 1 norm solution, both constrained to be non-negative with regard to estimating x 0. 5. Average over 1000 independent trials.

Empirical Comparison

Limitation of MAP based methods To retain the same maximally sparse global solution as the l 0 norm in general conditions, then any possible MAP algorithm will possess O [( )] M N local minima.

Bayesian Inference: Sparse Bayesian Learning(SBL) MAP estimation is just a penalized regression, hence Bayesian Interpretation has not contributed much as of now. Previous methods were interested in the mode of the posterior but SBL uses posterior information beyond the mode, i.e. posterior distribution. Problem For all sparse priors it is not possible to compute the normalized posterior P(x y), hence some approximations are needed.

Hierarchical Bayes

Construction of Sparse priors Separability: P(x) = i P(x i) Gaussian Scale Mixture : P(x i )= P(x i γ i )P(γ i )dγ i = N(x i ;0,γ i )P(γ i )dγ i Most of the sparse priors over x (including those with concave g) can be represented in this GSM form, and different scale mixing density i.e, P(γ i ) will lead to different sparse priors. [Palmer et al., 2006] Instead of solving a MAP problem in x, in the Bayesian framework one estimates the hyperparameters γ leading to an estimate of the posterior distribution for x. (Sparse Bayesian Learning)

Examples of Gaussian Scale Mixture Generalized Gaussian 1 p(x; ρ) = 2Γ(1 + 1 ρ )e x Scale mixing density: Positive alpha stable density of order ρ/2. Generalized Cauchy p(x; α, ν) = αγ(ν +1/α) 2Γ(1/α)Γ(ν) Scale mixing density: Gamma Distribution. p 1 (1 + x α ) ν+1/α

Examples of Gaussian Scale Mixture Generalized Logistic p(x; α) = Γ(2α) Γ(α) 2 e αx (1 + e x ) 2α Scale mixing density: Related to Kolmogorov-Smirnov distance statistic.

Sparse Bayesian Learning y =Φx + v Solving for the optimal γ ˆγ = arg max γ = arg min γ P(γ y) = arg max γ log Σ y + y T Σ 1 y y 2 i where, Σ y = σ 2 I + ΦΓΦ T and Γ = diag(γ) Empirical Bayes Choose P(γ i ) to be a non-informative prior P(y x)p(x γ)p(γ)dx logp(γ i )

Sparse Bayesian Learning Computing Posterior Now because of our convenient choice posterior can be easily computed, i.e, P(x y;ˆγ) =N(µ x, Σ x ) where, µ x = E[x y;ˆγ] =ˆΓΦ T (σ 2 I +ΦˆΓΦ T ) 1 y Σ x = Cov[x y;ˆγ] =ˆΓ ˆΓΦ T (σ 2 I +ΦˆΓΦ T ) 1 ΦˆΓ Updating γ Using EM algorithm with a non informative prior over γ, the update rule becomes: γ i µ x (i) 2 +Σ x (i, i)

SBL properties Local minima are sparse. i.e. have at most N nonzero γ i Bayesian inference cost is generally much smoother than associated MAP estimation. Fewer local minima. In high signal to noise ratio, the global minima is the sparsest solution. No structural problems.

Connection to MAP formulation Using the relationship, x-space cost function becomes, yσ 1 1 y y = min x λ y Φx 2 + x T Γ 1 x L x II(x) = y Φx 2 2 + λg II (x) where, xi 2 g II (x) = min γ γ i i with, f (γ i )= 2logP(γ i ) + log Σ y + i f (γ i )

Empirical Comparison: Simultaneous Sparse Approximation Generate data matrix via Y =ΦX 0 (noiseless), where: 1. X 0 is 100-by-5 with random non-zero rows. 2. Φ is 50-by-100 with Gaussian iid entries.

Empirical Comparison: 1000 trials

Useful Extensions 1. Block Sparsity 2. Multiple Measurement Vectors (MMV) 3. Block MMV 4. MMV with time varying sparsity

Block Sparsity Intra-Vector Correlation is often present and is hard to model.

Block-Sparse Bayesian Learning Framework Model y =Φx + v x =[x 1,..., x d1..., x dg 1+1,..., x dg }{{}}{{} x1 T xg T ] T Parameterized Prior P(x i ; γ i, B i ) N(0,γ i B i ), where, i =1,..., g γ i : Control Block-Sparsity; B i : Capture intra-block correlation; P(x;(γ i, B i ) i ) N(0, Σ 0 )

BSBL framework Noise Model P(v; λ) N(0,λI ) Posterior Where, P(x y; λ, (γ i, B i ) g i=1 ) N(µ x, Σ x ) µ x =Σ 0 Φ T (λi + ΦΣ 0 Φ T ) 1 y Σ x =Σ 0 Σ 0 Φ T (λi + ΦΣ 0 Φ T ) 1 ΦΣ 0 µ x, i.e. the mean of the posterior can be perceived as the point estimate of x.

BSBL framework All parameters can be estimated by maximizing the Type II likelihood: L(Θ) = 2 log P(y x; λ)p(x;(γ i, B i ) g i=1 )dx = log λi + ΦΣ 0 Φ T + y T (λi + ΦΣ 0 Φ T ) 1 y Different optimization strategies lead to different BSBL algorithms.

BSBL Framework BSBL-EM Minimize the cost function using Expectation-Maximization. BSBL-BO Minimize the cost function using Bound Optimization technique (Majorize-Minimization). BSBL-l 1 Minimize the cost function using a sequence of reweighted l 1 problems.

Summary Bayesian Methods offer Interesting Algorithmic Options MAP estimation Sparse Bayesian Learning Versatile and can be more easily employed in problems with structure Algorithms can often be justified by studying the resulting objective functions.