Bayesian Methods for Sparse Signal Recovery

Similar documents
Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

Sparse Signal Recovery: Theory, Applications and Algorithms

Introduction to Compressed Sensing

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Latent Variable Bayesian Models for Promoting Sparsity

An Introduction to Sparse Approximation

On the Role of the Properties of the Nonzero Entries on Sparse Signal Recovery

Compressed Sensing and Neural Networks

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

l 0 -norm Minimization for Basis Selection

Compressive Sensing under Matrix Uncertainties: An Approximate Message Passing Approach

Lecture 5 : Sparse Models

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Sparsity in Underdetermined Systems

Greedy Signal Recovery and Uniform Uncertainty Principles

Sparse Solutions of an Undetermined Linear System

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Reconstruction from Anisotropic Random Measurements

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Sparsity Regularization

Signal Recovery from Permuted Observations

Noisy Signal Recovery via Iterative Reweighted L1-Minimization

Minimax MMSE Estimator for Sparse System

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Type II variational methods in Bayesian estimation

y Xw 2 2 y Xw λ w 2 2

Oslo Class 6 Sparsity based regularization

Info-Greedy Sequential Adaptive Compressed Sensing

Regularization Algorithms for Learning

Elaine T. Hale, Wotao Yin, Yin Zhang

Estimating Unknown Sparsity in Compressed Sensing

Sparse & Redundant Signal Representation, and its Role in Image Processing

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 22: More On Compressed Sensing

Least Squares Regression

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Robust multichannel sparse recovery

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Random Sampling of Bandlimited Signals on Graphs

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Sketching for Large-Scale Learning of Mixture Models

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Sparse Linear Models (10/7/13)

Generalized Orthogonal Matching Pursuit- A Review and Some

Statistical Data Mining and Machine Learning Hilary Term 2016

SPARSE signal representations have gained popularity in recent

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Compressive Sensing and Beyond

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STA414/2104 Statistical Methods for Machine Learning II

Least Squares Regression

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC 576: Variants of Sparse Learning

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Bayesian Machine Learning

Optimization for Compressed Sensing

Estimation Error Bounds for Frame Denoising

EUSIPCO

A hierarchical Krylov-Bayes iterative inverse solver for MEG with physiological preconditioning

y(x) = x w + ε(x), (1)

Recovering overcomplete sparse representations from structured sensing

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Reduced Complexity Models in the Identification of Dynamical Networks: links with sparsification problems

SEQUENTIAL SUBSPACE FINDING: A NEW ALGORITHM FOR LEARNING LOW-DIMENSIONAL LINEAR SUBSPACES.

Linear Dynamical Systems

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Sparsifying Transform Learning for Compressed Sensing MRI

MCMC Sampling for Bayesian Inference using L1-type Priors

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Statistical Methods for Data Mining

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Homotopy methods based on l 0 norm for the compressed sensing problem

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Sparse & Redundant Representations by Iterated-Shrinkage Algorithms

Graphical Models for Collaborative Filtering

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Sparse Approximation and Variable Selection

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Compressive Sensing (CS)

Sparse linear models

The Sparsest Solution of Underdetermined Linear System by l q minimization for 0 < q 1

Enhanced Compressive Sensing and More

Bayesian linear regression

ABSTRACT. Topics on LASSO and Approximate Message Passing. Ali Mousavi

Transcription:

Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri

Motivation

Motivation Sparse Signal Recovery is an interesting area with many potential applications.

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery problem can be a valuable tool for signal processing practitioners.

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery problem can be a valuable tool for signal processing practitioners. Many interesting developments in recent past that make the subject timely.

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery problem can be a valuable tool for signal processing practitioners. Many interesting developments in recent past that make the subject timely. Bayesian Framework o ers some interesting options.

Outline

Outline Sparse Signal Recovery (SSR) Problem and some Extensions

Outline Sparse Signal Recovery (SSR) Problem and some Extensions Applications

Outline Sparse Signal Recovery (SSR) Problem and some Extensions Applications Bayesian Methods MAP estimation Empirical Bayes

Outline Sparse Signal Recovery (SSR) Problem and some Extensions Applications Bayesian Methods Summary MAP estimation Empirical Bayes

Problem Description: Sparse Signal Recovery (SSR) y is a N 1measurementvector. isn M dictionary matrix where M >> N. x is M 1desiredvectorwhichissparsewithk non zero entries. v is the measurement noise.

Problem Statement: SSR

Problem Statement: SSR Noise Free Case Given a target signal y and dictionary, find the weights x that solve, X min I (x i 6= 0) subject to y = x x I (.) is the indicator function. i

Problem Statement: SSR Noise Free Case Given a target signal y and dictionary, find the weights x that solve, X min I (x i 6= 0) subject to y = x x I (.) is the indicator function. Noisy case i Given a target signal y and dictionary, find the weights x that solve, X min I (x i 6= 0) subject to ky xk 2 < x i

Useful Extensions

Useful Extensions Block Sparsity

Useful Extensions Block Sparsity Multiple Measurement Vectors (MMV)

Useful Extensions Block Sparsity Multiple Measurement Vectors (MMV) Block MMV MMV with time varying sparsity

Block Sparsity Variations include equal blocks, unequal blocks, block boundary known or unknown.

Multiple Measurement Vectors (MMV) Multiple measurements: L measurements Common Sparsity Profile: k nonzero rows

Applications

Applications Signal Representation (Mallat, Coifman, Donoho,..) EEG/MEG (Leahy, Gorodnitsky,Ioannides,..) Robust Linear Regression and Outlier Detection (Jin, Giannakis,..) Speech Coding (Ozawa, Ono, Kroon,..) Compressed Sensing (Donoho, Candes, Tao,..) Magnetic Resonance Imaging (Lustig,..) Sparse Channel Equalization (Fevrier, Proakis,...) Face Recognition (Wright, Yang,...) Cognitive Radio (Eldar,..) and many more...

MEG/EEG Source Localization? source space (x) sensor space (y) Forward model dictionary can be computed using Maxwell s equations [Sarvas,1987]. In many situations the active brain regions may be relatively sparse, and so solving a sparse inverse problem is required. [Baillet et al., 2001]

Sparse Channel Estimation Potential Application: Underwater Acoustics

Speech Modeling and Deconvolution Speech Model Speech specific assumptions: Voiced excitation is block sparse and the filter is an all pole filter 1 A(z)

Compressive Sampling (CS)

Compressive Sampling (CS) Transform Coding is the transform and b is the original data/image.

Compressive Sampling (CS)

Compressive Sampling (CS) Computation: Solve for x such that x = y. Reconstruction: b = x

Compressive Sampling (CS) Computation: Solve for x such that x = y. Reconstruction: b = x Issues: Need to recover sparse signal x with constraint x = y. Need to design sampling matrix A.

Robust Linear Regression X, y: data; c: regression coeffs.; n: model noise; y X c n w: Sparse Component, Outliers Model noise ε: Gaussian Component, Regular error Transform into overcomplete representation: Y = X c + Φ w + ε, where Φ=I, or Y = [X, Φ] c + ε w

Potential Algorithmic Approaches Finding the Optimal Solution is NP hard. So need low complexity algorithms with reasonable performance. Greedy Search Techniques Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP). Minimizing Diversity Measures Indicator function is not continuous. Define Surrogate Cost functions that are more tractable and whose minimization leads to sparse solutions, e.g. `1 minimization. Bayesian Methods Make appropriate Statistical assumptions on the solution and apply estimation techniques to identify the desired sparse solution.

Bayesian Methods 1. MAP Estimation Framework (Type I) 2. Hierarchical Bayesian Framework (Type II)

MAP Estimation Framework (Type I) Problem Statement ˆx =argmax x P(x y) =argmax x P(y x)p(x) Choice of P(x) = a 2 e a x as Laplacian and P(y x) asgaussianwilllead to the familiar LASSO framework.

Hierarchical Bayesian Framework (Type II)

Hierarchical Bayesian Framework (Type II) Problem Statement Z ˆ =argmaxp( y) =argmax P(y x)p(x )P( )dx Using this estimate of P(x y;ˆ). we can compute our concerned posterior

Hierarchical Bayesian Framework (Type II) Problem Statement Z ˆ =argmaxp( y) =argmax P(y x)p(x )P( )dx Using this estimate of P(x y;ˆ). we can compute our concerned posterior Example: Bayesian LASSO Laplacian prior P(x) can be represented as a Gaussian Scale Mixture in this fashion, Z P(x) = P(x )P( )d Z 1 x 2 = p exp( 2 2 ) a2 2 exp( a 2 2 )d = a exp( a x ) 2

MAP Estimation Problem Statement Advantages ˆx =argmax x P(x y) =argmax x P(y x)p(x) Many options to promote sparsity, i.e. choose some sparse prior over x. Growing options for solving the underlying optimization problem. Can be related to LASSO and other `1 minimization techniques by using suitable P(x).

MAP Estimation Assumption: Gaussian Noise ˆx =argmax P(y x)p(x) x =argmin logp(y x) logp(x) x =argmin x ky xk 2 2 + mx g( x i ) i=1

MAP Estimation Assumption: Gaussian Noise Theorem ˆx =argmax P(y x)p(x) x =argmin logp(y x) logp(x) x =argmin x ky xk 2 2 + mx g( x i ) If g is non decreasing and strictly concave function for x 2 R +,thelocal minima of the above optimization problem will be the extreme points, i.e. have max of N non-zero entries. i=1

Special cases of MAP estimation Gaussian Prior Gaussian assumption of P(x) leadsto`2 norm regularized problem ˆx =argmin x ky xk 2 2 + kxk 2 2

Special cases of MAP estimation Gaussian Prior Gaussian assumption of P(x) leadsto`2 norm regularized problem Laplacian Prior ˆx =argmin x ky xk 2 2 + kxk 2 2 Laplacian assumption of P(x) leadstostandard`1 norm regularized problem i.e. LASSO. ˆx =argmin x ky xk 2 2 + kxk 1

Examples of Sparse Distributions Sparse distributions can be viewed using a general framework of supergaussian distribution. P(x;,p) = p 2 pp 2 ( 1 p x p 2 p, p apple 1 )e If a unit variance distribution is desired becomes a function of p.

Example of Sparsity Penalties Practical Selections g(x i ) = log(x 2 i + ), [Chartrand and Yin, 2008] g(x i ) = log( x i + ), [Candes et al., 2008] g(x i ) = x i p, [Rao et al., 1999] Di erent choices favor di erent levels of sparsity.

Which Sparse prior to choose? ˆx =argmin x ky xk 2 2 + MX x l p l=1 Two issues: If the prior is too sparse, i.e. p 0, then we may get stuck at a local minima which results in convergence error. If the prior is not sparse enough, i.e. p 1, then though global minima can be found, it may not be the sparsest solution, which results in a structural error.

MAP Estimation Underlying Optimization problem is ˆx =argmin x ky xk 2 2 + mx g( x i ) i=1

MAP Estimation Underlying Optimization problem is ˆx =argmin x ky xk 2 2 + mx g( x i ) i=1 Useful algorithms exist to minimize the cost function with a strictly concave penalty function g on R + (Reweighted `2/`1 algorithms). The essence of this algorithm is to create a bound for the concave penalty function and follow the steps of a Majorize-Minimization (MM) algorithm.

MAP Estimation Underlying Optimization problem is ˆx =argmin x ky xk 2 2 + mx g( x i ) i=1 Useful algorithms exist to minimize the cost function with a strictly concave penalty function g on R + (Reweighted `2/`1 algorithms). The essence of this algorithm is to create a bound for the concave penalty function and follow the steps of a Majorize-Minimization (MM) algorithm.

Reweighted `1 optimization Assume: g(x i )=h( x i )withh concave. Now we have to bound this concave penalty function.

Reweighted `1 optimization Assume: g(x i )=h( x i )withh concave.

Reweighted `1 optimization Assume: g(x i )=h( x i )withh concave. Updates x (k+1)! argmin x ky xk 2 2 + X i w (k) i x i w k+1 i! @g(x i) @ x i xi =x (k+1) i

Reweighted `1 optimization Candes et al., 2008 Penalty: g(x i ) = log( x i + ), 0 0 Weight Update: w (k+1) i! [ x (k+1) i + ] 1

Reweighted `2 optimization Assume: g(x i )=h(xi 2 )withh concave Upper bound h(.) as before. Bound will be quadratic in the variables leading to a weighted 2-norm optimization problem

Reweighted `2 optimization Updates Assume: g(x i )=h(xi 2 )withh concave Upper bound h(.) as before. Bound will be quadratic in the variables leading to a weighted 2-norm optimization problem x (k+1)! argmin x ky xk 2 2 + X i w (k) i xi 2 = W (k) T ( I + W (k) T ) 1 y w k+1 i! @g(x i) @x 2 i xi =x (k+1) i, W (k+1)! diag[w (k+1) ] 1

Reweighted `2 optimization: Examples FOCUSS Algorithm[Rao et al., 2003] Penalty: g(x i )= x i p, 0 apple p apple 2 Weight Update: w (k+1) i! x (k+1) i p 2 Properties: Well-characterized convergence rates; very susceptible to local minima when p is small.

Reweighted `2 optimization: Examples FOCUSS Algorithm[Rao et al., 2003] Penalty: g(x i )= x i p, 0 apple p apple 2 Weight Update: w (k+1) i! x (k+1) i p 2 Properties: Well-characterized convergence rates; very susceptible to local minima when p is small. Chartrand and Yin (2008) Algorithm Penalty: g(x i ) = log(x 2 i + ), 0 Weight Update: w (k+1) i! [(x (k+1) i ) 2 + ] 1 Properties: Slowly reducing to zero smoothes out local minima initially allowing better solutions to be found;

Empirical Comparison

Empirical Comparison Figure: Probability of Successful recovery vs Number of non zero coe cients

Limitation of MAP based methods To retain the same maximally sparse global solution as the `0 norm in general conditions, then any possible MAP algorithm will possess O M N local minima.

Hierarchical Bayes: Sparse Bayesian Learning(SBL)

Hierarchical Bayes: Sparse Bayesian Learning(SBL) MAP estimation is just a penalized regression, hence Bayesian Interpretation has not contributed much as of now.

Hierarchical Bayes: Sparse Bayesian Learning(SBL) MAP estimation is just a penalized regression, hence Bayesian Interpretation has not contributed much as of now. MAP methods were interested in the mode of the posterior but SBL uses posterior information beyond the mode, i.e. posterior distribution.

Hierarchical Bayes: Sparse Bayesian Learning(SBL) MAP estimation is just a penalized regression, hence Bayesian Interpretation has not contributed much as of now. MAP methods were interested in the mode of the posterior but SBL uses posterior information beyond the mode, i.e. posterior distribution. Problem For all sparse priors it is not possible to compute the normalized posterior P(x y), hence some approximations are needed.

Hierarchical Bayesian Framework (Type II)

Hierarchical Bayesian Framework (Type II) In order for this framework to be useful, we need tractable representations: Gaussian Scaled Mixtures

Construction of Sparse priors Separability: P(x) = Q i P(x i)

Construction of Sparse priors Separability: P(x) = Q i P(x i) Gaussian Scale Mixture : Z P(x i )= P(x i i )P( i )d i = Z N(x i ;0, i)p( i )d i

Construction of Sparse priors Separability: P(x) = Q i P(x i) Gaussian Scale Mixture : Z P(x i )= P(x i i )P( i )d i = Z N(x i ;0, i)p( i )d i Most of the sparse priors over x (including those with concave g) can be represented in this GSM form, and di erent scale mixing density i.e, P( i ) will lead to di erent sparse priors. [Palmer et al., 2006]

Construction of Sparse priors Separability: P(x) = Q i P(x i) Gaussian Scale Mixture : Z P(x i )= P(x i i )P( i )d i = Z N(x i ;0, i)p( i )d i Most of the sparse priors over x (including those with concave g) can be represented in this GSM form, and di erent scale mixing density i.e, P( i ) will lead to di erent sparse priors. [Palmer et al., 2006] Instead of solving a MAP problem in x, in the Bayesian framework one estimates the hyperparameters leading to an estimate of the posterior distribution for x, i.e. P(x y;ˆ). (Sparse Bayesian Learning)

Examples of Gaussian Scale Mixture Laplacian density P(x; a) = a 2 exp( a x ) Scale mixing density: P( )= a2 2 exp( a 2 2 ), 0.

Examples of Gaussian Scale Mixture Laplacian density P(x; a) = a 2 exp( a x ) Scale mixing density: P( )= a2 2 exp( a 2 2 ), 0. Student-t Distribution P(x; a, b) = ba (a +1/2) (2 ) 0.5 (a) Scale mixing density: Gamma Distribution. 1 (b + x 2 /2) a+1/2

Examples of Gaussian Scale Mixture Laplacian density P(x; a) = a 2 exp( a x ) Scale mixing density: P( )= a2 2 exp( a 2 2 ), 0. Student-t Distribution P(x; a, b) = ba (a +1/2) (2 ) 0.5 (a) Scale mixing density: Gamma Distribution. 1 (b + x 2 /2) a+1/2 Generalized Gaussian P(x; p) = 1 x 2 (1 + 1 p )e p Scale mixing density: Positive alpha stable density of order p/2.

Sparse Bayesian Learning (Tipping) y = x + v Solving for the optimal ˆ =argmaxp( y) =argmaxp(y )P( ) =argminlog y + y T 1 y y 2 X i logp( i ) where, y = 2 I + T and = diag( )

Sparse Bayesian Learning (Tipping) y = x + v Solving for the optimal ˆ =argmaxp( y) =argmaxp(y )P( ) =argminlog y + y T 1 y y 2 X i logp( i ) where, y = 2 I + T and = diag( ) Empirical Bayes Choose P( i ) to be a non-informative prior

Sparse Bayesian Learning Computing Posterior Now because of our convenient choice posterior can be easily computed, i.e, P(x y;ˆ)=n(µ x, x )where, µ x = E[x y;ˆ]=ˆ T ( 2 I + ˆ T ) 1 y x = Cov[x y;ˆ]=ˆ ˆ T ( 2 I + ˆ T ) 1 ˆ

Sparse Bayesian Learning Computing Posterior Now because of our convenient choice posterior can be easily computed, i.e, P(x y;ˆ)=n(µ x, x )where, µ x = E[x y;ˆ]=ˆ T ( 2 I + ˆ T ) 1 y x = Cov[x y;ˆ]=ˆ ˆ T ( 2 I + ˆ T ) 1 ˆ Updating Using EM algorithm with a non informative prior over becomes: i µ x (i) 2 + x (i, i),the update rule

SBL properties

SBL properties Local minima are sparse, i.e. have at most N nonzero i

SBL properties Local minima are sparse, i.e. have at most N nonzero Bayesian inference cost is generally much smoother than associated MAP estimation. Fewer local minima. i

SBL properties Local minima are sparse, i.e. have at most N nonzero Bayesian inference cost is generally much smoother than associated MAP estimation. Fewer local minima. In high signal to noise ratio, the global minima is the sparsest solution. No structural problems. i

Empirical Comparison For each test case 1 Generate a random dictionary with 50 rows and 250 columns from the normal distribution and normalize each column to have 2-norm of 1. 2 Select the support for the true sparse coe cient vector x 0 randomly. 3 Generate the non-zero components of x 0 from the normal distribution. 4 Compute signal, y = x 0 (Noiseless case). 5 Compare SBL with previous methods with regard to estimating x 0. 6 Average over 1000 independent trials.

Empirical Comparison: 1000 trials Figure: Probability of Successful recovery vs Number of non zero coe cients

Empirical Comparison: Multiple Measurement Vectors (MMV) Generate data matrix via Y = X 0 (noiseless), where: 1 X 0 is 100-by-5 with random non-zero rows. 2 is 50-by-100 with Gaussian iid entries.

Empirical Comparison: 1000 trials

Summary

Summary Bayesian methods o er interesting and useful options to the Sparse Signal Recovery problem MAP estimation (Reweighted `2/`1 algorithms) Sparse Bayesian Learning

Summary Bayesian methods o er interesting and useful options to the Sparse Signal Recovery problem MAP estimation (Reweighted `2/`1 algorithms) Sparse Bayesian Learning Versatile and can be more easily employed in problems with structure

Summary Bayesian methods o er interesting and useful options to the Sparse Signal Recovery problem MAP estimation (Reweighted `2/`1 algorithms) Sparse Bayesian Learning Versatile and can be more easily employed in problems with structure Algorithms can often be justified by studying the resulting objective functions.