Hybrid Approximate Message Passing for Generalized Group Sparsity

Size: px

Start display at page:

Download "Hybrid Approximate Message Passing for Generalized Group Sparsity"

Jonas King
5 years ago
Views:

1 Hybrid Approximate Message Passing for Generalized Group Sparsity Alyson K Fletcher a and Sundeep Rangan b a UC Santa Cruz, Santa Cruz, CA, USA; b NYU-Poly, Brooklyn, NY, USA ABSTRACT We consider the problem of estimating a group sparse vector x R n under a generalized linear measurement model Group sparsity of x means the activity of different components of the vector occurs in groups a feature common in estimation problems in image processing, simultaneous sparse approximation and feature selection with grouped variables Unfortunately, many current group sparse estimation methods require that the groups are non-overlapping This work considers problems with what we call generalized group sparsity where the activity of the different components of x are modeled as functions of a small number of boolean latent variables We show that this model can incorporate a large class of overlapping group sparse problems including problems in sparse multivariable polynomial regression and gene expression analysis To estimate vectors with such group sparse structures, the paper proposes to use a recently-developed hybrid generalized approximate message passing (HyGAMP) method Approximate message passing (AMP) refers to a class of algorithms based on Gaussian and quadratic approximations of loopy belief propagation for estimation of random vectors under linear measurements The HyGAMP method extends the AMP framework to incorporate priors on x described by graphical models of which generalized group sparsity is a special case We show that the HyGAMP algorithm is computationally efficient, general and offers superior performance in certain synthetic data test cases Keywords: Compressed sensing, group sparsity, message passing, graphical models, approximate message passing 1 INTRODUCTION Sparsity-based estimation methods have become popular in a number of areas including inverse problems in image processing, statistical feature selection, dimensionality reduction and, most recently, compressed sensing [1 3] One of the basic problems in sparse signal processing is to estimate a sparse vector x R n from noisy linear measurements of the form y = z + w, z = Ax, (1) where A R m n is a known transform matrix and w is additive noise More generally, one may also be interested in so-called generalized linear models [4, 5] where the output mapping from z to y is described by a general (possibly nonlinear) probabilistic transfer function P (y i z i ) so that y i P (y i z i ), z = Ax (2) In either case, x being sparse implies that the vector has few non-zero components reduces the effective degrees of freedom The goal of sparse estimation is to exploit this property for improved estimation of x from the observations y Bayesian formulations of sparse estimation [6 8] typically model sparsity by imposing a prior on the vector x such that the marginal distributions P (x j ) are sparse, meaning that the components have a high probability of being zero or close to zero In the simplest Bayesian models, the components x j would be modeled as independent However, in many practical problems, the sparsity between components have dependencies that Further author information: (Send correspondence to S Rangan) AKFletcher: afletcher@ucscedu, Telephone: SRangan: srangan@polyedu, Telephone:

2 impose further constraints on the signal that can be potentially exploited in signal recovery Recent years have thus seen considerable interest in finding suitable so-called structured sparse models that can capture a rich set of dependencies between components while enabling tractable estimation algorithms that can leverage that structure [9, 10] One particularly simple model for structured sparsity is group sparsity (also sometimes called block sparsity) [11,12] In its simplest form, one is given K groups G 1,, G K that form a disjoint partition of indices 1,, n} so that each component x j belongs to exactly one of K groups In the group sparse model, all the components x j within the same group j G k are zero (inactive) or non-zero (active) together This group structure can be found in a number of applications including simultaneous sparse approximation [13, 14], model selection with grouped variables [11], canonical correlation analysis [15], image annotation [16] and image reconstruction [17] Recovery of the vector x under a group sparse model is often performed with variants of traditional (ie non-group) sparse estimation methods For example, group variants [11, 12] of the LASSO method [18], estimate x by solving a regularized least-squares problem with a mixed l 1 -l 2 regularizer to promote the group sparsity Similar to the standard LASSO method, the resulting optimization problem is typically convex and can be solved via a number of fast methods including [19 21] Group variants [22,23] of widely-used matching pursuit methods [24] have also been successful However, a key limitation in these approaches is that the groups typically must be disjoint or non-overlapping The treatment of overlapping groups generally requires some approximations and restrictions [14, 25 27] We review some of these methods in more detail below In recent years, an alternate, and potentially more general, approach for handling structured sparse problems has been offered by so-called turbo and hybrid extensions of approximate message passing (AMP) AMP methods refer to a class of algorithms based on Gaussian and quadratic approximations of loopy belief propagation designed for the estimation of random vectors x from linear measurements [28 38] AMP methods have attracted considerable recent attention in the context of compressed sensing due to their computational simplicity, generality and analytic tractability Also, although traditional AMP methods generally require the vector x to have independent components, turbo and hybrid extensions [39 43] have been proposed that can incorporate priors on x described by arbitrary graphical models These turbo and hybrid methods operate by combining AMP updates across the transform A with standard loopy belief propagation updates in the factor graph associated with the prior on the vector x The methodology applies to a tremendous range of problems and one particular version of these methods, called Hybrid Generalized Approximate Message Passing (HyGAMP) described in [43], has been proposed for certain classes of group sparse problems The contribution of this paper is to extend and evaluate the HyGAMP methodology to a larger class of group sparsity problems that we call generalized group sparsity In the proposed generalized group sparse model, the sparsity pattern on the n components of the vector x is modeled as a deterministic function of K independent boolean latent variables, ξ k, k = 1,, K, where K is typically smaller than n The mapping between the latent variables ξ k and the activities of the components x j capture the correlations between components and can model a range of problems with overlapping groups We show that the proposed HyGAMP algorithm for generalized group sparsity offers a number of attractive features: Generality: The model applies to structured sparse models with an arbitrary mapping between the latent variables and the activities on the components of x j For computational purposes, the only limitation is that the activity of each component should only depend on a small number of variables We show that the model can incorporate a number of overlapping group sparse problems including unions and intersections of groups and applications including multivariable polynomial regression, sparse boolean regression and gene expression analysis Support for generalized linear models: The HyGAMP framework is an extension of the Generalized Approximate Message Passing (GAMP) method in [37, 38, 44] which allows for generalized linear models (2) As a result, our methodology can support both group sparse classification (where the outputs y i are discrete) as well as regression problems

3 f j (ξ γ(j) ) u j P (x j u j ) x j ξ 1 ξ 2 z i P (y i z i ) y i z = Ax ξ K Figure 1 Factor graph representation of the generalized group sparsity model The problem is to estimate the components x j of a random vector x observed through a known linear transform z = Ax followed by a componentwise, probabilistic measurement channel generating the observed variables y i The group sparse structure on the variables x j is modeled through a set of boolean latent variables ξ k, k = 1,, K Computational simplicity: As described in [45], the GAMP method on which the HyGAMP algorithm is based is essentially a first-order algorithm with similar per iteration cost as the fastest known compressed sensing method such as inexact ADMM and iterative thresholding In particular, each iteration of the approximate message passing updates requires only multiplications by A and A T No matrix inverses or vector-valued estimation updates are required As we will see, the additional updates associated with the HyGAMP algorithm for group sparsity are typically small Performance: In Section 5, we test the algorithm on random instances of an overlapping group sparse problem We show that the method outperforms a number of state of the art techniques including group variants of LASSO [11, 12] 2 GENERALIZED GROUP SPARSITY We consider the problem of estimating a vector x R n from measurements y R m under the factor graph model shown in Fig 1 A general treatment of graphical models can be found in [46] In the factor graph in Fig 1, the variables x j, j = 1,, n are the components of the unknown random vector x R n To model the sparse structure of x we assume that, corresponding to each component x j there is a boolean latent variable u j 0, 1} where u j = 0 when x j = 0 and u j = 1 when x j is potentially non-zero When u j = 1 we will say the component x j is active and we call the variables u j the variable activity indicators We assume that, given the vector u of the variable activity indicators, the components of x are independent with the conditional distributions P (x j u j ) given by 0 if uj = 0 x j (3) V j if u j = 1, where V j is a random variable having the distribution of the component x j in the event that the component is active More general two variable mixture distributions can also easily be incorporated into this model As a simple Bayesian formulation of the standard (ie non-group) compressed sensing problem, one could assume that the variable activity indicators u j are independent, with each variable having some small probability of being active (ie P (u j = 1) is small) However, our goal in this work is to consider a more general class of group structured sparse problems To this end, we assume that the variable activity indicators u j are themselves functions of a second set of boolean latent variables, ξ k 0, 1}, k = 1,, K, where K is typically less than n We assume that each variable activity indicator u j is a deterministic function of the variables ξ k of the form u j = f j (ξ γ(j) ), (4) where γ(j) is a subset of indices γ(j) 1,, K} and ξ γ(j) denotes the sub-vector of ξ with components ξ k, k γ(j) We let G k 1,, n} be the group of indices j such that k γ(j) Thus G k is the set of component indices j such that u j is a function of ξ k The groups G k may be overlapping The group G k will be called active

4 or inactive depending on whether ξ k = 1 or 0 We will call the variables ξ k the group activity indicators and model them as independent with P (ξ k = 1) = 1 P (ξ k = 0) = α k (5) for some activity level α k (0, 1) We will see below that this latent variable model can incorporate a wide range of interesting overlapping group sparse structures Similar to the GAMP framework in [37, 38, 44], we assume a generalized linear measurement model (2) where the observed variable vector y is generated by first passing x through a linear transform z = Ax followed by a separable componentwise measurement channel with probability distribution functions P (y i z i ) The model is general and includes standard additive white Gaussian noise (AWGN) models such as (1) where w has independent Gaussian components, but can also incorporate nonlinearities or non-gaussian randomness In particular, the model can be used for classification problems where the observations are boolean (y i 0, 1}) and the output transfer function P (y i = 1 z i ) is typically some sigmodial function of z i such as a logisitic or probit model [47] In the context of this work, these output channels enable Bayesian approaches to group sparse classification problems considered in [23, 48, 49] Earlier work of ours has also used the nonlinear outputs for Poisson spiking processes in neural recordings [50] and quantized outputs [51] Given the above description, the joint probability distribution function of the variables can be written as m n K n P (x, z, u, ξ, y) = 1 z=ax} P (y i z i ) P (x j u j ) P (ξ k ) 1 uj=f j(ξ γ(j) )} (6) i=1 j=1 where the indicators 1 z=ax} and 1 uj=f j(ξ γ(j))} are used to constrain the random vectors to satisfy the constraints z = Ax in (2) and u j = f j (ξ γ(j) ) in (4) Our goal in this work is to estimate the posterior marginals of the components of the transform inputs and outputs That is, given a vector of observations y, we are interested in estimating the posterior distributions P (x j y), P (z i y), (7) and possibly the posterior distributions on the activity indicators ξ k and u j as well From these marginal distributions, one can compute a variety of quantities of interest including the minimum mean squared error (MMSE) estimates E(x j y) and E(z i y) or optimal estimates with respect to any other loss functions In addition, given any estimate, the marginal distributions can be used to quantify the uncertainty through the distribution of the error Unfortunately, in general, exact computation of the posterior marginals in (7) is intractable since it involves marginalization of the joint distribution (6) across the group activity indicators ξ Since there are 2 K possible values for ξ, computing the marginal distributions generally grows exponentially in the number of groups The Hybrid-GAMP algorithm presented in Section 4 will provide an algorithm for approximately computing these marginal distributions k=1 3 GROUP SPARSE EXAMPLES Before describing the HyGAMP algorithm, it is useful to illustrate the above generalized group sparse model with some motivating examples j=1 Group Sparsity with Disjoint Groups We first consider the standard group sparse model with nonoverlapping groups This is the model used in most of the group or block sparse literature see, for example, [11, 12] In this problem, we are given groups G 1,, G K that form a disjoint partition of the set 1,, n} so that each component index j 1,, n} belongs to exactly one group γ(j) The variable x j is non-zero (ie active) only when its group G k is active Thus, all the components x j with indices j belonging to the same group G k are active or inactive together

5 To model this scenario in the above formalism, we let u j, j = 1,, n, represent the activity indicators for the variables x j and let ξ k, k = 1,, K be the activity indicators for the groups G k Since each variable is active only when its group is active, we assume 1 when ξ γ(j) = 1 u j = 0 when ξ γ(j) = 0, which clearly fits in the form of (4) If we additionally assume that the group activity variables ξ k are independent with probabilities of the form (5), and the components of x are also conditionally independent given u with the mixture distribution (3), we see that the non-overlapping group sparse structure on x can be modeled as a special case of the generalized group sparsity model in Section 2 Unions and Intersections of Groups The above example considers non-overlapping groups A simple example of overlapping groups that can be easily handled in the generalized sparsity framework is arbitrary union or intersections of groups As before, suppose that there are K groups, G 1,, G K, with the activity of each component x j being dependent on a subset γ(j) 1,, K} Now, suppose we take the activity functions u j = f j ( ) to be the logical or function 1 if ξ k = 1 for any k γ(j) f j (ξ γ(j) ) = f or (ξ γ(j) ) = (8) 0 else, or the logical and function f j (ξ γ(j) ) = f and (ξ γ(j) ) = 1 if ξ k = 1 for all k γ(j) 0 else (9) The or function corresponds to a union of groups so that the component x j is active if any of the groups it belongs to are active while the and function corresponds to the intersection of groups where x j is active only if all the groups it belongs to are active One application for the unions of groups model is gene pathway analysis [25, 52] In this application, A R m n is a data matrix of m samples of expression levels on n genes Each sample i is labeled with some target variable, y i, such as whether a cancer tested in that sample is metastatic or non-metastatic The goal is to find a linear classification or regression model between the expression data and target variables Since only a few genes are likely to play role, we expect that the regression coefficients are sparse Moreover, it is known that genes typically operate together in functional groups, with each gene potentially belonging to multiple groups Thus, it is desirable to explain the data with a minimal number of the functional groups A union of groups model can be used to enforce precisely this form of sparsity Sparse Multivariable Polynomial Regression An example where intersecting groups arise is multivariable polynomial regression Suppose we are given data (y i, v i )}, i = 1,, n, where, for each data sample i, y i is an (observed) target variable and v i = (v i1,, v ik ) is a vector of K covariates Suppose that we wish to fit a multivariable polynomial model of the form y i P (y i z i ), z i = w j v j i, (10) j J(d) where d = 0, 1, is the polynomial degree, J(d) is a set of generalized multivariable indices J(d) = j Z L j k = 0, 1,, K k=1 } j k d,

6 and v j denotes the multivariable monomial term v j := v j1 1 vj K K For example, when K = 2 and d = 2 the model (10) would reduce to the quadratic polynomial z i = w 00 + w 10 v 1 + w 01 v 2 + w 20 v w 02 v w 11 v 1 v 2 In (10), P (y i z i ) is the conditional distribution of the target variable y i and z i The observation model is general so that we can incorporate both regression and classification problems The problem is to estimate the regression coefficients w j, j J(d) If we let w be the vector of the regression coefficients w j, then we can write z = Aw for a suitable data matrix A built out of the covariates v i The challenge is that the number of regression coefficients grows exponentially in the polynomial degree d Specifically, if n is the dimension of w, then n = O(K d ) To reduce the dimensionality of the regression, it may be reasonable in some applications to assume that there are only a small number of indices k such that the covariates v ik have influence on the targets z i To model this assumption in the group sparse formalism described above, let ξ k 0, 1} be a boolean variable indicating that the k-th set of covariates v ik has influence on the targets y i, where ξ k = 1 when the covariates are active and ξ k = 0 are inactive Similarly, for each generalized index j J(d), let u j 0, 1} be the boolean variable with u j = 0 when the coefficient w j = 0 and u j = 1 when the coefficient is possibly non-zero Now, in the expansion (10), we assume that the coefficients w j = 0 whenever any of the variables with non-zero exponents are inactive We can write this as 1 if ξ k = 1 for all k γ(j), } u j = where γ(j) := k j k > 0 0 if ξ k = 0 for any k γ(j), This mapping is precisely the logical and function in (9) If, as before, we assume that group activity indicators ξ k are independent and the components of the vector x j are conditionally independent given u, the problem follows the generalized group sparse model in Section 2 Note that the groups G k in this example are the set of generalized indices j such that j k > 0 Thus, G k is the set of monomial terms v j = v j1 1 vj K K with a non-zero dependence on the variable covariate v k These groups are, in general, overlapping and thus this example provides a useful case when non-overlapping groups are necessary Sparse Linear Boolean Regression Closely related application to the above example is what we call sparse linear boolean regression As above, suppose we are given data (y i, v i )}, i = 1,, n, where for each data sample i, y i is a target variable and v i = (v i1,, v ik ) is a vector of covariates However, in this case, suppose that the covariates are boolean variables v i 0, 1}, and we wish to fit a linear model of the form z i = n w j φ j (v i,γ(j) ), (11) j=1 where each function φ j (v) is a boolean-valued function dependent on a small subset of components γ(j) 1,, K} The weights w j and output z i may be real-valued Thus, we are interested in fitting a real-valued function with discrete inputs As one example, suppose that we take the functions φ j ( ), j = 1,, n, to be the set of all d-literal clauses from the boolean vector v For example, when d = 3 and K = 5, the functions φ(v) would include the boolean expressions such as v 1 v 4 v 5, v 2 v 3 v 4, v 1 v 2 v 5,, where denotes logical and and logical negation As in the previous example, the number of terms grows as n = O(K d ) However, if we impose a sparsity constraint that only a small number of the boolean covariates v ik can influence on z i, the number of non-zero coefficients w j will be reduced making the regression more tractable

7 4 HY-GAMP FOR GENERALIZED GROUP SPARSITY Having presented some examples, we can now describe the HyGAMP algorithm [53] and its application to generalized group sparsity Given the separable structure of the joint distribution (6), graphical-model methods [46] provide a natural approach to estimating the marginal distributions such as (7) As we mentioned above, exact computation of marginal distributions is generally intractable Traditional graphical model techniques such as loopy belief propagation (loopy BP) attempt to reduce such inherently high-dimensional vector-valued estimation problems on a factorizable distributions such as (6), to a sequence of lower-dimensional problems associated only with the variables in each factor However, for standard loopy BP to be successful, each factor must depend only on a small number of the variables Unfortunately, for the distribution (6), this property will be valid only when the constraint matrix A is sparse While this sparsity property occurs in problems such as low-density parity check codes [54], transforms A arising in imaging and regression problems are often dense Approximate message passing (AMP) refers to a class of Gaussian and quadratic approximations of loopy BP that can be applied to dense A AMP approximations of loopy BP originated in CDMA multiuser detection problems [28 30] and have received considerable recent attention in the context of compressed sensing [31 36,38] The Gaussian approximations used in AMP are also closely related to expectation propagation techniques [55,56] These algorithms have been particularly attractive since they are general, computationally extremely simple, and for certain large random problem instances admit precise analyses with testable conditions for optimality, even when the problems are non-convex In addition, the methods can be easily integrated with EM techniques when the distributions parameters are unknown [57 61] The standard AMP algorithm considers estimation of a vector x with independent components To model dependencies, the works [39 43] proposed various turbo and hybrid extensions of the AMP algorithms analogous to similar turbo procedure used in conjunction with LDPC codes Specifically, correlations between components of the vector x are assumed to be modeled through prior P (x) described by a graphical model The graphical model may contain other latent variables in the distribution The turbo and hybrid AMP methods then use standard loopy belief propagation updates in the graphical model associated with the prior of x while using approximate message passing across the transform z = Ax One particular version of these algorithms is the so-called Hybrid Generalized Approximate Message Passing (HyGAMP) method of [43] The HyGAMP enables extensions of the AMP methods applicable to priors on x described by arbitrary graphical models of which generalized group sparsity is a special case Algorithm 1 below shows the steps in the HyGAMP algorithm in the case of generalized group sparsity and is very similar to a slightly more restricted group sparse method presented in [43] A detailed derivation of the method can be performed along the lines of the methods in [43]; here, we just provide a brief qualitative description of the method As in the HyGAMP group sparse algorithm in [43], Algorithm 1 is run in a sequence of iterations Each iteration t of the main repeat-until loop has two parts: The first half-iteration is the GAMP update part that generates quantities x j (t), ẑ i (t), τ x j (t) and τ z i (t) representing estimates of the posterior mean and variances of the unknown variables x j and z i This update is based on standard GAMP algorithm in [37] and uses, as an input, the parameter ρ j (t) representing the current estimate of posterior probabilities P (u j = 1 y) The second half-iteration is the sparsity update part that updates the estimates ρ j (t) of the posterior probabilities P (u j = 1 y) of the variable activity indicators The original GAMP paper [43] describes the equations for the GAMP update part of Algorithm 1 in more detail, and derives the updates based on certain Gaussian and quadratic approximations of sum-product loopy BP As in the standard GAMP method [38], the GAMP update part of Algorithm 1 is based on solving certain scalar AWGN estimation problems on the variables x j and z i Specifically, lines 8 and 9 compute the mean

8 Algorithm 1 Hybrid-GAMP for Generalized Group Sparsity 1: Initialization} 2: t 0 3: τ r j (t 1) 4: α j k (t) α k 5: repeat 6: Basic GAMP update} 7: ρ j (t) P (f j (ξ γ(j) ) = 1) with ξ k indep and P (ξ k = 1) = α j k (t) for all k γ(j) 8: x j (t) E(x j r j (t 1), τj r(t 1)), ρ j = ρ j (t))) 9: τj x(t) var(x j r j (t 1), τj r(t 1), ρ j = ρ j (t)) 10: τ p i (t) j A ij 2 τj x(t) 11: p i (t) j A ij x j (t) τ p i (t)ŝ i(t 1) 12: ẑ i (t) E(z i p i (t), τ p i (t)) 13: τi z(t) var(z i p i (t), τ p i (t)) 14: ŝ i (t) (ẑi 0 p i(t))/τ p i (t) 15: τi s p (t) τi (t)(1 τi z(t)/τ p i (t)) 16: τj r(t) 1/( i A ij 2 τi s(t)) 17: r j (t) x j (t) + τj r(t) i A ijŝi(t) 18: Sparsity update} 19: ρ j k (t, 0) P (f j (ξ γ(j) ) = 1 ξ k = 0) with ξ l indep and P (ξ l = 1) = α j l (t) for all l γ(j), l k 20: ρ j k (t, 1) P (f j (ξ γ(j) ) = 1 ξ k = 1) with ξ l indep and P (ξ l = 1) = α j l (t) for all l γ(j), l k 21: LLR j k (t) log p( r j (t); τj r(t), ρ j k(t, 1)) log p( r j (t); τj r(t), ρ j k(t, 0)) 22: LLR j k (t) log(α k /(1 α k )) + i G k j LLR i k(t) 23: α j k (t+1) 1/(1 + exp( LLR j k (t))) 24: t t+1 25: until Terminate and variance estimates x j (t) and τj x(t) We use the notation E(x j r j, τj r, ρ j) and var(x j r j, τj r, ρ j) to denote the expectation and variance with respect to the distribution P (x j r j, τj r, ρ j) defined as the posterior distribution of the scalar variable x j with activity probability ρ j observed through an AWGN measurement r j of the form r j = x j + w j, w j N (0, τ r V j with probability ρ j, ), x j = (12) 0 with probability 1 ρ j The density also provides the estimates of the posterior marginal distribution in that we take P (x j y) P (x j r j = r j (t), τ r j (t), ρ j (t)) Similarly, lines 12 and 13 compute the output mean and variance estimates ẑ i (t) and τ z i We use E(z i p i = p i, τ p i ) and var(z i p i = p i, τ p i ) to denote the mean and variance with respect to a distribution P (z i y i, p i, τ p i ) defined as the posterior distribution of a Gaussian variable z i observed with from the measurement y i as y i P (y i z i ), z i N (p i, τ p i ) (13) The distribution also provides the estimate of the posterior marginal for z i in that we take P (z i y) P (z i y i, p i (t), τ p i (t)) The second half of the iteration, labeled as the sparsity update, updates the parameters ρ j (t), which are the estimates of the posterior probabilities P (u j = 1 y) of the variable activity indicators As described in [43], this part of the algorithm is the standard loopy belief propagation update applied on the portion of the factor

9 graph to the left of the variables u j in Fig 1 Specifically, the quantities ρ j (t), ρ j k (t, 0) and ρ j k (t, 1) can be interpreted, respectively, as estimates for the posterior probabilities ρ j = Pr ( u j = 1 y ), ρ j k (0) = Pr ( u j = 1 y, ξ k = 0 ), ρ j k (1) = Pr ( u j = 1 y, ξ k = 1 ) Similarly, the quantities α j k (t) are estimates of the posterior probabilities P (ξ k = 1 y) on the group activity indicators, and LLR j k (t) and LLR j k (t) represent estimates of the corresponding log-likelihood ratios LLR k = log P (ξ k = 1 y) log P (ξ k = 0 y) Initially, in line 4, the algorithm sets α j k (t) to the prior probabilities α k for all j In each iteration, the posterior probabilities are updated with the standard loopy BP procedure In line 21, p(r j τ r, ρ j ) is the probability density function of the scalar random variable r j in (12) assuming a prior activity probability P (u j = 1) = 1 P (u j = 0) = ρ j It should be pointed out that the HyGAMP methodology of [53] also provides a systematic methodology for approximately computing the maximum a posteriori (MAP) estimate ( x, ẑ) := arg max P (x, z y) (14) x,z However, for space considerations, in this work, we only consider the estimation of the marginal distributions Computational Complexity One of the attractive features of the HyGAMP generalized group sparsity algorithm is its computational simplicity The GAMP update part of each iteration involves evaluating n scalar AWGN problems associated with the variables x j (lines 8 and 9); m scalar AWGN problems associated with the variables z i (lines 12 and 13); and multiplications by A and A 2 and their transposes In many cases, the scalar AWGN problems have closed form solutions, even when the P (x j u j ) or P (y i z i ) are non-gaussian In cases when closed form solutions are not available, the expectations and variances can be computed via one or two-dimensional numerical integration Thus the per iteration cost of the scalar AWGN problems is O(m + n) Hence, the dominant per iteration cost of the GAMP update is only the multiplication by the matrices A and A 2 which is O(mn) in the worst-case The per iteration cost is similar to most first-order methods for compressed sensing including iterative thresholding and ADMM The cost for the sparsity update part of the iteration depends on the complexity of the functions f j (ξ γ(j) ), but is often small For example, suppose f j (ξ γ(j) ) represents a logical and operation (9) Then, lines 7, 19 and 19 reduce to: ρ j (t) = α j k (t), ρ j k (t, 1) = ρ j (t)/α j k (t), ρ j k (t, 0) = 0 k γ(j) Thus, if each subset γ(j) has cardinality d, all the terms ρ j k (t) can be computed in O(nd) operations A similar expression is possible when f j (ξ γ(j) ) represents a logical or operation (8) Of course, for general functions f j ( ), there may not be a simple expression for computing the ρ terms However, even in the worst case, the updates of all the ρ j k (t) terms would require O(n2 d ) operations Thus, as long as d is small (ie each of the variable indicators u j depend on only small number d of the K groups), the computation will be tractable 5 NUMERICAL EXAMPLE To evaluate the HyGAMP methodology, we measured the algorithm performance on a large number of random instances of a sparsity recovery problem with overlapping groups In each random instance of the problem, the vector x R n was generated with iid Gauss-Bernoulli components 0 if u j = 0 x j = (15) N (0, 1) if u j = 1

10 Normalized MSE (db) LMMSE LASSO GLASSO GLASSO LAT GAMP HyGAMP Num measurements (m) Figure 2 Comparison of the average performance of various estimation algorithms for random instances of a overlapping group sparsity problem with K = 20 groups for a vector of dimension n = 100 The groups are active with probability α = 01 and the activity of each of the n = 100 components is dependent on a random d = 2 of the K = 20 groups The vector dimension was set to n = 100 and we assumed K = 20 groups with iid group activity indicators ξ k, k = 1,, K For the variable activity indicators, the sets γ(j) were generated by randomly selecting d = 2 of the K groups for each component u j The activity function used a logical or operation so that u j = 1 if and only if ξ k = 1 for any k γ(j) Recall that the estimator knows the sets γ(j) and all other statistics only the particular realization of ξ and x are unknown We set each of the groups to be active with probability α = 01 so that the components x j were active with probability 1 (1 α) d 019 Thus, not exploiting the group structure requires the estimator to identify, on average, 19 out of the 100 components of x However, since the groups are active with only probability 01, using the group structure requires, on average, the identification of only 2 of the 20 groups Hence, the example provides a test case where the correlated group structure can significantly reduce the effective degrees of freedom For the measurement matrix, we used a zero mean Gaussian iid matrix A R m n varying the number of measurements, m, from 10 to 200 The measurement vector y was generated by an AWGN measurement channel (1) with SNR = 30 db Fig 2 compares the performance of the proposed HyGAMP methodology with several other common algorithms For each method and number of measurements m, we generated 500 random Monte Carlo instances of the problem and measured the average normalized mean squared error E x x 2 2 Normalized MSE = 10 log 10 E x 2, 2 where the x is estimated vector and the expectations in the numerator and denominator are taken over the 500 Monte Carlo trials The details of the algorithms shown in Fig 2 are as follows: The LMMSE is the simple linear minimum mean squared estimator This method does not exploit any sparsity and thus the performance, as expected, is poor The LASSO method [18] is a standard algorithm for sparse recovery and finds an estimate by solving the optimization x = arg min x 1 2 y Ax γ n x j, (16) for some regularization parameter γ > 0 The regularization trades off the sparsity with the prediction error To provide the most favorable case for the LASSO method, we used an oracle method for selecting γ where, for each number of measurements m, various γ values were tested, and we selected the value of γ that resulted in the lowest MSE The LASSO method is well-known to exploit the sparse structure of the signal well, but not the group sparse structure It thus shows marked improvement in performance over simple LMMSE method but j=1

11 still does not perform as well the HyGAMP algorithm To incorporate group structure, one often uses a group LASSO method [11, 12] where the estimate is given by the solution to a mixed l 1 l 2 optimization x = arg min x 1 2 y Ax γ K x Gk 2, (17) where x Gk is the subvector of x with components in G k While this method works well for disjoint groups or intersecting groups, it is well-known to be problematic for unions of overlapping groups [25]: The reason is that if any x j is to be non-zero, all the groups k with j G k must be made active This behavior is undesirable when the activities of the variables require only one of the groups in γ(j) to be active The performance of the group LASSO estimate (17) is plotted in Fig 2 in the curve labeled GLASSO We see that, due to this overlapping problem, group LASSO does not even outperform the standard (non-group) LASSO To improve the performance of group LASSO with union overlaps, [25] proposed to transform the problem so that the vector x is replaced with a new vector of latent variables In the transformed domain, the dimension of the new vector is larger, but there are no overlapping groups One can then apply the group LASSO estimator (17) to the transformed problem The performance of this group LASSO method with latent variables is shown in Fig 2 in the curve labelled GLASSO-LAT We see that, as predicted in [25], the method offers significant improvement over group LASSO particularly when the data is undersampled (ie m < n) But, for most values of the measurement number, the HyGAMP offers a significant improvement over the group LASSO with latent variables The reason the HyGAMP method offers improved performance is likely due to the inherent dimension expansion required for latent variable group LASSO that is not required by HyGAMP In addition, since all the LASSO methods are based on an l 1 penalty, they introduce a small bias error due to implicit soft thresholding in that they become unable to recover oracle estimator even when the correct sparsity pattern is detected (See [62]) This bias may explains some of the gap with HyGAMP at large m We did not test any de-biasing methods Finally, the curve labelled GAMP in Fig 2 is the standard GAMP algorithm from [37] with iid priors This algorithm does not exploit the group sparsity, and thus also performs worse than the HyGAMP method It should be noted that we did not test the group OMP algorithms descried in [22,23], since there is no obvious way to incorporate overlapping groups in those methods In conclusion, we see that the HyGAMP method outperforms all the tested methods over a large range of the measurements In some cases, the performance improvement is significant Moreover, the algorithms closest to the HyGAMP in performance, the latent variable group LASSO method in [25] is particularly constructed for overlapping groups where the variable activities are a logical or of the group activities The HyGAMP method here is more general Also, the GAMP and HyGAMP curves plotted in Fig 2 were based on running the algorithm for only 20 iterations, which we observed to be sufficient to obtain convergence within less than 01 db Thus, the HyGAMP method is also computationally extremely fast for this test case k=1 6 CONCLUSIONS Turbo and hybrid extensions of approximate message passing methods provide a promising, systematic and general framework for a large class of structured sparsity problems The techniques capture the modularity of graphical models along with the computational simplicity of approximate message passing In this work, we have shown the Hybrid Generalized Approximate Message Passing (HyGAMP) method of [63], in particular, can incorporate very general forms of groups sparsity in a computationally efficient manner On synthetic test cases, our simulations illustrate that the HyGAMP methodology can outperform state-of-the-art methods while being more general Nevertheless, much remains to be understood about these methods Most importantly, our current results are based entirely on simulations since there are currently no results that quantitatively describe the behavior of AMP-like methods used in conjunction with turbo updates However, recent work [59 61] have provided methods for analyzing the behavior of AMP combined with EM updates, and an interesting avenue of future work is to see if these techniques extend to turbo and hybrid AMP methods

12 In addition, even without the turbo and hybrid extensions, much of the analysis on AMP applies only to certain large random iid matrices where the algorithms exhibit extremely good performance See, for example, the state evolution analysis in [28,30 38] However, many of the matrices arising in practical problems in imaging and regression are not well-known modeled as realizations of such iid matrices The behavior of AMP is not well-understood in these cases, and it is known, in fact that the algorithm may perform poorly or even diverge A central research challenge for both the AMP methods and their turbo and hybrid extensions is to understand what modifications are necessary so that the benefits of these methods can be realized in a broader class of practical problems REFERENCES 1 E J Candès, J Romberg, and T Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans Inform Theory 52, pp , Feb D L Donoho, Compressed sensing, IEEE Trans Inform Theory 52, pp , Apr E J Candès and T Tao, Near-optimal signal recovery from random projections: Universal encoding strategies?, IEEE Trans Inform Theory 52, pp , Dec J A Nelder and R W M Wedderburn, Generalized linear models, J Royal Stat Soc Series A 135, pp , P McCullagh and J A Nelder, Generalized linear models, Chapman & Hall, 2nd ed, D Wipf and B Rao, Sparse Bayesian learning for basis selection, IEEE Trans Signal Process 52, pp , Aug S Ji, Y Xue, and L Carin, Bayesian compressive sensing, IEEE Trans Signal Process 56, pp , June V Cevher, Learning with compressible priors, in Proc NIPS, (Vancouver, BC), Dec R G Baraniuk, V Cevher, M F Duarte, and C Hegde, Model-based compressed sensing, IEEE Trans Inform Theory 56, pp , Apr M Duarte and Y Eldar, Structured compressed sensing: From theory to applications, IEEE Trans Signal Process 59(9), pp , M Yuan and Y Lin, Model selection and estimation in regression with grouped variables, J Royal Statist Soc 68, pp 49 67, P Zhao, G Rocha, and B Yu, The composite absolute penalties family for grouped and hierarchical variable selection, Ann Stat 37(6), pp , D P Wipf and B Rao, An empirical Bayesian strategy for solving the simultaneous sparse approximation problem, IEEE Trans Signal Process 55, pp , July F R Bach, Consistency of the group lasso and multiple kernel learning, J Machine Learn Res 9, S Virtanen, A Klami, and S Kaski, Bayesian CCA via group sparsity, ICML, June S Zhang, J Huang, Y Huang, Y Yu, H Li, and D N Metaxas, Automatic image annotation using group sparsity, in Proc Conf on Computer Vision and Pattern Recognition (CVPR), pp , A Majumdar and R K Ward, Compressive color imaging with group-sparsity on analysis prior, in Proc Conf on Image Processing, pp , R Tibshirani, Regression shrinkage and selection via the lasso, J Royal Stat Soc, Ser B 58(1), pp , M Figueiredo, S J Wright, and R D Nowak, Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems, IEEE J Sel Topics Signal Process 1, pp , Dec S Kim, K Koh, M Lustig, S Boyd, and D Gorinvesky, An interior point method for large-scale l 1 - regularized least squares, IEEE J Sel Topics Signal Process 1, pp , Dec S J Wright, R D Nowak, and M Figueiredo, Sparse reconstruction by separable approximation, IEEE Trans Signal Process 57, pp , July A C Lozano, G Świrszcz, and N Abe, Group orthogonal matching pursuit for variable selection and prediction, in Proc Neural Information Process Syst, (Vancouver, Canada), Dec 2008

13 23 A C Lozano, G Świrszcz, and N Abe, Group orthogonal matching pursuit for logistic regression, J Machine Learning Res 15, pp , S S Chen, D L Donoho, and M A Saunders, Atomic decomposition by basis pursuit, SIAM Rev 43(1), pp , L Jacob, G Obozinski, and J-P Vert, Group lasso with overlap and graph lasso, in Proc International Conf Machine Learning (ICML), pp , N S Rao, R D Nowak, S J Wright, and N G Kingsbury, Convex approaches to model wavelet sparsity patterns arxiv: [cscv], Apr G Peyré and J Fadili, Group sparsity with overlapping partition functions, Proc EUSIPCO 2011, pp , J Boutros and G Caire, Iterative multiuser joint decoding: Unified framework and asymptotic analysis, IEEE Trans Inform Theory 48, pp , July T Tanaka and M Okada, Approximate belief propagation, density evolution, and neurodynamics for CDMA multiuser detection, IEEE Trans Inform Theory 51, pp , Feb D Guo and C-C Wang, Asymptotic mean-square optimality of belief propagation for sparse linear systems, in Proc IEEE Inform Theory Workshop, pp , (Chengdu, China), Oct D L Donoho, A Maleki, and A Montanari, Message-passing algorithms for compressed sensing, Proc Nat Acad Sci 106, pp , Nov D L Donoho, A Maleki, and A Montanari, Message passing algorithms for compressed sensing I: motivation and construction, in Proc Info Theory Workshop, Jan D L Donoho, A Maleki, and A Montanari, Message passing algorithms for compressed sensing II: analysis and validation, in Proc Info Theory Workshop, Jan M Bayati and A Montanari, The dynamics of message passing on dense graphs, with applications to compressed sensing, IEEE Trans Inform Theory 57, pp , Feb S Rangan, Estimation with random linear mixing, belief propagation and compressed sensing, in Proc Conf on Inform Sci & Sys, pp 1 6, (Princeton, NJ), Mar A Montanari, Graphical model concepts in compressed sensing, in Compressed Sensing: Theory and Applications, Y C Eldar and G Kutyniok, eds, pp , Cambridge Univ Press, June S Rangan, Generalized approximate message passing for estimation with random linear mixing arxiv: v1 [csit], Oct S Rangan, Generalized approximate message passing for estimation with random linear mixing, in Proc IEEE Int Symp Inform Theory, pp , (Saint Petersburg, Russia), July Aug P Schniter, Turbo reconstruction of structured sparse signals, in Proc Conf on Inform Sci & Sys, (Princeton, NJ), Mar J Ziniel, L C Potter, and P Schniter, Tracking and smoothing of time-varying sparse signals via approximate belief propagation, in Conf Rec 44th Asilomar Conf Signals, Syst & Comput, pp , (Pacific Grove, CA), Nov S Som, L C Potter, and P Schniter, Compressive imaging using approximate message passing and a Markov-tree prior, in Conf Rec 44th Asilomar Conf Signals, Syst & Comput, pp , (Pacific Grove, CA), Nov P Schniter, A message-passing receiver for BICM-OFDM over unknown clustered-sparse channels, in Proc IEEE Workshop Signal Process Adv Wireless Commun, (San Francisco, CA), June S Rangan, A K Fletcher, V K Goyal, and P Schniter, Hybrid approximate message passing with applications to structured sparsity arxiv: [csit], Nov A Javanmard and A Montanari, State evolution for general approximate message passing algorithms, with applications to spatial coupling arxiv: [mathpr], Nov S Rangan, P Schniter, E Riegler, A Fletcher, and V Cevher, Fixed points of generalized approximate message passing with arbitrary matrices arxiv preprint, Jan M J Wainwright and M I Jordan, Graphical models, exponential families, and variational inference, Found Trends Mach Learn 1, 2008

14 47 C M Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics, Springer, New York, NY, Y Kim, J Kim, and Y Kim, Blockwise sparse regression, Statistica Sinica 16, pp , L Meier, S van de Geer, and P Bühlmann, The group lasso for logistic regression, J Royal Statistical Society: Series B 70(1), pp 53 71, A K Fletcher, S Rangan, L Varshney, and A Bhargava, Neural reconstruction with approximate message passing (NeuRAMP), in Proc Neural Information Process Syst, (Granada, Spain), Dec U S Kamilov, V K Goyal, and S Rangan, Message-passing de-quantization with applications to compressed sensing, IEEE Trans Signal Process 60, pp , Dec A Subramanian, P Tamayo, V K Mootha, S Mukherjee, B L Ebert, M A Gillette, A Paulovich, S L Pomeroy, T R Golub, E S Lander, et al, Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences 102(43), pp , S Rangan, A K Fletcher, V K Goyal, and P Schniter, Hybrid generalized approximation message passing with applications to structured sparsity, in Proc IEEE Int Symp Inform Theory, pp , (Cambridge, MA), July T J Richardson and R L Urbanke, Modern Coding Theory, Cambridge Univ Press, Cambridge, UK, T P Minka, A family of algorithms for approximate Bayesian inference PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, M Seeger, Bayesian inference and optimal design for the sparse linear model, J Machine Learning Research 9, pp , Sept J P Vila and P Schniter, Expectation-maximization Bernoulli-Gaussian approximate message passing, in Conf Rec 45th Asilomar Conf Signals, Syst & Comput, pp , (Pacific Grove, CA), Nov J P Vila and P Schniter, Expectation-maximization Gaussian-mixture approximate message passing, in Proc Conf on Inform Sci & Sys, (Princeton, NJ), Mar F Krzakala, M Mézard, F Sausset, Y Sun, and L Zdeborová, Statistical physics-based reconstruction in compressed sensing arxiv: , Sept F Krzakala, M Mézard, F Sausset, Y Sun, and L Zdeborová, Probabilistic reconstruction in compressed sensing: Algorithms, phase diagrams, and threshold achieving matrices arxiv: , June U S Kamilov, S Rangan, A K Fletcher, and M Unser, Approximate message passing with consistent parameter estimation and applications to sparse learning, in Proc NIPS, (Lake Tahoe, NV), Dec R Tibshirani, Regression shrinkage and selection via the lasso: a retrospective, J Royal Statistical Society: Series B (Statistical Methodology) 73(3), pp , S Rangan, A K Fletcher, and V K Goyal, Extension of replica analysis to MAP estimation with applications to compressed sensing, in Proc IEEE Int Symp Inform Theory, pp , (Austin, TX), June 2010

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns