Bayesian Nonparametric Dictionary Learning for Compressed Sensing MRI

Size: px

Start display at page:

Download "Bayesian Nonparametric Dictionary Learning for Compressed Sensing MRI"

Edith Cummings
5 years ago
Views:

1 1 Bayesian Nonparametric Dictionary Learning for Compressed Sensing MRI Yue Huang, John Paisley, Qin Lin, Xinghao Ding, Xueyang Fu and Xiao-ping Zhang arxiv: v2 [cs.cv] 9 Oct 2013 Abstract We develop a Bayesian nonparametric model for reconstructing magnetic resonance images (MRI) from highly undersampled k-space data. Our model uses the beta process as a nonparametric prior for dictionary learning, in which an image patch is a sparse combination of dictionary elements. The size of the dictionary and the patch-specific sparsity pattern is inferred from the data, in addition to all dictionary learning variables. Dictionary learning is performed as part of the image reconstruction process, and so is tailored to the MRI being considered. In addition, we investigate a total variation penalty term in combination with the dictionary learning model. We derive a stochastic optimization algorithm based on Markov Chain Monte Carlo (MCMC) sampling for the Bayesian model, and use the alternating direction method of multipliers (ADMM) for efficiently performing total variation minimization. We present empirical results on several MRI, which show that the proposed regularization framework can improve reconstruction accuracy over other methods. Index Terms Bayesian nonparametrics, dictionary learning, compressed sensing, magnetic resonance imaging I. INTRODUCTION Magnetic resonance imaging (MRI) is a widely used technique for visualizing the structure and functioning of the body. A limitation of MRI is its slow scan speed during data acquisition. Therefore, methods for accelerating the MRI process have received much research attention. Recent advances in signal reconstruction from measurements sampled below the Nyquist rate, called compressed sensing (CS) [1][2], have had a major impact on MRI [3]. CS-MRI allows for significant undersampling in the Fourier measurement domain of MR images (called k-space), while still outputting a high-quality image reconstruction. While image reconstruction using this undersampled data is a case of an ill-posed inverse problem, compressed sensing theory has shown that it is possible to reconstruct a signal from significantly fewer measurements than mandated by traditional Nyquist sampling if the signal is sparse in a particular transform domain. Motivated by the need to find a sparse domain for signal representation, a large body of literature now exists on reconstructing MRI from significantly undersampled k-space Yue Huang, Qin Lin, Xinghao Ding and Xueyang Fu are with the Department of Communications Engineering at Xiamen University in Xiamen, Fujian, China. John Paisley is with the Department of Electrical Engineering at Columbia University in New York, NY, USA. Xiao-ping Zhang is with the Department of Electrical and Computer Engineering at Ryerson University in Toronto, Canada. This work supported by the National Natural Science Foundation of China (Nos , , , ), the Fundamental Research Funds for the Central Universities (Nos , ) and the Natural Science Foundation of Fujian Province of China (No. 2012J05160). Equal contributions. Corresponding author: dxh@xmu.edu.cn data. Existing improvements in CS-MRI mostly focus on (i) seeking sparse domains for the image, such as contourlets [5][6]; (ii) using approximations of the l 0 norm for better reconstruction performance with fewer measurements, for example l 1, FOCUSS, l p quasi-norms with 0 < p < 1, or using smooth functions to approximate the l 0 norm [7] [10]; and (iii) accelerating image reconstruction through more efficient optimization techniques []. In this paper we present a modeling framework that is similarly motivated. CS-MRI reconstruction algorithms tend to fall into two categories: Those which enforce sparsity directly within some image transform domain [3] [16], and those which enforce sparsity in some underlying latent representation of the image, such as a dictionary learning representation [17] [20]. Most CS-MRI reconstruction algorithms belong to the first category. For example Sparse MRI [3], the leading study in CS-MRI, performs MR image reconstruction by enforcing sparsity in both the wavelet domain and the total variation (TV) of the reconstructed image. Algorithms with image-level sparsity constraints such as Sparse MRI typically employ an off-theshelf basis, which can usually capture only one feature of the image. For example, wavelets recover point-like features, while contourlets recover curve-like features. Since MR images contain a variety of underlying features, such as edges and textures, using a basis not adapted to the image can be considered a drawback of the algorithms in this group. Finding a sparse basis that is suited to the image at hand can benefit MR image reconstruction, since CS theory shows that the required number of measurements is linked to the sparsity of the signal in the selected transform domain. Using a standard basis not adapted to the image under consideration will likely not provide a representation that can compete in sparsity with an adapted basis. To this end, dictionary learning, which falls in the second group of algorithms, learns a sparse basis on image subregions called patches that is adapted to the image class of interest. Recent studies in the image processing literature have shown that dictionary learning is an effective means for finding a sparse representation of an image on the patch-level [22] [24], [29]. These algorithms learn a patchlevel basis (i.e., dictionary) by exploiting structural similarities between patches extracted from images within a class of interest (for example BM3D [22], MOD [23] and K-SVD [24]). Among these approaches, adaptive dictionary learning where the dictionary is learned directly on the image being considered based on patch-level sparsity constraints usually outperforms analytical dictionary approaches in denoising, super-resolution reconstruction, interpolation, inpainting, classification and other applications, since the adaptively learned

2 2 dictionary suits the signal of interest [23] [26]. Dictionary learning has been applied to CS-MRI as a sparse basis for reconstruction (e.g., LOST [18] and DLMRI [19]). With these methods, parameters such as the dictionary size and patch sparsity are preset, and algorithms are considered that are non-bayesian. In this paper, we consider a new dictionary learning algorithm for CS-MRI that is based on Bayesian nonparametric statistics. Specifically, we consider the beta process as a nonparametric prior for a dictionary learning model that provides the sparse representation necessary for CS-MRI reconstruction. The beta process is a method for generating measures on infinite parameter spaces that can be employed in latent factor models [30][31]; in this case the latent factors are the dictionary elements and the measure is a value in (0, 1] that gives the corresponding activation probability. While the dictionary is theoretically infinite in size, through posterior inference the beta process learns a representation that is both sparse in dictionary size and in the dictionary usage for any given patch. The proposed Bayesian nonparametric model gives an alternative approach to dictionary learning for CS-MRI reconstruction to those previously considered. We derive a Markov Chain Monte Carlo (MCMC) sampling algorithm for stochastic optimization of the dictionary learning variables in the objective function. In addition, we consider including a sparse total variation (TV) penalty, for which we perform efficient optimization using the alternating direction method of multipliers (ADMM). We organize the paper as follows. In Section II we review CS-MRI inversion methods and the beta process for dictionary learning. In Section III, we describe the proposed regularization framework and optimization algorithm. We then show the advantages of the proposed Bayesian nonparametric regularization framework on several CS-MRI problems in Section IV. II. BACKGROUND AND RELATED WORK We use the following notation: Let x R N be a N N MR image in vectorized form. Let F u C u N, u < N, be the undersampled Fourier encoding matrix and y = F u x represent the sub-sampled set of k-space measurements. The goal is to estimate x from the small fraction of k-space measurements y. For dictionary learning, let R i be the ith patch extraction matrix. That is, R i is a P N matrix of all zeros except for a one in each row that extracts a vectorized P P patch from the image, R i x R P for i = 1,..., N. We work with overlapping image patches with a shift of one pixel and allow a patch to wrap around the image at the boundaries for mathematical convenience [19][26]. A. Two approaches to CS-MRI inversion We focus on CS-MRI inversion via optimizing an unconstrained function of the form arg min x h(x) + λ 2 F ux y 2 2, (1) where F u x y 2 2 is a data fidelity term, λ > 0 is a parameter and h(x) is a regularization function that controls properties of the image we want to reconstruct. As discussed in the introduction, the function h can take several forms, but tends to fall into one of two categories according to whether imagelevel or patch-level information is considered. We next review these two approaches. 1) Image-level sparse regularization: CS-MRI with an image-level, or global regularization function h g (x) is one in which sparsity is enforced within a transform domain defined on the entire image. For example, in Sparse MRI [3] the regularization function is h g (x) = W x 1 + µ T V (x), (2) where W is the wavelet basis and T V (x) is the total variation (spatial finite differences) of the image. Regularizing with this function requires that the image be sparse in the wavelet domain, as measured by the l 1 norm of the wavelet coefficients W x 1, which acts as a surrogate for l 0 [1][2]. The total variation term enforces homogeneity within the image by encouraging neighboring pixels to have similar values while allowing for sudden high frequency jumps at edges. The parameter µ > 0 controls the trade-off between the two terms. Various other definitions of h g (x) have also been proposed for MRI reconstruction, which we briefly summarize. Examples are over-complete contourlets [5], a combination of wavelets, contourlets and TV [6], and regularization of wavelet coefficient correlations based on Gaussian scale mixtures [4]. Other methods replace the l 1 norm with approximations of the l 0 norm, for example FOCUSS [9][10], l p norms [8], and homotopic l 0 minimization [7]. Numerical algorithms for optimizing (1) with an image-level h g (x) include nonlinear conjugate gradient descent with backtracking line search [3], an operator-splitting algorithm (TVCMRI) [11] and a variable splitting method (RecPF) [21]. Both TVCMRI and RecPF can replace iterative linear solvers with Fourier domain computations, with substantial time savings. Other methods in the literature include a combination of variable and operator splitting techniques [13], a fast composite splitting algorithm (FCSA) [], a contourlet transform with iterative soft thresholding [5], a combination of Gaussian scale mixture model with iterative hard thresholding [4], a variation on Bregman operator splitting (BOS) [15] and alternating proximal minimization applied to the TV-based SENSE problem [16]. The above algorithms generally employ variable and operator splitting techniques with the FFT and alternating minimization to simplify the object function. In this work, we follow a similar approach for total variation minimization. 2) Patch-level sparse regularization: An alternative to the image-level sparsity constraint h g (x) is a patch-level, or local regularization function h l (x), which enforces sparsity in a transform domain defined on patches (square sub-regions of the image) extracted from the full image. An example of such a regularization function is, h l (x) = i γ 2 R ix Dα i f(α i, D), (3) where the dictionary matrix is D R P K and α i is a K- dimensional vector. An important difference between h l (x)

3 3 and h g (x) is the additional function f(α i, D). While imagelevel sparsity constraints fall within a predefined transform domain, such as the wavelet basis, the sparse transform domain can be unknown for patch-level regularization and learned from data. The function f enforces sparsity by learning a D for which α i is sparse. 1 For example, [19] uses K-SVD to learn D off-line, and then approximately optimize the objective function arg min α 1:N R i x Dα i 2 2 subject to α i 0 T, i, (4) i using orthogonal matching pursuits (OMP) [25]. (Note that this objective can be written using f(α i, D) = κ i α i 0 for some κ i > 0.) In this case, the extra parameters α i are included in the objective function (1), and so the problem is no longer convex. Using this definition of h l (x) in (1), a local optimal solution can be found by an alternating minimization procedure: first solve the least squares solution for x using the current values of α i and D, and then update α i and D, or only α i if D is learned off-line. The dictionary learning step can be thought of as a denoising procedure. That is, the combination of each Dα i in effect produces a denoised proposal reconstruction for x, after which the reconstruction takes into account the squared error from this smooth proposal and from the sub-sampled k-space, with weight determined by the regularization parameters. Aside from sparse dictionary learning, other patch-level algorithms have been reported. For example, regularization of patches in a spatial region with a robust distance metric [17], patch clustering followed by de-aliasing and artifact removal for reconstruction using 3DFFT (LOST) [18] or directional wavelets [20]. These methods each take into account similarities between image patches in determining the dictionary. Next, we review our method for dictionary learning by using a Bayesian nonparametric prior called the beta process. B. Dictionary learning with beta process factor analysis Typical dictionary learning approaches require a predefined dictionary size and, for each patch, the setting of either a sparsity level T, or an error threshold ɛ to determine how many dictionary elements are used. In both cases, if the settings do not agree with ground truth, the performance can significantly degrade. Instead, we consider a Bayesian nonparametric method called beta process factor analysis (BPFA) [27], which has been shown to successfully infer both of these values, as well as have competitive performance with algorithms in several application areas [27] [29], and see [] [39] for related algorithms. The beta process is driven by an underlying Poisson process, and so it s properties as a stochastic process for Bayesian modeling are well understood [30]. Originally used for survival analysis in the statistics literature, its use for latent factor modeling has been significantly increasing within the machine learning field [27] [29],[31],[37] [39]. Being a Bayesian method, the prior definition of our proposed model gives a way (in principle) of generating images. Writing the generative method for BPFA gives an informative 1 We ve suppress this dependence on α and D in h l (x). Algorithm 1 Generating an image with BPFA 1) Construct a dictionary D = [d 1,..., d K ]: d k N(0, P 1 I P ), k = 1,..., K. 2) Draw a probability π k [0, 1] for each d k : π k Beta(cγ/K, c(1 γ/k)), k = 1,..., K. 3) Draw precision values for noise and each weight γ ε Gamma(g 0, h 0 ), γ s,k Gamma(e 0, f 0 ). 4) For the ith patch in x: a) Draw the vector s i N(0, diag(γ 1 sk )). b) Draw the binary vector z i with z ik Bernoulli(π k ). c) Define α i = s i z i by an element-wise product. d) Construct the patch R i x = Dα i + ε i with noise ε i N(0, γε 1 I P ). 5) Construct the image x as the average of all R i x that overlap on a given pixel. picture of what the algorithm is doing and what assumptions are being made. 2 To construct an image with the proposed model, we use the generative structure given in Algorithm 1. With this approach, the model constructs a dictionary matrix D R P K of i.i.d. random variables, and assigns probability π k to vector d k. The parameters for these probabilities are set such that most of the π k are expected to be small, with a few large. In Algorithm 1 we use an approximation to the beta process; for a fixed c > 0 and γ > 0, convergence is guaranteed as K [30][28]. Under this parameterization, each patch R i x extracted from the image x is modeled as a sparse weighted combination of the dictionary elements, as determined by the element-wise product of z i {0, 1} K with the Gaussian vector s i. What makes the model nonparametric is that for many values of k, the values of z ik will equal zero for all i; the model learns the number of these unused dictionary elements and their index values from the data. The independent Bernoulli random variables ensure values of zero for the kth element of each z i when π k is very small, and thereby eliminates d k from the model. Therefore, the value of K should be set to a large number that is more than the expected size of the dictionary. It can be shown that under the assumptions of this prior, in the limit K, the number of dictionary elements used by a patch is Poisson(γ) distributed and the total number of dictionary elements used by the data grows like cγ ln(c + N), where N is the number of patches [31]. 1) Relationship to K-SVD: Another widely used dictionary learning method is K-SVD [24]. Though they are models for the same problem, BPFA and K-SVD have some significant differences that we briefly discuss. K-SVD learns the sparsity pattern of the coding vectror α i using the OMP algorithm 2 The model has an equivalent representation as an optimization procedure over an analytical objective function, but the result is less informative.

4 TABLE I PEAK SIGNAL-TO-NOISE RATIO (PSNR) FOR IMAGE DENOISED BY BPFA AND K-SVD. PERFORMANCE IS COMPARABLE WHEN THE NOISE PARAMETER OF K-SVD IS CORRECT (MATCH).

4 4 TABLE I PEAK SIGNAL-TO-NOISE RATIO (PSNR) FOR IMAGE DENOISED BY BPFA AND K-SVD. PERFORMANCE IS COMPARABLE WHEN THE NOISE PARAMETER OF K-SVD IS CORRECT (MATCH). BPFA OUTPERFORMS K-SVD WHEN THIS SETTING IS WRONG (MISMATCH). σ 2 K-SVD denoising (PSNR) BPFA denoising (PSNR) Match Mismatch Results Learned noise [25] for each i. Holding the sparsity fixed, it then updates each dictionary element and dimension of α jointly by a rank one approximation to the residual. BPFA on the other hand updates the sparsity pattern by generating from a beta posterior distribution and generates weights and the dictionary from Gaussian posterior distributions using Bayes Rule. Because of this probabilistic structure, we derive a sampling algorithm for these variables that takes advantage of marginalization, and naturally learns the auxiliary variables γ ε and γ s,k. 2) Example denoising problem: We briefly illustrate BPFA on a denoising problem using 6 6 patches extracted from a image and setting K = 108. In Figures 1(a) and 1(b) we show the noisy and denoised images. In Figures 1(c) and 1(d) we show some statistics from dictionary learning. For example, Figure 1(c) shows the values of π k sorted, where we see that fewer than 100 elements are used by the data. Figure 1(d) shows the empirical distribution of the number of elements per patch, where we see the ability of the model to adapt the sparsity to the patch. In Table I we show PSNR results for three noise variance levels. For K- SVD, we consider the case when the error parameter matches the ground truth, and when it mismatches it by a magnitude of five. As expected, when K-SVD does not have an appropriate setting of this value the performance suffers. BPFA on the other hand can adaptively infer the noise variance which leads to an improvement in denoising. III. CS-MRI WITH BPFA AND TV PENALTY We next present our regularization scheme for reconstructing MR images from highly undersampled k-space data. In reference to the discussion in Section II, we consider a sparsity constraint of the form h g (x) := T V (x), arg min x,ϕ λ gh g (x) + h l (x) + λ 2 F ux y 2 2, (5) h l (x) := i γ ε 2 R ix Dα i f(ϕ i ). For the local regularization function h l (x) we use BPFA as given in Algorithm 1 in Section II-B. The parameters to be optimized for this penalty are contained in the set ϕ i = {D, s i, z i, γ ε, γ s, π}, and are defined in Algorithm 1. The regularization term γ ε is a model variable that corresponds to an inverse variance parameter of the multivariate Gaussian likelihood. This likelihood is equivalently viewed as the squared error penalty term in (5). This term acts as the sparse basis for the image and also aids in producing a denoised reconstruction, as discussed in Section II-B. (We (a) Noisy image (c) Dictionary probabilities (b) Denoising by BPFA (d) Dictionary elements per patch Fig. 1. An example of denoising by BPFA. (c) Shows the final probabilities of the dictionary elements and (d) shows a distribution on the number of dictionary elements per patch. indicate how to construct the analytical form of f in the appendix.) For the global regularization function h g (x) we use the total variation of the image. This term encourages homogeneity within contiguous regions of the image, while still allowing for sharp jumps in pixel value at edges due to the underlying l 1 penalty. The regularization parameters λ g, γ ε and λ control the trade-off between the terms in this optimization, which is adaptively learned since γ ε changes with each iteration. For the total variation penalty T V (x) we use the isotropic TV model. Let ψ i be the 2 N difference operator for pixel i. Each row of ψ i contains a 1 centered on pixel i, and 1 on the pixel directly above pixel i (for the first row of ψ i ) or to the right (for the second row of ψ i ), and zeros elsewhere. Let Ψ = [ψ1 T,..., ψn T ]T be the resulting 2N N difference matrix for the entire image. The TV coefficients are β = Ψx R 2N, and the isotropic TV penalty is T V (x) = i ψ ix 2 = i β2i β2 2i, where i ranges over the pixels in the MR image. For optimization we use the alternating direction method of multipliers (ADMM) [][33]. ADMM works by performing dual ascent on the augmented Lagrangian objective function introduced for the total variation coefficients. For completeness, we give a brief review of ADMM in the appendix. A. Algorithm We present an algorithm for finding a local optimal solution to the non-convex objective function given in (5). We can write this objective as L(x, ϕ) = λ g i ψ ix 2 + i γ ε 2 R i x Dα i i f(ϕ i) + λ 2 F ux y 2 2. (6)

5 5 We seek to minimize this function with respect to x and the dictionary learning variables ϕ i = {D, s i, z i, γ ε, γ s, π}. Our first step is to put the objective into a more suitable form. We begin by defining the TV coefficients for the ith pixel as β i := [β 2i 1 β 2i ] T = ψ i x. We introduce the vector of Lagrange multipliers η i, and then split β i from ψ i x by relaxing the equality via an augmented Lagrangian. This results in the objective function L(x, β, η, ϕ) = λ g i β i 2 + ηi T (ψ ix β i ) + ρ 2 ψ ix β i γ ε i 2 R i x Dα i f(ϕ i ) + λ 2 F ux y 2 2. (7) From the ADMM theory, this objective will have (local) optimal values β i and x with β i = ψ i x, and so the equality constraints will be satisfied. 3 Optimizing this function can be split into three separate sub-problems: one for TV, one for BPFA and one for updating the reconstruction x. Following the discussion of ADMM in the appendix, we define u i = (1/ρ)η i and complete the square in the first line of (7). We then cycle through the following three sub-problems, (P 1) β i = arg min β λ g β 2 + ρ 2 ψ ix β + u i 2 2, i, (P 2) ϕ = arg min ϕ i γ ε 2 R i x Dα i f(ϕ i ), (P 3) x ρ = arg min x i 2 ψ ix β i + u i γ ε i 2 R i x D α i λ 2 F ux y 2 2, u i = u i + ψ i x β i, i = 1,..., N. For each sub-problem, we use the most recent values of all other parameters. Solutions for P 1 and P 3 are globally optimal and in closed form, while the update for u i follows from ADMM. Since P 2 is non-convex, we cannot perform the desired minimization, and so an approximation is required. Furthermore, this problem requires iterating through the several dictionary learning variables of BPFA, and so a local optimal solution cannot be given either. Our approach is to use stochastic optimization for problem P 2 by Gibbs sampling each variable in BPFA conditioned on current values of all other variables. We next present the updates for each sub-problem, and give an outline in Algorithm 2. 1) Algorithm for P1 (total variation): We can solve for β i exactly for each pixel i = 1,..., N by using a generalized shrinkage operation [], β i = max { ψ i x + u i 2 λ g ρ, 0 } ψ i x + u i ψ i x + u i 2. (8) We recall that β i corresponds to the 2-dimensional TV coefficients for pixel i, with differences in one direction vertically and horizontally. These coefficients have been been split from ψ i x using ADMM, but gradually converge to one another and become equal in the limit. We recall that after updating x, we update the Lagrange multiplier u i = u i + ψ i x β i. 3 We note that for a fixed D and α 1:N, the solution is also globally optimal. Algorithm 2 Outline of algorithm Input: y undersampled k-space data Output: x reconstructed MR image Step 1. Initialize x = F H u y (zero filling), and u = 0. Initialize BPFA variables using x. Step 2. Solve P 1 sub-problem by optimizing β via shrinkage. Step 3. Update P 2 sub-problem by Gibbs sampling BPFA variables. Step 4. Solve P 3 sub-problem in Fourier domain, followed by inverse transform. Step 5. Update Lagrange multiplier vector u. if not converged then return to Step 2. 2) Algorithm for P2 (BPFA): We update the parameters of BPFA using Gibbs sampling. We are therefore stochastically optimizing (7), but only for this sub-problem. With reference to Algorithm 1, the P2 sub-problem entails sampling new values for the dictionary D, the binary vectors z i and weights s i, with which we construct α i = s i z i through the element-wise product, the precisions γ ε and γ sk, and the beta probabilities π 1:K, which give the probability that z ik = 1. In principle, there is no limit to the number of samples that can be made, with the final sample giving the updates used in the other sub-problems. We found that a single sample is sufficient in practice and leads to a faster algorithm. The samples we make are given below. a) Sample dictionary D: We define the P N matrix X = [R 1 x,..., R N x], which is the matrix of all vectorized patches extracted from the image x. We also define the K N matrix α = [α 1,..., α N ] containing the dictionary weight coefficients for the corresponding columns in X such that Dα is an approximation of X prior to additive Gaussian noise. The update for the dictionary D is E p,: D = Xα T (αα T + (P/γ ε )I P ) 1 + E, (9) ind N(0, (γ ε αα T + P I P ) 1 ), p = 1,..., P. We note that the first term in Equation (9) is the l 2 -regularized least squares solution for D. Correlated Gaussian noise is then added to generate a sample from the conditional posterior of D. Since both the number of pixels and γ ε will tend to be very large, the variance of the noise is small and the mean term dominates the update for D. b) Sample sparse coding α i : Sampling α i entails sampling s ik and z ik for each k. We sample these values using block sampling. We recall that to block sample two variables from their joint conditional posterior distribution, (s, z) p(s, z ), one can first sample z from the marginal distribution, z p(z ), and then sample s z p(s z, ) from the conditional distribution. The other sampling direction is possible as well, but for our problem sampling z s z is more efficient in finding a mode of the objective function. We define r i, k to be the residual error in approximating the ith patch with the current values from BPFA minus the kth dictionary element, r i, k = R i x j k (s ijz ij )d j. We then sample z ik from its conditional posterior Bernoulli distribution

After sampling z ik we sample the corresponding weight s ik from its conditional posterior Gaussian distribution, ( ) d T k r i, k s ik z ik N z ik γ sk /γ ε + d T, (γ sk + γ ε z ik d T k d k ) 1.

6 6 z ik p ik δ 1 + (1 p ik )δ 0, where following a simplification, ( p ik π k 1 + (γε /γ sk )d T ) 1 k d 2 k (10) { γε } exp 2 (dt k r i, k ) 2 /(γ sk /γ ε + d T k d k ), 1 p ik 1 π k. (11) We observe that the probability that z ik = 1 takes into account how well dictionary element d k correlates with the residual r i, k. After sampling z ik we sample the corresponding weight s ik from its conditional posterior Gaussian distribution, ( ) d T k r i, k s ik z ik N z ik γ sk /γ ε + d T, (γ sk + γ ε z ik d T k d k ) 1. k d k (12) When z ik = 1, the mean of s ik is the regularized least squares solution and the variance will be small if γ ε is large. When z ik = 0, s ik is sampled from the prior. 4 c) Sample γ ε and γ sk : We next sample from the conditional gamma posterior distribution of the noise precision and weight precision, γ ε Gamma ( g P N, h i R ix Dα i 2) 2, (13) γ sk Gamma(e i z ik, f i z iks 2 ik ). (14) The expected value of each variable is the first term of the distribution divided by the second, which is close to the inverse of the average empirical error for γ ε. d) Sample π k : The conditional posterior of π k is a beta distribution sampled as follows, π k Beta (a 0 + i z ik, b 0 + i (1 z ik)). (15) The parameters to the beta distribution include counts of how many times dictionary element d k was used by a patch. 3) Algorithm for P3 (MRI reconstruction): The final subproblem is to reconstruct the image x. Our approach takes advantage of the Fourier domain similar to other methods, e.g. [33]. The corresponding objective function is x = arg min x i ρ 2 ψ ix β i + u i i + λ 2 F ux y 2 2. γ ε 2 R ix Dα i 2 2 Since this is a least squares problem, x has a closed form solution that satisfies ( ρψ T Ψ + γ ε i RT i R i + λf H u F u ) x = (16) ρψ T (β u) + γ ε P x BPFA + λf H u y. We recall that Ψ is the matrix of stacked ψ i. The vector β is also obtained by stacking each β i, and similarly u is the vector formed by stacking u i. The vector x BPFA is the proposed reconstructed image from BPFA using the current D and α 1:N, which results from the equality P x BPFA = i RT i Dα i. We observe that inverting the left N N matrix is computationally prohibitive, since N is the number of pixels in the image. Fortunately, given the form of the matrix in Equation 4 We note that the value of s ik does not factor into the model in this case, since s ik z ik = 0 and s ik is integrated out the next time z ik is sampled. Fig. 2. Example masks used to undersample k-space. (left) Cartesian mask, (right) radial mask. (16) we can simplify the problem by working in the Fourier domain, which allows for element-wise updates in k-space, followed by an inverse Fourier transform. We represent x as x = F H θ, where θ is the Fourier transform of x and H denotes the conjugate transpose. We then take the Fourier transform of each side of Equation (16) to give F ( ρψ T Ψ + γ ε i RT i R i + λf H u F u ) F H θ = (17) ρfψ T (β u) + γ ε FP x BPFA + λff H u y. The left-hand matrix simplifies to a diagonal matrix, F ( ρψ T Ψ + γ ε i RT i R i + λf H u F u ) F H = (18) ρλ + γ ε P I N + λi u N. Term-by-term this results as follows: The product of the finite difference operator matrix Ψ with itself yields a circulant matrix, which has the rows of the Fourier matrix F as its eigenvectors and eigenvalues Λ = FΨ T ΨF H. The matrix Ri T R i is a matrix of all zeros, except for ones on the diagonal entries that correspond to the indices of x associated with the ith patch. Since each pixel appears in P patches, the sum over i gives P I N, and the Fourier product cancels. The final diagonal matrix IN u also contains all zeros, except for ones along the diagonal corresponding to the indices in k-space that are measured, which results from FFu H F u F H. Since the left matrix is diagonal we can perform elementwise updating of the Fourier coefficients θ, θ i = ρf iψ T (β u) + γ ε P F i x BPFA + λf i Fu H y ρλ ii + γ ε P + λf i Fu H. (19) 1 We observe that the rightmost term in the numerator and denominator equals zero if i is not a measured k-space location. We invert θ via the inverse Fourier transform F H to obtain the reconstructed MR image x. B. Discussion on λ We note that a feature of dictionary learning approaches is that λ can be allowed to go to infinity, and so parameter selection isn t necessary here. This is because a denoised reconstruction of the image is obtained through the dictionary learning reconstruction. In reference to Equation (19), we observe that in this case we are fixing the measured k-space values and using the k-space projection of BPFA and TV to fill in the missing values.

7 (a) Circle of Willis (b) Lumbar (a) Zero filling (b) BPFA reconstruction (c) Shoulder (d) Brain Fig. 3. Ground truth images considered in the experiments (512 512). IV.

We consider a variety of sampling rates and masks, and compare with four other algorithms: SparseMRI [3], PBDW [19], TV [33] and DLMRI [18].

We also compare with BPFA without using total variation, which is a special case of our algorithm with λ g = 0. A.

We also considered random sampling and found comparable results, with reconstruction improved for each algorithm, as expected from CS theory.

In the first scheme, measurement trajectories are sampled from a variable density Cartesian grid and in the second we measure along radial lines uniformly spaced in angle.

As a performance measure we use the peak signal-tonoise ratio (PSNR) to the ground truth image, in addition to showing qualitative performance comparisons.

7 7 (a) Circle of Willis (b) Lumbar (a) Zero filling (b) BPFA reconstruction (c) Shoulder (d) Brain Fig. 3. Ground truth images considered in the experiments ( ). IV. EXPERIMENTS AND DISCUSSION We present experimental results on synthetic data and the MRI shown in Figure 3. We consider a variety of sampling rates and masks, and compare with four other algorithms: SparseMRI [3], PBDW [19], TV [33] and DLMRI [18]. We use the publicly available code for these algorithms and tried several parameter settings, selecting the best ones for comparison. We also compare with BPFA without using total variation, which is a special case of our algorithm with λ g = 0. A. Set-up We consider two sampling trajectories in k-space corresponding to the two practical approaches to CS-MRI: Cartesian sampling with random phase encodes and radial sampling. We also considered random sampling and found comparable results, with reconstruction improved for each algorithm, as expected from CS theory. Since this is not a practical sampling method we omit these results. In the first scheme, measurement trajectories are sampled from a variable density Cartesian grid and in the second we measure along radial lines uniformly spaced in angle. We show examples of these trajectories in Figure 2. We considered several subsampling rates for each trajectory, measuring 10%, 20%, 25%, 30%, and 35% of k- space. As a performance measure we use the peak signal-tonoise ratio (PSNR) to the ground truth image, in addition to showing qualitative performance comparisons. For all images, we extract 6 6 patches where each pixel defines the upper left corner of a patch and wrap around the image at the boundaries. For the synthetic data we learn (c) BPFA denoising (f) Dictionary probabilities (e) Dictionary (magnitude) (d) Total variation denoising (g) Dictionary elements per patch Fig. 4. GE data with noise (σ = ) and 30% Cartesian sampling. The BPFA dictionary learning model: (b) reconstructs the original noisy image, and (c) denoises the reconstruction in unison using two versions of the image in the reconstruction. (d) Total variation minimization reconstructs and denoises one image. Also shown are the dictionary learning variables sorted by π k : (e) the dictionary, (f) the distribution on the dictionary, π k. (g) The normalized histogram of number of the dictionary elements used per patch.

8 8 (a) BPFA+TV (b) BPFA (c) TV (d) DLMRI (e) PBDW DLMRI BPFA BPFA+TV SparseMRI PBDW TV (f) SparseMRI (g) Zero filling (h) Cartesian results: PSNR vs sample % (i) Radial results: PSNR vs sample % Fig. 5. Circle of Willis MRI: (a) (f) Example reconstructions for Cartesian sampling of 25% of k-space, and (g) zero-filling. (h) PSNR results for Cartesian sampling. (i) PSNR results for radial sampling. (a) BPFA+TV (b) BPFA (c) TV (d) DLMRI (e) PBDW DLMRI BPFA BPFA+TV SparseMRI PBDW TV (f) SparseMRI (g) Zero filling (h) Cartesian results: PSNR vs sample % (i) Radial results: PSNR vs sample % Fig. 6. Lumbar MRI: (a) (f) Example reconstructions for radial sampling of 20% of k-space, and (g) zero-filling. (h) PSNR results for Cartesian sampling. (i) PSNR results for radial sampling.

9 (a) BPFA+TV (b) BPFA (c) TV (d) DLMRI (e) PBDW 42 46 44 42 30 DLMRI BPFA BPFA+TV SparseMRI PBDW TV 5 0.2 0.25 0.3 0.

Shoulder MRI: (a) (f) Example reconstructions for Cartesian sampling of 20% of k-space, and (g) zero-filling. (h) PSNR results for Cartesian sampling. (i) PSNR results for radial sampling.

9 9 (a) BPFA+TV (b) BPFA (c) TV (d) DLMRI (e) PBDW DLMRI BPFA BPFA+TV SparseMRI PBDW TV (f) SparseMRI (g) Zero filling (h) Cartesian results: PSNR vs sample % (i) Radial results: PSNR vs sample % Fig. 7. Shoulder MRI: (a) (f) Example reconstructions for Cartesian sampling of 20% of k-space, and (g) zero-filling. (h) PSNR results for Cartesian sampling. (i) PSNR results for radial sampling. (a) BPFA+TV (b) BPFA (c) TV (d) DLMRI (e) PBDW DLMRI BPFA BPFA+TV SparseMRI PBDW TV (f) SparseMRI (g) Zero filling (h) Cartesian results: PSNR vs sample % (i) Radial results: PSNR vs sample % Fig. 8. Brain MRI: (a) (f) Example reconstructions for radial sampling of 10% of k-space, and (g) zero-filling. (h) PSNR results for Cartesian sampling. (i) PSNR results for radial sampling.

10 complex-valued dictionaries, while for the MRI data we restrict the model to real-valued dictionaries. We initialize x by zero-filling in k-space.

10 10 complex-valued dictionaries, while for the MRI data we restrict the model to real-valued dictionaries. We initialize x by zero-filling in k-space. We use a dictionary with K = 108 initial dictionary elements, recalling that the final number of dictionary elements will be smaller due to the sparse BPFA prior. If 108 is found to be too small, K can be increased with the result being a slower inference algorithm. (In principle K can be infinitely large.) We ran 1000 iterations of the algorithm and show the results of the last iteration. For regularization parameters of our model, we set the data fidelity regularization λ = We are therefore treating λ as effectively being infinity and allowing BPFA to fill in the missing k-space and denoise, as discussed in Section III-B. We also set λg = 10 and ρ = For BPFA we set c = 1, e0 = f0 = 1, γ = 5, g0 = 0.5N 2 /10, h0 = g0 v/8 where v is the empirical variance of the initialization. C. Experiments on MRI We next evaluate the performance of our algorithm using the MRI shown in Figure 3. As mentioned, we compare our algorithm with Sparse MRI [3], which is a combination of wavelets and total variation, TV [33] using the isotropic model, DLMRI [18], which is a dictionary learning model based on K-SVD, and PBDW [19], which is patch-based method that uses directional wavelets, and therefore places greater restrictions on the dictionary. In all algorithms, we considered several parameter settings and picked the best results for comparison. In addition we consider our algorithm with and without the total variation penalty, denoted BPFA+TV and BPFA, respectively. 1) Reconstruction results: We present quantitative and qualitative results for the reconstruction algorithms in Figures 5 8. In these figures, we show the peak signal to noise ratio (PSNR) for Cartesian and radial sampling as a function of percentage sampled in k-space. We see that the proposed Bayesian nonparametric dictionary learning method gives an (a) BPFA+TV B. Experiments on simulated data In Figure 4 we show results on the GE phantom with additive noise having standard deviation σ =. In this experiment we use BPFA without TV to reconstruct the original image using 30% Cartesian sampling. We show the reconstruction using zero-filling in Figure 4(a). Since λ = 10100, we see in Figure 4(b) that BPFA essentially helps reconstruct the underlying noisy image for x. However, using the denoising property of the BPFA model shown in Figure 1, we obtain the denoised reconstruction of Figure 4(c) by focusing on xbpfa from Equation (16). This is in contrast with the best result we could obtain with TV in Figure 4(d), which places the sparse penalty directly on the reconstructed image. For TV the value of λ relative to the regularization parameter becomes significant. We set λ = 1 and swept through values in (0, 5) for the TV regularization parameter. Similar to Figure 1 we show some statistics from the BPFA model in Figures 4(e)-(g). Roughly 80 dictionary elements were used, and an average of 2.28 elements were used by a patch given that at least one was used (which discounts the black regions). 5 0 (b) PBDW (c) DLMRI 0 (d) Sparse MRI Fig. 9. Absolute errors for 20% radial sampling of the shoulder MRI. improved reconstruction. We also see additional slight improvement when a TV penalty is added, though this is not always the case. Given the denoising property of dictionary learning, this is perhaps not surprising. We also observe that radial sampling performed better than Cartesian sampling in all experiments. We again note that we performed similar experiments using random sampling and observed similar relative results with an overall improvement compared with radial sampling, but we omit these results for space, and because random sampling is not practical for MRI. In each figure we also show example reconstructions for the algorithms considered, including zero-filling in k-space. In some MRI, such as the Circle of Willis in Figure 5, the improvement is less in the structural information and more in image quality. In other MRI, the proposed method is able to capture structure in the image that is missed by the other algorithms. In Figure 6(a) we indicate one of these regions for the shoulder MRI. In Figure 9 we show the residual errors (in absolute value) for several algorithms on the shoulder MRI. (Note that these images correspond to a different sampling pattern than in Figure 7.) In this example we see that the errors for BPFA are more noise-like than for the other algorithms. The proposed method has several advantages, which we believe leads to the improvement in performance. A significant advantage is the adaptive learning of the dictionary size and per-patch sparsity level using a nonparametric stochastic process that is naturally suited for this problem. In addition to this, several other parameters such as the noise variance and the variances of the score weights are adjusted through a natural MCMC sampling approach. Also, the regularization introduced by the prior helps prevent over-fitting, which is important since in the first several iterations BPFA is modeling an MRI reconstruction that is significantly distorted.

11 11 Sum of probabilities Dictionary element index (a) Dictionary for 10% sampling (b) Dictionary for 20% sampling (c) Dictionary for 30% sampling 10% sampling 20% sampling 30% sampling (d) BPFA weights (cumulative) empirical probability % sampling 20% sampling 30% sampling # dictionary elements used by patch (e) Dictionary elements per patch Fig. 10. Radial sampling for the Circle of Willis. (a)-(c) The learned dictionary for various sampling rates. (d) The cumulative function of the sorted π k from BPFA for each sampling rate. This gives information on sparsity and average usage of the dictionary. (e) The distribution on the number of elements used per patch for each sampling rate. Another advantage of our model is the Markov Chain Monte Carlo inference algorithm. In highly non-convex Bayesian models (or similar models with a Bayesian interpretation), it is generally observed by the statistics community that MCMC sampling outperforms deterministic methods. Given that BPFA is a Bayesian model, such inference/optimization techniques are readily derived, as we showed in Section III-A. A drawback of MCMC is that more iterations are required than deterministic methods (we used 1000 iterations requiring approximately 1.5 hours, whereas the other algorithms required under 100). However, we note that inference for the BPFA model is easily parallelizable, which can mitigate this problem. 2) Dictionary learning: We next investigate the model learned by BPFA. In Figure 10 we show dictionary learning results learned by BPFA+TV for radial sampling of the Circle of Willis. In the top portion, we show the dictionaries learned for 10%, 20% and 30% sampling. We see that they are consistent, but the number of elements increases as the sampling percentage increases, since more complex information is contained in the k-space measurements of the image. This is also shown in Figure 10(d). In this plot we show the cumulative sum of the ordered π k from BPFA. We can read off the average number of elements used per patch by looking at the right-most value. We see that more elements are used per patch as the fraction of observed k-space increases. We also see that for 10%, 20% and 30% sampling, roughly 60, 80 and 95, respectively, of the 108 total dictionary elements were significantly used, as indicated by the leveling off of these functions. This highlights the adaptive property of the nonparametric beta process prior. In Figure 10(e) we show the empirical distribution on the number of dictionary elements used per patch for each sampling rate. We see that there are two modes, one for the empty background and one for the foreground, and the second mode tends to increase as the sampling rate increases. The adaptability of this value to each patch is another characteristic of the beta process model. We note that these results are typical of what we observed in the other experiments. 3) Discussion: We initialized the image using zero-filling and initialized the first dictionary elements using the singular vectors of the patches from this image. We then randomly sampled the remaining dictionary elements from the prior. We initialized z ik = 0 for all i and k, and π k =. As mentioned, for the Gamma(g 0, h 0 ) prior on the inverse noise variance of the patch, we set g 0 = 0.5N 2 /10 and h 0 = g 0 v/8, where v is the empirical noise variance of the zero-filled image. This gives a prior expected noise of v/8. Here, 10 indicates that the prior is 1/10 the strength of the likelihood, and 8 indicates that the prior expects a SNR of 8 to 1 with respect to the zero-filled image. The purpose of this is that the MRI we consider have very little noise and so using a non-informative prior (where g 0 = h 0 = ) would cause dictionary learning to fit the early reconstructions tightly by correctly learning that there is very little noise. While we still observed good results, the convergence was very slow. Strengthening the prior enforces a more smooth reconstruction in the early stages of inference. We note that for more significant levels of noise, such as our examples in Sections II-B2 and IV-B, this issue did not arise and noninformative priors could be used. We note that the added computation time for the TV penalty is very small compared with dictionary learning; the total amount of time required for one iteration was between 5 and 6 seconds for the BPFA+TV model. This is significantly faster than DLMRI, since our sampling approach is much less computationally intensive than the OMP algorithm, which requires matrix inversions, but slower than the other algorithms we compare with. V. CONCLUSION We have presented an algorithm for CS-MRI reconstruction that uses Bayesian nonparametric dictionary learning. Our Bayesian approach uses a model called beta process factor analysis (BPFA) for in situ dictionary learning. Through this hierarchical generative structure, we can learn the dictionary size, sparsity pattern and additional regularization parameters. We also considered a total variation penalty term for additional constraints on image smoothness. We presented an optimization algorithm using the alternating direction method

12 12 of multipliers (ADMM) and MCMC Gibbs sampling for all BPFA variables. Experimental results on several MR images showed that our proposed regularization framework compares favorably with other algorithms for various sampling trajectories and rates. VI. APPENDIX A. Constructing the Bayesian part of the objective function We give some additional details of the Bayesian structure of our dictionary learning approach. The unknown variables of the model are D = {d 1,..., d K }, π = {π 1,..., π K }, {s i } i=1:n, {z i } i=1:n, γ ε, γ s = {γ s,1,..., γ s,k }. The data from the perspective of BPFA is the set of patches extracted from the current reconstruction, {R i x} i=1:n. The joint likelihood of these variables and data is p({r i x}, D, π, {s i }, {z i }, γ ε, γ s ) = [ N p(r i x D, z i, s i, γ ε )p(s i γ s ) ] k p(z ik π k ) i=1 [ K ] p(π k )p(d k ) p(γ ε ) k p(γ s,k). (20) k=1 The first bracketed group constitutes the patch-specific part of the likelihood. The second group contains the dictionary elements and their probabilities and the remaining distributions are for inverse variances. The specific distributions used are given in Algorithm 1. By writing out these distributions explicitly, the functional form of the joint likelihood can be obtained. The dictionary learning part of the objective function, which corresponds to sub-problem P2, is γ ε 2 R ix Dα i f(ϕ i ) = i ln p({r i x}, D, π, {s i }, {z i }, γ ε, γ s ). Optimizing this non-convex function is equivalent to finding a mode of the joint likelihood. Rather than use a deterministic gradient-based method, we use the MCMC Gibbs sampling to stochastically find a mode. The functional form is unnecessary for deriving the Gibbs sampling algorithm. We note that many of the updates are essentially noisy versions of regularized least squares solutions. B. Alternating Direction Method of Multipliers To review the general form of ADMM [35] we are interested in, we start with the convex optimization problem min x Ax b h(x), (21) where h is a non-smooth convex function, such as an l 1 penalty. ADMM decouples the smooth squared error term from this penalty by introducing a second vector v such that min x Ax b h(v) subject to v = x. (22) This is followed by a relaxation of the equality v = x via an augmented Lagrangian term L(x, v, η) = Ax b h(v) + η T (x v) + ρ 2 x v 2 2. (23) A minimax saddle point is found with the minimization taking place over both x and v and dual ascent for η. Another way to write the objective in (23) is to define u = (1/ρ)η and combine the last two terms. The result is an objective that can be optimized by cycling through the following updates for x, v and u, x = arg min x Ax b ρ 2 x v + u 2 2, (24) v = arg min v h(v) + ρ 2 x v + u 2 2, (25) u = u + x v. (26) This algorithm simplifies the optimization since the objective for x is quadratic and thus has a simple analytic solution, while the update for v is a proximity operator of h with penalty ρ, the difference being that v is not pre-multiplied by a matrix as x is in (21). Such optimization problems tend to be much easier to solve; for example when h is the TV penalty the solution for v is analytical. REFERENCES [1] E. Candés, J. Romberg, and T. Tao, Robust Uncertainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency Information, IEEE Trans. on Information Theory, vol. 52, no. 2, pp , [2] D. Donoho, Compressed sensing, IEEE Trans. on Information Theory, vol. 52, no. 4, pp , [3] M. Lustig, D. Donoho, and J. M. Pauly, Sparse MRI: The Application of Compressed Sensing for Rapid MR Imaging, Magnetic Resonance in Medicine, vol. 58, no. 6, pp , [4] Y. Kim, M. S. Nadar, and A. Bilgin, Wavelet-Based Compressed Sensing Using Gaussian Scale Mixtures, IEEE Trans. on Image Processing, vol. 21, no. 6, pp , [5] X. Qu, W. Zhang, D. Guo, C. Cai, S. Cai, and Z. Chen, Iterative Thresholding Compressed Sensing MRI Based on Contourlet Transform, Inverse Problems Sci. Eng., Jun [6] X. Qu, X. Cao, D. Guo, C. Hu, and Z. Chen, Combined Sparsifying Transforms for Compressed Sensing MRI, Electronics Letters, vol. 46, no. 2, pp , [7] J. Trzasko and A. Manduca, Highly Undersampled Magnetic Resonance Image Reconstruction via Homotopic L0-Minimization, IEEE Trans. on Medical Imaging, vol. 28, no. 1, pp , [8] R. Chartrand, Fast Algorithms for Nonconvex Compressive Sensing: MRI Reconstruction from Very Few Data, in Proc. IEEE Int. Symp. on Biomedical Imaging, pp , [9] J. C. Ye, S. Tak, Y. Han, and H. W. Park, Projection Reconstruction MR Imaging Using FOCUSS, Magnetic Resonance in Medicine, vol. 57, no. 4, pp , [10] H. Jung, K. Sung, K. S. Nayak, E. Y. Kim, and J. C. Ye, k-t FOCUSS: A General Compressed Sensing Framework for High Resolution Dynamic MRI, Magnetic Resonance in Medicine, vol. 61, pp , [11] J. Yang, Y. Zhang, and W. Yin, A Fast Alternating Direction Method for TVL1-L2 Signal Reconstruction from Partial Fourier Data, IEEE J. Sel. Topics in Signal Processing, vol. 4, no. 2, pp , [12] Y. Chen and X. Ye, A Novel Method and Fast Algorithm for MR Image Reconstruction with Significantly Under- sampled Data, Inverse Problems and Imaging, vol. 4, no. 2, pp , [13] J. Huang, S. Zhang, and D. Metaxas, Efficient MR Image Reconstruction for Compressed MR Imaging, Medical Image Analysis, vol. 15, no. 5, pp , [14] S. Ji, Y. Xue and L. Carin, Bayesian compressive sensing, IEEE Trans. on Signal Processing, vol. 56, no. 6, pp , [15] X. Ye, Y. Chen, and F. Huang, Computational Acceleration for MR Image Reconstruction in Partially Parallel Imaging, IEEE Trans. on Medical Imaging, vol. 30, no. 5, pp , [16] X. Ye, Y. Chen, W. Lin, and F. Huang, Fast MR Image Reconstruction for Partially Parallel Imaging with Arbitrary k-space Trajectories, IEEE Trans. on Medical Imaging, vol. 30, no. 3, pp , 2011.

Sparsifying Transform Learning for Compressed Sensing MRI

Sparsifying Transform Learning for Compressed Sensing MRI Saiprasad Ravishankar and Yoram Bresler Department of Electrical and Computer Engineering and Coordinated Science Laborarory University of Illinois