Sparse & Redundant Representations by Michael Elad * The Computer Science Department The Technion Israel Institute of technology Haifa 3000, Israel 6-30 August 007 San Diego Convention Center San Diego, CA USA Wavelet XII (670) * Joint work with: Boaz Joseph Michael Matalon Shtok Zibulevsky
Today s Talk is About f the Minimization of the Function ( ) = D x + λρ( ) by Today Why This we Topic? will discuss: Why Key this for minimization turning theory task to is applications important? in Sparse-land, Which Shrinkage applications is intimately could benefit coupled from with this wavelet, minimization? How The can applications it be minimized we target effectively? fundamental signal processing ones. What This iterated-shrinkage is a hot topic in harmonic methods are analysis. out there? and /4
Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case ˆ = ArgMin 3. We describe five versions of those in detail D x + λρ( ) 4. Some Results Image deblurring results Why? 5. Conclusions 3/4
Lets Start with Image Denoising y : Given measurements x : Unknown to be recovered Noise? { } v ~ N 0, σ I Remove Additive Many of the existing image denoising algorithms are related to the minimization of an energy function of the form f ( x) = x y Pr( x) Likelihood: Relation to measurements + y = x + v, Prior or regularization We will use a Sparse & Redundant Representation prior. 4/4
Our MAP Energy Function We assume that x is created by M: where is a sparse & redundant representation and D is a known dictionary. This leads to: ˆ = ArgMin Dx y + λρ M D ( ) xˆ = Dˆ x D=x= = This MAP denoising algorithm is known as basis Pursuit Denoising [Chen, Donoho, Saunders 995]. ˆ = ArgMin D x + λρ( ) The term ρ() measures the sparsity of the solution : This is Our Problem!!! L 0 -norm ( 0 ) leads to non-smooth & non-convex problem. p p The L p norm ( ) with 0<p is often found to be equivalent. Many other ADDITIVE sparsity measures are possible. 5/4
General (linear) Inverse Problems Assume that x is known to emerge from M, as before. Suppose we observe y = Hx + v, a blurred and noisy version of x. How could we recover x? A MAP estimator leads to: ˆ = argmin HD y + λρ( ) ˆ = M D ArgMin x R N Hx xˆ D x = Dˆ Noise + λρ ( ) y R This is Our Problem!!! M? xˆ 6/4
Inverse Problems of Interest De-Noising De-Blurring In-Painting De-Mosaicing ˆ = ArgMin D x + λρ( ) This is Our Problem!!! Tomography Image Scale-Up & super-resolution And more 7/4
Signal Separation D M v M D β x x ˆ = ArgMin z Given a mixture z=x +x +v two sources, M and M, and white Gaussian D xnoise + λρ v, we ( ) desire to separate it to its ingredients. This is Our Problem!!! [ ] v Written differently: z = D D + β Thus, solving this problem using MAP leads to the Morphological Component Analysis (MCA) [Starck, Elad, Donoho, 005]: ˆ β ˆ = argmin [ D, D ] z β + λρ β 8/4
Compressed-Sensing [Candes et.al. 006], [Donoho, 006] In compressed-sensing we compress the signal x by exploiting its origin. This is done by p<<n random projections. The core idea: y Px = ˆPD = ArgMin (P size: p n) D holds x ( ) all + λρ the information about the original signal x, even though p<<n. Reconstruction? Use MAP again and solve D ˆ = argmin This is Our Problem!!! PD y + λρ ( ) xˆ = Dˆ M Px y? xˆ x Noise 9/4
Brief Summary # The minimization of the function f ( ) = D x + λρ( ) is a worthy task, serving many & various applications. So, How This Should be Done? 0/4
Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions Apply Wavelet Transform LUT Apply Inv. Wavelet Transform /4
Is there a Problem? f ( ) = D x + λρ( ) The first thought: With all the existing knowledge in optimization, we could find a solution. Methods to consider: (Normalized) Steepest Descent compute the gradient and follow it s path. Conjugate Gradient use the gradient and the previous update direction, combined by a preset formula. Pre-Conditioned SD weight the gradient by the Hessian s diagonal inverse. Truncated Newton Use the gradient and Hessian to define a linear system, and solve it approximately by a set of CG steps. Interior-Point Algorithms Separate to positive and negative entries, and use both the primal and the dual problems + barrier for forcing positivity. /4
General-Purpose Software? So, simply download one of many general-purpose packages: 0 L-Magic (interior-point solver), 5 Sparselab (interior-point solver), MOSEK (various tools), 5 0 5 0 Matlab Optimization 30 Toolbox (various tools), 5 0 5 0 5 30 A Problem: General purpose software-packages (algorithms) are typically performing poorly on our task. Possible reasons: The fact that the solution is expected to be sparse (or nearly so) in our problem is not exploited in such algorithms. H f = D D x The Hessian of f() tends to be highly H ill-conditioned near the (sparse) solution. f = D D + λρ' ' ( ) ( ) + λρ' ( ) ( ) ( ) So, are we stuck? Is this problem really that complicated? 00 50 00 50 Relatively small entries for the non-zero entries in the solution 3/4
Consider the Unitary Case (DD H =I) H D Define x = β We got a separable set of m identical D optimization problems f f f f ( ) = D x + λρ( ) H ( ) = D DD x + λρ( ) ( ) ( ) = D β + λρ( ) ( ) = β + λρ( ) = m j= = ( ) β + λρ( ) j j j Use H DD = I L is unitarily invariant 4/4
The D Task We need to solve the following D problem: opt = ArgMin = opt S ρ, λ ( β) ( ) β + λρ( ) LUT opt Such a Look-Up-Table (LUT) opt =S ρ,λ (β) can be built for ANY sparsity measure function ρ(), including non-convex ones and non-smooth ones (e.g., L 0 norm), giving in all cases the GLOBAL minimizer of g(). β 5/4
The Unitary Case: A Summary Minimizing f ( ) = D x + λρ( ) = β + λρ( ) is done by: H D x = β x Multiply by D H β LUT Sρ,λ ( β) ˆ Multiply by D xˆ DONE! The obtained solution is the GLOBAL minimizer of f(), even if f() is non-convex. 6/4
Brief Summary # f The minimization of ( ) = D x + λρ( ) Leads to two very Contradicting Observations:. The problem is quite hard classic optimization find it hard.. The problem is trivial for the case of unitary D. How Can We Enjoy This Simplicity in the General Case? 7/4
Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 8/4
? We will present THE PRINCIPLES of several leading methods: Bound-Optimization and EM [Figueiredo & Nowak, `03], Surrogate-Separable-Function (SSF) [Daubechies, Defrise, & De-Mol, `04], Parallel-Coordinate-Descent (PCD) algorithm [Elad `05], [Matalon, et.al. `06], IRLS-based algorithm [Adeyemi & Davies, `06], and Stepwise-Ortho-Matching Pursuit (StOMP) [Donoho et.al. `07]. Common to all is a set of operations in every iteration that includes: (i) Multiplication by D, (ii) Multiplication by D H, and (iii) A Scalar shrinkage on the solution S ρ,λ (). Some of these algorithms pose a direct generalization of the unitary case, their st iteration is the solver we have seen. 9/4
. The Proximal-Point Method Aim: minimize f() Suppose it is found to be too hard. Define a surrogate-function g(, 0 )=f()+dist(- 0 ), using a general (uni-modal, non-negative) distance function. Then, the following algorithm necessarily converges to a local minima of f() [Rokafellar, `76]: 0 Minimize Minimize k k+ Minimize g(, 0 ) g(, ) g(, k ) Comments: (i) Is the minimization of g(, 0 ) easier? It better be! (ii) Looks like it will slow-down convergence. Really? 0/4
The Proposed Surrogate-Functions Our original function is: f ( ) = D x + λρ( ) c The distance to use: dist (, 0 ) = 0 D D Proposed by [Daubechies, Defrise, & De-Mol `04]. Require c > r( D H D ). 0 The beauty in this choice: the term D vanishes c H g(, ) ( ) H 0 = λρ + β where β = D 0 0 * 0 Minimization mof λ g(, β ρ( j j = c 0 ) is done in a closed + form by shrinkage, done on the vector j j= c β k, and this j generates c the solution k+ of the next iteration. ( x D0 ) + c0 It is a separable sum of m D problems. Thus, we have a closed form solution by THE SAME SHRINKAGE!! /4
The Resulting SSF Algorithm While the Unitary case solution is given by x Multiply by D H β LUT S ρ,λ ( β) ˆ Multiply by D xˆ ˆ = x S ( H ) D x ; xˆ = ˆ ρ, λ D the general case, by SSF requires : + - + Multiply by D H /c + β k S λ ρ, c k ( β) = S D c H LUT k + ( x D ) + + λ ρ, k k c Multiply by D xˆ /4
. Bound-Optimization Technique Aim: minimize f() Suppose it is found to be too hard. Define a function Q(, 0 ) that satisfies the following conditions: Q( 0, 0 )=f( 0 ), Q(, 0 ) f() for all, and Q(, ) ( ) 0 f Q(, 0 )= f() at 0. Then, the following algorithm necessarily converges to a local minima of f() [Hunter & Lange, (Review)`04]: 0 Well, regarding this method Minimize Minimize The above is closely related to the EM algorithm [Neal & Hinton, `98]. Q(, 0 ) Q(, ) k k+ Minimize Q(, k ) Figueiredo & Nowak s method (`03): use the BO idea to minimize f(). 0 They use the VERY SAME Surrogate functions we saw before. 3/4
3. Start With Coordinate Descent f ( ) = D x + λρ( ) We aim to minimize. First, consider the Coordinate Descent (CD) algorithm. This is a D minimization problem: ( ) f j = opt j = ArgMin D It has a closed for solution, using a simple SHRINKAGE d as before, applied on j d the j scalar <e j,d j >. d - j e j opt j e j = S ( ) - + λρ j + λρ ρ, λ ( ) d j j x ( ) d e H j j ( ) + λρ j LUT opt in 4/4
Parallel Coordinate Descent (PCD) {v Current solution for minimization of f() m j } j= : Descent directions obtained by the previous CD algorithm We will take the sum of these m descent directions for the update step. Line search is mandatory. k v m v v v j m-dimensional space m v j j= This leads to 5/4
The PCD Algorithm [Elad, `05] [Matalon, Elad, & Zibulevsky, `06] [ ( ( ) ) ] H S QD x + k+ = k + µ ρ, Q λ D ( H ) D D Where Q = diag and µ represents a line search (LS). k k k x + - + Multiply by QD H + β k Sρ, Qλ ( β) LUT k + Line Search Multiply by D xˆ Note: Q can be computed quite easily off-line. Its storage is just like storing the vector k. 6/4
Algorithms Speed-Up k 3 k k+ = k +? v Surprising as it may sound, these very 0 effective acceleration k methods k can be T v = S λ D x D c Use line-search vas is: k + with = v v: ρ, Option 3 implemented Sequential with Subspace no additional OPtimization cost c (SESOP): (i.e., multiplications k + = k +µ ( v k ) by D or D T ) µ M = + k+ k k M+ k M k k k k v k L M µ 0 v For example (SSF): ( ) { + } [Zibulevsky & Narkis, 05] [Elad, Matalon, & Zibulevsky, `07] k k 7/4
Brief Summary #3 For an effective minimization of the function f ( ) = D x + λρ( ) we saw several iterated-shrinkage algorithms, built using. Proximal Point Method. Bound Optimization 3. Parallel Coordinate Descent 4. Iterative Reweighed LS 5. Fixed Point Iteration 6. Greedy Algorithms How Are They Performing? 8/4
Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 9/4
A Deblurring Experiment x Hx 5 5 kernel ( n + m + ) 7 n,m 7 v y White Gaussian Noise σ = y ˆ = argmin 0.07 0.06 0.05 0.04 0.03 HD y + λρ ( ) xˆ 0.0 0.0 xˆ = Dˆ 30/4
Penalty Function: More Details The blur operator f A given blurred and noisy image of size 56 56 λ = 0.075 ( ) = HD y + λρ( ) Note: This experiment is similar (but not eqiuvalent) to one of tests done in [Figueiredo & Nowak `05], that leads to D un-decimated Haar wavelet state-of-the-art results. transform, 3 resolution layers, 7: redundancy ρ Approximation of L with slight (S=0.0) smoothing ( ) = S log + S 3/4
So, The Results: The Function Value f()-f min 0 9 SSF 0 8 SSF-LS SSF-SESOP-5 0 7 0 6 0 5 0 4 0 3 0 0 5 0 5 0 5 30 35 40 45 50 Iterations/Computations 3/4
So, The Results: The Function Value f()-f min 0 9 SSF 0 8 SSF-LS SSF-SESOP-5 0 7 PCD-LS PCD-SESOP-5 0 6 Comment: Both SSF and PCD (and their accelerated versions) are provably converging to the minima of f(). 0 5 0 4 0 3 0 0 5 0 5 0 5 30 35 40 45 50 Iterations/Computations 33/4
So, The Results: The Function Value 0 9 f()-f min 0 8 StOMP PDCO 0 7 0 6 0 5 0 4 0 3 0 0 50 00 50 00 50 Iterations/Computations 34/4
So, The Results: ISNR 0 ISNR [db] 5 0-5 6.4dB SSF SSF-LS SSF-SESOP-5-0 -5 ISNR = 0 log 0-0 0 5 0 5 0 5 30 y D k x 0 x 0 Iterations/Computations 35 40 45 50 35/4
So, The Results: ISNR 0 ISNR [db] 5 0-5 -0 7.03dB SSF SSF-LS SSF-SESOP-5 PCD-LS PCD-SESOP-5-5 -0 0 5 0 5 0 5 30 Iterations/Computations 35 40 45 50 36/4
So, The Results: ISNR 0 5 0-5 -0-5 ISNR [db] -0 0 50 00 50 00 50 Iterations/Computations StOMP PDCO Comments: StOMP is inferior in speed and final quality (ISNR=5.9dB) due to to over-estimated support. PDCO is very slow due to the numerous inner Least-Squares iterations done by CG. It is not competitive with the Iterated-Shrinkage methods. 37/4
Visual Results PCD-SESOP-5 Results: original (left), Measured (middle), and Restored (right): Iteration: 9 03 45 67 8 ISNR=0.069583 ISNR=.4694 ISNR=-6.778 ISNR=4.84 ISNR=4.976 ISNR=5.5875 ISNR=6.88 ISNR=6.6479 ISNR=6.6789 ISNR=7.03 ISNR=6.946 db db db 38/4
Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 39/4
Conclusions The Bottom Line If your work leads you to the need to minimize the problem: Then: f ( ) = D x + λρ( ) We recommend you use an Iterated-Shrinkage algorithm. SSF and PCD are Preferred: both are provably converging to the (local) minima of f(), and their performance is very good, getting a reasonable result in few iterations. Use SESOP Acceleration it is very effective, and with hardly any cost. There is Room for more work on various aspects of these algorithms see the accompanying paper. 40/4
Thank You for Your Time & Attention This field of research is very hot More information, including these slides and the accompanying paper, can be found on my web-page http://www.cs.technion.ac.il/~elad THE END!! 4/4
4/4
3. The IRLS-Based Algorithm Use the following principles [Edeyemi & Davies `06]: () Iterative Reweighed Least-Squares (IRLS) & () Fixed-Point Iteration f ( ) = D x + λρ( ) Solve = min D H somehow: D ρ ( ) = H W( ) H ( x D+ λ ) + λ W( ( ) = 0 k+ k ( ) ρ 0 ρ This is the IRLS algorithm, used in FOCUSS [Gorodinsky & Rao `97]. ( m) ρ ( ) j j 0 m 43/4
The IRLS-Based Algorithm Use the following principles [Edeyemi & Davies `06]: () Fixed-Point Iteration k D Task: solve the system Φ H Solvesomehow: λidea: assign indices D ( ) H + = W k + I D x c c The actual algorithm: x H ( x D) + λw( ) + c c = 0 ( The DFixed-Point-Iteration ) + λw( ) c method: + c = 0 H D x k k k+ k k+ ( x D) + λw( ) = 0 k ( x ) x = 0 Φ D( x ) + x k + = 0 k k ( ) ( ) = Φ ( x ) k+ k Notes: () For convergence, we should require c>r(d H D)/. () This algorithm cannot guarantee local-minimum. Diagonal Entries λ ρ c j ( j ) + j 44/4
4. Stagewise-OMP [Donoho, Drori, Starck, & Tsaig, `07] x xˆ + - + rk S β Multiply T s k + by D H Multiply by D β k ( ) LUT ~ k Minimize D-x ^ on S StOMP is originally designed to solve S ( D H( x D ) ) and especially so for random dictionary k D (Compressed-Sensing). Nevertheless, it is used elsewhere (restoration) [Fadili & Starck, `06]. If S grows by one item at each iteration, this becomes OMP. LS uses K 0 CG steps, each equivalent to iterated-shrinkage step. min D x + λ 0 45/4
So, The Results: The Function Value f()-f min 0 8 0 7 0 9 SSF SSF-LS SSF-SESOP-5 IRLS IRLS-LS 0 6 0 5 0 4 0 3 0 0 5 0 5 0 5 30 35 40 45 50 Iterations 46/4
So, The Results: ISNR SKIP 0 ISNR [db] 5 0-5 -0 SSF SSF-LS SSF-SESOP-5 IRLS IRLS-LS -5-0 0 5 0 5 0 5 30 Iteration 35 40 45 50 47/4