Sparse & Redundant Representations by IteratedShrinkage Algorithms


 Noah Lambert
 7 months ago
 Views:
Transcription
1 Sparse & Redundant Representations by Michael Elad * The Computer Science Department The Technion Israel Institute of technology Haifa 3000, Israel 630 August 007 San Diego Convention Center San Diego, CA USA Wavelet XII (670) * Joint work with: Boaz Joseph Michael Matalon Shtok Zibulevsky
2 Today s Talk is About f the Minimization of the Function ( ) = D x + λρ( ) by Today Why This we Topic? will discuss: Why Key this for minimization turning theory task to is applications important? in Sparseland, Which Shrinkage applications is intimately could benefit coupled from with this wavelet, minimization? How The can applications it be minimized we target effectively? fundamental signal processing ones. What This iteratedshrinkage is a hot topic in harmonic methods are analysis. out there? and /4
3 Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case ˆ = ArgMin 3. We describe five versions of those in detail D x + λρ( ) 4. Some Results Image deblurring results Why? 5. Conclusions 3/4
4 Lets Start with Image Denoising y : Given measurements x : Unknown to be recovered Noise? { } v ~ N 0, σ I Remove Additive Many of the existing image denoising algorithms are related to the minimization of an energy function of the form f ( x) = x y Pr( x) Likelihood: Relation to measurements + y = x + v, Prior or regularization We will use a Sparse & Redundant Representation prior. 4/4
5 Our MAP Energy Function We assume that x is created by M: where is a sparse & redundant representation and D is a known dictionary. This leads to: ˆ = ArgMin Dx y + λρ M D ( ) xˆ = Dˆ x D=x= = This MAP denoising algorithm is known as basis Pursuit Denoising [Chen, Donoho, Saunders 995]. ˆ = ArgMin D x + λρ( ) The term ρ() measures the sparsity of the solution : This is Our Problem!!! L 0 norm ( 0 ) leads to nonsmooth & nonconvex problem. p p The L p norm ( ) with 0<p is often found to be equivalent. Many other ADDITIVE sparsity measures are possible. 5/4
6 General (linear) Inverse Problems Assume that x is known to emerge from M, as before. Suppose we observe y = Hx + v, a blurred and noisy version of x. How could we recover x? A MAP estimator leads to: ˆ = argmin HD y + λρ( ) ˆ = M D ArgMin x R N Hx xˆ D x = Dˆ Noise + λρ ( ) y R This is Our Problem!!! M? xˆ 6/4
7 Inverse Problems of Interest DeNoising DeBlurring InPainting DeMosaicing ˆ = ArgMin D x + λρ( ) This is Our Problem!!! Tomography Image ScaleUp & superresolution And more 7/4
8 Signal Separation D M v M D β x x ˆ = ArgMin z Given a mixture z=x +x +v two sources, M and M, and white Gaussian D xnoise + λρ v, we ( ) desire to separate it to its ingredients. This is Our Problem!!! [ ] v Written differently: z = D D + β Thus, solving this problem using MAP leads to the Morphological Component Analysis (MCA) [Starck, Elad, Donoho, 005]: ˆ β ˆ = argmin [ D, D ] z β + λρ β 8/4
9 CompressedSensing [Candes et.al. 006], [Donoho, 006] In compressedsensing we compress the signal x by exploiting its origin. This is done by p<<n random projections. The core idea: y Px = ˆPD = ArgMin (P size: p n) D holds x ( ) all + λρ the information about the original signal x, even though p<<n. Reconstruction? Use MAP again and solve D ˆ = argmin This is Our Problem!!! PD y + λρ ( ) xˆ = Dˆ M Px y? xˆ x Noise 9/4
10 Brief Summary # The minimization of the function f ( ) = D x + λρ( ) is a worthy task, serving many & various applications. So, How This Should be Done? 0/4
11 Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions Apply Wavelet Transform LUT Apply Inv. Wavelet Transform /4
12 Is there a Problem? f ( ) = D x + λρ( ) The first thought: With all the existing knowledge in optimization, we could find a solution. Methods to consider: (Normalized) Steepest Descent compute the gradient and follow it s path. Conjugate Gradient use the gradient and the previous update direction, combined by a preset formula. PreConditioned SD weight the gradient by the Hessian s diagonal inverse. Truncated Newton Use the gradient and Hessian to define a linear system, and solve it approximately by a set of CG steps. InteriorPoint Algorithms Separate to positive and negative entries, and use both the primal and the dual problems + barrier for forcing positivity. /4
13 GeneralPurpose Software? So, simply download one of many generalpurpose packages: 0 LMagic (interiorpoint solver), 5 Sparselab (interiorpoint solver), MOSEK (various tools), Matlab Optimization 30 Toolbox (various tools), A Problem: General purpose softwarepackages (algorithms) are typically performing poorly on our task. Possible reasons: The fact that the solution is expected to be sparse (or nearly so) in our problem is not exploited in such algorithms. H f = D D x The Hessian of f() tends to be highly H illconditioned near the (sparse) solution. f = D D + λρ' ' ( ) ( ) + λρ' ( ) ( ) ( ) So, are we stuck? Is this problem really that complicated? Relatively small entries for the nonzero entries in the solution 3/4
14 Consider the Unitary Case (DD H =I) H D Define x = β We got a separable set of m identical D optimization problems f f f f ( ) = D x + λρ( ) H ( ) = D DD x + λρ( ) ( ) ( ) = D β + λρ( ) ( ) = β + λρ( ) = m j= = ( ) β + λρ( ) j j j Use H DD = I L is unitarily invariant 4/4
15 The D Task We need to solve the following D problem: opt = ArgMin = opt S ρ, λ ( β) ( ) β + λρ( ) LUT opt Such a LookUpTable (LUT) opt =S ρ,λ (β) can be built for ANY sparsity measure function ρ(), including nonconvex ones and nonsmooth ones (e.g., L 0 norm), giving in all cases the GLOBAL minimizer of g(). β 5/4
16 The Unitary Case: A Summary Minimizing f ( ) = D x + λρ( ) = β + λρ( ) is done by: H D x = β x Multiply by D H β LUT Sρ,λ ( β) ˆ Multiply by D xˆ DONE! The obtained solution is the GLOBAL minimizer of f(), even if f() is nonconvex. 6/4
17 Brief Summary # f The minimization of ( ) = D x + λρ( ) Leads to two very Contradicting Observations:. The problem is quite hard classic optimization find it hard.. The problem is trivial for the case of unitary D. How Can We Enjoy This Simplicity in the General Case? 7/4
18 Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 8/4
19 ? We will present THE PRINCIPLES of several leading methods: BoundOptimization and EM [Figueiredo & Nowak, `03], SurrogateSeparableFunction (SSF) [Daubechies, Defrise, & DeMol, `04], ParallelCoordinateDescent (PCD) algorithm [Elad `05], [Matalon, et.al. `06], IRLSbased algorithm [Adeyemi & Davies, `06], and StepwiseOrthoMatching Pursuit (StOMP) [Donoho et.al. `07]. Common to all is a set of operations in every iteration that includes: (i) Multiplication by D, (ii) Multiplication by D H, and (iii) A Scalar shrinkage on the solution S ρ,λ (). Some of these algorithms pose a direct generalization of the unitary case, their st iteration is the solver we have seen. 9/4
20 . The ProximalPoint Method Aim: minimize f() Suppose it is found to be too hard. Define a surrogatefunction g(, 0 )=f()+dist( 0 ), using a general (unimodal, nonnegative) distance function. Then, the following algorithm necessarily converges to a local minima of f() [Rokafellar, `76]: 0 Minimize Minimize k k+ Minimize g(, 0 ) g(, ) g(, k ) Comments: (i) Is the minimization of g(, 0 ) easier? It better be! (ii) Looks like it will slowdown convergence. Really? 0/4
21 The Proposed SurrogateFunctions Our original function is: f ( ) = D x + λρ( ) c The distance to use: dist (, 0 ) = 0 D D Proposed by [Daubechies, Defrise, & DeMol `04]. Require c > r( D H D ). 0 The beauty in this choice: the term D vanishes c H g(, ) ( ) H 0 = λρ + β where β = D 0 0 * 0 Minimization mof λ g(, β ρ( j j = c 0 ) is done in a closed + form by shrinkage, done on the vector j j= c β k, and this j generates c the solution k+ of the next iteration. ( x D0 ) + c0 It is a separable sum of m D problems. Thus, we have a closed form solution by THE SAME SHRINKAGE!! /4
22 The Resulting SSF Algorithm While the Unitary case solution is given by x Multiply by D H β LUT S ρ,λ ( β) ˆ Multiply by D xˆ ˆ = x S ( H ) D x ; xˆ = ˆ ρ, λ D the general case, by SSF requires : Multiply by D H /c + β k S λ ρ, c k ( β) = S D c H LUT k + ( x D ) + + λ ρ, k k c Multiply by D xˆ /4
23 . BoundOptimization Technique Aim: minimize f() Suppose it is found to be too hard. Define a function Q(, 0 ) that satisfies the following conditions: Q( 0, 0 )=f( 0 ), Q(, 0 ) f() for all, and Q(, ) ( ) 0 f Q(, 0 )= f() at 0. Then, the following algorithm necessarily converges to a local minima of f() [Hunter & Lange, (Review)`04]: 0 Well, regarding this method Minimize Minimize The above is closely related to the EM algorithm [Neal & Hinton, `98]. Q(, 0 ) Q(, ) k k+ Minimize Q(, k ) Figueiredo & Nowak s method (`03): use the BO idea to minimize f(). 0 They use the VERY SAME Surrogate functions we saw before. 3/4
24 3. Start With Coordinate Descent f ( ) = D x + λρ( ) We aim to minimize. First, consider the Coordinate Descent (CD) algorithm. This is a D minimization problem: ( ) f j = opt j = ArgMin D It has a closed for solution, using a simple SHRINKAGE d as before, applied on j d the j scalar <e j,d j >. d  j e j opt j e j = S ( )  + λρ j + λρ ρ, λ ( ) d j j x ( ) d e H j j ( ) + λρ j LUT opt in 4/4
25 Parallel Coordinate Descent (PCD) {v Current solution for minimization of f() m j } j= : Descent directions obtained by the previous CD algorithm We will take the sum of these m descent directions for the update step. Line search is mandatory. k v m v v v j mdimensional space m v j j= This leads to 5/4
26 The PCD Algorithm [Elad, `05] [Matalon, Elad, & Zibulevsky, `06] [ ( ( ) ) ] H S QD x + k+ = k + µ ρ, Q λ D ( H ) D D Where Q = diag and µ represents a line search (LS). k k k x Multiply by QD H + β k Sρ, Qλ ( β) LUT k + Line Search Multiply by D xˆ Note: Q can be computed quite easily offline. Its storage is just like storing the vector k. 6/4
27 Algorithms SpeedUp k 3 k k+ = k +? v Surprising as it may sound, these very 0 effective acceleration k methods k can be T v = S λ D x D c Use linesearch vas is: k + with = v v: ρ, Option 3 implemented Sequential with Subspace no additional OPtimization cost c (SESOP): (i.e., multiplications k + = k +µ ( v k ) by D or D T ) µ M = + k+ k k M+ k M k k k k v k L M µ 0 v For example (SSF): ( ) { + } [Zibulevsky & Narkis, 05] [Elad, Matalon, & Zibulevsky, `07] k k 7/4
28 Brief Summary #3 For an effective minimization of the function f ( ) = D x + λρ( ) we saw several iteratedshrinkage algorithms, built using. Proximal Point Method. Bound Optimization 3. Parallel Coordinate Descent 4. Iterative Reweighed LS 5. Fixed Point Iteration 6. Greedy Algorithms How Are They Performing? 8/4
29 Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 9/4
30 A Deblurring Experiment x Hx 5 5 kernel ( n + m + ) 7 n,m 7 v y White Gaussian Noise σ = y ˆ = argmin HD y + λρ ( ) xˆ xˆ = Dˆ 30/4
31 Penalty Function: More Details The blur operator f A given blurred and noisy image of size λ = ( ) = HD y + λρ( ) Note: This experiment is similar (but not eqiuvalent) to one of tests done in [Figueiredo & Nowak `05], that leads to D undecimated Haar wavelet stateoftheart results. transform, 3 resolution layers, 7: redundancy ρ Approximation of L with slight (S=0.0) smoothing ( ) = S log + S 3/4
32 So, The Results: The Function Value f()f min 0 9 SSF 0 8 SSFLS SSFSESOP Iterations/Computations 3/4
33 So, The Results: The Function Value f()f min 0 9 SSF 0 8 SSFLS SSFSESOP PCDLS PCDSESOP Comment: Both SSF and PCD (and their accelerated versions) are provably converging to the minima of f() Iterations/Computations 33/4
34 So, The Results: The Function Value 0 9 f()f min 0 8 StOMP PDCO Iterations/Computations 34/4
35 So, The Results: ISNR 0 ISNR [db] dB SSF SSFLS SSFSESOP ISNR = 0 log y D k x 0 x 0 Iterations/Computations /4
36 So, The Results: ISNR 0 ISNR [db] dB SSF SSFLS SSFSESOP5 PCDLS PCDSESOP Iterations/Computations /4
37 So, The Results: ISNR ISNR [db] Iterations/Computations StOMP PDCO Comments: StOMP is inferior in speed and final quality (ISNR=5.9dB) due to to overestimated support. PDCO is very slow due to the numerous inner LeastSquares iterations done by CG. It is not competitive with the IteratedShrinkage methods. 37/4
38 Visual Results PCDSESOP5 Results: original (left), Measured (middle), and Restored (right): Iteration: ISNR= ISNR=.4694 ISNR= ISNR=4.84 ISNR=4.976 ISNR= ISNR=6.88 ISNR= ISNR= ISNR=7.03 ISNR=6.946 db db db 38/4
39 Agenda. Motivating the Minimization of f() Describing various applications that need this minimization. Some Motivating Facts General purpose optimization tools, and the unitary case 3. We describe five versions of those in detail 4. Some Results Image deblurring results 5. Conclusions 39/4
40 Conclusions The Bottom Line If your work leads you to the need to minimize the problem: Then: f ( ) = D x + λρ( ) We recommend you use an IteratedShrinkage algorithm. SSF and PCD are Preferred: both are provably converging to the (local) minima of f(), and their performance is very good, getting a reasonable result in few iterations. Use SESOP Acceleration it is very effective, and with hardly any cost. There is Room for more work on various aspects of these algorithms see the accompanying paper. 40/4
41 Thank You for Your Time & Attention This field of research is very hot More information, including these slides and the accompanying paper, can be found on my webpage THE END!! 4/4
42 4/4
43 3. The IRLSBased Algorithm Use the following principles [Edeyemi & Davies `06]: () Iterative Reweighed LeastSquares (IRLS) & () FixedPoint Iteration f ( ) = D x + λρ( ) Solve = min D H somehow: D ρ ( ) = H W( ) H ( x D+ λ ) + λ W( ( ) = 0 k+ k ( ) ρ 0 ρ This is the IRLS algorithm, used in FOCUSS [Gorodinsky & Rao `97]. ( m) ρ ( ) j j 0 m 43/4
44 The IRLSBased Algorithm Use the following principles [Edeyemi & Davies `06]: () FixedPoint Iteration k D Task: solve the system Φ H Solvesomehow: λidea: assign indices D ( ) H + = W k + I D x c c The actual algorithm: x H ( x D) + λw( ) + c c = 0 ( The DFixedPointIteration ) + λw( ) c method: + c = 0 H D x k k k+ k k+ ( x D) + λw( ) = 0 k ( x ) x = 0 Φ D( x ) + x k + = 0 k k ( ) ( ) = Φ ( x ) k+ k Notes: () For convergence, we should require c>r(d H D)/. () This algorithm cannot guarantee localminimum. Diagonal Entries λ ρ c j ( j ) + j 44/4
45 4. StagewiseOMP [Donoho, Drori, Starck, & Tsaig, `07] x xˆ rk S β Multiply T s k + by D H Multiply by D β k ( ) LUT ~ k Minimize Dx ^ on S StOMP is originally designed to solve S ( D H( x D ) ) and especially so for random dictionary k D (CompressedSensing). Nevertheless, it is used elsewhere (restoration) [Fadili & Starck, `06]. If S grows by one item at each iteration, this becomes OMP. LS uses K 0 CG steps, each equivalent to iteratedshrinkage step. min D x + λ 0 45/4
46 So, The Results: The Function Value f()f min SSF SSFLS SSFSESOP5 IRLS IRLSLS Iterations 46/4
47 So, The Results: ISNR SKIP 0 ISNR [db] SSF SSFLS SSFSESOP5 IRLS IRLSLS Iteration /4