Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique Federale de Lausanne (EPFL) July 10 th 14 th, 006 D x
Today s Talk is About Sparsity and Redundancy We will try to show today that Sparsity & Redundancy can be used to design new/renewed & powerful signal/image processing tools (e.g., transforms, priors, models, compression, ), The obtained machinery works very well we will show these ideas deployed to image processing applications. D x
Agenda 1. A Visit to Sparseland Motivating Sparsity & Overcompleteness. Problem 1: Transforms & Regularizations How & why should this work? 3. Problem : What About D? The quest for the origin of signals Welcome to Sparseland 4. Problem 3: Applications Image filling in, denoising, separation, compression, D x 3
Generating Signals in Sparseland N K A fixed Dictionary D M A sparse & random vector x N Every column in D (dictionary) is a prototype signal (Atom). The vector is generated randomly with few non-zeros in random locations and random values. D x 4
Sparseland Signals Are Special Multiply by D x M D Simple: Every generated signal is built as a linear combination of few atoms from our dictionary D Rich: A general model: the obtained signals are a special type mixtureof-gaussians (or Laplacians). D x 5
Transforms in Sparseland? Assume that x is known to emerge from M.. We desire simplicity, independence, and expressiveness. How about Given x, find the that generated it in M? D M x R N T K R D D x 6
So, In Order to Transform We need to solve an under-determined linear system of equations: D x Among all (infinitely many) possible solutions we want the sparsest!! We will measure sparsity using the L 0 norm: Known 0 D x 7
Measure of Sparsity? x p p k j 1 x p p ( x) x f 1 1 As p 0 we get a count of the non-zeros in the vector 1 p p p 0 p p p < 1 0-1 +1 x D x 8
Signal s Transform in Sparseland A sparse & random vector Multiply by D x D Min s.t. 0 x D ˆ 4 Major Questions Is ˆ? Under which conditions? Are there practical ways to get ˆ? How effective are those ways? How would we get D? D x 9
Inverse Problems in Sparseland? Assume that x is known to emerge from M. Suppose we observe y Hx + v, a blurred and noisy version of x with v ε. How will we recover x? How about find the that generated the x again? y R M M Hx Q D x R N ˆ R K Noise D x 10
Inverse Problems in Sparseland? A sparse & random vector x Multiply by D D blur by H v y Hx + v Min 0 y HD s.t. ε ˆ 4 Major Questions (again!) Is ˆ? How far can it go? Are there practical ways to get ˆ? How effective are those ways? How would we get D? D x 11
Back Home Any Lessons? Several recent trends worth looking at: JPEG to JPEG000 - From (L -norm) KLT to wavelet and nonlinear approximation Sparsity. From Wiener to robust restoration From L -norm (Fourier) to L 1. (e.g., TV, Beltrami, wavelet shrinkage ) Sparsity. Sparseland From unitary to richer representations Frames, shift-invariance, steerable wavelet, contourlet, curvelet Overcompleteness. Approximation theory Non-linear approximation Sparsity & Overcompleteness. ICA and related models is HERE Independence and Sparsity. D x 1
To Summarize so far M The Sparseland model for signals is interesting since it is rich and yet every signal has a simple description Who cares? We do! this model is relevant to us just as well. So, what are the implications? (a) Practical solvers? (b) How will we get D? (c) Applications? Sparsity? looks complicated We need to solve (or approximate the solution of) a linear system with more unknowns than equations, while finding the sparsest possible solution D x 13
Agenda 1. A Visit to Sparseland Motivating Sparsity & Overcompleteness T. Problem 1: Transforms & Regularizations How & why should this work? 3. Problem : What About D? The quest for the origin of signals Q 4. Problem 3: Applications Image filling in, denoising, separation, compression, D x 14
Lets Start with the Transform Our dream for Now: Find the sparsest solution of D x Known Put formally, P 0 : Min 0 s.t. x D known D x 15
Question 1 Uniqueness? M x D Multiply by D Min s.t. 0 x D ˆ Suppose we can solve this exactly Why should we necessarily get ˆ? It might happen that eventually ˆ <. 0 0 D x 16
Matrix Spark Definition: Given a matrix D, σspark{d} is the smallest and and number of columns that are linearly dependent. Donoho & E. ( 0) * Example: 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 Rank 4 Spark 3 * In tensor decomposition, Kruskal defined something similar already in 1989. D x 17
Uniqueness Rule Suppose this problem has been solved somehow P 0 : Min 0 s.t. x D Uniqueness Donoho & E. ( 0) If we found a representation that satisfy σ > 0 Then necessarily it is unique (the sparsest). M This result implies that if generates signals using sparse enough, the solution of the above will find it exactly. D x 18
Question Practical P 0 Solver? M x D Multiply by D Min s.t. 0 x D ˆ Are there reasonable ways to find ˆ? D x 19
Matching Pursuit (MP) Mallat & Zhang (1993) The MP is a greedy algorithm that finds one atom at a time. Step 1: find the one atom that best matches the signal. Next steps: given the previously found atoms, find the next one to best fit The Orthogonal MP (OMP) is an improved version that re-evaluates the coefficients after each round. D x 0
Basis Pursuit (BP) Min 0 Instead of solving s.t. x D Chen, Donoho, & Saunders (1995) Min 1 Solve Instead s.t. x D The newly defined problem is convex (linear programming). Very efficient solvers can be deployed: Interior point methods [Chen, Donoho, & Saunders (`95)], Sequential shrinkage for union of ortho-bases [Bruce et.al. (`98)], Iterated shrinkage [Figuerido & Nowak (`03), Daubechies, Defrise, & Demole ( 04), E. (`05), E., Matalon, & Zibulevsky (`06)]. D x 1
Question 3 Approx. Quality? M x D Multiply by D Min s.t. 0 x D ˆ How effective are the MP/BP in finding ˆ? D x
Evaluating the Spark Compute D D T Assume normalized columns D T D The Mutual Coherence M is the largest off-diagonal entry in absolute value. The Mutual Coherence is a property of the dictionary (just like the Spark ). In fact, the following relation can be shown: σ 1 + 1 M D x 3
BP and MP Equivalence Equivalence Donoho & E. ( 0) Gribonval & Nielsen ( 03) Tropp ( 03) Temlyakov ( 03) Given a signal x with a representation x D, ( ) Assuming that < 0.5 1 + 1 M, BP and MP are 0 Guaranteed to find the sparsest solution. MP and BP are different in general (hard to say which is better). The above result corresponds to the worst-case. Average performance results are available too, showing much better bounds [Donoho (`04), Candes et.al. (`04), Tanner et.al. (`05), Tropp et.al. (`06)]. D x 4
What About Inverse Problems? A sparse & random vector x Multiply by D D blur by H v y Hx + v Min 0 y HD s.t. ε ˆ We had similar questions regarding uniqueness, practical solvers, and their efficiency. It turns out that similar answers are applicable here due to several recent works [Donoho, E. and Temlyakov (`04), Tropp (`04), Fuchs (`04), Gribonval et. al. (`05)]. D x 5
To Summarize so far The Sparseland model for signals is relevant to us. We can design transforms and priors based on it Find the sparsest solution? Use pursuit Algorithms Why works so well? (a) How shall we find D? (b) Will this work for applications? Which? What next? A sequence of works during the past 3-4 years gives theoretic justifications for these tools behavior D x 6
Agenda 1. A Visit to Sparseland Motivating Sparsity & Overcompleteness. Problem 1: Transforms & Regularizations How & why should this work? 3. Problem : What About D? The quest for the origin of signals 4. Problem 3: Applications Image filling in, denoising, separation, compression, D x 7
Problem Setting Multiply by D M { } X P j j 1 0 L x D Given these P examples and a fixed size [N K] dictionary D: 1. Is D unique?. How would we find D? D x 8
Uniqueness? Comments: Uniqueness Aharon, E., & Bruckstein (`05) { x } P j j 1 If is rich enough* and if L < Spark then D is unique. M { D} Rich Enough : The signals from could be clustered to groups that share the same support. At least L+1 examples per each are needed. This result is proved constructively, but the number of examples needed to pull this off is huge we will show a far better method next. A parallel result that takes into account noise is yet to be constructed. K L D x 9
Practical Approach Objective X D A Min D, A P j 1 D j x Each example is a linear combination of atoms from D j s.t. j, L Each example has a sparse representation with no more than L atoms (n,k,l are assumed known, D has norm. columns) j 0 Field & Olshausen (96 ) Engan et. al. (99 ) Lewicki & Sejnowski (00 ) Cotter et. al. (03 ) Gribonval et. al. (04 ) D x 30
K Means For Clustering Clustering: An extreme sparse coding Initialize D D Sparse Coding Nearest Neighbor Dictionary Update Column-by-Column by Mean computation over the relevant examples X T D x 31
The K SVD Algorithm General Initialize D D Sparse Coding Use MP or BP Dictionary Update Column-by-Column by SVD computation X T Aharon, E., & Bruckstein (`04) D x 3
K SVD Sparse Coding Stage Min A P j 1 D j x j s.t. j, j 0 L D For the j th example we solve X T Min D x s.t. j 0 L Pursuit Problem!!! D x 33
K SVD Dictionary Update Stage D d k? { } P 1 G k : The examples in X j j and that use the column d k. The content of d k influences only the examples in G k. Let us fix all A apart from the k th column and seek both d k and the k th column to better fit the residual! D x 34
K SVD Dictionary Update Stage We should solve: d Min k, k T k d k E F Residual E D dk? k d k is obtained by SVD on the examples residual in G k. D x 35
K SVD: A Synthetic Experiment Create A 0 30 random dictionary with normalized columns Generate 000 signal examples with 3 atoms per each and add noise Train a dictionary using the KSVD and MOD and compare D D 1 0.9 0.8 Results Relative Atoms Found 0.7 0.6 0.5 0.4 0.3 0. 0.1 MOD performance K-SVD performance 0 0 5 10 15 0 5 30 35 40 45 50 Iteration D x 36
To Summarize so far The Sparseland model for signals is relevant to us. In order to use it effectively we need to know D How D can be found? Use the K-SVD algorithm Will it work well? (a) Deploy to applications (b) Generalize in various ways: multiscale, non-negative factorization, speed-up, (c) Performance guarantees? What next? We have established a uniqueness result and shown how to practically train D using the K-SVD D x 37
Agenda 1. A Visit to Sparseland Motivating Sparsity & Overcompleteness. Problem 1: Transforms & Regularizations How & why should this work? 3. Problem : What About D? The quest for the origin of signals 4. Problem 3: Applications Image filling in, denoising, separation, compression, D x 38
Application 1: Image Inpainting Assume: the signal x has been created by xd 0 with very sparse 0. Missing values in x imply missing rows in this linear system. By removing these rows, we get. Now solve Min 0 ~ ~ D x s.t. ~ x D ~ D 0 x If 0 was sparse enough, it will be the solution of the above problem! Thus, computing D 0 recovers x perfectly. D x 39
Application 1: The Practice Given a noisy image y, we can clean it using the Maximum A- posteriori Probability estimator by solving ˆ ArgMin p p + λ y D What if some of the pixels in that image are missing (filled with zeros)? Define a mask operator as the diagonal matrix W, and now solve instead p ˆ ArgMin + λ y WD p xˆ Dˆ xˆ Dˆ When handling general images, there is a need to concatenate two dictionaries to get an effective treatment of both texture and cartoon contents This leads to separation [E., Starck, & Donoho ( 05)]. D x 40
Inpainting Results Source Predetermined dictionary: Curvelet (cartoon) + Overlapped DCT (texture) Outcome D x 41
Inpainting Results Source predetermined dictionary: Curvelet (cartoon) + Overlapped DCT (texture) Outcome D x 4
Inpainting Results 0% 50% 80% D x 43
Application : Image Denoising Given a noisy image y, we have already mentioned the ability to clean it by solving ˆ ArgMin p p + λ y D When using the K-SVD, it cannot train a dictionary for large support images How do we go from local treatment of patches to a global prior? The solution: Force shift-invariant sparsity - on each patch of size N-by-N (e.g., N8) in the image, including overlaps. xˆ Dˆ D x 44
Application : Image Denoising ˆ ArgMin p p + λ y D xˆ Dˆ Our MAP penalty becomes xˆ ArgMin x,{ ij } ij 1 x y + μ ij R ij x D ij Extracts a patch in the ij location s.t. ij 0 0 L Our prior D x 45
Application : Image Denoising x,{ ij } ij, D 1 xˆ ArgMin x y + μ R ij ij x D xy and D known x and ij known ij s.t. ij 0 0 L D and ij known Compute ij per patch ij Min R s.t. ij x 0 0 D L using the matching pursuit Compute D to minimize Min R ij ijx D using SVD, updating one column at a time K-SVD Compute x by 1 T T x I + μ R ij R ij y + μ R ij Dij ij ij which is a simple averaging of shifted patches D x 46
Application : The Algorithm Sparse-code every patch of 8-by-8 pixels D Update the dictionary based on the above codes Compute the output image x Complexity of this algorithm: O(N L Iterations) per pixel. For N8, L1, and 10 iterations, we need 640 operations per pixel. D x 47
Denoising Results Source The results of this algorithm compete favorably with the state-of-the-art (e.g., GSM+steerable wavelets [Portilla, Strela, Wainwright, & Simoncelli ( 03)] - giving ~0.5-1dB better results) Result 30.89dB Noisy image σ 0 The obtained Initial dictionary after (overcomplete 10 iterations DCT) 64 56 D x 48
Application 3: Compression The problem: Compressing photo-id images. General purpose methods (JPEG, JPEG000) do not take into account the specific family. By adapting to the image-content (PCA/K-SVD), better results could be obtained. For these techniques to operate well, train dictionaries locally (per patch) using a training set of images is required. In PCA, only the (quantized) coefficients are stored, whereas the K-SVD requires storage of the indices as well. Geometric alignment of the image is very helpful and should be done. D x 49
Application 3: The Algorithm Divide the image into disjoint 15-by-15 patches. For each compute mean and dictionary Per each patch find the operating parameters (number of atoms L, quantization Q) Warp, remove the mean from each patch, sparse code using L atoms, and apply Q) D x Training set (500 images) On the training set Detect main features and warp the images to a common reference (0 parameters) On the test image 50
Compression Results 11.99 10.49 8.81 5.56 Results for 80 Bytes per each file 10.83 10.93 8.9 8.71 7.89 8.61 4.8 5.58 D x 51
Compression Results 15.81 13.89 10.66 6.60 Results for 550 Bytes per each file 14.67 15.30 1.41 1.57 9.44 10.7 5.49 6.36 D x 5
Compression Results? 18.6 1.30 7.61? Results for 400 Bytes per each file? 16.1 16.81 11.38 1.54 6.31 7.0 D x 53
Today We Have Discussed 1. A Visit to Sparseland Motivating Sparsity & Overcompleteness. Problem 1: Transforms & Regularizations How & why should this work? 3. Problem : What About D? The quest for the origin of signals 4. Problem 3: Applications Image filling in, denoising, separation, compression, D x 54
Summary Sparsity and Overcompleteness are important ideas that can be used in designing better tools in signal/image processing There are difficulties in using them! We are working on resolving those difficulties: Performance of pursuit alg. Speedup of those methods, Training the dictionary, Demonstrating applications, Future transforms and regularizations will be datadriven, non-linear, overcomplete, and promoting sparsity. The dream? D x 55