A Quest for a Universal Model for Signals: From Sparsity to ConvNets Yaniv Romano The Electrical Engineering Department Technion Israel Institute of technology Joint work with Vardan Papyan Jeremias Sulam Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649 1
Local Sparsity = Signal Dictionary (learned) Sparse vector
Independent patch-processing Local Sparsity =
Independent patch-processing Local Sparsity =
Independent patch-processing Local Sparsity Convolutional Sparse Coding =
Independent patch-processing Local Sparsity Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding
Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding
Independent patch-processing Convolutional Sparse Coding Local Sparsity! Convolutional Neural Networks Multi-Layer Convolutional Sparse Coding The forward pass is a sparsecoding algorithm, serving our model Forward pass Extension of the classical sparse theory to a multi-layer setting
Our Story Begins with 9
Image Denoising Original Image X White Gaussian Noise E Noisy Image Y Many (thousands) image denoising algorithms can be cast as the minimization of an energy function of the form min X 1 2 X Y 2 2 + G X Topic=image and Relation to measurements Prior or regularization noise and (removal or denoising) 10
Leading Image Denoising Methods are built upon powerful patch-based local models: Popular local models: GMM Sparse-Representation Example-based Low-rank Field-of-Experts & Neural networks Working locally allows us to learn the model, e.g. dictionary learning 11
Why is it so Popular? It does come up in many applications Other inverse problems can be recast as an iterative denoising [Zoran & Weiss 11] [Venkatakrishnan, Bouman & Wohlberg 13] [Brifman, Romano & Elad 16] In a recent work we show that a denoiser f X can form a regularizer: Regularization by Denoising (RED) [Romano, Elad & Milanfar 16] 1 min X 2 HX Y 2 2 + G X Relation to measurements Prior 1 min X It is the simplest inverse problem revealing the limitations of the model 2 HX Y 2 2 + ρ 2 XT X f X Relation to measurements Under simple conditions on f x X G X G X regularizer = X f X 12
The Sparse-Land Model Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator R i extracts the i-th n-dimensional patch from X R N Sparse coding: n m > n Ω = n R i X P 0 : min γ i γ i 0 s. t. R i X = Ωγ i γ i 13
Patch Denoising Given a noisy patch R i Y, solve P 0 ϵ : γ i = argmin γ i γ i 0 s. t. R i Y Ωγ i 2 ϵ Clean patch: Ωγ i P 0 and (P 0 ϵ ) are hard to solve Greedy methods such as Orthogonal Matching Pursuit (OMP) or Thresholding Convex relaxations such as Basis Pursuit (BP) P ϵ 2 1 : min γ i 1 + ξ R i Y Ωγ i 2 γ i 14
Recall K-SVD Denoising [Elad & Aharon, 06] Noisy Image Initial Dictionary Using K-SVD Update the dictionary Reconstructed Image Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as patch averaging A modification of this method leads to state-of-the-art results [Mairal, Bach, Ponce, Spairo, Zisserman, `09] 15
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image 16
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Y 17
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) Y X 18
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) The orthogonally is lost due to patch-averaging Y X 19
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) Y X 20
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise 21
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Previous Result 22
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Operate Previous Result 23
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Operate Subtract Previous Result 24
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] X k+1 = f Y + ρx k ρx k Relation to Game Theory: Encourage overlapping patches to reach a consensus Relation to Graph Theory: Minimizing a Laplacian regularization functional: X = argmin X X f Y 2 2 + ρx T X f X Why boosting? Since it s guaranteed to be able to improve any denoiser 25
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] 26
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] 27
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch Self-Similarity feature 28
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch RAISR (Upscaling) Edge Features: Direction, Strength, Consistency 29
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch RAISR (Upscaling) Edge Features: Direction, Strength, Consistency 30
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Noisy K-SVD EPLL 31
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 32
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 33
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 34
What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening Missing the residual a theoretical image [Romano & Elad backbone! 13] SOS-Boosting [Romano & Elad 15] Exploiting Why self-similarities can we [Ram use & Elad a local 13] [Romano, prior Protter to & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] solve a global problem? Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 35
Now, Our Story Takes a Surprising Turn 36
Convolutional Sparse Coding Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad [LeCun, Bottou, Bengio and Haffner 98] [Lewicki & Sejnowski 99] [Hashimoto & Kurata, 00] [Mørup, Schmidt & Hansen 08] [Zeiler, Krishnan, Taylor & Fergus 10] [Jarrett, Kavukcuoglu, Gregor, LeCun 11] [Heide, Heidrich & Wetzstein 15] [Gu, Zuo, Xie, Meng, Feng & Zhang 15] 37
Intuitively X = + + + + + + + The first filter The second filter 38
Convolutional Sparse Coding (CSC) = + + + + 39
Convolutional Sparse Representation R i+1 R i X = (2n (2n 1)m 1)m n n D L γ i γ i+1 X = DΓ R i X = Ωγ i stripe-dictionary stripe vector Adjacent representations overlap, as they skip by m items as we sweep through the patches of X 40
CSC Relation to Our Story A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary Ω, just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging (R i T - put the patch Ωγ i back in the i-th location of the global vector) X = DΓ = 1 n i R i T Ωγ i What about the Pursuit? Patch averaging : independent sparse coding for each patch CSC: should seek all the representations together Is there a bridge between the two? We ll come back to this later What about the theory? Until recently little was known regrading the theoretical aspects of CSC 41
Classical Sparse Theory (Noiseless) P 0 : min Γ Γ 0 s. t. X = DΓ Definition: Mutual Coherence: μ D = max i j d i T d j [Donoho & Elad 03] Theorem: For a signal X = DΓ, if Γ 0 < 1 2 1 + 1 then this solution is necessarily the sparsest [Donoho & Elad 03] μ D Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that Γ 0 < 1 2 1 + 1 μ D [Tropp 04], [Donoho & Elad 03] 42
CSC: The Need for a Theoretical Study Assuming that m = 2 and n = 64 we have that [Welch, 74] μ D 0.063 As a result, uniqueness and success of pursuits is guaranteed as long as Γ 0 < 1 2 1 + 1 μ(d) 1 2 1 + 1 0.063 8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result! Repeating the above for the noisy case leads to even worse performance predictions Bottom line: Classic SparseLand theory cannot provide good explanations for the CSC model 43
Moving to Local Sparsity: Stripes s Γ 0, = max i γ i 0 s P 0, : min Γ 0, s. t. X = DΓ Γ s Γ 0, is low all γ i are sparse every patch has a sparse representation over Ω If Γ is locally sparse [Papyan, Sulam & Elad 16] The solution of P 0, is necessarily unique The global OMP and BP are guaranteed to recover it m = 2 γ i This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal 44
From Ideal to Noisy Signals In practice, Y = DΓ + E, where E is due to noise or model deviations Y DΓ E ϵ s P 0, : min Γ 0, s. t. Y DΓ 2 ϵ Γ Γ How close is Γ to Γ? If Γ is locally sparse and the noise is bounded [Papyan, Sulam & Elad 16] ϵ The solution of the P 0, is stable The solution obtained via global OMP/BP is stable The true and estimated representation are close 45
Global Pursuit via Local Processing P 1 ϵ : 1 Γ BP = min Γ 2 Y DΓ 2 2 + ξ Γ 1 X = DΓ = = R i T D L α i i = R i T s i m i s i are slices not patches n D L α i m 46
Global Pursuit via Local Processing (2) P 1 ϵ : Γ BP = min Γ 1 2 Y DΓ 2 2 + ξ Γ 1 Using variable splitting s i = D L α i 2 1 min s i, α i 2 Y R T i s i i 2 + ξ α i 1 i s. t. s i = D L α i These two problems are equivalent, and convex w.r.t their variables The new formulation targets the local slices, and their sparse representations Can be solved via ADMM, replace the constraint with a penalty 47
Slice-Based Pursuit Local Sparse Coding: Slice Reconstruction: α i = argmin α i p i = 1 ρ R iy + D L α i u i ρ 2 s i D L α i + u i 2 2 + ξ α i 1 Slice Aggregation: Local Laplacian: Dual Variable Update: X = j R j T p j s i = p i 1 ρ + n R ix u i = u i + s i D L α i Comment: One iteration of this procedure amounts to the very same patch-averaging algorithm we started with 48
Two Comments About this Scheme We work with Slices and not Patches Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches The Proposed Scheme can be used for Dictionary (D L ) Learning Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods Patches Slices Patches Slices [Wohlberg, 2016] Ours 49
Two Comments About this Scheme We work with Slices and not Patches Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches Patches Slices Patches Slices The Proposed Scheme can be used for Dictionary (D L ) Learning Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods Heide et. al Wohlberg [Wohlberg, 2016] Ours 50
Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Joint work with Vardan Papyan and Michael Elad 51
CSC and CNN There is an analogy between CSC and CNN: Convolutional structure, data driven models, ReLU is a sparsifying operator, and more We propose a principled way to analyze CNN SparseLand Sparse Representation Theory The Underlying Idea Modeling data sources enables a theoretical analysis of algorithms performance But first, a short review of CNN CNN * Convolutional Neural Networks *Our analysis holds true for fully connected networks as well 52
CNN N Y m 1 N N m 2 N N n 1 W 2 N n 1 n 0 W 1 n 0 [LeCun, Bottou, Bengio and Haffner 98] [Krizhevsky, Sutskever & Hinton 12] [Simonyan & Zisserman 14] [He, Zhang, Ren & Sun 15] ReLU z = max 0, z 53
Mathematically... T f X, Y, W i, b i = ReLU b 2 + W 2 ReLU b 1 + W T 1 X Y Z 2 R Nm 2 b W T 2 R Nm 2 Nm 1 2 R Nm 2 n 1 m 1 b 1 R Nm 1 W 1 T R Nm 1 N m 2 m 1 n 0 Y R N ReLU ReLU m 1 54
Training Stage of CNN Consider the task of classification Given a set of signals Y j j and their corresponding labels h Y j, the CNN learns an end-to-end mapping j min W i, b i,u j l h Y j, U, f Y j, W i, b i True label Classifier Output of last layer 55
Back to CSC X R N n 0 D 1 R N Nm 1 m 1 Γ 1 R Nm 1 We propose to impose the same structure on the representations themselves Γ 1 R Nm 1 D 2 R Nm 1 Nm 2 Γ 2 R Nm 2 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals n 1 m 1 m 2 m 1 Multi-Layer CSC (ML-CSC) 56
Intuition: From Atoms to Molecules X R N D 1 R N Nm 1 Γ D 2 R Nm 1 Nm 2 1 R Nm 1 Γ 2 R Nm 2 Columns in D 1 are convolutional atoms Columns in D 2 combine the atoms in D 1, creating more complex structures The dictionary D 1 D 2 is a superposition of the atoms of D 1 The size of the effective atoms is increased throughout the layers (receptive field) Γ 1 R Nm 1 57
Intuition: From Atoms to Molecules X R N D 1 R N Nm 1 D 2 R Nm 1 Nm 2 Γ 2 R Nm 2 One could chain the multiplication of all the dictionaries into one effective dictionary D eff = D 1 D 2 D 3 D K and then X = D eff Γ K as in SparseLand However, a key property in this model is the sparsity of each representation (feature-maps) Γ 1 R Nm 1 58
A Small Taste: Model Training (MNIST) MNIST Dictionary: D 1 : 32 filters of size 7 7, with stride of 2 (dense) D 2 : 128 filters of size 5 5 32 with stride of 1-99.09 % sparse D3: 1024 filters of size 7 7 128 99.89 % sparse D 1 (7 7) D 1 D 2 (15 15) D 1 D 2 D 3 (28 28) 59
A Small Taste: Pursuit Y Γ 0 Γ 1 94.51 % sparse (213 nnz) Γ 2 99.52% sparse (30 nnz) Γ 3 99.51% sparse (5 nnz) 60
A Small Taste: Pursuit Y Γ 0 Γ 1 Γ 2 92.20 % sparse (302 nnz) 99.25 % sparse (47 nnz) 99.41 % sparse (6 nnz) Γ 3 61
A Small Taste: Model Training (CIFAR) D 1 (5 5 3) D 1 D 2 (13 13) D 1 D 2 D 3 (32 32) CIFAR Dictionary: D 1 : 64 filters of size 5x5x3, stride of 2 dense D 2 : 256 filters of size 5x5x64, stride of 2 82.99 % sparse D 3 : 1024 filters of size 5x5x256 90.66 % sparse 62
Deep Coding Problem (DCP) Noiseless Pursuit K Find Γ j j=1 Noisy Pursuit s. t. X = D 1 Γ 1 Γ 1 s 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K K Find Γ j j=1 s. t. Y D 1 Γ 1 2 E 0 Γ 1 s 0, λ 1 Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K 63
Deep Learning Problem DLP Supervised Dictionary Learning Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce 12] min D i i=1 K Unsupervised Dictionary Learning Find D K i i=1 s. t. j Γ K 1,U True label Y j D 1 Γ 1 j Γ 1 j D2 Γ 2 j j D K Γ K j l h Y j, U, DCP Y j, D i Classifier 2 E 0 Γ 1 j 2 E 1 2 E K 1 Deepest representation obtained from DCP Γ 2 j Γ K j s 0, s 0, s 0, λ 1 λ 2 J λ K j=1 64
ML-CSC: The Simplest Pursuit The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal Y by: Y = DΓ + E and Γ is sparse Γ = P β D T Y Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model ReLU = Soft Nonnegative Thresholding 65
Consider this for Solving the DCP Layered thresholding (LT): Estimate Γ 1 via the THR algorithm K DCP λ E : Find Γ j j=1 s. t. s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Γ 2 = P β2 D 2 T P β1 D 1 T Y Estimate Γ 2 via the THR algorithm Forward pass of CNN: Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K f X = ReLU b 2 + W 2 T ReLU b 1 + W 1 T Y ReLU Bias b Weights W Forward pass f Soft Non-Negative THR Thresholds β Dictionary D Layered Soft NN THR DCP 66
Consider this for Solving the DCP Layered thresholding (LT): Estimate Γ 1 via the THR algorithm K DCP λ E : Find Γ j j=1 s. t. s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Γ 2 = P β2 D 2 T P β1 D 1 T Y Estimate Γ 2 via the THR algorithm Forward pass of CNN: Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K f X = ReLU b 2 + W 2 T ReLU b 1 + W 1 T Y The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things!!! 67
Consider this for Solving the DLP DLP (supervised ): CNN training: min D i i=1 K,U j l h Y j, U, DCP Y j, D i Estimate via the layered THR algorithm The thresholds for the DCP should also learned min W i, b i,u j l h Y j, U, f Y, W i, b i CNN Language Forward pass f SparseLand Language Layered Soft NN THR DCP 68
Consider this for Solving the DLP DLP (supervised ): CNN training: * min D i i=1 K,U j l h Y j, U, DCP Y j, D i Estimate via the layered THR algorithm The thresholds for the DCP should also learned min W i, b i,u j l h Y j, U, f Y, W i, b i The problem CNN solved by SparseLand the training stage of CNN Language and the DLP are equivalent Language as well, assuming Forward pass that the DCP Layered is approximated THR Pursuit via the layered f thresholding DCP algorithm * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN 69
Theoretical Questions M A X = D 1 Γ 1 Γ 1 = D 2 Γ 2 Γ K 1 = D K Γ K X Y DCP λ E Layered Thresholding (Forward Pass) K Γ i i=1 Γ i is L 0, sparse Other? 70
Theoretical Path: Possible Questions Having established the importance of the ML-CSC model and its associated pursuit, the DCP problem, we now turn to its analysis The main questions we aim to address: I. Uniqueness of the solution (set of representations) to the DCP λ? II. Stability of the solution to the DCP λ E problem? III. Stability of the solution obtained via the hard and soft layered THR algorithms (forward pass)? IV. Limitations of this (very simple) algorithm and alternative pursuit? V. Algorithms for training the dictionaries D K i i=1 vs. CNN? VI. New insights on how to operate on signals via CNN? 71
Uniqueness of DCP λ DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K Theorem: If a set of solutions Γ K i i=1 is found for (DCP λ ) such that: s Γ i 0, = λ i < 1 2 1 + 1 μ D i Then these are necessarily the unique solution to this problem. The feature maps CNN aims to recover are unique Is this set unique? Mutual Coherence: μ D = max d T i d j i j [Donoho & Elad 03] 72
Stability of DCP λ E DCP λ E : Find a set of representations satisfying s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Is this set stable? Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K Suppose that we manage to solve the E DCP λ and find a feasible set of representations satisfying all the conditions K Γ i i=1 The question we pose is: How close is Γ i to Γ i? 73
Stability of DCP λ E Theorem: If the true representations Γ K i i=1 s Γ i 0, λ i < 1 2 1 + 1 μ D i satisfy then the set of solutions Γ i i=1 obtained by solving this 2 problem (somehow) with E 0 = E 2 and E i = 0 (i 1) must obey 2 Γ i Γ i 2 4 E 2 2 i K j=1 1 1 2λ j 1 μ D j The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model 74
Local Noise Assumption Our analysis relied on the local sparsity of the underlying solution Γ, which was enforced through the l 0, norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector E This will be enforced via the l 2, norm, defined as: p E 2, = max i R i E 2 75
Stability of Layered-THR s Theorem: If Γ i 0, < 1 1 + 1 2 μ D i Γ min i 1 Γ max i μ D i ε L i 1 Γ i max then the layered hard THR (with the proper thresholds) will find the correct supports * and Γ LT i i Γ i ε 2, L where we have defined ε 0 p L = E 2, and p i p ε L = Γ i 0, ε i 1 s L + μ D i Γ i 0, 1 Γ max i The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded * Least-Squares update of the non-zeros? 76
Limitations of the Forward Pass The stability analysis reveals several inherent limitations of the forward pass (a.k.a. Layered THR) algorithm: Even in the noiseless case, the forward pass is incapable of recovering the perfect solution of the DCP problem Its success depends on the ratio Γ i min / Γ i max. This is a direct consequence of relying on a simple thresholding operator The distance between the true sparse vector and the estimated one increases exponentially as a function of the layer depth next we propose a new algorithm that attempts to solve some of these problems 77
Special Case Sparse Dictionaries Throughout the theoretical study we assumed that the representations in the different layers are L 0, -sparse K Do we know of a simple example of a set of dictionaries D i i=1 and their corresponding signals X that will obey this property? Assuming the dictionaries are sparse: s Γ j 0, K s Γ K 0, D i 0 i=j+1 Maximal number of non-zeros in an atom in D i In the context of CNN, the above happens if a sparsity promoting regularization, such as the L 1, is employed on the filters 78
Better Pursuit? DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K So far we proposed the Layered THR: Γ K = P βk D K T P β2 D 2 T P β1 D 1 T X The motivation is clear getting close to what CNN use However, this is the simplest and weakest pursuit known in the field of sparsity Can we offer something better? 79
Layered Basis Pursuit (Noiseless) DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K We can propose a Layered Basis Pursuit Algorithm: Γ 1 LBP = min Γ 1 Γ 1 1 s. t. = D 1 Γ 1 Deconvolutional networks Γ 2 LBP = min Γ 2 Γ 2 1 s. t. Γ 1 LBP = D 2 Γ 2 [Zeiler, Krishnan, Taylor & Fergus 10] 80
Guarantee for Success of Layered BP As opposed to prior work in CNN, we can do far more than just proposing an algorithm we can analyze its terms for success: Theorem: If a set of representations Γ K i i=1 of the Multi-Layered CSC model satisfy s Γ i 0, λ i < 1 2 1 + 1 μ D i then the Layered BP is guaranteed to find them Consequences: The layered BP can retrieve the underlying representations in the noiseless case, a task in which the forward pass fails to provide The Layered-BP s success does not depend on the ratio Γ i min / Γ i max 81
Layered Basis Pursuit (Noisy) DCP λ E : Find a set of representations satisfying Y D 1 Γ 1 2 E 0 Γ 1 s 0, λ 1 Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K Similarly to the noiseless case, this can be also solved via the Layered Basis Pursuit Algorithm But in a Lagrangian form Γ LBP 1 1 = min Γ 1 2 Y D 2 1Γ 1 2 + ξ 1 Γ 1 1 Γ LBP 1 2 = min Γ 2 2 Γ 1 LBP 2 D 2 Γ 2 2 + ξ2 Γ 2 1 82
Stability of Layered BP s Theorem: Assuming that Γ i 0, 1 1 + 1 3 then for correctly chosen λ K i i=1 μ D i we are guaranteed that 1. The support of Γ i LBP is contained in that of Γ i 2. The error is bounded: Γ i LBP Γ i 2, i ε L = 7.5 i p E 2, p p Γ j 0, ε L i, where 3. Every entry in Γ i greater than ε i p L / Γ i 0, will be found i j=1 83
Layered Iterative Thresholding Layered BP: Γ LBP 1 j = min Γ j 2 Γ LBP 2 j 1 D j Γ j + ξj Γ 2 j 1 j t Layered Iterative Soft-Thresholding Γ j t = S ξj /c j Γ j t 1 + 1 c j D j T Γ j 1 D j Γ j t 1 j Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun 10] * c i > 0.5 λ max (D i T D i ) 84
This Talk Independent patch-processing Local Sparsity We described the limitations of patch-based processing and ways to overcome some of these 85
This Talk Independent patch-processing Local Sparsity Convolutional Sparse Coding We presented a theoretical study of the CSC and a practical algorithm that works locally 86
This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding We mentioned several interesting connections between CSC and CNN and this led us to 87
This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN 88
This Talk The ML-CSC was shown to enable a theoretical study of CNN, along with new insights Convolutional Neural Networks Independent patch-processing Convolutional Sparse Coding Local Sparsity Multi-Layer Convolutional Sparse Coding Extension of the classical sparse theory to a multi-layer setting A novel interpretation and theoretical understanding of CNN 89
This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding Extension of the classical sparse theory to a multi-layer setting The underlying idea: Modeling the data source in order to be able to theoretically analyze algorithms performance A novel interpretation and theoretical understanding of CNN 90
Questions? 91