A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Similar documents
Convolutional Neural Networks Analyzed via Convolutional Sparse Coding

Sparse Modeling. in Image Processing and Deep Learning. Michael Elad

SOS Boosting of Image Denoising Algorithms

arxiv: v1 [stat.ml] 27 Jul 2016

Sparse & Redundant Signal Representation, and its Role in Image Processing

Signal Modeling: From Convolutional Sparse Coding to Convolutional Neural Networks

Multi-Layer Convolutional Sparse Modeling: Pursuit and Dictionary Learning

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

SPARSE signal representations have gained popularity in recent

Adversarial Noise Attacks of Deep Learning Architectures Stability Analysis via Sparse Modeled Signals

Machine Learning for Signal Processing Sparse and Overcomplete Representations

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

A tutorial on sparse modeling. Outline:

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

EUSIPCO

2.3. Clustering or vector quantization 57

Sparse Approximation and Variable Selection

Introduction to Compressed Sensing

An Introduction to Sparse Approximation

Greedy Dictionary Selection for Sparse Representation

Structured matrix factorizations. Example: Eigenfaces

Sparsifying Transform Learning for Compressed Sensing MRI

The Analysis Cosparse Model for Signals and Images

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Noise Removal? The Evolution Of Pr(x) Denoising By Energy Minimization. ( x) An Introduction to Sparse Representation and the K-SVD Algorithm

Introduction to Convolutional Neural Networks (CNNs)

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE

Sparse molecular image representation

Sparsity in Underdetermined Systems

Stable and Efficient Representation Learning with Nonnegativity Constraints. Tsung-Han Lin and H.T. Kung

Wavelet Footprints: Theory, Algorithms, and Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Greedy Signal Recovery and Uniform Uncertainty Principles

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

Image Noise: Detection, Measurement and Removal Techniques. Zhifei Zhang

Blind Compressed Sensing

Analysis of Greedy Algorithms

Lecture 17: Neural Networks and Deep Learning

Sensing systems limited by constraints: physical size, time, cost, energy

Detecting Sparse Structures in Data in Sub-Linear Time: A group testing approach

A NEW FRAMEWORK FOR DESIGNING INCOHERENT SPARSIFYING DICTIONARIES

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Theories of Deep Learning

Topographic Dictionary Learning with Structured Sparsity

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

Sparse representation classification and positive L1 minimization

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Recent developments on sparse representation

Five Lectures on Sparse and Redundant Representations Modelling of Images. Michael Elad

The Little Engine That Could: Regularization by Denoising (RED)

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Provable Alternating Minimization Methods for Non-convex Optimization

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

Introduction to Machine Learning Midterm Exam

ISyE 691 Data mining and analytics

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Sparse & Redundant Representations by Iterated-Shrinkage Algorithms

A practical theory for designing very deep convolutional neural networks. Xudong Cao.

The Iteration-Tuned Dictionary for Sparse Representations

Sparse Solutions of an Undetermined Linear System

Neural networks and optimization

Lecture Support Vector Machine (SVM) Classifiers

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

L-statistics based Modification of Reconstruction Algorithms for Compressive Sensing in the Presence of Impulse Noise

Sparse Linear Models (10/7/13)

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

SGD and Deep Learning

9 Classification. 9.1 Linear Classifiers

Statistical Pattern Recognition

Neural Networks and Ensemble Methods for Classification

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Improved Local Coordinate Coding using Local Tangents

Designing Information Devices and Systems I Spring 2018 Homework 13

ON THE STABILITY OF DEEP NETWORKS

Compressed Sensing: Extending CLEAN and NNLS

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Minimax Reconstruction Risk of Convolutional Sparse Dictionary Learning

Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Convolutional neural networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning And Applications: Supervised Learning-SVM

How to do backpropagation in a brain

Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images!

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Infinite Ensemble Learning with Support Vector Machinery

Designing Information Devices and Systems I Discussion 13B

Bias-free Sparse Regression with Guaranteed Consistency

Compressed Sensing and Neural Networks

Transcription:

A Quest for a Universal Model for Signals: From Sparsity to ConvNets Yaniv Romano The Electrical Engineering Department Technion Israel Institute of technology Joint work with Vardan Papyan Jeremias Sulam Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649 1

Local Sparsity = Signal Dictionary (learned) Sparse vector

Independent patch-processing Local Sparsity =

Independent patch-processing Local Sparsity =

Independent patch-processing Local Sparsity Convolutional Sparse Coding =

Independent patch-processing Local Sparsity Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding

Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding

Independent patch-processing Convolutional Sparse Coding Local Sparsity! Convolutional Neural Networks Multi-Layer Convolutional Sparse Coding The forward pass is a sparsecoding algorithm, serving our model Forward pass Extension of the classical sparse theory to a multi-layer setting

Our Story Begins with 9

Image Denoising Original Image X White Gaussian Noise E Noisy Image Y Many (thousands) image denoising algorithms can be cast as the minimization of an energy function of the form min X 1 2 X Y 2 2 + G X Topic=image and Relation to measurements Prior or regularization noise and (removal or denoising) 10

Leading Image Denoising Methods are built upon powerful patch-based local models: Popular local models: GMM Sparse-Representation Example-based Low-rank Field-of-Experts & Neural networks Working locally allows us to learn the model, e.g. dictionary learning 11

Why is it so Popular? It does come up in many applications Other inverse problems can be recast as an iterative denoising [Zoran & Weiss 11] [Venkatakrishnan, Bouman & Wohlberg 13] [Brifman, Romano & Elad 16] In a recent work we show that a denoiser f X can form a regularizer: Regularization by Denoising (RED) [Romano, Elad & Milanfar 16] 1 min X 2 HX Y 2 2 + G X Relation to measurements Prior 1 min X It is the simplest inverse problem revealing the limitations of the model 2 HX Y 2 2 + ρ 2 XT X f X Relation to measurements Under simple conditions on f x X G X G X regularizer = X f X 12

The Sparse-Land Model Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator R i extracts the i-th n-dimensional patch from X R N Sparse coding: n m > n Ω = n R i X P 0 : min γ i γ i 0 s. t. R i X = Ωγ i γ i 13

Patch Denoising Given a noisy patch R i Y, solve P 0 ϵ : γ i = argmin γ i γ i 0 s. t. R i Y Ωγ i 2 ϵ Clean patch: Ωγ i P 0 and (P 0 ϵ ) are hard to solve Greedy methods such as Orthogonal Matching Pursuit (OMP) or Thresholding Convex relaxations such as Basis Pursuit (BP) P ϵ 2 1 : min γ i 1 + ξ R i Y Ωγ i 2 γ i 14

Recall K-SVD Denoising [Elad & Aharon, 06] Noisy Image Initial Dictionary Using K-SVD Update the dictionary Reconstructed Image Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as patch averaging A modification of this method leads to state-of-the-art results [Mairal, Bach, Ponce, Spairo, Zisserman, `09] 15

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image 16

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Y 17

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) Y X 18

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) The orthogonally is lost due to patch-averaging Y X 19

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] R i Y Ωγ i Orthogonal (when using the OMP) Y X 20

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise 21

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Previous Result 22

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Operate Previous Result 23

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Denoise Strengthen Operate Subtract Previous Result 24

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] X k+1 = f Y + ρx k ρx k Relation to Game Theory: Encourage overlapping patches to reach a consensus Relation to Graph Theory: Minimizing a Laplacian regularization functional: X = argmin X X f Y 2 2 + ρx T X f X Why boosting? Since it s guaranteed to be able to improve any denoiser 25

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] 26

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] 27

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch Self-Similarity feature 28

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch RAISR (Upscaling) Edge Features: Direction, Strength, Consistency 29

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Con-Patch RAISR (Upscaling) Edge Features: Direction, Strength, Consistency 30

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Noisy K-SVD EPLL 31

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 32

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 33

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening the residual image [Romano & Elad 13] SOS-Boosting [Romano & Elad 15] Exploiting self-similarities [Ram & Elad 13] [Romano, Protter & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 34

What is? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of Whitening Missing the residual a theoretical image [Romano & Elad backbone! 13] SOS-Boosting [Romano & Elad 15] Exploiting Why self-similarities can we [Ram use & Elad a local 13] [Romano, prior Protter to & Elad 14] A multi-scale treatment [Ophir, Lustig & Elad 11] [Sulam, Ophir & Elad 14] [Papyan & Elad 15] solve a global problem? Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) Sparse EPLL [Sulam & Elad 15] Image Synthesis [Ren, Romano & Elad 17] 35

Now, Our Story Takes a Surprising Turn 36

Convolutional Sparse Coding Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad [LeCun, Bottou, Bengio and Haffner 98] [Lewicki & Sejnowski 99] [Hashimoto & Kurata, 00] [Mørup, Schmidt & Hansen 08] [Zeiler, Krishnan, Taylor & Fergus 10] [Jarrett, Kavukcuoglu, Gregor, LeCun 11] [Heide, Heidrich & Wetzstein 15] [Gu, Zuo, Xie, Meng, Feng & Zhang 15] 37

Intuitively X = + + + + + + + The first filter The second filter 38

Convolutional Sparse Coding (CSC) = + + + + 39

Convolutional Sparse Representation R i+1 R i X = (2n (2n 1)m 1)m n n D L γ i γ i+1 X = DΓ R i X = Ωγ i stripe-dictionary stripe vector Adjacent representations overlap, as they skip by m items as we sweep through the patches of X 40

CSC Relation to Our Story A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary Ω, just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging (R i T - put the patch Ωγ i back in the i-th location of the global vector) X = DΓ = 1 n i R i T Ωγ i What about the Pursuit? Patch averaging : independent sparse coding for each patch CSC: should seek all the representations together Is there a bridge between the two? We ll come back to this later What about the theory? Until recently little was known regrading the theoretical aspects of CSC 41

Classical Sparse Theory (Noiseless) P 0 : min Γ Γ 0 s. t. X = DΓ Definition: Mutual Coherence: μ D = max i j d i T d j [Donoho & Elad 03] Theorem: For a signal X = DΓ, if Γ 0 < 1 2 1 + 1 then this solution is necessarily the sparsest [Donoho & Elad 03] μ D Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that Γ 0 < 1 2 1 + 1 μ D [Tropp 04], [Donoho & Elad 03] 42

CSC: The Need for a Theoretical Study Assuming that m = 2 and n = 64 we have that [Welch, 74] μ D 0.063 As a result, uniqueness and success of pursuits is guaranteed as long as Γ 0 < 1 2 1 + 1 μ(d) 1 2 1 + 1 0.063 8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result! Repeating the above for the noisy case leads to even worse performance predictions Bottom line: Classic SparseLand theory cannot provide good explanations for the CSC model 43

Moving to Local Sparsity: Stripes s Γ 0, = max i γ i 0 s P 0, : min Γ 0, s. t. X = DΓ Γ s Γ 0, is low all γ i are sparse every patch has a sparse representation over Ω If Γ is locally sparse [Papyan, Sulam & Elad 16] The solution of P 0, is necessarily unique The global OMP and BP are guaranteed to recover it m = 2 γ i This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal 44

From Ideal to Noisy Signals In practice, Y = DΓ + E, where E is due to noise or model deviations Y DΓ E ϵ s P 0, : min Γ 0, s. t. Y DΓ 2 ϵ Γ Γ How close is Γ to Γ? If Γ is locally sparse and the noise is bounded [Papyan, Sulam & Elad 16] ϵ The solution of the P 0, is stable The solution obtained via global OMP/BP is stable The true and estimated representation are close 45

Global Pursuit via Local Processing P 1 ϵ : 1 Γ BP = min Γ 2 Y DΓ 2 2 + ξ Γ 1 X = DΓ = = R i T D L α i i = R i T s i m i s i are slices not patches n D L α i m 46

Global Pursuit via Local Processing (2) P 1 ϵ : Γ BP = min Γ 1 2 Y DΓ 2 2 + ξ Γ 1 Using variable splitting s i = D L α i 2 1 min s i, α i 2 Y R T i s i i 2 + ξ α i 1 i s. t. s i = D L α i These two problems are equivalent, and convex w.r.t their variables The new formulation targets the local slices, and their sparse representations Can be solved via ADMM, replace the constraint with a penalty 47

Slice-Based Pursuit Local Sparse Coding: Slice Reconstruction: α i = argmin α i p i = 1 ρ R iy + D L α i u i ρ 2 s i D L α i + u i 2 2 + ξ α i 1 Slice Aggregation: Local Laplacian: Dual Variable Update: X = j R j T p j s i = p i 1 ρ + n R ix u i = u i + s i D L α i Comment: One iteration of this procedure amounts to the very same patch-averaging algorithm we started with 48

Two Comments About this Scheme We work with Slices and not Patches Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches The Proposed Scheme can be used for Dictionary (D L ) Learning Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods Patches Slices Patches Slices [Wohlberg, 2016] Ours 49

Two Comments About this Scheme We work with Slices and not Patches Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches Patches Slices Patches Slices The Proposed Scheme can be used for Dictionary (D L ) Learning Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods Heide et. al Wohlberg [Wohlberg, 2016] Ours 50

Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Joint work with Vardan Papyan and Michael Elad 51

CSC and CNN There is an analogy between CSC and CNN: Convolutional structure, data driven models, ReLU is a sparsifying operator, and more We propose a principled way to analyze CNN SparseLand Sparse Representation Theory The Underlying Idea Modeling data sources enables a theoretical analysis of algorithms performance But first, a short review of CNN CNN * Convolutional Neural Networks *Our analysis holds true for fully connected networks as well 52

CNN N Y m 1 N N m 2 N N n 1 W 2 N n 1 n 0 W 1 n 0 [LeCun, Bottou, Bengio and Haffner 98] [Krizhevsky, Sutskever & Hinton 12] [Simonyan & Zisserman 14] [He, Zhang, Ren & Sun 15] ReLU z = max 0, z 53

Mathematically... T f X, Y, W i, b i = ReLU b 2 + W 2 ReLU b 1 + W T 1 X Y Z 2 R Nm 2 b W T 2 R Nm 2 Nm 1 2 R Nm 2 n 1 m 1 b 1 R Nm 1 W 1 T R Nm 1 N m 2 m 1 n 0 Y R N ReLU ReLU m 1 54

Training Stage of CNN Consider the task of classification Given a set of signals Y j j and their corresponding labels h Y j, the CNN learns an end-to-end mapping j min W i, b i,u j l h Y j, U, f Y j, W i, b i True label Classifier Output of last layer 55

Back to CSC X R N n 0 D 1 R N Nm 1 m 1 Γ 1 R Nm 1 We propose to impose the same structure on the representations themselves Γ 1 R Nm 1 D 2 R Nm 1 Nm 2 Γ 2 R Nm 2 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals n 1 m 1 m 2 m 1 Multi-Layer CSC (ML-CSC) 56

Intuition: From Atoms to Molecules X R N D 1 R N Nm 1 Γ D 2 R Nm 1 Nm 2 1 R Nm 1 Γ 2 R Nm 2 Columns in D 1 are convolutional atoms Columns in D 2 combine the atoms in D 1, creating more complex structures The dictionary D 1 D 2 is a superposition of the atoms of D 1 The size of the effective atoms is increased throughout the layers (receptive field) Γ 1 R Nm 1 57

Intuition: From Atoms to Molecules X R N D 1 R N Nm 1 D 2 R Nm 1 Nm 2 Γ 2 R Nm 2 One could chain the multiplication of all the dictionaries into one effective dictionary D eff = D 1 D 2 D 3 D K and then X = D eff Γ K as in SparseLand However, a key property in this model is the sparsity of each representation (feature-maps) Γ 1 R Nm 1 58

A Small Taste: Model Training (MNIST) MNIST Dictionary: D 1 : 32 filters of size 7 7, with stride of 2 (dense) D 2 : 128 filters of size 5 5 32 with stride of 1-99.09 % sparse D3: 1024 filters of size 7 7 128 99.89 % sparse D 1 (7 7) D 1 D 2 (15 15) D 1 D 2 D 3 (28 28) 59

A Small Taste: Pursuit Y Γ 0 Γ 1 94.51 % sparse (213 nnz) Γ 2 99.52% sparse (30 nnz) Γ 3 99.51% sparse (5 nnz) 60

A Small Taste: Pursuit Y Γ 0 Γ 1 Γ 2 92.20 % sparse (302 nnz) 99.25 % sparse (47 nnz) 99.41 % sparse (6 nnz) Γ 3 61

A Small Taste: Model Training (CIFAR) D 1 (5 5 3) D 1 D 2 (13 13) D 1 D 2 D 3 (32 32) CIFAR Dictionary: D 1 : 64 filters of size 5x5x3, stride of 2 dense D 2 : 256 filters of size 5x5x64, stride of 2 82.99 % sparse D 3 : 1024 filters of size 5x5x256 90.66 % sparse 62

Deep Coding Problem (DCP) Noiseless Pursuit K Find Γ j j=1 Noisy Pursuit s. t. X = D 1 Γ 1 Γ 1 s 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K K Find Γ j j=1 s. t. Y D 1 Γ 1 2 E 0 Γ 1 s 0, λ 1 Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K 63

Deep Learning Problem DLP Supervised Dictionary Learning Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce 12] min D i i=1 K Unsupervised Dictionary Learning Find D K i i=1 s. t. j Γ K 1,U True label Y j D 1 Γ 1 j Γ 1 j D2 Γ 2 j j D K Γ K j l h Y j, U, DCP Y j, D i Classifier 2 E 0 Γ 1 j 2 E 1 2 E K 1 Deepest representation obtained from DCP Γ 2 j Γ K j s 0, s 0, s 0, λ 1 λ 2 J λ K j=1 64

ML-CSC: The Simplest Pursuit The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal Y by: Y = DΓ + E and Γ is sparse Γ = P β D T Y Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model ReLU = Soft Nonnegative Thresholding 65

Consider this for Solving the DCP Layered thresholding (LT): Estimate Γ 1 via the THR algorithm K DCP λ E : Find Γ j j=1 s. t. s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Γ 2 = P β2 D 2 T P β1 D 1 T Y Estimate Γ 2 via the THR algorithm Forward pass of CNN: Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K f X = ReLU b 2 + W 2 T ReLU b 1 + W 1 T Y ReLU Bias b Weights W Forward pass f Soft Non-Negative THR Thresholds β Dictionary D Layered Soft NN THR DCP 66

Consider this for Solving the DCP Layered thresholding (LT): Estimate Γ 1 via the THR algorithm K DCP λ E : Find Γ j j=1 s. t. s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Γ 2 = P β2 D 2 T P β1 D 1 T Y Estimate Γ 2 via the THR algorithm Forward pass of CNN: Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K f X = ReLU b 2 + W 2 T ReLU b 1 + W 1 T Y The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things!!! 67

Consider this for Solving the DLP DLP (supervised ): CNN training: min D i i=1 K,U j l h Y j, U, DCP Y j, D i Estimate via the layered THR algorithm The thresholds for the DCP should also learned min W i, b i,u j l h Y j, U, f Y, W i, b i CNN Language Forward pass f SparseLand Language Layered Soft NN THR DCP 68

Consider this for Solving the DLP DLP (supervised ): CNN training: * min D i i=1 K,U j l h Y j, U, DCP Y j, D i Estimate via the layered THR algorithm The thresholds for the DCP should also learned min W i, b i,u j l h Y j, U, f Y, W i, b i The problem CNN solved by SparseLand the training stage of CNN Language and the DLP are equivalent Language as well, assuming Forward pass that the DCP Layered is approximated THR Pursuit via the layered f thresholding DCP algorithm * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN 69

Theoretical Questions M A X = D 1 Γ 1 Γ 1 = D 2 Γ 2 Γ K 1 = D K Γ K X Y DCP λ E Layered Thresholding (Forward Pass) K Γ i i=1 Γ i is L 0, sparse Other? 70

Theoretical Path: Possible Questions Having established the importance of the ML-CSC model and its associated pursuit, the DCP problem, we now turn to its analysis The main questions we aim to address: I. Uniqueness of the solution (set of representations) to the DCP λ? II. Stability of the solution to the DCP λ E problem? III. Stability of the solution obtained via the hard and soft layered THR algorithms (forward pass)? IV. Limitations of this (very simple) algorithm and alternative pursuit? V. Algorithms for training the dictionaries D K i i=1 vs. CNN? VI. New insights on how to operate on signals via CNN? 71

Uniqueness of DCP λ DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K Theorem: If a set of solutions Γ K i i=1 is found for (DCP λ ) such that: s Γ i 0, = λ i < 1 2 1 + 1 μ D i Then these are necessarily the unique solution to this problem. The feature maps CNN aims to recover are unique Is this set unique? Mutual Coherence: μ D = max d T i d j i j [Donoho & Elad 03] 72

Stability of DCP λ E DCP λ E : Find a set of representations satisfying s Y D 1 Γ 1 2 E 0 Γ 1 0, λ 1 Is this set stable? Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K Suppose that we manage to solve the E DCP λ and find a feasible set of representations satisfying all the conditions K Γ i i=1 The question we pose is: How close is Γ i to Γ i? 73

Stability of DCP λ E Theorem: If the true representations Γ K i i=1 s Γ i 0, λ i < 1 2 1 + 1 μ D i satisfy then the set of solutions Γ i i=1 obtained by solving this 2 problem (somehow) with E 0 = E 2 and E i = 0 (i 1) must obey 2 Γ i Γ i 2 4 E 2 2 i K j=1 1 1 2λ j 1 μ D j The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model 74

Local Noise Assumption Our analysis relied on the local sparsity of the underlying solution Γ, which was enforced through the l 0, norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector E This will be enforced via the l 2, norm, defined as: p E 2, = max i R i E 2 75

Stability of Layered-THR s Theorem: If Γ i 0, < 1 1 + 1 2 μ D i Γ min i 1 Γ max i μ D i ε L i 1 Γ i max then the layered hard THR (with the proper thresholds) will find the correct supports * and Γ LT i i Γ i ε 2, L where we have defined ε 0 p L = E 2, and p i p ε L = Γ i 0, ε i 1 s L + μ D i Γ i 0, 1 Γ max i The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded * Least-Squares update of the non-zeros? 76

Limitations of the Forward Pass The stability analysis reveals several inherent limitations of the forward pass (a.k.a. Layered THR) algorithm: Even in the noiseless case, the forward pass is incapable of recovering the perfect solution of the DCP problem Its success depends on the ratio Γ i min / Γ i max. This is a direct consequence of relying on a simple thresholding operator The distance between the true sparse vector and the estimated one increases exponentially as a function of the layer depth next we propose a new algorithm that attempts to solve some of these problems 77

Special Case Sparse Dictionaries Throughout the theoretical study we assumed that the representations in the different layers are L 0, -sparse K Do we know of a simple example of a set of dictionaries D i i=1 and their corresponding signals X that will obey this property? Assuming the dictionaries are sparse: s Γ j 0, K s Γ K 0, D i 0 i=j+1 Maximal number of non-zeros in an atom in D i In the context of CNN, the above happens if a sparsity promoting regularization, such as the L 1, is employed on the filters 78

Better Pursuit? DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K So far we proposed the Layered THR: Γ K = P βk D K T P β2 D 2 T P β1 D 1 T X The motivation is clear getting close to what CNN use However, this is the simplest and weakest pursuit known in the field of sparsity Can we offer something better? 79

Layered Basis Pursuit (Noiseless) DCP λ : Find a set of representations satisfying s X = D 1 Γ 1 Γ 1 0, λ 1 Γ 1 = D 2 Γ 2 Γ 2 s 0, λ 2 Γ K 1 = D K Γ K Γ K s 0, λ K We can propose a Layered Basis Pursuit Algorithm: Γ 1 LBP = min Γ 1 Γ 1 1 s. t. = D 1 Γ 1 Deconvolutional networks Γ 2 LBP = min Γ 2 Γ 2 1 s. t. Γ 1 LBP = D 2 Γ 2 [Zeiler, Krishnan, Taylor & Fergus 10] 80

Guarantee for Success of Layered BP As opposed to prior work in CNN, we can do far more than just proposing an algorithm we can analyze its terms for success: Theorem: If a set of representations Γ K i i=1 of the Multi-Layered CSC model satisfy s Γ i 0, λ i < 1 2 1 + 1 μ D i then the Layered BP is guaranteed to find them Consequences: The layered BP can retrieve the underlying representations in the noiseless case, a task in which the forward pass fails to provide The Layered-BP s success does not depend on the ratio Γ i min / Γ i max 81

Layered Basis Pursuit (Noisy) DCP λ E : Find a set of representations satisfying Y D 1 Γ 1 2 E 0 Γ 1 s 0, λ 1 Γ 1 D 2 Γ 2 2 E 1 Γ 2 s 0, λ 2 Γ K 1 D K Γ K 2 E K 1 Γ K s 0, λ K Similarly to the noiseless case, this can be also solved via the Layered Basis Pursuit Algorithm But in a Lagrangian form Γ LBP 1 1 = min Γ 1 2 Y D 2 1Γ 1 2 + ξ 1 Γ 1 1 Γ LBP 1 2 = min Γ 2 2 Γ 1 LBP 2 D 2 Γ 2 2 + ξ2 Γ 2 1 82

Stability of Layered BP s Theorem: Assuming that Γ i 0, 1 1 + 1 3 then for correctly chosen λ K i i=1 μ D i we are guaranteed that 1. The support of Γ i LBP is contained in that of Γ i 2. The error is bounded: Γ i LBP Γ i 2, i ε L = 7.5 i p E 2, p p Γ j 0, ε L i, where 3. Every entry in Γ i greater than ε i p L / Γ i 0, will be found i j=1 83

Layered Iterative Thresholding Layered BP: Γ LBP 1 j = min Γ j 2 Γ LBP 2 j 1 D j Γ j + ξj Γ 2 j 1 j t Layered Iterative Soft-Thresholding Γ j t = S ξj /c j Γ j t 1 + 1 c j D j T Γ j 1 D j Γ j t 1 j Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun 10] * c i > 0.5 λ max (D i T D i ) 84

This Talk Independent patch-processing Local Sparsity We described the limitations of patch-based processing and ways to overcome some of these 85

This Talk Independent patch-processing Local Sparsity Convolutional Sparse Coding We presented a theoretical study of the CSC and a practical algorithm that works locally 86

This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding We mentioned several interesting connections between CSC and CNN and this led us to 87

This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN 88

This Talk The ML-CSC was shown to enable a theoretical study of CNN, along with new insights Convolutional Neural Networks Independent patch-processing Convolutional Sparse Coding Local Sparsity Multi-Layer Convolutional Sparse Coding Extension of the classical sparse theory to a multi-layer setting A novel interpretation and theoretical understanding of CNN 89

This Talk Independent patch-processing Local Sparsity Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding Extension of the classical sparse theory to a multi-layer setting The underlying idea: Modeling the data source in order to be able to theoretically analyze algorithms performance A novel interpretation and theoretical understanding of CNN 90

Questions? 91