TRADEOFFS BETWEEN SPEED AND ACCURACY IN INVERSE PROBLEMS RAJA GIRYES TEL AVIV UNIVERSITY

Size: px

Start display at page:

Download "TRADEOFFS BETWEEN SPEED AND ACCURACY IN INVERSE PROBLEMS RAJA GIRYES TEL AVIV UNIVERSITY"

Jerome Flowers
5 years ago
Views:

1 TRADEOFFS BETWEEN SPEED AND ACCURACY IN INVERSE PROBLEMS RAJA GIRYES TEL AVIV UNIVERSITY Deep Learning TCE Conference June 14, 2018

2 DEEP LEARNING IMPACT Today we get 3.5% by 152 layers Imagenet dataset 1,400,000 images 1000 categories for testing, for validation TCE conference 2

3 CUTTING EDGE PERFORMANCE IN MANY OTHER APPLICATIONS Disease diagnosis [Greenspan, Riklin, Kiriaty]. Language translation [Sutskever et al., 2014]. Video classification [Karpathy et al., 2014]. Handwriting recognition [Poznanski & Wolf, 2016]. Sentiment classification [Socher et al., 2013]. Image denoising [Remez et al., 2017]. Depth Reconstruction [Haim et al., 2017]. Super-resolution [Kim et al., 2016], [Bruna et al., 2016]. many other applications 3 TCE conference

4 CLASS AWARE DENOISING Class aware denoising Agnostic denoising TCE conference [Remez, Litani, Giryes and Bronstein, 2017] 4

5 DEEP ISP TCE conference [Schwartz, Giryes and Bronstein, 2018] 5

6 NON-UNIFORMLY QUANTIZED DEEP LEARNING Performance vs. complexity of different quantized neural networks. [Baskin et al., 2018] 6

7 DEPTH ESTIMATION BY PHASE CODED CUES [Haim, Elmalem, Bronstein, Marom, Giryes, 2018] TCE conference 7

8 COMPRESSED COLOR LIGHT FIELD TCE conference [Nabati, Mendelovic, Giryes, 2018] 8

9 EXOPLANETS DETECTION TCE conference [Zucker and Giryes, 2018] 9

10 PARTIAL SHAPE ALIGNMENT Alignment is performed by a free form deformation generated by a network: TCE conference 10 [Hanocka et al., 2018]

11 WARNING: NO DEEP LEARNING HERE! TCE conference 11 [Levy, Vahid and Giryes, 2018]

12 WHY THINGS WORK BETTER TODAY? More data larger datasets, more access (internet) Better hardware (GPU) Better learning regularization (dropout) Deep learning impact and success is not unique only to image classification. But it is still unclear why deep neural networks are so remarkably successful and how they are doing it. TCE conference 12

13 SAMPLE OF RELATED EXISTING THEORY Universal approximation for any measurable Borel functions [Hornik et. al., 1989, Cybenko 1989] Depth of a network provides an exponential complexity compared to the number parameters [Montúfar et al. 2014, Cohen et al. 2016, Eldan & Shamir, 2016] and invariance to more complex deformations [Bruna & Mallat, 2013] Number of training samples scales as the number of parameters [Shalev-Shwartz & Ben-David 2014] or the norm of the weights in the DNN [Neyshabur et al. 2015] Pooling relation to shift invariance and phase retrieval [Bruna et al. 2013, 2014] Deeper networks have more local minima that are close to the global one and less saddle points [Saxe et al. 2014], [Dauphin et al. 2014], [Choromanska et al. 2015], [Haeffele & Vidal, 2015], [Soudry & Hoffer, 2017], Relation to dictionary learning [Papayan et al. 2016]. Information bottleneck [Shwartz-Ziv & Tishby, 2017], [Tishby & Zaslavsky 2017]. Invariant representation for certain tasks [Soatto & Chiuso, 2016] Bayesian deep learning [Kendall and Gal. 2017] [Patel, Nguyen & Baraniuk, 2016].

14 SAMPLE OF OUR THEORY DNN with Random Gaussian weights are good for classifying the average points in the data and an important goal of training is to classify the boundary points between the different classes in the data [Giryes et al. 2016]. Generalization error of neural network [Sokolic, Giryes, Sapiro & Rodrigues, 2017]. Relationship between invariance and generalization in deep learning [Sokolic, Giryes, Sapiro & Rodrigues, 2017]. Solving optimization problems with neural networks [Giryes, Eldar, Bronstein & Sapiro, 2018]. Robustness to adversarial examples [Jakubovitz & Giryes, 2018]. Robustness to label noise [Drory, Avidan & Giryes, 2018]. Observation of k-nn behavior in neural networks to explain the coexistence of memorization and generalization in neural networks [Cohen, Sapiro & Giryes, 2018]. 14

15 SAMPLE OF OUR THEORY DNN with Random Gaussian weights are good for classifying the average points in the data and an important goal of training is to classify the boundary points between the different classes in the data [Giryes et al. 2016]. Generalization error of neural network [Sokolic, Giryes, Sapiro & Rodrigues, 2017]. Relationship between invariance and generalization in deep learning [Sokolic, Giryes, Sapiro & Rodrigues, 2017]. Solving optimization problems with neural networks [Giryes, Eldar, Bronstein & Sapiro, 2018]. Our focus today Robustness to adversarial examples [Jakubovitz & Giryes, 2018]. Robustness to label noise [Drory, Avidan & Giryes, 2018]. Observation of k-nn behavior in neural networks to explain the coexistence of memorization and generalization in neural networks [Cohen, Sapiro & Giryes, 2018]. 15

16 INVERSE PROBLEMS We are given X = AZ + E Given set of measurements linear operator Unknown signal noise Standard technique for recovery min X AZ 2 s. t. Z Υ Z Unconstrained form min Z X AZ λf(z) Z resides in a low dimensional set Υ Regularization parameter f is a penalty function 16

17 l 1 MINIMIZATION CASE Unconstrained form min X AZ λ Z 1 Z Can be solved by proximal gradient, e.g., iterative shrinkage and thresholding technique (ISTA) Z t+1 = ψ λμ Z t + μa T X AZ t Soft thresholding operation - λμ λμ μ is the step size 17

18 ISTA CONVERGENCE Reconstruction mean squared error (MSE) as a function of the number of iterations E Z Z t t 18

19 ISTA Z t+1 = ψ λμ LISTA Z t + μa T X AZ t Rewriting ISTA: Z t+1 = ψ λμ I μa T A Z t + μa T X Learned ISTA (LISTA): Z t+1 = ψ λ WZ t + SX Learned thresholds Learned operators 19

20 LISTA CONVERGENCE Replacing I μa T A and μa T in ISTA with the learned W and S improves convergence [Gregor & LeCun, 2010] E Z Z t t 100 Extensions to other models [Sprechmann, Bronstein & Sapiro, 2015], [Remez, Litani & Bronstein, 2015], [Tompson, Schlachter, Sprechmann & Perlin, 2016]. 20

21 Learned linear operators LISTA AS A NEURAL NETWORK S X R d W + ψ Z X = AZ + E Z Υ Soft thresholding operation - λ λ An estimate of Z [Gregor & LeCun, 2010] 21

22 Iterative soft thresholding algorithm (ISTA) X R d X = AZ + E μa T Step size μ obeys 1 μ A ISTA + I μa T A Z Soft thresholding operation - λμ λμ Minimizer of min Z ψ μ is the step size X A Z + λ Z 1 [Daubechies, Defrise & Mol, 2004], [Beck & Teboulle, 2009] 22

23 PROJECTED GRADIENT DESCENT (PGD) I μa T A μ is the step size X R d μa T + ψ Z X = AZ + E f(z) R ψ projects onto the set Υ f( Z) R Estimate of Z. Aim at solving min X A Z Z s. t. f( Z) R 23

24 THEORY FOR PGD Theorem 8: Let Z R d, f: R d R a proper function, f Z R, C f (Z) the tangent cone of f at point Z, A R m d a random Gaussian matrix and X = AZ + E. Then the estimate of PGD at iteration t, Z t, obeys Z t Z κ f ρ t Z, where ρ = sup U T I μa T A V U,V C f Z B d and κ f = 1 if f is convex and κ f = 2 otherwise. [Oymak, Recht & Soltanolkotabi, 2016]. 24

25 PGD CONVERGENCE RATE ρ = sup U T I μa T A V is the convergence U,V C f Z B d rate of PGD. Let ω be the Gaussian mean width of C f Z B d. If μ = 1 m+ d 2 1 d If μ = 1 m then ρ = O ω m. then ρ = 1 O m ω m+d. For the k-sparse model ω 2 = O klog d For GMM with k Gaussians ω 2 = O k. How may we cause ω to become smaller for having a better convergence rate? 25

26 INACCURATE PROJECTION PGD iterations projects onto Υ = Z: f Z R. Smaller Υ Smaller ω. Faster convergence as ρ = 1 O m ω or O m+d Let us assume that our signal belongs to a smaller set Υ = Z: f Z R with ω ω. Ideally, we would like to project f( Z) R onto Υ instead of Υ. This will lead to faster convergence. What if such a projection is not feasible? ω m Υ f( Z) R Υ 26

27 INACCURATE PROJECTION We will estimate the projection onto Υ by A linear projection P Followed by a projection onto Υ Assumptions: Υ (PZ) Z ϵ f( Z) R Projection of the target vector Z onto P and then onto Υ 27

28 INACCURATE PGD (IPGD) P I μa T A μ is the step size X R d μpa T + ψ Z X = AZ + E f(z) R Υ ψ projects onto the set Υ f(z) R Estimate of Z. Aim at solving min X A Z Z s. t. f( Z) R 28

29 THEORY FOR IPGD Theorem 9: Let Z R d, f: R d R a proper convex* function, f Z R, C f (Z) the tangent cone of f at point Z, A R d m a random Gaussian matrix and X = AZ + E. Then the estimate of IPGD at iteration t, Z t, obeys Z t Z ρ P t + 1 ρ P t 1 ρ P ε Z, where ρ p = sup U T P I μa T A PV U,V C f Z B d and ε = (2 + ρ p )ϵ. [Giryes, Eldar, Bronstein & Sapiro, 2018] *We have a version of this theorem also when f is non-proper or non-convex function 29

30 CONVERGENCE RATE COMPARISON PGD convergence: IPGD convergence: ρ t ρ P t + 1 ρ P t 1 ρ P (2 + ρ p )ε (a) ρ t (b) P + ε (c) t ρ P ρ t (a)ε is negligible compared to ρ P (b) For small values of t (early iterations). (c) Faster convergence as ρ P ρ (because ω p ω). 30

31 MODEL BASED COMPRESSED SENSING Υ is the set of sparse vectors with sparsity patterns that obey a tree structure. Projecting onto Υ improves convergence rate compared to projecting onto the set of sparse vectors Υ [Baraniuk et al., 2010]. The projection onto Υ is more demanding than onto Υ Note that the probability of selecting atoms from lower tree levels is smaller than upper ones. P will be a projection onto certain tree levels zeroing the values at lower levels

32 MODEL BASED COMPRESSED SENSING Z Z t Non-zeros picked entries has zero mean random Gaussian distribution with variance: - 1 at first two levels at the third level at the rest of the levels 32

33 SPECTRAL COMPRESSED SENSING Υ is the set of vectors with sparse representation in a 2-times redundant DCT dictionary such that: The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 20. The value of each representation coefficient ~N(0,1) i.i.d. We set the neighboring coefficients at distance 2 of each active atom to be ~N(0, ), respectively We set P to be a pooling-like operation that keeps in each window of size 3 only the largest value. 33

34 Z Z t SPECTRAL COMPRESSED SENSING 34

35 SPECTRAL COMPRESSED SENSING Υ is the set of vectors with sparse representation in a 4-times redundant DCT dictionary such that: The active atoms are selected uniformly at random such that minimum distance between neighboring atoms is 5. The value of each representation coefficient ~N(0,1) i.i.d. We set the neighboring coefficients at distance 1 and 2 of each active atom to be ~N(0, ) We set P to be a pooling-like operation that keeps in each window of size 5 only the largest value. 35

36 Z Z t SPECTRAL COMPRESSED SENSING 36

37 LEARNING THE PROJECTION If we have no explicit information about Υ it might be desirable to learn the projection. Instead of learning P, it is possible to replace P I μa T A and μpa T with two learned matrices S and W respectively. This leads to a very similar scheme to the one of LISTA and provides a theoretical foundation for the success of LISTA. 37

38 Learned linear operators LEARNED IPGD S X R d W + ψ Z X = AZ + E f(z) R Υ ψ projects onto the set Υ f(z) R Estimate of Z. Aim at solving min X A Z Z s. t. f( Z) R 38

39 SUPER RESOLUTION A popular super-resolution technique uses a pair of low-res and high-res dictionaries [Zeyde et al. 2012] The original work uses OMP with sparsity 3 to decode the representation of patches in low-res image Then the representation is used to reconstruct the patches of the high-res image We replace OMP with LIPGD with 3 levels but higher target sparsity This leads to better reconstruction results (with up to 0.5dB improvement) 39

40 Learned linear operators LISTA S X R d W + ψ Z X = AZ + E f(z) R Υ ψ is a proximal mapping. ψ U = argmin U Z Z R d +λf( Z) Estimate of Z. Aim at solving min X A Z Z +λ f( Z) 40

41 LISTA MIXTURE MODEL Approximation of the projection onto Υ with one linear projection may not be accurate enough. This requires more LISTA layers/iterations. Instead, one may use several LISTA networks, where each approximates a different part of Υ Training multiple LISTA networks accelerate the convergence further. Υ Υ 41

42 LISTA MIXTURE MODEL X A Z t Z t 1 X AZ 2 2 Z 1 42

43 RELATED WORKS In [Bruna et al. 2017] it is shown that a learning may give a gain due to better preconditioning of A. In [Xin et al. 2016] a relation to the restricted isometry property (RIP) is drawn In [Borgerding & Schniter, 2016] a connection is drawn to approximate message passing (AMP). All these works consider only the sparsity case 43

44 ACKNOWLEDGEMENTS Guillermo Sapiro Duke University Yonina C. Eldar Technion Alex M. Bronstein Technion TCE conference 44

45 QUESTIONS? WEB.ENG.TAU.AC.IL/~RAJA TCE conference 45

46 GAUSSIAN MEAN WIDTH Gaussian mean width: ω Υ = E sup V,W Υ V W, g, g~n(0, I). The width of the set Υ in the direction of g: V Υ W g TCE conference 46

47 MEASURE FOR LOW DIMENSIONALITY Gaussian mean width: ω Υ = E sup V,W Υ V W, g, g~n(0, I). ω 2 Υ is a measure for the dimensionality of the data. Examples: If Υ B d is a Gaussian Mixture Model with k Gaussians then ω 2 Υ = O(k) If Υ B d is a data with k-sparse representations then ω 2 Υ = O(k log d) TCE conference 47

ON THE STABILITY OF DEEP NETWORKS

ON THE STABILITY OF DEEP NETWORKS AND THEIR RELATIONSHIP TO COMPRESSED SENSING AND METRIC LEARNING RAJA GIRYES AND GUILLERMO SAPIRO DUKE UNIVERSITY Mathematics of Deep Learning International Conference