Harmonic Analysis of Deep Convolutional Neural Networks

Size: px

Start display at page:

Download "Harmonic Analysis of Deep Convolutional Neural Networks"

Solomon Leonard
5 years ago
Views:

1 Harmonic Analysis of Deep Convolutional Neural Networks Helmut Bőlcskei Department of Information Technology and Electrical Engineering October 2017 joint work with Thomas Wiatowski and Philipp Grohs

2 ImageNet

3 ImageNet ski rock plant coffee

4 ImageNet ski rock plant coffee CNNs win the ImageNet 2015 challenge [He et al., 2015 ]

5 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

6 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos.

7 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos Kleiber.

8 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos Kleiber conducting the.

9 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos Kleiber conducting the Vienna Philharmonic s.

10 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos Kleiber conducting the Vienna Philharmonic s New Year s Concert.

11 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Carlos Kleiber conducting the Vienna Philharmonic s New Year s Concert 1989.

12 Feature extraction and classification input: f = non-linear feature extraction feature vector Φ(f) linear classifier output: { w, Φ(f) > 0, w, Φ(f) < 0, Shannon von Neumann

13 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : w, f > 0 : w, f < 0

14 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : w, f > 0 : w, f < 0 not possible!

15 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 Φ(f) = [ f 1 ] : w, f > 0 : w, f < 0 not possible! : w, Φ(f) > 0 : w, Φ(f) < 0 [ ] 1 possible with w = 1

16 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier Φ(f) = [ f 1 ] Φ is invariant to angular component of the data

17 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier Φ(f) = [ f 1 ] Φ is invariant to angular component of the data Linear separability in feature space!

18 Translation invariance Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ] Feature vector should be invariant to spatial location translation invariance

19 Deformation insensitivity Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters

20 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) f g λ (k) 1 g (l) λ 2 f g λ (p) 1 g (r) λ 2 f g (k) λ 1 f g (p) λ 1 f feature map

21 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) f g λ (k) 1 g (l) λ 2 f g λ (p) 1 g (r) λ 2 χ 3 χ 3 f g (k) λ 1 f g (p) λ 1 χ 2 f feature map χ 2 χ 1

22 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) f g λ (k) 1 g (l) λ 2 f g λ (p) 1 g (r) λ 2 χ 3 χ 3 f g (k) λ 1 f g (p) λ 1 χ 2 f feature map χ 2 χ 1 feature vector Φ(f) General scattering networks guarantee [Wiatowski & HB, 2015 ] - (vertical) translation invariance - small deformation sensitivity essentially irrespective of filters, non-linearities, and poolings!

23 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Filters: Semi-discrete frame Ψ n := {χ n } {g λn } λn Λ n A n f 2 2 f χ n λ n Λ n f g λn 2 B n f 2 2, f L 2 (R d )

24 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Filters: Semi-discrete frame Ψ n := {χ n } {g λn } λn Λ n A n f 2 2 f χ n λ n Λ n f g λn 2 B n f 2 2, f L 2 (R d ) e.g.: Structured filters

25 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Filters: Semi-discrete frame Ψ n := {χ n } {g λn } λn Λ n A n f 2 2 f χ n λ n Λ n f g λn 2 B n f 2 2, f L 2 (R d ) e.g.: Unstructured filters

26 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Filters: Semi-discrete frame Ψ n := {χ n } {g λn } λn Λ n A n f 2 2 f χ n λ n Λ n f g λn 2 B n f 2 2, f L 2 (R d ) e.g.: Learned filters

27 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous M n (f) M n (h) 2 L n f h 2, f, h L 2 (R d )

28 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous M n (f) M n (h) 2 L n f h 2, f, h L 2 (R d ) Satisfied by virtually all non-linearities used in the deep learning literature! ReLU: L n = 1; modulus: L n = 1; logistic sigmoid: L n = 1 4 ;...

29 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Pooling: In continuous-time according to f Sn d/2 P n (f)(s n ), where S n 1 is the pooling factor and P n : L 2 (R d ) L 2 (R d ) is R n -Lipschitz-continuous

30 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Pooling: In continuous-time according to f Sn d/2 P n (f)(s n ), where S n 1 is the pooling factor and P n : L 2 (R d ) L 2 (R d ) is R n -Lipschitz-continuous Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling P n (f) = f with R n = 1

31 Building blocks Basic operations in the n-th network layer g λ (k) n non-lin. pool. f. g λ (r) n non-lin. pool. Pooling: In continuous-time according to f Sn d/2 P n (f)(s n ), where S n 1 is the pooling factor and P n : L 2 (R d ) L 2 (R d ) is R n -Lipschitz-continuous Emulates most poolings used in the deep learning literature! e.g.: Pooling by averaging P n (f) = f φ n with R n = φ n 1

32 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy B n min{1, L 2 n Rn 2 }, n N. Let the pooling factors be S n 1, n N. Then, ( ) t Φ n (T t f) Φ n (f) = O, S 1... S n for all f L 2 (R d ), t R d, n N.

33 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy B n min{1, L 2 n Rn 2 }, n N. Let the pooling factors be S n 1, n N. Then, ( ) t Φ n (T t f) Φ n (f) = O, S 1... S n for all f L 2 (R d ), t R d, n N. Features become more invariant with increasing network depth!

34 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy B n min{1, L 2 n Rn 2 }, n N. Let the pooling factors be S n 1, n N. Then, ( ) t Φ n (T t f) Φ n (f) = O, S 1... S n for all f L 2 (R d ), t R d, n N. Full translation invariance: If lim n S 1 S 2... S n =, then lim n Φn (T t f) Φ n (f) = 0

35 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy B n min{1, L 2 n Rn 2 }, n N. Let the pooling factors be S n 1, n N. Then, ( ) t Φ n (T t f) Φ n (f) = O, S 1... S n for all f L 2 (R d ), t R d, n N. The condition B n min{1, L 2 n Rn 2 }, n N, is easily satisfied by normalizing the filters {g λn } λn Λ n.

36 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy B n min{1, L 2 n Rn 2 }, n N. Let the pooling factors be S n 1, n N. Then, ( ) t Φ n (T t f) Φ n (f) = O, S 1... S n for all f L 2 (R d ), t R d, n N. applies to general filters, non-linearities, and poolings

37 Philosophy behind invariance results Mallat s horizontal translation invariance [Mallat, 2012 ]: lim Φ W (T t f) Φ W (f) = 0, J f L 2 (R d ), t R d Vertical translation invariance: lim n Φn (T t f) Φ n (f) = 0, f L 2 (R d ), t R d

38 Philosophy behind invariance results Mallat s horizontal translation invariance [Mallat, 2012 ]: lim Φ W (T t f) Φ W (f) = 0, J f L 2 (R d ), t R d - features become invariant in every network layer, but needs J Vertical translation invariance: lim n Φn (T t f) Φ n (f) = 0, f L 2 (R d ), t R d - features become more invariant with increasing network depth

39 Philosophy behind invariance results Mallat s horizontal translation invariance [Mallat, 2012 ]: lim Φ W (T t f) Φ W (f) = 0, J f L 2 (R d ), t R d - features become invariant in every network layer, but needs J - applies to wavelet transform and modulus non-linearity without pooling Vertical translation invariance: lim n Φn (T t f) Φ n (f) = 0, f L 2 (R d ), t R d - features become more invariant with increasing network depth - applies to general filters, general non-linearities, and general poolings

40 Non-linear deformations Non-linear deformation (F τ f)(x) = f(x τ(x)), where τ : R d R d For small τ:

41 Non-linear deformations Non-linear deformation (F τ f)(x) = f(x τ(x)), where τ : R d R d For large τ:

42 Deformation sensitivity for signal classes Consider (F τ f)(x) = f(x τ(x)) = f(x e x2 ) f 1(x), (F τ f 1)(x) x f 2(x), (F τ f 2)(x) x For given τ the amount of deformation induced can depend drastically on f L 2 (R d )

43 Philosophy behind deformation stability/sensitivity bounds Mallat s deformation stability bound [Mallat, 2012 ]: Φ W (F τ f) Φ W (f) C ( 2 J τ +J Dτ + D 2 τ ) f W, for all f H W L 2 (R d ) - The signal class H W and the corresponding norm W depend on the mother wavelet (and hence the network) Our deformation sensitivity bound: Φ(F τ f) Φ(f) C C τ α, f C L 2 (R d ) - The signal class C (band-limited functions, cartoon functions, or Lipschitz functions) is independent of the network

44 Philosophy behind deformation stability/sensitivity bounds Mallat s deformation stability bound [Mallat, 2012 ]: Φ W (F τ f) Φ W (f) C ( 2 J τ +J Dτ + D 2 τ ) f W, for all f H W L 2 (R d ) - Signal class description complexity implicit via norm W Our deformation sensitivity bound: Φ(F τ f) Φ(f) C C τ α, f C L 2 (R d ) - Signal class description complexity explicit via C C - L-band-limited functions: C C = O(L) - cartoon functions of size K: C C = O(K 3/2 ) - M-Lipschitz functions C C = O(M)

45 Philosophy behind deformation stability/sensitivity bounds Mallat s deformation stability bound [Mallat, 2012 ]: Φ W (F τ f) Φ W (f) C ( 2 J τ +J Dτ + D 2 τ ) f W, for all f H W L 2 (R d ) Our deformation sensitivity bound: Φ(F τ f) Φ(f) C C τ α, f C L 2 (R d ) - Decay rate α > 0 of the deformation error is signal-classspecific (band-limited functions: α = 1, cartoon functions: α = 1 2, Lipschitz functions: α = 1)

46 Philosophy behind deformation stability/sensitivity bounds Mallat s deformation stability bound [Mallat, 2012 ]: Φ W (F τ f) Φ W (f) C ( 2 J τ +J Dτ + D 2 ) τ f W, for all f H W L 2 (R d ) - The bound depends explicitly on higher order derivatives of τ Our deformation sensitivity bound: Φ(F τ f) Φ(f) C C τ α, f C L 2 (R d ) - The bound implicitly depends on derivative of τ via the condition Dτ 1 2d

47 Philosophy behind deformation stability/sensitivity bounds Mallat s deformation stability bound [Mallat, 2012 ]: Φ W (F τ f) Φ W (f) C ( 2 J τ +J Dτ + D 2 τ ) f W, for all f H W L 2 (R d ) - The bound is coupled to horizontal translation invariance lim Φ W (T t f) Φ W (f) = 0, J Our deformation sensitivity bound: f L 2 (R d ), t R d Φ(F τ f) Φ(f) C C τ α, f C L 2 (R d ) - The bound is decoupled from vertical translation invariance lim n Φn (T t f) Φ n (f) = 0, f L 2 (R d ), t R d

48 CNNs in a nutshell CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes!

49 CNNs in a nutshell CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: # of FLOPS for a single forward pass: 11.3 billion

50 CNNs in a nutshell CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: # of FLOPS for a single forward pass: 11.3 billion Such depths (and breadths) pose formidable computational challenges in training and operating the network!

51 Topology reduction Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers

52 Topology reduction Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ

53 Topology reduction Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have most of the input signal energy be contained in the feature vector

54 Topology reduction Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Guarantee trivial null-space for feature extractor Φ Specify the number of layers needed to have most of the input signal energy be contained in the feature vector For a fixed (possibly small) depth, design CNNs that capture most of the input signal energy

55 Building blocks Basic operations in the n-th network layer g λ (k) n S f. g λ (r) n S Filters: Semi-discrete frame Ψ n := {χ n } {g λn } λn Λ n Non-linearity: Modulus Pooling: Sub-sampling with pooling factor S 1

56 Demodulation effect of modulus non-linearity Components of feature vector given by f g λn χ n+1 ĝ λn (ω) 1 χ n+1(ω) f(ω) ω

57 Demodulation effect of modulus non-linearity Components of feature vector given by f g λn χ n+1 ĝ λn (ω) 1 χ n+1(ω) f(ω) ω f(ω) ĝ λn (ω) 1 ω

58 Demodulation effect of modulus non-linearity Components of feature vector given by f g λn χ n+1 ĝ λn (ω) 1 χ n+1(ω) f(ω) ω f(ω) ĝ λn (ω) 1 ω Modulus squared: f g λn (x) 2 R f ĝλn (ω)

59 Demodulation effect of modulus non-linearity Components of feature vector given by f g λn χ n+1 ĝ λn (ω) 1 χ n+1(ω) f(ω) ω f(ω) ĝ λn (ω) 1 ω f g λn (ω) 1 Φ(f) via χ n+1 ω

60 Do all non-linearities demodulate? High-pass filtered signal: F(f g λ ) 2R 2R ω 2R

61 Do all non-linearities demodulate? High-pass filtered signal: F(f g λ ) 2R 2R ω 2R Modulus: Yes! F( f g λ )... but (small) tails! 2R 2R ω

62 Do all non-linearities demodulate? High-pass filtered signal: F(f g λ ) 2R 2R ω 2R Modulus squared: Yes, and sharply so! F( f g λ 2 ) 2R... but not Lipschitz-continuous! 2R ω

63 Do all non-linearities demodulate? High-pass filtered signal: F(f g λ ) 2R 2R ω 2R Rectified linear unit: No! F(ReLU(f g λ )) 2R 2R ω

64 First goal: Quantify feature map energy decay f g λ (k) 1 g (l) λ 2 f g λ (p) 1 g (r) λ 2 W 2(f) χ 3 χ 3 f g (k) λ 1 f g (p) λ 1 W 1(f) χ 2 f χ 2 χ 1

65 Assumptions (on the filters) i) Analyticity: For every filter g λn there exists a (not necessarily canonical) orthant H λn R d such that supp(ĝ λn ) H λn. ii) High-pass: There exists δ > 0 such that λ n Λ n ĝ λn (ω) 2 = 0, a.e. ω B δ (0).

66 Assumptions (on the filters) i) Analyticity: For every filter g λn there exists a (not necessarily canonical) orthant H λn R d such that supp(ĝ λn ) H λn. ii) High-pass: There exists δ > 0 such that λ n Λ n ĝ λn (ω) 2 = 0, a.e. ω B δ (0). Comprises various contructions of WH filters, wavelets, ridgelets, (α)-curvelets, shearlets ω 2 e.g.: analytic band-limited curvelets: ω 1

67 Input signal classes Sobolev functions of order s 0: { H s (R d ) = f L 2 (R d ) (1 + ω 2 ) s f(ω) } 2 dω < R d

68 Input signal classes Sobolev functions of order s 0: { H s (R d ) = f L 2 (R d ) (1 + ω 2 ) s f(ω) } 2 dω < R d H s (R d ) contains a wide range of practically relevant signal classes

69 Input signal classes Sobolev functions of order s 0: { H s (R d ) = f L 2 (R d ) (1 + ω 2 ) s f(ω) } 2 dω < R d H s (R d ) contains a wide range of practically relevant signal classes - square-integrable functions L 2 (R d ) = H 0 (R d )

70 Input signal classes Sobolev functions of order s 0: { H s (R d ) = f L 2 (R d ) (1 + ω 2 ) s f(ω) } 2 dω < R d H s (R d ) contains a wide range of practically relevant signal classes - square-integrable functions L 2 (R d ) = H 0 (R d ) - L-band-limited functions L 2 L (Rd ) H s (R d ), L > 0, s 0

71 Input signal classes Sobolev functions of order s 0: { H s (R d ) = f L 2 (R d ) (1 + ω 2 ) s f(ω) } 2 dω < R d H s (R d ) contains a wide range of practically relevant signal classes - square-integrable functions L 2 (R d ) = H 0 (R d ) - L-band-limited functions L 2 L (Rd ) H s (R d ), L > 0, s 0 - cartoon functions [Donoho, 2001 ] C CART H s (R d ), s [0, 1 2 ) Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

72 Exponential energy decay Theorem Let the filters be wavelets with mother wavelet supp( ψ ) [r 1, r], r > 1, or Weyl-Heisenberg (WH) filters with prototype function supp(ĝ) [ R, R], R > 0. Then, for every f H s (R d ), there exists β > 0 such that ) W n (f) = O (a n(2s+β) 2s+β+1, where a = r2 +1 r 2 1 in the wavelet case, and a = R in the WH case.

73 Exponential energy decay Theorem Let the filters be wavelets with mother wavelet supp( ψ ) [r 1, r], r > 1, or Weyl-Heisenberg (WH) filters with prototype function supp(ĝ) [ R, R], R > 0. Then, for every f H s (R d ), there exists β > 0 such that ) W n (f) = O (a n(2s+β) 2s+β+1, where a = r2 +1 r 2 1 in the wavelet case, and a = R in the WH case. decay factor a is explicit and can be tuned via r, R

74 Exponential energy decay Exponential energy decay: ) W n (f) = O (a n(2s+β) 2s+β+1

75 Exponential energy decay Exponential energy decay: ) W n (f) = O (a n(2s+β) 2s+β+1 - β > 0 determines the decay of f(ω) (as ω ) according to f(ω) µ(1 + ω 2 ) ( s β 4 ), ω L, for some µ > 0, and L acts as an effective bandwidth

76 Exponential energy decay Exponential energy decay: ) W n (f) = O (a n(2s+β) 2s+β+1 - β > 0 determines the decay of f(ω) (as ω ) according to f(ω) µ(1 + ω 2 ) ( s β 4 ), ω L, for some µ > 0, and L acts as an effective bandwidth - smoother input signals (i.e., s ) lead to faster energy decay

77 Exponential energy decay Exponential energy decay: ) W n (f) = O (a n(2s+β) 2s+β+1 - β > 0 determines the decay of f(ω) (as ω ) according to f(ω) µ(1 + ω 2 ) ( s β 4 ), ω L, for some µ > 0, and L acts as an effective bandwidth - smoother input signals (i.e., s ) lead to faster energy decay - pooling through sub-sampling f S 1/2 f(s ) leads to decay factor a S

78 Exponential energy decay Exponential energy decay: ) W n (f) = O (a n(2s+β) 2s+β+1 - β > 0 determines the decay of f(ω) (as ω ) according to f(ω) µ(1 + ω 2 ) ( s β 4 ), ω L, for some µ > 0, and L acts as an effective bandwidth - smoother input signals (i.e., s ) lead to faster energy decay - pooling through sub-sampling f S 1/2 f(s ) leads to decay factor a S What about general filters? polynomial energy decay!

79 ... our second goal... trivial null-space for Φ Why trivial null-space? Feature space w : w, Φ(f) > 0 : w, Φ(f) < 0

80 ... our second goal... trivial null-space for Φ Why trivial null-space? Feature space w Φ(f ) : w, Φ(f) > 0 : w, Φ(f) < 0 Non-trivial null-space: f 0 such that Φ(f ) = 0 w, Φ(f ) = 0 for all w! these f become unclassifiable!

81 ... our second goal... Trivial null-space for feature extractor: { f L 2 (R d ) Φ(f) = 0 } = { 0 } Feature extractor Φ( ) = n=0 Φn ( ) shall satisfy for some A, B > 0. A f 2 2 Φ(f) 2 B f 2 2, f L 2 (R d ),

82 Energy conservation Theorem For the frame upper {B n } n N and frame lower bounds {A n } n N, define B := n=1 max{1, B n} and A := n=1 min{1, A n}. If then 0 < A B <, A f 2 2 Φ(f) 2 B f 2 2, f L 2 (R d ).

83 Energy conservation Theorem For the frame upper {B n } n N and frame lower bounds {A n } n N, define B := n=1 max{1, B n} and A := n=1 min{1, A n}. If then 0 < A B <, A f 2 2 Φ(f) 2 B f 2 2, f L 2 (R d ). - For Parseval frames (i.e., A n = B n = 1, n N), this yields Φ(f) 2 = f 2 2

84 Energy conservation Theorem For the frame upper {B n } n N and frame lower bounds {A n } n N, define B := n=1 max{1, B n} and A := n=1 min{1, A n}. If then 0 < A B <, A f 2 2 Φ(f) 2 B f 2 2, f L 2 (R d ). - For Parseval frames (i.e., A n = B n = 1, n N), this yields - Connection to energy decay: Φ(f) 2 = f 2 2 n 1 f 2 2 = Φ k (f) 2 + W n (f) }{{} k=0 0

85 ... and our third goal... For a given CNN, specify the number of layers needed to capture most of the input signal energy

86 ... and our third goal... For a given CNN, specify the number of layers needed to capture most of the input signal energy How many layers n are needed to have at least ((1 ε) 100)% of the input signal energy be contained in the feature vector, i.e., n (1 ε) f 2 2 Φ k (f) 2 f 2 2, f L 2 (R d ). k=0

87 Number of layers needed Theorem Let the frame bounds satisfy A n = B n = 1, n N. Let the input signal f be L-band-limited, and let ε (0, 1). If n log a ( L (1 1 ε ) ), then n (1 ε) f 2 2 Φ k (f) 2 f 2 2. k=0

88 Number of layers needed Theorem Let the frame bounds satisfy A n = B n = 1, n N. Let the input signal f be L-band-limited, and let ε (0, 1). If n log a ( L (1 1 ε ) ), then n (1 ε) f 2 2 Φ k (f) 2 f 2 2. k=0 also guarantees trivial null-space for n k=0 Φk (f)

89 Number of layers needed Theorem Let the frame bounds satisfy A n = B n = 1, n N. Let the input signal f be L-band-limited, and let ε (0, 1). If n log a ( L (1 1 ε ) ), then n (1 ε) f 2 2 Φ k (f) 2 f 2 2. k=0 - lower bound depends on - description complexity of input signals (i.e., bandwidth L) - decay factor (wavelets a = r2 +1 r 2 1, WH filters a = R )

90 Number of layers needed Theorem Let the frame bounds satisfy A n = B n = 1, n N. Let the input signal f be L-band-limited, and let ε (0, 1). If n log a ( L (1 1 ε ) ), then n (1 ε) f 2 2 Φ k (f) 2 f 2 2. k=0 - lower bound depends on - description complexity of input signals (i.e., bandwidth L) - decay factor (wavelets a = r2 +1 r 2 1, WH filters a = R ) - similar estimates for Sobolev input signals and for general filters (polynomial decay!)

91 Number of layers needed Numerical example for bandwidth L = 1: (1 ε) wavelets (r = 2) WH filters (R = 1) general filters

92 Number of layers needed Numerical example for bandwidth L = 1: (1 ε) wavelets (r = 2) WH filters (R = 1) general filters

93 Number of layers needed Numerical example for bandwidth L = 1: (1 ε) wavelets (r = 2) WH filters (R = 1) general filters Recall: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: # of FLOPS for a single forward pass: 11.3 billion

94 ... our fourth and last goal... For a fixed (possibly small) depth N, design scattering networks that capture most of the input signal energy

95 ... our fourth and last goal... For a fixed (possibly small) depth N, design scattering networks that capture most of the input signal energy Recall: Let the filters be wavelets with mother wavelet supp( ψ ) [r 1, r], r > 1, or Weyl-Heisenberg filters with prototype function supp(ĝ) [ R, R], R > 0.

96 ... our fourth and last goal... For a fixed (possibly small) depth N, design scattering networks that capture most of the input signal energy For fixed depth N, want to choose r in the wavelet and R in the WH case so that (1 ε) f 2 2 N Φ k (f) 2 f 2 2, f L 2 (R d ). k=0

97 Depth-constrained networks Theorem Let the frame bounds satisfy A n = B n = 1, n N. Let the input signal f be L-band-limited, and fix ε (0, 1) and N N. If, in the wavelet case, κ < r κ 1, or, in the WH case, ( where κ := L (1 1 ε ) 0 < R ) 1 N, then (1 ε) f κ 1, 2 N Φ k (f) 2 f 2 2. k=0

98 Depth-width tradeoff Spectral supports of wavelet filters: 1 ψ ĝ 1 ĝ 2 ĝ r r 2 r 3 r L ω

99 Depth-width tradeoff Spectral supports of wavelet filters: 1 ψ ĝ 1 ĝ 2 ĝ r r 2 r 3 r L ω Smaller depth N smaller bandwidth r of mother wavelet

100 Depth-width tradeoff Spectral supports of wavelet filters: 1 ψ ĝ 1 ĝ 2 ĝ r r 2 r 3 r L ω Smaller depth N smaller bandwidth r of mother wavelet larger number of wavelets (O(log r (L))) to cover the spectral support [ L, L] of input signal

101 Depth-width tradeoff Spectral supports of wavelet filters: 1 ψ ĝ 1 ĝ 2 ĝ r r 2 r 3 r L ω Smaller depth N smaller bandwidth r of mother wavelet larger number of wavelets (O(log r (L))) to cover the spectral support [ L, L] of input signal larger number of filters in the first layer

102 Depth-width tradeoff Spectral supports of wavelet filters: 1 ψ ĝ 1 ĝ 2 ĝ r r 2 r 3 r L ω Smaller depth N smaller bandwidth r of mother wavelet larger number of wavelets (O(log r (L))) to cover the spectral support [ L, L] of input signal larger number of filters in the first layer depth-width tradeoff

103 Yours truly

104 Experiment: Handwritten digit classification - Dataset: MNIST database of handwritten digits [LeCun & Cortes, 1998 ]; 60,000 training and 10,000 test images - Φ-network: D = 3 layers; same filters, non-linearities, and pooling operators in all layers - Classifier: SVM with radial basis function kernel [Vapnik, 1995 ] - Dimensionality reduction: Supervised orthogonal least squares scheme [Chen et al., 1991 ]

105 Experiment: Handwritten digit classification Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p sub max avg

106 Experiment: Handwritten digit classification Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p sub max avg modulus and ReLU perform better than tanh and LogSig

107 Experiment: Handwritten digit classification Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p sub max avg modulus and ReLU perform better than tanh and LogSig - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost

108 Experiment: Handwritten digit classification Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p sub max avg modulus and ReLU perform better than tanh and LogSig - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost - state-of-the-art: 0.43 [Bruna and Mallat, 2013 ] - similar feature extraction network with directional, non-separable wavelets and no pooling - significantly higher computational complexity

109 Energy decay: Related work [Waldspurger, 2017 ]: Exponential energy decay W n (f) = O(a n ), for some unspecified a > D wavelet filters - every network layer equipped with the same set of wavelets

110 Energy decay: Related work [Waldspurger, 2017 ]: Exponential energy decay W n (f) = O(a n ), for some unspecified a > D wavelet filters - every network layer equipped with the same set of wavelets - vanishing moments condition on the mother wavelet

111 Energy decay: Related work [Waldspurger, 2017 ]: Exponential energy decay W n (f) = O(a n ), for some unspecified a > D wavelet filters - every network layer equipped with the same set of wavelets - vanishing moments condition on the mother wavelet - applies to 1-D real-valued band-limited input signals f L 2 (R)

112 Energy decay: Related work [Czaja and Li, 2016 ]: Exponential energy decay for some unspecified a > 1. W n (f) = O(a n ), - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters

113 Energy decay: Related work [Czaja and Li, 2016 ]: Exponential energy decay for some unspecified a > 1. W n (f) = O(a n ), - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters - analyticity and vanishing moments conditions on the filters

114 Energy decay: Related work [Czaja and Li, 2016 ]: Exponential energy decay for some unspecified a > 1. W n (f) = O(a n ), - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters - analyticity and vanishing moments conditions on the filters - applies to d-dimensional complex-valued input signals f L 2 (R d )

The Mathematics of Deep Learning Part 1: Continuous-time Theory

The Mathematics of Deep Learning Part 1: Continuous-time Theory Helmut Bőlcskei Department of Information Technology and Electrical Engineering June 2016 joint work with Thomas Wiatowski, Philipp Grohs,