FEATURE extraction based on deep convolutional neural. Energy Propagation in Deep Convolutional Neural Networks

Size: px

Start display at page:

Download "FEATURE extraction based on deep convolutional neural. Energy Propagation in Deep Convolutional Neural Networks"

Chad Stephens
5 years ago
Views:

1 Energy Propagation in Deep Convolutional Neural Networks Thomas Wiatowski, Philipp Grohs, an Helmut Bölcskei, Fellow, IEEE Abstract Many practical machine learning tasks employ very eep convolutional neural networks. Such large epths pose formiable computational challenges in training an operating the network. It is therefore important to unerstan how fast the energy containe in the propagate signals a.k.a. feature maps) ecays across layers. In aition, it is esirable that the feature extractor generate by the network be informative in the sense of the only signal mapping to the all-zeros feature vector being the zero input signal. This trivial null-set property can be accomplishe by asking for energy conservation in the sense of the energy in the feature vector being proportional to that of the corresponing input signal. This paper establishes conitions for energy conservation an thus for a trivial null-set) for a wie class of eep convolutional neural network-base feature extractors an characterizes corresponing feature map energy ecay rates. Specifically, we consier general scattering networks employing the moulus non-linearity an we fin that uner mil analyticity an high-pass conitions on the filters which encompass, inter alia, various constructions of Weyl-Heisenberg filters, wavelets, rigelets, α)-curvelets, an shearlets) the feature map energy ecays at least polynomially fast. For broa families of wavelets an Weyl-Heisenberg filters, the guarantee ecay rate is shown to be exponential. Moreover, we provie hany estimates of the number of layers neee to have at least ε) 00)% of the input signal energy be containe in the feature vector. Inex Terms Machine learning, eep convolutional neural networks, scattering networks, energy ecay an conservation, frame theory. I. INTODUCTION FEATUE extraction base on eep convolutional neural networks DCNNs) has been applie with significant success in a wie range of practical machine learning tasks [] [6]. Many of these applications, such as, e.g., the classification of images in the ImageNet ata set, employ very eep networks with potentially hunres of layers [7]. Such network epths entail formiable computational challenges in the training phase ue to the large number of parameters to be learne, an in operating the network ue to the large number of convolutions that nee to be carrie out. It is therefore paramount to unerstan how fast the energy containe in the signals generate in the iniviual network T. Wiatowski an H. Bölcskei are with the Department of Information Technology an Electrical Engineering, ETH Zurich, Switzerlan. {withomas, boelcskei}@nari.ee.ethz.ch P. Grohs is with the Faculty of Mathematics, University of Vienna, Austria. philipp.grohs@univie.ac.at The material in this paper was presente in part at the 07 IEEE International Symposium on Information Theory ISIT), Aachen, Germany. Copyright c) 07 IEEE. Personal use of this material is permitte. However, permission to use this material for any other purposes must be obtaine from the IEEE by sening a request to pubs-permissions@ieee.org. layers, a.k.a. feature maps, ecays across layers. In aition, it is important that the feature vector obtaine by aggregating filtere versions of the feature maps be informative in the sense of the only signal mapping to the all-zeros feature vector being the zero input signal. This trivial null-set property for the feature extractor can be obtaine by asking for the energy in the feature vector being proportional to that of the corresponing input signal, a property we shall refer to as energy conservation. Scattering networks as introuce in [8] an extene in [9] constitute an important class of feature extractors base on noes that implement convolutional transforms with prespecifie or learne filters in each network layer e.g., wavelets [8], [0], uniform covering filters [], or general filters [9]), followe by a non-linearity e.g., the moulus [8], [0], [], or a general Lipschitz non-linearity [9]), an a pooling operation e.g., sub-sampling or average-pooling [9]). Scattering network-base feature extractors were shown to yiel classification performance competitive with the state-of-the-art on various ata sets [] [7]. Moreover, a mathematical theory exists, which allows to establish formally that such feature extractors are uner certain technical conitions horizontally [8] or vertically [9] translation-invariant an eformationstable in the sense of [8], or exhibit limite sensitivity to eformations in the sense of [9] on input signal classes such as ban-limite functions [9], [8], cartoon functions [9], an Lipschitz functions [9]. It was shown recently that the energy in the feature maps generate by scattering networks employing, in every network layer, the same set of certain) Parseval wavelets [0, Section 5] or uniform covering [] filters both satisfying analyticity an vanishing moments conitions), the moulus non-linearity, an no pooling, ecays at least exponentially fast an strict energy conservation which, in turn, implies a trivial null-set) for the infinite-epth feature vector hols. Specifically, the feature map energy ecay was shown to be at least of orer Oa N ), for some unspecifie a >, where N enotes the network epth. We note that -imensional uniform covering filters as introuce in [] are functions whose Fourier transforms support sets can be covere by a union of finitely many balls. This covering conition is satisfie by, e.g., Weyl-Heisenberg filters [] with a banlimite prototype function, but fails to hol for multi-scale filters such as wavelets [], [3], α)-curvelets [4] [6], shearlets [7], [8], or rigelets [9] [3], see [, emark. b)]. Contributions. The first main contribution of this paper is a characterization of the feature map energy ecay rate in

2 DCNNs employing the moulus non-linearity, no pooling, an general filters that constitute a frame [], [3] [34], but not necessarily a Parseval frame, an are allowe to be ifferent in ifferent network layers. We fin that, uner mil analyticity an high-pass conitions on the filters, the energy ecay rate is at least polynomial in the network epth, i.e., the ecay is at least of orer ON α ), an we explicitly specify the ecay exponent α > 0. This result encompasses, inter alia, various constructions of Weyl-Heisenberg filters, wavelets, rigelets, α)-curvelets, shearlets, an learne filters of course as long as the learning algorithm imposes the analyticity an high-pass conitions we require). For broa families of wavelets an Weyl-Heisenberg filters, the guarantee energy ecay rate is shown to be exponential in the network epth, i.e., the ecay is at least of orer Oa N ) with the ecay factor given as a = 5 3 in the wavelet case an a = 3 in the Weyl-Heisenberg case. We hasten to a that our results constitute guarantee ecay rates an o not preclue the energy from ecaying faster in practice. Our secon main contribution shows that the energy ecay results above are compatible with a trivial null-set for finitean infinite-epth networks. Specifically, this is accomplishe by establishing energy proportionality between the feature vector an the unerlying input signal with the proportionality constant lower- an upper-boune by the frame bouns of the filters employe in the ifferent layers. We show that this energy conservation result is a consequence of a emoulation effect inuce by the moulus non-linearity in combination with the analyticity an high-pass properties of the filters. Specifically, in every network layer, the moulus non-linearity moves the spectral content of each iniviual feature map to base-ban i.e., to low frequencies), where it is subsequently extracte i.e., fe into the feature vector) by a low-pass output-generating filter. Finally, for input signals that belong to the class of Sobolev functions, our energy ecay an conservation results are shown to yiel hany estimates of the number of layers neee to have at least ε) 00)% of the input signal energy be containe in the feature vector. For example, in the case of exponential energy ecay with a = 5 3 an for ban-limite input signals, only 8 layers are neee to absorb 95% of the input signal s energy. We emphasize that throughout energy ecay results pertain to the feature maps, whereas energy conservation statements apply to the feature vector, obtaine by aggregating filtere versions of the feature maps. Notation. The complex conjugate of z C is enote by z. We write ez) for the real, an Imz) for the imaginary part of z C. The Eucliean inner prouct of x, y C is x, y := i= x iy i, with associate norm x := x, x. For x, x) + := max{0, x} an x := + x ) /. We enote the open ball of raius r > 0 centere at x by B r x). The first canonical orthant is H := {x A wie range of practically relevant signal classes are Sobolev functions, for example, ban-limite functions an as establishe in the present paper cartoon functions [35]. We note that cartoon functions are wiely use in the mathematical signal processing literature [5], [9], [6], [36], [37] as a moel for natural images such as, e.g., images of hanwritten igits [38]. x k 0, k =,..., }, an we efine the rotate orthant H A := {Ax x H} for A O), where O) stans for the orthogonal group of imension N. The Minkowski sum of sets A, B is A + B) := {a + b a A, b B}, an A B := A\B) B\A) enotes their symmetric ifference. A multi-inex α = α,..., α ) N 0 is an orere -tuple of non-negative integers α i N 0. For functions W : N an G : N, we say that W N) = OGN)) if there exist C > 0 an N 0 N such that W N) CGN), for all N N 0. The support suppf) of a function f : C is the closure of the set {x fx) 0} in the topology inuce by the Eucliean norm. For a Lebesgue-measurable function f : C, we write fx)x for its integral w.r.t. Lebesgue measure. The inicator function of a set B is efine as B x) =, for x B, an B x) = 0, for x \B. For a measurable set B, we let vol B) := B x)x = x, an B we write B for its bounary. L p ), with p [, ), stans for the space of Lebesgue-measurable functions f : C satisfying f p := fx) p x) /p <. L ) enotes the space of Lebesgue-measurable functions f : C such that f := inf{α > 0 fx) α for a.e. x } <. For a countable set Q, L )) Q stans for the space of sets S := {f q } q Q, with f q L ) for all q Q, satisfying S := q Q f q ) / <. We enote the Fourier transform of f L ) by f) := fx)e πi x, x an exten it in the usual way to L ) [40, Theorem 7.9]. I : L p ) L p ) stans for the ientity operator on L p ). The convolution of f L ) an g L ) is f g)y) := fx)gy x)x. We write T t f)x) := fx t), t, for the translation operator, an M f)x) := e πi x, fx),, for the moulation operator. We set f, g := fx)gx)x, for f, g L ). H s ), with s > 0, stans for the Sobolev space of functions f L ) satisfying f H s := f) + ) s ) / <, see [4, Section 6..]. Here, the inex s reflects the egree of smoothness of f H s ), i.e., larger s entails smoother f. For a multi-inex α N 0, D α enotes the ifferential operator D α := / x ) α... / x ) α, with orer α := i= α i. The space of functions f : C whose erivatives D α f of orer at most k N 0 are continuous is esignate by C k, C). Moreover, we enote the graient of a function f : C as f. II. DCNN-BASED FEATUE EXTACTOS Throughout the paper, we use the terminology of [9], consier unless explicitly state otherwise) input signals f L ), an employ the moule-sequence Ω := Ψ n,, I) ) n N, ) i.e., each network layer is associate with i) a collection of filters Ψ n := {χ n } {g λn } λn Λ n L ) L ), where χ n, referre to as output-generating filter, an the g λn, inexe Throughout the paper a.e. is w.r.t. Lebesgue measure.

3 3 f g λ j) g λ l) g m) λ 3 f g λ p) g λ r) g s) λ 3 f g λ j) g l) λ f g λ p) g r) λ f g λ j) g l) λ χ 3 f g λ p) g r) λ χ 3 f g j) λ f g p) λ f g j) λ χ f f g p) λ χ f χ Fig. : Network architecture unerlying the feature extractor 5). The inex λ k) n correspons to the k-th filter g k) λ of the collection Ψ n associate with n the n-th network layer. The function χ n+ is the output-generating filter of the n-th network layer. The root of the network correspons to n = 0. by a countable set Λ n, satisfy the frame conition [], [3], [34] A n f f χ n + f g λn B n f, ) λ n Λ n for all f L ), for some A n, B n > 0, ii) the moulus non-linearity : L ) L ), f x) := fx), an iii) no pooling, which, in the terminology of [9], correspons to pooling through the ientity operator with pooling factor equal to one. Associate with the moule Ψ n,, I), the operator U n [λ n ] efine in [9, Eq. ] particularizes to We exten 3) to paths on inex sets U n [λ n ]f = f gλn. 3) q = λ, λ,..., λ n ) Λ Λ Λ n =: Λ n, n N, accoring to U[q]f = U[λ, λ,..., λ n )]f := U n [λ n ] U [λ ]U [λ ]f, 4) where, for the empty path e :=, we set Λ 0 := {e} an U[e]f := f. The signals U[q]f, q Λ n, associate with the n-th network layer, are often referre to as feature maps in the eep learning literature. The feature vector Φ Ω f) is obtaine by aggregating filtere versions of the feature maps. More formally, Φ Ω f) is efine as [9, Definition 3] Φ Ω f) := Φ n Ωf), 5) where Φ n Ωf) := {U[q]f) χ n+ } q Λ n are the features generate in the n-th network layer, see Figure. Here, n = 0 correspons to the root of the network. The function χ n+ is the output-generating filter of the n-th network layer. The feature extractor Φ Ω : L ) L ) ) Λn was shown in [9, Theorem ] to be vertically translationinvariant, provie although that pooling is employe, with pooling factors S n, n N, see [9, Eq. 6] for the efinition of the general pooling operator) such that N lim N n= S n =. Moreover, Φ Ω exhibits limite sensitivity to certain non-linear eformations on input) signal classes such as ban-limite functions [9, Theorem ], cartoon functions [9, Theorem ], an Lipschitz functions [9, Corollary ]. III. ENEGY DECAY AND TIVIAL NULL-SET The first central goal of this paper is to unerstan how fast the energy containe in the feature maps ecays across layers. Specifically, we shall stuy the ecay of W N f) := U[q]f, f L ), 6) q Λ N as a function of network epth N. Moreover, it is esirable that the infinite-epth feature vector Φ Ω f) be informative in the sense of the only signal mapping to the all-zeros feature vector being the zero input signal, i.e., Φ Ω has a trivial null-set N Φ Ω ) := {f L ) Φ Ω f) = 0}! = {0}. 7) Figure illustrates the practical ramifications of a non-trivial null-set in a binary classification task. N Φ Ω ) = {0} can be guarantee by asking for energy conservation in the sense of A Ω f Φ Ω f) B Ω f, f L ), 8) for some constants A Ω, B Ω > 0 possibly epening on the moule-sequence Ω) an with the feature space norm Φ Ω f) := Φn Ω f) ) /, where Φ n Ω f) :=

4 4 w for an unspecifie a >, was establishe in [, Proposition 3.3]. Moreover, [0, Section 5] an [, Theorem 3.6 a)] state for the respective moule-sequences that 8) hols with A Ω = B Ω = an hence Φ Ω f) = f. ) Φ Ω f ) Fig. : Impact of a non-trivial null-set N Φ Ω ) in a binary classification task. The feature vector Φ Ω f) is fe into a linear classifier [0], which etermines set membership base on the sign of the inner prouct w, Φ Ω f). The learne) weight vector w is perpenicular to the separating hyperplane ashe line). If the null-set of the feature extractor Φ Ω is non-trivial, there exist input signals f 0 that are mappe to the origin in feature space, i.e., Φ Ω f ) = 0 gray circle), an therefore lie inepenently of the weight vector w on the separating hyperplane. These input signals f 0 are therefore unclassifiable. /. q Λ U[q]f) χ n+ ) Inee, 7) follows from 8) n as the upper boun in 8) yiels {0} N Φ Ω ), an the lower boun implies {0} N Φ Ω ). We emphasize that, as Φ Ω is a non-linear operator owing to the moulus non-linearities), characterizing its null-set is non-trivial in general. The upper boun in 8) was establishe in [9, Appenix E]. While the existence of this upper boun is implie by the filters Ψ n, n N, satisfying the frame property ) [9, Appenix E], perhaps surprisingly, this is not enough to guarantee A Ω > 0 see Appenix A for an example). We refer the reaer to Section V for results on the null-set of the finite-epth feature extractor N Φn Ω. Previous work on the ecay rate of W N f) in [0, Section 5] shows that for wavelet-base networks i.e., in every network layer the filters Ψ = {χ} {g λ } λ Λ in ) are taken to be specific) -D wavelets that constitute a Parseval frame, with χ a low-pass filter) there exist ε > 0 an a > both constants unspecifie) such that W N f) f) ) r g εa N, 9) for real-value -D signals f L ) an N, where ) r g := e. To see that this result inicates energy ecay, Figure 3 illustrates the influence of network epth N on the upper boun in 9). Specifically, we can see that increasing the network epth results in cutting out increasing amounts of energy of f an thereby making the upper boun in 9) ecay as a function of N. Moreover, it is interesting to note that the upper boun on W N f) = q Λ N U[q]f is inepenent of the wavelets generating the feature maps U[q]f, q Λ N. For scattering networks that employ, in every network layer, uniform covering filters Ψ = {χ} {g λ } λ Λ L ) L ) forming a Parseval frame where χ, again, is a lowpass filter), exponential energy ecay accoring to W N f) = Oa N ), f L ), 0) The first main goal of the present paper is to establish i) for - imensional complex-value input signals that W N f) ecays polynomially accoring to W N f) BΩ N f) ) r l N α, ) for f L ) an N, where α =, for =, an α = log / /)), for, BΩ N = N k= max{, B k}, an r l :, r l ) = ) l +, with l > / +, for networks base on general filters {χ n } {g λn } λn Λ n that satisfy mil analyticity an high-pass conitions an are allowe to be ifferent in ifferent network layers with the proviso that χ n, n N, is of low-pass nature in a sense to be mae precise), an ii) for -D complex-value input signals that 6) ecays exponentially accoring to W N f) f) ) r l a N, 3) for f L ) an N, for networks that are base, in every network layer, on a broa family of wavelets, with the ecay factor given explicitly as a = 5 3, or on a broa family of Weyl-Heisenberg filters [9, Appenix B], with ecay factor a = 3. Thanks to the right-han sie HS) of ) an 3) not epening on the specific filters {χ n } {g λn } λn Λ n, we will be able to establish uner smoothness assumptions on the input signal f universal energy ecay results. Specifically, particularizing the HS expressions in ) an 3) to Sobolev-class input signals f H s ), s > 0, where { } H s ) = f L ) f H s <, with f H s := + ) s f) ) /, we show that ) yiels polynomial energy ecay accoring to an 3) exponential energy ecay W N f) = O N γα), 4) W N f) = O a γn ), 5) where γ := min{, s} in both cases. Sobolev spaces H s ) contain a wie range of practically relevant signal classes such as, e.g., the space L L ) := {f L ) supp f ) B L 0)}, L 0, of L-ban-limite functions accoring to L L ) H s ), for L 0 an s > 0. This follows from + ) s f) = + ) s f) B L 0) + L ) s f <, for f L L ), L 0, an s > 0, where we use Parseval s formula an the fact that + ) s,

5 5 f) h N ) a N a N a N a N f) h N ) a N a N a N a N f) h N+) a N a N a N a N f) h N+) a N a N a N a N Fig. 3: Illustration of the impact of network epth N on the upper boun on W N f) in 9), for ε = an a >. The function h N ) := )) r g εa, N where r g) = e, is of increasing high-pass nature as N increases, which makes the upper boun in 9) ecay in N., is monotonically increasing in, for s > 0, the space CCAT K of cartoon functions of size K, introuce in [35], an wiely use in the mathematical signal processing literature [5], [9], [6], [36], [37] as a moel for natural images such as, e.g., images of hanwritten igits [38] see Figure 4). For a formal efinition of CCAT K, we refer the reaer to Appenix B, where we also show that CCAT K Hs ), for K > 0 an s 0, /). Moreover, Sobolev functions are containe in the space of k-times continuously ifferentiable functions C k, C) accoring to H s ) C k, C), for s > k+ [4, Section 4]. Our secon central goal is to prove energy conservation accoring to 8) which, as explaine above, implies N Φ Ω ) = {0}) for the network configurations corresponing to the energy ecay results ) an 3). Finally, we provie hany estimates of the number of layers neee to have at least ε) 00)% of the input signal energy be containe in the feature vector. IV. MAIN ESULTS Throughout the paper, we make the following assumptions on the filters {g λn } λn Λ n. Assumption. The {g λn } λn Λ n, n N, are analytic in the following sense: For every layer inex n N, for every λ n Λ n, there exists an orthant H Aλn, with A λn O), such that suppĝ λn ) H Aλn. 6) Moreover, there exists > 0 so that λ n Λ n ĝ λn ) = 0, a.e. B 0). 7) In the -D case, i.e., for =, Assumption simply amounts to every filter g λn satisfying either suppĝ λn ), ] or suppĝ λn ) [, ), which constitutes an analyticity an high-pass conition. For imensions, Assumption requires that every filter g λn be of high-pass nature an have a Fourier transform supporte in a not necessarily canonical) orthant. Since the frame conition ) is equivalent to the Littlewoo-Paley conition [43] A n χ n ) + ĝ λn ) B n, a.e., 8) λ n Λ n 7) implies low-pass characteristics for χ n to fill the spectral gap B 0) left by the filters {g λn } λn Λ n.

6 6 Fig. 4: An image of a hanwritten igit is moele by a -D cartoon function. The conitions 6) an 7) we impose on the Ψ n, n N, are not overly restrictive as they encompass, inter alia, various constructions of Weyl-Heisenberg filters e.g., a -D B-spline as prototype function [45, Section ]), wavelets e.g., analytic Meyer wavelets [, Section 3.3.5] in -D, an Cauchy wavelets [46] in -D), an specific constructions of rigelets [3, Section.], curvelets [5, Section 4.], α-curvelets [6, Section 3], an shearlets e.g., cone-aapte shearlets [37, Section 4.3]). We refer the reaer to [9, Appenices B an C] for a brief review of some of these filter structures. We are now reay to state our main result on energy ecay an energy conservation. Theorem. Let Ω be the moule-sequence ) with filters {g λn } λn Λ n satisfying the conitions in Assumption, an let > 0 be the raius of the spectral gap B 0) left by the filters {g λn } λn Λ n accoring to 7). Furthermore, let s 0, A N Ω := N k= min{, A k}, BΩ N := N k= max{, B k}, an {, =, α := log 9) / /)),. i) We have W N f) BΩ N f) ) r l N α, 0) for f L ) an N, where r l :, r l ) := ) l +, with l > / +. ii) For every Sobolev function f H s ), s > 0, we have where γ := min{, s}. iii) If, in aition to Assumption, W N f) = O B N Ω N γα), ) 0 < A Ω := lim N AN Ω B Ω := lim N BN Ω <, ) then we have energy conservation accoring to for all f L ). A Ω f Φ Ω f) B Ω f, 3) Proof. For the proofs of i) an ii), we refer to Appenices C an D, respectively. The proof of statement iii) is base on two key ingreients. First, we establish in Proposition in Appenix E that the feature extractor Φ Ω satisfies the energy ecomposition ientity A N Ω f N Φ n Ωf) + W N f) B N Ω f, 4) for all f L ) an all N. Secon, we show in Proposition in Appenix F that the integral on the HS of 0) goes to zero as N which, thanks to lim N BN Ω = B Ω <, implies that W N f) 0 as N. We note that while the ecomposition 4) hols for general filters Ψ n satisfying the frame property ), it is the upper boun 0) that makes use of the analyticity an high-pass conitions in Assumption. The final energy conservation result 3) is obtaine by letting N in 4). The strength of the results in Theorem erives itself from the fact that the only conition we nee to impose on the filters Ψ n is Assumption, which, as alreay mentione, is met by a wie array of filters. Moreover, conition ) is easily satisfie by normalizing the filters Ψ n, n N, appropriately see, e.g., [9, Proposition 3]). We note that this normalization, when applie to filters that satisfy Assumption, yiels filters that still meet Assumption. The ientity ) establishes, upon normalization [9, Proposition 3] of the Ψ n to get B n, n N, that the energy ecay rate, i.e., the ecay rate of W N f), is at least polynomial in N. We hasten to a that 0) oes not preclue the energy from ecaying faster in practice. Unerlying the energy conservation result 3) is the following emoulation effect inuce by the moulus nonlinearity in combination with the analyticity an high-pass properties of the filters {g λn } λn Λ n. In every network layer, the spectral content of each iniviual feature map is move to base-ban i.e., to low frequencies), where it is extracte by the low-pass output-generating atom χ n+, see Figure 5. The components not collecte by χ n+ see Figure 5, bottom row) are capture by the analytic high-pass filters {g λn+ } λn+ Λ n+ in the next layer an, thanks to the moulus non-linearity, again move to low frequencies an extracte by χ n+. Iterating this process ensures that the null-set of the feature vector be it for the infinite-epth network or, as establishe in Section V, for finite network epths) is trivial. It is interesting to observe that the sigmoi, the rectifie linear unit, an the hyperbolic tangent non-linearities all wiely use in the eep learning literature exhibit very ifferent behavior in this regar, namely, they o not emoulate in the way the moulus non-linearity oes [44, Figure 6]. It is therefore unclear whether the proof machinery for energy conservation evelope in this paper extens to these nonlinearities or, for that matter, whether one gets energy ecay an conservation at all. The feature map energy ecay result ) relates to the feature vector energy conservation result 3) via the energy ecomposition ientity 4). Specifically, particularizing 4) for Parseval frames, i.e., A n = B n =, for all n N, we get N Φ n Ωf) + W N f) = f. 5) This shows that the input signal energy containe in the network layers n N is precisely given by W N f). Thanks to W N f) 0 as N establishe in Proposition in Appenix F) this resiual energy will eventually be collecte

7 7 ĝ λn ) χ n+) f) f) ĝ λn ) χ n+) f g λn ) Fig. 5: Illustration of the emoulation effect of the moulus non-linearity. The {g λn } λn Λ n are taken as perfect ban-pass filters e.g., ban-limite analytic Weyl-Heisenberg filters) an hence trivially satisfy the conitions in Assumption. The moulus operation in combination with the analyticity an the highpass nature of the filters {g λn } λn Λ n ensures that in every network layer the spectral content of each iniviual feature map is move to base-ban i.e., to low frequencies), where it is extracte by the low-pass) output-generating filter χ n+. in the infinite-epth feature vector Φ Ω f) so that no input signal energy is lost in the network. In Section V, we shall answer the question of how many layers are neee to absorb ε) 00)% of the input signal energy. The next result shows that, uner aitional structural assumptions on the filters {g λn } λn Λ, the guarantee energy ecay rate can be improve from polynomial to exponential. Specifically, we can get exponential energy ecay for broa families of wavelets an Weyl-Heisenberg filters. For conceptual reasons, we consier the -D case an, for simplicity of exposition, we employ filters that constitute Parseval frames an are ientical across network layers. Theorem. Let r l :, r l ) := ) l +, with l >. i) Wavelets: Let the mother an father wavelets ψ, φ L ) L ) satisfy supp ψ) [/, ] an φ) + ψ j ) =, a.e. 0. 6) j= Moreover, let g j x) := j ψ j x), for x, j, an g j x) := j ψ j x), for x, j, an set χx) := φx), for x. Let Ω be the moule-sequence ) with filters Ψ = {χ} {g j } j Z\{0} in every network layer. Then, W N f) f) ) r l 5/3) N, 7) for f L ) an N. Moreover, for every Sobolev function f H s ), s > 0, we have W N f) = O 5/3) γn), 8) where γ := min{, s}. ii) Weyl-Heisenberg filters: For, let the functions g, φ L ) L ) satisfy suppĝ) [, ], ĝ ) = ĝ), for, an φ) + ĝ k + )) =, 9) k= a.e. 0. Moreover, let g k x) := e πik+)x gx), for x, k, an g k x) := e πi k +)x gx), for x, k, an set χx) := φx), for x. Let Ω be the moule-sequence ) with filters Ψ = {χ} {g k } k Z\{0} in every network layer. Then, W N f) f) ) r l 3/) N, 30) for f L ) an N. Moreover, for every Sobolev function f H s ), s > 0, we have where γ := min{, s}. Proof. See Appenix G. W N f) = O 3/) γn ), 3) The conitions we impose on the mother an father wavelet ψ, φ in i) are satisfie, e.g., by analytic Meyer wavelets [, Section 3.3.5], an those on the prototype function g an low-pass filter φ in ii) by B-splines [45, Section ]. Moreover, as shown in [44, Theorem 3.], the exponential energy ecay results in 8) an 3) can be generalize to Oa N ) with arbitrary ecay factor a > realize through

8 8 suitable choice of the mother wavelet or the Weyl-Heisenberg prototype function. We note that in the presence of pooling by sub-sampling as efine in [9, Eq. 9]), say with pooling factors S n := S [, a), for all n N, where a = 5 3 in the wavelet case an a = 3 in the Weyl-Heisenberg case) the effective ecay factor in 8) an 3) becomes 5 3S an 3 S, respectively. Exponential energy ecay is hence compatible with vertical translation invariance accoring to [9, Theorem ], albeit at the cost of a slower exponential) ecay rate. The proof of this statement is structurally very similar to that of Theorem an will therefore not be given here. Finally, we note that the energy ecay an conservation results in Theorems an are compatible with the feature extractor Φ Ω being eformationinsensitive accoring to [9, Theorem ], simply by noting that [9, Theorem ] applies to general semi-iscrete frames an general Lipschitz-continuous non-linearities. We next put the results in Theorems an into perspective with respect to the literature. elation to [0, Section 5]: The basic philosophy of our proof technique for 0), 3), 7), an 30) is inspire by the proof in [0, Section 5], which establishes 9) an ) for scattering networks base on certain wavelet filters an with -D real-value input signals f L ). Specifically, in [0, Section 5], in every network layer, the filters Ψ W = {χ} {g j } j Z where g j ) := j ψ j ), j Z, for some mother wavelet ψ L ) L )) are -D functions satisfying the frame property ) with A n = B n =, n N, a mil analyticity conition 3 [0, Eq. 5.5] in the sense of ĝ j ), j Z, being larger for positive frequencies than for the corresponing negative ones, an a vanishing moments conition [0, Eq. 5.6] which controls the behavior of ψ) aroun the origin accoring to ψ) C +ε, for, for some C, ε > 0. Similarly to the proof of ) as given in [0, Section 5], we base our proof of 3) on the energy ecomposition ientity 4) an on an upper boun on W N f) see 9) for the corresponing upper boun establishe in [0, Section 5]) shown to go to zero as N. The exponential energy ecay results ), 8), an 3) for Sobolev functions f H s ) are entirely new. The major ifferences between [0, Section 5] an our results are i) that 9) reporte in [0, Section 5]) epens on an unspecifie a >, whereas our results in 0), ), 7), 8), 30), an 3) make the ecay factor a an the ecay exponent α explicit, ii) the technical elements employe to arrive at the upper bouns on W N f); specifically, while the proof in [0, Section 5] makes explicit use of the algebraic structure of the filters, namely, the multiscale structure of wavelets, our proof of 0) is oblivious to the algebraic structure of the filters, which is why it applies to general possibly unstructure) filters that, in aition, can be ifferent in ifferent network layers, iii) the assumptions impose on the filters, namely the analyticity an vanishing moments conitions in [0, Eq ], in contrast to our Assumption, an iv) the class of input signals f the results 3 At the time of completion of the present paper, I. Walspurger kinly sent us a preprint [47] which shows that the analyticity conition [0, Eq. 5.5] on the mother wavelet is not neee for 9) to hol. apply to, namely -D real-value signals in [0, Section 5], an -imensional complex-value signals in our Theorem. elation to []: For scattering networks that are base on so-calle uniform covering filters [], 0) an ) are establishe in [] for -imensional complex-value signals f L ). Specifically, in [], in every network layer, the -imensional filters {χ} {g λ } λ Λ are taken to satisfy i) the frame property ) with A = B = an hence A n = B n =, n N, see [, Definition. c)], ii) a vanishing moments conition [, Definition. a)] accoring to ĝ λ 0) = 0, for λ Λ, an iii) a uniform covering conition [, Definition. b)] which says that the filters Fourier transform support sets can be covere by a union of finitely many balls. The major ifferences between [] an our results are as follows: i) the results in [] apply exclusively to filters satisfying the uniform covering conition such as, e.g., Weyl-Heisenberg filters with a ban-limite prototype function [, Proposition.3], but o not apply to multi-scale filters such as wavelets, α)-curvelets, shearlets, an rigelets see [, emark. b)]), ii) 0) as establishe in [] leaves the ecay factor a > unspecifie, whereas our results in 8) an 3) make the ecay factor a explicit namely, a = 5/3 in the wavelet case an a = 3/ in the Weyl-Heisenberg case), iii) the exponential energy ecay result in 0) as establishe in [] applies to all f L ) an thus, in particular, to Sobolev input signals owing to H s ) L ), for all s > 0), whereas our ecay results in ), 8), an 3) pertain to Sobolev input signals f H s ), s > 0, only, iv) the technical elements employe to arrive at the upper bouns on W N f), specifically, while the proof in [] makes explicit use of the uniform covering property of the filters, our proof of 0) is completely oblivious to the algebraic) structure of the filters, v) the assumptions impose on the filters, i.e., the vanishing moments an uniform covering conition in [, Definition. a)-b)], in contrast to our Assumption, which is less restrictive, an thereby makes our results in Theorem apply to general possibly unstructure) filters that, in aition, can be ifferent in ifferent network layers. V. NUMBE OF LAYES NEEDED DCNNs use in practice employ potentially hunres of layers [7]. Such network epths entail formiable computational challenges both in training an in operating the network. It is therefore important to unerstan how many layers are neee to have most of the input signal energy be containe in the feature vector. This will be one by consiering Parseval frames in all layers, i.e., frames with frame bouns A n = B n =, n N. The energy conservation result 3) then implies that the infinite-epth feature vector Φ Ω f) = Φ n Ωf) contains the entire input signal energy accoring to Φ Ω f) = Φn Ω f) = f. Now, the ecomposition 5) reveals that thanks to lim W Nf) 0, N increasing the network epth N implies that the feature vector f) progressively contains a larger fraction of the N Φn Ω

9 9 ε) wavelets Weyl-Heisenberg filters general filters Table I: Number N of layers neee to ensure that ε) 00)% of the input signal energy are containe in the features generate in the first N network layers. input signal energy. We formalize the question on the number of layers neee by asking for bouns of the form N ε) Φn Ω f) f, 3) i.e., by etermining the network epth N guaranteeing that at least ε) 00)% of the input signal energy are capture by the corresponing epth-n feature vector N Φn Ω f). Moreover, 3) ensures that the epth-n feature extractor N Φn Ω exhibits a trivial null-set. The following results establish hany estimates of the number N of layers neee to guarantee 3). For peagogical reasons, we start with the case of ban-limite input signals an then procee to a more general statement pertaining to Sobolev functions. Corollary. i) Let Ω be the moule-sequence ) with filters {g λn } λn Λ n satisfying the conitions in Assumption, an let the corresponing frame bouns be A n = B n =, n N. Let > 0 be the raius of the spectral gap B 0) left by the filters {g λn } λn Λ n accoring to 7). Furthermore, let l > / +, ε 0, ), α as efine in 9), an f L ) L-ban-limite. If ) /α L N, 33) ε) l ) then 3) hols. ii) Assume that the conitions in Theorem i) an ii) hol. For the wavelet case, let a = 5 3 an = where correspons to the raius ) of the spectral gap left by the wavelets {g j } j Z\{0}. For the Weyl-Heisenberg case, let a = 3 an = here, correspons to the raius of the spectral ) gap left by the Weyl-Heisenberg filters {g k } k Z\{0}. Moreover, let l >, ε 0, ), an f L ) L-ban-limite. If ) L N log a, 34) ε) l ) then 3) hols in both cases. Proof. See Appenix H. Corollary nicely shows how the escription complexity of the signal class uner consieration, namely the banwith L an the imension through the ecay exponent α efine in 9) etermine the number N of layers neee. Specifically, 33) an 34) show that larger banwiths L an larger imension rener the input signal f more complex, which requires eeper networks to capture most of the energy of f. The epenence of the lower bouns in 33) an 34) on the network properties, through the moule-sequence Ω, is through the ecay factor a > an the raius of the spectral gap left by the filters {g λn } λn Λ n. The following numerical example provies quantitative insights on the influence of the parameter ε on 33) an 34). Specifically, we set L =, =, = which implies α =, see 9)), l =.000, an show in Table I the number N of layers neee accoring to 33) an 34) for ifferent values of ε. The results show that 95% of the input signal energy are containe in the first 8 layers in the wavelet case an in the first 0 layers in the Weyl-Heisenberg case. We can therefore conclue that in practice a relatively small number of layers is neee to have most of the input signal energy be containe in the feature vector. In contrast, for general filters, where we can guarantee polynomial energy ecay only, N = 39 layers are neee to absorb 95% of the input signal energy. We hasten to a, however, that 0) simply guarantees polynomial energy ecay an oes not preclue the energy from ecaying faster in practice. We procee with the estimates for Sobolev-class input signals. Corollary. i) Let Ω be the moule-sequence ) with filters {g λn } λn Λ n satisfying the conitions in Assumption, an let the corresponing frame bouns be A n = B n =, n N. Let > 0 be the raius of the spectral gap B 0) left by the filters {g λn } λn Λ n accoring to 7). Furthermore, let l > / +, ε 0, ), α as efine in 9), an f H s )\{0}, for s > 0. If N l f /γ H s ε /γ f /γ ) /α, 35) where γ := min{, s}, then 3) hols. ii) Assume that the conitions in Theorem i) an ii) hol. For the wavelet case, let a = 5 3 an = where correspons to the raius ) of the spectral gap left by the wavelets {g j } j Z\{0}. For the Weyl-Heisenberg case, let a = 3 an = here, correspons to the raius of the spectral ) gap left by the Weyl-Heisenberg filters {g k } k Z\{0}. Furthermore, let l >, ε 0, ), an f H s )\{0}, for s > 0. If ) l f /γ H N log s a, 36) ε /γ f /γ where γ := min{, s}, then 3) hols. Proof. See Appenix I. As alreay mentione in Section III, Sobolev spaces H s ) contain a wie range of practically relevant signal classes. The results in Corollary therefore provie for a wie variety of input signals a picture of how many layers are neee to have most of the input signal energy be containe in the feature vector.

10 0 The with of the networks consiere throughout the paper is, in principle, infinite as the sets Λ n nee to be countably infinite in orer to guarantee that the frame property ) is satisfie. For input signals that exhibit mil spectral ecay, the number of operationally significant noes will, however, be finite in practice. For a treatment of this aspect as well as results on epth-with traeoffs, the intereste reaer is referre to [44]. APPENDIX A A FEATUE EXTACTO WITH A NON-TIVIAL NULL-SET We show, by way of example, that employing filters Ψ n which satisfy the frame property ) alone oes not guarantee a trivial null-set for the feature extractor Φ Ω. Specifically, we construct a feature extractor Φ Ω base on filters satisfying ) an a corresponing function f 0 with f N Φ Ω ). Our example employs, in every network layer, filters Ψ = {χ} {g k } k Z that satisfy the Littlewoo-Paley conition 8) with A = B =, an where g 0 is such that ĝ 0 ) =, for B 0), an arbitrary else of course, as long as the Littlewoo-Paley conition 8) with A = B = is satisfie). We emphasize that no further restrictions are impose on the filters {χ} {g k } k Z, specifically χ nee not be of low-pass nature an the filters {g k } k Z may be structure such as wavelets [9, Appenix B]) or unstructure such as ranom filters [48], [49]), as long as they satisfy the Littlewoo-Paley conition 8) with A = B =. Now, consier the input signal f L ) accoring to f) := ) l +,, with l > / +. Then f g 0 = f, owing to supp f ) = B 0) an ĝ 0 ) =, for B 0). Moreover, f is a positive efinite raial basis function [50, Theorem 6.0] an hence by [50, Theorem 6.8] fx) 0, x, which, in turn, implies f = f. This yiels U[q N 0 ]f = f g0 g 0 g0 = f, for q0 N := 0, 0,..., 0) Z N an N N. Owing to the energy ecomposition ientity 4), together with A N Ω = BN Ω =, N N, which, in turn, is by A n = B n =, n N, we have f = N = N Φ n Ωf) + W N f) Φ n Ωf) + U[q0 N ]f + U[q]f }{{}, = f q Z N \{q0 N } for N N. This implies N Φ n Ωf) + q Z N \{q N 0 } U[q]f = 0. 37) As both terms in 37) are positive, we can conclue that N Φn Ω f) = 0, N N, an thus Φ Ω f) = Φ n Ωf) = 0. Since Φ Ω f) = 0 implies Φ Ω f) = 0, we have constructe a non-zero f, namely fx) = ) l +e πi x,, that maps to the all-zeros feature vector, i.e., f N Φ Ω ). The point of this example is the following. Owing to the nature of ĝ 0 ) namely, ĝ 0 ) =, for B 0)) an the Littlewoo-Paley conition χ) + k Z ĝ k ) =, a.e., it follows that neither the output-generating filter χ nor any of the other filters g k, k Z\{0}, can have spectral support in B 0). Consequently, the only non-zero contribution to the feature vector can come from U[q N 0 ]f = f, which, however, thanks to supp f ) = B 0), is spectrally isjoint from the output-generating filter χ. Therefore, Φ Ω f) will be ientically equal to 0. Assumption isallows this situation as it forces the filters g k, k Z, to be of highpass nature which, in turn, implies that χ must have lowpass characteristics. The punch-line of our general results on energy conservation, be it for finite N or for N, is that Assumption in combination with the frame property an the moulus non-linearity prohibit a non-trivial null-set in general. APPENDIX B SOBOLEV SMOOTHNESS OF CATOON FUNCTIONS Cartoon functions, introuce in [35], satisfy mil ecay properties an are piecewise continuously ifferentiable apart from curve iscontinuities along smooth hypersurfaces. This function class has been wiely aopte in the mathematical signal processing literature [5], [9], [6], [36], [37] as a stanar moel for natural images such as, e.g., images of hanwritten igits [38] see Figure 4). We will work with the following relative to the efinition in [35] slightly moifie version of cartoon functions. Definition. The function f : C is referre to as a cartoon function if it can be written as f = f + D f, where D is a compact omain whose bounary D is a compact topologically embee C -hypersurface of without bounary 4, an f i H / ) C, C), i =,, satisfy the ecay conition f i x) C x, i =,, for some C > 0 not epening on f,f ). Furthermore, we enote by C K CAT := {f + D f f i H / ) C, C), i =,, f i x) K x, vol D) K, f K} the class of cartoon functions of size K > 0. 4 We refer the reaer to [5, Chapter 0] for a review on ifferentiable manifols.

11 Even though cartoon functions are in general iscontinuous, they still amit Sobolev smoothness. The following result formalizes this statement. Lemma. Let K > 0. Then, CCAT K Hs ), for all s 0, /). Proof. Let f + D f ) CCAT K. We first establish D H s ), for all s 0, /). To this en, we efine the Sobolev-Sloboeckij semi-norm [5, Section..] f H s := fx) fy) x y s+ x y ) /s, an note that, thanks to [5, Section..], D H s ) if D H s <. We have D s H = D x) D y) s x y s+ x y = t s+ D x) D x t) x t, where we employe the change of variables t = x y. Next, we note that, for fixe t, the function h t x) := D x) D x t) satisfies h t x) =, for x S t, where S t := {x x D an x t / D} {x x / D an x t D} =D D + t), 38) an h t x) = 0, for x \S t. It follows from 38) that vol S t ) vol D), t. 39) Moreover, owing to S t D + B t 0) ), where D + B t 0)) is a tube of raius t aroun the bounary D of D see Figure 6), an Lemma, state below, there exists a constant C D > 0 such that vol S t ) vol D + B t 0)) C D t, 40) for all t with t. Next, fix such that 0 < <. Then, D s H = s t s+ D x) D x t) x t = t s+ h t x)x t vol S t ) = t s+ x t = t S t t s+ vol D) C D t + t 4) t s+ t s+ \B 0) = vol D) vol B 0)) B 0) r s+) r } {{ } =:I + C D vol B 0)) r s r, 4) 0 } {{ } =:I where in 4) we employe 39) an 40), an in the last D t D D + t) Fig. 6: Illustration in imension =. The set D + t) grey) is obtaine by translating the set D white) by t. The symmetric ifference D D+t) is containe in D + B t 0)), the tube of raius t aroun the bounary D of D. step we introuce polar coorinates. The integral I is finite for all s > 0, while I is finite for all s < /. Moreover, vol D) = x is finite owing to D being compact an D thus boune). We can therefore conclue that 4) is finite for s 0, /), an hence D H s ), for s 0, /). To see that f + D f ) H s ), for s 0, /), we first note that f + D f H s f H s + D f H s, 43) which is thanks to the sub-aitivity of the semi-norm H s. Now, the first term on the HS of 43) is finite owing to f H / ) H s ), for all s 0, /). For the secon term on the HS, we start by noting that an D f s H = s D f )x) D f )y) ) x y s+ x y D f )x) D f )y) = D x) D y))f x) + f x) f y)) D y) 44) D x) D y)) f x) 45) + f x) f y)) D y), 46) where 45) an 46) are thanks to a + b a + b, for a, b C. Substituting 45) an 46) into 44) an noting that f x) f K, x, which is by assumption, an D y), y, implies D f s H s K D s H + f s s Hs <, 47) where in the last step we use D H s ), establishe above, an f H / ) H s ), both for all s 0, /). This completes the proof. It remains to establish the secon inequality in 40). Lemma. Let M be a compact topologically embee C - hypersurface of without bounary an let T M, r) := { x inf x y r}, r > 0, y M be the tube of raius r aroun M. Then, there exists a constant

12 C M > 0 that oes not epen on r) such that for all r it hols that vol T M, r)) C M r. 48) Proof. The proof is base on Weyl s tube formula [54]. Let κ := max κ i, i {,..., } where κ i is the i-th principal curvature of the hypersurface M see [53, Section 3.] for a formal efinition). It follows from [53, Theorem 8.4 i)] that vol T M, r)) = i=0 r i+ k i M) i j=0 + j), for all r κ, where k i M) = M H ix)x, i {0,..., }, with H i enoting the so-calle i)-th curvature of M, see [53, Section 4.] for a formal efinition. Now, thanks to M being a C -hypersurface, we have that H i, i {0,..., }, is boune see [53, Section 4.]), which together with M compact an thus boune) implies k i M) <, for all i {0,..., }. Moreover, by efinition, k i M), i {0,..., }, is inepenent of the tube raius r. Therefore, setting ) k i M) C M := + max i i j=0 + j) establishes 48) for 0 < r min{, κ }. It remains to prove 48) for min{, κ } < r. Let := inf{ > 0 M B 0)} an D := vol B +0)). Since it follows that vol T M, r)) D, 0 < r, vol T M, r)) < D max{, κ} r, for all min{, κ } < r, which establishes 48) for min{, κ } < r an thereby conclues the proof. APPENDIX C POOF OF STATEMENT I) IN THEOEM We start by establishing 0) with α = log / /)), for. Then, we sharpen our result in the -D case by proving that 0) hols for = with α =. This leas to a significant improvement, in the -D case, of the ecay exponent from log / /)) = to. The iea for the proof of 0) for α = log / /)),, is to establish that 5 U[q]f q Λ n Λ n+ Λ n+n Cn n+n f) ) r l N α, 49) 5 We prove the more general result 49) for technical reasons, concretely in orer to be able to argue by inuction over path lengths with flexible starting inex n. for N N, where n+n Cn n+n := max{, B k }. k=n Setting n = in 49) an noting that C N = BΩ N yiels the esire result 0). We procee by inuction over the path length lq) := N, for q = λ n, λ n+,..., λ n+n ) Λ n Λ n+ Λ n+n. Starting with the base case N =, we have U[q]f = f g λn q Λ n λ n Λ n = ĝ λn ) f) 50) λ n Λ n B n f) 5) \B 0) max{, B n } f) r l }{{} =C n n ), 5) for N N, where 50) is by Parseval s formula, 5) is thanks to 7) an 8), an 5) is ue to supp r l ) B 0) an 0 r l ), for. The inuctive step is establishe as follows. Let N > an suppose that 49) hols for all paths q of length lq) = N, i.e., U[q]f q Λ n Λ n+ Λ n+n Cn n+n f) ) r l N ) α, 53) for n N. We start by noting that every path q Λ n Λ n+... Λ n+n of length l q) = N, with arbitrary starting inex n, can be ecompose into a path q Λ n+... Λ n+n of length lq) = N an an inex λ n Λ n accoring to q = λ n, q). Thanks to 4) we have which yiels U[ q] = U[λ n, q)] = U[q]U n [λ n ], U[q]f q Λ n Λ n+ Λ n+n = U[q] λ n Λ n q Λ n+ Λ n+n Un [λ n ]f ), 54) for n N. We procee by examining the inner sum on the HS of 54). Invoking the inuction hypothesis 53) with n replace by n + ) an employing Parseval s formula, we get Un [λ n ]f ) q Λ n+ Λ n+n U[q] Cn+ n+n Un [λ n ]f) ) r l N ) α = Cn+ n+n Un [λ n ]f U n [λ n ]f) r l,n,α, ) = Cn+ n+n f gλn f g λn r l,n,α, ), 55) for n N, where ) r l,n,α, is the inverse Fourier ) transform of r l N ) α. Next, we note that rl N ) α is a positive efinite raial basis function [50, Theorem 6.0] an hence

13 3 by [50, Theorem 6.8] r l,n,α, x) 0, for x. Furthermore, it follows from Lemma 3, state below, that for {ν λn } λn Λ n, we have f g λn r l,n,α, f g λn M νλn r l,n,α, ). 56) Here, we note that choosing the moulation factors {ν λn } λn Λ n appropriately see 60) below) will be key in establishing the inuctive step. Lemma 3. [8, Lemma.7]: Let f, g L ) with gx) 0, for x. Then, f g f M g), for. Inserting 55) an 56) into the inner sum on the HS of 54) yiels U[q]f q Λ n Λ n+ Λ n+n Cn+ n+n f g λn λ n Λ n f g λn M νλn r l,n,α, ) = Cn+ n+n f) h n,n,α, ), N N, 57) where we applie Parseval s formula together with M f = T f, for f L ), an, an set h n,n,α, ) := ĝ λn ) νλn ) r l N ) α. 58) λ n Λ n The key step is now to establish by juiciously choosing {ν λn } λn Λ n the upper boun ) h n,n,α, ) max{, B n } r l N α, 59) for, which upon noting that Cn n+n = max{, B n } Cn+ n+n yiels 49) an thereby completes the proof. We start by efining H Aλn, for λ n Λ n, to be the orthant supporting ĝ λn, i.e., suppĝ λn ) H Aλn, where A λn O), for λ n Λ n see Assumption ). Furthermore, for λ n Λ n, we choose the moulation factors accoring to ) ν λn := A λn ν, 60) where the components of ν are given by ν k := + / ), for k {,..., }. Invoking 6) an 7), we get h n,n,α, ) = ĝ λn ) νλn ) r l N ) α λ n Λ n = ĝ λn ) νλn ) Sλn, ) r l N ) α, 6) λ n Λ n for, where S λn, := H Aλn \B 0). For the first canonical orthant H = {x x k 0, k =,..., }, we show in Lemma 4 below that ν ), r l rl 6) N ) α N α for H\B 0) an N. This will allow us to euce νλn ), r l rl 63) N ) α N α for S λn,, λ n Λ n, an N, where S λn, = H Aλn \B 0), simply by noting that νλn ) r l = N ) α A λn ) ν) l N ) α + ν ) l ν = N ) α = r l 64) + N ) α ) ) r l = N α l N α 65) + = A λn ) l, N α = r l 66) N α + for = A λn H Aλn \B 0), where H\B 0). Here, 64) an 66) are thanks to = A λn, which is by A λn O), an the inequality in 65) is ue to 6). Insertion of 63) into 6) then yiels h n,n,α, ) ĝ λn ) ) Sλn, ) r l N α λ n Λ n = ĝ λn ) ) r l N α 67) λ n Λ n ) max{, B n } r l N α, 68) for, where in 67) we employe Assumption, an 68) is thanks to 8). This establishes 59) an completes the proof of 0) for α = log / /)),. It remains to show 6), which is accomplishe through the following lemma. ) Lemma 4. Let α := log / /), rl :, r l ) := ) l +, with l > / +, an efine ν to have components ν k = + / ), for k {,..., }. Then, ν ), rl rl 69) N ) α N α for H\B 0) an N. Proof. The key iea of the proof is to employ a monotonicity argument. Specifically, thanks to r l monotonically ecreasing in, i.e., r l ) r l ), for, with, 69) can be establishe simply by showing that κ N ) := N N α ν 0, 70) for H\B 0) an N. We first note that for H\B 0) with > N α, 69) is trivially satisfie as the HS of 69) equals zero owing to N α > together with supp r l ) B 0)). It hence suffices to prove 70) for H with N α. To this en, fix τ [, N α ], an efine the spherical segment Ξ τ := { H = τ}.

Harmonic Analysis of Deep Convolutional Neural Networks

Harmonic Analysis of Deep Convolutional Neural Networks Helmut Bőlcskei Department of Information Technology and Electrical Engineering October 2017 joint work with Thomas Wiatowski and Philipp Grohs ImageNet