Lipschitz Properties of General Convolutional Neural Networks

Size: px

Start display at page:

Download "Lipschitz Properties of General Convolutional Neural Networks"

Jemimah Sophie Lawrence
5 years ago
Views:

1 Lipschitz Properties of General Convolutional Neural Networks Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD Joint work with Dongmian Zou (UMD), Maneesh Singh (Verisk) March 22, 2017 Image Analysis Seminar University of Houston, Houston TX

Any opinions, findings, and conclusions or recommendations expressed in this material are

2 This material is based upon work supported by the National Science Foundation under Grant No. DMS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The author has been partially supported by ARO under grant W911NF

3 Table of Contents: 1 Problem Formulation 2 Deep Convolutional Neural Networks 3 Lipschitz Analysis 4 Numerical Results

4 Problem Formulation Nonlinear Maps Consider a nonlinear function between two metric spaces, F : (X, d X ) (Y, d Y ).

5 Problem Formulation Lipschitz analysis of nonlinear systems F : (X, d X ) (Y, d Y ) F is called Lipschitz with constant C if for any f, f X, d Y (F(f ), F( f )) C d X (f, f ) The optimal (i.e. smallest) Lipschitz constant is denoted Lip(F). The square C 2 is called Lipschitz bound (similar to the Bessel bound). F is called bi-lipschitz with constants C 1, C 2 > 0 if for any f, f X, C 1 d X (f, f ) d Y (F(f ), F( f )) C 2 d X (f, f ) The square C 2 1, C 2 2 are called Lipschitz bounds (similar to frame bounds).

6 Problem Formulation Motivating Examples Consider the typical neural network as a feature extractor component in a classification system: g = F(f ) = F M (...F 1 (f ; W 1, ϕ 1 );...; W M, ϕ M ) F m (f ; W m, ϕ m ) = ϕ m (W m f ) W m is a linear operator (matrix); ϕ m is a Lip(1) scalar nonlinearity (e.g. Rectified Linear Unit).

7 Problem Formulation Deep Convolutional Neural Networks Lipschitz Analysis Numerical Results Problem Formulation Motivating Example: AlexNet Example from [SZSBEGF13] ( Intriguing properties... ). ImageNet dataset [DDSLLF09] (10,184 categories; 8.9 mil.img.); AlexNet architecture [KSH12]: Figure: From Krizhevsky et all 2012 [KSH12]: AlexNet: 5 convolutive layers + 3 dense layers. Input size: 224x224x3 pixels. Output size: 1000.

.. ) found small variations of the input, almost imperceptible, that produced completely

8 Problem Formulation Motivating Example: AlexNet The authors of [SZSBEGF13] ( Intriguing properties... ) found small variations of the input, almost imperceptible, that produced completely different classification decisions: Figure: From Szegedy et all 2013 [SZSBEGF13]: AlexNet: 6 different classes: original image, difference, and adversarial example all classified as ostrich

9 Problem Formulation Motivating Example: Alex Net Szegedy et all 2013 [SZSBEGF13] computed the Lipschitz constants of each layer. Overall Lipschitz constant: Layer Size Sing.Val Conv Conv Conv Conv Conv Fully Conn (43264) Fully Conn Fully Conn Lip = 5, 612, 006

10 Problem Formulation Motivating Example: Scattering Network Example of Scattering Network; definition and properties: [Mallat12]; this example from [BSZ17]: Input: f ; Outputs: y = (y l,k ).

11 Problem Formulation Motivating Example: Scattering Network Remarks: Outputs from each layer

12 Problem Formulation Motivating Example: Scattering Network Remarks: Outputs from each layer Tree-like topology

13 Problem Formulation Motivating Example: Scattering Network Remarks: Outputs from each layer Tree-like topology Backpropagation/Chain rule: Lipschitz bound 40.

14 Problem Formulation Motivating Example: Scattering Network Remarks: Outputs from each layer Tree-like topology Backpropagation/Chain rule: Lipschitz bound 40. Mallat s result predicts Lip = 1.

15 Problem Formulation Problem 1 Given a deep network: Estimate the Lipschitz constant, or bound: Lip = 2 y ỹ sup 2 y ỹ 2, Bound = sup f f L 2 f f 2 f f L 2 f f 2. 2

16 Problem Formulation Problem 1 Given a deep network: Estimate the Lipschitz constant, or bound: Lip = Methods (Approaches): 2 y ỹ sup 2 y ỹ 2, Bound = sup f f L 2 f f 2 f f L 2 f f Standard Method: Backpropagation, or chain-rule 2 New Method: Storage function based approach (dissipative systems) 3 Numerical Method: Simulations

17 Problem Formulation Problem 2 Given a deep network: Estimate the stability of the output to specific variations of the input: 1 Invariance to deformations: f (x) = f (x τ(x)), for some smooth τ. 2 Covariance to such deformations f (x) = f (x τ(x)), for smooth τ and bandlimited signals f ; 3 Tail bounds when f has a known statistical distribution (e.g. normal with known spectral power)

18 ConvNet Topology A deep convolution network is composed of multiple layers:

19 ConvNet One Layer Each layer is composed of two or three sublayers: convolution, downsampling, detection/pooling/merge.

20 ConvNet: Sublayers Linear Filters: Convolution and Pooling-to-Output Sublayer f (2) = g f (1), f (2) (x) = g(x ξ)f (1) (ξ)dξ where g B = {g S, ĝ L (R d )}. (B, ) is a Banach algebra with norm g B = ĝ. Notation: g for regular convolution filters, and Φ for pooling-to-output filters.

21 ConvNet: Sublayers Downsampling Sublayer f (2) (x) = f (1) (Dx) For f (1) L 2 (R d ) and D = D 0 I, f (2) L 2 (R d ) and f (2) 2 2 = R d f (2) (x) 2 dx = 1 f (1) (x) 2 dx = 1 det(d) R d D0 d f (1) 2 2

22 ConvNet: Sublayers Detection and Pooling Sublayer We consider three types of detection/pooling/merge sublayers: Type I, τ 1 : Componentwise Addition: z = k j=1 σ j (y j ) ( kj=1 Type II, τ 2 : p-norm aggregation: z = σ j (y j ) p) 1/p Type III, τ 3 : Componentwise Multiplication: z = k j=1 σ j (y j ) Assumptions: (1) σ j are scalar Lipschitz functions with Lip(σ j ) 1; (2) If σ j is connected to a multiplication block then σ j 1.

23 ConvNet: Sublayers MaxPooling and AveragePooling MaxPooling can be implemented as follows:

24 ConvNet: Sublayers MaxPooling and AveragePooling MaxPooling can be implemented as follows: AveragePooling can be implemented as follows:

25 ConvNet: Layer m Components of the m th layer

26 ConvNet: Layer m Topology coding of the m th layer n m denotes the number of input nodes in the m-th layer: I m = {N m,1, N m,2,, N m,nm }. Filters: 1 pooling filter: φ m,n for node n, in layer m; 2 convolution filter: g m,n,k for input node n to output node k, in layer m; For node n: G m,n = {g m,n;1, g m,n;km,n }. The set of all convolution filters in layer m: G m = nm n=1 G m,n.

27 ConvNet: Layer m Topology coding of the m th layer n m denotes the number of input nodes in the m-th layer: I m = {N m,1, N m,2,, N m,nm }. Filters: 1 pooling filter: φ m,n for node n, in layer m; 2 convolution filter: g m,n,k for input node n to output node k, in layer m; For node n: G m,n = {g m,n;1, g m,n;km,n }. The set of all convolution filters in layer m: G m = nm n=1 G m,n. O m = {N m,1, N m,2,, N m,n m } the set of output nodes of the m-th layer. Note that n m = n m+1 and there is a one-one correspondence between O m and I m+1. The output nodes automatically partitions G m into n m disjoint subsets G m = n m n =1 G m,n, where G m,n is the set of filters merged into N m,n.

28 ConvNet: Layer m Topology coding of the m th layer For each filter g m,n;k, we define an associated multiplier l m,n;k in the following way: suppose g m,n;k G m,n G, let K = m,n denote the cardinality of G m,n. Then l m,n;k = { K, if gm,n;k τ 1 τ 3 K max{0,2/p 1}, if g m,n;k τ 2 (2.1)

29 Layer Analysis Bessel Bounds In each layer m and for each input node n we define three types of Bessel bounds: 1st type Bessel bound: m,n = ˆφ k 2 m,n m,n + B (1) 2nd type Bessel bound: B (2) k=1 3rd type (or generating) bound: k=1 l m,n;k D d m,n;k ĝ m,n;k 2 k m,n m,n = l m,n;k Dm,n;k d ĝ m,n;k 2 (3.2) (3.3) B (3) m,n = ˆφ m,n 2. (3.4)

30 Layer Analysis Bessel Bounds Next we define the layer m Bessel bounds: 1 st type Bessel bound B m (1) = max B m,n (1) (3.5) 1 n n m 2 nd type Bessel bound B m (2) = max B m,n (2) (3.6) 1 n n m 3 rd type (generating) Bessel bound B m (3) = max B (3) 1 n n m m,n. (3.7) Remark. These bounds characterize semi-discrete Bessel systems.

31 Lipschitz Analysis First Result Theorem [BSZ17] Consider a Convolutional Neural Network with M layers as described before, where all scalar nonlinear functions are Lipschitz with Lip(ϕ m,n,n ) 1. Additionally, those ϕ m,n,n that aggregate into a multiplicative block satisfy ϕ m,n,n 1. Let the m-th layer 1st type Bessel bound be B m (1) = max ˆφ k 2 m,n m,n + l m,n;k Dm,n;k d ĝ m,n;k 2. 1 n n m k=1 Then the Lipschitz bound of the entire CNN is upper bounded by Mm=1 max(1, B (1) m ). Specifically, for any f, f L 2 (R d ): F(f ) F( f ) 2 2 ( M m=1 max(1, B (1) m ) ) f f 2 2,

32 Lipschitz Analysis Second Result Theorem Consider a Convolutional Neural Network with M layers as described before, where all scalar nonlinearities satisfy the same conditions as in the previous result. For layer m, let B m (1), B m (2), and B m (3) denote the three Bessel bounds defined earlier. Denote by L the optimal solution of the following linear program: Γ = max y 1,...,y M,z 1,...,z M 0 M z m m=1 s.t. y 0 = 1 y m + z m B (1) m y m 1, y m B (2) m y m 1, z m B (3) m y m 1, 1 m M 1 m M 1 m M (3.8)

33 Lipschitz Analysis Second Result - cont d Theorem Then the Lipschitz bound satisfies Lip(F) 2 Γ. Specifically, for any f, f L 2 (R d ): F(f ) F( f ) 2 2 Γ f f 2 2,

34 Example 1: Scattering Network The Lipschitz constant: Backpropagation/Chain rule: Lipschitz bound 40 (hence Lip 6.3).

35 Example 1: Scattering Network The Lipschitz constant: Backpropagation/Chain rule: Lipschitz bound 40 (hence Lip 6.3). Using our main theorem, Lip 1, but Mallat s result: Lip = 1. Filters have been choosen as in a dyadic wavelet decomposition. Thus B m (1) = B m (2) = B m (3) = 1, 1 m 4.

36 Example 2: A General Convolutive Neural Network

37 Example 2: A General Convolutive Neural Network Set p = 2 and: F (ω) = exp( 4ω2 + 4ω + 1 4ω 2 + 4ω )χ ( 1, 1/2)(ω) + χ ( 1/2,1/2) (ω) + exp( 4ω2 4ω + 1 4ω 2 4ω )χ (1/2,1)(ω). ˆφ 1 (ω) = F (ω) ĝ 1,j (ω) = F (ω + 2j 1/2) + F (ω 2j + 1/2), j = 1, 2, 3, 4 ˆφ 2 (ω) = exp( 4ω2 + 12ω + 9 4ω ω + 8 )χ ( 2, 3/2)(ω) + χ ( 3/2,3/2) (ω) + exp( 4ω2 12ω + 9 4ω 2 12ω + 8 )χ (3/2,2)(ω) ĝ 2,j (ω) = F (ω + 2j) + F (ω 2j), j = 1, 2, 3 ĝ 2,4 (ω) = F (ω + 2) + F (ω 2) ĝ 2,5 (ω) = F (ω + 5) + F (ω 5) ˆφ 3 (ω) = exp( 4ω2 + 20ω ω ω + 24 )χ ( 3, 5/2)(ω) + χ ( 5/2,5/2) (ω) + exp( 4ω2 20ω ω 2 20ω + 25 )χ (5/2,3)(ω).

38 Example 2: A General Convolutive Neural Network Bessel Bounds: B m (1) = 2e 1/3 = 1.43, B m (2) = B m (3) = 1. The Lipschitz bound: Using backpropagation/chain-rule: Lip 2 5. Using Theorem 1: Lip Using Theorem 2 (linear program): Lip

39 References [BSZ17] Radu Balan, Manish Singh, Dongmian Zou, Lipschitz Properties for Deep Convolutional Networks, available online arxiv [cs.lg], 18 Jan [BZ15] Radu Balan, Dongmian Zou, On Lipschitz Analysis and Lipschitz Synthesis for the Phase Retrieval Problem, available online arxiv v1 [mathfa], 6 June 2015; Lin. Alg. and Appl. 496 (2016), [BGC16] Yoshua Bengio, Ian Goodfellow and Aaron Courville, Deep learning, MIT Press, [BM13] Joan Bruna and Stephane Mallat, Invariant scattering convolution networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013), no. 8, [BCLPS15] Joan Bruna, Soumith Chintala, Yann LeCun, Serkan Piantino, Arthur Szlam, and Mark Tygert, A theoretical argument for

40 complex-valued convolutional networks, CoRR abs/ (2015). [DDSLLF09] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei Imagenet: A large-scale hierarchical image database, IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, [HS97] Sepp Hochreiter and Jürgen Schmidhuber, Long short-term memory, Neural Comput. 9 (1997), no. 8, [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25, , [LBH15] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton, Deep learning, Nature 521 (2015), no. 7553, [LSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir, On the computational efficiency of training neural networks, Advances in

41 Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger, eds.), Curran Associates, Inc., 2014, pp [Mallat12] Stephane Mallat, Group invariant scattering, Communications on Pure and Applied Mathematics 65 (2012), no. 10, [SVS15] Tara N. Sainath, Oriol Vinyals, Andrew W. Senior, and Hasim Sak, Convolutional, long short-term memory, fully connected deep neural networks, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp [SLJSRAEVR15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going deeper with convolutions, CVPR 2015, 2015.

42 [SZSBEGF13] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus, Intriguing properties of neural networks, CoRR abs/ (2013). [WB15a] Thomas Wiatowski and Helmut Bölcskei, Deep convolutional neural networks based on semi-discrete frames, Proc. of IEEE International Symposium on Information Theory (ISIT), June 2015, pp [WB15b]Thomas Wiatowski and Helmut Bölcskei, A mathematical theory of deep convolutional neural networks for feature extraction, IEEE Transactions on Information Theory (2015).

When Harmonic Analysis Meets Machine Learning: Lipschitz Analysis of Deep Convolution Networks

When Harmonic Analysis Meets Machine Learning: Lipschitz Analysis of Deep Convolution Networks Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD Joint