Continuous NN. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Size: px

Start display at page:

Download "Continuous NN. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University."

Sharlene McDonald
5 years ago
Views:

1 Continuous NN Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Continuous NN 1

2 Course Overview Continuous NN 2

3 Course Overview Lecture 1: Linear Models 1. Introduction and Applications 2. Linear Models: Least-Squares and Logistic Regression Lecture 2: Neural Networks 1. Introduction to Nonlinear Models 2. Single Layer Neural Networks 3. Training Algorithms for Single Layer Neural Networks 4. Neural Networks and Residual Neural Networks (ResNets) Lecture 3: Neural Networks as Differential Equations 1. Resnets as ODEs 2. Residual CNN and their relation to PDEs Continuous NN 3

4 Stability Continuous NN 4

5 Continuous NN 5 Motivation: Nonlinear Models In general, impossible to find a linear separator between classes input features transformed features Goal/Trick Embed the points in higher dimension and/or move the points to make them linearly separable

6 Stability of Deep Residual Networks Why are ResNets more stable? A small change Y 1 = Y 0 + hσ(k 0 Y 0 + b 0 ). =. Y N = Y N 1 + hσ(k N 1 Y N 1 + b N 1 ) This is nothing but a forward Euler discretization of the Ordinary Differential Equation (ODE) t Y(t) = σ(k(t)y(t) + b(t)), Y(0) = Y 0 We can understand the behavior by learning the dynamics of nonlinear ODEs [6, 5]. Continuous NN 6

7 Crash Course: ODE Continuous NN 7

8 Continuous NN 8 Crash Course on ODEs Given the ODE Assumptions: t y(t) = f (t, y(t)) 1. f differentiable with Jacobian J(t, y) = ( ) f y 2. J changes sufficiently slowly in time Then (see also [2, 3, 1]) If Re(eig(J)) > 0 If Re(eig(J)) < 0 point) If Re(eig(J)) = 0 Unstable Stable (converge to a stationary Stable, energy bounded

9 Continuous NN 9 Stability of Residual Network Assume forward propagation of single example y 0 t y(t) = σ(k(t)y(t) + b(t)), y(0) = y 0 The Jacobian is J(t) = diag (σ (K(t)y(t) + b(t))) K(t) Here, σ (x) 0 for tanh, ReLU,... Hence, problem is stable when 1. J changes slowly in time 2. Re(eigK(t)) 0 for every t Remember that we learn K ensure stability by regularization/constraints!

10 Continuous NN 10 Residual Network as a Path Planning Problem Forward propagation in residual network (continuous) t Y(t) = σ(k(t)y(t) + b(t)), Y(0) = Y 0 The goal is to plan a path (via K and b) such that the initial data can be linearly separated Question: What is a layer, what is depth?

11 Continuous NN 11 Stability: Continuous vs. Discrete Assume K is chosen so that the (continuous) forward propagation is stable t Y(t) = σ(ky(t) + b(t)), Y(0) = Y 0 And assume we use the forward Euler method to discretize Y j+1 = Y j + hσ(k j Y j + b j ) Is the network stable? Not always...

12 Continuous NN 12 Stability: A Simple Example Look at the simplest possible forward propagation t Y(t) = λy(t) And assume we use the forward Euler to discretize Y j+1 = Y j + hλy j = (1 + hλ)y j Then the method is stable only if 1 + hλ 1 Not every network is stable! Time step size depends on λ.

13 Why you should care about stability min θ 2 Y N(θ) C 2 F Y j+1 (θ) = Y j (θ) + 10 N tanh (KY j(θ)) where C = Y 200 (1, 1), Y 0 N (0, 1), and θ 1 θ 2 θ 1 θ 2 K(θ) = θ 2 θ 1 θ 2 θ 1 θ 1 θ 2 θ 1 θ 2 objective, N = 5 objective, N = 100 Next: Compare different inputs generalization Continuous NN 13

14 Continuous NN 14 Why you should care about stability - 2 objective, Y train 0 objective, Y test 0 abs. diff stable, N = 100 unstable, N = 5

15 Continuous NN 15 Stability: A Non-Trivial Example Consider the antisymmetric kernel model K(t) = K(t) K(t). Here, Re(eig(J(t))) = 0 for all θ. Assume we use the forward Euler to discretize Y j+1 = Y j + hσ((k j K j )Y j + b j ). Tricky question: How to pick h to ensure stability? Answer: Impossible since eigenvalues of Jacobian are imaginary. Need other method than forward Euler.

16 Experiment: Peaks Compare the three approaches for training a residual neural network EResNet PeaksSGD.m - stochastic gradient descent EResNet PeaksNewtonCG.m - Newton CG with block-diagonal Hessian approximation EResNet PeaksVarPro.m - Fully coupled solver. Eliminate θ and use steepest descent/newton CG for reduced problem. Continuous NN 16

17 Parametric Models Continuous NN 17

18 Motivation Recall single layer Z = σ(ky + b), where Y R n f n, K R m n f, b R, and σ element-wise activation. We saw that m n f needed to fit training data. Conservative example: Consider MNIST (n f = 28 2 ) and use m = n f 614, 656 unknowns for a single layer. Famous quote: With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. Possible remedies: Regularization: penalize K Parametric model: K(θ) where θ R p with p m n f. Continuous NN 18

19 Some Simple Parametric Models Diagonal scaling: K(θ) = diag(θ) R n f n f Advantage: preserves size and structure of data. Antisymmetric kernel 0 θ 1 θ 2 K(θ) = θ 1 0 θ 3 θ 2 θ 3 0 Advantage?: real(λ i (K(θ))) = 0. M-matrix θ 1 + θ 2 θ 1 θ 2 K(θ) = θ 3 θ 3 + θ 4 θ 4 θ 0 θ 5 θ 6 θ 5 + θ 6 Advantage: like differential operator Continuous NN 19

20 Continuous NN 20 Differentiating Parametric Models Need derivatives of model to optimize θ in E(Wσ(K(θ)Y + b), C) (we can re-use previous derivatives and use chain rule) Note that all previous models are linear in the following sense K(θ) = mat(q θ) Therefore, matrix-vector products with the Jacobian simply are J θ (K(θ))v = mat(q v) and J θ (K(θ)) w = Q w where v R p and w R m.

21 Example: Derivative of M-matrix θ 1 + θ 2 θ 1 θ 2 K(θ) = θ 3 θ 3 + θ 4 θ 4 θ 0 θ 5 θ 6 θ 5 + θ 6 verify that this can be written as K(θ) = mat(q θ) where Q = R Note: not efficient to construct Q when p large but helpful when computing derivatives Continuous NN 21

Convolutional Neural Networks [7] y R 28 28

.. efficient parameterization, efficient codes

22 Convolutional Neural Networks [7] y R input features z R output features θ R 5 5 convolution kernel useful for speech, images, videos,... efficient parameterization, efficient codes (GPUs,... ) now: CNNs as parametric model and PDEs, simple code see E13Conv2D.m Continuous NN 22

23 Continuous NN 23 Convolutions in 1D Let y, z, θ : R R, z : R R be continuous functions then z(x) = (θ y)(x) = θ(x t)y(t)dt. Assume θ(x) 0 only in interval [ a, a] (compact support). A few properties θ y = F 1 ((Fθ)(Fy)), F is Fourier transform θ y = y θ

24 Discrete Convolutions in 1D Let θ R 2k+1 be stencil, y R n f grid function z i = (θ y) i = k θ j y i 1. j= k Example: Discretize θ R 3 (non-zeros only), y, z R 4 on regular grid z 1 = θ 3 w 1 + θ 2 x 1 + θ 1 x 2 z 2 = θ 3 x 1 + θ 2 x 2 + θ 1 x 3 z 3 = θ 3 x 2 + θ 2 x 3 + θ 1 x 4 z 4 = θ 3 x 3 + θ 2 x 4 + θ 1 w 2 where w 1, w 2 are used to implement different boundary conditions (right choice? depends... ). Continuous NN 24

25 Structured Matrices - 1 z 1 z 2 z 3 z 4 = θ 3 θ 2 θ 1 θ 3 θ 2 θ 1 θ 3 θ 2 θ 1 θ 3 θ 2 θ 1 w 1 x 1 x 2 x 3 x 4 w 2 Different boundary conditions lead to different structures Zero boundary conditions: w 1 = w 2 = 0 z 1 θ 2 θ 1 z 2 z 3 = θ 3 θ 2 θ 1 θ 3 θ 2 θ 1 z 4 θ 3 θ 2 x 1 x 2 x 3 x 4 This is a Toeplitz matrix (constant along diagonals). Continuous NN 25

26 Structured Matrices - 2 Periodic boundary conditions: w 1 = x 4 and w 2 = x 1 z 1 θ 2 θ 1 θ 3 x 1 z 2 z 3 = θ 3 θ 2 θ 1 x 2 θ 3 θ 2 θ 1 x 3 z 4 θ 1 θ 3 θ 2 x 4 this is a circulant matrix (each row/column is periodic shift of previous row/column) An attractive property of a circulant matrix is that we can efficiently compute its eigendecomposition K(θ) = F diag(λ)f where F is the discrete Fourier transform and the eigenvalues, λ C 4, can be computed using first column λ = F(K(θ)u 1 ) where u 1 = (1, 0, 0, 0). Continuous NN 26

27 Coding: 1D Convolution using FFTs Let θ R 3 be some stencil and n f = m = build a sparse matrix K for computing the convolution with periodic boundary conditions. Hint: spdiags 2. compute the eigenvalues of K using eig(full(k)) and using fft and first column of K. Compare! 3. verify that norm(k*y - real(ifft(lam.*fft(y)))) is small. 4. repeat previous item for transpose. 5. write code that computes eigenvalues for arbitrary stencil size without building K. Hint: circshift Continuous NN 27

28 Derivatives of 1D Convolution - 1 Recall that we need a way to compute J θ (K(θ)Y)v and J θ (K(θ)Y) w, (J θ R m p ) (note that we put Y inside the bracket to avoid tensors) Assume single example, y. Since we have periodic boundary conditions K(θ)y = real(f (λ(θ) Fy)) = real(f diag(fy) λ(θ)), λ(θ) = F(K(θ)u 1 ). Need to differentiate eigenvalues w.r.t. θ. Note linearity K(θ)u 1 = Qθ, Q =? Continuous NN 28

29 Continuous NN 29 Derivatives of 1D Convolution - 2 Assume we have K(θ)y = real(f diag(fy) FQθ)) Then mat-vecs with Jacobian are easy to compute J θ (K(θ)y)v = real(f (diag(fy)fqv)) and (note that F = F and (F ) = F ) J θ (K(θ)y) w = real(q Fdiag(Fy)F w) Code this and check Jacobian and its transpose using conv1d.m!

Extension: Width of CNNs RGB image input channels output channels Width of

Let y = (y R, y G, y B ), then we might compute z 1 z 2 z 3 z 4 = K 11 (θ 11

31 ) K 32 (θ 32 ) K 33 (θ 33 ) K 41 (θ 41 ) K 42 (θ 42 ) K 43 (θ 43 ) y R y

30 Extension: Width of CNNs RGB image input channels output channels Width of CNN can be controlled by number of input and output channels of each layer. Let y = (y R, y G, y B ), then we might compute z 1 z 2 z 3 z 4 = K 11 (θ 11 ) K 12 (θ 12 ) K 11 (θ 13 ) K 21 (θ 21 ) K 22 (θ 22 ) K 23 (θ 23 ) K 31 (θ 31 ) K 32 (θ 32 ) K 33 (θ 33 ) K 41 (θ 41 ) K 42 (θ 42 ) K 43 (θ 43 ) y R y G y B, where K ij is a 2D convolution operator with stencil θ ij Continuous NN 30

31 Outlook: Possible Extensions For now, we just introduced the very basic convolution layer. CNNs used in practice also use the following components pooling: reduce image resolution (e.g. average over patches) stride: Example: stride of two reduces image resolution by computing z only at every other pixel. Build your own parametric model (ideas for projects) M matrix for convolution cheaper convolution models: separable kernels, doubly symmetric kernels Wavelet,... other sparsity patterns Continuous NN 31

32 PDE-based Neural Networks Continuous NN 32

33 Deep Convolutional Neural Networks Deep network with convolution operators parameterized by stencils. Common Y j+1 = P j Y j + hk j,2 σ(k j,1 Y j + b j ). Motivation for second convolution? approximation properties of double layer if σ(x) = max(x, 0) entries in Y can only grow Learning tasks: image classification semantic segmentation Continuous NN 33

34 Continuous NN 34 Convolutions and PDEs Let y be 1D grid function, y y (n cells of width h x = 1/n) K(θ)y = [θ 1 θ 2 θ 3 ] y ( β1 = 4 [1 2 1] + β 2 [ 1 0 1] + β ) 3 [ 1 2 1] y 2h x Relation between β and θ given by 1/4 1/(2h x ) 1/hx 2 1/2 0 2/hx 2 1/4 1/(2h x ) 1/hx 2 }{{} =A(h x ) In the limit h x 0 this gives β 1 β 2 β 3 h 2 x = K(θ(t)) = β 1 (t) + β 2 (t) x + β 3 (t) 2 x. θ 1 θ 2 θ 3.

35 Algebraic Prolongation of Convolution Kernels K H = RK h P, where K h fine mesh operator (given) R restriction (e.g., averaging) P prolongation (e.g., interpolation) Remarks: Galerkin: R = γp Coarse Fine: unique if kernel size constant. only small linear solve required prolongation coarse convolution restriction fine convolution Continuous NN 35

36 Scaling Convolution Operators: Algebraic Let n c = 3, n f = 6 (see E15CoarseToFineConv1D.m) Prolongation and restriction P = 1/2 1/2 R = 1 3/4 1/4 1/4 3/4 3/4 1/4 1/4 3/4 1 1/2 1/2, 1/2 1/2 1. build K H = RK h D for jth unit vector as stencil 2. extract a column associated with interior node 3. reshape and get patch around node 4. vectorize and use as one column Continuous NN 36

37 Stability of Convolutional ResNets Residual Neural Network is discretization of t y(t) = K 2 (t)σ (K 1 (t)y(t) + b(t)), y(0) = y 0. To analyze stability need to look at eigenvalues of J(t) = K 2 (t)diag (σ (K 1 (t)y(t) + b(t))) K 1 (t) Difficult to analyze eigenvalues for general K 2 and K 1. Easy option: Set K 2 = K 1 (and drop subscript index). Doing so gives: J sym (t) = K(t) diag (σ (K 1 (t)y(t) + b(t)))k(t) symmetric negative semidefinite stability if t K small E15ParabolicCNN.m Continuous NN 37

38 Parabolic CNN In original Residual Net choose K 2 = K 1 parabolic CNN = K. This gives t Y(t) = K(t) σ(k(t)y(t) + b(t)), Y(0) = Y 0. Theorem If σ is monotonically non-decreasing, then the forward propagation is stable, i.e., there is a M > 0 such that for all T Y(T ) Y ɛ (T ) F M Y(0) Y ɛ (0) F, where Y and Y ɛ are solutions for different initial values. Use forward Euler discretization with h small enough Y j+1 = Y j hk j σ(k j Y + b j ), j = 0, 1,..., N 1 Similar to anisotropic diffusion (popular in image processing) [4] Continuous NN 38

39 Continuous NN 39 Proof: Stability of Parabolic Networks For ease of notation, assume no bias. We show that t Y(t) Y ɛ (t) 2 F 0. Integrating this over [0, T ] yields the stability result. Why? Note that for all t [0, T ] taking derivative gives ( K(t) σ(k(t)y) + K(t) σ(k(t)y ɛ ), Y Y ɛ ) (σ(k(t)y) σ(k(t)y ɛ ), K(t)(Y Y ɛ )) 0. Where (, ) is inner product and the inequality follows from the monotonicity of the activation function.

40 Continuous NN 40 Relation to Total Variation Note that the parabolic network can be seen as a gradient flow for φ(y) = s(k(t)y(t) + b(t)), where σ(x) = s (x). What is s? σ(x) = max(x, 0): σ(x) = tanh(x): s(x) = max(x, 0) 2 s(x) = log(exp(x) exp( x)) For K(t) = hx, this shows a relation to total variation [8]. thanks to Stanley Osher for providing this example

41 Continuous NN 41 References [1] U. Ascher. Numerical methods for Evolutionary Differential Equations. SIAM, Philadelphia, [2] U. Ascher and L. Petzold. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. SIAM, Philadelphia, PA, [3] U. M. Ascher and C. Greif. A First Course on Numerical Methods. SIAM, Philadelphia, [4] Y. Chen and T. Pock. Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): , June [5] W. E. A Proposal on Machine Learning via Dynamical Systems. Communications in Mathematics and Statistics, 5(1):1 11, Mar [6] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34:014004, [7] Y. LeCun, B. E. Boser, and J. S. Denker. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages , [8] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4): , 1992.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models