Beyond finite layer neural network

Size: px

Start display at page:

Download "Beyond finite layer neural network"

Percival Townsend
6 years ago
Views:

1 Beyond finite layer neural network Bridging Numerical Dynamic System And Deep Neural Networks arxiv: Joint work with Bin Dong, Quanzheng Li, Aoxiao Zhong Yiping Lu Peiking University School Of Mathematical Science

2 Depth Revolution

Motivation Deep Residual Learning(@CVPR2016) x t = f(x) x n+1 = x n + f(x n )

3 Motivation Deep Residual x t = f(x) x n+1 = x n + f(x n ) Forward Euler Scheme Weinan E. A Proposal on Machine Learning via Dynamical Systems.

4 Previous Works learn a diffusion process for denoising Chen Y, Yu W, Pock T. On learning optimized reaction diffusion processes for effective image restoration CVPR2015

5 Depth Revolution Going into infinite layer Differential Equation As Infinite Layer Neural Network

6 Revisiting previous efforts in deep learning, we found that diversity, another aspect in network design that is relatively less explored, also plays a significant role PolyStrure: x n+1 = x n + F x n + F(F x n ) Backward Euler Scheme: x n+1 = x n + F x n+1 x n+1 = I F 1 x n (b) Polynet Approximate the operator I F 1 by I + F + F 2 + Zhang X, Li Z, Loy C C, et al. PolyNet: A Pursuit of Structural Diversity in Very Deep Networks

7 fc fc fc Runge-Kutta Scheme(2order) x n+1 = k 1 x n + k 2 (k 3 x n + f 1 x n ) + f 2 (k 3 x n + f 1 x n ) Larsson G, Maire M, Shakhnarovich G. FractalNet: Ultra-Deep Neural Networks without Residuals.

8 PDE: Infinite Layer Neural Network Dynamic System Continuous limit Nueral Network Numerical Approximation WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc : New scheme to Approximate the right hand side term Why not change the way to discrete u_t?

9 Multi-step Residual Network x t = f(x) x n+1 = x n + f(x n )

Experiment @Linear Multi-step Residual Network Linear Multi-step Scheme x t = f(x) x n+1 =

10 Multi-step Residual Network Linear Multi-step Scheme x t = f(x) x n+1 = (1 k n )x n + k n x n 1 + f(x n ) x n+1 = x n + f(x n ) Linear Multi-step Residual Network

11 Scale k Scale 1-k (a) ResNet (b)linear Multi-step ResNet

12 Scale k Scale 1-k Only One More Parameter (a) ResNet (b)linear Multi-step ResNet

13 Multi-step Residual Network (a)resnet (b)lm-resnet

14 Multi-step Residual Network

15 Explanation on the performance boost via modified Multi-step Residual Network ResNet x n+1 = x n + Δtf(x n ) u + Δt 2 u n = f(u) LM-ResNet x n+1 = (1 k n )x n +k n x n 1 + Δtf(x n ) 1 + k n u + 1 k n Δt 2 u n = f(u) [1] Dong B, Jiang Q, Shen Z. Image restoration: wavelet frame shrinkage, nonlinear evolution PDEs, and beyond. Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal [2] Su W, Boyd S, Candes E J. A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights. Advances in Neural Information Processing Systems, [3] A. Wibisono, A. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimizationproceedings of the National Academy of Sciences 2016.

16 Plot The Multi-step Residual Network x n+1 = (1 k n )x n +k n x n 1 + Δtf(x n ) Learn A Momentum 1 + k n u + 1 k n Δt 2 u n + o Δt 3 = f(u)

Plot The Momentum @Linear Multi-step Residual Network x n+1 = (1 k n )x n +k

17 Plot The Multi-step Residual Network x n+1 = (1 k n )x n +k n x n 1 + Δtf(x n ) Learn A Momentum 1 + k n u + 1 k n Δt 2 u n + o Δt 3 = f(u)

18 Bridge the stochastic dynamic Noise can avoid overfit? Dynamic System

x n f 2 x n Apply data augmentation techniques to internal

19 Previous Works Shake-Shake regularization x n+1 = x n + ηf 1 x + 1 η f 2 x, η U 0, 1 = x n + f 2 x n f 1 x n f 2 x n + (η 1 2 ) f 1 x n f 2 x n Apply data augmentation techniques to internal representations. Gastaldi X. Shake-Shake regularization. ICLR Workshop Track2017.

20 Previous Works Deep Networks with Stochastic Depth x n+1 = x n + η n f x = x n + Eη n f x n + η n Eη n f(x n ) To reduce the effective length of a neural network during training, we randomly skip layers entirely. Huang G, Sun Y, Liu Z, et al. Deep Networks with Stochastic Depth ECCV2016.

21 Bridge the stochastic control Noise can avoid overfit? X t = f X t, a t + g(x t, t)db t, X 0 = X 0 The numerical scheme is only need to be weak ergence!

effective length of a neural network during training, we randomly skip layers

22 Previous Works Deep Networks with Stochastic Depth x n+1 = x n + η n f x = x n + Eη n f x n + η n Eη n f(x n ) We need 1 2p n = O( Δt) To reduce the effective length of a neural network during training, we randomly skip layers entirely. Huang G, Sun Y, Liu Z, et al. Deep Networks with Stochastic Depth ECCV2016.

23 1 + k n u + 1 k n Δt 2 u n + o Δt 3 = f u + g u dw t Scale k Scale 1-k Stochastic Strategy As Previous (a) ResNet (b)linear Multi-step ResNet

24 Multi-step Residual Network

25 Finite Layer Neural Network Neural Network Dynamic System Stochastic Learning Stochastic Dynamic System New Discretization Original One: LM-Resnet56 Beats Resnet110 Modified Equation LM-ResNet Stochastic Depth One: LM-Resnet110 Beats Resnet1202

26 Thanks For Attention And Question? Lu Y, Zhong A, Li Q, et al. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations arxiv:

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,