Normalization Techniques in Training of Deep Neural Networks

Similar documents
Day 3 Lecture 3. Optimizing deep networks

Normalization Techniques

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Intro to AI Bert Huang Virginia Tech

Deep Feedforward Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Advanced Training Techniques. Prajit Ramachandran

Nonlinear Optimization Methods for Machine Learning

Optimization for Training I. First-Order Methods Training algorithm

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

ECS289: Scalable Machine Learning

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Deep Residual. Variations

Introduction to Deep Neural Networks

Deep Learning & Neural Networks Lecture 4

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 6 Optimization for Deep Neural Networks

Centered Weight Normalization in Accelerating Training of Deep Neural Networks

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Optimization for neural networks

Lecture 14: Deep Generative Learning

arxiv: v1 [cs.lg] 16 Jun 2017

Deep Learning & Artificial Intelligence WS 2018/2019

Tips for Deep Learning

The K-FAC method for neural network optimization

OPTIMIZATION METHODS IN DEEP LEARNING

Introduction to Convolutional Neural Networks (CNNs)

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Convolutional Neural Network Architecture

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Feedforward Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

y(x n, w) t n 2. (1)

A picture of the energy landscape of! deep neural networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning Lecture 14

Based on the original slides of Hung-yi Lee

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Tips for Deep Learning

Deep Learning Lecture 2

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Introduction to Neural Networks

Ch.6 Deep Feedforward Networks (2/3)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Maxout Networks. Hien Quoc Dang

Lecture 5: Logistic Regression. Neural Networks

Presentation in Convex Optimization

Kernel Learning via Random Fourier Representations

FreezeOut: Accelerate Training by Progressively Freezing Layers

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning II: Momentum & Adaptive Step Size

Deep Neural Networks (1) Hidden layers; Back-propagation

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

UNSUPERVISED LEARNING

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Convolutional Neural Networks. Srikumar Ramalingam

Neural Networks: Optimization & Regularization

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Introduction to Machine Learning (67577)

Intro to Neural Networks and Deep Learning

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

A summary of Deep Learning without Poor Local Minima

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Lecture 7 Convolutional Neural Networks

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

CSC321 Lecture 7: Optimization

Numerical Optimization Techniques

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

IMPROVING STOCHASTIC GRADIENT DESCENT

CSC321 Lecture 8: Optimization

Deep Feedforward Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Feedforward Networks. Sargur N. Srihari

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Training Neural Networks Practical Issues

Improving training of deep neural networks via Singular Value Bounding

RegML 2018 Class 8 Deep learning

An overview of deep learning methods for genomics

Convex Optimization Algorithms for Machine Learning in 10 Slides

Neural Networks and Deep Learning

Transcription:

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th, 2017

Outline Introduction to Deep Neural Networks (DNNs) Training DNNs: Optimization Batch Normalization Other Normalization Techniques Centered Weight Normalization

Machine learning dataset D={X, Y} Input: X Output: Y Learning: Y=F(X) or P(Y X) Y=F(X) P(Y X) Fitting and Generalization Types: view of models Non-parametric model Y=F(X; x 1, x 2 x n ) Parametric model Y=F(X; θ)

Neural network Neural network Y=F(X)=f T (f T 1 ( f 1 (X))) f i x = g(wx + b) Nonlinear activation sigmod Relu

Deep neural network Why deep? Powerful representation capacity

Key properties of Deep learning End to End learning no distinction between feature extractor and classifier Deep architectures: Hierarchy of simpler non-linear modules

Applications and techniques of DNNs Successful applications in a range of domains Speech Computer Vision Natural Language processing Main techniques in using deep neural networks networks Design the architecture Module selection and Module connection Loss function Train the model based on optimization Initialize the parameters Search direction in parameters space Learning rate schedule Regularization techniques

Outline Introduction to Deep Neural Networks (DNNs) Training DNNs: Optimization Batch Normalization Other Normalization Techniques Centered Weight Normalization

Training of Neural Networks Multi-layer perceptron (example) x 0 = 1 h 0 (1) = 1 x 1 x 2 x 3 input hidden layer 1 Forward calculate y: a (1) = W (1) x h (1) = σ a (1) a (2) = W (2) h 1 y = σ a (2) MSE Loss: L=(y y) 2 output (1, 0, 0) T 2 Backward,Calculate d L d x : d L =2(y y) y d L da (2)=d L σ dy a(2) (1 σ a (2) ) d L d L d W (2)= d L da d L d W (1)= d L da d h (1)=d L da (2) W(2) d L = d L σ a (1) d h (1) a(1) (1 σ a (1) ) d L =d L dx da (2) h(1) 9 (1) x (1) W(1) 3 calculate gradient d L d W :

Optimization in Deep Model Goal: Update Iteratively: Challenge: Non-convex and local optimal points Saddle point Severe correlation between dimensions and highly non-isotropic parameter space (ill-shaped)

First order optimization First order stochastic gradient descent (SGD): The direction of the gradient Gradient is averaged by the sampled examples Disadvantage Over-aggressive steps on ridges Too small steps on plateaus Slow convergence non-robust performance. Figure 2: zig-zag iteration path for SGD

Advanced Optimization Estimate curvature or scale Quadratic optimization Newton or quasi-newton Inverse of Hessian Natural Gradient Inverse of FIM Estimate the scale AdaGrad Rmsprop Adam Iteration path of SGD (red) and NGD (green) Normalize input/activation Intuition:the landscape of cost w.r.t parameters is controlled by Input/activation L=(f(x,θ),y) Method: Stabilize the distribution of input/activation Normalize the input explicitly Normalize the input implicitly (constrain weights)

Some intuitions of normalization for optimization How Normalizing activation affect the optimization? y = w 1 x 1 + w 2 x 2 +b L=(y y)^2 w 2 0 < x 1 < 2 0 < x 2 < 0.5 w 2 0 < x1 = x 1 /2 < 1 0 < x 2 = x 2 2 < 1 L(w 1, w 2 ) L(w 1, w 2 ) w 1 w 1

Outline Introduction to Deep Neural Networks (DNNs) Training DNNs: Optimization Batch Normalization Other Normalization Techniques Centered Weight Normalization

Batch Normalization--motivation Solving Internal Covariate Shift h 2 = wh 1 h 1 = wx Whitening Input benefits optimization (1998,Lecun, Efficient back-propagation) centering Centering Decorrelate stretch y=wx, MSE loss stretch decorrelate

Batch Normalization--method Only standardize input: decorrelating is expensive Centering Stretch centering stretch How to do it? x = x E(x) std(x)

Batch Normalization--training Forward

Batch Normalization--training Backward

Batch Normalization--Inference Inference (in paper) Inference (in practice) Running average E x = α μ B + (1 α)e x Var x = α σ 2 B + 1 α var x

Batch Normalization how to use Convolution layer Wrapped as a module Before or after nonlinear? For shallow module, after nonlinear (Layer <11) For deep model, before nonlinear Advantage of before nonlinear For Relu, half activated For sigmod, avoiding saturated region. Advantage of after nonlinear The intuition of whitening

Batch Normalization how to use Example: Residual block (CVPR 2015) Pre-activation Residual block (ECCV 2016)

Batch Normalization characteristics For accelerating training: Weight scale invariant: Not sensitive for weight initialization Adjustable learning rate Large learning rate Better conditioning, (1998 Lecun) For generalization Stochastic, works like Dropout

Batch Normalization Routine in deep feed forward neural networks, especially for CNNs. Weakness Can not be used for online learning Unstable for small mini batch size Used in RNN with caution

Batch Normalization for RNN The extra problems need be considered: Where BN should put? Sequence data 2016,ICASSP, Batch Normalized Recurrent Neural Networks How to put BN module Sequence data problem Frame-wise normalization Sequence-wise normalization

Batch Normalization for RNN 2017, ICLR, Recurrent Batch Normalization How to put BN module Sequence data problem T_max Frame-wise normalization It depends..

Outline Introduction to Deep Neural Networks (DNNs) Training DNNs: Optimization Batch Normalization Other Normalization Techniques Centered Weight Normalization

Norm-propagation (2016, ICML) Target BN s drawback: Can not be used for online learning Unstable for small mini batch size. Data independent parametric estimate of mean and variance Normalize input: 0-mean and unit variance Assuming W is orthogonal Derivate the nonlinear dynamic Relu:

Layer Normalization (2016, Arxiv) Target BN s drawback: Can not be used for online learning Unstable for small mini batch size RNN Normalizing each example, over dimensions BN LN

Natural Neural Network (2015, NIPS) How about decorrelate the activations? Canonical model(mlp): h i = f i (W i h i 1 + b i ) Natural neural network Model parameters: Ω = {,V 1, d 1, V L, d L } Whitening coefficients : Φ = {,U 0, c 0, U L 1, c L 1 }

Weight Normalization (2016, NIPS) Target BN s drawback: Can not be used for online learning Unstable for small mini batch size RNN Express weight as new parameters Decouple direction and length of vectors

Reference batch normalization accelerating deep network training by reducing internal covariate shift, ICML 2015 (Batch Normalization) Normalization Propagation A Parametric Technique for Removing Internal Covariate Shift in Deep Networks, ICML, 2016 Weight Normalization A Simple Reparameterization to Accelerate Training of Deep Neural Networks, NIPS, 2016 Layer Normalization, Arxiv:1607.06450, 2016 Recurrent Batch Normalization, ICLR,2017 Batch Normalized Recurrent Neural Networks, ICASSP, 2016 Natural Neural Networks, NIPS, 2015 Normalizing the normaliziers-comparing and extending network normalization schemes, ICLR, 2017 Batch Renormalization, Arxiv:1702.03275, 2017 mean-normalized stochastic gradient for large-scale deep learning, ICASSP 2014 deep learning made easier by linear transformations in perceptrons, AISTATS 2012 31

Outline Introduction to Deep Neural Networks (DNNs) Training DNNs: Optimization Batch Normalization Other Normalization Techniques Centered Weight Normalization

Centered Weight Normalization in Accelerating Training of Deep Neural Networks Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, Dacheng Tao International Conference on Computer Vision (ICCV) 2017

Motivation Stable distribution in hidden layer Initialization method Random Init (1998, YanLecun) Zero-mean, stable-var Xavier Init (2010, Xavier) He Init (2015, He Kaiming) W~N 0, 2 n, n = out H W Keep desired characters during training

Method Formulation: Constrained optimization problem: Solution by re-parameterization

Method Using proxy parameter v: Gradient Information: Adjustable scale:

Method Wrapped as module for practitioner: Forward

Method Wrapped as module for practitioner: Backward

Discussion Beneficial Characters for training Stabilize the distributions Better Conditioning of Hessian Regularization in improving performance

Data set Yale-B SVHN Cifar10, Cifar100 ImageNet Experiments Reproducible experiments and Code: https://github.com/huangleibuaa/centeredwn

Experiments Ablation study YaleB, MLP{128,64,48,48}

Experiments MLP SVHN, {128,128,128,128,128} SGD optimization SGD+BN Adam optimization

Experiments MLP SVHN, {128,128,128,128,128} SGD optimization SGD+BN Adam optimization

Cifar10 & Cifar100 BN-Inception Experiments Cifar-10 Residual Network-56 layers. Cifar-100 Plain 6.14 ±0.04 25.52 ±0.15 WN 6.18 ±0.34 25.49 ±0.35 WCBN 6.01 ±0.16 24.45 ±0.54 Cifar-10 Cifar-100 Plain 7.34 ±0.52 29.38 ±0.14 WN 7.58 ±0.40 29.85 ±0.66 WCBN 6.85 ±0.25 29.23 ±0.14

ImageNet BN-Inception Experiments

Conclusion and Feature work Conclusion: CWN shows the advantages in accelerating training and better generalization CWN module as a optimal module to replace linear module Apply CWN module RNNs Reinforcement learning scene

Thanks!