<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation of MLP Derivatives of activation Back-propagation in MLP Back-Propagation of CNN Intuitive of CNN Pooling Layer & Stride Back-propagation in CNN Hardware Issues & Other Learning Method 2

Previous Standford Lectures

Previous Standford Lectures We understand how back-propagation works (Today s Purpose)?

History of Neural Networks Artificial Neural Network Num. Artificial Neural Network Object Class Input Function1 Output Input Function2 Output Feed-forward network with a single hidden layer containing a finite number of neurons, can approximate any continuous functions on R n George Cybenko(1989) : Add sigmoidal nonlinearity Decision regions can be well approximated Learnable Multilayer feedforward architecture has potential of universal approximator (L p space) Learnable

Forward and Backward Paths Forward Path : Inference f W (x) 28 28

Forward and Backward Paths 28 Forward Path : Inference f W (x) What we expect 1: 0.0 2: 1.0 3: 0.0 28 1: 0.1 2: 0.6 3: 0.8 Error!!

Forward and Backward Paths 28 Forward Path : Inference f W (x) What we expect 1: 0.0 2: 1.0 3: 0.0 28 1: 0.1 2: 0.6 3: 0.8 Error!! f W x Backward Path : Error propagation 1: 0.1 2: -0.4 3: 0.8

Forward and Backward Paths Forward Path : Inference f W (x) 28 28 1: 0.08 2: 0.75 3: 0.4 f W x Backward Path : Error propagation 1: 0.08 2: -0.25 3: 0.4

Perceptron Basic Structure for DNN b Nonlinear Activation Function Wx b Linear Summation + Nonlinear Activation Function To implement nonlinear classification ex) sigmoid, ReLU

Forward Inference x x 1 x 2 1 Input node W 11 W 12 W 21 W 22 b 1 b 2 g(x) y 1 ' y 2 ' Linear output h g(x) h(y 1 ') y 1 y 2 h(y 2 ') Non-linear output y = f W x y 1 ' y 2 ' W 11 W 21 b 1 W 12 W 22 b 2 x 1 x 2 1 y 1 = h( y 1 ' ) y 2 = h( y 2 ' )

Principle of Back-propagation Gradient Descent Weight update by gradient descent w i t+1 = w i t η E w i (t : iteration) Move along Steepest gradient E E w i Minimum w j t 1 w t i wi w w i

Principle of Back-propagation Gradient Descent

Examples of Activation Function 1) Sigmoid s s c c 1 ( x) 1 e '( x) cx c 1 x e x (1 e ) s( x)(1 s( x)) 2 c close to step function Gradient calculation : w/ non-gradient value Problem : gradient vanishing

Examples of Activation Function 2) Absolute value rectification Intuitive meaning : folding 3) ReLU h x = max 0, b + w x Simple gradient h x = 0 h x = 1 x < 0 x > 0 Suggested by Hinton to solve gradient vanishing

Details of Back Propagation Output Layer Hidden layer Output layer Loss evaluation h i w ij o j Activation function z j E j o j = w ij h i +b z j =f(o j ) E f : activation function w t+1 ij = w t ij η E w ij E w ij E z j o j ( o j t j ) f '( o j ) hi jhi z o w j j ij Sigmoid case f '( o ) o (1 o ) j j j s '( x ) s ( x )(1 s ( x )) c c 1 Differential of Error Function Differential of Activation Function Input value

Details of Back Propagation Hidden Layer Hidden layer 0 Hidden layer 1 Hidden layer 2 w j1 (2) o 1 (2) δ 1 (2) o i o i+1 w ij (1) (1) w (i+1)j o j (1) Activation function z j o j (1) = w ij h i +b z j =f(o j ) w j2 (2) w jm (2) o 2 (2) δ 2 (2) w t+1 ij = w t ij η E w ij f : activation function o m (2) δ m (2) E w ij E o z o f '( o ) h m (2) m k j j (2) (2) ( ) ( ) (2) k wjk k 1 ok z j o j wij k 1 j i ( 1) j h i Error Propagated from Previous Layer E o E z j (2) (2) j z j o j (2) j Differential of Activation Function Input Value

Backward Error Propagation FC4 FC5 FC6 Error t 1 t 2 E I 4, O 4 I 5, O 5 I 6, O 6 W 5 W 6 Forward Inference f(i 4 ) = O 4 O 4 * W 5 = I 5 f(i 5 ) = O 5 O 5 * W 6 = I 6 f(i 6 ) = O 6 E = o - t 2 Backpropagation E E o I 6 6 ( D6e) o5 5o5 6 o6 I6 6 E I6 O5 I5 ( D5W 6 5) o4 4o4 5 I6 O5 I5 5 E T T 5 ( D5W 6 5) o4 6 ( D6e) o5 E e ( o4 t4) o 4

Various Gradient Descent Batch gradient descent : Use all m examples in each iteration Stochastic gradient descent : Use 1 example in each iteration (Expensive for large dataset) (Large flitting) Mini-batch gradient descent : Use small b examples in each iteration Batch gradient descent Stochastic gradient descent

Limit of Multilayer Perceptron MLP has the potential of universal approximator 1) Fully Connected Layer ex) 1000x1000 image + 1M hidden units 10 12 parameters Too many parameters!! - Unrealistic memory - Slow learning Locally Connected Layer - Simple solution to reduce parameters 1) Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik, 1990

Motivation of Convolution Layer Locally Connected Layer w/ Shared Weight Parameters can be much smaller with shared weight Fully Connected Layer Locally Connected Layer Convolution Layer

Convolution Layer w/ 2D Convolution Input Image Convolution Layer Non-linearity Pooling Layer Feature Maps Original Sharpen Blur Edge x[ m, n] h [ i, j] h [ i, j] 1 2 h [ i, j] 3 25

Convolution Layer w/ Non-linearity Input Image Convolution Layer Non-linearity Pooling Layer Feature Maps Because of Multiple Non-linearity Simple Feature Complex Feature 26

Relation btw Conv. Layer & 2D Conv. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 2D Convolution Convolution Layer

Substitute Conv. by Cross-correlation Convolution Cross-correlation 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 No Poo 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 7 8 9 Nonlin. Pooling 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 7 8 9 1 41 7 4 7 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 4 5 6 4 5 6 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9

Error Update in Pooling Layer Role of Pooling : Image size (Dimension) reduction - Use Variance method to extract important feature Max Pooling VS Average Pooling - Max pooling is better Activated node receives error from upper layer only when the value is selected (in Max Pooling) Feature Map After Pooling gx ( ) g'( x) Average Pooling 1 m k m x k 1 g x i 1 m Max Pooling max( x) g x i 1 xi max( x) 0 ow..

Deep Learning Cycle (1) Forward Inference Forward Input Image 1 Feature Map 2 Feature 3 Map Feature Map 4 Feature 5 Map FC Out 6 FC 7 Out Inference Result 1 3 5 6 Kernel Weight Kernel Weight FC Param. FC Param.

Deep Learning Cycle (2) Back-propagation Forward Back-propagation Input Image 1 1 Feature Map 2 Feature 3 Map 3 Feature Map 4 Feature 5 Map FC Out 6 5 6 FC Out 7 Ground Truth Kernel Weight Kernel Weight Kernel Weight Kernel Weight FC Param. FC Param. FC Param. FC Param. 8 Err. Sparse Error Map 16 Error Map 15 15 Sparse Error Map 13 Error Map 12 12 Back Prop. Err. 10 10 Back Prop. Err. 8 Compute error Loss w/ labeled data (Ground truth) Duplicate weight of kernel & FC parameters Generally GPU uses x2 memory during back-prop.

Deep Learning Cycle (3) Weight Update Forward Back-propagation Weight Update Input Image 1 1 Feature Map 2 Feature 3 Map 3 Feature Map 4 Feature 5 Map FC Out 6 5 6 FC Out 7 Ground Truth Kernel Weight Kernel Weight FC Param. FC Param. 18 18 18 18 Kernel Weight 17 Kernel Weight 14 FC Param. 11 FC Param. 9 8 Err. Sparse Error Map 16 Error Map 15 15 Sparse Error Map 13 Error Map 12 12 Back Prop. Err. 10 10 Back Prop. Err. 8 Feature map data is used for weight updating Weight update w/ propagated error

Error Update in Convolution Layer Flip the Kernel 1 2 3 4 5 6 7 8 9 Nonlin. Pooling δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 δ 9 δ 11 δ 12 δ 21 δ 22 δ 11 δ 11 w δ 11 w+δ 12 w δ 12 w δ 11 w+ δ 21 w δ 11 w+δ 12 w+δ 21 w+δ 22 w δ 12 w+ δ 22 w δ 21 w δ 21 w+δ 22 w δ 22 w 1 2 3 4 5 6 7 8 9 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22

Weight Update in Convolution Layer Flip the Kernel 1 2 3 4 5 6 7 8 9 Nonlin. Pooling δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 δ 9 δ 11 δ 12 δ 21 δ 22 δ 11 1 2 3 δ 22 δ 21 1 2 3 δ 22 δ 21 o 1 δ 11 +o 2 δ 12 +o 4 δ 21 +o 5 δ 22 o 2 δ 11 +o 3 δ 12 +o 5 δ 21 +o 6 δ 22 o 4 δ 11 +o 5 δ 12 +o 7 δ 21 +o 8 δ 22 o 5 δ 11 +o 6 δ 12 +o 8 δ 21 +o 9 δ 22 o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 1 2 3 4 5 6 7 8 9 δ 12 δ 21 δ 22 4 5 6 δ 12 δ 11 7 8 9 1 2 3 4 5 6 δ 22 δ 21 7 8 9 δ 12 δ 11 4 5 6 δ 12 δ 11 7 8 9 1 2 3 4 5 6 δ 22 δ 21 7 8 9 δ 12 δ 11

Feature Map Extraction by Convolution Layer We need all feature map for learning Too large Low-Level Feature Mid-Level Feature High-Level Feature Trainable Classifier Simple Feature Complex Feature Because of Multiple Non-linearity

Large Memory for Feature Map Feature map uses high portion of memory Generally, memory usage is proportional to network depth (due to feature map size) [Rhu et al., vdnn, MICRO 2016]

Memory Usage of Various Network [Rhu et al., vdnn, MICRO 2016]

Learning with Lower Bit Precision All data (weight, feature map, gradient) has lower bit [Suyog Gupta et al. 2015]

Learning with Lower Bit Precision (2) Gradients and nonlinear activation has lower bit (8bit, dynamic tree data type) AlexNet for Imagenet dataset [Tim Dettmers. ICLR 2016]

Learning with Quantization BNN Quantization for weight & activations [Itay Hubara et al. 2016]

Learning with Incremental Quantization Quantization for weight & activations [Aojun Zhou. ICLR 2017]

Transport of Huge Weight Information We need l i weight information to compute back-propagation error!!

New Attempt to Simplify Back-Propagation (Feedback Alignment) (2) De 2 DW (1) (2) 1 2 Need to transport all weight data (2) De 2 DB (1) (2) 1 2 (B : random but fixed matrix) should be satisfied Simply compute gradient descent

Feedback Alignment for MNIST dataset

Feedback Alignment for CIFAR-10 dataset