<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Similar documents
Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 19, Sections 1 5 1

y(x n, w) t n 2. (1)

Convolutional Neural Network Architecture

Normalization Techniques in Training of Deep Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Convolutional Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning (CNNs)

Neural networks COMS 4771

Jakub Hajic Artificial Intelligence Seminar I

Introduction to Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Supervised Learning. George Konidaris

4. Multilayer Perceptrons

Machine Learning

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Neural networks. Chapter 20, Section 5 1

Deep Feedforward Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks. Intro to AI Bert Huang Virginia Tech

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks

Multilayer Perceptrons and Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to (Convolutional) Neural Networks

Machine Learning

Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Lecture 17: Neural Networks and Deep Learning

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Revision: Neural Network

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Machine Learning Spring 2018 Note Neural Networks

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Multilayer Perceptron

Course 395: Machine Learning - Lectures

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Deep Learning: a gentle introduction

Lecture 7 Convolutional Neural Networks

CSCI567 Machine Learning (Fall 2018)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Convolutional Neural Networks. Srikumar Ramalingam

Unit III. A Survey of Neural Network Model

Gradient Descent Training Rule: The Details

Handwritten Indic Character Recognition using Capsule Networks

Lecture 5: Logistic Regression. Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Artificial Intelligence

Advanced Machine Learning

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Reading Group on Deep Learning Session 1

Neural Networks and the Back-propagation Algorithm

Final Examination CS 540-2: Introduction to Artificial Intelligence

A summary of Deep Learning without Poor Local Minima

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Artificial Neural Networks

CSC321 Lecture 16: ResNets and Attention

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CSC242: Intro to AI. Lecture 21

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Day 3 Lecture 3. Optimizing deep networks

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Artificial Neural Networks 2

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 14: Deep Generative Learning

Neural Networks biological neuron artificial neuron 1

Neural Network Language Modeling

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Artificial Neural Networks

Machine Learning Basics III

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speech and Language Processing

Bits of Machine Learning Part 1: Supervised Learning

Multilayer Perceptron

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Deep Feedforward Networks

18.6 Regression and Classification with Linear Models

CSCI 315: Artificial Intelligence through Deep Learning

Data Mining Part 5. Prediction

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Transcription:

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation of MLP Derivatives of activation Back-propagation in MLP Back-Propagation of CNN Intuitive of CNN Pooling Layer & Stride Back-propagation in CNN Hardware Issues & Other Learning Method 2

Previous Standford Lectures

Previous Standford Lectures

Previous Standford Lectures We understand how back-propagation works (Today s Purpose)?

Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation of MLP Derivatives of activation Back-propagation in MLP Back-Propagation of CNN Intuitive of CNN Pooling Layer & Stride Back-propagation in CNN Other Learning Method 6

History of Neural Networks Artificial Neural Network Num. Artificial Neural Network Object Class Input Function1 Output Input Function2 Output Feed-forward network with a single hidden layer containing a finite number of neurons, can approximate any continuous functions on R n George Cybenko(1989) : Add sigmoidal nonlinearity Decision regions can be well approximated Learnable Multilayer feedforward architecture has potential of universal approximator (L p space) Learnable

Forward and Backward Paths Forward Path : Inference f W (x) 28 28

Forward and Backward Paths 28 Forward Path : Inference f W (x) What we expect 1: 0.0 2: 1.0 3: 0.0 28 1: 0.1 2: 0.6 3: 0.8 Error!!

Forward and Backward Paths 28 Forward Path : Inference f W (x) What we expect 1: 0.0 2: 1.0 3: 0.0 28 1: 0.1 2: 0.6 3: 0.8 Error!! f W x Backward Path : Error propagation 1: 0.1 2: -0.4 3: 0.8

Forward and Backward Paths Forward Path : Inference f W (x) 28 28 1: 0.08 2: 0.75 3: 0.4 f W x Backward Path : Error propagation 1: 0.08 2: -0.25 3: 0.4

Perceptron Basic Structure for DNN b Nonlinear Activation Function Wx b Linear Summation + Nonlinear Activation Function To implement nonlinear classification ex) sigmoid, ReLU

Forward Inference x x 1 x 2 1 Input node W 11 W 12 W 21 W 22 b 1 b 2 g(x) y 1 ' y 2 ' Linear output h g(x) h(y 1 ') y 1 y 2 h(y 2 ') Non-linear output y = f W x y 1 ' y 2 ' W 11 W 21 b 1 W 12 W 22 b 2 x 1 x 2 1 y 1 = h( y 1 ' ) y 2 = h( y 2 ' )

Principle of Back-propagation Gradient Descent Weight update by gradient descent w i t+1 = w i t η E w i (t : iteration) Move along Steepest gradient E E w i Minimum w j t 1 w t i wi w w i

Principle of Back-propagation Gradient Descent

Examples of Activation Function 1) Sigmoid s s c c 1 ( x) 1 e '( x) cx c 1 x e x (1 e ) s( x)(1 s( x)) 2 c close to step function Gradient calculation : w/ non-gradient value Problem : gradient vanishing

Examples of Activation Function 2) Absolute value rectification Intuitive meaning : folding 3) ReLU h x = max 0, b + w x Simple gradient h x = 0 h x = 1 x < 0 x > 0 Suggested by Hinton to solve gradient vanishing

Details of Back Propagation Output Layer Hidden layer Output layer Loss evaluation h i w ij o j Activation function z j E j o j = w ij h i +b z j =f(o j ) E f : activation function w t+1 ij = w t ij η E w ij E w ij E z j o j ( o j t j ) f '( o j ) hi jhi z o w j j ij Sigmoid case f '( o ) o (1 o ) j j j s '( x ) s ( x )(1 s ( x )) c c 1 Differential of Error Function Differential of Activation Function Input value

Details of Back Propagation Hidden Layer Hidden layer 0 Hidden layer 1 Hidden layer 2 w j1 (2) o 1 (2) δ 1 (2) o i o i+1 w ij (1) (1) w (i+1)j o j (1) Activation function z j o j (1) = w ij h i +b z j =f(o j ) w j2 (2) w jm (2) o 2 (2) δ 2 (2) w t+1 ij = w t ij η E w ij f : activation function o m (2) δ m (2) E w ij E o z o f '( o ) h m (2) m k j j (2) (2) ( ) ( ) (2) k wjk k 1 ok z j o j wij k 1 j i ( 1) j h i Error Propagated from Previous Layer E o E z j (2) (2) j z j o j (2) j Differential of Activation Function Input Value

Backward Error Propagation FC4 FC5 FC6 Error t 1 t 2 E I 4, O 4 I 5, O 5 I 6, O 6 W 5 W 6 Forward Inference f(i 4 ) = O 4 O 4 * W 5 = I 5 f(i 5 ) = O 5 O 5 * W 6 = I 6 f(i 6 ) = O 6 E = o - t 2 Backpropagation E E o I 6 6 ( D6e) o5 5o5 6 o6 I6 6 E I6 O5 I5 ( D5W 6 5) o4 4o4 5 I6 O5 I5 5 E T T 5 ( D5W 6 5) o4 6 ( D6e) o5 E e ( o4 t4) o 4

Various Gradient Descent Batch gradient descent : Use all m examples in each iteration Stochastic gradient descent : Use 1 example in each iteration (Expensive for large dataset) (Large flitting) Mini-batch gradient descent : Use small b examples in each iteration Batch gradient descent Stochastic gradient descent

Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation of MLP Derivatives of activation Back-propagation in MLP Back-Propagation of CNN Intuitive of CNN Pooling Layer & Stride Back-propagation in CNN Hardware Issues & Other Learning Method 22

Limit of Multilayer Perceptron MLP has the potential of universal approximator 1) Fully Connected Layer ex) 1000x1000 image + 1M hidden units 10 12 parameters Too many parameters!! - Unrealistic memory - Slow learning Locally Connected Layer - Simple solution to reduce parameters 1) Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik, 1990

Motivation of Convolution Layer Locally Connected Layer w/ Shared Weight Parameters can be much smaller with shared weight Fully Connected Layer Locally Connected Layer Convolution Layer

Convolution Layer w/ 2D Convolution Input Image Convolution Layer Non-linearity Pooling Layer Feature Maps Original Sharpen Blur Edge x[ m, n] h [ i, j] h [ i, j] 1 2 h [ i, j] 3 25

Convolution Layer w/ Non-linearity Input Image Convolution Layer Non-linearity Pooling Layer Feature Maps Because of Multiple Non-linearity Simple Feature Complex Feature 26

Relation btw Conv. Layer & 2D Conv. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 2D Convolution Convolution Layer

Substitute Conv. by Cross-correlation Convolution Cross-correlation 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 No Poo 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 7 8 9 Nonlin. Pooling 1 2 3 1 2 3 4 5 6 7 8 9 4 5 6 7 8 9 1 41 7 4 7 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 4 5 6 4 5 6 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9

Error Update in Pooling Layer Role of Pooling : Image size (Dimension) reduction - Use Variance method to extract important feature Max Pooling VS Average Pooling - Max pooling is better Activated node receives error from upper layer only when the value is selected (in Max Pooling) Feature Map After Pooling gx ( ) g'( x) Average Pooling 1 m k m x k 1 g x i 1 m Max Pooling max( x) g x i 1 xi max( x) 0 ow..

Deep Learning Cycle (1) Forward Inference Forward Input Image 1 Feature Map 2 Feature 3 Map Feature Map 4 Feature 5 Map FC Out 6 FC 7 Out Inference Result 1 3 5 6 Kernel Weight Kernel Weight FC Param. FC Param.

Deep Learning Cycle (2) Back-propagation Forward Back-propagation Input Image 1 1 Feature Map 2 Feature 3 Map 3 Feature Map 4 Feature 5 Map FC Out 6 5 6 FC Out 7 Ground Truth Kernel Weight Kernel Weight Kernel Weight Kernel Weight FC Param. FC Param. FC Param. FC Param. 8 Err. Sparse Error Map 16 Error Map 15 15 Sparse Error Map 13 Error Map 12 12 Back Prop. Err. 10 10 Back Prop. Err. 8 Compute error Loss w/ labeled data (Ground truth) Duplicate weight of kernel & FC parameters Generally GPU uses x2 memory during back-prop.

Deep Learning Cycle (3) Weight Update Forward Back-propagation Weight Update Input Image 1 1 Feature Map 2 Feature 3 Map 3 Feature Map 4 Feature 5 Map FC Out 6 5 6 FC Out 7 Ground Truth Kernel Weight Kernel Weight FC Param. FC Param. 18 18 18 18 Kernel Weight 17 Kernel Weight 14 FC Param. 11 FC Param. 9 8 Err. Sparse Error Map 16 Error Map 15 15 Sparse Error Map 13 Error Map 12 12 Back Prop. Err. 10 10 Back Prop. Err. 8 Feature map data is used for weight updating Weight update w/ propagated error

Error Update in Convolution Layer Flip the Kernel 1 2 3 4 5 6 7 8 9 Nonlin. Pooling δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 δ 9 δ 11 δ 12 δ 21 δ 22 δ 11 δ 11 w δ 11 w+δ 12 w δ 12 w δ 11 w+ δ 21 w δ 11 w+δ 12 w+δ 21 w+δ 22 w δ 12 w+ δ 22 w δ 21 w δ 21 w+δ 22 w δ 22 w 1 2 3 4 5 6 7 8 9 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22 δ 11 δ 12 δ 21 δ 22

Weight Update in Convolution Layer Flip the Kernel 1 2 3 4 5 6 7 8 9 Nonlin. Pooling δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 δ 9 δ 11 δ 12 δ 21 δ 22 δ 11 1 2 3 δ 22 δ 21 1 2 3 δ 22 δ 21 o 1 δ 11 +o 2 δ 12 +o 4 δ 21 +o 5 δ 22 o 2 δ 11 +o 3 δ 12 +o 5 δ 21 +o 6 δ 22 o 4 δ 11 +o 5 δ 12 +o 7 δ 21 +o 8 δ 22 o 5 δ 11 +o 6 δ 12 +o 8 δ 21 +o 9 δ 22 o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 1 2 3 4 5 6 7 8 9 δ 12 δ 21 δ 22 4 5 6 δ 12 δ 11 7 8 9 1 2 3 4 5 6 δ 22 δ 21 7 8 9 δ 12 δ 11 4 5 6 δ 12 δ 11 7 8 9 1 2 3 4 5 6 δ 22 δ 21 7 8 9 δ 12 δ 11

Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation of MLP Derivatives of activation Back-propagation in MLP Back-Propagation of CNN Intuitive of CNN Pooling Layer & Stride Back-propagation in CNN Hardware Issues & Other Learning Method 35

Feature Map Extraction by Convolution Layer We need all feature map for learning Too large Low-Level Feature Mid-Level Feature High-Level Feature Trainable Classifier Simple Feature Complex Feature Because of Multiple Non-linearity

Large Memory for Feature Map Feature map uses high portion of memory Generally, memory usage is proportional to network depth (due to feature map size) [Rhu et al., vdnn, MICRO 2016]

Memory Usage of Various Network [Rhu et al., vdnn, MICRO 2016]

Learning with Lower Bit Precision All data (weight, feature map, gradient) has lower bit [Suyog Gupta et al. 2015]

Learning with Lower Bit Precision (2) Gradients and nonlinear activation has lower bit (8bit, dynamic tree data type) AlexNet for Imagenet dataset [Tim Dettmers. ICLR 2016]

Learning with Quantization BNN Quantization for weight & activations [Itay Hubara et al. 2016]

Learning with Incremental Quantization Quantization for weight & activations [Aojun Zhou. ICLR 2017]

Transport of Huge Weight Information We need l i weight information to compute back-propagation error!!

New Attempt to Simplify Back-Propagation (Feedback Alignment) (2) De 2 DW (1) (2) 1 2 Need to transport all weight data (2) De 2 DB (1) (2) 1 2 (B : random but fixed matrix) should be satisfied Simply compute gradient descent

Feedback Alignment for MNIST dataset

Feedback Alignment for CIFAR-10 dataset