Backpropagation in Neural Network

Similar documents
Computing Neural Network Gradients

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Computational Graphs, and Backpropagation

Lecture 4 Backpropagation

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Lecture 2: Learning with neural networks

Vector, Matrix, and Tensor Derivatives

Matrix Multiplication

CSC 411 Lecture 10: Neural Networks

Lecture 2: Modular Learning

Lecture 17: Neural Networks and Deep Learning

CSC321 Lecture 6: Backpropagation

Neural Networks: Backpropagation

Learning Neural Networks

Statistical Machine Learning from Data

Intro to Neural Networks and Deep Learning

Sometimes the domains X and Z will be the same, so this might be written:

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Backpropagation and Neural Networks part 1. Lecture 4-1

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Introduction to Machine Learning (67577)

Convolutional Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning

MATH 320, WEEK 7: Matrices, Matrix Operations

Solutions. Part I Logistic regression backpropagation with a single training example

CSC321 Lecture 5 Learning in a Single Neuron

Automatic Differentiation and Neural Networks

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

CS 453X: Class 20. Jacob Whitehill

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 4: Training a Classifier

Notes on Back Propagation in 4 Lines

Linear Regression and Its Applications

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural networks COMS 4771

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

6.036: Midterm, Spring Solutions

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Multi Variable Calculus

Lecture 35: Optimization and Neural Nets

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

MAS201 LINEAR MATHEMATICS FOR APPLICATIONS

Machine Learning Basics III

An overview of deep learning methods for genomics

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Neural Network Back-propagation HYUNG IL KOO

Lecture 4: Training a Classifier

Backpropagation: The Good, the Bad and the Ugly

Inverse Kinematics. Mike Bailey. Oregon State University. Inverse Kinematics

CSCI567 Machine Learning (Fall 2018)

Introduction to Neural Networks

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Deep Feedforward Networks. Sargur N. Srihari

Advanced Machine Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Basic Linear Algebra in MATLAB

COMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Deep Learning II: Momentum & Adaptive Step Size

Gradient Descent Training Rule: The Details

Artificial Neural Networks. MGS Lecture 2

Neural Network Language Modeling

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Mon Feb Matrix algebra and matrix inverses. Announcements: Warm-up Exercise:

CSC321 Lecture 2: Linear Regression

Lecture 16 Solving GLMs via IRWLS

Meaning of the Hessian of a function in a critical point

Slide credit from Hung-Yi Lee & Richard Socher

Deep Feedforward Networks

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Rapid Introduction to Machine Learning/ Deep Learning

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

1 What a Neural Network Computes

Feedforward Neural Networks. Michael Collins, Columbia University

Managing Uncertainty

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Neural Networks (1) Hidden layers; Back-propagation

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Constrained Optimization

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Answers in blue. If you have questions or spot an error, let me know. 1. Find all matrices that commute with A =. 4 3

Tangent Planes, Linear Approximations and Differentiability

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Neural Networks and Deep Learning

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Developing an Algorithm for LP Preamble to Section 3 (Simplex Method)

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

MATH 423/ Note that the algebraic operations on the right hand side are vector subtraction and scalar multiplication.

Neural Network Training

Stephen Scott.

x 1 + 4x 2 = 5, 7x 1 + 5x 2 + 2x 3 4,

Tutorial on Tangent Propagation

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Automatic Reverse-Mode Differentiation: Lecture Notes

An introduction to basic information theory. Hampus Wessman

Transcription:

Backpropagation in Neural Network Victor BUSA victorbusa@gmailcom April 12, 2017 Well, again, when I started to follow the Stanford class as a self-taught person One thing really bother me about the computation of the gradient during backpropagation It is actually the most important part in the rst 5 lessons and yet all the examples from Stanford class are on 1-D functions You can see those examples via this link: http://cs231ngithubio/optimization-2/ Well the advice that they give is that on higher dimension it "works quite the same" It is actually not false but I think we style need to do the math to see it So, in this paper I will compute the gradient on higher dimension of the ReLu, the bias and the weight matrix in a fully connected network Forward pass Before dealing with the backward pass and the computation of the gradient in higher dimension, let's compute the forward pass rst, and then we will backpropagate the gradient Using the notation of the assignment, we have: y 1 XW 1 + b 1 h 1 max(0, y 1 ) y 2 h 1 W 2 + b 2 L i ey2y i e y2 j j L i samples L N where Nnumber of training sample In python code we can compute the forward pass using the following code: 1

y1 X dot (W1) + b1 #(N,H) + (H) h1 np maximum(0, y1 ) y2 h1 dot (W2) + b2 s c o r e s y2 # correspond to e^y2 in maths exp_scores np exp ( s c o r e s ) # correspond to e^y2/sum( e ^( y2 ) [ j ] ) in maths probs exp_scores / np sum( exp_scores, a x i s 1, keepdimstrue ) # correspond to l o g [ ( e^y2/sum( e ^( y2 ) [ j ] ) ) [ y i ] ] Li in maths correct_logprobs np l o g ( probs [ range (N), y ] ) # correspond to L l o s s np sum( correct_logprobs ) / N And using the notation of the assignment, we have: X R N D, W 1 R D H, b 1 R H h 1 R N H, W 2 R H C y 2 R N C, b 2 R C Backpropagation pass gradient of the Softmax given by: where: We already saw (previous paper) that the gradient of the softmax is i f k (p k 1(k y i )) p yi efy i So actually the gradient of the loss with respect to y 2 is just the matrix probs (see python code of the forward pass) in which we substract 1 only in the yi th column And we need to do this for each row of the matrix probs (because each row correspond to a sample) So in python we can write: j e fj dy2 probs dy2 [ range (N), y ] 1 dy2 / N Note : We divide by N because the total loss is averaged over the N samples (see forward pass code) 2

Gradient of the fully connected network (weight matrix) Then now we want to compute dw 2 To do so, we will use the chain rule Note that to avoid complex notation, I rewrite W 2 as being W and w ij being the coecient of W (b 2 is replace by b, y 2 by y and h 1 by h) As L is a scalar we can compute w ij directly To do so, we will use the chain rule in higher dimension Let's recall rst that with our simplied notation, we have: where: y hw + b h is a (N, H) matrix, W is a (H, C) matrix b is a (C, 1) column vector We already know all the p,q (this is the term of the y 2 So we only need to focus on computing dypq : So nally replacing (2) in (1) we have: d (1) ( H ) h pu w uq + b q we computed in the previous paragraph) (2) u1 1{q j}h pi p,q 1{q j}h pi p dy pj h pi p h p,i dy pj p h i,p ( ) dy p,j (3) We used the fact the h pi and dy pj we see that: are scalars and is a commutative operation for scalars Finally ( ) W i,j p h i,p ( ) y p,j So we recognize the product of two matrix: h and y Using the assignment notations we have: h 1 W 2 y 2 In python we denote dx as being x, so we can write: dw2 h1 T dot ( dy2 ) 3

Gradient of the fully connected network (bias) Let's dene Y W x + b with x being a column vector With the notation of the assignment we have: w 11 w 12 w 1C x 1 b 1 y 1 w 21 w 22 w 2C x 2 b 2 y 2 + (4) w H1 w H2 w HC x C b H y H We already computed y 2 (gradient of the softmax), so according to the chain rule we want to compute: b y 2 y 2 b Note that here y 2 y to simplify the notations also according to (4), we saw that i j: dy i db j ( d C ) w ik x k + b i 0 db j k1 also if i j: hence we have that y2 b matrix I HC, we have: dy i db i ( d C ) w ik x k + b i 1 db i k1 is the identity matrix (1 on the diagonal and 0 elsewhere) Noting this b y 2 I HC y 2 Now, if we are dealing with X as being a matrix we can simply noticed that the gradient is the sum of all local gradient in y i (see Figure 1) So we have: b n i1 y i y i b n i1 y i In python, we can achieve the gradient with the following code: db2 np sum( dy2, a x i s 0) Gradient of ReLu I won't enter into to much details as we understand how it works now We use the chain rule and the local gradient Here again I will focus on computing the gradient of x, x being a vector In reality X is actually a matrix as we use mini-batch and vectorized implementation to speed up the computation For the local gradient we have: 4

Figure 1: Gradient of the bias We see that b receives n incoming gradients So we have to add all those incoming gradients to get the gradient wrt the bias x 1 max(0, x 1 ) x (ReLu(x)) x 1 max(0, x 2 ) max(0, x) x x 1 max(0, x H ) 1(x 1 > 0) 0 0 0 1(x 2 > 0) 0 0 0 1(x H > 0) x 2 max(0, x 1 ) x H max(0, x 1 ) x 2 max(0, x 2 ) x H max(0, x 2 ) x 2 max(0, x H ) x H max(0, x H ) and then using chain rule we have what we want: 5

x y y x y x (ReLu(x)) y 1(x 1 > 0) 0 0 0 1(x 2 > 0) 0 0 0 1(x H > 0) In python (dy1 being x and dh1 being y ), we can write: dy1 dh1 ( y1 > 0) Note : I let the reader compute the gradient of the Relu if x is a matrix It isn't dicult We just need to use the chain rule in higher dimension (like I did for the computation of the Gradient wrt the weight matrix) I preferred to use x as a vector to be able to visualize the Jacobian of the Relu Conclusion In this paper we tried to understand quite precisely how to compute the gradient in higher dimension We hence gain a better understanding of what's happening behind the python code and are ready to compute the gradient of other activations functions or other kind of layers (not necessarily fully connected for example) 6