Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Similar documents
Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ECE521 Lecture 7/8. Logistic Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning Lecture 14

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 13 Back-propagation

CSC 411 Lecture 10: Neural Networks

SGD and Deep Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Lecture 2: Learning with neural networks

Machine Learning Basics III

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Lecture 17: Neural Networks and Deep Learning

Lecture 4 Backpropagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Feedforward Networks

Lecture 3 Feedforward Networks and Backpropagation

Data Mining & Machine Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Neural Networks (1) Hidden layers; Back-propagation

arxiv: v1 [cs.lg] 11 May 2015

Solutions. Part I Logistic regression backpropagation with a single training example

Deep Learning (CNNs)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks: Backpropagation

Computational statistics

Lecture 3 Feedforward Networks and Backpropagation

Training Neural Networks Practical Issues

CSC321 Lecture 9: Generalization

CSC242: Intro to AI. Lecture 21

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Feedforward Networks

Neural Nets Supervised learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Computing Neural Network Gradients

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSC321 Lecture 15: Exploding and Vanishing Gradients

Jakub Hajic Artificial Intelligence Seminar I

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Feedforward Neural Networks. Michael Collins, Columbia University

CSCI567 Machine Learning (Fall 2018)

Introduction to Deep Neural Networks

Introduction to Neural Networks

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Neural Networks and the Back-propagation Algorithm

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Neural Networks and Deep Learning

CSC321 Lecture 6: Backpropagation

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

DeepLearning on FPGAs

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Rapid Introduction to Machine Learning/ Deep Learning

A summary of Deep Learning without Poor Local Minima

Machine Learning

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning II: Momentum & Adaptive Step Size

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS60010: Deep Learning

Lecture 15: Exploding and Vanishing Gradients

Ch.6 Deep Feedforward Networks (2/3)

Advanced Machine Learning

Stanford Machine Learning - Week V

CSC321 Lecture 4: Learning a Classifier

Neural Network Language Modeling

The connection of dropout and Bayesian statistics

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 35: Optimization and Neural Nets

Classification: Logistic Regression from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning (CSE 446): Neural Networks

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

1 What a Neural Network Computes

Lecture 2: Modular Learning

CSC321 Lecture 4: Learning a Classifier

Week 5: Logistic Regression & Neural Networks

Introduction to Neural Networks

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Error Functions & Linear Regression (1)

CSC 578 Neural Networks and Deep Learning

Deep Learning for NLP

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

More on Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Statistical NLP for the Web

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

Transcription:

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson Machine Learning: Chenhao Tan Boulder 1 of 58

Logistics Prelim 1 grades and solutions are available Homework 3 is due next week Feedback from the second homework: Include more examples and spend more time on explaining the equations in class Make questions bold CLEAR open house right after the class (free food)! Machine Learning: Chenhao Tan Boulder 2 of 58

Overview Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 3 of 58

Forward propagation recap Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 4 of 58

Forward propagation recap Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o 1...... x d Machine Learning: Chenhao Tan Boulder 5 of 58

Forward propagation recap Forward propagation algorithm Suppose your network has L layers Make prediction for an instance x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) // g represents the nonlinear activation 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 6 of 58

Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) Machine Learning: Chenhao Tan Boulder 7 of 58

Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? Stochastic gradient descent, ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) W l W l L (y, ŷ) η W l Machine Learning: Chenhao Tan Boulder 7 of 58

Forward propagation recap Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 8 of 58

Forward propagation recap Challenge Challenge: How do we compute derivatives of the loss function with respect to weights and biases? Solution: Back propagation Machine Learning: Chenhao Tan Boulder 9 of 58

Back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 10 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: Example: d dx 1 1+exp( x) d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Example: d 1 dx 1+exp( x) = 1 exp( x) 1 (1+exp( x)) 2 dσ(x) = σ(x)(1 σ(x)) dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 12 of 58

Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Derivative of L with respect to x: Similarly, f y, f z f x Machine Learning: Chenhao Tan Boulder 12 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 13 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f r = f x x r + f y y r + f z z r Machine Learning: Chenhao Tan Boulder 13 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 14 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 14 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f = yz 0 + xz r + xy 1 s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = rs2 0 + rs r + r 2 s 1 Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = 2r2 s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f (r, s) = r rs s = r 2 s 2 f s = 2r2 s Machine Learning: Chenhao Tan Boulder 16 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 17 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f u = f r r u + f s s u Machine Learning: Chenhao Tan Boulder 17 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. Machine Learning: Chenhao Tan Boulder 18 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. This is the cornerstone of the back propagation algorithm. Machine Learning: Chenhao Tan Boulder 18 of 58

Back propagation Back propagation Back Propagation W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2... o 1... x d Machine Learning: Chenhao Tan Boulder 19 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases L, for l = 1, 2 w l ij w 2 ij is the weight from node j in layer l 1 to node i in layer l. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to. Could choose a s, but we ll choose z s because math is easier. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Define the derivative w.r.t. the z s by δ: δ l j = L z l j Note that δ l has the same size as z l and a l. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Let s compute δ L for output layer L: δ L j = L z L j = L a L j da L j dz L j Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation δ L j = L z L j = L a L j da L j dz L j We know that a L j = g(z L j ), so dal j dz L j = g (z L j ) δ L j = L a L j g (z L j ) Note: The first term is j th entry of gradient of L. Machine Learning: Chenhao Tan Boulder 21 of 58

Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Machine Learning: Chenhao Tan Boulder 22 of 58

Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Notice that computing δ L requires knowing activations. This means that before we can compute derivatives for SGD through back propagation, we first run forward propagation through the network. Machine Learning: Chenhao Tan Boulder 22 of 58

Back propagation Back propagation Back Propagation Example: Suppose we re in regression setting and choose a sigmoid activation function: L = 1 (y j a L j ) 2 and a L j = σ(z) 2 j L a L j = (a L j y j ), da L j dz L j = σ (z L j ) = σ(z L j )(1 σ(z L j )) So δ L = (a L y) σ(z L ) (1 σ(z L )) Machine Learning: Chenhao Tan Boulder 23 of 58

Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 24 of 58

Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Question: What do you notice? Machine Learning: Chenhao Tan Boulder 24 of 58

Back propagation Back propagation Back Propagation We want to find derivative L w.r.t. to weights and biases a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Every weight connected to a node in layer L depends on a single δ L j Machine Learning: Chenhao Tan Boulder 25 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j z L j w L jk = δ L j z L j w L jk Need to compute zl j w L. Recall z L = W L a L 1 + b L jk j th entry in vector z L j = i w L jia L 1 i + b L j Machine Learning: Chenhao Tan Boulder 26 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j Taking derivative w.r.t. w L jk gives z L j w L jk = δ L j z L j w L jk zl j w L jk = a L 1 k L w L jk = a L 1 k δj L Machine Learning: Chenhao Tan Boulder 26 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = a L 1 k δj L Easy expression for derivative w.r.t. every weight leading into layer L. Machine Learning: Chenhao Tan Boulder 27 of 58

Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Machine Learning: Chenhao Tan Boulder 28 of 58

Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 [ L ] L L [ L W 2 = w 2 11 w 2 12 w 2 13 δ 2 = 1 a 1 1 δ1 2a1 2 δ1 2 ] a1 3 δ2 2a1 1 δ2 2a1 2 δ2 2a1 3 L w 2 21 L w 2 22 L w 2 23 Now we can write this as an outer-product of δ 2 and a 1, (Exercise for yourself, L b 2 ) L W 2 = δ2 (a 1 ) T Machine Learning: Chenhao Tan Boulder 28 of 58

Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Machine Learning: Chenhao Tan Boulder 29 of 58

Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Machine Learning: Chenhao Tan Boulder 29 of 58

Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Since the formulas were so nice once we knew the adjacent δ l, it sure would be nice if we could easily compute the δ l s on earlier layers. Machine Learning: Chenhao Tan Boulder 29 of 58

Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. Machine Learning: Chenhao Tan Boulder 30 of 58

Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Machine Learning: Chenhao Tan Boulder 30 of 58

Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Notice that δ 1 depends on δ 2. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 30 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, L z l 1 k = j L z l j z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ l 1 k = L z l 1 k = j L z l j z l j z l 1 k = j δj l z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Machine Learning: Chenhao Tan Boulder 32 of 58

Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Recall that z 2 = W 2 a 1 + b 2, it follows that Taking the derivative z2 i z 1 2 z 2 i = w 2 i1a 1 1 + w 2 i2a 1 2 + w 2 i3a 1 3 + b 2 i = w 2 i2 g (z 1 2 ), and plugging in gives δ 1 2 = L z 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) Machine Learning: Chenhao Tan Boulder 32 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Machine Learning: Chenhao Tan Boulder 33 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Notice that each row of the system gets multiplied by g (z 1 i ), so let s factor those out. Machine Learning: Chenhao Tan Boulder 33 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w 2 13 + δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 34 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ3 1 = (δ1w 2 2 13 + δ2w 2 2 23) g (z 1 3) [ ] [ Remember δ 2 δ 2 = 1 δ2 2, W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Do you see δ 2 and W 2 lurking anywhere in the above system? Machine Learning: Chenhao Tan Boulder 34 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: Does this help? w 2 (W 2 ) T 11 w 2 [ ] 21 = w 2 12 w 2 22, δ 2 δ 2 = 1 w 2 13 w 2 δ2 2. 23 δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 2 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ 2 3 = (δ 2 1w 2 13 + δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 35 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w 2 13 + δ 2 2w 2 23) g (z 1 3) δ 1 = (W 2 ) T δ 2 g (z 1 ) Machine Learning: Chenhao Tan Boulder 36 of 58

Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 Machine Learning: Chenhao Tan Boulder 37 of 58

Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 We ve worked this out for a simple network with one hidden layer. Nothing we ve done assumed anything about the number of layers, so we can apply the same procedure recursively with any number of layers. Machine Learning: Chenhao Tan Boulder 37 of 58

Back propagation Full algorithm Back Propagation δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer (After this, ready to do a SGD update on weights/biases) Machine Learning: Chenhao Tan Boulder 38 of 58

Back propagation Full algorithm Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 39 of 58

Back propagation Full algorithm Training a Feed-Forward Neural Network Given initial guess for weights and biases. Loop over each training example in random order: 1. Forward propagate to get activations on each layer 2. Back propagate to get derivatives 3. Update weights and biases via stochastic gradient descent 4. Repeat Machine Learning: Chenhao Tan Boulder 40 of 58

Practical issues of back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 41 of 58

Practical issues of back propagation Back Propagation In practice, many remaining questions may arise. δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 42 of 58

Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o Machine Learning: Chenhao Tan Boulder 43 of 58

Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o L b 1 = g (z 1 ) w 2 g (z 2 ) w 3... g (z L ) L a L Machine Learning: Chenhao Tan Boulder 43 of 58

Practical issues of back propagation Unstable gradients Unstable gradients 1.0 0.8 sigmoid derivative 0.6 0.4 0.2 0.0 10 0 10 Machine Learning: Chenhao Tan Boulder 44 of 58

Practical issues of back propagation Unstable gradients Vanishing gradients If we use Gaussian initialization for weights, w j N (0, 1), w j < 1 w j σ (z j ) < 1 4 L decay to zero exponentially b1 Machine Learning: Chenhao Tan Boulder 45 of 58

Practical issues of back propagation Unstable gradients Vanishing gradients ReLu 2.0 1.5 ReLu derivative 1.0 0.5 0.0 2 0 2 Machine Learning: Chenhao Tan Boulder 46 of 58

Practical issues of back propagation Unstable gradients Exploding gradients If w j = 100, w j σ (z j ) k > 1 Machine Learning: Chenhao Tan Boulder 47 of 58

Practical issues of back propagation Unstable gradients Training a Feed-Forward Neural Network In practice, many remaining questions may arise, more examples: 1. How do we initialize weights and biases? 2. How do we regularize? 3. Can I batch this? Machine Learning: Chenhao Tan Boulder 48 of 58

Practical issues of back propagation Weight Initialization Non-convexity Machine Learning: Chenhao Tan Boulder 49 of 58

Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? Machine Learning: Chenhao Tan Boulder 50 of 58

Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) Machine Learning: Chenhao Tan Boulder 50 of 58

Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) δ L = a LL g (z L ) # Compute δ s on output layer For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 50 of 58

Practical issues of back propagation Weight Initialization Weight initialization First idea: small random numbers, W N (0, 0.01) Machine Learning: Chenhao Tan Boulder 51 of 58

Practical issues of back propagation Weight Initialization Weight initialization Var(z) = Var( i w i x i ) = nvar(w i )Var(x i ) Machine Learning: Chenhao Tan Boulder 52 of 58

Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] W N (0, 2 n in + n out ) Machine Learning: Chenhao Tan Boulder 53 of 58

Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) Machine Learning: Chenhao Tan Boulder 53 of 58

Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) This is an actively research area and next great idea may come from you! Machine Learning: Chenhao Tan Boulder 53 of 58

Practical issues of back propagation Alternative regularization Dropout layer "randomly set some neurons to zero in the forward pass" [Srivastava et al., 2014] Machine Learning: Chenhao Tan Boulder 54 of 58

Practical issues of back propagation Alternative regularization Dropout layer Forces the network to have a redundant representation. Machine Learning: Chenhao Tan Boulder 55 of 58

Practical issues of back propagation Alternative regularization Dropout layer Another interpretation: Dropout is training a large ensemble of models. Machine Learning: Chenhao Tan Boulder 55 of 58

Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Machine Learning: Chenhao Tan Boulder 56 of 58

Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. Machine Learning: Chenhao Tan Boulder 56 of 58

Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. In general, we can use a parameter batch size to compute the gradients from a few instances. N (the entire training data) 1 (a single instance) More common values: 16, 32, 64, 128 Machine Learning: Chenhao Tan Boulder 56 of 58

Practical issues of back propagation Batch size Wrap up Back propagation allows for computing the gradients of the parameters and watch out for unstable gradients! δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 57 of 58

Practical issues of back propagation Batch size References (1) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249 256, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), December 2015. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929 1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html. Machine Learning: Chenhao Tan Boulder 58 of 58