CSC321 Lecture 6: Backpropagation

Similar documents
CSC 411 Lecture 10: Neural Networks

Lecture 6: Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 2: Linear Regression

CSC 411 Lecture 6: Linear Regression

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 9: Generalization

Lecture 2: Linear regression

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 9: Generalization

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 10 Training RNNs

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Neural Networks. MGS Lecture 2

25. Chain Rule. Now, f is a function of t only. Expand by multiplication:

Solutions. Part I Logistic regression backpropagation with a single training example

Input layer. Weight matrix [ ] Output layer

Multilayer Perceptrons and Backpropagation

CSC 411 Lecture 7: Linear Classification

Artificial Neuron (Perceptron)

1 What a Neural Network Computes

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks in Structured Prediction. November 17, 2015

CSC321 Lecture 7: Optimization

Optimization and Gradient Descent

Automatic Differentiation and Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Intro to Neural Networks and Deep Learning

Neural Networks (and Gradient Ascent Again)

Lecture 4 Backpropagation

CSC321 Lecture 8: Optimization

Lecture 17: Neural Networks and Deep Learning

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Lecture 5: Logistic Regression. Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

CSC321 Lecture 16: ResNets and Attention

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSC321 Lecture 7 Neural language models

Neural Networks and the Back-propagation Algorithm

Nonlinear Classification

Machine Learning Basics III

Natural Language Processing with Deep Learning CS224N/Ling284

Statistical NLP for the Web

Machine Learning (CSE 446): Neural Networks

Linear and Logistic Regression. Dr. Xiaowei Huang

Introduction to Machine Learning

Neural Nets Supervised learning

Dr. Allen Back. Sep. 10, 2014

Machine Learning Basics

Artificial Neural Networks 2

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Course 395: Machine Learning - Lectures

Lecture 2: Modular Learning

Lecture 15: Exploding and Vanishing Gradients

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Math 212-Lecture 8. The chain rule with one independent variable

Stochastic gradient descent; Classification

Sometimes the domains X and Z will be the same, so this might be written:

6.5 Trigonometric Equations

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Computational Graphs, and Backpropagation

Neural networks COMS 4771

Backpropagation: The Good, the Bad and the Ugly

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 6. Notes on Linear Algebra. Perceptron

How to do backpropagation in a brain

Machine Learning. Neural Networks

Logistic Regression & Neural Networks

Machine Learning Lecture 10

Feed-forward Network Functions

How to Use Calculus Like a Physicist

Calculus for the Life Sciences II Assignment 6 solutions. f(x, y) = 3π 3 cos 2x + 2 sin 3y

18.6 Regression and Classification with Linear Models

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

AI Programming CS F-20 Neural Networks

CSC321 Lecture 20: Reversible and Autoregressive Models

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Intelligent Systems Discriminative Learning, Neural Networks

Linear Classifiers. Michael Collins. January 18, 2012

Lecture 2: Learning with neural networks

Pattern Recognition and Machine Learning

Neural Networks: Backpropagation

Lecture 4: Feed Forward Neural Networks

Transcription:

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21

Overview We ve seen that multilayer neural networks are powerful. But how can we actually learn them? Backpropagation is the central algorithm in this course. It s is an algorithm for computing gradients. Really it s an instance of reverse mode automatic differentiation, which is much more broadly applicable than just neural nets. This is just a clever and efficient use of the Chain Rule for derivatives. We ll see how to implement an automatic differentiation system next week. Roger Grosse CSC321 Lecture 6: Backpropagation 2 / 21

Overview Design choices so far Task: regression, binary classification, multiway classification Model/Architecture: linear, log-linear, multilayer perceptron Loss function: squared error, 0 1 loss, cross-entropy, hinge loss Optimization algorithm: direct solution, gradient descent, perceptron Compute gradients using backpropagation Roger Grosse CSC321 Lecture 6: Backpropagation 3 / 21

Recap: Gradient Descent Recall: gradient descent moves opposite the gradient (the direction of steepest descent) Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers Conceptually, not any different from what we ve seen so far just higher dimensional and harder to visualize! We want to compute the cost gradient de/dw, which is the vector of partial derivatives. This is the average of dl/dw over all the training examples, so in this lecture we focus on computing dl/dw. Roger Grosse CSC321 Lecture 6: Backpropagation 4 / 21

Univariate Chain Rule We ve already been using the univariate Chain Rule. Recall: if f (x) and x(t) are univariate functions, then d df dx f (x(t)) = dt dx dt. Roger Grosse CSC321 Lecture 6: Backpropagation 5 / 21

Univariate Chain Rule Recall: Univariate logistic least squares model z = wx + b y = σ(z) L = 1 (y t)2 2 Let s compute the loss derivatives. Roger Grosse CSC321 Lecture 6: Backpropagation 6 / 21

Univariate Chain Rule How you would have done it in calculus class L = 1 (σ(wx + b) t)2 2 L w = [ ] 1 (σ(wx + b) t)2 w 2 = 1 (σ(wx + b) t)2 2 w = (σ(wx + b) t) (σ(wx + b) t) w = (σ(wx + b) t)σ (wx + b) (wx + b) w = (σ(wx + b) t)σ (wx + b)x L b = b What are the disadvantages of this approach? [ 1 (σ(wx + b) t)2 2 = 1 (σ(wx + b) t)2 2 b = (σ(wx + b) t) (σ(wx + b) t) b = (σ(wx + b) t)σ (wx + b) (wx + b) b = (σ(wx + b) t)σ (wx + b) ] Roger Grosse CSC321 Lecture 6: Backpropagation 7 / 21

Univariate Chain Rule A more structured way to do it Computing the loss: z = wx + b y = σ(z) L = 1 (y t)2 2 Computing the derivatives: dl dy = y t dl dz = dl dy σ (z) L w = dl dz x L b = dl dz Remember, the goal isn t to obtain closed-form solutions, but to be able to write a program that efficiently computes the derivatives. Roger Grosse CSC321 Lecture 6: Backpropagation 8 / 21

Univariate Chain Rule We can diagram out the computations using a computation graph. The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes. Roger Grosse CSC321 Lecture 6: Backpropagation 9 / 21

Univariate Chain Rule A slightly more convenient notation: Use y to denote the derivative dl/dy, sometimes called the error signal. This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation). This is not a standard notation, but I couldn t find another one that I liked. Computing the loss: z = wx + b y = σ(z) L = 1 (y t)2 2 Computing the derivatives: y = y t z = y σ (z) w = z x b = z Roger Grosse CSC321 Lecture 6: Backpropagation 10 / 21

Multivariate Chain Rule Problem: what if the computation graph has fan-out > 1? This requires the multivariate Chain Rule! L 2 -Regularized regression Multiclass logistic regression z = wx + b y = σ(z) L = 1 (y t)2 2 z l = j w lj x j + b l R = 1 2 w 2 L reg = L + λr y k = ez k l ez l L = k t k log y k Roger Grosse CSC321 Lecture 6: Backpropagation 11 / 21

Multivariate Chain Rule Suppose we have a function f (x, y) and functions x(t) and y(t). (All the variables here are scalar-valued.) Then Example: d f dx f (x(t), y(t)) = dt x dt + f dy y dt Plug in to Chain Rule: f (x, y) = y + e xy x(t) = cos t y(t) = t 2 df dt = f dx x dt + f dy y dt = (ye xy ) ( sin t) + (1 + xe xy ) 2t Roger Grosse CSC321 Lecture 6: Backpropagation 12 / 21

Multivariable Chain Rule In the context of backpropagation: In our notation: t = x dx dt + y dy dt Roger Grosse CSC321 Lecture 6: Backpropagation 13 / 21

Backpropagation Full backpropagation algorithm: Let v 1,..., v N be a topological ordering of the computation graph (i.e. parents come before children.) v N denotes the variable we re trying to compute derivatives of (e.g. loss). Roger Grosse CSC321 Lecture 6: Backpropagation 14 / 21

Backpropagation Example: univariate logistic least squares regression Backward pass: Forward pass: z = wx + b y = σ(z) L = 1 (y t)2 2 R = 1 2 w 2 L reg = L + λr L reg = 1 R = L reg dl reg dr = L reg λ L = L reg dl reg dl = L reg y = L dl dy = L (y t) z = y dy dz = y σ (z) w= z z w + R dr dw = z x + R w b = z z b = z Roger Grosse CSC321 Lecture 6: Backpropagation 15 / 21

Backpropagation Multilayer Perceptron (multiple outputs): Forward pass: z i = j h i = σ(z i ) w (1) ij x j + b (1) i y k = w (2) ki h i + b (2) k i L = 1 (y k t k ) 2 2 k Backward pass: L = 1 y k = L (y k t k ) w (2) ki b (2) k w (1) ij = y k h i = y k h i = k b (1) i y k w (2) ki z i = h i σ (z i ) = z i x j Roger Grosse CSC321 Lecture 6: Backpropagation 16 / 21 = z i

Backpropagation In vectorized form: Backward pass: Forward pass: z = W (1) x + b (1) h = σ(z) y = W (2) h + b (2) L = 1 t y 2 2 L = 1 y = L (y t) W (2) = yh b (2) = y h = W (2) y z = h σ (z) W (1) = zx b (1) = z Roger Grosse CSC321 Lecture 6: Backpropagation 17 / 21

Backpropagation Backprop as message passing: Each node receives a bunch of messages from its children, which it aggregates to get its error signal. It then passes messages to its parents. This provides modularity, since each node only has to know how to compute derivatives with respect to its arguments, and doesn t have to know anything about the rest of the graph. Roger Grosse CSC321 Lecture 6: Backpropagation 18 / 21

Computational Cost Computational cost of forward pass: one add-multiply operation per weight z i = w (1) ij x j + b (1) i j Computational cost of backward pass: two add-multiply operations per weight w (2) ki = y k h i h i = k y k w (2) ki Rule of thumb: the backward pass is about as expensive as two forward passes. For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer. Roger Grosse CSC321 Lecture 6: Backpropagation 19 / 21

Backpropagation Backprop is used to train the overwhelming majority of neural nets today. Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients. Despite its practical success, backprop is believed to be neurally implausible. No evidence for biological signals analogous to error derivatives. All the biologically plausible alternatives we know about learn much more slowly (on computers). So how on earth does the brain learn? Roger Grosse CSC321 Lecture 6: Backpropagation 20 / 21

Backpropagation By now, we ve seen three different ways of looking at gradients: Geometric: visualization of gradient in weight space Algebraic: mechanics of computing the derivatives Implementational: efficient implementation on the computer When thinking about neural nets, it s important to be able to shift between these different perspectives! Roger Grosse CSC321 Lecture 6: Backpropagation 21 / 21