Neural Networks: Backpropagation

Similar documents
Neural Networks: Introduction

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 17: Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Least Mean Squares Regression

Machine Learning (CSE 446): Neural Networks

SGD and Deep Learning

Intro to Neural Networks and Deep Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Supervised Learning. George Konidaris

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning (CSE 446): Backpropagation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Support vector machines Lecture 4

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Nonlinear Classification

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Logistic Regression & Neural Networks

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Least Mean Squares Regression. Machine Learning Fall 2018

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Lecture 10

Computational statistics

CSCI567 Machine Learning (Fall 2018)

Lecture 16: Introduction to Neural Networks

ECS171: Machine Learning

INTRODUCTION TO DATA SCIENCE

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Artificial Neural Networks 2

Logistic Regression. Machine Learning Fall 2018

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Statistical NLP for the Web

Machine Learning Lecture 12

Kernelized Perceptron Support Vector Machines

Understanding How ConvNets See

CSC321 Lecture 5 Learning in a Single Neuron

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 6. Notes on Linear Algebra. Perceptron

Logistic Regression. COMP 527 Danushka Bollegala

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Artifical Neural Networks

Lecture 2: Learning with neural networks

Multilayer Perceptrons and Backpropagation

Deep Learning for NLP

Neural networks and support vector machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neuron (Perceptron)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Neural Networks and Deep Learning

A summary of Deep Learning without Poor Local Minima

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 5: Logistic Regression. Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

4. Multilayer Perceptrons

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Based on the original slides of Hung-yi Lee

text classification 3: neural networks

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

CSC321 Lecture 4: Learning a Classifier

Introduction to Machine Learning (67577) Lecture 7

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Introduction to Neural Networks

Perceptron & Neural Networks

Machine Learning Basics

CS260: Machine Learning Algorithms

ECE521 Lecture 7/8. Logistic Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

IFT Lecture 7 Elements of statistical learning theory

How to do backpropagation in a brain

ML4NLP Multiclass Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Stochastic gradient descent; Classification

RegML 2018 Class 8 Deep learning

Introduction to Deep Learning

Machine Learning Basics III

Introduction to Machine Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning and Data Mining. Linear classification. Kalev Kask

Neural Networks. Intro to AI Bert Huang Virginia Tech

Lab 5: 16 th April Exercises on Neural Networks

CSC321 Lecture 4: Learning a Classifier

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Deep Neural Networks (1) Hidden layers; Back-propagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and the Back-propagation Algorithm

Neural Networks biological neuron artificial neuron 1

In the Name of God. Lecture 11: Single Layer Perceptrons

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Transcription:

Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

Neural Networks What is a neural network? Predicting with a neural network Training neural networks Practical concerns 2

This lecture What is a neural network? Predicting with a neural network Training neural networks Backpropagation Practical concerns 3

Training a neural network Given A network architecture (layout of neurons, their connectivity and activations) A dataset of labeled examples S = {(x i, y i )} The goal: Learn the weights of the neural network Remember: For a fixed architecture, a neural network is a function parameterized by its weights Prediction: y" = NN(x, w) 4

Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer 5

Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer So far, we saw that this strategy worked for: 1. Logistic Regression Each 2. Support Vector Machines minimizes a 3. Perceptron different loss 4. LMS regression function All of these are linear models Same idea for non-linear models too! 6

Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 7

Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) Suppose the true label for this example is a number y 8

Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) Suppose the true label for this example is a number y We can write the square loss for this example as: L = 1 2 y y 9 9

Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer How do we solve the optimization problem? 10

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 11

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 12

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 13

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 14

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 15

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 16

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) The objective is not convex. Initialization can be important Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 17

Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) The objective is not convex. Initialization can be important Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w Have we solved everything? t : learning rate, many tweaks possible 18

The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 19

The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 20

The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 21

Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 22

Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 23

Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 24

Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 25

Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 26

Reminder: Chain rule for derivatives If z is a function of y and y is a function of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 27

Reminder: Chain rule for derivatives If z = a function of y 5 + a function of y 9, and the y 0 s are functions of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 28

Reminder: Chain rule for derivatives If z is a sum of functions of y 0 s, and the y 0 s are functions of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 29

Backpropagation L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 30

Backpropagation L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want to compute GJ GK LM N and GJ GK LM O 31

Backpropagation Applying the chain rule to compute the gradient (And remembering partial computations along the way to speed up things) L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want to compute GJ GK LM N and GJ GK LM O 32

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 4 w 45 33

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 34

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 = y y 35

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 = y y w 45 6 = 1 36

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 37

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 38

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 = y y 39

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 = y y w 45 6 = z 5 40

Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 6 w 55 = y y w 45 6 = z 5 We have already computed this partial derivative for the previous case Cache to speed up! 41

Backpropagation example Hidden layer derivatives L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 42

Backpropagation example Hidden layer derivatives L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want GJ GKO QQ 43

Backpropagation example Hidden layer derivatives L = 1 2 y y 9 44

Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 w (w 45 99 6 + w 6 55 z 5 + w 6 95 z 9 ) 45

Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 (w 55 w (w 45 99 6 6 + w 6 55 z 5 + w 6 95 z 9 ) z 5 + w 95 6 z 9) 46

Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 (w 55 w (w 45 99 6 6 + w 6 55 z 5 + w 6 95 z 9 ) z 5 + w 95 6 z 9) 0 z 5 is not a function of w 99 47

Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 w 95 w (w 45 99 6 z 9 6 + w 6 55 z 5 + w 6 95 z 9 ) 48

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 6 + w 6 55 z 5 + w 6 95 z 9 ) 49

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 Call this s 6 + w 6 55 z 5 + w 6 95 z 9 ) 50

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 w 95 6 z 9 s Call this s 6 + w 6 55 z 5 + w 6 95 z 9 ) s 51

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s (From previous slide) 52

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy 53

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y 54

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) Why? Because z 9 s is the logistic function we have already seen 55

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) s w = x 9 99 Why? Because z 9 s is the logistic function we have already seen 56

Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) s w = x 9 99 More important: We have already computed many of these partial derivatives because we are proceeding from top to bottom (i.e. backwards) Why? Because z 9 s is the logistic function we have already seen 57

The Backpropagation Algorithm The same algorithm works for multiple layers Repeated application of the chain rule for partial derivatives First perform forward pass from inputs to the output Compute loss From the loss, proceed backwards to compute partial derivatives using the chain rule Cache partial derivatives as you compute them Will be used for lower layers 58

Mechanizing learning Backpropagation gives you the gradient that will be used for gradient descent SGD gives us a generic learning algorithm Backpropagation is a generic method for computing partial derivatives A recursive algorithm that proceeds from the top of the network to the bottom Modern neural network libraries implement automatic differentiation using backpropagation Allows easy exploration of network architectures Don t have to keep deriving the gradients by hand each time 59

Stochastic gradient descent Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset min w / L(NN x 0, w, y 0 ) 0 The objective is not convex. Initialization can be important Compute the gradient of the loss L(NN x 0, w, y 0 ) using backpropagation Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 60