Lecture 4 Backpropagation

Similar documents
Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Lecture 6 Optimization for Deep Neural Networks

Ch.6 Deep Feedforward Networks (2/3)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Machine Learning (CSE 446): Backpropagation

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Computing Neural Network Gradients

Neural Networks: Backpropagation

Advanced Machine Learning

Lecture 11 Recurrent Neural Networks I

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSCI567 Machine Learning (Fall 2018)

Backpropagation: The Good, the Bad and the Ugly

CSC321 Lecture 6: Backpropagation

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

CSC 411 Lecture 10: Neural Networks

Lecture 2: Modular Learning

4. Multilayer Perceptrons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

ECE521 Lecture 7/8. Logistic Regression

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Intro to Neural Networks and Deep Learning

Lecture 6: Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 17: Neural Networks and Deep Learning

Lecture 11 Recurrent Neural Networks I

Neural Networks (and Gradient Ascent Again)

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Lecture 2 Machine Learning Review

Tangent Planes, Linear Approximations and Differentiability

Machine Learning Basics III

Multilayer Perceptrons and Backpropagation

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 2: Learning with neural networks

Neural Architectures for Image, Language, and Speech Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

More on Neural Networks

Topic 3: Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS 453X: Class 20. Jacob Whitehill

Tutorial on Tangent Propagation

Neural Networks in Structured Prediction. November 17, 2015

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Lecture 16 Deep Neural Generative Models

A summary of Deep Learning without Poor Local Minima

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks and Deep Learning

1 What a Neural Network Computes

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks: Backpropagation

Name (NetID): (1 Point)

NN V: The generalized delta learning rule

Neural Nets Supervised learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Deep Feedforward Networks

Computational Graphs, and Backpropagation

How to do backpropagation in a brain

Lecture 15: Exploding and Vanishing Gradients

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Introduction to Machine Learning Spring 2018 Note Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Backpropagation and Neural Networks part 1. Lecture 4-1

Automatic Differentiation and Neural Networks

Lecture 7 Convolutional Neural Networks

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Neural Networks (1) Hidden layers; Back-propagation

Artificial Neural Networks. MGS Lecture 2

Relative Loss Bounds for Multidimensional Regression Problems

Backpropagation in Neural Network

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Feedforward Neural Networks. Michael Collins, Columbia University

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Natural Language Processing with Deep Learning CS224N/Ling284

Multilayer Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Feedforward Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Machine Learning (67577)

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Course 395: Machine Learning - Lectures

Deep Learning (CNNs)

Neural Networks and the Back-propagation Algorithm

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Solutions. Part I Logistic regression backpropagation with a single training example

CSC321 Lecture 5 Learning in a Single Neuron

DeepLearning on FPGAs

CSE446: Neural Networks Winter 2015

Transcription:

Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017

Things we will look at today More Backpropagation Still more backpropagation Quiz at 4:05 PM

To understand, let us just calculate!

One Neuron Again ŷ w 1 w 2 w d x 1 x 2... x d Consider example x; Output for x is ŷ; Correct Answer is y Loss L = (y ŷ) 2 ŷ = x T w = x 1 w 1 + x 2 w 2 +... x d w d

One Neuron Again ŷ w 1 w 2 w d x 1 x 2... x d Want to update w i (forget closed form solution for a bit!) Update rule: w i := w i η L w i Now L (ŷ y)2 = = 2(ŷ y) (x 1w 1 + x 2 w 2 +... x d w d ) w i w i w i

One Neuron Again ŷ w 1 w 2 w d x 1 x 2... x d L We have: = 2(ŷ y)x i w i Update Rule: w i := w i η(ŷ y)x i = w i ηδx i where δ = (ŷ y) In vector form: w := w ηδx Simple enough! Now let s graduate...

Simple Feedforward Network ŷ w (2) 1 w (2) 2 z 1 z 2 w (1) 12 w (1) 22 w(1) 32 w (1) 11 w (1) 21 w (1) 31 x 1 x 2 x 3 ŷ = w (2) 1 z 1 + w (2) 2 z 2 z 1 = tanh(a 1 ) where a 1 = w (1) 11 x 1 + w (1) 21 x 2 + w (1) 31 x 3 likewise for z 2

Simple Feedforward Network ŷ z 1 = tanh(a 1 ) where a 1 = w (1) 11 x 1 + w (1) 21 x 2 + w (1) 31 x 3 z 2 = tanh(a 2 ) where a 2 = w (1) 12 x 1 + w (1) 22 x 2 + w (1) 32 x 3 z1 w (2) 1 w (2) 2 z2 w (1) 12 w (1) 22 w(1) 32 w (1) 11 w (1) 21 w (1) 31 x1 x2 x3 Output ŷ = w (2) 1 z 1 + w (2) 2 z 2; Loss L = (ŷ y) 2 Want to assign credit for the loss L to each weight

Top Layer ŷ Want to find: L w (2) 1 Consider w (2) 1 first and L w (2) 2 z1 w (2) 1 w (2) 2 z2 w (1) 12 w (1) 22 w(1) 32 w (1) 11 w (1) 21 w (1) 31 x1 x2 x3 L w (2) 1 = (ŷ y)2 w (2) 1 = 2(ŷ y) (w(2) 1 z 1+w (2) 2 z 2) = 2(ŷ y)z w (2) 1 1 Familiar from earlier! Update for w (2) 1 would be w (2) 1 := w (2) 1 η L = w (2) w (2) 1 ηδz 1 with δ = (ŷ y) 1 Likewise, for w (2) 2 update would be w (2) 2 := w (2) 2 ηδz 2

Next Layer ŷ There are six weights to assign credit for the loss incurred Consider w (1) 11 for an illustration Rest are similar z1 w (2) 1 w (2) 2 z2 w (1) 12 w (1) 22 w(1) 32 w (1) 11 w (1) 21 w (1) 31 x1 x2 x3 L w (1) 11 = (ŷ y)2 w (1) 11 = 2(ŷ y) (w(2) 1 z 1+w (2) 2 z 2) w (21) 11 Now: (w(2) 1 z 1+w (2) 2 z 2) = w (2) (tanh(w (1) w (1) 11 x 1+w (1) 21 x 2+w (1) 31 x 3)) 1 + 0 11 w (1) 11 Which is: w (2) 1 (1 tanh2 (a 1 ))x 1 recall a 1 =? So we have: L w (1) 11 = 2(ŷ y)w (2) 1 (1 tanh2 (a 1 ))x 1

Next Layer ŷ L w (1) 11 = 2(ŷ y)w (2) 1 (1 tanh2 (a 1 ))x 1 Weight update: w (1) 11 := w(1) 11 η L w (1) 11 z1 w (2) 1 w (2) 2 z2 w (1) 12 w (1) 22 w(1) 32 w (1) 11 w (1) 21 w (1) 31 x1 x2 x3 Likewise, if we were considering w (1) 22, we d have: L w (1) 22 = 2(ŷ y)w (2) 2 (1 tanh2 (a 2 ))x 2 Weight update: w (1) 22 := w(1) 22 η L w (1) 22

Let s clean this up... Recall, for top layer: One can think of this as: For next layer we had: L w (2) i L w (2) i L w (1) ij = (ŷ y)z i = δz i (ignoring 2) = δ }{{} z i }{{} local error local input = (ŷ y)w (2) j (1 tanh 2 (a j ))x i Let δ j = (ŷ y)w (2) j (1 tanh 2 (a j )) = δw (2) j (1 tanh 2 (a j )) (Notice that δ j contains the δ term (which is the error!)) Then: Neat! L w (1) ij = δ j }{{} x i }{{} local error local input

Let s clean this up... Let s get a cleaner notation to summarize this Let w i j be the weight for the connection FROM node i to node j Then L w i j = δ j z i δ j is the local error (going from j backwards) and z i is the local input coming from i

Credit Assignment: A Graphical Revision 0 ŷ w 1 0 z 1 1 z 2 2 w 2 0 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3 Let s redraw our toy network with new notation and label nodes

Credit Assignment: Top Layer ŷ δ 0 w 1 0 w 2 0 z 1 1 z 2 2 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3 Local error from 0: δ = (ŷ y), local input from 1: z 1 L w 1 0 = δz 1 ; and update w 1 0 := w 1 0 ηδz 1

Credit Assignment: Top Layer 0 ŷ w 1 0 z 1 1 z 2 2 w 2 0 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3

Credit Assignment: Top Layer ŷ 0 δ w w2 0 1 0 z 1 1 z 2 2 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3 Local error from 0: δ = (ŷ y), local input from 2: z 2 L w 2 0 = δz 2 and update w 2 0 := w 2 0 ηδz 2

Credit Assignment: Next Layer 0 ŷ w 1 0 z 1 1 z 2 2 w 2 0 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3

Credit Assignment: Next Layer ŷ δ 0 w 1 0 w 2 0 z 1 1 z 2 2 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3

Credit Assignment: Next Layer ŷ δ 0 w 1 0 w 2 0 z 1 1 z 2 2 δ 1 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3 Local error from 1: δ 1 = (δ)(w 1 0 )(1 tanh 2 (a 1 )), local input from 3: x 3 L = δ 1 x 3 and update w 3 1 := w 3 1 ηδ 1 x 3 w 3 1

Credit Assignment: Next Layer 0 ŷ w 1 0 z 1 1 z 2 2 w 2 0 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3

Credit Assignment: Next Layer ŷ 0 δ w w2 0 1 0 z 1 1 z 2 2 w 5 2w 4 2 w 3 1 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3

Credit Assignment: Next Layer ŷ 0 δ w w2 0 1 0 z 1 1 z 2 2 δ 1 w 5 2 w 3 1 w 4 2 w 5 1 w 4 1 w 3 1 x 1 5 x 2 4 3 x 3 Local error from 2: δ 2 = (δ)(w 2 0 )(1 tanh 2 (a 2 )), local input from 4: x 2 L = δ 2 x 2 and update w 4 2 := w 4 2 ηδ 2 x 2 w 4 2

Let W (2) = [ w1 0 w 2 0 Let s Vectorize more appropriate to use w (2) ) Let W (1) = ] (ignore that W (2) is a vector and hence w 5 1 w 4 1 w 3 1 w 5 2 w 4 2 w 3 2 Let x 1 Z (1) = x 2 and Z (2) = x 3 [ z1 z 2 ]

Feedforward Computation 1 Compute A (1) = Z (1)T W (1) 2 Applying element-wise non-linearity Z (2) = tanh A (1) 3 Compute Output ŷ = Z (2)T W (2) 4 Compute Loss on example (ŷ y) 2

Flowing Backward 1 Top: Compute δ 2 Gradient w.r.t W (2) = δz (2) 3 Compute δ 1 = (W (2)T δ) (1 tanh(a (1) ) 2 ) Notes: (a): is Hadamard product. (b) have written W (2)T δ as δ can be a vector when there are multiple outputs 4 Gradient w.r.t W (1) = δ 1 Z (1) 5 Update W (2) := W (2) ηδz (2) 6 Update W (1) := W (1) ηδ 1 Z (1) 7 All the dimensionalities nicely check out!

So Far Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights We follow the path from the output (where we have an error signal) to the edge we want to consider We find the δs from the top to the edge concerned by using the chain rule Once we have the partial derivative, we can write the update rule for that weight

What did we miss? Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them

A General View of Backpropagation Some redundancy in upcoming slides, but redundancy can be good!

An Aside Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient Next: Computing gradient x f(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e θ J(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

Computational Graphs Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:

z = xy x z y Graph uses operation for the computation

Logistic Regression σ ŷ x u 1 u 2 dot w + b Computes ŷ = σ(x T w + b)

H = max{0, XW + b} Rc H X U 1 U 2 MM W + b MM is matrix multiplication and Rc is ReLU activation

Back to backprop: Chain Rule Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R R Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dx = dz dy dy dx

z y dy dx x dz dy Chain rule: dz dx = dz dy dy dx

z dz dy 1 y 1 y 2 dz dy 2 dy 1 dx x dy 2 dx Multiple Paths: dz dx = dz dy 1 dy 1 dx + dz dy 2 dy 2 dx

z dz dz dy 1 dy 2... y 1 y 2 y n dz dy n dy 1 dx x dy 2 dx dy n dx Multiple Paths: dz dx = j dz dy j dy j dx

Chain Rule Consider x R m, y R n Let g : R m R n and f : R n R Suppose y = g(x) and z = f(y), then z x i = j z y j y j x i In vector notation: z x 1. z x m = j z y j y j x 1. j z y j y j x m = xz = ( ) T y y z x

Chain Rule ( ) T y x z = y z x ( ) y x is the n m Jacobian matrix of g Gradient of x is a multiplication of a Jacobian matrix with a vector i.e. the gradient y z ( ) y x Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g

Chain Rule We can ofcourse also write this in terms of tensors Let the gradient of z with respect to a tensor X be X z If Y = g(x) and z = f(y), then: X z = j ( X Y j ) z Y j

Recursive Application in a Computational Graph Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule Let for some node x the successors be: {y 1, y 2,... y n } Node: Computation result Edge: Computation dependency dz n dx = dz dy i dy i dx i=1

Flow Graph (for previous slide)... y 1 y 2... y n z x...

Recursive Application in a Computational Graph Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors Successors of x in previous slide {y 1, y 2,... y n }: dz n dx = dz dy i dy i dx i=1

Automatic Differentiation Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: How to compute its output How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping

Computational Graph for a MLP Figure: Goodfellow et al. To train we want to compute W (1)J and W (2)J Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Computational Graph for a MLP Figure: Goodfellow et al. Weight decay cost is relatively simple: Will always contribute 2λW (i) to gradient on W (i) Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Symbol to Symbol Figure: Goodfellow et al. In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow

Next time Regularization Methods for Deep Neural Networks