10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1
Reminders Homework 5: Neural Networks Out: Tue, Feb 28 Due: Fri, Mar 9 at 11:59pm 2
Q&A 3
BACKPROPAGATION 4
Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 5
Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 6
Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 7
Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 8
Symbolic Differentiation Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9
Symbolic Differentiation Differentiation Quiz #2: 11
Chain Rule Whiteboard Chain Rule of Calculus 12
{ That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 13
{ That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus 101. 14
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 15
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 16
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 17
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 18
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 19
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 20
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 21
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 22
Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 23
Error Back-Propagation p(y x (i) ) ϴ y (i) z Slide from (Stoyanov & Eisner, 2012) 24
Backpropagation Whiteboard Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25
Backpropagation Automatic Differentiation Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. For variable u i with inputs v 1,, v N a. Compute u i = g i (v 1,, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26
Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 27
Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward du = sin(u) = du, du 1 du du 1 du 1 = dt du 1 dt, du 1 dt du 2 = dt du 2 dt, du 2 dt dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x = du 2 du du du 2, du du 2 =1 28
Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dy = y y + (1 y ) y 1 da = dy dy da, dy da = ( a) ( ( a) + 1) 2 = d j da = dx j da da d j, da dx j, da d j = x j da dx j = j 29
Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 30
Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 31
Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 32
Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 33
Derivative of a Sigmoid 34
Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 35
Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear 36
Backpropagation Whiteboard SGD for Neural Network Example: Backpropagation for Neural Network 37
Backpropagation Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. a. Compute the corresponding variable s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38
Background A Recipe for Gradients Machine Learning 1. Given training data: 3. Define goal: And it s a special case of a more general algorithm called reversemode automatic differentiation that Decision functioncan compute 4. Train the with gradient SGD: of any differentiable function efficiently! 2. Choose each of these: Loss function Backpropagation can compute this gradient! (take small steps opposite the gradient) 39
Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse-mode automatic differentiation 40
Backprop Objectives You should be able to Construct a computation graph for a function as specified by an algorithm Carry out the backpropagation on an arbitrary computation graph Construct a computation graph for a neural network, identifying all the given and intermediate quantities that are relevant Instantiate the backpropagation algorithm for a neural network Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2) when the parameters of a model are comprised of several matrices corresponding to different layers of a neural network Apply the empirical risk minimization framework to learn a neural network Use the finite difference method to evaluate the gradient of a function Identify when the gradient of a function can be computed at all and when it can be computed efficiently 41