Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Size: px

Start display at page:

Download "Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018"

Alexandrina Williamson
6 years ago
Views:

1 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1,

2 Reminders Homework 5: Neural Networks Out: Tue, Feb 28 Due: Fri, Mar 9 at 11:59pm 2

3 Q&A 3

4 BACKPROPAGATION 4

5 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 5

6 Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 6

7 Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 7

8 Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

9 Symbolic Differentiation Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

10 Symbolic Differentiation Differentiation Quiz #2: 11

11 Chain Rule Whiteboard Chain Rule of Calculus 12

12 { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 13

13 { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus

14 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 15

15 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 16

16 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 17

17 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 18

18 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 19

19 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 20

20 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 21

21 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 22

22 Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 23

23 Error Back-Propagation p(y x (i) ) ϴ y (i) z Slide from (Stoyanov & Eisner, 2012) 24

24 Backpropagation Whiteboard Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25

25 Backpropagation Automatic Differentiation Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. For variable u i with inputs v 1,, v N a. Compute u i = g i (v 1,, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26

26 Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 27

27 Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward du = sin(u) = du, du 1 du du 1 du 1 = dt du 1 dt, du 1 dt du 2 = dt du 2 dt, du 2 dt dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x = du 2 du du du 2, du du 2 =1 28

28 Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dy = y y + (1 y ) y 1 da = dy dy da, dy da = ( a) ( ( a) + 1) 2 = d j da = dx j da da d j, da dx j, da d j = x j da dx j = j 29

29 Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 30

30 Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 31

31 Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 32

32 Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 33

33 Derivative of a Sigmoid 34

34 Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 35

35 Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear 36

36 Backpropagation Whiteboard SGD for Neural Network Example: Backpropagation for Neural Network 37

37 Backpropagation Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. a. Compute the corresponding variable s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38

Background A Recipe for Gradients Machine

Define goal: And it s a special case of a

38 Background A Recipe for Gradients Machine Learning 1. Given training data: 3. Define goal: And it s a special case of a more general algorithm called reversemode automatic differentiation that Decision functioncan compute 4. Train the with gradient SGD: of any differentiable function efficiently! 2. Choose each of these: Loss function Backpropagation can compute this gradient! (take small steps opposite the gradient) 39

39 Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse-mode automatic differentiation 40

40 Backprop Objectives You should be able to Construct a computation graph for a function as specified by an algorithm Carry out the backpropagation on an arbitrary computation graph Construct a computation graph for a neural network, identifying all the given and intermediate quantities that are relevant Instantiate the backpropagation algorithm for a neural network Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2) when the parameters of a model are comprised of several matrices corresponding to different layers of a neural network Apply the empirical risk minimization framework to learn a neural network Use the finite difference method to evaluate the gradient of a function Identify when the gradient of a function can be computed at all and when it can be computed efficiently 41

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline