Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Similar documents
Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Learning (CNNs)

Perceptron & Neural Networks

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Automatic Differentiation and Neural Networks

Artificial Neuron (Perceptron)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 17: Neural Networks and Deep Learning

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Intro to Neural Networks and Deep Learning

Artificial Neural Networks

Lecture 4 Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

CSC 411 Lecture 10: Neural Networks

Neural Networks in Structured Prediction. November 17, 2015

Machine Learning (CSE 446): Backpropagation

CSC321 Lecture 6: Backpropagation

Machine Learning Basics III

Advanced Machine Learning

Ch.6 Deep Feedforward Networks (2/3)

Neural Networks: Backpropagation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Bayesian Networks (Part I)

Perceptron (Theory) + Linear Regression

Logistic Regression & Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. Intro to AI Bert Huang Virginia Tech

y(x n, w) t n 2. (1)

Artificial Neural Networks. MGS Lecture 2

More on Neural Networks

Lecture 6: Backpropagation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSC242: Intro to AI. Lecture 21

Hidden Markov Models

Computational Graphs, and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Neural Networks and Deep Learning

Multilayer Perceptrons and Backpropagation

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning (CSE 446): Neural Networks

Introduction to Machine Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 3 Feedforward Networks and Backpropagation

Statistical NLP for the Web

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Neural networks COMS 4771

Backpropagation: The Good, the Bad and the Ugly

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Artificial Neural Networks

Hidden Markov Models

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Artificial Intelligence

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Stanford Machine Learning - Week V

Learning Neural Networks

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Deriving Principal Component Analysis (PCA)

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Lecture 2: Learning with neural networks

Error Backpropagation

Pattern Recognition and Machine Learning

Recurrent Neural Networks

Backpropagation and Neural Networks part 1. Lecture 4-1

Lecture 2: Modular Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks: Backpropagation

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

4. Multilayer Perceptrons

How to do backpropagation in a brain

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 5: Logistic Regression. Neural Networks

Neural Networks (and Gradient Ascent Again)

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Multi-layer Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Lecture 4: Perceptrons and Multilayer Perceptrons

Sequential Supervised Learning

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

NEURAL LANGUAGE MODELS

Computing Neural Network Gradients

Machine Learning

Christian Mohr

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1

Reminders Homework 5: Neural Networks Out: Tue, Feb 28 Due: Fri, Mar 9 at 11:59pm 2

Q&A 3

BACKPROPAGATION 4

Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 5

Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 6

Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 7

Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 8

Symbolic Differentiation Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 9

Symbolic Differentiation Differentiation Quiz #2: 11

Chain Rule Whiteboard Chain Rule of Calculus 12

{ That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 13

{ That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus 101. 14

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 15

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 16

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 17

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 18

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 19

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 20

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 21

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 22

Error Back-Propagation Slide from (Stoyanov & Eisner, 2012) 23

Error Back-Propagation p(y x (i) ) ϴ y (i) z Slide from (Stoyanov & Eisner, 2012) 24

Backpropagation Whiteboard Example: Backpropagation for Chain Rule #1 Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 25

Backpropagation Automatic Differentiation Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. For variable u i with inputs v 1,, v N a. Compute u i = g i (v 1,, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 26

Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 27

Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward du = sin(u) = du, du 1 du du 1 du 1 = dt du 1 dt, du 1 dt du 2 = dt du 2 dt, du 2 dt dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x = du 2 du du du 2, du du 2 =1 28

Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dy = y y + (1 y ) y 1 da = dy dy da, dy da = ( a) ( ( a) + 1) 2 = d j da = dx j da da d j, da dx j, da d j = x j da dx j = j 29

Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 30

Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 31

Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 32

Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 33

Derivative of a Sigmoid 34

Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dy = y y + (1 y ) y 1 db = dy dy db, dy db = ( b) ( ( b) + 1) 2 = d j db = dz j db db d j, db dz j, db d j = z j db dz j = = dz j, dz j = da j dz j da j da j = d ji da j da j d ji, = da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 35

Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear 36

Backpropagation Whiteboard SGD for Neural Network Example: Backpropagation for Neural Network 37

Backpropagation Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. a. Compute the corresponding variable s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 38

Background A Recipe for Gradients Machine Learning 1. Given training data: 3. Define goal: And it s a special case of a more general algorithm called reversemode automatic differentiation that Decision functioncan compute 4. Train the with gradient SGD: of any differentiable function efficiently! 2. Choose each of these: Loss function Backpropagation can compute this gradient! (take small steps opposite the gradient) 39

Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse-mode automatic differentiation 40

Backprop Objectives You should be able to Construct a computation graph for a function as specified by an algorithm Carry out the backpropagation on an arbitrary computation graph Construct a computation graph for a neural network, identifying all the given and intermediate quantities that are relevant Instantiate the backpropagation algorithm for a neural network Instantiate an optimization method (e.g. SGD) and a regularizer (e.g. L2) when the parameters of a model are comprised of several matrices corresponding to different layers of a neural network Apply the empirical risk minimization framework to learn a neural network Use the finite difference method to evaluate the gradient of a function Identify when the gradient of a function can be computed at all and when it can be computed efficiently 41