Machine Learning (CSE 446): Backpropagation

Similar documents
Machine Learning (CSE 446): Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

CSC 411 Lecture 10: Neural Networks

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Neural Networks: Backpropagation

Ch.6 Deep Feedforward Networks (2/3)

Neural Networks in Structured Prediction. November 17, 2015

A Course in Machine Learning

Nonlinear Classification

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Artificial Neuron (Perceptron)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Lecture 4 Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Learning Lab Course 2017 (Deep Learning Practical)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Multilayer Perceptron

4. Multilayer Perceptrons

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Automatic Differentiation and Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Lecture 3 Feedforward Networks and Backpropagation

Multilayer Perceptrons and Backpropagation

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Machine Learning Basics III

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Lecture 3 Feedforward Networks and Backpropagation

Intro to Neural Networks and Deep Learning

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Neural Nets Supervised learning

Neural networks COMS 4771

A summary of Deep Learning without Poor Local Minima

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks and Deep Learning

Input layer. Weight matrix [ ] Output layer

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Lab 5: 16 th April Exercises on Neural Networks

Lecture 17: Neural Networks and Deep Learning

CS489/698: Intro to ML

Artificial Neural Networks

Neural Networks. Intro to AI Bert Huang Virginia Tech

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

y(x n, w) t n 2. (1)

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Training Neural Networks Practical Issues

SGD and Deep Learning

Backpropagation and Neural Networks part 1. Lecture 4-1

Artificial Neural Networks. MGS Lecture 2

Logistic Regression & Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

Support Vector Machines

Deep Learning & Artificial Intelligence WS 2018/2019

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Notes on Back Propagation in 4 Lines

CSCI567 Machine Learning (Fall 2018)

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction to Convolutional Neural Networks (CNNs)

Deep Feedforward Networks

Neural Networks and Deep Learning.

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 9: Generalization

Neural networks. Chapter 19, Sections 1 5 1

Lecture 6: Backpropagation

1 What a Neural Network Computes

Lecture 5: Logistic Regression. Neural Networks

Supervised Learning. George Konidaris

Machine Learning Lecture 10

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

More on Neural Networks

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

CSC321 Lecture 2: Linear Regression

Revision: Neural Network

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Artificial Neural Networks 2

Stochastic Gradient Descent

Lecture 16: Introduction to Neural Networks

Neural Networks (Part 1) Goals for the lecture

Week 5: Logistic Regression & Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC321 Lecture 5 Learning in a Single Neuron

text classification 3: neural networks

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Machine Learning Lecture 12

Machine Learning and Data Mining. Linear classification. Kalev Kask

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Machine Learning Lecture 5

Transcription:

Machine Learning (CSE 446): Backpropagation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 8, 2017 1 / 32

Neuron-Inspired Classifiers correct output y n L n loss hidden units input weights x n W tanh activation! ŷ b v classifier output, f 2 / 32

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. The number of hidden units affects how complicated your decision boundary can be and how easily you will overfit. 3 / 32

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. Special name for (S)GD with chain rule invocations: backpropagation. 4 / 32

Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a 5 / 32

Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a Base case: L n = L n L n = 1 6 / 32

Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a Base case: L n = L n L n = 1 From here on, we overload notation and let a and b refer to nodes and to their values. 7 / 32

Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a After working forwards through the computation graph to obtain the loss L n, we work backwards through the computation graph, using the chain rule to calculate ā for every node a, making use of the work already done for nodes that depend on a. L n a = b:a b ā = b:a b = b:a b L n b b a b b a b 1 if b = a + c for some c c if b = a c for some c 1 b 2 if b = tanh(a) 8 / 32

Backpropagation with Vectors Pointwise ( Hadamard ) product for vectors in R n : a[1] b[1] a[2] b[2] a b =. a[n] b[n] ā = b b:a b i=1 = b:a b b[i] b[i] a b b c b (1 b b) if b = a + c for some c if b = a c for some c if b = tanh(a) 9 / 32

Backpropagation, Illustrated y n L n x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v Intermediate nodes are de-anonymized, to make notation easier. 10 / 32

Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v L n L n = 1 11 / 32

Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ The form of ḡ will be loss-function specific (e.g., 2(y n g) for squared loss). 12 / 32

Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ Sum. 13 / 32

Backpropagation, Illustrated y n L n 1 ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ ḡ e Product. 14 / 32

Backpropagation, Illustrated y n L n 1 ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ ḡ e Hyperbolic tangent. 15 / 32

Backpropagation, Illustrated y n L n 1 ā ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b ā v ḡ e ḡ 1 ḡ Sum. 16 / 32

Backpropagation, Illustrated y n L n 1 āx n ā ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b ā v ḡ e ḡ 1 ḡ Product. 17 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. 18 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. 19 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? 20 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations 21 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD 22 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) 23 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) 24 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization 25 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization 26 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization Interpretability? 27 / 32

Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization Interpretability? 28 / 32

Challenge of Deeper Networks Backpropagation aims to assign credit (or blame ) to each parameter. In a deep network, credit/blame is shared across all layers. So parameters at early layers tend to have very small gradients. One solution is to train a shallow network, then use it to initialize a deeper network, perhaps gradually increasing network depth. This is called layer-wise training. 29 / 32

Radial Basis Function Networks correct output y n L n loss input weights hidden units x n w 1 sqd exp γ 1 v 1 activation! ŷ w 2 sqd exp classifier output, f γ 2 v 2 In the diagram, sqd(x, w) = x w 2 2. 30 / 32

Radial Basis Function Networks Generalizing to H hidden units: ( H f(x) = sign v[h] exp ( γ h x w h 2 ) ) 2 h=1 Each hidden unit is like a little bump in data space. w h is the position of the bump, and γ h is inversely proportional to its width. 31 / 32

A Gentle Reading on Backpropagation http://colah.github.io/posts/2015-08-backprop/ 32 / 32