Neural Networks: Backpropagation

Similar documents
Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Network Training

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Stochastic gradient descent; Classification

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Feedforward Networks

Multilayer Perceptron

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Computational statistics

Solutions. Part I Logistic regression backpropagation with a single training example

Statistical Machine Learning from Data

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

Machine Learning Basics III

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Error Backpropagation

Neural Networks and Deep Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Lecture 4 Backpropagation

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

y(x n, w) t n 2. (1)

Intro to Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multiclass Logistic Regression

Neural networks. Chapter 20. Chapter 20 1

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Multilayer Neural Networks

Feedforward Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

More on Neural Networks

Neural networks COMS 4771

Natural Language Processing

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 3 Feedforward Networks and Backpropagation

Lecture 17: Neural Networks and Deep Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Computing Neural Network Gradients

Lecture 3 Feedforward Networks and Backpropagation

Convolutional Neural Networks

CSCI567 Machine Learning (Fall 2018)

Lecture 5: Logistic Regression. Neural Networks

ECS171: Machine Learning

OPTIMIZATION METHODS IN DEEP LEARNING

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Multi-layer Neural Networks

CS 453X: Class 20. Jacob Whitehill

4. Multilayer Perceptrons

Statistical Data Mining and Machine Learning Hilary Term 2016

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Feedforward Neural Networks. Michael Collins, Columbia University

Logistic Regression. COMP 527 Danushka Bollegala

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction to Machine Learning (67577)

Neural Networks. Intro to AI Bert Huang Virginia Tech

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Deep Feedforward Networks

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

Introduction to Neural Networks

Loss Functions and Optimization. Lecture 3-1

word2vec Parameter Learning Explained

Backpropagation: The Good, the Bad and the Ugly

Machine Learning Linear Models

How to do backpropagation in a brain

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Logistic Regression & Neural Networks

Classification: Logistic Regression from Data

CSC242: Intro to AI. Lecture 21

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Multilayer Neural Networks

Gradient-Based Learning. Sargur N. Srihari

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multilayer Perceptrons (MLPs)

Linear Models in Machine Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Linear discriminant functions

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction to Machine Learning

Machine Learning Lecture 10

Reading Group on Deep Learning Session 1

CS60021: Scalable Data Mining. Large Scale Machine Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Nets Supervised learning

NEURAL NETWORKS

Lecture 5 Neural models for NLP

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Neural Networks and the Back-propagation Algorithm

Machine Learning Basics

Statistical Machine Learning

Transcription:

Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 1 / 26

Jacobian matrix Two functions f (x, y), g(x, y) with two parameters x, y f (x, y) 3x 2 y g(x, y) 5xy + y 3 Jacobian matrix (numerator layout): J [ ] f (x, y) g(x, y) [ 6yx 3x 2 5y 5x + 3y 2 [ f (x,y) x g(x,y) x ] f (x,y) y g(x,y) y ] Jacobian matrix (denominator layout): J T [ f (x,y) x f (x,y) y g(x,y) x g(x,y) y ] Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 2 / 26

Jacobian: Generalization y f(x): a vector of m scalar-valued functions that each takes a vector x y 1 f 1 (x). y m f m (x) Jacobian matrix: has m rows for m equations. f y 1 (x) x f 1(x) x f m (x) x f m(x) x 1 f 1 (x) x n f 1 (x).. x 1 f m (x) x n f m (x) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 3 / 26

Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 4 / 26

Vector chain rule Jacobian is the multiplication of two other Jacobians f f (g (x)) x g g x f 1 g 1. f m g 1 f 1 g k. f m g k g 1 x 1. g k x 1 g 1 x n. g k x n Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 5 / 26

Vector chain rule: Example y f(x) y [ y1 (x) y 2 (x) ] [ f1 (x) f 2 (x) ] [ ln(x 2 ) sin(3x) ] y f(g(x)): introduce two intermediate variables g 1, g 2 : [ ] [ ] g1 (x) x 2 g g 2 (x) 3x [ ] [ ] f1 (g) ln(g1 ) y f 2 (g) sin(g 2 ) (1) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 6 / 26

Jacobian: Generalization Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 7 / 26

MLP with single hidden layer: Notation For simplicity, a network has single hidden layer only o k : k-th output unit, h j : j-th hidden unit, x i : i-th input u kj : weight b/w j-th hidden and k-th output w ji : weight b/w i-th input and j-th hidden Bias terms are also contained in weights Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 8 / 26

MLP with single hidden layer: Matrix notation h max(wx, 0) o softmax(uh) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 9 / 26

Typical Setting for Classification K: the number of labels Input layer: Input values (raw features) Output layer: Scores of labels Softmax layer: o softmax(v) o k exp(v k) i exp(v i) exp(v k) Z Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 10 / 26

Learning as Optimization Training data: T : {(x i, y i ),, (x N, y N )} x i : i-th input feature vector y i (or y i ): i-th target label Parameter: θ : {W, U} Weight matrices: Input-to-hidden, and hidden-to-output Objective function ( Loss function) Take Negative Log-likelihood (NLL) as Empirical risk J(θ) Loss(T, θ) logp (y x) Training process Known as Empirical risk minimization (x,y) T θ argmin θ J(θ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 11 / 26

Optimization by Gradient Method Gradient Descent: θ θ θ η θ E (x,y) [ logp(y x)] Batch algorithm Expectations over the training set are required But, computing expectations exactly is very expensive, as it evaluates on every example in the entire dataset Minibatch algorithm In practice, we compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples Using exact gradient using large examples does not significantly reduce the estimation error: Slow convergence Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 12 / 26

Stochastic Gradient Method 1 Randomly a minibatch of m samples {(x, y)} from training data 2 Define NLL for {(x i, y i )} J(θ) 1 i m 3 Compute derivatives W for W θ 4 Update weight matrix for W θ: log (y i x i ) W W η W Iterate the above procedure until stopping criteria is satisfied Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 13 / 26

Logistic regression for binary classification (x, y): a training example for binary classification where y {0, 1} Logistic regression function: which is rewritten to: o (x) σ J: the log-likelihood on (x, y) ( ) w T x + b z w T x + b o σ (z) J ylog (o) + (1 y)log (1 o) o y o 1 y 1 o Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 14 / 26

Logistic regression: Deriv J wrt w All together lead to: z w xt o z σ (z) (1 σ (z)) o(1 o) w o o z z w ( y o 1 y 1 o (y o) x T ) o (1 o) x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 15 / 26

Logistic regression for multi-class classification K: the number of labels (x, k): a training example where k {1,, K} Logistic regression function: which is rewritten to: o (x) softmax (Wx + b) z Wx + b o softmax (z) where softmax (z) exp(z)/ i exp(z i) exp(z)/z J: the log-likelihood on (x, k) J y T log (o) where y is one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is the target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 16 / 26

Logistic regression: Deriv J wrt W o o z Thus, we have: z [ 0 1 ] 0 o k [ ] oj [exp(z j ) (I(i j) ] k exp(z k) exp(z i )) z i ij ( k exp(z k)) 2 ij [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 [o j I(i j) o i o j ] ij o o z [ 0 1 o k 0 ] [ ] exp(zj ) (I(i j) Z exp(z i )) (Z) 2 ij [ ] I(1 k) o 1 1 o k I(K k) o K ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 17 / 26

Logistic regression: Deriv J wrt W (Cont.) Given z Wx + b, where W i is i-th row vector of W. Finally, this leads to: z i W i x T z i (I(i k) o i ) x T W i z i W i W : W 1. W K (I(1 k) o 1 ) x T. (I(K k) o K ) x T (I(1 k) o 1 ). x T T x T z (I(K k) o K ) Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 18 / 26

MLP with single hidden layer: Log-likelihood h max(wx, 0) o softmax(uh) J: the log-likelihood on single example (x, y) J y T log (o) y: one-hot encoding for target label. y [0 1 0] T where y i I(i k) where k is a target label. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 19 / 26

Derivative of J wrt Output layer v Uh o softmax(v) exp(v) i exp (v i) exp(v) Z J y T log (o) o o v [ 0 1 ] 0 o k [ ] oj [exp(v j ) (I(i j) ] k exp(v k) exp(v i )) v i ij ( k exp(v k)) 2 [ ] exp(vj ) (I(i j) Z exp(v i )) (Z) 2 ij ij Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 20 / 26

Derivative of J wrt Output layer (Cont.) v o o v [ 0 1 [ 0 [ ] [ ] exp(vj ) (I(i j) Z exp(v i )) 0 o k (Z) 2 ij ] [ ] Z exp(v k ) 0 exp(vj ) (I(i j) Z exp(v i )) (Z) 2 exp(v 1) Z 1 exp(v k) Z exp(v K ) Z [ ] o 1 1 o k o K ] ij Let δ (o) be the error signal for output layer: δ (o) : v [ o 1 1 o k o K ] Here, note that δ (o) is a row vector. Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 21 / 26

Error propagation to hidden layer Hidden-to-output: v Uh v h U Let δ (h) be the error signal for hidden layer δ (h) : h v v h δ (o) U Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 22 / 26

Deriv of J wrt hidden-output weight matrix U Hidden-to-output: v Uh v i U i h j u ij h j v i U i h T where U i is i-th row vector of U. U i v i U : v i δ (o) i U i U 1. U K h T δ (o) 1. δ (o) K h T δ (o)t h T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 23 / 26

Error prop to input layer Input-to-hidden layer: z Wx h max (z, 0) h z z x I(z 1 > 0) 0..... diag (I(z i > 0)) 0 I(z m > 0) W Let δ (z) be the error signal for the pre-activated hidden layer δ (z) : z h h z δ(h) diag (I(z i > 0)) Let δ (x) be the error signal for input layer δ (x) : x h z h z x δ(h) diag (I(z i > 0)) W Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 24 / 26

Deriv of J wrt input-hidden weight matrix W Input-to-hidden layer: z Wx z i W i x j w ij x j z i W i x T where W i is i-th row vector of W. W i z i W : z i δ (z) i W i W 1. W m x T δ (z) 1. δ (z) m x T δ (z)t x T Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 25 / 26

Discussion Here, error messages such as δ (o), δ (h) are row vectors But, we can define δ (o), δ (h) as column vectors and derive backprop again. In this case, only the slight modification on error prop is necessary Seung-Hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25 26 / 26