How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Similar documents
How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Multilayer Kerceptron

Notes on Backpropagation with Cross Entropy

CS229 Lecture notes. Andrew Ng

Convolutional Networks 2: Training, deep convolutional networks

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Physics 235 Chapter 8. Chapter 8 Central-Force Motion

BP neural network-based sports performance prediction model applied research

ECE521 Lectures 9 Fully Connected Neural Networks

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

CSC321 Lecture 6: Backpropagation

A. Distribution of the test statistic

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Multilayer Perceptrons and Backpropagation

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Computational Intelligence Winter Term 2017/18

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

Computational Intelligence

Decoupled Parallel Backpropagation with Convergence Guarantee

Neural Nets Supervised learning

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

CSC321 Lecture 5 Learning in a Single Neuron

( ) is just a function of x, with

Neural Networks: Backpropagation

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Convolutional Neural Networks. Srikumar Ramalingam

SVM: Terminology 1(6) SVM: Terminology 2(6)

Paragraph Topic Classification

CSC 411 Lecture 10: Neural Networks

Quantum Mechanical Models of Vibration and Rotation of Molecules Chapter 18

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

CS 331: Artificial Intelligence Propositional Logic 2. Review of Last Time

y(x n, w) t n 2. (1)

4. Multilayer Perceptrons

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

Stochastic Variational Inference with Gradient Linearization

Neural Networks. Intro to AI Bert Huang Virginia Tech

A Brief Introduction to Markov Chains and Hidden Markov Models

Learning Fully Observed Undirected Graphical Models

V.B The Cluster Expansion

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Statistical Machine Learning from Data

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Automatic Differentiation and Neural Networks

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Lecture 4 Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Lecture 17: Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Backpropagation: The Good, the Bad and the Ugly

Training Algorithm for Extra Reduced Size Lattice Ladder Multilayer Perceptrons

Revision: Neural Network

Chemical Kinetics Part 2

Chapter 4. Moving Observer Method. 4.1 Overview. 4.2 Theory

Artifical Neural Networks

1 What a Neural Network Computes

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Learning Deep Architectures for AI. Part I - Vijay Chakilam

V.B The Cluster Expansion

Artificial Neural Networks. MGS Lecture 2

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Perceptron & Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Multilayer Perceptron

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Intro to Neural Networks and Deep Learning

Lecture 2: Learning with neural networks

14 Separation of Variables Method

Computational Graphs, and Backpropagation

On the Goal Value of a Boolean Function

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Unconditional security of differential phase shift quantum key distribution

ECE521 Lecture 7/8. Logistic Regression

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

How to do backpropagation in a brain

17 Lecture 17: Recombination and Dark Matter Production

MONTE CARLO SIMULATIONS

Multiple Beam Interference

Artificial Neural Networks

Chemical Kinetics Part 2. Chapter 16

Statistical Learning Theory: A Primer

arxiv: v1 [cs.lg] 18 Feb 2015 ABSTRACT

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

SydU STAT3014 (2015) Second semester Dr. J. Chan 18

Introduction to Machine Learning

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

Asynchronous Control for Coupled Markov Decision Systems

Convolutional Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

arxiv: v1 [quant-ph] 18 Nov 2014

Transcription:

How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah

Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

Introduction First discovered in 1970. First infuentia paper in 1986: Rumehart, Hinton and Wiiams, Learning representations by backpropagating errors, Nature, 1986.

Perceptron (Reminder)

Sigmoid neuron (Reminder) A sigmoid neuron can take rea numbers (x 1, x 2, x ) within 0 to 1 and returns a number within 0 to 1. The weights (w 1, w 2, w ) and the bias term b are rea numbers. Sigmoid function σ 0 = 0.5, σ = 0, σ = 1

Matrix equations for neura networks The indices j and k seem a itte counter-intuitive! Notations are used in this manner to enabe matrix mutipications.

Layer to ayer reationship Exampes: a j = σ(z j ) a 1 = σ z 1, a 2 = σ z 2 z j = k w jk a 1 k + b j z j = k=1 to 4 w jk a k 2 + b j, j {1,2} z j 2 = k=1 to 4 w jk 2 a k 1 + b j 2, j {1,.., 4} b j is the bias term in the j th neuron in the th ayer. a j is the activation in the j th neuron in the th ayer. z j is the weighted input to the j th neuron in the th ayer.

Cost function from the network Groundtruth for each input Output activation vector for a specific training sampe x. # of input sampes Input vector x for each input sampe

Backpropagation and stochastic gradient descent The goa of the backpropagation agorithm is to compute the gradients and of the cost function C with respect to each and w b every weight and bias parameters. Note that backpropagation is ony used to compute the gradients. Stochastic gradient descent is the training agorithm.

Assumptions on the cost function 1. We assume that the cost function can be written as the average over the cost functions from individua training sampes: C = 1 σ n x C x. The cost function for the individua training sampe is given by C x = 1 y x 2 al x 2. - why do we need this assumption? Backpropagation wi ony aow us to compute the gradients with respect to a singe training sampe as given by x and x. We then recover w b the gradients from the different training sampes. w and b by averaging

Assumptions on the cost function (continued) 2. We assume that the cost function can be written as a function of the output from the neura network. We assume that the input x and its associated correct abeing y x are fixed and treated as constants.

Hadamard product Let s and t are two vectors. The Hadamard product is given by: E. g., 1 2 2 2 2 = 2 4 6 Such eementwise mutipication is aso referred to as schur product.

Backpropagation Our goa is to compute the partia derivatives w jk and b j. We compute some intermediate quantities whie doing so: δ j = z j

Four equations of the BP (backpropagation) Summary: the equations of backpropagation (L is the tota number of ayers) 1) δ L = a L σ z L BP1 2) δ = w +1 T δ +1 σ z BP2 ) b j = δ j BP 4) w jk = a k 1 δ j BP4

Chain Rue in differentiation In order to differentiate a function z = f g x foowing: w.r.t x, we can do the Let y = g x, z = f y, dz = dz dy dx dy dx

Chain Rue in differentiation (computation graph) z x = j:x Parent y j, y j Ancestor (z) z y j y j x x y 1 y 2 y z

Chain Rue in differentiation (vector case) Let x R m, y R n, g maps from R m to R n, and f maps from R n to R. If y = g x and z = f y, then z x i = k z y k y k x i

BP1 C To Show: δ L = z L a L a L σ z L Variabe association for appying vector chain rue Here L is the ast ayer. δ L = z L, a L = L a, L 1 a,, L 2 a n T, σ z L = σ z 1 L, σ z 2 L,, σ z n L T Proof: δ j L = z j L = σ k L a k a L k z L = L a j j a L j δ j L = a j L σ (z j L ) -> Thus we have L = a L σ (z L ) L z L, because the other terms a k j z L, when j k, vanish because a L k does not invove z L j. j

Exampe for BP1 δ 1 = z = σ a k k=1 1 a k z = a 1 1 a 1 z 1, 0 for a other vaues of k and 1 C δ 2 = σ a k k=1 a k z = a 2 2 a 2 z 2, z a δ = σ a k k=1 a k z = a a z, δ L = a L σ z L

Derivates of Sigmoid activation function σ z = dσ z dz 1 1 + e z = σ (z) = 1 1 + e z 2 1 e z e z = 1 + e z 2 = e z + 1 1 1 + e z 2 1 1 = 1 + e z 1 1 + e z = σ(z)(1 σ z )

Derivates of quadratic objective function C = 1 2 y al 2 = 1 2 y 1 a 1 L 2 + y 2 a 2 L 2 + + y n a n L 2 a j L = y j a j L a L = (y 1 a 1 L ) (y 2 a 2 L ).. (y n a n L )

BP2 C Proof: δ = w +1 T δ +1 σ z δ j = z j = σ k z k +1 z k +1 z j = σ k +1 z k z j δ k +1 z z +1 Variabe association for appying vector chain rue z +1 k = σ j w +1 kj a j + b k = σ j w +1 kj σ z j + b k By differentiating we have: z k +1 z j = w +1 kj σ z j δ j = σ k w +1 kj δ +1 k σ z j

BP2 Exampe z 2 z C Variabe association for appying vector chain rue δ 2 1 = 4 4 2 z = z k 2 1 z k=1 k z = δ z k k 2 1 z k=1 1 z k = j=1 w kj σ(z 2 j ) + b k, z k 2 z = w k1 σ 2 z 1 1 4 δ 1 2 = k=1 δ z k 4 k 2 z = δ k w k1 σ z 2 1 = (δ 1 w 11 + δ 2 w 21 + δ w 1 + δ 4 w 41 ) σ 2 z 1 1 k=1 4 δ 2 2 = k=1 δ k w k1 σ z 2 2 = (δ 1 w 12 + δ 2 w 22 + δ w 2 + δ 4 w 42 ) σ 2 z 2 δ = ( w +1 T δ +1 ) σ (z )

BP Proof: b j = σ k z k z k b j = z j z j b j = δ j b, the other terms z k j b vanish when j k. j b z C Variabe association for appying vector chain rue = δ j σ k w jk a k 1 + b j b j = δ j

z = w a 2 + b BP Exampe b z C Variabe association for appying vector chain rue b = δ 1 = 1 z k=1 k z k b = z 1 1 z 1 b 1 = δ (σ k=1 1 2 w 1kak+b1 ) b = δ 1 1 b = δ

BP4 Proof: w jk = σ m z m = z j z j z m w jk w jk w jk and the other terms = a k 1 δ j z j w jk when m j. = δ σ k w jk a 1 k + b j j w jk = δ 1 j a k w z C Variabe association for appying vector chain rue

BP4 Exampe Proof: w = a 2 2 δ 1 12 w z C Variabe association for appying vector chain rue w = 12 m z m z m w 12 = z 1 z 1 w 12 = δ 1 σ k w 12 a 2 2 + b 1 = δ 2 w 1 a 2 12

The backpropagation agorithm δ L = a L σ z L. The word backpropagation comes from the fact that we compute the error vectors δ j in the backward direction.

Stochastic gradient descent with BP

Gradients using finite differences Here ε is a sma positive number and e j is the unit vector in the jth direction. Conceptuay very easy to impement. In order to compute this derivative w.r.t one parameter, we need to do one forward pass for miions of variabes we wi have to do miions of forward passes. - Backpropagation can get a the gradients in just one forward and backward pass forward and backward passes are roughy equivaent in computations. The derivatives using finite differences woud be a miion times sower!!

Backpropagation the big picture To compute the tota change in C we need to consider a possibe paths from the weight to the cost. We are computing the rate of change of C w.r.t a weight w. Every edge between two neurons in the network is associated with a rate factor that is just the ratio of partia derivatives of one neurons activation with respect to another neurons activation. The rate factor for a path is just the product of the rate factors of the edges in the path. The tota change is the sum of the rate factors of a the paths from the weight to the cost.

Thank You

Source: http://math.arizona.edu/~cac/rues.pdf

Source: http://math.arizona.edu/~cac/rues.pdf