To keep things simple, let us just work with one pattern. In that case the objective function is defined to be. E = 1 2 xk d 2 (1)

Similar documents
Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Computational Graphs, and Backpropagation

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

4. Multilayer Perceptrons

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

y(x n, w) t n 2. (1)

Error Backpropagation

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks: Backpropagation

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Slide credit from Hung-Yi Lee & Richard Socher

Lab 5: 16 th April Exercises on Neural Networks

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Revised Simplex Method

Multilayer Perceptrons and Backpropagation

Section 1.1: Systems of Linear Equations

Gradient Descent and Implementation Solving the Euler-Lagrange Equations in Practice

CSC321 Lecture 15: Exploding and Vanishing Gradients

Gaussian Process Approximations of Stochastic Differential Equations

Minimization with Equality Constraints

1.2 Derivation. d p f = d p f(x(p)) = x fd p x (= f x x p ). (1) Second, g x x p + g p = 0. d p f = f x g 1. The expression f x gx

Machine Learning

Multi-layer Perceptron Networks

Feedforward Neural Networks

CSC321 Lecture 10 Training RNNs

UNIQUENESS OF POSITIVE SOLUTION TO SOME COUPLED COOPERATIVE VARIATIONAL ELLIPTIC SYSTEMS

ECE 680 Fall Test #2 Solutions. 1. Use Dynamic Programming to find u(0) and u(1) that minimize. J = (x(2) 1) u 2 (k) x(k + 1) = bu(k),

Neural Networks and the Back-propagation Algorithm

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Distance Between Ellipses in 2D

Convolutional networks. Sebastian Seung

Multilayer Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Multilayer Perceptron

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

NN V: The generalized delta learning rule

Introduction to Optimization

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Neural Network Training

Lecture 6 Optimization for Deep Neural Networks

Example: 2x y + 3z = 1 5y 6z = 0 x + 4z = 7. Definition: Elementary Row Operations. Example: Type I swap rows 1 and 3

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

DM545 Linear and Integer Programming. Lecture 7 Revised Simplex Method. Marco Chiarandini

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lagrange Multiplier Method. Uwe A. Schneider

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Learning Neural Networks

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Math 273a: Optimization Lagrange Duality

Backpropagation Neural Net

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Static and Dynamic Optimization (42111)

Process Model Formulation and Solution, 3E4

Elementary Linear Algebra

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Quasi-Newton Methods

Recurrent Neural Networks. Jian Tang

C1.2 Multilayer perceptrons

Differentiation of functions of covariance

Numerical optimization

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Introduction to Machine Learning

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Numerical Methods. V. Leclère May 15, x R n

Gradient Method Based on Roots of A

Deep Feedforward Networks

Statistical Machine Learning from Data

Nonlinear Optimization: What s important?

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.18

Answers Machine Learning Exercises 4

Production Possibility Frontier

Gradient Descent Training Rule: The Details

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Scientific Computing

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSC 576: Linear System

Math 252 Fall 2002 Supplement on Euler s Method

Computational statistics

Minimization with Equality Constraints!

ECE521 Lectures 9 Fully Connected Neural Networks

CLASSICAL ITERATIVE METHODS

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Iterative Methods for Solving A x = b

Deep Learning II: Momentum & Adaptive Step Size

Constrained optimization: direct methods (cont.)

Neural networks COMS 4771

Numerical Methods for Inverse Kinematics

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

Conjugate Gradient (CG) Method

The Conjugate Gradient Method

Lecture # 20 The Preconditioned Conjugate Gradient Method

Neural Nets Supervised learning

Transcription:

Backpropagation To keep things simple, let us just work with one pattern. In that case the objective function is defined to be E = 1 2 xk d 2 (1) where K is an index denoting the last layer in the network and d is the desired output. The equation for updating the states is given by x k+1 = g(w k x k ) (2) where W k is a matrix used to store the weights between layer k and layer k + 1, w W k 11 k w k w 12... k 1 = w21 k... = w k 2. wnn k. w k n and the special understanding we shall have is that the function g applied to a vector is just that function applied to its elements; that is, g(w x) = g(w 1 x) g(w 2 x) This is just a version of the optimal control problem where Equation 1 is the optimand and Equation 2 is the dynamics. Here instead of the control u, the matrices W k, k = 1,..., K represent the control variables. One difference, of course, is the dimensionality: In networks the dimension of the state space may range into the thousands! But the important thing to realize is that the mathematical treatment is the same.. 1

Thus this problem can be tackled using the Euler-Lagrange formulation. The Hamiltonian is defined to be H = 1 2 xk d 2 + k=k 1 (λ k+1 ) T [ x k+1 + g(w k x k )] k=0 where the first term measures the cost of errors reproducing the pattern and the second term constrains the system to follow its dynamic equations. Differentiating with respect to x k provides the adjoint system of equations, where H x k = 0 = λ k + [W kt Λ k+1 g (W k x k )] for k = 0,..., K 1 (3) Λ k+1 = and with the final condition given by λ k+1 1 0... 0 λ k+1 2. λ k+1 n H x K = 0 = x K d λ K for k = K 2

Unpacking the Notation H x k = 0 = λ k + [W kt Λ k+1 g (W k x k )] for k = 0,..., K 1 (4) To understand Equation 4, consider the case where the number of units in each layer is just two. Then (λ k+1 ) T g(w k x k ) = λ k+1 1 g(w 11 x k 1 + w 12 x k 2) + λ k+1 2 g(w 21 x k 1 + w 22 x k 2) Taking the partial derivative with respect to x k 1, λ k+1 1 g (w 11 x k 1 + w 12 x k 2)w 11 + λ k+1 2 g (w 21 x k 1 + w 22 x k 2)w 21 Similarly the partial derivative with respect to x k 2, λ k+1 1 g (w 11 x k 1 + w 12 x k 2)w 12 + λ k+1 2 g (w 21 x k 1 + w 22 x k 2)w 22 Reassembling these terms into a vector results in = W T λ k+1 λ k+1 1 g (w k 1 x k ) 2 g (w k 2 x k ). = W kt Λ k+1 g (W k x k ) 3

Generating the Solution To generate the solution, assume an initial W matrix. Then solve the dynamic equations going forward in levels in the network. This solution provides a value for x K. This value allows the adjoint system to be solved backward for k = K to k = 0: λ K = x K d for k = K λ k = W T Λ k+1 g (W x k ) for k = 0,..., K 1 where g denotes differentiation with respect to its argument. Now improve the estimate for W using gradient descent. To do so, calculate the adjustment for the weights as H w ij : H w k ij = λ k+1 i x k j g (w k i x k ) (5) Equation 5 is very compact and so is worth unpacking in order to understand it. The multiplier λ k+1 i can be understood by realizing that at level K it just keeps track of the output error. This is its role for the internal units also; it appropriately keeps track of how the internal weight change affects output error. The derivative g is also simple, especially when using since it can be easily verified that g(u) = 1 1 + e u g (u) = g(1 g) Thus to calculate g for a given unit, just determine how much input the x kth j unit receives, and apply that as an argument to g (). 4

Putting all this together results in the following algorithm. Backpropagation Algorithm Until the error is sufficiently small, do the following for each pattern: 1. Apply the pattern to the network and calculate the state variables using Equation 2. 2. Solve the adjoint Equation 3. 3. Adjust the weights W k using w k ij = η H w k ij where H w k ij is given by Equation 5. Notice the change in the formulation in the backpropagation equations. The original problem formulation used time as a dependent variable. Translating to the network formulation, the dependent variable became the different hidden state variables in the network. Time is translated into space. In developing this algorithm only one pattern was used, but the extension to multiple patterns is straightforward. The strictly correct thing to do would be to accumulate the weight changes for each pattern separately, add them up, and then adjust the weight vector accordingly. The problem with this approach is that the calculations for each pattern are laborious. To make faster progress one can approximate the strict method by adjusting the weights after each pattern is used. This technique is known as stochastic gradient descent. The understanding is that the sum of the successive corrections should approximate the sum of all the corrections that are computed simultaneously. 5

Figure 1: 6

Figure 2: 7

Figure 3: Figure 4: 8

Figure 5: 9