Backpropagation: The Good, the Bad and the Ugly

Similar documents
Supervised Learning in Neural Networks

Gradient Descent Training Rule: The Details

Multilayer Perceptrons and Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

y(x n, w) t n 2. (1)

Neural Nets Supervised learning

Lecture 4 Backpropagation

AI Programming CS F-20 Neural Networks

Data Mining Part 5. Prediction

4. Multilayer Perceptrons

Machine Learning

Introduction to Neural Networks

CSC 411 Lecture 10: Neural Networks

Artifical Neural Networks

Chapter 2 Single Layer Feedforward Networks

Neural networks. Chapter 20. Chapter 20 1

CS 453X: Class 20. Jacob Whitehill

Machine Learning

Input layer. Weight matrix [ ] Output layer

Lecture 4: Perceptrons and Multilayer Perceptrons

Statistical NLP for the Web

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Single Layer Perceptron Networks

Multilayer Neural Networks

Error Backpropagation

Introduction to Neural Networks

CSC321 Lecture 6: Backpropagation

Machine Learning. Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Statistical Machine Learning from Data

Neural Networks and Deep Learning

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks: Backpropagation

Artificial Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Course 395: Machine Learning - Lectures

How to do backpropagation in a brain

Pattern Classification

Feedforward Neural Nets and Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Lecture 5: Logistic Regression. Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Neural Network : Training

Regularization and Optimization of Backpropagation

Multilayer Perceptron

Chapter 3 Supervised learning:

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Revision: Neural Network

Artificial Neural Networks. Edward Gatt

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Tutorial on Tangent Propagation

Artificial Neural Networks

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Neural Networks and the Back-propagation Algorithm

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks 2. 2 Receptive fields and dealing with image inputs

1 What a Neural Network Computes

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Notes on Back Propagation in 4 Lines

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Backpropagation Neural Net

Introduction to Artificial Neural Networks

Computing Neural Network Gradients

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 7 Artificial neural networks: Supervised learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Radial-Basis Function Networks

Artificial Intelligence

Artificial Neural Networks. MGS Lecture 2

Neural Networks (Part 1) Goals for the lecture

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning Basics III

Unit III. A Survey of Neural Network Model

Intro to Neural Networks and Deep Learning

Artificial Neural Networks. Historical description

Artificial Neural Networks Examination, June 2005

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Neural Network Language Modeling

More on Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Deep learning in the brain. Deep learning summer school Montreal 2017

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Network Training

Week 5: The Backpropagation Algorithm

Transcription:

Backpropagation: The Good, the Bad and the Ugly The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no October 3, 2017

Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs to be learned. Weights are modified by a complex procedure (back-propagation based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a lots of input-output training data, and b goal of a mapping (function from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof.

Backpropagation Overview Training/Test Cases: {(d1, r1 (d2, r2 (d3, r3...} d3 r3 Encoder Decoder r* E = r3 - r* de/dw Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way.

Training -vs- Testing Cases Training N times, with learning Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on. Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases.

Widrow-Hoff (a.k.a. Delta Rule 1 2 3 X w T = target output value δ = error 1 Y Node N 0 0 S Δw = ηδx Y δ = T - Y Delta (δ = error; Eta (η = learning rate Goal: change w so as to reduce δ. Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing w if X is positive (negative. Similar for δ < 0 Assumes derivative of N s transfer function is everywhere non-negative.

Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels to follow the route of steepest descent in error space. min ΔE Error(E Weight Vector (W w ij = η E i w ij

Computing E i w ij 1 x 1d w i1 f T t id sum id i o id n x nd w in E id Sum of Squared Errors (SSE E i = 1 2 (t id o id 2 d D E = 1 w ij 2 2(t id o id (t id o id = w d D ij (t id o id ( o id w d D ij

Computing ( o id w ij 1 t id x 1d w i1 f T sum id i o id n x nd w in E id Since output = f(sum weighted inputs where E = w ij (t id o id ( f T (sum id w d D ij n sum id = w ik x kd k=1 Using Chain Rule: f (g(x = f x g(x g(x x (f T (sum id = f T (sum id sum id = f T (sum id x w ij sum id w ij sum jd id

Computing sum id w ij - Easy!! sum id w ij = ( n k=1 w ik x kd = ( wi1 x 1d + w i2 x 2d +... + w ij x jd +... + w in x nd w ij w ij = (w i1x 1d w ij + (w i2x 2d w ij +... + (w ijx jd w ij +... + (w inx nd w ij = 0 + 0 +... + x jd +... + 0 = x jd

Computing f T (sum id sum id - Harder for some f T f T = Identity function: f T (sum id = sum id Thus: f T (sum id sum id = 1 (f T (sum id = f T (sum id w ij sum id sum id w ij = 1 x jd = x jd f T = Sigmoid: f T (sum id = 1 1+e sum id Thus: (f T (sum id = f T (sum id w ij sum id f T (sum id sum id = o id (1 o id sum id w ij = o id (1 o id x jd = o id (1 o id x jd

The only non-trivial calculation f T (sum id sum id But notice that: = ( (1 + e sum id 1 = ( 1 (1 + e sumid (1 + e sum id 2 sum id sum id = ( 1( 1e sum id (1 + e sum id 2 = e sum id (1 + e sum id 2 e sum id (1 + e sum id 2 = f T (sum id (1 f T (sum id = o id (1 o id

Putting it all together E i = w ij (t id o id ( f ( T (sum id = w d D ij d D (t id o id f T (sum id sum id sum id w ij So for f T = Identity: E i = w ij (t id o id x jd d D and for f T = Sigmoid: E i = w ij (t id o id o id (1 o id x jd d D

Weight Updates (f T = Sigmoid Batch: update weights after each training epoch w ij = η E i = η w ij (t id o id o id (1 o id x jd d D The weight changes are actually computed after each training case, but w ij is not updated until the epoch s end. Incremental: update weights after each training case w ij = η E i w ij = η(t id o id o id (1 o id x jd A lower learning rate (η recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch.

Backpropagation in Multi-Layered Neural Networks d(e d d(sum 1d 1 d(sum 1d d(o jd d(o jd d(sum jd w 1j sum jd j o jd E d w nj d(sum nd d(o jd n d(e d d(sum nd For each node (j and each training case (d, backpropagation computes an error term: δ jd = E d sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer.

Computing δ jd d(e d d(sum 1d 1 d(sum 1d d(o jd d(o jd d(sum jd w 1j sum jd j o jd E d w nj d(sum nd d(o jd n d(e d d(sum nd Along the upper path, the contribution to So summing along all paths: E d sum jd is: o jd sum jd sum 1d o jd E d sum 1d E d sum jd = o jd sum jd n sum kd o k=1 jd E d sum kd

Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: sum kd o jd = w kj Assuming f T = a sigmoid: Thus: o jd = f T (sum jd = o sum jd sum jd (1 o jd jd δ jd = E d = o n jd sum sum jd sum jd kd E d o k=1 jd sum kd n n = o jd (1 o jd w kj ( δ kd = o jd (1 o jd w kj δ kd k=1 k=1

Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: δ jd = o jd (1 o jd n k=1 w kj δ kd So all δ values in the network can be computed by moving backwards, one layer at a time.

Computing E d w ij from δ jd - Easy!! 1 w i1 d(e d d(sum id j o jd w ij sum id i The only effect of w ij upon the error is via its effect upon sum id, which is: sum id w ij = o jd So: E d = sum id E d = sum id ( δ w ij w ij sum id w id = o jd δ id ij

Computing w ij Given an error term, δ id (for node i on training case d, the update of w ij for all nodes j that feed into i is: w ij = η E d w ij = η( o jd δ id = ηδ id o jd So given δ i, you can easily calculate w ij for all incoming arcs.

Learning XOR 1.2 Sum-Squared-Error 1.2 Sum-Squared-Error 1.2 Sum-Squared-Error 1.0 1.0 1.0 0.8 0.8 0.8 Error 0.6 Error Error 0.6 Error Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0 200 400 600 800 1000 Epoch 0.0 0 200 400 600 800 1000 Epoch 0.0 0 200 400 600 800 1000 Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs X 2 (hidden X 1 (output network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init.

Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 1 13.2 1.78 2.14 11.2 100 2.65 2 13.11 1.01 1.7 15 78 2.98 3 13.17 2.59 2.37 20 120 1.65. Wine Properties 1 2 1 Wine Class 5 1 13 Hidden Layer

Wine Runs 0.007 Sum-Squared-Error 0.010 Sum-Squared-Error 0.006 0.008 0.005 Error 0.004 0.003 Error Error 0.006 0.004 Error 0.002 0.001 0.002 0.000 0 20 40 60 80 100 Epoch 0.16 0.000 0 20 40 60 80 100 Epoch 13x5x1 (lrate = 0.3 13x5x1 (lrate = 0.1 Sum-Squared-Error 0.014 Sum-Squared-Error 0.14 0.012 0.12 0.010 Error 0.10 0.08 0.06 Error Error 0.008 0.006 Error 0.04 0.004 0.02 0.002 0.00 0 20 40 60 80 100 Epoch 0.000 0 20 40 60 80 100 Epoch 13 x 10 x 1 (lrate = 0.3 13 x 25 x 1 (lrate = 0.3

WARNING: Notation Change When working with matrices, it is often easier to let w ij denote the weight on the connection FROM node i TO node j. Then, the normal syntax for matrix row-column indices implies that: The values in row i are written as w i? and denote the weights on the outputs of node i. The values in column j are written as w?j and denote the weights on the inputs to node j. w 11 w 12 w 13 Weights from node 2 w 21 w 22 w 23 w 31 w 32 w 33 Weights to node 3

Backpropagation using Tensors X V Y g: W Z f: v 11 v 12 v 21 v W = V = 22 R = XV Y = g(r S = YW Z = f(s w 11 w 21 X = [x 1, x 2, x 3 ] Y = [y 1, y 2 ] Z = [z 1 ] f = g = sigmoid v 31 v 32 v ij = weight from i to j

Goal: Compute Z W Using the Chain Rule of Tensor Calculus: Z f (S = W W = f (S S S f (S = W S (Y W W f(s is a sigmoid, with derivative f(s(1-f(s, so simplify to: Using the product rule: z 1 (1 z 1 (Y W W z 1 (1 z 1 (Y W W = z 1(1 z 1 [ Y W W + Y W W ] Y is independent of W, so red term = 0. But W W 1 0 0 1 is not simply 1, but:

Computing Z W Y W W = (y 1,y 2 1 0 0 1 ( = y1 y 2 Putting it all together: Z W = z 1(1 z 1 Y W W = z 1 (1 z 1 y 1 z 1 (1 z 1 y 2 = Z w 1,1 Z w 2,1 For each case, c k, in a minibatch, the forward pass will produce values for X, Y, and Z. Those values will be used to compute actual numeric gradients by filling in these symbolic gradients: z Z 1 (1 z 1 y 1 = W ck z 1 (1 z 1 y 2 ck

Tensor Calculations across Several Layers Goal: Compute Z V Expanding using the Chain Rule: Z f (S = V V Using the Product Rule: (Y W V = f (S S S V = z 1(1 z 1 (Y W V = Y V W + Y W V = (g(r V W is indep. of V, so red term = 0; and Y = g(r. Using the Chain Rule, and R = X V, yields: (g(r V g(r is also a sigmoid, so: W = [ g(r R R V ] W W g(r R = g(r(1 g(r = Y (1 Y = [y 1(1 y 1,y 2 (1 y 2 ]

Computing Z V Since R = X V, and using the Product Rule: R V = (X V = X V V V + X V V X is indep. of V, so red term = 0; V is 4-dimensional: V 1 0 0 1 0 0 0 0 0 0 0 0 V V = 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1

Computing Z V Take dot product of X = [x 1,x 2,x 3 ] with each internal maxtrix of V V : 1 0 0 1 0 0 0 0 0 0 0 0 X V V = [x 0 0 0 0 1,x 2,x 3 ] 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 = x 1 x 2 x 3 0 0 0 0 x 1 0 x 2 0 x 3

Computing Z V Putting the pieces back together: g(r R R V = [y 1(1 y 1,y 2 (1 y 2 ] = x 1 y 1 (1 y 1 x 2 y 1 (1 y 1 x 3 y 1 (1 y 1 0 0 0 0 0 0 x 1 x 2 x 3 0 0 0 x 1 y 2 (1 y 2 x 2 y 2 (1 y 2 x 3 y 2 (1 y 2 0 x 1 0 x 2 0 x 3

Computing Z V (g(r V W = [ g(r R R V ] W = x 1 y 1 (1 y 1 x 2 y 1 (1 y 1 x 3 y 1 (1 y 1 = 0 0 0 0 0 0 x 1 y 2 (1 y 2 x 2 y 2 (1 y 2 x 3 y 2 (1 y 2 x 1 y 1 (1 y 1 w 11 x 1 y 2 (1 y 2 w 21 x 2 y 1 (1 y 1 w 11 x 2 y 2 (1 y 2 w 21 x 3 y 1 (1 y 1 w 11 x 3 y 2 (1 y 2 w 21 ( w11 w 21 = (Y W V

Computing Z V...and finally... = Z f (S = V V = z 1(1 z 1 (Y W V x 1 y 1 (1 y 1 w 11 x 1 y 2 (1 y 2 w 21 = z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 x 2 y 2 (1 y 2 w 21 x 3 y 1 (1 y 1 w 11 x 3 y 2 (1 y 2 w 21 z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21 = Z v 11 Z v 21 Z v 31 Z v 12 Z v 22 Z v 32

Gradients and the Minibatch For every case c k in a minibatch, the values of X, Y and Z are computed during the forward pass. The symbolic gradients in Z V are filled in using these values in X,Y and Z (along with W, producing one numeric, 3 x 2 matrix per case: Z Z v 11 v 12 = ( Z V ck = Z v 21 Z v 31 Z v 22 Z v 32 z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21 ck ck

Gradients and the Minibatch In most Deep Learning situations, the gradients will be based on a loss function, L, not simply the output of the final layer. But that s just one more level of derivative calculations. At the completion of a minibatch, the numeric gradient matrices are added together to yield the complete gradient, which is then used to update the weights. For any weight matrix U in the network, and minibatch M, update weight u i,j as follows: ( L u M = i,j ( L u ck c k M i,j η = learning rate u i,j = η( L u i,j M Tensorflow and Theano do all of this for you. Everytime you write Deep Learning code, be grateful!!

The General Algorithm: Jacobian Product Chain Weights (W Layer M Goal: Compute dl / dw Layer N Targets Loss (L L W = L N N M + 1 N 1 M M W

Distant Gradients = Sums of Path Products X X 2 dy 1 dy 3 dx 2 dy 2 dx 2 X Y dx 2 dz 3 Y Z Z 3 dz 3 dy 2 dy 1 dz 3 dy 3 Z More generally: z 3 x 2 = z 3 y 1 y 1 x 2 + z 3 y 2 y 2 x 2 + z 3 y 3 y 3 x 2 z i z = x j i y k y k Y k x j This results from the multiplication of two Jacobian matrices.

The Jacobian Matrix J Z Y = z 1 y 1 z 1 y 2 z 2 y 1 z 2.. z m y 1 y 2.... z m y 2 z 1 y n z 2 y n.. z m y n

Multiplying Jacobian Matrices. Jy z Jx y z = i y 1... =... z i y 2. z i y 1 + z i y 1 x j. z i y 3.. y 2 y 2 x j. + z i y 3 y 3 y 1 x j y 2 x j y 3 x j. x j... = Jx z We can do this repeatedly to form J n m, the Jacobian relating the activation levels of upstream layer m with those of downstream layer n: R = Identity Matrix For q = n down to m+1 do: J n m R R R J q q 1

Last of the Jacobians Once we ve computed J n m, we need one final Jacobian: J m w, where w are the weights feeding into layer m. Then we have the standard Jacobian, J n w needed for updating all weights in matrix w. J n w J n m J m w Weights V from layer X to Y (from the earlier example ( T ( y1 y 2 y1 y 2 JV Y = Y v 11 v 11 v 12 v V = ( 12 T y1 y 2 v 21 v 21 ( T y1 y 2 v 31 v 31 T Since a the act func for Y is sigmoid, and b y k = 0 when j k: v ij ( JV Y = Y y1 (1 y 1 x 1 0 T ( T 0 y2 (1 y 2 x V = ( 1 y1 (1 y 1 x 2 0 T ( T 0 y2 (1 y 2 x 2 ( y1 (1 y 1 x 3 0 T ( T 0 y2 (1 y 2 x 3 Inner vectors are transposed to fit them on the page.

Last of the Jacobian Multiplications From the previous example (the network with layer sizes 3-2-1, once we have JY Z and JY V, we can multiply (taking dot products of JZ Y with the inner vectors of JV Y to produce JZ V : the matrix of gradients that backprop uses to modify the weights in V. J Z V = JZ Y JY V = z 1 y 1 z 1 y 2 Y V Since a the act func for Z is sigmoid, and b sum(z k = w y i ik (where sum(z k = sum of weighted inputs to node z k : JV Z = = z 1 (1 z 1 w 11 z 1 (1 z 1 w 21 Y V z 1 (1 z 1 x 1 y 1 (1 y 1 w 11 z 1 (1 z 1 x 1 y 2 (1 y 2 w 21 z 1 (1 z 1 x 2 y 1 (1 y 1 w 11 z 1 (1 z 1 x 2 y 2 (1 y 2 w 21 z 1 (1 z 1 x 3 y 1 (1 y 1 w 11 z 1 (1 z 1 x 3 y 2 (1 y 2 w 21

First of the Jacobians This iterative process is very general, and permits the calculation of JM N (M < N for any layers M and N, or JW N for an weights (W upstream of N. However, the standard situation in backpropagation is to compute J L W i i where L is the objective (loss function and W i are the weight matrices. This follows the same procedure as sketched above, but the series of dot products begins with JN L : the Jacobian of derivatives of the loss function w.r.t. the activations of the output layer. For example, assume L = Mean Squared Error (MSE, Z is an output layer of 3 sigmoid nodes, and t i are the target values for those 3 nodes for a particular case (c: L(c = 1 3 3 (z i t i 2 k=1 Taking partial derivatives of L(c w.r.t. the z i yields: J L Z = L Z = 2 3 (z 1 t 1 2 3 (z 2 t 2 2 3 (z 3 t 3

First of the Jacobian Multiplications Continuing the example: Assume that layer Y (of size 2 feeds into layer Z, using weights W, then: JY Z = z 1 (1 z 1 w 11 z 1 (1 z 1 w 21 z 2 (1 z 2 w 12 z 2 (1 z 2 w 22 z 3 (1 z 3 w 13 z 3 (1 z 3 w 23 Our first Jacobian multiplication yields JY L, a 1 x 2 row vector (shown transposed to fit on the page: J L Z JZ Y = JL Y = 2 3 (z 1 t 1 z 1 (1 z 1 w 11 + 2 3 (z 2 t 2 z 2 (1 z 2 w 12 + 2 3 (z 3 t 3 z 3 (1 z 3 w 13 2 3 (z 1 t 1 z 1 (1 z 1 w 21 + 2 3 (z 2 t 2 z 2 (1 z 2 w 22 + 2 3 (z 3 t 3 z 3 (1 z 3 w 23 T

Backpropagation with Tensors: The Big Picture General Algorithm Assume Layer M is upstream of Layer N (the output layer. So M < N. Assume V is the tensor of weights feeding into Layer M. Assume L is the loss function. Goal: Compute J L V = L V R = JN L (the partial derivatives of the loss function w.r.t. the output layer For q = N down to M + 1 do: R R J q q 1 J L V R JM V Use JV L to update the weights in V.

Practical Tips 1 Only add as many hidden layers and hidden nodes as necessary. Too many more weights to learn + increased chance of over-specialization. 2 Scale all input values to the same range, typically [0 1] or [-1 1]. 3 Use target values of 0.1 (for zero and 0.9 (for 1 to avoid saturation effects of sigmoids. 4 Beware of tricky encodings of input (and decodings of output values. Don t combine too much info into a single node s activation value (even though it s fun to try, since this can make proper weights difficult (or impossible to learn. 5 For discrete (e.g. nominal values, one (input or output node per value is often most effective. E.g. car model and city of residence -vs- income and education for assessing car-insurance risk. 6 All initial weights should be relatively small: [-0.1 0.1] 7 Bias nodes can be helpful for complicated data sets. 8 Check that all your layer sizes, activation functions, activation ranges, weight ranges, learning rates, etc. make sense in terms of each other and your goals for the ANN. One improper choice can ruin the results.

Supervised Learning in the Cerebellum Parallel Fibers Granular Cells Golgi Cells Mossy Fibers Climbing Fibers Purkinje Cells Inferior Olive Sensory + Cortical Inputs Somatosensory (touch, pain, body position + Cortical Inputs Efference Copy Inhibition of deep cerebellar To neurons Spinal Cord To Cerebral Cortex Granular cells detect contexts. Parallel fibers and Purkinje cells map contexts to actions Climbing fibers from Inferior Olive provide (supervisory feedback signals for LTD on Parallel-Purkinje synapses