Introduction Biologically Motivated Crude Model Backpropagation

Introduction Biologically Motivated Crude Model Backpropagation 1

McCulloch-Pitts Neurons In 1943 Warren S. McCulloch, a neuroscientist, and Walter Pitts, a logician, published A logical calculus of the ideas immanent in nervous activity in [1]. Gave a highly simplified computational model of a neuron. 2

Rosenblatt In 1957 Rosenblatt, a neurobiologist at Cornell, was researching vision in flies. The neural processing that occurred within the eye itself particularly intrigued Rosenblatt and formed the basis of his Perceptron neural network [2]. The Perceptron and other models showed great promise with many initial successes. 3

Minsky and Papert In 1969 Marvin Minsky and Seymour Papert published a book [3] in which they discussed some of the limitations of the Perceptron model. Showed that the perceptron could not solve non-linear problems, such as the simple XOR problem. The effect of these problems was to limit much of the funding available for research into artificial neural networks. As a result, ANN research went into hibernation. 4

Werbos In 1974, Paul J. Werbos, published his Harvard University Ph.D. thesis [4], which first described the process of training artificial neural networks through the backpropagation of errors. Its significance was not noticed until it was rediscovered in 1986 by Rumelhart, Hinton, and Williams. 5

Hopfield In 1982, John Hopfield s work [5] caused a resurgence in the field. Hopfield's approach was not simply to create models but to develop technologies that could be applied to real life problems. Several books and conferences followed and provided a forum for people within the field to discuss the topic. 6

Rumelhart, Hinton, and Williams In 1986, Rumelhart, Hinton, and Williams rediscovered the backpropagation error learning algorithm. Now, very popular. Steve Grossberg, Teuvo Kohonen, and Henry Klopf also created new models. 7

IEEE In 1987 the Institute of Electrical and Electronic Engineers (IEEE) first International Conference on Neural Networks drew more than a thousand attendees. Many other conferences on ANNs appeared. 8

Support Vector Machines In 1990s, artificial neural networks were overtaken in popularity in machine learning by support vector machines and other, much simpler methods, such as linear classifiers. Renewed interest in neural nets was sparked in the 2000s by the advent of deep learning. 9

Natural Neuron: Basic Purpose From http://www.biologyreference.com/ Dendrites Axon Axon terminals Soma (Cell body) Photo by: Sebastian Kaulitzki Nucleus Synapse The basic purpose of a neuron is to receive incoming information (in the form of chemical and electrical signals) and, based upon that information, determines whether or not to send an electrical signal (action potential) to other neurons, muscles, or glands. Thus, a natural neuron is like an electrochemical signal receiver and transmitter (transceiver). When a neuron sends an action potential, this electrical signal travels to the end terminals of the neuron (synapse), where it triggers a release of chemicals called neurotransmitters. The neurotransmitters cross a short gap between cells (synapse) and, are input by the adjoining cell (other neuron, muscle, or gland.) 10

Neuron Structure: 4 Parts From http://www.biologyreference.com/ Dendrites Axon Axon terminals Soma (Cell body) Photo by: Sebastian Kaulitzki Nucleus Synapse A typical human neuron has a cell body (soma), an array of input paths or wires to receive incoming signals (dendrites), a single output path wire (axon) that carries electrical signals away from the neuron toward other cells, and many axon terminals (synapses). The dendrites are specialized to receive signals and transmit them toward the cell body. The single long axon carries action potentials away from the cell body. The synaptic terminals (1000 s) form connections either with the dendrites of other neurons or with effector cells in muscles or glands. 11

Neuron Communications: 4 Steps From http://www.biologyreference.com/ Dendrites Axon Axon terminals Soma (Cell body) Photo by: Sebastian Kaulitzki Nucleus Synapse 1. A neuron receives information from the external environment or from other neurons. Human brain may receive input from up to 100,000 other neurons. 2. The neuron integrates the information from all of its inputs and determines whether or not to send an output signal, depending on the strength of the summed input. This integration takes place both in time (the duration of the input and the time between inputs) and in space (across the surface of the neuron). 3. The neuron propagates the signal along its axon (several meters with rates up to 100 m/s). 4. Finally, the neuron converts this electrical signal to a chemical one and transmits it to other neurons, muscles, or glands. 12

Neuron Communications: Synapse From http://www.biologyreference.com/ Dendrites Axon Axon terminals Soma (Cell body) Nucleus Synapse Photo by: Alila Once an electrical signal has arrived at the end of an axon, the synaptic terminals release a chemical messenger called a neurotransmitter, which relays the signal across the synapse to the next neuron or to the effector cell. The magnitude, density of release, and type of chemical of the neurotransmitter released is not well understood, but the receiver s response can be either excitatory or inhibitory, depending on the properties of the receptor. 13

Crude Computational Model [8] transmitter neurons each send their electrical activation level (spikes) to neuron through and along the axon. Assumption: the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Furthermore, the frequency of firing is modeled by an activation output; the higher is the output (this is supposed to model) the higher is the frequency of firing. (a) Above: Cartoon drawing of neuron. (b) Below: a model [8]. Activation of neuron 1: Axon from Synapse neuron 1 Dendrite q Cell body Output axon : Activation function 14

Crude Computational Model [8] The electrical activation level enter the synapses located at the junction of the axon terminals of the sending neurons and the dendrites of the receiving neuron. The synapse acts to multiplicatively amplify or attenuate the activation level, through a weight : thus, represents the strength of the neurotransmitters, which are input to the receiving neuron. The activation level of neuron depends on the sum of the neurotransmitters values and an activation function : If this sum is strong enough (greater than the neuron s threshold ), neuron outputs an excitatory signal; otherwise it sends an inhibitory signal through the axon, to other neurons or receptor cells. (a) Above: Cartoon drawing of neuron. (b) Below: a model [8]. Activation of neuron 1: Axon from Synapse neuron 1 Dendrite q Cell body : Activation function Output axon 15

This model of a biological neuron is very crude, simplified, and coarse. For example, there are many different types of neurons, each with different properties; thus, suggesting a requirement for a different model for each different type of neuron. The dendrites in biological neurons perform complex nonlinear computations; we are not even modeling this. The synapses are not just a single weight, they re a complex non-linear dynamical system. The exact timing of the output spikes in many systems is known to be important, suggesting that the rate code approximation may not hold. Due to all these and many other simplifications, please avoid trying to draw serious analogies between any neural network model and real brains. The neural network model has been modified and fit to solve computational problems, without trying to faithfully exemplify the real brain. See this [6], or more recently this [7] if you are interested to learn more details about the physiology of actual neurons. 16

The is called bias because the summed inputs need to have at least the level of the threshold (a bias) to excite the neuron. Since we don t know exactly the value of the threshold, we can model it to be learned as a weight. q Cell body Output axon The threshold of the activation function now becomes 0. q Cell body : Activation function Output axon : Activation function 17

The activation level of neuron depends on the sum of the neurotransmitters values ; If this sum is strong enough (greater than the neuron s threshold ), neuron outputs an excitatory signal; otherwise it sends an inhibitory signal through the axon, to other neurons or receptor cells. The is called bias because the summed inputs need to have at least the level of the threshold (a bias) to excite the neuron. Since we don t know exactly what value the threshold should be, we can learn it as a weight. 18

Common Activation Functions Sigmoid (Logistic) Function Hyperbolic Tangent Function 19

Common Activation Functions Rectified Linear Unit (ReLU) Convergence Rates --- ReLU ReLu was found to greatly accelerate (e.g. a factor of 6 in Krizhevsky et al.) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form. 20

Comparison with Logistic Regression The model of a single neuron with a Sigmoid activation function is exactly the same as the model of the Logistic Regression classifier: Single Neuron Logistic Classifier Single neuron activation output: Input from -neurons. Logistic Regression prediction output: Input from -feature input. 23

Single Neuron Nonlinear Problem As we have seen with the linear Logistic Regression model, a single neuron with linear inputs is incapable of solving non-linear problems. For example, it cannot solve the XOR (or XNOR) problem, with given linear Boolean inputs. 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0 More complex; Noisy examples; Analog training set; but same problem. 24

+20-20 +20 +20, -30-20 +10, +20-10, +20-20 +20-20 -30 +10 +20-10 +20 Layer Layer Layer, 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 Units in Layer can be interpreted as representing higher order features ( and ) of the data to realize the desired function. 25

Generic Neural Network Model For Higher Order Non-linear Problems,, Layer Input Layer Hidden Layer Hidden Layer Hidden Layer Output The leftmost layer of the network is called the input layer, and the rightmost layer the output layer. The input units are called inputs, where j is the feature of the t th training example. There can be any number of units in any layer. The circles labeled +1 are called bias units (or threshold units) and correspond to the intercept term. The middle layers of nodes are called the hidden layers, because either the input, desired output, or both input and desired output values of the nodes in the layer are not observed in the training set. 26

, Layer Layer Layer Our example neural network has 3 input units (not counting the bias unit), 3 hidden units, and 1 output unit. The connections between nodes are called weights, but are labeled for uniformity. 27

We label layer as, so layer is the input layer, and layer the output layer. Let denote the number of layers in our network. In our example network,. Our neural network has parameters: denotes the parameter (or weight) going to unit in layer and coming from unit in layer. (Note the order of the indices.), is the bias associated with unit in layer. Layer Layer Layer is the activation (meaning output value) of unit in layer. For Layer, we use to denote where j is the feature of the t th training example. is the total weighted sum of inputs to unit in layer, including the bias term: 28

Forward Propagation of Activations, Layer Layer Layer, Forward propagation corresponds to computing the output of each neuron in each layer, except for the input layer. 29

Method of Learning the Parameters (Weights): Gradient Descent We will use gradient descent to learn the parameters from the training set 30

Type of Weight Update: Online Mode As discussed earlier, there are two methods for updating the weights: Batch mode Update each weight after taking into consideration the effect of all training examples, i.e., after one training cycle. Online mode Update each weight after each randomly chosen training example. Weights are updated times per each training cycle, where is the number of training examples in the training set. Aka: Stochastic gradient descent, since the different results will be achieved due to the random nature of presenting training examples. In the development of the update equations, we will use the online mode, initially, to simplify the derivation. 31

For numepochs (training cycles) { For each training example chosen at random ( ) {//do for all training examples: For each weight in each layer { } } } Each weight is updated after taking into consideration only one randomly chosen training example. Note that the term corrected. is like an error term; it gives the amount by which should be is the learning rate. 32

Linear regression: Logistic regression: Neural Network: Problem is that with many units connected in network, the cost function will likely be non-convex, independent of what cost function is used, and therefore there will be multiple minima. Therefore, we will consider the Euclidean cost function (as in Linear regression, initially, to simplify the derivation of the weight update equations, but will substitute the Logistic cost function later. 33

Output Layer Error Expression For a network that has (i.e., multiple) output units, the output error expression for a single training example is: For our example network that has a single output unit, the error expression for a single training example is: Layer Layer Layer, We want to calculate in order to change the value of the weight to minimize the error : according to the gradient descent algorithm. 34

Computing the Impact of Changes in Weight Has on, Layer Layer Layer 35

Computing the Impact of Changes in Weight Has on, Layer Layer Layer 36

Computing the Impact of Changes in Weight Has on, Layer Layer Layer 37

Comparing Results & Compacting 38

Computing the Impact of Changes in Weight Has on, Layer Layer Layer 39

Alternative Method to Compute, Layer Layer Layer 40

Computing the Impact of Changes in Weight Has on, Layer Layer Layer 41

Compact Forms of and 42

For each training example chosen at random ( ) { For each weight in layer { 1 1, } } Layer Layer Layer Each weight is updated after performing one forward propagation of activations using one randomly chosen training example. Note that the term may be interpreted as an error term; it gives the amount by which should be corrected. Implementation note: the forward pass will have computed and and is known. 43

Hidden Layer Error Expression For a network that has (i.e., multiple) units in the output layer, the error expression for a single training example is: For our example network that has a single output unit, the error expression for a single training example is:, Layer Layer Layer We want to calculate in order to change the value of the weight to minimize the error : according to the gradient descent algorithm. 44

Computing the Impact of Changes in Weight Has on Let s Backpropagate the error!, Layer Layer Layer Note that subscript of is 1 for, corresponding to the j th subscript in. 45

In Common Chain Rule Path Note: This expression is the same for all, since all chain rule paths for weights in Layer 1 have that segment in common., In common chain rule path. Layer Layer Layer 46

Computing the Impact of Changes in Weight Has on Note: Subscripts of and are 1 because is the weight connected to unit 1 in Layer 2., Layer Layer Layer Note that subscript of is 2 for, corresponding to the j th subscript in. 47

Generalize: Impact of Changes in Weight Has on Note: Subscripts of and are 1 because is the weight connected to unit 1 in Layer 2., Layer Layer Layer Note that subscript of is j for, corresponding to the subscript in. 48

Computing the Impact of Changes in Weight Has on Note: Subscripts of and are 2 because is the weight connected to unit 2 in Layer 2., Layer Layer Layer Note that subscript of is 1 for, corresponding to the j th subscript in. 49

Computing the Impact of Changes in Weight Has on Note: Subscripts of and are 2 because is the weight connected to unit 2 in Layer 2., Layer Layer Layer Note that subscript of is 2 for, corresponding to the j th subscript in. 50

Generalize: the Impact of Changes in Weight Has on, Layer Layer Layer 51

, Layer Layer Layer 52

Generalize: the Impact of Changes in Weight Has on, Layer Layer Layer 53

For this Network Configuration 1. Forward propagation to determine activations: Layer Layer Layer 2. Back propagation with gradient descent to update weights: For each training example chosen at random ( ) { For each weight in layer { } } For each weight in layer { }, 54

Using Alternative Cost Functions Note that the only partial derivative that depends on the cost function is: Furthermore, that derivative affects only: Layer Layer Layer, Therefore, to use an alternative cost function, only the first term updated. of needs to be 55

Using Logistic Regression Cost Function For instance, to use the Logistic Cost Function (i.e., instead of ), we need to substitute the with the. Recall Logistic Regression s cost function and its derivative: For a single training example (i.e., for online mode): 56

References [1] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. [2] F. Rosenblatt, "The Perceptron--a perceiving and recognizing automaton," Cornell Aeronautical Laboratory, New York, NY, 1957. [3] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry, Cambridge MA: The MIT Press, 1969. [4] P. J. Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences," PhD thesis, Harvard University, Harvard, 1974. [5] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational properties," in Proceedings of the National Academy of Sciences of the USA, 1982. [6] M. L. London and M. Hausser, "Dendritic Computation," 20 09 2009. [Online]. Available: https://physics.ucsd.edu/neurophysics/courses/physics_171/annurev.neuro.28.061604.135703.pdf. [Accessed 05 February 2016]. [7] N. Brunel, V. Hakim and M. J. Richardson, "Single neuron dynamics and computation," Current Opinion in Neurobiology, vol. 25, pp. 149-155, 2014. [8] Stanford and F.-F. Li, "Stanford University CS231n: Convolutional Neural Networks for Visual Recognition," 1 January 2016. [Online]. Available: http://cs231n.stanford.edu/. [Accessed 05 February 2016]. 57