Artificial Neural Networks

Size: px

Start display at page:

Download "Artificial Neural Networks"

Rosanna Bailey
5 years ago
Views:

1 Artificial Neural Networks Math 596 Project Summary Fall 2016 Jarod Hart 1 Overview Artificial Neural Networks (ANNs) are machine learning algorithms that imitate neural function. There is a vast theory of ANNs available these days. To limit the scope and length of this project summary, we will only consider three ANN algorithms, which are typically the first ones introduced when starting to work with ANNs. They are the Perceptron, Adaline, and multi-layer feedforward networks. They are all supervised learning algorithms, meaning that they train their parameters based on a set of pre-classified (or labeled) training data. That is, each element of the training data has a known label associated to it. The goal is to learn how to classify novel input according to a rule learned from the training data. We should acknowledge that calling the Perceptron and Adaline models neural networks may be overstating their complexity. Indeed these models are only formulated based on only a single neuron. Hence they could be viewed as a trivial network, but it may be more conducive to think of them as a primer for building mathematical models of neurons. Their construction introduces some foundational ideas of the theory, and they make it much more manageable to work with more complicated networks, like multi-layer networks. There are ways to accomplish more complicated tasks by preprocessing training data and/or using several Perceptrons/Adaline neurons, but even in these models the neurons still function largely independently (not in a coordinated way inherent to more complicated ANN models). We mention a couple ways to extend these models in the Possible Extensions section. There is an important development in the transition from the Perceptron to the Adaline model. Roughly speaking, the Perceptron learns by using a somewhat ad hoc training rule that can be motivated by a geometric argument. The training is a little cumbersome and is reliant on a linear structure in some ways. This makes it difficult to extend directly to more complicated and nonlinear models. Adaline introduces a shift in point of which, which is to train the neuron by minimizing an error function. This notion of minimization in place of a more geometric argument is much more easily extended to more complicated settings, which can be observed in the construction of the multi-layer networks. Much of the information presented here is taken from Mitchell s book on machine learning, but several aspects are presented differently and at times more concretely. In particular, the details of the learning rules here are laid out in more detailed but less general terms. This description is also much less comprehensive, which allows a much shorter presentation of the material. This may be of use for those just starting to work with ANNs, but would probably be best used in along with other resources. 2 Mathematical and Programming Content To complete this project, a background in the following topics is recommended. - Linear algebra: Some familiarity with matrix operations and dot products is necessary for this project. In addition, some understanding of separating hyperplanes is helpful. Although, it does not require a rigorous understanding linear independence, linear combinations, bases, subspaces, diagonalization, etc. - Convex optimization theory: A significant component of training ANNs in the summary is based on gradient descent in reducing a squared error function. A rough understanding of how gradient descent works, and how to use it to generate an iterative optimization scheme is necessary to complete this project. - Graph theory: A very rudimentary understanding of graph theory is helpful to understand ANN topologies. This background can be limited to the basic familiarity weighted directed graphs.

2 - Programming: This project involves a fair amount of programming ability. A thorough understanding of working with vectors/matrices/arrays, decision statements, and loops are essential to implement these ANNs. For some of the applications a understanding of computer graphics is also helpful. 3 Primary Resources For much of the mathematical content listed above, typical text books in the pertinent area are sufficient. Some additional resources on SIR models and specifically stochastic SIR models are the following (this is by no means a complete list). - T. Mitchell, Machine Learning, McGraw-Hill, B. Kröne and P. van der Smart, An Introduction to Neural Networks, Eighth Edition, University of Amsterdam, Available at archive.org. 4 Mathematical Description of the Project Suppose we are given a training data set X 1,...,X N R n, along with corresponding labels l 1,...,l N { 1,1}. We will denote this training set T = {(X 1,l 1 ),...,(X N,l N )} R n { 1,1}. Our goal is to use this data to learn a function F : R n R such that F(X i ) l i for all (X i,l i ) T. 1 We describe three ways of accomplishing this classification function F through ANNs. They are the Perceptron, Adaline, and multi-layer feed-forward networks. In each situation the function F depends on a weight parameters. The number and structure of the weight parameters used to define F depend on the network topology and learning structure. These will be described in more detail below. However, each of these networks works the same way from the perspective of a supervised learning algorithm. Each model has two stages: a learning stage where the weight parameters are selected based on the training data and a classification stage where the function (using the weights learned in the learning stage) classifies novel inputs. Fist we formulate the Perceptron ANN. 2 Define, for fixed θ R and w R n, the function F : R n { 1,1} F(x) = sgn(θ + w x), where sgn(x) = 1 if x 0 and sgn(x) = 1 if x < 0. Figure 1 shows a depiction of how to interpret this function F(x) as a neuron in the biological sense in the n = 2 situation. The parameters θ and w are the weight parameters for this model. We forego our description of the training of the Perceptron for the moment, and describe how it classifies novel input given θ and w. So suppose for a moment that we have already completed our training process, which means that the weights parameters θ and w are already determined. Then the decision function F classifies new inputs by using the hyperplane defined by θ + w x = 0 to split R n in two (ignoring the issues that arise when a new input lie on this hyperplane). It may be easiest to think of this in n = 2 dimensions, where θ + w 1 x 1 + w 2 x 2 = 0 defines a line. If θ + w 1 x 1 + w 2 x 2 0, then (x 1,x 2 ) lies on one side of the line (or on the line), and if θ + w 1 x 1 + w 2 x 2 < 0, then (x 1,x 2 ) lies on the other side. Hence R 2 is split into two sets, where the Perceptron will fire or not according to the function F(x). Figure 2 below shows the decision rule learned by the Perceptron for a simulated training data set. This principle of the hyperplane θ + w x = 0 for w R n splitting R n in two extends naturally to higher dimensions. This describes the classification stage for the Perceptron. Now let s return to discuss how the perceptron is trained. The purpose of the training stage is to figure out what to make θ R and w R n to correctly classify all the training data T = {(X 1,l 1 ),...,(X N,l N )}. To do this, fix a learning rate γ > 0, and repeatedly execute steps 1 3 below: 1 To avoid ambiguous notation, we reserve subscripted capital X i s to represent x-values from the training set, and use x to represent an arbitrary element of R n. 2 Technically, it may be misleading to call this an ANN since we only describe the action of a single neuron. So this is not exactly a neural network, but rather a single node.

3 x 1 w 1 w 2 + w x sgn( + w x) x 2 Figure 1: A depiction of how the Perceptron can be interpreted as a neuron; x 1 and x 2 are the input stimuli for the neuron, and the weight parameters θ and w determine when the neuron fired through the function sgn(θ + w x). x 1 w 1 w 2 + w x ( + w x) x 2 Figure 2: The left plot shows a simulated training data set where blue o s are 1 and red o s are 1. The right plot shows the decision rule learned by the Perceptron, where points above the line return 1 and points below the line return Choose (X i,l i ) from the training set T = {(X 1,l 1 ),...,(X N,l N )}. 2. Compute y = θ + w X i 3. If y l i < 0, update θ and w according to: θ θ + γ (l i y) w w + γ (l i y) X i. Here the notation θ θ+γ(l y) means overwrite the current value of θ with θ+γ(l y). Note also that the update for w describes the update for the entire w vector; recall that X i R n so that the right hand side is well defined by adding w R n and a scalar multiple of X i R n. There are several ways that this algorithm can be iterated. One can simply choose (X i,l i ) randomly from the training data set T for some set number of iterations. Alternatively, one can cycles through all elements of the training dataset in order several times. There are many other ways that will work. One can interpret the Perceptron learning rule in the following way. We take an element (X i,l i ) from T, and check if the Perceptron classifies it correctly (with the current weight parameters). If the point is classified correctly by the

4 Perceptron, we leave the weights unchanged. If the point is misclassified, then we update the weights in such a way that the updated weights are more likely to classify it correctly. For larger values of γ, the weight update changes θ and w more sporadically. Making γ small will cause less sporadic changes, but if they are too small it may take a long time for the Perceptron to train. It is not hard to make a geometric argument for defining this update rule, but it can also be justified using an optimization argument. We will mention this argument at the end of the description of the Adaline ANN. There are a few things that should be observed with the Perceptron ANN. First, the decision rule shown in the right plot of Figure 2 does not appear to be optimally placed. Indeed, it is very close to the red cluster of the training set, and hence one would expect this particular training of the Perceptron would be susceptible to misclassifying red points as blue ones. This type of non-optimal placement of the decision boundary arises from the sporadic update rule, and from the fact that the Perceptron stops updating once it has classified all points correctly. So it will not try to determine the decision hyperplane optimally, only so that it classifies all point correctly. Second, in order for the Perceptron to work well, the training data must be linearly separable. If it is not, the decision line will continue to change sporadically (depending on the size of γ), and it has no hope to fully classify the training data. In Figure 3, we show a simulated training data set that is not linearly separable, and hence for which the Perceptron (at least applied directly as described here) will fail. The Adaline ANN will provide a solution that better addresses the first issue, 3 but will still be limited by the second one the same way the Perceptron is. The second issue can be solved using the last model we discuss, multi-layer ANNs. Figure 3: The left plot shows a simulated training data set of an exclusive or type decision where blue o s are 1 and red o s are 1. The right plot shows the decision rule learned by the perceptron, which has failed to generate an accurate decision rule. The Adaline ANN is an acronym for Adaptive Linear Element. It is still limited to a linear classifier, but it provides an important shift in point of view for updating ANNs. For this model, we modify F(x) slightly. Define σ(x) = tanh(x) = e x e x e x +e x, and F(x) = σ(θ + w x), where θ R and w R n are again our weight parameters. Suppose again that we have already trained our Adaline ANN, and hence have fixed weight parameters θ and w R n. Then Adaline can be used to classify new data x R n by evaluating F(x). To view this in the way that neurons are traditionally viewed, with a binary output, we can simply say the neuron fires if F(x) 0 and doesn t fire if F(x) < 0. From a machine learning perspective, it is typically more informative to retain the extra information contained in F(x), rather than just the binary output ±1. This describes the classification function of Adaline. Figure 4 shows a schematic for the Adaline ANN for n = 2 dimensions, which is a 3 See also support vector machines for a solution to this deficit of the Perceptron.

5 w 2 + w x sgn( + w x) x 2 slight modification of the Perceptron pictured in Figure 1. The only difference is that the smooth function σ(x) is used in place of sgn(x). x 1 w 1 w 2 + w x ( + w x) x 2 Figure 4: A depiction of how the Adaline can be interpreted as a neuron; x 1 and x 2 are the input stimuli for the neuron, and the parameters θ and w determine when the neuron fired through the sign of the function σ(θ + w x). It remains to describe the weight parameter training for Adaline. Our goal in this training is to choose the weight parameters θ and w so that F(X i ) l i for all elements of our training set (X i,l i ) {(X 1,l 1 ),...,(X N,l N )}. The important shift in point of view for this model is to define an error associated to our training data classification (as a function of the weights), and choose our weights by minimizing that error. In particular, define the squared error function E i (θ,w) = 1 2 (σ(θ + w X i) l i ) 2 for i = 1,2,...N and E(θ,w) = N i=1 E i (θ,w). Note that the training data (X i,l i ) are treated as fixed quantities here. Then for a given θ R and w R n, E i (θ,w) is half of the squared error of our classification σ(θ + w X i ) of the training data (X i,l i ) T, and E(θ,w) is half of the cumulative squared error over the entire training set T. 4 Note that E takes into account all of the information provided to us by the training set, and we ve expressed it as a function of the weight parameters θ and w. Hence we have posed our weight training process as an optimization problem: choose θ R and w R n that minimize E(θ,w). In order to solve this optimization problem, we use a gradient descent algorithm applied successively to the incremental marginal error functions E i (θ,w). Roughly speaking, this algorithm is formulated in the following way. Fix a training element (X i,l i ) T. Treating θ and w as variables, we compute θ and w for each j j = 1,2,...,n. Then given the current values of θ and w, we update them by moving in the direction of steepest descent of the error function given by θ and w. Then we replace j θ and w according to the following rule, θ θ γ θ w j w j γ w j for j = 1,2,...,n. Once again we interpret the here as over θ and w, where the right hand side is computed with the current values of θ and w. Now we compute the partial derivatives above to formulate the update rule. For the θ update rule, we calculate θ = (σ(θ + w X i) l i ) θ σ(θ + w X i) = (σ(θ + w X i ) l i ) σ (θ + w X i ), 4 The 1 2 here is just for convenience to simplify computation. The algorithm could be just as easily formulated without the 1 2.

6 and similarly w j = (σ(θ + w X i ) l i ) σ (θ + w X i ) (X i ) j. Here (X i ) j denotes the j th component of X i R n. Also note that if we take σ(x) = tanh(x) as above, then σ (x) = sech(x) = 2 e x +e x. If we define δ = (l i σ(θ + w X i )) σ (θ + w X i ), then the update becomes simply θ θ + γ δ and w w+γ δ X i. This update rule, simplified by computing δ in this way, is often referred to as the δ-rule. In fact, this δ-rule can be generalized to more complicated settings and will be convenient for our description of multi-layer ANNs. For a fixed a learning rate γ > 0, we train the Adaline ANN by repeatedly updating θ and w according to the following rules: 1. Choose (X i,l i ) from the training set T = {(X 1,l 1 ),...,(X N,l N )}. 2. Compute δ = (l i σ(θ + w X i )) σ (θ + w X i ) 3. Update θ and w according to θ θ + γ δ w w + γ δ X i. Once again, there are different strategies for iterating these steps. For this implementation, it is typical to repeatedly cycle through the entire training set from i = 1,2,...,N and update θ,w for each training element. Every full cycle through the training set is sometimes called an epoch. This provides a natural way to report the squared error function. At the end of each epoch, you can compute E(θ,w) as defined above. Then you can measure the success of your training in terms of the squared error function E(θ, w) versus the number of epochs. Figure 5 below demonstrates the outcome of an Adaline classification on simulated training data. It should be noted that Adaline does a better job of placing the decision line (the right plot of Figure 5) than the Perceptron for similar training date (the right plot of Figure 2). Adaline continues to update weights even if all training data points are classified correctly. This is a consequence of the error minimization approach for the Adaline model. This idea can be extended a little to conclude that Adaline is better equipped to classify clusters with limited amounts of overlap (that just fall short of being linearly separable). It should also be noted that Adaline cannot effectively address exclusive or type data; that is, the situation in Figure 3. Since Adaline still relies on a linear classifier, it is not a good choice for classifying data that is not linearly separable in this way (at least in the initial formulation presented here; another option to solve this is described in the Possible Extensions section). It is worth noting that this formulation allows for some flexibility in terms of the choice of function σ to use. Modifying the σ used allows for one to model 0-1 neurons rather than ±1 or even to model general functions for appropriately chosen functions. Finally, we consider a multi-layer feed-forward network. This involves introducing a hidden layer of neurons to the model (we will only address models with a single hidden layer, but one can extend the ideas here to multiple hidden layers). Suppose we wish to construct a network with one hidden layer made up of n h N neurons. Let τ R, θ,v R n h, and W R n h R n F(x) = σ(τ + v a), where a = σ(θ +W x) R n h. Here a = a(x) is a function of x R n, and we interpret σ(y) = (σ(y 1 ),...,σ(y nh )) for y R n h. Now we consider τ R, θ,v R n h, and W R n h R n all to be our weight parameters to be chosen in the learning stage of the algorithm. As before, the classification function of this multi-layer network (when τ, θ, v, and W are fixed after training) works simply by plugging in an element x R n into F. When F(x) 0, we classify x as group 1 (i.e. the neuron fires), and when

7 Figure 5: The left plot shows a simulated training data set where blue o s are 1 and red o s are 1. The right plot shows the decision rule learned by Adaline, where points above the line return 1 and points below the line return 1. F(x) < 0, we classify x as group 1 (i.e. the neuron doesn t fire). In Figure 6, we depict an ANN with n = 2 inputs, a single hidden layer made up of n h = 5 neurons, and a single output neuron. It remains to describe the training rule for this multi-layer network. We do this in a similar way to the Adaline ANN training, by defining an error function and using a gradient descent approach to define the weight updates that work towards minimizing the error function. To account for the multiple layers of weights in this model, we use a notion of back-propagation of error. Roughly speaking this means that we adjust the output layer weights v and use those computations to inform our adjustment of weights W in the preceding layer. More precisely, the updates are formulates as follows. Define E i (τ,v,θ,w) = 1 2 (σ(τ + v σ(θ +W X i)) l i ) 2 for i = 1,2,...,N and E(τ,v,θ,W) = N i=1 E i (τ,v,θ,w). To implement the gradient descent algorithm, we will need compute the partial derivative of E i with respect to τ, each component of v, each component of θ, and each component of W. We first handle the output layer weights, whose update rule come out very similar to those of the Adaline model, τ = δ = δ (σ(θ +W X i )) j, v j where δ = (l i σ(τ + v σ(θ +W X i ))) σ (τ + v σ(θ +W X i )). Note again that we use that notation σ(θ+w X i ) = (σ(θ 1 +(W X i ) 1 ),...,σ(θ nh +(W X i ) nh )), and so v σ(θ+w X i ) is the dot product of two elements in R n h. So the update rules for the output layer are τ τ+γ δ and v v+γ δ X i, where δ is defined as above. For the hidden layer weight parameter updates, we consider the following argument = δ (τ + v σ(θ +W X i )) = δ v j σ (θ j + (W X i ) j ), θ j θ j = δ (τ + v σ(θ +W X i )) = δ v j σ (θ j + (W X i ) j ) (X i ) k, w j,k w j,k where δ is as above. Now we can easily formulate the training rules for choosing τ, v, θ, and W. We iteratively update the multi-layer ANN weight parameters according to the following: 1. Choose (X i,l i ) from the training set T = {(X 1,l 1 ),...,(X N,l N )}.

8 1 1 +(W x) (W x) 2 x 1 3 +(W x) 3 + v ( + W x) ( + v ( + W x)) x 2 4 +(W x) (W x) 5 Figure 6: A depiction of how the Adaline can be interpreted as a neuron; x 1 and x 2 are the input stimuli for the neuron, and the parameters θ and w determine when the neuron fired through the sign of the function σ(θ +W x). Here W is a 5 2 matrix. 2. Compute y = W X i a = (σ(θ 1 + y 1 ),σ(θ 2 + y 2 ),...,σ(θ nh + y nh )) ã = (v 1 σ (θ 1 + y 1 ),v 2 σ (θ 2 + y 2 ),...,v nh σ (θ nh + y nh )) b = σ(τ + v a) b = σ (τ + v a) δ = (l i b) b. 3. Update τ, v, θ, and W according to τ τ + γ δ v j v j + γ δ a j for j = 1,2,...,n h θ j θ j + γ δ ã j w j,k w j,k + γ δ ã j x k for j = 1,2,...,n h for j = 1,2,...,n h and k = 1,2,...,n. Step 2 above is sometimes referred to as the feed-forward part of the algorithm and step 3 the back propagation portion. That is, in step two we take input data X i, and feed it forward through the ANN to arrive at its classification b = σ(τ + v a) (and record several other quantities along the way). Then measure the error in the δ = (l i b) b term, and propagate it back through the layers to update each weight parameter. This multi-layer network, as described above, is capable of solving the exclusive or type problem that neither the Perceptron nor Adaline ANN could solve (at least applied directly). We implemented an ANN to the specifications above in n = 2 dimensions with a single hidden layer made up of n h = 5 neurons applied to simulated exclusive or type training data shown in the left plot of Figure 7. The decision function and decision boundary obtained is shown in various formats in the right plot of Figure 7 and in Figure 8.

Figure 8: The left plot shows a simulated training data set, and regions colored according to their classification according to a multi-layer ANN with one hidden layer of n h = 5 neurons.

9 Figure 7: The left plot shows a simulated training data set where blue o s are 1 and red o s are 1. The right plot shows the decision rule learned by a multi-layer ANN with a single hidden layer made up of n h = 5 neurons. Figure 8: The left plot shows a simulated training data set, and regions colored according to their classification according to a multi-layer ANN with one hidden layer of n h = 5 neurons. The right plot shows a plot of the learned function F(x) with the simulate data plotted on top of it. 5 Possible Extensions There are many, many directions in which the work here can be extended. We first mention some extensions that are possible using only the Perceptron and Adeline (linear classifier) models. It is possible to preprocess training data so that they can solve exclusive or type problems (as well as other non-linearly separable classification problems). This can be done by simply transforming the original training data, embedding it into a higher dimensional space. For example, suppose we have training data T = {(X 1,l 1 ),...,(X N,l N )}, where each X i R 2. We can define a transformed training set using the transformation P 2 (x 1,x 2 ) = (x 1,x 2,x1 2,x 1 x 2,x2 2 ) to create an alternate training set T = {(P(X 1 ),l 1 ),...,(P(X N ),l N )}, which now contains labeled data P 2 (X i ) R 5. Now if we apply either the Perceptron or Adaline to this higher dimensional training data T, we end up with decisions that are allowed to be conic sections in R 2 rather than simply linear classifiers. Of course, we could define higher order transformations P n that allow for polynomials of arbitrary degree (or other functions as well). This highlights a principle in mathematics that, in many situations, one can relax some structural limitations of a model (like requiring a linear classifier) by

10 embedding the problem into higher dimensions. 5 Another extension is to apply the multi-layer network to approximate a function f : R n R given several samples from the graph {(x, f (x))} R n R. This application is really just a slight shift in point of view by allowing the label to be generally real-valued, rather than ±1. Another minor extension is to extend the multi-layer networks described here to one that allows for many classes, rather than just a binary classification ±1. This can be done by allowing for more than one output neuron in the networks. One can also allow for more than one hidden layer (though this may not be very worthwhile, since it is know that single layer networks are capable of classifying any pattern given enough neurons in the hidden layer). Beyond that, one could even develop algorithms like the one above for feed-forward ANNs with a topology described by any acyclic directed graph. Other directions include working with Bayesian neural networks (ANNs trained through a Bayesian learning weight parameter update formulation), convolution neural networks (ANNs that augment multi-layer networks with with preprocessing and feature extraction techniques), lattice algebra neural networks (ANNs that, roughly speaking, replace the summation in the dot product w x with a maximum max(w 1 x 1,...,w n x n )), or recurrent neural networks (ANNs that allow for feedback looks in their network topologies). Each of these take ANNs in a different direction than what was discussed in this project summary, but aspects of the foundational theory are the same. 6 Note From the Author This is a student project from the Math and Biomedical Research course, taught by the current author Jarod Hart, offered at the University of Kansas in the Spring of Some modification and additions were made to the original project for this summary. The course is supported by the Initiative for Maximizing Student Development (IMSD) through an NIH grant NIH-NIGMS 5R25GM The PIs of this IMSD grant are Professors Estela Gavosto (Mathematics Department) and James Orr (Biology Department). We are happy to share these project ideas, and welcome those who are interested to use them. We d love to hear about your results and extensions related to these projects, and in some cases, will provide some support for the projects. Please contact Jarod Hart at jvhart@ku.edu with any typos, errors, questions, or comments about this project summary. 5 Compare this, for example, with reduction of order techniques in ordinary differential equation theory.

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely