Neural Network Models in Statistical Learning

Size: px

Start display at page:

Download "Neural Network Models in Statistical Learning"

Camilla Johns
6 years ago
Views:

1 Neural Network Models in Statistical Learning Stephen Talley April 25, 2014 Abstract Neural network models can solve problems more easily than traditional methods by emulating the human brain. We examine a basic neural network to model regression and to classify data. We conclude with an example of basic ZIP code character recognition. 1 Introduction 1.1 Definition Neural network models were originally developed in two separate yet equally important fields: statistics and artificial intelligence [1]. However, despite the connotations that the term neural network carries, there is nothing highly technical or mysterious about such a model. Rather, a neural network is defined by the following: Definition 1. A neural network is a nonlinear statistical model that emulates the human brain on a very basic level by adapting to or learning from a set of training patterns [1, 2]. Because a neural network requires a set of training patterns and targets to properly function, it may be characterized as a supervised system, as opposed to an unsupervised system which infers trends in random, unmarked data. Definition 2. A supervised system is a system or algorithm that infers trends from objective training data. 1.2 History Hebb s Rule The origins of today s neural network models can be traced back to one man s contribution. Dr. Donald Hebb, widely regarded as the father of neuropsychology, outlined an intial theory of biological neural networking in his seminal work The Organization of Behavior (1949) [3]. Theorem 1. Hebb s Rule: When a neuron of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A s efficiency, as one of the cells firing B, is increased. 1

2 Put simply, Cells that fire together, wire together. We can also express a simplified version of Hebb s Rule mathematically: α ij = ηx i x j. (1) In Equation (1), α ij is the change in connection strength between two given nodes i and j and η is a constant learning rate such that 0 η 1. This neurological rule not only proposed an explanation for associative learning in humans, but also provided the basis for adaptive learning algorithms in computer science Early Devlopments While the advent of Hebb s Rule is considered the beginning of computational neuroscience, in truth, the first neural network model had already been created six years before [4]. In 1943, Walter Pitts and Warren McCulloch created the first computational neural network using basic algorithms. Unfortunately, because of their model s simplicity, it was only capable of solving simple arithmetic and logic problems [4]. In 1958, Frank Rosenblatt developed the first successful neurocomputer: a single layer neural network or perceptron model called single layer since it only had one hidden layer between input and output which could receive multiple inputs and create a single output from a linear combination of these inputs [2]. The single layer perceptron, shown in Figure 1, was more adaptable than other models at the time and could solve problems more quickly and reliably despite its simplicity [8]. Figure 1: Diagram of a single layer perceptron neural network. Despite the progress of the perceptron model, it suffered from two limitations. First, the perceptron model could not solve the exclusive-or problem a logical operation that outputs true even when inputs differ on truth value [2]. Second, as problems increased in complexity, progressively more inputs were required for classifications, and the computer hardware of the time was simply too limited to handle these problems. Most further advancement in the field stagnated until technology could reach the perceptron model s computational demands [1]. 2

Figure 2: Plot for the general sigmoid function [6]. 1.2.3 Recent Developments Computer capabilities did not reach the level required for more complex neural network models until the early 80 s, and interest revived in the field soon after [1].

[4]. Since this time, researchers have found many new applications for neural network models, including mathematical finance, data mining, handwriting recognition and (obviously) modeling biological

3 Figure 2: Plot for the general sigmoid function [6] Recent Developments Computer capabilities did not reach the level required for more complex neural network models until the early 80 s, and interest revived in the field soon after [1]. In particular, the discovery of the back-propogation algorithm in 1986 was crucial for further developments, since it helped to find global minimums for error functions in any neural network model [4]. Since this time, researchers have found many new applications for neural network models, including mathematical finance, data mining, handwriting recognition and (obviously) modeling biological neural systems [1]. 1.3 Basics of a Neural Network Model All neural network models, regardless of application, share some common elements, though the number and complexity of these elements can vary depending on the model used [5]. For a basic model, as shown in Figure 1, each red node represents an individual input x i in the vector of p inputs X T = [x 1, x 2, x 3,..., x p ]. These inputs all form a layer unto themselves, simply called the input layer [2]. Each input is connected to the nodes in the second, hidden layer, and these connections all have values associated with them, called weights. Each weight is assigned a random value between 0 and 1 depending on the context of the problem. Then, by using the inputs and weights, the model determines the value of the hidden layer node Z m by forming the linear combination p αmix T = α m1 x 1 + α m2 x α mp x p. i=1 Once the value for the linear combination is found, it is then inserted into a nonlinear activation function σ. Usually, this nonlinear function is the sigmoid function 1 σ(x) =. (2) 1 + e x The sigmoid function is frequently used particularly for regression models because it combines nearly linear, curvilinear, and nearly constant behavior depending on input value [5]. As Figure 2 illustrates, the sigmoid function becomes nearly linear for domain values 1 < x < 1. For extreme values of x, σ(x) becomes nearly constant. 3

4 1.4 Applications Because of the neural network model s ability to generalize a linear model using a nonlinear function along with its ability to learn from data, they can be used for a variety of practical applications. In particular, neural networks are best used for four types of problems [7]: 1. function prediction or approximation, 2. complex data classification (with nonlinear classification boundaries), 3. using internal properties of data for clustering, and 4. time-series forecasting. 1.5 Advantages and Disadvantages of a Neural Network Model The neural network model offers a few distinct advantages over other types of machine learning algorithms. Because a neural network is a supervised system (i.e. it requires a standard or basis for classification), it requires less formal training to determine a proper algorithm for a given data set [5]. Furthermore, neural networks can detect more complex relationships and interactions among variables thanks to their aforementioned property of deriving parameters from data [8]. One last advantage of the neural network is the ubiquity of training algorithms for working with data, most likely stemming from their variety of applications. Unfortunately, neural networks also have several disadvantages. Though computer technology has advanced substantially since the neural network s introduction, more complex models still have heavy computational demands that sometimes cannot be met within a reasonable time. Another disadvantage involves the sheer quantity of connections/weights. Since almost every node is connected to one another, forming a weight for each connection, overfitting data can be an issue; however, this problem can be regulated either by early stopping or by a process called weight decay using a penalty function to shrink all weights toward zero, thereby reducing the model to a linear one [1]. 2 Body 2.1 Advanced Neural Networks Obviously, with more advanced computers come more advanced neural networks. Since the transformation functions of the hidden layers are fairly simple, a typical neural network model can, in truth, have up to 100 nodes encompassing multiple hidden layers [1]. In this case, the formula for determining the outputs becomes a multi-step transformation: Z m = σ(α 0m + α T mx), where X = [x 1, x 2,..., x p ], T k = β 0 + β T k Z, f k (X) = g k (T ), where T = [T 1, T 2,..., T k ]. (3) 4

5 Typically, the complexity of these neural network models is dependent upon the following variables: p, the number of inputs, m, the total number of neurons, and k, the number of classes or outputs. Each step of this algorithm alternates the linearity of the data. Initially, the neural network forms linear combinations from the original inputs. Then, the linear combination is plugged into the activation function σ. Unlike the single-layer perceptron model, a multi-layer network makes an additional linear combination T k from the non-linear hidden layer values Z m and subsequently inputs said linear combination into another, different non-linear function g k (T ). Note that g k (T ) in Equation (3) is an additional, often final activation function brought about by the inclusion of multiple hidden layers. In some of the earliest multi-layer neural network models (and in some current regression models), g k (T ) = T k ; thus, the entire model reduced to a linear output [5]. Classification models later replaced the identity function with the softmax function g k (T ) = e T k K. (4) l=1 et l The softmax function (Equation (4)) was chosen due to its probabalistic properties: each output is between zero and one, and all outputs sum to one [7]. 2.2 Overparameterization and Prevention The Weight Problem Because the scale of the neural network model is dependent on both the number of neurons and the number of inputs, the quantity of connections increases as these two variables increase. These weights are designated by two key parameters, α and β, the complete set of which are given by the matrices below [1]: α 01 α 11 α α p1 α 02 α 12 α α p2 α 03 α 13 α α p α 0m α 1m α 2m... α pm β 01 β 11 β β m1 β 02 β 12 β β m2 β 03 β 13 β β m β 0k β 1k β 2k... β mk Even if errors are minimized, the neural network may overfit the data due to the sheer quantity of weights accounted for in the algorithm [1]. An overfitted model will become excessively complex, and often it will exaggerate minor or random errors in the data. The best and most efficient way to prevent overfitting is by establishing an early stopping rule [5]. An early stopping rule is a method of training the model only for a short time thereby generating fewer weights than would be generated with a full network. This simplifies the model while limiting the potential effect of random error.. 5

6 2.2.2 Error Functions and Minima Aside from the problem of having too many weights, a neural network may also have problems associated with the weights values. Consequently, we must adjust the values for the initially random weights such that they fit the data well enough to make predictions [1]. For regression models, we use a sum-of-squares as our error function R(θ) = K k=1 i=1 N (y ik f k (x i )) 2. (5) Note that R(θ) measures the total difference between the actual class or value and the predicted class or value across all classes K and across all observations N. For a classification neural network, we can also use a cross-entropy equation R(θ) = N i=1 k=1 K y ik log f k (x i ) to determine the minimum amount of information needed for categorizing a given observation [5] Weight Decay While the aforementioned early-stopping technique can be effective for controlling the number of weights, there exists a more explicit method for controlling the quality of weights rather than the quantity: a process known as weight decay [1]. By adding an additional term to the error function, the error equation becomes R(θ) + λj(θ), where J(θ) = km β 2 km + ml α 2 ml and λ 0 represents a tuning parameter [7]. This tuning parameter is ideally large, and the larger the value of λ, the more quickly the weights will shrink to 0. As the weights shrink to 0, the activation (sigmoid) function and by extension the entire model reduces to an approximately linear function. The value of λ is also generally estimated using a cross-validation function [7]. Weight decay is especially important as it helps to improve prediction on any type of neural network [1]. 2.3 Back-propagation Regardless of the equation used for R(θ), it is an error term; therefore, we want to keep the value of R(θ) small. In neural network design, the most popular method for minimizing R(θ) is through back propagation (also called gradient descent) [7]. Quite simply, back-propagation is the process of working backwards from an estimated point using a function s rate of change. Once we have the rate of change and the estimated point, we estimate another, lower point on the function until we reach a minimum. While the network is training with this algorithm, its weights are continually modified to reduce mean-squared error across all classes and observations [5]. The back-propagation method can be 6

7 Figure 3: Examples of handwritten characters from training data [1]. applied for either single or multivariate functions. In this case we only have two parameters that we have any degree of control over, α and β. Using Equation 5 as our error function, we obtain the beta and alpha derivatives R α ml = R β km = 2(y ik f k (x i )g k(β T k z i )z mi. K 2(y ik f k (x i )g k(β k T z i )β km σ (αmx T i )x il. k=1 Once the rates of change are determined for the error function, a gradient descent update for the (r + 1)st iteration takes the form β (r+1) km α (r+1) ml N = β(r) km γ R, β km i=1 N = α (r) ml γ R. α ml The gamma term in both equations denotes the step size for the backpropagation, and it is an arbitrary constant such that 0 γ 1. The actual value for the step size should be chosen carefully, as problems may arise if γ is either too large or too small. If the step size is too large, the algorithm may overstep the local minimum and come up with a larger, inaccurate result. If the step size is too small, the algorithm will take some time to reach the local minimum, sacrificing efficiency in the process. Due to its simplicity, back-propagation is considered the textbook approach to minimizing error; however, there are other methods that can converge to minima more quickly [1]. Use of Newton s method for optimization is possible, but because the second derivative for both parameters can be very complex, it is avoided. One more efficient method is a variation of traditional backpropagation, called conjugate gradient back-propagation. Conjugate gradient back-propagation is similar to back-propagation, but rather than using the negative gradient for steepest descent, the algorithm uses a line search in conjugate directions for alternate directions of descent [7]. While this method tends to be faster, it also is more computationally demanding because of the required searching at each step. i=1 7

8 2.4 Example: ZIP Code Character Recognition The Setup One of the earliest, best-known problems in neural networks has been handwritten character recognition. Because recognizing characters is essentially a classification task (into categories A-Z for letters and 0-9 for numbers), it is an ideal test of a neural network s capabilities. The particular data set for this example is the same used in a similar neural network test in 1989 [9]; however, this example obviously uses more advanced neural networks and error reduction techniques than the previous experiment. Every digit was scanned from U.S. Postal Service envelopes and then standardized into pixel grayscale images such as those in Figure 3. The digits were standardized this way to limit certain characteristics (such as the slant or rotation of the number) which could lead to misclassification [1]. Since the digits were 16 16, each digit, denoted as an observation, had 256 inputs The Procedure Since this example uses the same data as the 1989 neural network test, the total data set consisted of 9298 handwritten digits, each one an individual observation. This data was divided into two main subsets: a practice set of 7291 observation and a working set of 2007 observations [9]. The practice subset was further divided into randomly assigned training, validation, and test sets to prepare the neural network. Both the inputs and the actual targets (the ground truth of which input belongs to which class) were inserted into a MATLAB program which then constructed the neural network. The MATLAB program relies on user input for only two parts of a singlelayer neural network model: the number of neurons in the hidden layer and the allocation of the training data. Given the subdivisions of the training data mentioned before, we determined that an allocation of 90% training, 5% validation, and 5% testing for the 7291 observations in the practice data set yielded the best performance for a single-layer network. While using the entirety of the practice set for training would have likely been ideal, restrictions in MATLAB s neural network scripts prevented us from doing so. The primary reason that this allocation was so effective was because of the high number of observations reserved strictly for training i.e. preparing the neural network. Since the network itself was actually being prepared for the working data set, the testing and validation portions could be relatively low The Results Once we determined the best allocation to use, the next part to consider was the size of the neural network or more specifically, the number of neurons in the hidden layer. For the sake of simplicity, we investigated varying network sizes (multiples of 10 neurons) for accuracy. After running each network five times each, an average accuracy rating was taken, and the initial findings are given in Table 1. Note that ultimately a 60-layer network was the most accurate by a slight margin. While neural networks with more than 100 neurons would possibly be more accurate, these would take a considerable amount of time to compute for 8

9 T raining% V alidation% T est% #of N eurons Accuracy % Table 1: Data for initial neural network tests. each iteration of the network. Once we determined the most accurate allocation and network size, the network was ready for the data. Because MATLAB uses different initial conditions for each neural network test, the best way to gather information was to run the test several times; thus, we decided on trying the network one-hundred times. This reduced the actual test to setting up a for loop to evaluate the working data set a full one-hundred times. The program would then plot a histogram for the classification error and display both the mean and the standard deviation for said error. Referring back to the issue of time, as an example this particular loop required approximately two hours to complete all one-hundred trials. The resulting histogram, shown in Figure 4, showed some interesting results. For one, the error distribution was very right-skewed with only one true outlier of 27% error. Furthermore, both the average error and the standard deviation were far smaller than the findings in Table 1 would have indicated. For this histogram, the mean error = (meaning that the network had 98.42% accuracy) and the standard deviation = Though these statistics may seem high compared to expectations, they are, in comparison with other modern neural networks, relatively low. For example, as of 2011, multi-layer networks have reported error rates as low as 0.7% [1]. 3 Conclusion Overall, neural network models are incredibly useful and versatile statistical learning tools. This paper only examines the basics of the models themselves, error detection, and possible applications that such models can have. Though other applications, such as regression modeling or time series analysis, along with more thorough multi-layer networks, may be examined at a later date, everything in this paper should be sufficient information to give one a proper overview of this fascinating subject. 9

10 Figure 4: Histogram of classification error on working set. References [1] T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer: New York, [2] K. Gurney, An Introduction to Neural Networks, UCL Press: London, 1997 [3] D. Hebb, The Organization of Behavior, Wiley and Sons: New York, 1949 [4] I.A. Bansheer, M. Hajmeer, Artifical Neural Networks: Fundamentals, Computing, Design, and Application, Journal of Microbiological Methods, 43 issue 1, 2000, pp [5] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Third Edition, Burlington, Massachusetts: Morgan Kaufman, [6] Image found at svg [7] S. Samarasinghe, Neural Networks for Applied Sciences and Engineering, Auerbach: Boca Raton, Florida, [8] J.V. Tu, Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes, 49 issue 11, 1996, pp [9] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Back-Propagation Applied to Handwritten ZIP Code Recognition, Neural Computation, 1 (1989) pp

Computational statistics

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial