Self-organizing Neural Networks

Bachelor project Self-organizing Neural Networks Advisor: Klaus Meer Jacob Aae Mikkelsen 19 10 76 31. December 2006

Abstract In the first part of this report, a brief introduction to neural networks in general and principles in learning with a neural network is given. Later, the Kohonen Self Organizing Map (SOM) and two algorithms for learning with this type of network are presented. In the second part, recognition of handwritten characters as a specific task for a Kohonen SOM is described, and three different possibilities of encoding the input are implemented and tested.

Contents Contents 1 1 Introduction 3 2 Introduction to Neural Networks 4 2.1 The Neuron............................ 4 2.2 Network Structure........................ 6 2.3 Kohonen Network......................... 7 2.4 Learning with a neural network................. 7 3 Kohonen Self-organizing Map 9 3.1 Structure............................. 9 3.2 Neighborhood function...................... 10 3.3 η - the learning parameter.................... 11 3.4 Border factor........................... 11 4 Learning algorithms for SOM 13 4.1 The Original incremental SOM algorithm........... 13 4.2 The Batch SOM algorithm.................... 15 5 Vector Quantization 17 5.1 Definitions............................. 17 5.2 Vector Quantization algorithm................. 18 5.3 Comparison between VQ and SOM algorithms......... 18 6 Kohonen SOM for pattern recognition 20 6.1 Preprocessing........................... 20 6.2 Implementation.......................... 23 6.3 Character Classification..................... 23 6.4 Results............................... 30 7 Conclusion 31 Bibliography 32 1

A Test results 33 A.1 Results from test 1........................ 33 A.2 Parameters for test 2....................... 36 A.3 Parameters for test 3....................... 36 A.4 Parameters for test 4....................... 36 A.5 Parameters for test 5....................... 37 A.6 Estimate of τ 1 for Batch Algorithm............... 37 A.7 Parameters for test 6....................... 37 A.8 Parameters for test 7....................... 38 A.9 Output from test 7........................ 38 B Characteristics tested for 40 C Source Code 42 D Material produced 116 2

Chapter 1 Introduction The basis knowledge for this project originates from the course DM74 - Neural Networks and Learning Theory, at University of Southern Denmark, spring semester 2006. The textbook in the course was Haykin [1]. The reader is not expected to a prior knowledge of neural networks, but a background in computer science or mathematics is recommended. Typical applications of neural networks are function approximation, pattern recognition, statistical learning etc. which makes neural networks usable for a wide area of people such as computer scientists, statisticians, biologists, engineers and so on. The Kohonen self-organizing maps are frequently used in speech recognition, but also in pattern classification, statistical analysis and other areas. In the second chapter, the building block of neural networks, the neuron, is presented, along with an introduction to its computation. In the chapter it is described what learning with a neural network is, and which problems can be handled. The third chapter treats the central parts of the Kohonen self organizing map. This includes the structure of the neurons in the lattice and a description of the neighborhood function, the learning parameter etc. Two learning algorithms for the Kohonen self organizing map are presented in chapter four. Those are the original incremental SOM algorithm and the batch SOM algorithm. In chapter five, vector quantization is defined and the LBG (Linde-Buzo- Gray) algorithm is described. Relations between the SOM algorithms and vector quantization are treated here as well. Finally in chapter six, the implementation of the Kohonen SOM for recognition of handwritten characters is treated, including three encodings for the characters as input, and a thorough testing of the algorithms. The results from the tests are compared to the results of others using SOM. The source code produced can be found in the appendix, but also downloaded from the web, see appendix D. 3

Chapter 2 Introduction to Neural Networks We will in this chapter take a look at the neuron as a single unit and how it works, the structure of neural networks and the different learning processes. The basic idea in neural networks is to have a network build by fundamental units, neurons, that interacts. Each neuron is able to get input from other neurons and process them in a certain way and produce an output signal. Although work have been done in neural network research earlier, the modern research was started by McCulloch and Pitts, who published A Logical Calculus of the Ideas Immanent in Nervous Activity in 1943 [2]. There are two approaches to neural networks. The first idea is to model the brain. The second is to create heuristics which are biologically inspired. In the early years the first was dominant, but now the second is prevailing. 2.1 The Neuron The neuron works as follows: It gets input from other neurons or input synapses X 1,X 2,...,X n. Then it combines these signals in a certain way, and computes the output. 4

Figure 2.1: A model of neuron k 2.1.1 Computation of neuron k 1. On input signals X 1,X 2,...,X n the sum v k = n W ki X i + b k i=1 is computed 2. Apply the activation function ϕ to v k : y k = ϕ(v k ) This will be the next output of neuron X k Activation functions Different activation functions are used in the different types of networks. Often used functions are the threshold function ( figure 2.2 (a), equation 2.1 ), used by McCulloch & Pitts, a stepwise linear function( figure 2.2 (b), equation 2.2), and the sigmoidal activation ( figure 2.2 (c), equation 2.3). The stepwise linear function is continuous, but not differentiable, whereas the sigmoidal function is both continuous and differentiable. The threshold function can be approximated by the sigmoidal function by choosing a sufficiently large a. 5

1 0,8 0,6 0,4 0,2 1 0,8 0,6 0,4 0,2 1 0,8 0,6 0,4 0,2 0-2 -1 0 1 2 0-2 -1 0 1 2 0-2 -1 0 1 2 (a) Threshold (b) Stepwise linear (c) Sigmoidal (a = 3) Figure 2.2: Three different activation functions φ(v) = φ(v) = { 0 if v < 0 1 if v 0 (2.1) 0 if v < 0 1 2 + v if 1 2 v 1 2 (2.2) 1 if 1 2 < v φ(v) = 1 1 + exp ( av) (2.3) When a neuron using the threshold function outputs a 1, it is said to fire. In the Kohonen self organizing network, the activation function is the identity function ϕ(v) = v so v k = y k. 2.2 Network Structure Neurons are often connected in layers to form a neural network. The first layer is called the input layer, the last layer is called the output layer, and the ones in between are called hidden layers. The neurons are connected in different ways depending on the network type. If the underlying structure is an acyclic directed graph, with neurons as vertices, the network is called a feed forward network, opposed to recurrent network, where the underlying structure is directed, but not acyclic. We suppose the network is clocked in discrete time steps. The output of some neurons are sent to a neighbor neuron. The weighted sum and the activation are computed. One step later the output of the neighbor neuron is available. In Kohonen networks there is just one layer of neurons, thus no hidden layers of neurons, and the input layer is the same as the output layer. All the neurons are however connected in the layer. In one time step in the Kohonen network, a complete update of the neurons is made. 6

2.3 Kohonen Network The Kohonen neural network works as follows: First the weights of the neurons in the network are initialized to random values. When a sample is presented, the neurons calculate their output. The neurons compare their output, which is the distance between their weight and the sample presented. The neuron with which have the smallest (or largest, depending on the metric used) distance is found. This neuron is called the winning neuron. The neurons in the network are all updated, moving the weight of the neurons in the direction of the sample presented. The winning neuron is updated the most, and the greater the distance between the winner and the location of the neuron, the lesser that neuron is updated. In training, the samples are repeatedly presented to the network, and the amount that the weights of the neurons are moved in a single time step decreases, just as the perimeter for affecting surrounding neurons decreases. Toward the end of training, a training sample affects a single neuron, and only by a small amount. Because of the winning neurons influence on neurons close to itself, the network lattice obtain a topological ordering, where neurons located close to each other have similar weights. The different elements of the Kohonen network are described in details in chapter 3. 2.4 Learning with a neural network Learning with a neural network depends on the problem considered. Typically the network is supposed to deduce general characteristics from the samples presented. If the task is pattern recognition/clustering, samples from the same cluster should be recognized as being alike, thus being recognized by the same neuron in the lattice. If the problem is approximation of a function, learning consists of updating the weights in the network, so the output produced when the network is presented with an input, is closer to the output corresponding to the input. This of course for all the inputs presented, but still without overfitting the function, where the output in the points presented are exactly correct, but the function fluctuate. 2.4.1 Supervised and unsupervised learning In supervised learning, for each sample presented, the result (or the correct class if it is pattern classification) is known, and the network can use the result to correct the weights of the neurons into the right direction. This is typical for the perceptron network and the radial-basis function networks. The learning rules for this type of learning typically uses error correction or minimization of output error. 7

When using an unsupervised strategy, the network doesn t know the classification or result of the input sample until learning has been completed. After learning, the network is calibrated by a manually analyzed set of samples. One type of learning for the unsupervised network is correlation based learning, which tries to find correlation between neurons that always fire at the same time. Then it makes sense to increase the weights between two such correlated neurons. Another type is competitive earning, which tries to find a winning neuron when input comes from a particular cluster of points in the input space, and change the weight of this neuron. 2.4.2 Hebb based learning The Hebbian based learning is used for the correlation learning scenario. Consider two neurons A,B, and suppose we remark that each time A is firing, B will fire in the next step. This could be incorporated in the following rule on how to change the weight W AB between A and B. with: W new AB = W old AB + AB (2.4) AB = ηx A Y A (2.5) where X A and Y B are earlier weights of the two neurons and η > 0 is a constant. 2.4.3 Competitive learning The overall goal is classification of inputs. Several output neurons compete against each other, the one with the largest output value wins and fires. If for an input x R the output neuron X k wins, then the weight coming into X k is changed into the direction of x. This is done by updating the weights W kj as follows: where W new kj = W old kj + kj (2.6) W kj = η(x W kj ) (2.7) and η is the learning parameter, and 0 < η < 1. The Kohonen self organizing map is an example of competitive learning, where the neighbors of the winning neuron also are updated, and the learning parameter decreases over time. 8

Chapter 3 Kohonen Self-organizing Map The Kohonen SOM works like the brain in pattern recognition tasks. When presented with an input, it excites neurons in a specific area. It does this by clustering a large input space to a smaller output space. To achieve this, the Kohonen SOM algorithms builds a lattice of neurons, where neurons located close to each other have similar characteristics. There are several key concepts in this process, such as structure of the lattice, how the geographical location between neurons influence the training (neighborhood), how much each update time step should affect each neuron (learning parameter), and making sure all samples fall within the lattice (border factor). These matters are described in this chapter. 3.1 Structure As described in section 2.2 there is only one layer of neurons in the Kohonen self-organizing map. The location of the neurons in relation to each other in the layer is important. (a) Network structure using squares (b) Network structure using regular hexagons Figure 3.1: Two different lattice structures The neurons can in principle be located in a square lattice, a hexagonal lattice or even an irregular pattern. When using a structure, where the neurons form a pattern of regular hexagons, each neuron will have six neighbors 9

with the same geographical distance, where the squared network structure only has four. This causes that more neurons will belong to the same neighborhood, and thus updating those neurons with the same magnitude in the direction of the winner. Neighboring neurons in the net should end up being mutually similar, and the hexagonal lattice does not favor the horizontal or vertical direction as much as the square lattice ( Makhoul et al. [3] and Kohonen [4]). 3.2 Neighborhood function When using Kohonen self-organizing maps, the distance in the lattice between the two neurons influence the learning process. The neighborhood function should decrease over time, for all but the winning neuron. A typical choice of neighborhood function is: ( d 2 ) j,i h j,i(x) = exp 2σ 2 (n) (3.1) As σ 0 the radius of the net is recommended (or a little bigger), and the sigma is calculated by: ( n ) σ(n) = σ 0 exp (3.2) τ 1 1000 As τ 1 Haykin [1] recommends log(σ 0 ), in figure 3.2 the neighborhood function is displayed, for two neurons with a distance 1, and different τ 1 values. Figure 3.2: How the neighborhood function decreases over time, with d = 1 and a) τ 1 = 1000, b) τ 1 = 5000, c) τ 1 = 10000 and d) τ 1 = 15000 This neighborhood function provides a percentage-like measure of how close the neurons are. The function decreases not just over time, but also depending on the geographical distance of the two neurons in the net. Another possibility of neighborhood function is a boolean function, just decreasing the radius over time. At first all neurons are in the neighborhood, 10

but as the radius decreases, the number of neurons in the neighborhood decreases, ending when only the neuron itself. Raising τ 1 causes the neighborhood function to decrease slower, and therefore the ordering takes longer time. 3.3 η - the learning parameter The learning parameter also decreases over time. One possible formula is: ( n ) η(n) = η 0 exp (3.3) τ 2 Another possibility is: ( η(n) = 0.9 1 n ) 1000 (3.4) A graph of η decreasing over time, using formula 3.3 can be seen in figure 3.3 with different values for τ 2. It is easy to see that raising τ 2 makes the graph of η decrease more slowly. After the ordering phase, the learning parameter should remain at a low value like 0.01 (Haykin [1])or 0.02 (Kohonen [4]). Figure 3.3: η displayed over time, with a) τ 2 = 1000, b) τ 2 = 2000 and c) τ 2 = 5000 possibilities of decreasing η functions 3.4 Border factor To make sure all samples fall within the lattice, an enlarged update of the neurons in the border of the lattice is made, once a such neuron is the winner. The ordering phase is the first part of the algorithm defined in 4.1. The function B(i(X)) is defined af follows: 11

1 if the ordering phase is not over, else b B(i(X)) = 2 if i(x) is located a corner of the lattice b if i(x) is located on an edges 1 otherwise (3.5) where b is a constant. The border factor increases the update of the neurons in the border of the lattice, thus stretching it, so more of the samples will be matched inside the net, and not outside. If we don t use the border factor, we risk having too many samples matching on the outside of the lattice, and the network will not converge properly. If b is set to 1, the border factor is simply ignored. 12

Chapter 4 Learning algorithms for SOM In this chapter we take a look at two fundamentally different algorithms for the Kohonen SOM. The original incremental algorithm and the batch learning algorithm. The main difference between the two algorithms are found after the initialization. The incremental method repeats the following steps: process one training sample, and then updates the weights. The batch learning method repeats the steps: process all the training samples, and then update the weights. The incremental algorithm repeats its steps many times more than the batch algorithm. 4.1 The Original incremental SOM algorithm The original SOM algorithm works by updating neurons in a neighborhood around the best matching neuron. The units closest to the winning neuron are allowed to learn the most. Learning or updating consists of a linear combination off the old weight and the new sample presented to the network. The algorithm is defined as follows: 1. Initialization The weights of the neurons in the network are initialized. 2. Sampling Samples are drawn from the input space in accordance with the probability distribution off the input. 3. Similarity matching The winning neuron is determined, as the one with the most similar weight vector compared to the sample presented. 13

4. Updating The neurons in the net are all updated according to the updating rule: equation 4.1 5. Continuation Continue steps 2-4 until no noticeable changes in the weights are observed. 6. Calibration The neurons are labeled from a set of manually analyzed samples. The first approximately 1000 steps (Haykin [1]) or more of the algorithm is a self-organizing/ordering phase, and the rest is called the convergence phase. In the ordering phase, the general topology of the input vectors are ordered in the net. In the convergence phase the map is fine tuned to provide an accurate quantification of the input data. 4.1.1 Initialization This can be done by using randomized values for the entries in the weights or by selecting different of the training samples as the initial weights for the neurons. The samples are vectors in R n, and thus the weight for each neuron is a vector in R n. 4.1.2 Sampling How the samples are drawn from the input set, can be done by selecting samples at random or presenting the samples in a cyclic way, see Kohonen [4]. 4.1.3 Similarity Matching This is measured by the euclidean distance between the weights of the neurons in the net, and the selected input. It could alternatively be done selecting the largest dot product. The winning neuron is the one updated the most, and in the last part of the of the algorithm (the last time steps) it is the only neuron updated, which justifies the label winner takes all. 4.1.4 Updating the neurons For each sample x R n presented, all neurons in the net are updated according to the following rule: 14

W j (n + 1) = W j (n) + η(n)b(i(x))h j,i(x) (n)(x W j (n)) (4.1) where x is the sample presented to the network, i(x) is the index of the winning neuron, j is the index of the neuron we are updating, η is a learning parameter depending on the time step n, W j (n) is the weight of neuron j in time step n, and the term B(i(x)) is a border factor. 4.1.5 Continuation The algorithm must run until there are no notable changes in the updates of the network. This is difficult to measure, but as a guideline the number of iterations in the convergence phase, Kohonen [4] recommends at least 500 times the number of neurons in the network. 4.1.6 Calibration Once the network has finished training, the neurons in the lattice must be assigned which class they correspond to. This can be accomplished by a manually analyzed set of samples, where each neuron is assigned the class of the sample it resembles the most. 4.2 The Batch SOM algorithm The original algorithm works on line, thus not needing all samples at the same time. This is not the case in the Batch SOM algorithm, here all training samples must be available when learning begins, since they are all considered in each step. The algorithm is described here 1. Initialization of the weights By using random weights or the first K training vectors, where K is the number of neurons. 2. Sort the samples Each training samples best matching neuron is found, and the sample is added to this neurons list of samples. 3. Update All weights of the neurons are updated, as a weighted mean of the lists found in step 2. 4. Continuation Repeat step 2-3 some times, depending on the neighborhood function. 5. Calibration The neurons are labeled from a set of manually analyzed samples. 15

4.2.1 Initialization of the weights The initialization of the weights can be done in the same way as the original incremental algorithm. 4.2.2 Sort the samples All the input samples are distributed in lists, one for each neuron, where each sample is added to the list corresponding to the neuron it matches best. 4.2.3 Updating the neurons For the new weights, the mean of the weight vectors in the neighborhood is taken, where the neighborhood is defined as in the incremental algorithm, and the weight vectors for each neuron is the mean of the samples in the list collected in step two. The new weights in the network is calculated from the following update rule i W j (n + 1) = n i h ji x j i n (4.2) i h ji where x j is the mean of the vectors in the lists collected in step, n i is the number of elements in the list and both sums are taken over all neurons in the net. When the edges are updated, the mean of the neurons own list is weighted higher than the others, and in the corners even higher. The neighborhood function is the same as in the incremental algorithm, only parameters are used, so in the last couple of iterations the neighborhood consists of only the single neuron. 4.2.4 Continuation This algorithm typically loops through step two and tree for 15 to 25 times (Kohonen [4]). This should be sufficient, so the final updates are very moderate. 4.2.5 Calibration As in the incremental algorithm, the neurons are assigned a label from the class the are most similar to, taken from a set of manually analyzed samples. 16

Chapter 5 Vector Quantization Vector quantization is typically used in lossy image compression, where the main idea is to code values from a high dimension input space into a key in a discrete output space of a lower dimension, to reduce the bit rate of transfer or the space needed for archiving. Compression is extensively used in codecs for video and image compression, and is especially important when a signal must pass a low transmission rate network and still maintain an acceptable quality [5]. One example of compression is using the HTML safe color codes, where the entire color spectrum is encoded into 216 safe colors to use, which should always remain the same no matter which browser is used. Note that 216 different colors easily can be encoded using 8 bits, compressing the encoding for the millions of colors monitors today can handle. Vector quantization can generate a better choice of colors corresponding to each of the (in this case 216) possibilities, depending on how extensively the different colors are used. 5.1 Definitions As seen in figure 5.1, selecting 10 codebook vectors (the dots), the vectors lying in the same area bounded by the lines (or in general, hyperplanes if the dimension of the vectors are greater than two) is called the Voronoi set corresponding to the codebook vector. The lines bounding the Voronoi sets are altogether called the Voronoi Tesselation. 17

Figure 5.1: 10 codebook vectors (the dots), and the corresponding voronoi tessellation (the lines) in R 2 5.2 Vector Quantization algorithm Lloyd s algorithm for vector quantization is the original algorithm for vector quantization. The LBG (Linde-Buzo-Gray) algorithm, which is very similar to the K-means algorithm, is a generalization of Lloyd s algorithm Gray [6]. It tries to find the best location of the codebook vectors, according to the samples presented. It works as follows: 1. Determine the number of vectors in the codebook, N 2. Select N random vectors and let them be the initial codebook vectors 3. Clusterize the input vectors around the vectors in the codebook using a distance measure 4. Compute the new set of codebook vectors, by obtaining the mean of the clustered vectors for each codebook vector 5. Repeat 3 and 4 until the change in codebook vectors is very small In compression, by selecting N codebook vectors, where N is smaller than the original number of colors in the picture or the video stream, the algorithm tries to find the N colors which gives the smallest distortion in the picture or video stream. 5.3 Comparison between VQ and SOM algorithms Especially the batch SOM algorithm has similarities with the LBG algorithm. The first step in the LBG algorithm, selecting the number of codebook vectors, compares to the number of neurons in the lattice used in the Bath SOM algorithm. Both algorithms can use randomly selected input samples as their initial values. 18

The third step in both algorithms are quite identical. This is in both algorithms a separation of the input samples into Voronoi sets. Compared to the Batch SOM algorithm, only step 4 in the LBG algorithm, the calculation of the new codebook vectors, is really different. In the SOM algorithm, the mean of the neighboring neurons is considered with a weight decided by the distance between the neurons. This causes the Batch SOM algorithm to have a topological ordering. The last step is identical in the two algorithms, and when the neighborhood in the Batch SOM algorithm has decreased to only the single neuron, the two algorithms behave exactly the same. In [5] the Kohonen algorithm and vector quantization algorithms are compared, documenting that the Kohonen algorithms usually does at least as good as the vector quantization algorithms in practice, although it has only been proved empirically. 19

Chapter 6 Kohonen SOM for pattern recognition In this chapter we take a look at the Kohonen selforganizing map implemented to recognize handwritten characters. This recognition problem is a typical example of a clustering problem, where the clusters are the different characters available as input. After the characters are collected, there are two main stages in this process: preprocessing and classification. The purpose of the preprocessing stage is to produce an encoding for the characters that can be used as input to the network. Three different encodings has been implemented and described. In the classification stage, the Kohonen Self-organizing map performs the training, and the result are reported. Last in the chapter, The test results are evaluated and compared to results in other papers. To collect data, first 200 persons were asked to fill out a form as seen in D.1, and one person filled out 20 forms. The forms were then scanned using a HP PSC 1402 scanner, and the individual letters were saved as single files, using a tool developed to do exactly this. To reduce the size of the images (for the one-to-one encoding) Faststone Foto Resizer was used. The source code of the preprocessing and the algorithm can be found in appendix C. The tests have been rum from inside Eclipse, therefore the parameters have been changed directly in the source code. 6.1 Preprocessing Three different input encodings have been tried. Figure 6.1 and figure 6.2 shows the letters A and O written by different writers. It is clearly seen that regardless of encoding used, they will be different, making the task of recognition complex. 20

Figure 6.1: Example of different versions of the capitol letter A Figure 6.2: Example of different versions of the capitol letter O The three encodings are described in the next sections. 6.1.1 One-to-one of pixels and input The first approach was to reduce the size off the characters to a dimension of 20 times 25 pixels making the input vector have 500 entries. Then mapping in rows the white pixels to a 0 and the black pixels to a 1. It turned out that the scaling of the input and the relative large input dimension was ineffective and complex in its training, without yielding any impressive results. Figure A.1 shows a visual output of the weights of the neurons in the network after training. The large input dimension makes the amount of calculations of the algorithm too high to have the desired number of neurons in the network, on a simple personal computer. This method was therefore discarded and will only be briefly mentioned in this report. 6.1.2 Division into lines The second encoding tried uses the number of black lines seen vertically and the number of lines horizontally in the letter. Figure 6.3 shows an example of this encoding. In the project, most of the letters have a dimension (not reduced like in the first approach) off 60 times 80 pixels. This encoding have taken 30 evenly spaced encodings for the vertical lines seen in each horisontal row, and 20 evenly spaced encodings for horizontal lines seen in each vertical row. This makes the input dimension of size 50, which is a factor 7 less than in the first approach. Furthermore, the input values in the vector are integers, and not boolean values (zero or one). 21

Figure 6.3: Simple 5 4 dimension of encoding for the character R making it [ 1,2,1,2,2,1,2,3,2 ] T The results on most of the letters are fairly good, but as seen in figure 6.4 the encoding for two different characters can be very similar, if they are symmetric and the mirrored images of each other are alike. Figure 6.4: Example of same encoding for different characters 6.1.3 Characteristics The final encoding tested in this project is preprocessing the characters, to see which characteristics they possess, and coding the input vector with zeros and one s accordingly. The testing for the characteristics is thus vital, but unfortunately sensitive to rotations etc. The complete list of characteristics and the methods used to examine them can be found in appendix B. 6.1.4 Possible encodings not used The two latest encodings unfortunately don t use the same numeric values, otherwise combining both of them would be a possibility. In the articles studied for this project, different encodings has been used, besides the three described above. These include fourier coefficients extracted from the handwritten shapes as seen in [7] and encoding using the Shadow Method described by Jameel and Koutsougeras [8] 22

6.2 Implementation When implementing the network, the hexagonal net was chosen. The charts shown in A.1 doesn t display this, all even lines should be shifted half a character to the right. When presenting the samples to the network, they are chosen at random from the entire set of training samples. 6.3 Character Classification To limit the assignment, only the capitol letters A,B,...,Z has been taken into consideration in the pattern classification. The first data set consists of one sample from each of 200 test persons, making the total set of individual characters 5200. Of those are 3900 used to train the network, and 1300 used to test the network. The second dataset consist of 20 samples from one person. This makes a total of 520 separate characters, of which 390 are used for training and 130 are used for testing. While training, the network didn t know the label of each character, it was only when it assigned the letters to the neurons in the calibration phase. First of all, there is a number of parameters that can be changed in the algorithms, so testing will focus on the following: 1. The different encodings for the input 2. The number of neurons in the network 3. The number of iterations 4. The τ parameters 5. The start of the convergence phase The incremental algorithm using the first data set is tested the most, since this expectedly is a more challenging task, than a dataset from only one person. Test 1: Encodings First the best encoding, in this implementation, is decided, using a small number of neurons and iterations, and the same τ parameters and start of the convergence phase. Table A.1 shows the parameters used in this test, and table 6.1 shows the results of the test. 23

Encoding Rate of Success - Training set 1 0.39462 2 0.48795 3 0.57538 Rate of Success - Test set 1 0.39308 2 0.50462 3 0.57846 Table 6.1: Results for test 1 It seems that the high dimensional one to one mapping is less successful than the two others. The running time for the algorithm is much longer for the first encoding ( R 500 ) than for the two other other encodings ( R 50 or R 63 ), and therefore we disregard this encoding in the further tests. Test 2: Number of neurons Using the result of test 1, we now only examine for the second and third encoding. In this test, the parameters are kept the same, only changing the dimension of the network, and thus the number of neurons. It is worth noting, that since we initialize and draw samples at random, and only run each test one time, there is a bit of inaccuracy in the results. Obviously it would be better to run each test a number of times and take the average of the results. The results of test 2 can be seen in table 6.2, and the parameters used in table A.2 Lattice Total Encoding 2 Encoding 2 Encoding 3 Encoding 3 Dimension neurons Training set Test set Training set Test set 10*10 100 0.4720 0.5061 0.5562 0.6254 20*10 200 0.4713 0.4777 0.5661 0.6146 30*15 450 0.4979 0.5177 0.5936 0.6069 30*20 600 0.5097 0.5323 0.5936 0.6215 40*20 800 0.5146 0.5369 0.6028 0.6338 40*25 1000 0.5292 0.5438 0.6103 0.6254 40*30 1200 0.5210 0.5238 0.6215 0.6408 40*40 1600 0.5336 0.5331 0.6203 0.6238 60*30 1800 0.5262 0.5262 0.6100 0.6154 50*40 2000 0.5297 0.5200 0.6144 0.6423 60*40 2400 0.5356 0.5346 0.6133 0.6338 50*50 2500 0.5321 0.5208 0.6364 0.6631 70*40 2800 0.5428 0.5446 0.6228 0.6454 Table 6.2: Results for test 1 Here, the dimension of the network is not really considered. Kohonen [4] 24

recommends not to use a square, but instead a rectangular lattice. These results are not conclusive to whether a square or a rectangle is the best choice, and increasing the number of neurons in the lattice doesn t seem to have a large effect on the success rate. Figure 6.5: Rate of success depending of the number of neurons in the lattice. Plus is the test set for encoding 2, dots are test set for encoding 3 Test 3: Number of iterations We now test for the number of iterations, using encoding 2 and three, and the dimension 40 times 30 that seemed to be a fair choice according to test two. Iterations Encoding 2 Encoding 2 Encoding 3 Encoding 3 Training set Test set Training set Test set 10000 0.4074 0.4469 0.5426 0.5854 20000 0.4644 0.4900 0.5715 0.5977 30000 0.4854 0.5092 0.5941 0.6185 40000 0.5059 0.5131 0.6015 0.6108 50000 0.5346 0.5585 0.6138 0.6508 75000 0.5487 0.5446 0.6428 0.6415 100000 0.5705 0.5307 0.6503 0.6269 125000 0.5923 0.5392 0.6659 0.6400 150000 0.6118 0.5723 0.6762 0.6515 175000 0.6221 0.5523 0.6762 0.6369 200000 0.6333 0.5592 0.6756 0.6377 500000 0.6338 0.5469 0.6808 0.6377 Table 6.3: Results for test 3 25

As it is clearly seen in table 6.3 and in figure 6.6, the number of iterations only improve the success rate to a certain degree. There is nothing gained in using more than 60.000 iterations, since above this level, the results are stagnant. The advice in Haykin [1] and Kohonen [4] about using 500 times the number of neurons in the network doesn t apply to lattices of this larger size. Figure 6.6: Rate of success depending of the number of iterations in the algorithm. Plus is the test set for encoding 2, dots are test set for encoding 3 Test 4: The τ parameters We now turn our attention to the τ parameters. The τ 1 controls how fast the neighborhood should decrease. A larger τ 1 makes the neighborhood shrink slowly. The τ 2 parameter controls how rapid the learning parameter η decreases. The smaller the τ 2 gets, the faster the η decreases. The results of different combinations of the τ parameters are in diagram 6.4 The recommended value of 1000 (Haykin [1]) for the τ 2 parameter seems to small for this problem, a larger value is preferred. 26

τ 1 τ 2 Training set Test set 500/σ 0 500 0.6049 0.6300 500/σ 0 1000 0.6313 0.6738 500/σ 0 2000 0.6697 0.6438 500/σ 0 5000 0.7146 0.6592 1000/σ 0 500 0.5762 0.5962 1000/σ 0 1000 0.6046 0.6169 1000/σ 0 2000 0.6556 0.6608 1000/σ 0 5000 0.7103 0.6731 1000/σ 0 7500 0.7254 0.6738 1000/σ 0 10000 0.7295 0.6646 1000/σ 0 15000 0.7367 0.6669 3000/σ 0 500 0.5751 0.5985 3000/σ 0 1000 0.5682 0.5977 3000/σ 0 2000 0.6164 0.6123 3000/σ 0 5000 0.6787 0.6431 7000/σ 0 500 0.5872 0.6038 7000/σ 0 1000 0.5762 0.5923 7000/σ 0 2000 0.5715 0.6023 7000/σ 0 5000 0.6167 0.6100 Table 6.4: Results for test 4 Test 5: Start of the convergence phase The start of the convergence phase indicates when the border factor should be applied. We therefore test these two parameters together. The results are displayed in table 6.5, and the other parameters used can be found in table A.4. 27

Convergence Border Success rate Success rate start factor Training set Test set 1000 1 0.7318 0.6761 1000 2 0.7320 0.6692 1000 4 0.7500 0.6715 1000 8 0.7179 0.6623 5000 1 0.7415 0.6769 5000 2 0.7305 0.6438 5000 4 0.7400 0.6800 5000 8 0.7197 0.6623 10000 1 0.7349 0.6838 10000 2 0.7262 0.6577 10000 4 0.7300 0.6938 10000 8 0.7354 0.6685 15000 1 0.7454 0.6685 15000 2 0.7397 0.6723 15000 4 0.7433 0.6785 15000 8 0.7379 0.6569 Table 6.5: Results for test 5 The results indicate that the start of the convergence phase is not as important as the size of the border factor. A factor of 4 seems to be a good choice. 6.3.1 Test 6: The Batch algorithm In the Batch Kohonen SOM algorithm, we rely on test two and three. Thus, we use a lattice of dimension 40 times 30, and only consider encoding 3. Table A.6 presents a calculation of values of the neighborhood function, when τ 1 is selected to 5.1. A spreadsheet was used to select the value, knowing the last couple of iterations should be narrowed down, so the neurons only are influenced by the samples in the Voronoi set. 28

Iterations τ 1 Success rate Success rate Training set Test set 20 4,9 0.7046 0.6646 20 5,1 0.7028 0.6615 20 5,3 0.7038 0.6600 25 4,9 0.7126 0.6585 25 5,1 0.7038 0.6808 25 5,3 0.7156 0.6523 30 4,9 0.7146 0.6792 30 5,1 0.7000 0.6462 30 5,3 0.7156 0.6615 Table 6.6: Results for test 6 The success rates here are quite similar to the ones obtained from the incremental algorithm. 6.3.2 Test 7: Samples from one person Imagine the Kohonen SOM was used in a personal digital assistant (PDA). In this case, if the PDA could adapt to one persons handwriting, the owner would not have to learn how to write letters the device could detect correct. Using a smaller lattice size, and the results gained from test one to five, we perform a test on the dataset of 15 alphabets from one person, and tests on the last 5 alphabets. We will run tests on encoding 2 and tree, with parameters as in A.8, and the results can be seen in 6.7. Encoding 2 Dimension σ 0 Success rate Success rate Training set Test set 20 * 10 11 0.9513 0.8692 25 * 10 13 0.9590 0.9077 30 * 10 15 0.9692 0.9077 Encoding 3 Dimension σ 0 Success rate Success rate Training set Test set 20 * 10 11 0.9667 0.9615 25 * 10 13 0.9821 0.93077 30 * 10 15 0.9794 0.94615 Table 6.7: Results for test 7 The recognition rate in this test seems satisfactory. The output from the 29

last of the tests can be found in A.9. The characters the network has confused are N recognized as M, O as D, Q as B, Z as I and three times R as D. Based on this result, a test for a characteristic that would distinguish between R and D should have been added to the third encoding. The PDA would have one major advantage, that cannot be used in this report. When examining characteristics, the starting location of the pen could be taken into consideration, this is not possible, since the characters in this project have been written on paper and scanned. 6.4 Results For the first encoding style, Idan and Chevallier [9],using a training set of 735 and a test set of 265 samples, present a success rate of 90,7% on the training samples and 75,5% on the test samples, but this is on digits, thus only 10 different shapes to consider. This speed of the preprocessing is good, but the input dimension is much higher than other methods, which in practical applications could become an issue. In [7], 7400 samples have been collected from only 17 writers. Searching for only 18 different shapes, and using fourier descriptors a recognition rate of 88.43% is reached. They also had the advantage of knowing the writers starting location and writing speed. Considering the relative small number of writers and different shapes, the recognition rate found in this report in the perimeter of 70% seems reasonable. In [8] which focus on a new input encoding, the shadow method, 1000 characters from 13 writers yield a 78,6% result at best, but doesn t specify whether the training samples are the same as the testsamples. In comparison, when using one writer gets at least 93% succes rate to the 70% for the 200 writers, the results in this report, using a rather naive characterisation test on the letters, are satisfactory. 30

Chapter 7 Conclusion In this report, neural networks in general and learning with those, are presented. The primary focus is on unsupervised learning with the Kohonen Self Organizing Map. Two learning algorithms for the Kohonen SOM are described, the original incremental algorithm, and the batch style algorithm. A brief introduction to vector quantization is given, and similarities between vector quantization algorithms and the self organizing map algorithms are described. In the second part of the project, the Kohonen SOM is used to recognize single handwritten characters. When using neural networks for such a pattern classification problem, the input encoding is as important as the network itself. Therefore, three different encodings has been implemented and used as input to the network. Both the incremental algorithm and the batch algorithm have been implemented, and both are tested with datasets collected and processed for this specific purpose. The recognition rate for the dataset from 200 different writers is approximately 70%, which compared to other results using similar methods is satisfactory. Using samples from only one writer, the recognition rate is 96%, which implies that the method could be used on a PDA, which in time could adapt to the way the owner writes, instead of the owner adapting to the PDA. 31

Bibliography [1] Simon Haykin. Neural Networks, a comprehensive Foundation, 2nd Ed. Prentice-Hall, 1999. [2] Warren S. McCulloch and Walter Pitts. A Logical Calculus of the Ideas Immanent in Nervous Activity. MIT Press, 1988. [3] John Makhoul, Salim Roucos, and Herbert Gish. Vector quantization in speech coding. Proceedings of the IEEE, Vol 73, NO. 11, November 1985, pages 1551 1588, 1985. [4] Teuvo Kohonen. Self-Organizing Maps 3rd Ed. Springer-Verlag, 2001. [5] Nasser M. Nasrabadi and Yushu Feng. Vector quantization of images based upon the kohonen self-organizing feature maps. IEEE International Conference on Neural Networks, pages 101 108, 1988. [6] Robert M. Gray. Vector quantization. ASSP Magazine, IEEE Volume: 1, Issue: 2, Part 1, pages 4 29, 1984. [7] N. Mezghani, A. Mitiche, and M. Cheriet. On-line recognition of handwritten arabic characters using a kohonen neural network. Eighth International Workshop on Frontiers in Handwriting Recognition, 2002. Proceedings., pages 490 495, 2002. [8] Akhtar Jameel and Cris Koutsougeras. Experiments with kohonen s learning vector quantization in handwritten character recognition systems. Proceedings of the 37th Midwest Symposium on Circuits and Systems., pages iii, 1994. [9] Yizhak Idan and Raymond C. Chevallier. Handwritten digits recognition by a supervised kohonen - like learning algorithm. 1991 IEEE International Joint Conference on Neural Networks., pages 2576 2581, 1991. [10] Autar Kaw and Michael Keteltas. Lagrangian interpolation, 2003. http://numericalmethods.eng.usf.edu/mws/gen/05inp/mws_gen_inp_txt_lagrange.pdf. 32

Appendix A Test results A.1 Results from test 1 Iterations 30000 Dimension 30 times 15 η 0 0.5 σ 0 16 τ 1 1000/log(σ 0 ) τ 2 1000 Borderfactor b 2 Convergence phase 10000 Table A.1: Parameters for test 1 33

Results encoding 1 Figure A.1: Visual display of the neurons in the network after training, using the first type encoding (Test 1) The labels given to the neurons in the lattice, using encoding 1. LOCOOCGOOUUUVHWNKMHNWRMHAAAATT CCCGCCULLUUQDHKNMNMMFMHQPEXAIQ CLECECCLLDQVLHMNMNNNMMXHIYIXTI ULSECGCLLUGLIMNNMHMNHFXXXZSSIU SLQBVVCLLLLIIKINMMMNNYMHZZZSJA JQSVVVGLIIIIVHINMHHHWHYXZZZLIU SJJJVVIKIIIIIIFNNHHVWYYFFZZZCU JZJESHVILIIIIIPNHHHVVVYFZZZCCU JJZIYILILIIIJIPKRRRVFIFFFRZZCC IIZTTYICLIIJPFIFKFGFFTFFBSCZCC TLJZIXCIFIIFIYKFFPQCFFTIFBZJCZ TJIFXXIIIIFIFRPRIIGFTPIBBSVLUJ TIAXXXFIIAFFIIIIHIIGTTTBBBCCOL YTXXRRIIIHIIFPIMIPCFTTBBBCCCUQ TTYYXXBAIRAAPHPFBPBBBBDBDSLCJQ 34

Results encoding 2 The labels given to the neurons in the lattice, using encoding 2. FCCZZZEESSSGGGGBBBBRBQQQVWNTWW CCCSSEEZZSSGGGBBBBOQDOOODWNNWN FCCSZSZEEZSSBBBBBBDBOOOOOMWNNN CCSSSZZZSSSSCBBOBBDDODOOONNNNN CCCSCCISSSSSBODODDODOOOOUUNMNN CCCCJJSSSSCPPDOOOOODOOOOUUMNMU FCJCCCJSSZCPPPDDDDOOOGDOUUVVMN FLJCCCCGZCCPPPDDDODOOGODUUVMMU ZITJLCCCCZCPPPPODDDOAOOOUNMNVM TIJLLCCCTIPPPPPCDDDAAAGUNVNVVV CTTTLCCCZTTPPPPJKRRDAAVVUVMUVV ITIICCCTCYYPPPKKKRKAAAJVVVVUVO IITLTLLTTYYYPPPKKCKKRAVUUVUVVU LILLICTTYYYYYPKXXKKKAUUVVVVVVV IILLJIFCYTYYYYYXXXXXKHNHDUVVVV Results encoding 3 The labels given to the neurons in the lattice, using encoding 3. ZESSGGGQQOSPDYNMYYYXXKHNNVVVUU IISSGGBBDBBOYYMYXYVVHHHNMVVVUU ZEESSGGBBBBBAKPYKHHMVMMHMVUUVV ZESSJGGBBBBBBKVRVVHMMMNNWNUUVU ESESSRRBBBBRBBBKKKMUMMNWWWUUUV FEZZRRQBBBBBORRGUHUHMNWWWUVVUV ZEPPPPPRQOBBBOCCGHHUUWWMWNVVAV IFFIPRPPPODBBUCUKKKHWMMWMVAVVV IIIFPPPPPDDDDDBGAKKXYMXMKRKXXL IFFFFPPPPDDDOBDAKKKYYXYKRJAALL ILIFFFPPPDDDOBRHGKKXXYYAAJJJLL LLFFTPPPPPODDDQQGMMXYYHTJTJJLL LCCCCYYQPPOOODSQQGSJTTTTTYJJLL CCCCCRPRQBOOODOZGSSSZTTTTTTILI LCCCCRCRROCOOOOGSJSJIZITTYTTII 35

A.2 Parameters for test 2 Iterations 50000 η 0 0.5 σ 0 max(x,y)/2 + 1 τ 1 1000/log(σ 0 ) τ 2 1000 Borderfactor b 2 Convergence phase 10000 Table A.2: Parameters for test 1 A.3 Parameters for test 3 Dimension 40 * 30 η 0 0.5 σ 0 21 τ 1 1000/log(σ 0 ) τ 2 1000 Borderfactor b 2 Convergence phase 10000 Table A.3: Parameters for test 3 A.4 Parameters for test 4 Iterations 50000 Encoding 3 Dimension 40 * 30 η 0 0.5 σ 0 21 Borderfactor b 2 Convergence phase 10000 Table A.4: Parameters for test 4 36

A.5 Parameters for test 5 Iterations 50000 Encoding 3 Dimension 40 * 30 η 0 0.5 σ 0 21 τ 1 1000/log(σ 0 ) τ 2 15000 Table A.5: Parameters for test 5 A.6 Estimate of τ 1 for Batch Algorithm Distance Iterations 0 1 10 15 20 23 25 0 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1 0.9989 0.9983 0.9444 0.6659 0.0556 0.0001 0.0000 5 0.9943 0.9916 0.7511 0.1309 0.0000 0.0000 0.0000 10 0.9887 0.9834 0.5642 0.0171 0.0000 0.0000 0.0000 15 0.9831 0.9751 0.4238 0.0022 0.0000 0.0000 0.0000 20 0.9776 0.9670 0.3183 0.0003 0.0000 0.0000 0.0000 Table A.6: Estimate of the neighborhood function for the Batch Algorithm, depending on the distance and the iterations, when τ 1 is selected to 5.1 A.7 Parameters for test 6 Iterations 25 Encoding 3 Dimension 40 * 30 σ 0 21 τ 1 5.1 Table A.7: Parameters for test 6 37

A.8 Parameters for test 7 A.9 Output from test 7 Iterations 50000 η 0 0.5 τ 1 1000/log(σ 0 ) τ 2 15000 Borderfactor b 4 Convergence phase 1000 Table A.8: Parameters for test 7 Succesrate: 0.9794871794871794 {Z=1.0 dvs: 15 ud af 15, Y=1.0 dvs: 15 ud af 15, X=1.0 dvs: 15 ud af 15, W=1.0 dvs: 15 ud af 15, V=1.0 dvs: 15 ud af 15, U=1.0 dvs: 15 ud af 15, T=1.0 dvs: 15 ud af 15, S=1.0 dvs: 15 ud af 15, R=1.0 dvs: 15 ud af 15, Q=0.9333333333333333 dvs: 14 ud af 15 B: 1 1.0 %, P=1.0 dvs: 15 ud af 15, O=1.0 dvs: 15 ud af 15, N=0.8666666666666667 dvs: 13 ud af 15 M: 2 1.0 %, M=1.0 dvs: 15 ud af 15, L=1.0 dvs: 15 ud af 15, K=0.9333333333333333 dvs: 14 ud af 15 R: 1 1.0 %, J=0.9333333333333333 dvs: 14 ud af 15 S: 1 1.0 %, I=1.0 dvs: 15 ud af 15, H=1.0 dvs: 15 ud af 15, G=0.9333333333333333 dvs: 14 ud af 15 S: 1 1.0 %, F=1.0 dvs: 15 ud af 15, E=0.8666666666666667 dvs: 13 ud af 15 F: 2 1.0 %, D=1.0 dvs: 15 ud af 15, C=1.0 dvs: 15 ud af 15, B=1.0 dvs: 15 ud af 15, A=1.0 dvs: 15 ud af 15 } Succesrate: 0.9461538461538461 38

{Z=0.8 dvs: 4 ud af 5 I: 1 1.0 %, Y=1.0 dvs: 5 ud af 5, X=1.0 dvs: 5 ud af 5, W=1.0 dvs: 5 ud af 5, V=1.0 dvs: 5 ud af 5, U=1.0 dvs: 5 ud af 5, T=1.0 dvs: 5 ud af 5, S=1.0 dvs: 5 ud af 5, R=0.4 dvs: 2 ud af 5 D: 3 1.0 %, Q=0.8 dvs: 4 ud af 5 B: 1 1.0 %, P=1.0 dvs: 5 ud af 5, O=0.8 dvs: 4 ud af 5 D: 1 1.0 %, N=0.8 dvs: 4 ud af 5 M: 1 1.0 %, M=1.0 dvs: 5 ud af 5, L=1.0 dvs: 5 ud af 5, K=1.0 dvs: 5 ud af 5, J=1.0 dvs: 5 ud af 5, I=1.0 dvs: 5 ud af 5, H=1.0 dvs: 5 ud af 5, G=1.0 dvs: 5 ud af 5, F=1.0 dvs: 5 ud af 5, E=1.0 dvs: 5 ud af 5, D=1.0 dvs: 5 ud af 5, C=1.0 dvs: 5 ud af 5, B=1.0 dvs: 5 ud af 5, A=1.0 dvs: 5 ud af 5 } NNHHHHHHAALLLEEEIBBBRPPFFFEIII NNKKHHHAAALLFEESEBRRRPFFFEIZZZ NNKKKHXXAAALLIGSBBQRRDRFFFEZZZ MMKKKKXXAALLLGSGBRRQDDDCCLCZZZ MMMKKXXXAAAALQGGBBBDDOURCCLITT HNNXXXXXAAAQQQPBRRDDDDUCCCIIZZ HNMMXXXXTTWWWQPPRRRDOODUCGSIIJ UNMMYYYTTTWWWPPPRRQOOOBGSSSJJJ UUVVVYYTTTWWWWPPPQQQOOOGGSSSJJ UUVVVYYTTTWWWWPPPQQQOOGGGSSIJJ 39

Appendix B Characteristics tested for Here is a list of the different characteristics tested for in the third encoding style, the numbers refer to the location in the input vector. 0,1,2 Tests if there is is a horisontal bar in the top, bottom or middle. 3,4,5 Tests to see if there is a vertical bar in the left, middle or right of the picture 6,7,8,9 Checks if there is a rounded line segment in each of the four quadrants. The test is done using Lagrangian Interpolation[10] to model a desired curve, and calculating the percentage of black pixels on this arc. 10,11 Examining if there are diagonal bars like / or \ 12,13,14,15 Tests if the number of vertical bars seen 1 4 from the top of the image is exual to 0,1,2 or 3 (Similar to encoding 2) 16,17,18,19 Tests if the number of vertical bars seen 1 2 from the top of the image is exual to 0,1,2 or 3 (Similar to encoding 2) 20,21,22,23 Tests if the number of vertical bars seen 3 4 from the top of the image is exual to 0,1,2 or 3 (Similar to encoding 2) 24,25,26 Tests if the number of horisontal bars seen 1 3 exual to 1,2 or 3 (Similar to encoding 2) 27,28,29 Tests if the number of horisontal bars seen 1 2 exual to 1,2 or 3 (Similar to encoding 2) 30,31,32 Tests if the number of horisontal bars seen 2 3 exual to 1,2 or 3 (Similar to encoding 2) into the image is into the image is into the image is 33 Checks if the pickture is very narrow (below a threshold limit) 40