The Minimum Number of Hidden Neurons Does Not Necessarily Provide the Best Generalization

Size: px

Start display at page:

Download "The Minimum Number of Hidden Neurons Does Not Necessarily Provide the Best Generalization"

Andra Reed
5 years ago
Views:

1 The Minimum Number of Hidden Neurons Does Not Necessarily Provide the Best Generalization Jason M. Kinser The Institute for Biosciences, Bioinformatics, and Biotechnology George Mason University, MSN 4E University Blvd Manassas, VA USA Voice: Fax: Abstract The quality of a feedforward neural network that allows it to associate data not used in training is called generalization. A common method of creating the desired network is for the user to select the network architecture (largely based on the selecting the number of hidden neurons) and allowing a training algorithm to evolve the synaptic weights between the neurons. A popular belief is that the network with the fewest number of hidden neurons that correctly learns a sufficient training set is a network with better generalization. This paper will contradict this belief. The optimization of generalization requires that the network not assume information that does not exist in the training data. Unfortunately, a network with the minimum number of hidden neurons may require assumptions of information that does not exist. The network then skews the surface that maps the input space to the output space in order to accommodate the minimum architecture which then sacrifices generalization. I. INTRODUCTION It has been previously proposed that the feedforward neural network solution with the fewest possible number of hidden neurons is that network that best generalizes to the problem encompassed by the training data set. [Reed,93] [Murata,94] It will be proposed here that this is not the case. A contradictory case with discussion will be presented that demonstrates that too few hidden neurons deteriorates generalization even though the training data is correctly mapped by the network. The cause of this deterioration is that the network, by having too few hidden neurons, is forced to have improper hidden neuron activation. This requires the network to infer information that does not exist in the training data and thus generalization is compromised. II. CONTRADICTORY EXAMPLE 1

2 The networks considered here are feedforward networks with simple neurons. The network is shown in Fig. 1 and the operation of the network is described by, z i = Γ u ij y j + v ik x k (1) j k where x k is the state of the k th input neuron, y j is the state of the j th hidden neuron, u ij are the weights from the input layer to the output neuron, and v ik describes the connections between the hidden neurons and the output neuron. Γ{u} is a hard threshold function which is 1 if u is greater than a constant γ and 0 otherwise. The hidden neurons are similarly computed by, y j = Γ l w jl x l. (2) To demonstrate the assertion of this paper a single example problem shown in Table 1 is considered. There are eleven training pairs each with four input neurons and a single output neuron. Following the logic in [Kinser,96] conflicts exist in the data set and it cannot be solved by a single layer system. At least one hidden neuron is required to fully map this data set, or in other words, this data set is higher order. TABLE 1. T Input Output A solution of the network correctly learns all of the training data. The two solutions that will be used for discussion are: Solution 1: One hidden neuron. u = ( 0.8) v = ( 0.68, 0.61, 0.64, 0.64) w = ( 0.3, 0.21, 0.27, 0.24) 2

3 Solution 2: Two hidden neurons u = 1, 1 ( ) ( ) v = 0.6, 0.6, 0.6, w = Both solutions completely associate the training data. However, solution 1 infers information that does not exist in the training data. To explore this, consider the symmetry in the training data. In this set the reverse of each input vector produces the same output. This data set is complete in that every training input vector has its reverse in the training set. The data is wholly symmetric. The concern with solution 1 is that it is not symmetric. In fact, it is not possible to create a single hidden neuron solution to this data set that is symmetric. For this explanation each training pair is denoted by T k, and T 1 implies that the output neuron is not receiving a steady state bias. Following the logic in [Kinser,96] the hidden neuron is not added to the system until it is necessary. This provides a starting point for the algorithm, and deviations from this do not alter the necessity or role of a hidden neuron, but just alter connection weights. T 1, T 2, T 3 and T 4 each require that each element in v is greater than γ. The starting position of not activating the hidden neuron until necessary requires that all elements of w be less than γ. In order to produce a symmetric network w 1 = w 4, w 2 = w 3, v 1 = v 4, and v 2 = v 3. T 10 without a hidden neuron requires that v 1 + v 2 < γ which can not simultaneously be true with v 1 > γ and v 2 > γ. Therefore, T 10 (and similarly T 11 ) requires the creation and activation of the hidden neuron. Thus, v 1 + v 2 + u < γ and w 1 + w 2 > γ (and similarly v 3 + v 4 + u < γ and w 3 + w 4 > γ ). Since w 1 + w 2 > γ then either w 1 or w 2 must be greater than γ/2, and similarly w 3 or w 4 must be greater than γ/2. The fact that there are four possibilities from which to chose (w 1, w 3 > γ; w 2, w 3 > γ; w 1, w 4 > γ; or w 2, w 4 > γ) will be discussed later. So, for the sake of argument, w 1 and w 3 are chosen as the weights that must be greater than γ/2. With this selection T 9 must activate the hidden neuron which produces v 1 + v 3 + u > γ. Due to symmetry (v 2 = v 3 ) this becomes v 1 + v 2 + u > γ, which cannot be simultaneously true with inequalities and the end of the previous paragraph. The selection of which of the w weights is greater than γ/2 does not alter the existence of the contradictory inequalities. This means that a symmetric single-hidden neuron solution does not exist. The data set is symmetric, but the single hidden neuron solution is necessarily not symmetric. For symmetric data a non-symmetric solution imposes information that does not exist in the training data. The mapping surface created by the network is skewed by this unwarranted infusion of improper information. A network with optimal 3

4 generalization will not have such skews in the mapping surface. In the example, generalization is compromised by the single neuron solution. In the two hidden neuron solution the weights and therefore the mapping surface are symmetric and the generalization is not compromised. The conclusion drawn from this is that Solution 2 is a better generalizing network than Solution 1. Thus, a network solution for this problem with the minimum number of hidden neurons can not construct the network that is optimal in the sense of generalization. III. Symmetry Discussions The question then becomes, if the data has symmetry in it then where did the symmetry go in the case of the single hidden neuron? In the example problem there are four single hidden neuron solutions. The particular solution selected used w 1 and w 3 as the weights chosen to be greater than γ/2. The other three solutions would have selected different weights to be greater than γ/2. Each of these solutions is unique from each other in that they have different training vectors activating the hidden neuron. In each of these cases, T 10 and T 11 activate the hidden neuron as well as two from the set {T 6, T 7, T 8, T 9 } depending on which solution was chosen. The symmetry does still exist in that there are four solutions symmetrically located in a solution space. However, each particular solution does not have symmetry within itself. Activation of the hidden neuron by T 6, T 7, T 8, or T 9 is improper since none of these are required to activate the hidden neuron. This is easily seen by removing T 10 and T 11 from the data set and finding a easy solution for the rest of the data with a network that has no hidden neuron (v i = 0.3 for all i). T 10 and T 11, however, are required to activate the hidden neuron as determined in the previous section, but their presence should not require the unnecessary activation of the hidden neuron by other training vectors. Incorrect activation again is a sign that damage is being done to the generalization ability of the network. The use of hidden neurons is required for a mapping problem that requires a higher-order mapping. In the case of T 10 and T 11 being removed from the training data a first-order mapping (no hidden neurons) is sufficient, and therefore all of these trainers are considered first-order or less. In the case of Solution 1, two first order trainers are forced to activate the hidden neuron even though they are not the higher-order trainers in the data set. Again, information not existent in the data set is being imposed by improper activation of the hidden neuron. In Solution 2 only T 10 and T 11 activate the hidden neurons and thus no damage is imposed on the generalization of the network. There is no indication that non-existent information is being imposed on the network. 4

5 III. SUMMARY The best generalizing neural network is not necessarily the network with the fewest number of hidden neurons. This was demonstrated by presenting a case in which the fewest number of hidden neurons was deleterious to the generalization ability of the system. This destruction of the generalization occurs when the mapping surface of the network is skewed (or is forced to assume information that does not exist in the training data). Skewing is product of the improper activation of the hidden neuron. IV. REFERENCES R. Reed, "Pruning Algorithms - A Survey", IEEE Neural Nets NN-4(5) (1993). N. Murata, S. Yoshizawa, S. Aman, "Network Information Criterion - Determining the Number of Hidden Units for an Artificial Neural Network Model", IEEE Trans on Neural Nets, NN-5(6), (1994). J. M. Kinser, The Determination of Hidden Neurons, Optical Memories and Neural Networks, 5(4), (1996). u w v u x y z 5

Artificial Neural Network

Artificial Neural Network Contents 2 What is ANN? Biological Neuron Structure of Neuron Types of Neuron Models of Neuron Analogy with human NN Perceptron OCR Multilayer Neural Network Back propagation