VC-dimension of a context-dependent perceptron

VC-dimension of a context-dependent perceptron Piotr Ciskowski Institute of Engineering Cybernetics, Wroc law University of Technology, Wybrzeże Wyspiańskiego 27, 50 370 Wroc law, Poland cis@vectra.ita.pwr.wroc.pl Abstract. Ciekawym, czy pjd polskie znaczki:. A teraz: In the paper, we present the model of a context-dependent neural net - a net which may change the way it works according to the external conditions. The information about the environmental conditions is fed to the net through the context inputs, which are used to calculate the net s weights, and as a consequence modify the way the net reacts to the traditional inputs. We discuss the Vapnik-Chervonenkis dimension of such a neuron and show that the separating power of a context-dependent neuron and multilayer net grows with the number of adjustable parameters. We present the difference in the way traditional and context-dependent nets work and compare the input space transformations both of them are able to perform. We also show that context-dependent nets learn faster than traditional ones with the same VC-dimension. 1 Introduction The notion of context in computer science appeared some time ago, first in the area of formal languages. Now it is introduced to many areas of machine learning, classification, robotics and neural nets [5]. Medical applications seem to be an intuitive example of decisions dependence on external parameters. One of the first medical applications of context-sensitive neural networks was presented in [9], where a neural network is tuned to the parameters of a monitored patient. The paper presents a model of a context-dependent neural network - a network which may change the way it works according to the environmental conditions. In other words, such a network may react differently for the same values of inputs, depending on external conditions, later called context variables. The problem of defining and identifying primary, context-sensitive and irrelevant features among the input data is presented well in [1]. In the paper we assume that the division of the net s inputs into primary and context-sensitive ones for simplicity called context inputs has already been done. Paper supported by Wroc law University of Technology grant no. 332291

Different strategies of managing context-sensitive features are presented in [2]. The neural network model presented in the paper corresponds to the strategy 3 contextual classifier selection or strategy 5 contextual weighting. 2 Model of a Context-Dependent Neuron Consider a neuron model of the form: [ S ] [ S ] y = Φ w 0 + w s xs = Φ w s xs = Φ [W T ] X, 1 s=1 where y is the neuron s output, w s is its weight on the x s input and w 0 is the threshold which is included in the weight vector, while the input vector includes the bias x 0 = 1. Φ is the neuron s activation function - for example a sigmoidal 1 function: y u =. 1+e βu The dependence of the neuron s weight on the context vector is modeled by: w s = A T s V = [a s1, a s2,..., a sm ] [ v 1, v2,..., vm ] T, 2 where V is the vector of M linearly independent base functions spanning the weights dependence on the context vector, and A is the vector of coefficients approximating the s-th weight s dependence on the context. The number of adjustable parameters in each neuron is M S + 1 for the traditional neuron the number of parameters equals the number of weights: S + 1. This number is crucial for estimating the Vapnik-Chervonenkis dimension of the context-dependent perceptron. s=0 3 The VC-dimension of a Context-Dependent Neuron Vapnik-Chervonenkis dimension is the main quantity used for measuring the capacity of a learning machine, its generalization abilities or the number of learning examples needed to obtain the required accuracy of predictions. In the following we shall compare the results for the traditional and the context-dependent neuron. For more details on the VC-dimension of neural nets, see [6]. Theorem 1. [6] Consider a standard real-weight perceptron with S N real inputs { and denote} the set of functions it computes by H stand. Then a set S stand = X1, X 2,..., X n R S is shattered by H stand iff S stand is affinely independent, that is iff the set { X T 1, 1 in R S+1. It follows that:, X T 2, 1,..., X T n, 1 } is linearly independent VCdim H stand = S + 1 3 For the lack of space we omit the proofs of the following theorems, which may however be reconstructed by analogy to those presented in [6].

Theorem 2. Consider a context-dependent real-weight perceptron with S N real inputs, P N real context inputs and P N base functions. Denote the set of functions it computes by H cont. Then a set S cont = { X 1, 1, X 2, 2,..., } X n, n R S is shattered by H cont only if in { the subsets of S cont containing points with the same value of context, e.g. = z: X T z,1, 1, X T z,2, 1,..., } X T z,n z, 1, all points are linearly independent in R S+1. It follows that: VCdim H cont = M S + 1 4 It is known [6] that for standard feed-forward linear threshold networks with a total of W weights the VC-dimension grows as O W 2. Theorem 3. Suppose N cont is a context-dependent feed-forward linear threshold network consisting of context-dependent neurons given by 1, with a total of W weights, where each weight is given by a combination of M coefficients and base functions as in 2. Let H cont be the class of functions computed by this network. Then VCdim H cont [ = O MW 2] 5 The difference in the way traditional and context-dependent nets work can be seen in the following example. Suppose we have a traditional neuron with S + 1 inputs including bias and add another P contextual variables as traditional ones. We therefore expand the neuron s input space from R S+1 to R S+1+P the same expansion is done with its parameter space and the transformation done by the neuron R S+1+P R is still hyperplane, but in a higher-dimensional input space. When we add these P inputs as context ones and expand the base function vector with M functions M may be greater than P, the neuron s input space remains R S+1, while its parameter space growths to R MS+1 and the division R S+1+P R done by the neuron is not a hyperplane but a hypersurface, the more complicated, the more M is, remaining a hyperplane for a fixed value of context - this is the reason why the separating power of a context-dependent net is greater for sets of points in different contexts. 4 Learning of Context-Dependent Nets An interesting learning algorithm for context-dependent nets is presented in [7]. It uses the properties of the Kronecker product and allows to train the net using all examples from different contexts during training. It is a gradient descent algorithm, in which the gradient of the quality function: Q A = E X,,Y [Φ 1 Y W T ] 2 X = 6 [ = E X,,Y Φ 1 Y A T X V ] 2 7

is given by grad A = E X,,Y [Φ 1 Y A T X V ] X V 8 It should be emphasized that the neuron s output is calculated directly from the input vector X and the vector of base functions V without having to calculate the neuron s weight. The same Kronecker product is then used for calculating the target function s gradient w.r.t. the coefficient vector A. If all the net s layers have the same base functions vectors this calculation is also done once per epoch. These facts result in much less calculations in each learning epoch of the context-dependent net. This estimation may be slightly disturbed by the necessity of calculating the weights for backpropagation algorithm - but in this case it is only necessary to calculate the weight of neurons in all layers except the first one, which usually contains most neurons. 5 Conclusions The model of a context-dependent perceptron has been presented in the paper, as well as learning algorithms. It has been shown, that similarly to the traditional neurons, the Vapnik-Chervonenkis dimension of a context-dependent neuron and net grows with the number of adjustable parameters but, as this number is greater than that of a traditional one, the separating power of such a neuron is much greater and depends not on the context variables, but on the way the network designer uses them by choosing the base functions v. The growth of the Vapnik-Chervonenkis dimension is both a benefit and a problem - the number of examples needed for the learning algorithm to achieve the desired error is larger. The advantage of context-dependent nets over the traditional ones is that when comparing the nets with the same number of parameters the same VC-dimension, the context-dependent ones learn faster and this difference gets more significant with the growth of the nets size. References 1. Turney P.: The Identification of Context-Sensitive Features: A Formal Definition of Context for Concept Learning, Proc. of 13th International Conference on Machine Learning ICML96, Workshop on Learning in Context-Sensitive Domains, Bari, Italy, 1996 2. Turney P.: The Management of Context-Sensitive Features: A Review of Strategies, Proc. of ICML96, Bari, Italy, 1996 3. Turney P.: Exploiting context when learning to classify, Proc. of ICML93, Springer- Verlag 4. Harries M., Sammut C., Horn K.: Extracting hidden contexts, Machine Learning 32 5. Yeung D.T., Bekey G.A.: Using a context-sensitive learning to robot arm control, Proc. IEEE Int. Conf. on Robotics and Automation, pp 1441-1447, 1989 6. Anthony M., Bartlett P.L.: Neural Network Learning: Theoretical Foundations, Cambridge University Press, 1999, Cambridge

7. Rafaj lowicz E.: Context Dependent Neural Nets - Problem Statement and Examples Part 1, Learning Part 2, Proc. of 3rd Conference Neural Networks and Their Applications, akopane, Poland, 1999 8. Ciskowski P., Rafaj lowicz E.: Context Dependent Neural Nets - Structures and Learning, to be published in IEEE Trans. on Neural Networks 9. Watrous R.L., Towell G.: A Patient-Adaptive Neural Network ECG Patient Monitoring Algorithm. In Proc. Computers in Cardiology 1995, Vienna, Austria.