Universitat Wurzburg. Institut fur Theoretische Physik. Am Hubland, D{97074 Wurzburg, Germany. Selection of Examples for a Linear Classier

Size: px

Start display at page:

Download "Universitat Wurzburg. Institut fur Theoretische Physik. Am Hubland, D{97074 Wurzburg, Germany. Selection of Examples for a Linear Classier"

Eleanore Watson
5 years ago
Views:

1 Universitat Wurzburg Institut fur Theoretische Physik Am Hubland, D{9774 Wurzburg, Germany } Selection of Examples for a Linear Classier Georg Jung and Manfred Opper Ref.: WUE-ITP-95- ftp : ftp.physik.uni-wuerzburg.de georgju@physik.uni-wuerzburg.de

3 Selection of Examples for a Linear Classier Georg Jungyand Manfred Opperz yphysikalisches Institut, Julius-Maximilians-Universitat Am Hubland, D{9774 Wurzburg, Federal Republic of Germany zthe Baskin Center for Computer Engineering & Information Sciences, University of California, Santa Cruz, CA 9564, USA Abstract. We investigate the problem of selecting an informative subsample out of a neural network's training data. Using the replica method of statistical mechanics, we calculate the performance of a heuristic selection algorithm for a linear neural network which avoids overtting. PACS numbers: 87..+e, 5.9.+m Short title: Selection of Examples December 9, 995

4 . Introduction Finding the optimal complexity of a neural network for learning an unknown task is one of the most interesting problems in the theory of neural computation. A popular strategy is to start with a complex network having more connections than actually are needed. Then, after training, by deleting couplings which are too small or seem to be of less importance, one hopes to end up with a network which has reasonable performance (see e.g.[3]). The strategy of clipping network weights has also been investigated in the framework of statistical mechanics (see e.g.[, 4, 5]). In this paper, we will look at a problem, which is in some sense dual to the problem of pruning the network weights. We consider the problem of pruning the training examples. Besides its interest from a purely theoretical viewpoint, such problem is motivated from the so called phenomenon of overtting in network learning. If the complexity of a network is too high, it may be able to t all training examples perfectly, but the probability of predicting the outputs on new data may drop to the value of random guessing. As a result, the learning curve, which displays the generalization error as a function of the number of examples, may show a nonmonotonic behaviour. This means that an increase of the number of examples can in some regions lead to a decrease in generalization abilities. Recently, Garces has found in numerical studies that overtting can be avoided by deleting examples from the training set [8]. In this respect, it is interesting to nd out from which selection of examples the performance of the network will benet most. In this article, we will study a heuristic strategy for the selection of examples in the case of a linear classier applied to two toy problems. From an analytical viewpoint, it is simple enough to be free of the problems of replica symmetry breaking. This approach should not be confused with the well studied problem of learning with queries [4, 5, 6]. In the latter case, one selects inputs with respect to some criterion, before their outputs are observed. In our case, all training examples, i.e. both inputs and outputs are known to the learner. The paper is organized as follows: In the second section, we briey review some properties of the Adaline algorithm which will be basic to our treatment. The third section explains the heuristic strategy for example selection. In section four, we introduce two dierent rules to be learnt by the network. Section ve contains a new approach to the statistical mechanics of the problem. Finally, in sections six and seven, the results of our calculations are presented and discussed.

5 . Adaline Learning 3 As has been shown e.g. in [9, 3,, ], the eect of overtting can already be observed for the case of a simple linear classier, the so called Adaline model [3,, ], which we will discuss in the following sections. For any N dimensional input vector this linear classier is dened by the output S = p N J : () For p linearly independent input vectors, and arbitrary real valued target outputs S, = ; : : : ; p, the system of equations S = p N J () will always have solutions. Hence, for random inputs, where linear depencies are unlikely, it will be possible to adjust the vector J of the N network weights J j, j = ; : : : ; N in such a way that if p < N, all training examples are perfectly learnt by the network. An explicit solution for such a weight vector is given by the pseudo-inverse-solution (PSI) []: J = NX ;= S B(C ) ; (3) P where C = N j j j is the correlation matrix of the training patterns. Out of the linear space of solutions to equation (), this is the one which has has minimal squared norm q = J =N. The restriction of our treatment to the case p < N is mainly for mathematical convenience. However, especially in this region the eect of overtting can be observed. 3. Weighting of Examples Our basic strategy for the selection of examples is based on the fact that most iterative learning algorithms for a single layer net result in a coupling vector which has the form of a weighted Hebbian rule [4, 9]. To see this, consider a learning algorithm of the backpropagation type which is based on the minimization of a quadratic training energy E = NX = (S B? g J (h J)) ; (4) by gradient descent. Here we assume a smooth output g J (h J ) of the student net, which is dened through the internal eld h J = p N J. Hence, during a learning step, the

6 4 network couplings are changed by an amount J j : (5) It is not hard to show that the algorithm (5) for g J (h ) = h, when started with a zero initial vector, converges to the Adaline rule (3). where To rewrite (5) as a weighted Hebbian sum, we set J j = NX = x (t) / x S B j ; (6) NX = (? (S B ) g J J : J represents the weighting (in the following called the embedding strength) of the 'th example in the t'th learning step. Our selection of examples will be based on the following heuristic ansatz: Intuitively, we will expect that examples which have small or even negative total embedding strengths x = P t x (t) after learning could be cast out of the training set. The latter ones would have an output that has already the correct sign without being learnt. This strategy may also be understood as an approximation to the AdaTron algorithm [5] which, by construction, allows for nonnegative embedding strengths only. Another possibility was discussed in a paper by Garces [8] for the case of Adatron learning. Only examples which are not too hard to learn, i.e. which have positive embedding strengths below a certain value, were left in the training set To nd an expression for the total embedding strengths, it is not necessary to solve the dynamics (5). We can get the same information directly from the nal coupling vector. For the Adaline case, with = p < the form of the coupling vector (3) N provides us with an explicit expression for the embedding strengths which is given by x = S B X S B(C ) ; valid for p < N. To treat the statistical mechanics of the problem, however a simpler implicit denition of x 's using a suitable Lagrangean will be given in section 5. Although it is possible to obtain the coupling vector of the linear classier for >, useful expressions for the corresponding embedding strenghts are harder to nd and to treat within in the framework of statistical mechanics. Hence, we will restrict ourselves to the region <.

7 5 In order to keep the subsequent analysis simple, we will not assume that the Adaline algorithm has to be rerun for a second time on the reduced training set. We will rather make the ansatz, that the new weight vector is given a priori via single shot learning by a weighted Hebbian sum of the form ~J = p X f(x )S B N : (8) In the following, we will consider two choices for the weighting function f(x). One choice will be called the modied Adaline rule f(x) = x(x), where examples with negative embedding strenghts are abandoned, and the positive ones keep their original weights. This may be considered as an approximation to the more complicated algorithm of relearning the remaining examples with the Adaline method. For comparison, we will also study the simpler choice f(x) = (x) (modied Hebbian rule), where the remaining examples have equal weights. 4. Learning Tasks Like in most theoretical studies of neural networks, the task to be learnt will be dened by a teacher network. It provides the correct outputs S B for a given input. For the simplest case, the teacher is given by a single layer network with xed vector of couplings B and output S B = g B (h B ); (9) with h B = p N B. The complexity of such teacher networks can be tuned by suitably varying the output function g B. In all the following cases, an explicit mismatch between teacher and student is assumed by chosing g B as a nonlinear function. For simplicity, we specialize on the binary case gb =. Hence, even when the teacher problem is of the simple type g B (h) = sign(h), the linear output function of the student used for the training process will cause overtting. For better comparison with the teacher net, we will measure the performance of the student network after training by its clipped output sign(h J ). Hence, we dene the generalization error as the following average: g = hjg B(h B )? sign(h J )ji : () In the simplest case, where the inputs are drawn from a spherically symmetric density, f() = ()?N= e? jj

8 6 it is possible to describe the generalization error in terms of the angle between the two vectors B and ~ J. = B J ~ j Jj ~ jbj = R= ~ q ~q Here ~ R = B ~ J=N denes the overlap between teacher and student, and ~q = ~ J =N. For simplicity, we have chosen the norm of the teacher to be p N. We will investigate two dierent rules. (i) We begin with a linearly separable task, dened by a teacher perceptron with a nonzero threshold. The output function is: g B (h B ) = sign(h B? ) The shaded regions in Figure display the fraction of input space, where the answers from teacher and student dier. The generalization error can be easily calculated to be: g = cos ~ R p~q + where Dh is the gaussian measure: Dh = p dh e?h = and (x) = x R Dh: Dh q ~Rh ~q? ~ R (ii) The second rule is the so-called reversed-wedge-problem [9]: g B (h B ) = sign(h B (h B? ) ): A? 3 5 ; () The shaded regions in Figure display in the same way as before the fraction of input space, where the answers from teacher and student dier. The generalization error can be similarly calculated: g = cos R ~ p~q + Dh q 5. Statistical Mechanics: A Lagrangean Approach ~Rh ~q? ~ R A? 3 5 : () Following the approach of Elizabeth Gardner [, ], the application of statistical mechanics to network learning (for a review see [6, 7, 8]) is often based on the fact, that the network congurations J j obtained from a learning algorithm are minima of a suitable training energy E. In this case, by introducing a canonical ensemble of networks

9 at temperature, the desired conguration appears as the one with maximal weight in the partition function = R dje?e, in the limit. This procedure works ne [3] e.g. in order to determine the order parameters of the standard Adaline rule, which are explicit functions of the couplings. However it does not give us any direct information on the embedding strenghts, which are only implicitely related to the couplings. One possibility to obtain these quantities, would be to introduce their explicit denitions within functions, see e.g.[7]. In our paper, we will use a new, technically more elegant approach which is based on a Langrangean formulation of the optimization problem rather than on a Hamiltonian E. We will make explicit use of the fact that the coupling vector is the solution of the following constrained optimization problem: Minimize J under the constraints = p N J = S B; 8: By introducing Lagrange parameters x together with the Lagrange function L(J; fs B x g) = J? X S B x p N J? S B 7 ; (3) we can solve the optimization problem by nding the point in N + P dimensional space of J j 's and x 's where L is stationary. Setting the partial derivative of L with respect to the J j 's equal to zero, we see that J = p X x S B N : (4) Thus, the x 's actually coincide with the embedding strenghts. However, the stationary point of L is a saddlepoint, not a minimum. In order to keep integrals nite, we will work with the following complex partition function Y = dj j dy exp [?L(J; fiy g] : (5) Y j Using the saddlepoint method, by suitably deforming the contour of integration of the y 's, we nd that in the limit, the dominant contribution to comes from the stationary point of the Lagrangean L. Using the complex distribution in, we are able to calculate any average of functions of the embedding strenghts by identifying y with?is Bx at the saddlepoint. Especially, we are interested in the distribution W ( ~ J l ) of an arbitrary component of the new student vector ~ Jl (fx g). This enables us to calculate the order parameters necessary to describe the generalization ability of a network with the new couplings ~ Jl. The characteristic function ~(k) = he ik ~ J l (fx g) i of this random variable is expressed as a further average

10 8 over the complex distribution dened by the partition function (5). We will denote this average by h: : :i and get ** ~(k) = lim exp lim * + i X " # + + ik J ~ l (fi y S g) B = (6) Y j dj j Y dy p exp y p N J? S B? J + ik ~ Jl + Introducing n replicas and the local eld of the teacher, we see that in the n limit only the l?th component of the student vector contributes. ~(k) = * lim ;n exp? i X p N l X a Y a X a dj la Y ;a dy a Y p dh dv X X Jla? i g B (h )y a? i ;a h v + iy b + J la y a + B l v + kf g g B (h B (h ) ) After averaging and introducing appropriate order parameters, assuming replica symmetry, we obtain in the limit n ~(k) = exp? k A + ikb lb (7) ; (8) with A = F + G=? 3 Q = and b = T? U=: These constants are dened by the following order parameters, which again can be calculated within the replica framework. Q ab = hy a y b i; F = iy G ab = iy a b f g B (h) f iy g B (h) S = hiyg B (h)i; T = ivf g B (h) gb(h) Finally, = (G aa? G ab ) and = + (Q aa? Q ab ): ; U = hyvi (9) iy g B (h) : g B (h) Explicit expressions for these quantities are given in Appendix B. By a Fouriertransform of (8) we obtain the Gaussian distribution W ( Jl ~ ) = p exp? ( J ~ l? B l b) A A : ()

11 9 Hence, using the self-averaging property of order parameters, the overlap of the new weight vector with the teacher and the corresponding norm are given by ~R = N and ~q = N NX l= NX l= h ~ Jl ib l = b h ~ J l i = A + b : Using these orderparameters, the generalization error for the dierent tasks of section 4 can be calculated from the results of section Results In this section we present the learning curves of our algorithms applied to the learning tasks of section 4. For both learning tasks, the distribution P (x) of embedding strenghts (see Figure 3 and Appendix A) broadens as increases and shows an increasing fraction of negative x. Hence, using our algorithm, a mostly increasing fraction of examples, which approaches as, will be cast out of the training set. In Figure 4, we have displayed the relative number of remaining examples eff = p =N examples, for both learning tasks. Here eff = P (x) dx: Figure 5 shows the generalization error for the modied Adaline algorithm with weight function f(x) = x(x), learning a perceptron with threshold. The dashed curve was obtained for the standard Adaline algorithm, where for only the trivial generalization " g = is achieved. Obviously, by selecting examples, this overtting phenomenon vanishes. The little symbols on the curves are results of numerical simulations of the algorithm. Since the modied algorithm achieves smaller errors than the minimum of the dashed curve, it performes better than an Adaline algorithm applied to a random selection of eff N examples. It is interesting to note that both R ~ and ~q diverge for as in the case of normal Adaline learning [3]. The ratio R= ~ p ~q however, which enters the formulae for " g, remains nite. For the same task, Figure 7 shows the result for the modied Hebb method f(x) = (x). This yields, except for close to, a slight reduction of performance compared to the modied Adaline case. The dashed curves are the results for Hebbian

12 learning of a random selection of the same number of examples p = eff N. However both sets of curves are displayed as functions of the initial size of the training set. This comparison again proves that the selected examples contain more information about the learning task than a random subset of examples. Figures 6 and 8 display the performance of the algorithms in the case of the reversedwedge-problem. The overall behaviour is roughly the same as in the case of the threshold perceptron. 7. Conclusion Using statistical mechanics, we have analysed a simple strategy for pruning the training set of examples in case of a linear classier network. By rewriting the coupling vector as a Hebbian sum, the natural concept of the weight of an example is introduced. Deleting the examples with negative weights from the training set avoids the overtting phenomenon. By using dierent weight functions f(x), our analysis might be extended to other types of example selections e.g. erasing the ones which are too hard to learn [8]. In this case, we expect that the network's performance will benet mostly from such a procedure, when is suciently larger than. However, such an analysis requires a dierent mathematical treatment than that of section 5. It might be further interesting to see, whether similar strategies can be developed for more complicated networks. Acknowledgement We are grateful to Wolfgang Kinzel and Ronny Meir for stimulating discussions. G. Jung was supported by a DFG grant and M. Opper by a Heisenberg fellowship of DFG. Appendix A. Distribution of Embedding Strenghts In this appendix we will briey derive the calculation of the distribution of embedding strengths x. Dening the characteristic function (k) = he ikx i = lim he? ky S B i i ; and introducing replicas one gets (k) = * Y lim ;n j;a? i X dj ja Y h v + p i X N Y dy a ;a j ;j X a dh dv exp? J ja y a + B j v X Jja j;a (A)

13 X +? kyb g B (h )? i yg(h a ) ;a Upon averaging over the inputs and dening order parameters R a q ab = hj a J b i, we obtain in replica symmetry (k) = lim ;n Y? (q? R ) X = lim DhDz exp Y Dh dy a exp ;a X a y a + i X ;a k g B(h) + ik? X (y) a ;a y a (Rh? g B (h ))?? Rh g B (h) = hbj a i and kyb g B (h ) : g B (h)? i kz p q? R with = (q? q). Assuming binary outputs, jg B (h)j = this nally yields P (x) = q (q? R ) Dh exp? (? x? Rhg B(h)) (q? R ) : (A) The order parameters R and q can be obtained by the techniques introduced in [3, ] and read R = R Dh hg B (h); q = R Dhg B (h)?r for <. Explicit results for continuous functions g B (h) are given in []. Finally, the parameter can be determined using eq. (3) together with (A) q = N J = N X ; dx xp (x):? g B (h )(C ) g B (h ) = N X g B (h )x = (A3) (A4) In the last equality, we have specialised on binary outputs only. Calculating the integral with the help of (A) and comparing with (A3) leads to =?. Appendix B. Order Parameters The explicit results for the order parameters (9) are: Q =? q Dh Dz Rh? g B (h) + z q? R ; U =? Dh hg B (h);

14 = + ; =? F = G =? T = Dh Dz f? Rh? g B(h) + z p q? R g B (h) Dh Dz g B(h)f? Rh? g B(h) + z p q? R g B (h)? q? R ; Dh Dz g B (h) (Rh? g B (h)) f? Rh? g B(h) + z p q? R Dh Dz hg B (h)f? Rh? g B(h) + z p q? R g B (h) ; ; g B (h)? R ; The parameters Q and U depend only on the learning task and are independent of the choice of the weight function f(x). (i) Perceptron with threshold q Q =? q +? R exp? s U =? exp? (B5) (ii) reversed-wedge-problem q Q =? q +? R s U =? " exp? h i exp?? #? For modied Hebbian learning f(x) = (x) we get: F = Dh g B (h) (Rh? g B (h))? q A ; gb(h)(q? R ) dh q =? q p g (q? R B ) (h) exp? qh (q? R ) + Rhg B(h) q? R? g B(h) (q? R ) ;

15 G =? Dh g B (h) (Rh? g B g B (h) (Rh? g B (h))? q gb(h)(q? R ) A 3 T =? q? R Dh hg B g B (h) (Rh? g B (h))? q A? R: gb(h)(q? R ) In a similar way we are able to obtain the parameters for modied Adaline{learning, i.e. f(x) = x(x): References F =? =? T = Dh (Rh? g B(h)) + q? R p q? R p dh g p B (h)(rh? g B (h)) q gb(h) exp? qh (q? R ) + Rhg B(h)? g B(h) q? R (q? R ) g B (h)(rh? g B (h))? q A ; gb(h)(q? R ) Dh hg B(h)? R(h? ) p q? R dh hg B (h) q gb(h) [] Gardner E 988 J. Phys. A: Math. Gen. 57 [] Gardner E and Derrida B 988 J. Phys. A: Math. Gen. g B (h)(rh? g B (h))? q gb(h)(q? R g B (h)(rh? g B (h))? q A + gb(h)(q? R ) ; A exp? qh (q? R ) + Rhg B(h) q? R? g B(h) (q? R ) [3] Opper M, Kinzel W, Kleinz J and Nehl R 99 J. Phys. A: Math. Gen. 3 L58 [4] Hebb D O 949 The organization of behaviour, (New York Wiley) [5] Anlauf J and Biehl M 989 Europhys. Lett. 687 [6] Seung H S, Sompolinsky H and Tishby N 99 Phys. Rev. A [7] Watkin T L H, Rau A and Biehl M 993 Rev. Mod. Phys [8] Opper M and Kinzel W Statistical Mechanics of Generalization in: Physics of Neural Networks III, edited by van Hemmen J S, Domany E and Schulten K (Berlin: Springer, to be published). [9] Vallet F 989 Europhys. Lett [] Boes S, Kinzel W and Opper M 993 Phys. Rev. E [] Kohonen T 988 Self-Organisation and associative memory (Berlin: Springer). :

16 4 [] Diederich S and Opper M 987 Phys. Rev. Lett. 59, 949 [3] Widrow B and Ho M E 96 WESCON Convention Report IV 96 [4] Kinzel W and Rujan P 99 Europhys. Lett [5] Watkin T L H and Rau A 99 J. Phys. A: Math. Gen. 5 3 [6] Seung H S, Opper M and Sompolinsky H 99 contribution to the Vth Annual Workshop on Computational Learning Theory (COLT9) (Pittsburgh 99); pages 87-94, published by the Ass. for Computing Machinery, New York [7] Opper M 988 Phys. Rev. A [8] Garces R Scheme to improve the generalization error in: Proceedings of the 993 Connectionist Models Summer School pages , edited by Mozer M, Smolensky P, Touretzky D, Elman J and Weigend A [9] Watkin T L H and Rau A 99 Phys. Rev. A 45 4 [] Garces R, Kuhlmann P and Eissfeller H 99 J. Phys. A: Math. Gen. 5 L335 [] Kinzel W and Opper M 99 Dynamics of Learning; in: Physics of Neural Networks, ed. by van Hemmen J L, Domany E and Schulten K; published by Springer Verlag, p. 49 [] Krogh A and Hertz J 99 in: Advances in Neural Information Processing Systems III (San Mateo: Morgan Kaufmann) [3] LeCun Y, Denker J S and Solla S 99 Optimal Brain Damage, in: Advances in Neural Information Processing Systems II (Denver 989), ed. Touretzky D, 598 (San Mateo: Morgan Kaufmann) [4] Bouten M, Engel A, Komoda A and Serneels R 99 J. Phys. A: Math. Gen [5] Kuhlmann P and Muller K R 994 J. Phys. A:Math. Gen

17 Figure Captions Figure : Geometry of input space for perceptron with threshold. Figure : Geometry of input space for reversed-wedge-problem. Figure 3: Distribution of embedding strenghts for signum-teacher with threshold =. a) Theoretical results for =.4,.6,.8,.9 (upper to lower curves), b) Theoretical results (dashed) and simulations (histo) for = 3=3, c) Same as b) but for = 8=3. Figure 4: Relative size eff = p =N of pruned training set, versus relative size = p=n of original training set for a) perceptron with threshold (lled line =, dashed line = :4) and b) reversed-wedge-problem (lled line =, dashed line = :). Figure 5: Generalization error for modied Adaline algorithm (full lines) compared with regular Adaline (dashed lines). Task: perceptron with threshold. The symbols with errorbars represent results achieved by a perceptron with N = 3 input units averaged over draws of dierent learning sets. Figure 6: Same as in Figure 5, but for reversed-wedge-problem. Figure 7: Generalization error for modied Hebb learning (full lines). Task: a perceptron with threshold. The dashed lines show regular Hebb learning using a random selection of p = N examples. Simulations are done analogue to Figure 5. Figure 8: Same as in Figure 7, but for reversed-wedge-problem. In most cases the modied hebb rule achieves better results than the modied Adaline algorithm except the realizable problem =. Figure 8: Same as in Figure 7, but for reversed-wedge-problem. In most cases the modied hebb rule achieves better results than the modied Adaline algorithm except the realizable problem =.

18 6 6? B J Figure. 6? B J 6? Figure.

19 (a).35 w(x) x. w(x) (b).5 w(x). (c) x x Figure 3.

20 8 (a).6 eff (b).6 eff Figure 4.

21 9 g.5.45 = :4 = :.4 = : = Figure 5..5 = : g.45 = :6.4 = : = Figure 6.

22 g.5.45 = :4 = :.4 = : = Figure 7..5 = : g.45 = : = :4.3 = Figure 8.

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 18 Nov 1996

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 18 Nov 1996 arxiv:cond-mat/9611130v1 [cond-mat.dis-nn] 18 Nov 1996 Learning by dilution in a Neural Network B López and W Kinzel Institut für Theoretische Physik, Universität Würzburg, Am Hubland, D-97074 Würzburg