K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS at inænity. As the error surface is deæned by an average over all of the training examples, it is diæc

Size: px

Start display at page:

Download "K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS at inænity. As the error surface is deæned by an average over all of the training examples, it is diæc"

Cameron McDaniel
5 years ago
Views:

1 Journal of Mathematical Systems, Estimation, and Control Vol. 6, No. 2, 1996, pp. 1í18 cæ 1996 Birkhíauser-Boston Local Minima and Attractors at Inænity for Gradient Descent Learning Algorithms æ Kim L. Blackmore y Robert C. Williamson y Iven M.Y. Mareels y Abstract In the paper ëlearning Nonlinearly Parametrized Decision Regions", an online scheme for learning a very general class of decision regions is given, together with conditions on both the parametrization and on the sequence of input examples under which good learning can be guaranteed to occur. In this paper, we discuss these conditions, in particular the requirement that there be no non-global local minima of the relevant error function, and the more speciæc problem of no attractor at inænity. Somewhat simpler suæcient conditions are given. Anumber of examples are discussed. Key words: error function, local minimum, attractor at inænity, learning, decision region AMS Subject Classiæcations: 68T05 1 Introduction In ë3ë we presented a gradient descent based learning algorithm which was motivated by neural network learning algorithms. We gave a deterministic analysis of the convergence properties of the algorithm, using techniques of dynamical systems analysis. In the course of this analysis, we derived conditions under which the algorithm is guaranteed to learn eæectively. The conditions for convergence of the algorithm include restrictions on the topological properties of an associated error surface: speciæcally that it does not have non-global local minima, and does not produce an attractor æ Received November 5, 1993; received in ænal form June 30, Summary appeared in Volume 6, Number 6, y This work was supported by the Australian Research Council. 1

2 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS at inænity. As the error surface is deæned by an average over all of the training examples, it is diæcult to test conditions on the error surface. The aim of this paper is to develop simpler suæcient conditions which are easier to test in applications. We have had some success in this aim, particularly with the question of attractors at inænity, although the question of existence of local minima remains intractable for some very simple parametrizations. This is not surprising when one considers the value to general optimization theory of identifying functions with no local minima. This question has long been investigated, with few helpful results ë5ë. The results given in this paper suggest considerable diæculty in analysing more complicated neural network parametrizations. However some potentially tractable areas for further analysis are identiæed. We present a number of examples of applying these results to understanding the behaviour of the algorithm for diæerent classes of decision regions. We look at three diæerent parametrizations for the class of half spaces containing the origin, and show that two of these satisfy the conditions for convergence, but the third does not. The third parametrization is shown to produce an attractor at inænity. This is reminiscent of behaviour observed in neural network learning, where estimate parameters may drift oæ to inænity. This exercise shows that the large scale behaviour of the learning algorithm is very much dependent on the choice of parametrization. The fact that such diæerent behaviour is observed strongly suggests èbut of course does not proveè that for more complex decision regions an even wider range of behaviours may be apparent with diæerent parametrizations of the decision boundary. We also look at a problem motivated by a radar problem, where the decision regions are stripes, and at decision regions which are an intersection of two half spaces. We show that there is no attractor at inænity for our parametrization of an èapproximateè intersection of two half spaces. In the following section we outline the algorithm presented in ë3ë and state the conditions for convergence that we will be discussing. In section 3 we assume that the example points are uniformly distributed. Section 3.1 contains conditions under which both local minima and attractors at inænity are excluded, whereas section 3.2 focuses on the exclusion of attractors at inænity when local minima may ormay not be present. In section 4 we look at the application to learning half spaces, in section 5 we lookatthe problem of learning stripes, and in section 6 we discuss intersections of half spaces. In section 7 we raise the question of whether additional problems are occur when the examples are nonuniformly distributed, and we show that the results of section 3 must be modiæed. Section 8 concludes. 2

3 GRADIENT DESCENT LEARNING ALGORITHMS 2 Problem Formulation and Previous Results It is assumed that a sequence of data samples èèx k ;y k èè k2zz + are received, where x k 2 X IR n and y k 2f,1;1g. Xis called the sample space and y k indicates the membership of x k in some decision region æ X. If x k is contained in æ then y k = 1; otherwise y k =,1. We assume that æ belongs to a class C of subsets of X for which there exists some parameter space A IR m and some epimorphism èonto mappingè æ : A! C. Moreover, it is assumed that there exists a parametrization f for C, deæned below. Deænition 2.1 Let C be a class of closed subsets of X with parameter space A = IR m and some epimorphism æ:a!c. Aparametrization of C is a function f : A æ X! IR, such that for all a 2 A fèa; xè 8 é : é 0 if x 2 interior of æèaè =0 if x 2 boundary of æèaè é 0 if x 62 æèaè è2.1è In addition, fèa; xè is required to be twice diæerentiable with respect to a, on A æ are tobebounded inacompact domain; and f 2 must be Lipschitz continuous in x inacompact domain. The algorithm proposed in ë3ë is as follows: Algorithm 2.2 Step 0: Choose the stepsize : 2 è0; 1è. Choose a boundary sensitivity parameter: " 2 è0; 1è. Choose an initial parameter estimate: a 0 2 A. Step 1: Commencing at k =0, iterate the recursion below: where a k+1 = æ æ èak ;x k è ègèa k ;x k è,y k è; è2.2è gèa; xè := 2 fèa; xè arctan : è2.3è " In ë3ë, algorithm 2.2 is be shown to be a perturbation of stepwise gradient descent of the cost function Jèaè= lim K!1 1 K K,1 X k=0 fèa; x k èègèa; x k è, gèa æ ;x k èè, " ln 1+ fèa; x kè 2 3 " 2 è2.4è

4 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS where a æ 2 A is the true parameter vector. Based on this fact, we derived in ë3ë two sets of conditions, either of which guarantees that the estimate parameters a k asymptotically enter and remain in a neighbourhood of the parameters of the true decision region. This implies that after suæciently many iterations, the misclassiæed region is very small. One of the conditions in the ærst set èassumption B2 in ë3ëè is that the cost function J has a unique critical point ina. If a = a æ then each term in is zero. Therefore the true parameter aæ is always a critical point ofj, and B2 says that a æ is the only critical point ofj. Assumption B2 cannot hold when there is more than one a for which fèa; æè fèa æ ; æè. This occurs in many important examples where there is symmetry in the parametrization. We say there is more than one true parameter value. In ë3ë we showed that B2 is not necessary for eæective learning, and it is suæcient to replace it with two conditions in the second set: èc3è local minima of J only occur at the points where fèa; æè fèa æ ; æè and èc4è for all a 0 2 A, the solution of the initial value problem èivpè æ ; aè0è = a 0 è2.5è aètè does not cross the boundary of A, where A is assumed to be compact. That is, aètè 2 A for all t 0. For many interesting examples, A =IR m for some mé0, in which case C4 says that, for all a 0 2 A the solution of è2.5è is bounded for all t 0. We say in that case that there is no attractor at inænity for the ordinary diæerential equation in è2.5è. If C4 is violated, the estimate parameters generated by algorithm 2.2 may drift oæ to inænity as the algorithm updates. This undesirable behaviour has been observed in neural network learning, and in practice it would be good to anticipate this divergence, and avoid it if possible. For a given parametrization, it is not clear how to test conditions on the cost function J. This is due in part to the averaging involved in deæning J, and partly due to the application of the arctan squashing function to the parametrization before taking the diæerence between fèa; æè and fèa æ ; æè. 3 Results In this section we ignore the inæuence that diæerent input sequences can play inthe learning. This is achieved by assuming that the input examples èx k è k2zz + cover the sample space X. By this we mean that for any integrable function f : X! IR, K,1 1 X Z 1 lim fèx k è = f èxèdx: è3.1è K!1 K vol X X k=0 4

5 GRADIENT DESCENT LEARNING ALGORITHMS R Here vol X = dx. This is essentially a deterministic way ofsaying that X the input examples are uniformly distributed. In particular, this means that Z 1 Jèaè = fèa; xèègèa; xè, gèa æ ;xè, " 1+ vol X ln fèa; xè2 x " 2 dx: è3.2è 3.1 Local minima First we deal with condition B2, that the cost function J has a unique critical point a æ. This is guaranteed if J is strictly increasing along rays originating at a æ. That is, the directional derivative atany point a in the direction a, a æ should be positive. Deænition 3.1 The directional derivative of J : A! IR at point a 2 A and in direction b 2 A, is dj é D b Jèaè:= da æ b: è3.3è a Similarly, the directional derivative of f : AæX! IR with respect to a 2 A at point èa; xè 2 A æ X, in direction b 2 A, is D b fèa; xè é æ b: èa;xè è3.4è Note that D 0 Jèaè =D 0 fèa; xè = 0 for any values of a and x. Theorem 3.2 and corollary 3.3 use the directional derivative to give suæcient conditions that B2 is satisæed. For brevity, we use the notation Fèa; xè := D a,a æfèa; xèèfèa; xè, fèa æ ;xèè è3.5è J èa; xè := D a,a æfèa; xèègèa; xè, gèa æ ;xèè; è3.6è where g is deæned by è2.3è. Note that Fèa æ ;xè=jèa æ ;xè=0. For any a 2 IR m we partition X into the two subsets S a := fx 2 X : Fèa; xè 0g and T a := XnS a. Theorem 3.2 Assume X IR n is compact and èx k è covers X. Let f : IR m æ X! IR beaparametrization of some class of decision regions, and assume a æ 2 IR m. If, for any a 2 IR m, there exists a set R a S a such that inf x2r a Fèa; xè vol R a é è fèa; xè 1+ sup x2r a " 2! sup x2t a jfèa; xèj vol T a ; æ a =0if and only if a = a æ, where J is deæned by æ 5 è3.7è

6 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS The proof of this theorem is given in the appendix. The regions S a and T a can be interpreted as ëgood" and ëbad" regions respectively. If the current estimate parameter is a k, and x k falls in S ak, then gradient descent on the instantaneous cost, fèa k ;x k èègèa k ;x k è, gèa æ ;x k èè, " ln 1+ fèa k;x k è 2 ", will cause an update which brings the estimate closer to the true parameter vector. Thus if x k 2 S ak èand is 2 suæciently smallè, then ka k+1, a æ kka k,a æ k, where a k+1 is determined by è2.2è. However, if x k is chosen in T ak, then the estimate parameters will move away from the true parameters. Averaging relies on the idea that the eæect of the erroneous updates is negligible. This requires that for any a 2 A =IR m, the volume of T a is small compared to that of S a, and updates made when x k falls in T a are not signiæcantly larger than when x k falls in S a. The following corollary deals with the special case where the volume of T a is zero. Corollary 3.3 Assume X IR n is compact and èx k è covers X. Let f : IR m æ X! IR beaparametrization of some class of decision regions, and assume a æ 2 IR m. If, for any a 2 IR m, there exists a set U a X such that the closure of U a is X and Fèa; xè é 0 for all x 2 U a a =0if and only if a = a æ, where J is deæned by è2.4è. Proof: From the deænition of S a, it is clear that U a S a X, but the closure of U a is X, sos a =X.Therefore vol S a = vol X and vol T a =0 for all a 2 IR m. Because Fèa; xè é 0 for all x 2 U a, there exists a set R a U a such that vol R a é 0 and inf Ra Fèa; xè é 0. Thus Theorem 3.2 applies. The above results rely on showing that J is strictly increasing along rays in parameter space originating at a æ. If this is relaxed to J being nondecreasing along the rays, we mayhave critical points of J which are not at a æ, so B2 is violated. However, provided J is not constant along the rays, the behaviour of the algorithm will not be signiæcantly altered by this relaxation, because the critical points will not be local minima of J, and all solutions of the IVP è2.5è will be bounded for all t 0. That is, assumptions C3 and C4 will be satisæed. As shown in ë3ë, this is suæcient for convergence of the algorithm. To this point, we have only considered situations where there is a unique true parameter vector. If this is not the case, B2 cannot hold, and there must be regions where the directional derivative ofjalong rays originating at any of the true parameter vectors will be negative. In this case it is much more diæcult to test for local minima. We may have some success if the true parameter vectors are isolated and countable, and IR m can be 6

7 GRADIENT DESCENT LEARNING ALGORITHMS partitioned into convex sets æ i, each containing exactly one true parameter vector, a æi. Then if the directional derivative along all rays originating at a æi is positive atallpoints in æ i, for all possible i, then there cannot be any local minima of J, and there is no attractor at inænity. Thus C3 and C4 are satisæed. In practice, it is generally diæcult to determine the sets æ i, so this is probably not a useful result. 3.2 Attractors at inænity Even when local minima of the cost function exist, it is useful to know that there is no attractor at inænity. Algorithm 2.2 uses a ænite step size and gradient descent on the instantaneous cost, rather than the average cost J, so the estimate parameters will jump around rather than moving smoothly to the minima of J. Moreover, they will continue to jump around even when they are in a local minimum, or even at the global minimum of J. Large deviations theory suggests that estimate parameters will eventually escape from the local minima èunder some additional assumptionsè ë1ë. They may also escape from the global minimum, though this is harder because the size of the updates is smaller when the value of J is closer to zero. If there is an attractor at inænity, large deviations theory suggests that estimate parameters will eventually appear in the basin of attraction of this attractor. They will then head oæ to inænity and may become so large that no amount of jumping around will cause them to return to the basin of attraction of the global minimum. So if there is an attractor at inænity then ultimately the estimate parameters will converge there, even though the value of J at this attractor may be large. If there is no attractor at inænity, all local minima of J are constrained to lie within some compact set. Parameters will not wander oæ to inænity but will spend most of their time near a local or global minimum of J. It is much less likely that the estimate parameters will leave a global minimum than a local minimum èsince the cost driving the algorithm will be lessè, so the estimate parameters are likely to spend most of their time near the global minimum. A classic result dealing with attractors at inænity gives the following èsee page 204 of ë4ëè: Lemma 3.4 Let J : IR m! ë0; 4è have continuous second derivatives. If J,1 èë0;cëè is compact for all c 2 ë0; 4è, a 6= 0 except at a ænite number of points a æ1 ;:::;a ær 2 IR m, then for all a 0 2 A, the solution of the IVP è2.5è is bounded for all t 0. So, under the conditions of Lemma 3.4, if A =IR m then assumption C4 is satisæed. Using Lemma 3.4 involves determining the lower level sets 7 æ

8 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS J,1 èë0;cëè, which is not a simple problem. Essentially the idea here is that the cost function keeps increasing as the parameter a goes to inænity inany direction. We are not worried about the shape of the cost surface for ænite parameter values, but only as the parameter becomes large. Therefore we use the following, more easily tested result. Lemma 3.5 Let J : IR m! ë0; 4è have continuous second derivatives and Jèa æ è=0for some a æ 2 IR m. If there exists a constant C é 0 such that D a,a æjèaè é 0 for all a 2 IR m, kak C then for all a 0 2 A, the solution of the IVP è2.5è is bounded for all t 0. Proof: Let aètè 2 IR m. If kaètèk Cthen D a,a æjèaè é 0 so the solution of the IVP è2.5è moves towards a æ, in the sense that kaèt + "è, a æ k kaètè, a æ k for all suæciently small "é0. Thus all solutions of the IVP è2.5è enter and remain in the closed ball with centre a æ and radius C. Lemmas 3.4 and 3.5 are true for any functions J which satisfy the stated assumptions. We can now use the ideas of the last section to determine assumptions which are speciæc to the cost function J deæned in è2.4è. Theorem 3.6 Assume X IR n is compact and èx k è covers X. Let f : IR m æ X! IR beaparametrization of some class of decision regions, and assume a æ 2 IR m. If there exists a constant C é 0 such that, for all a 2 IR m, kak C, there exists a set R a S a such that inf x2r a Fèa; xè vol R a é è fèa; xè 1+ sup x2r a " 2! sup x2t a jfèa; xèj vol T a ; è3.8è then for all a 0 2 A, the solution of IVP è2.5è is bounded for all t 0, where J is deæned in è2.4è. Proof: The proof of this result follows similar lines to the proof of Theorem 3.2. In order to show that there is no attractor at inænity, we are only ever interested in what happens for large kak. For some cases it may be suæcient to only calculate the limiting behaviour of Fèca; xè as c!1, which is often much simpler than calculating Fèa; xè for ænite values of a. In particular, if lim c!1 Fèca; xè exists for all a and x, is never negative, and for each a lim c!1 Fèca; xè is positive for some values of x then Theorem 3.6 will hold. Or if, for each a, Fèca; xè!1for some values of x, and lim c!1 Fèa; xè exists for all other values of x then Theorem 3.6 will hold. Thus we have the following results. 8

9 GRADIENT DESCENT LEARNING ALGORITHMS Theorem 3.7 Assume X IR n is compact and èx k è covers X. Let f : IR m æ X! IR beaparametrization of some class of decision regions, and assume a æ 2 IR m. If 1. lim c!1 Fèca; xè 0 for all a 2 A; a 6= 0; x2x. 2. For all a 2 IR m ; a 6= 0, there exists a set V a X and a constant ré0such that vol V a 6=0and lim c!1 inf x2va Fèca; xè r. then for all a 0 2 A, the solution of IVP è2.5è is bounded for all t 0, where J is deæned in è2.4è. Proof: Choose a 2 IR m such that kak =1. Deæne æ a := 2vol X r vol V a : è3.9è fèca;xè 1 + sup cé0 sup 2 x2va " 2 The ærst assumption implies that sup x2tca jfèca; xèj! 0 as c! 1, so there exists a constant C æa such that sup jfèca; xèj éæ a 8cC æa : è3.10è x2t ca Assumption 2 implies that there exists a constant C a such that inf x2v a Fèca; xè r 2 8c C a : è3.11è Let C = sup kak=1 fc æa ;C a g. For any a 2 IR n,we can write a = c n a n, where c n = kak and a n = kak a. By equation 3.10, if kak éc then sup x2t cnan jfèc n a n ;xèj é r vol R a fèca n ;xè 2 è3.12è 2vol X 1 + sup sup cé0 x2r a " 2 è3.13è where R a = V an. By deænition, vol T a vol X, so equation 3.11 implies that sup x2t a jfèa; xèj Thus Theorem 3.6 holds. é inf x2ra Fèa; xè vol R a : è3.14è fèa;xè vol T a 1 + sup 2 x2ra " 2 9

10 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS Theorem 3.8 Assume X IR n is compact and èx k è covers X. Let f : IR m æ X! IR beaparametrization of some class of decision regions, and assume a æ 2 IR m. If, for all a 2 IR m ; a 6= 0, 1. There exists r a é 0 such that lim c!1 Fèca; xè,r a for all x 2 X. 2. There exists a set V a X such that vol V a 6=0and lim c!1 inf x2va Fèca; xè =1. then for all a 0 2 A, the solution of IVP è2.5è is bounded for all t 0, where J is deæned in è2.4è. Proof: The proof follows along similar lines to Theorem 3.7, the main diæerence being that here we can choose inf x2ra Fèa; xè as large as we wish, whereas in Theorem 3.7 we can choose sup x2ta jfèa; xèj as small as we wish. Converse results, giving conditions under which solutions of è2.5è are unbounded èj has an attractor at inænityè, can also be given. Theorem 3.8 will be used in sections 5 and 6 to show that for particular parametrizations of a stripe and the èapproximateè intersection of two half spaces there is no attractor at inænity. 4 Application Learning a Half Space We consider decision regions which are half spaces in IR n containing the origin. Three diæerent parametrizations will be discussed. The ærst is the natural choice of parametrization, and there appears to be no good reason why either of the other two would be chosen. However, for more complicated classes of decision regions, it can be diæcult to ænd a suitable parametrization, and given two candidate parametrizations it is not immediately apparent which one is preferable. By looking at diæerent parametrizations for a half space, we illustrate some of the issues that must be considered in choosing a parametrization. The natural choice for a parametrization of the halfspace a é x +1é0 is f 1 èa; xè = a é x +1; è4.1è where the parameter space A =IR n. This is the parametrization used in single node artiæcial neural networks. In this case, the directional derivative satisæes F 1 èa; xè = èèa, a æ è é xè 2 ; è4.2è 10

11 GRADIENT DESCENT LEARNING ALGORITHMS which is never negative, so Corollary 3.3 holds. It turns out that the cost function J 1 induced by f 1 is convex. Another suitable parametrization is f 2 èa; xè = 1, e,pèaé x+1è : è4.3è Note that in this case the magnitude of f 2 èa; xè will be much larger for points x outside the decision region than for those an equal distance inside the decision region. Nonetheless, the decision regions deæned by f 1 èa; æè and f 2 èa; æè are identical. Taking the directional derivative, we see that F 2 èa; xè = pèa, a æ è é xèe pèa,aæ è éx, 1èe,2pèaé x+1è ; è4.4è which is again never negative, since e z é 1 èè zé0. Again, Corollary 3.3 shows that the cost function induced by f 2 has a unique critical point. Figure 1: The cost function J 3 induced by f 3 when a æ =,3, X =ë,1;1ë, " =0:001 and example points èx k ècover X. The average was determined by simple numerical integration. Now consider the parametrization f 3 èa; xè = èa é x +1èe,èaé x+1è 2 : è4.5è For a particular a, the decision region identiæed by f 3 is identical to that identiæed by f 1. However in most simulations using this parametrization, estimate parameters drift oæ towards inænity. This suggests that there 11

12 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS is an attractor at inænity generated by the parametrization è4.5è. directional derivative satisæes The F 3 èa; xè = èa, a æ è é xè1, 2èa é x +1è 2 èe,èaé x+1è 2 èa é x +1èe,èaé x+1è 2, èa æé x +1èe,èaæé x+1è 2 : è4.6è So for any a 2 IR n, in the limit c!1, F 3 èca; xè! 2c 3 èa æé x + 1èèa é xè 3 e,c2 èa é xè 2,èa æé x+1è 2! 0 è4.7è for all x 2 X such that a é x 6= 0. Thus in the limit c!1,s ca = fx 2 X : èa æé x +1èa é x0gand T ca = fx 2 X :èa æé x +1èa é xé0g. We could try to prove there is an attractor at inænity by ænding opposite bounds to those in Theorem 3.6. That is, we would try to show that for any a 2 IR m there exists a set W a X and a constant ré0such that vol W a 6= 0 and lim c!1 F 3 èca; xè,rfor all x 2 W a. But è4.7è shows that no such r can be found in this case. Thus we have not been able to prove that there is always an attractor at inænity. However, for any one dimensional or two dimensional example, the cost function J 3 induced by f 3 can be plotted. Figure 1 shows the plot of J 3 for a particular one dimensional case. It can be seen from the ægure that J 3 has one non global local minimum and furthermore is decreasing èalbeit slowlyè as kak goes to inænity. 5 Application Learning a Stripe Next consider the parametrization f 4 èa; xè =, èa é x +1è 2 ; è5.1è where a; x 2 IR n and né0 is some æxed constant. The decision regions identiæed by f 4 are ëstripes" in IR n, where æèaè is of width 2p kak and is normal to a. Such decision regions arise in a radar problem. At a, the directional derivative is F 4 èa; xè = 4èèa, a æ è é xè 2 èa é x +1è 1 2 èa+aæ è é x+1 : è5.2è This is nonnegative ifèa é x+ 1èè 1 2 èa + aæ è é x +1è0, and negative otherwise. Figure 2 depicts the regions S a and T a for a particular choice of estimate parameter a and true parameter a æ, when X IR 2 and A =IR 2. In the limit C!1, F 4 èca; xè! 2c 4 èa é xè 4! 0 è5.3è 12

13 GRADIENT DESCENT LEARNING ALGORITHMS for all x 2fx2X:a é x6=0gand if a é x = 0 then F 4 èa; xè =,4èa æé xè aæé x +1 : è5.4è Compactness of X implies that s æ = sup x2x a æé x exists, so assumption 1 of Theorem 3.8 is satisæed, with r a =4s æ2 è 1 2 sæ + 1è. Setting V a = fx 2 X : ja é xj"gfor some "é0, it is clear that assumption 2 of Theorem 3.8 is also satisæed, so there is no attractor at inænity for this parametrization. T a Sa X T a x+1= 0 Σ(a) Σ( a*) 1 2 T (a+a*) x+1= 0 Figure 2: The ëgood", and ëbad" regions S a and T a for a particular choice of a and a æ. Here the sample space is X = ë,1; 1ë 2, the true parameter vector is a æ =è0;4è, the estimate is a =è2;0è, and the width is determined by =0:04. The dark shaded region is the true decision region, and the light shaded region is the estimate decision region. T a is the diagonally hashed area, and S a is the rest of X. As an aside, note that if the input sequence èx k è consists entirely of positive examples èy k = +1è, then X = æèa æ è. Then S a is much larger than T a for any choice of a, while the size of the integrand is of the same order in both S a and T a. This is interesting because it explains why in simulations the estimate parameters converge much more smoothly if only positive examples are used in training. This property is a special feature of this particular class of decision regions, and certainly does not apply in general. Generally one would probably not get convergence at all with solely positive examples. 13

14 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS 6 Application Learning an Intersection of Two Half Spaces Now consider decision regions which are intersections of half spaces containing the origin. If X IR n, the natural choice of parameter space is IR 2n. As in ë3ë, we use the parametrization f p èa; xè = 1, e,pèné 1 x+1è, e,pèné 2 x+1è ; è6.1è where a = vecèn 1 ;n 2 è, n 1 ;n 2 2IR n. There are two true parameter vectors in this case, vecèn æ 1 ;næ 2è and vecèn æ 2 ;næ 1è, so the assumptions of Theorem 3.2 do not hold. The directional derivative satisæes F p èa; xè = pèn 1, n æ 1è é xèe pèn1,næ 1 èéx, 1èe,2pèné 1 x+1è + pèn 1, n æ 1è é xèe pèn2,næ 2 èéx, 1èe,2pè n 1 +n pèn 2, n æ 2è é xèe pèn1,næ 1 èéx, 1èe,2pè n 1 +n 2 2 é x+1è é x+1è + pèn 2, n æ 2è é xèe pèn2,næ 2 èéx, 1èe,2pèné 2 x+1è : è6.2è Due to the nonlinear, coupled nature of this equation, there appears to be no simple way of describing the regions S a and T a. The ærst and last terms are always positive, but the middle terms can be negative. All terms will be positive if sgn èn 1,n æ 1è é x = sgn èn 2,n æ 2è é x. So if èn 1,n æ 1è=cèn 2,n æ 2è for some positive constant c, then F p èa; xè 0 for all x 2 X. That is, S a = X if a 2 n èn 1 ; n1,næ 1 c + n æ 2è:c2IR o. As in the previous example, we can not show there are no local minima, but we can show that there is no attractor at inænity. Let a = vecèn 1 ;n 2 è 2 IR 2n, where either n 1 may equal 0 or n 2 may equal 0, but not both. Deæne æ := pn é 1 x and æ := pné 2 x. In the limit c!1,fèca; xè takes on the following values: Case 1. æé0 and æé0 F p èca; xè! 2ècæe,cæ + cæe,cæ è! +0 è6.3è Case 2. æé0 and æé0 F p èca; xè!,cæe,2cæ! +1 è6.4è Case 3. æé0 and æé0 F p èca; xè! è,cæe,cæ, cæe cæ èèe,cæ + e,cæ è! +1 è6.5è Case 4. æ = 0 and æé0 F p èca; xè!,pn æé 1 xèe,pnæé 1 x, 1èe,2p 0 è6.6è 14

15 GRADIENT DESCENT LEARNING ALGORITHMS Case 5. æ = 0 and æé0 F p èca; xè!,cæe,2cæ! +1 è6.7è Case 6. æ = 0 and æ =0 F p èca; xè =,pèn æ 1+n æ 2è é xèe,pnæé 1 x + e,pnæé 2 x, 2èe,2p è6.8è Let s 1 = sup x2x n æé 1 x and s 2 = sup x2x n æé 2 x. Setting r a =,pès 1 + s 2 èèe,ps1 + e,ps2, 2èe,2p and V a = fx 2 X : n é 1 x " or né 2 x "g, for some "é0, the assumptions of Theorem 3.8 are satisæed. 7 Nonuniformly Distributed Examples The results and examples given so far in this paper have all assumed that the examples are such that èx k ècovers X. We have given conditions which guarantee that there is a unique critical point of the cost function, or that no attractors at inænity exist. But these conditions may not be suæcient if the examples do not cover X. From our deænitions of the sets S a and T a in section 3, we see that it is often possible to construct malicious input sequences which will force the estimate parameters generated by algorithm 2.2 to diverge to inænity. If the parametrization is such that T a 6= ; for all a 2 A, a suitable malicious sequence is achieved by choosing x k 2 T ak for each k 0. In a similar vein, Sontag and Sussman ë7ë have given an example of a particular sequence of input examples which will cause the error surface for a simple neural network learning problem to have non global local minima. These examples require special choices of the examples which vary with k, so this is quite a large departure from our original assumption that èx k ècovers X. A smaller relaxation of the covering assumption is that the examples are i.i.d. èindependent and identically distributedè with some nonuniform distribution. In this case the results of section 3 no longer apply. Assume that x k are i.i.d. random variables with some known distribution P èxè :X!ë0; 1ë. The cost function J becomes Jèaè = 1 vol X Z X fèa; xèèg " èa; xè, g " èa æ ;xèè, " 1+ ln fèa; xè2 " 2 dp èxè: è7.1è Results similar to those in section 3 can be derived, using inf P èxè and sup P èxè on the sets S a and T a. Whether or not the nature of the cost surface can be changed signiæcantly by the choice of P èæè is unclear. Do 15

16 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS there exist parametrizations for which there is no attractor at inænity when examples are uniformly distributed, but there is an attractor at inænity if examples are chosen according to some other distribution? If this is the case then satisfying the conditions for uniformly distributed points does not always guarantee good convergence if the points are nonuniformly distributed. 8 Conclusions In this paper we have investigated conditions guaranteeing convergence of an algorithm for learning nonlinearly parametrized decision regions, discussed in ë3ë. The conditions given in ë3ë involve a cost function, where the average is taken over the input examples. The conditions are hard to test directly, so our aim here has been to develop simpler suæcient conditions which can be tested for a particular parametrization. The approach taken has been to relate the directional derivative of the cost function to the directional derivative of the parametrization, and integrate over the entire sample space. A number of examples of the application of the new conditions to particular parametrizations have been given, and some interesting insights have been gained. We have shown that even half spaces can be parametrized in a way which will cause the learning algorithm to fail, though for the obvious linear parametrization, and for some nonlinear parametrizations of a half space, the algorithm will work. We have been able to explain why training with only positive examples gives smooth convergence for the parametrization of a stripe. The parametrization of an intersection of half spaces which was given in ë3ë has been investigated further, and it was shown that parameters will not drift oæ to inænity if points are uniformly distributed. Appendix Proof of Theorem 3.2 Let a 2 IR m. By the deænition of a parametrization, both are bounded on a compact domain, so inf Ra Fèa; xè and sup Ta jfèa; xèj are both ænite. In this section we write, for instance, inf Ra Fèa; xè, rather than inf x2ra Fèa; xè for the inæmum over a subset of X, and similarly for supremum. By è3.2è, the directional derivative ofjin the direction a, a æ is D a,a æjèaè = 1 vol X Z S a J èa; xèdx + 1 vol X Z T a J èa; xèdx: èa.1è Note that the smoothness property of f, and hence g, has been used to interchange the order of diæerentiation and integration. 16

17 GRADIENT DESCENT LEARNING ALGORITHMS It is clear from the deænitions of F and J that for any a 2 IR m, x 2 X, J èa; xè = gèa; xè, gèaæ ;xè fèa; xè, fèa æ Fèa; xè: ;xè èa.2è Because arctan is a strictly monotonic function, sgn ègèa; xè, gèa æ ;xèè = sgn èfèa; xè, fèa æ ;xèè. Thus for any a 2 IR m, x 2 X, sgn J èa; xè = sgn Fèa; xè. In particular, for any x 2 S a, J èa; xè is nonnegative, and for any x 2 T a, J èa; xè is negative. From the intermediate value theorem, we know that for any U X, But gèa; xè, gèaæ ;xè fèa; xè, fèa æ ;xè = 2 èa;xè " " 2 + fèa; xè 2 é 2 " èa.3è èa.4è always, so jj èa; xèj é 2 " jfèa; xèj for any values of a and x. Similarly, for 2" any x 2 U X, jj èa; xèj è" 2 +sup R f 2 èjfèa; xèj. Now choose a 2 A, a 6= a æ. From èa.1è, D a,a æjèaè vol X Z R a J èa; xèdx + inf R a J èa; xè Z Z T a J èa; xèdx dx, sup R a T a jj èa; xèj 2" inf R a è" 2 + sup Ra f 2 è Fèa; xè vol R a 2, sup T a " jfèa; xèj vol T a 2 " inf R a Fèa; xè vol R a sup Ra èf="è 2, 2 " sup jfèa; xèj vol T a T a é 0 by assumption. æ a 6=0ifa6=a æ as required. Z T a dx ë1ë M.R. Frater, R.R. Bitmead, and C.R. Johnson, Jr. Escape from stable equilibria in blind adaptive equalization, Conf. on Decision and Control, Tuscon, Arizona, December 1992, 1756í

18 K.L. BLACKMORE, R.C. WILLIAMSON, I.M.Y. MAREELS ë2ë G.C. Goodwin and K.S. Sin. Adaptive Filtering Prediction and Control. New York: Prentice-Hall, ë3ë K.L. Blackmore, R.C. Williamson, and I.M.Y. Mareels. Summary: Learning nonlinearly parametrized decision regions, J. Math. Systems, Estimation, and Control 6è1è è1996è, Full length paper available electronically. ë4ë M.W. Hirsch and S. Smale. Diæerential Equations, Dynamical Systems, and Linear Algebra. New York: Academic Press, ë5ë R. Horst and P.T. Thach. A topological property of limes-arcwise strictly quasiconvex functions, Jour. Math. Analysis and Applications 134 è1988è, 426í430. ë6ë M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. Cambridge: MIT Press, ë7ë E.D. Sontag and H.J. Sussmann. Back propagation can give rise to spurious local minima even for network without hidden layers, Complex Systems 3 è1989è, 91í106. Department of Engineering, Australian National University, Canberra ACT 0200, Australia Department of Engineering, Australian National University, Canberra ACT 0200, Australia Department of Engineering, Australian National University, Canberra ACT 0200, Australia Communicated by David Hill 18

to test in applications. We have had some success in this aim, particularly with the question of attractors at innity, although the question of existe

to test in applications. We have had some success in this aim, particularly with the question of attractors at innity, although the question of existe Local Minima and Attractors at Innity for Gradient Descent Learning Algorithms Kim L. Blackmore Robert C. Williamson Iven M.Y. Mareels June 1, 1995 Abstract In the paper \Learning Nonlinearly Parametrized