Effects of Moving the Centers in an RBF Network

Size: px

Start display at page:

Download "Effects of Moving the Centers in an RBF Network"

Jasper Boone
5 years ago
Views:

1 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER Effects of Moving the Centers in an RBF Network Chitra Panchapakesan, Marimuthu Palaniswami, Senior Member, IEEE, Daniel Ralph, and Chris Manzie Abstract In radial basis function (RBF) networks, placement of centers is said to have a signicant effect on the permance of the network. Supervised learning of center locations in some applications show that they are superior to the networks whose centers are located using unsupervised methods. But such networks can take the same training time as that of sigmoid networks. The increased time needed supervised learning offsets the training time of regular RBF networks. One way to overcome this may be to train the network with a set of centers selected by unsupervised methods and then to fine tune the locations of centers. This can be done by first evaluating whether moving the centers would decrease the error and then, depending on the required level of accuracy, changing the center locations. This paper provides new results on bounds the gradient and Hessian of the error considered first as a function of the independent set of parameters, namely the centers, widths, and weights; and then as a function of centers and widths the linear weights are now functions of the basis function parameters networks of fixed size. Moreover, bounds the Hessian are also provided along a line beginning at the initial set of parameters. Using these bounds, it is possible to estimate how much one can reduce the error by changing the centers. Further to that, a step size can be specied to achieve a guaranteed amount of reduction in error. Index Terms Generalized methods, gradient methods, Hessian matrices, intelligent networks, learning systems, neural-network architecture, nonlinear estimation. I. INTRODUCTION RADIAL BASIS function (RBF) networks are being used function approximation, pattern recognition, and time series prediction problems. To mention a few features, such networks have the universal approximation property [5], arise naturally as regularized solutions of ill-posed problems [3] and are dealt well in the theory of interpolation [4]. Their simple structure enables learning in stages, gives a reduction in the training time, and this has lead to the application of such networks to many practical problems. The adjustable parameters of such networks are the receptive field centers (the location of basis functions), the width (the spread), the shape of the receptive field and the linear output weights. The problem of determining the number of hidden nodes (or the number of basis functions) required any given practical problem is continually being tackled in the literature. Fixing the network size bee training, growing the architecture incrementally to achieve a needed level of accuracy, pruning to Manuscript received August 31, 1999; revised June 27, 2000 and May 15, This work was supported by a special research grant from The University of Melbourne. C. Panchapakesan, M. Palaniswami, and C. Manzie are with the Department of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Vic. 3010, Australia, D. Ralph is with the Judge Institute of Management Studies, University of Cambridge, Cambridge CB2 IAG, U.K. Digital Object Identier /TNN remove irrelevant units, or combining growing with pruning are some of the ways by which an optimal size the network is determined. By locating one basis function at each training input, it is possible to interpolate or to get a regularized solution (improving generalization) [3]. But, in general, it is desirable to have small networks that can generalize better and are faster to train. This calls an optimal positioning of the basis functions i.e., the location of centers. This paper looks into the problem of learning of the centers in an RBF network. In Section II, a summary of how centers are selected is mentioned bee giving the description of the problem which is covered in Section III. In Sections IV and V, some analytical results are given and their implications are also discussed. Section VI presents simple numerical examples to illustrate the theory. II. SELECTION OF CENTERS IN RBF NETWORKS Centers in RBF networks are in general located in one of the following ways. A set of grid points in the input space is selected [2]. In this method, the number of basis functions required would be quite large high-dimensional input spaces. Centers can be selected as a random subset of the training samples. Without prior knowledge about the prototype vectors, the number of centers to represent the data would be large. By using the -means clustering algorithm, learning vector quantization, or one of its variants, an optimal set of centers can be located [10], [11]. These methods are based on locating the dense regions of the training inputs and the centers are the means (averages) of the vectors in such regions. All the above methods are based on the distribution of the training inputs alone and do not take into consideration the output values which do influence the positioning of the centers, especially when the variation of the output in a cluster is high. So centers are also selected based on both the input and output data as follows. A set of training samples that explain the variation in the output in an optimal sense is selected using ward subset selection, regularization, and cross validation. Starting with an empty subset one selects gradually those centers, whose contribution toward reducing the error is appreciably large. An efficient procedure to achieve the same result is to use the orthogonal least squares method. To avoid overfit, regularization and cross validation are used [7], [8]. -means clustering that involves both input and output values [9]. The center vectors are learned using backpropagation [6], [16]. In this, following the generalized RBFs (GRBF) /02$ IEEE

2 1300 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 suggested by Girosi and Poggio, supervised learning of the centers and the linear weights are considered in the NETtalk domain (see [6]). This approach is reported to have a generalization ability superior to the sigmoidal networks as well as RBF networks whose centers are determined using unsupervised methods. In the case of on-line learning, the structure of the network is also allowed to change, depending on the novelty of the input, new centers are added and fine tuned using backpropagation or other methods [2], [12]. Growing cell structures are also used in deciding when and to add the new centers based on the accumulated error [13]. When centers are selected in one of the above unsupervised methods the linear output weights can be adapted using either the delta rule or calculated as the solution of an overdetermined system. III. MATHEMATICAL DESCRIPTION OF THE PROBLEM An RBF network is a feedward network with a single layer of hidden units that are fully connected to the linear output units. The output units m a linear combination of the basis (or kernel) functions computed by the hidden layer nodes. Activations of such hidden units 1) decreases monotonically with the distance from a central point or prototype (local) and 2) are identical inputs that lie at a fixed radial distance from the center (radially symmetric). We are trying to approximate a function with an RBF network whose structure is given below. The vector is an input, is the th basis function with center, width and is the vector of linear output weights and the number of basis functions used. We concatenate the centers to get and the widths to get. The output of the network and and is Let. Henceth, assume that the matrix is invertible when, is the matrix in with. The next well-known result shows how to obtain optimal weights given centers and width, provided is invertible [15]. Lemma 3.2: Let and. If is invertible then the unique least squares solution of is given by. The error function can be viewed either as or as and is the optimal vector as mentioned in the previous Lemma. Depending on the type of error let denote or. Given an initial set of parameters, it is known that the linear approximation of near is given by is the gradient of at. Denote by the closed unit ball in the parameter space. The next two results are a straightward application of nonlinear analysis, see [14]. Lemma 3.3: Let be as in Lemma 3.1. Then some and, is a function such that. Moreover, such, The above bound on the gap between the error function and its linear approximation can be used to show how the error function decreases along the direction of steepest descent. Lemma 3.4: In the situation of Lemma 3.3, suppose and. Then each we have Let be a set of training pairs and the desired output vector. For each,, and arbitrary weights,, which are chosen as nonnegative numbers in order to emphasize certain domains of the input space, set The point of this result is to provide a guaranteed and calculable way of decreasing the error along the steepest descent direction. This result can be directly applied in computer code, as we demonstrate in Section VI. Subsequently, the following expression the error will be used. Lemma 3.5: has the m In all of our discussions we restrict ourselves to 1 and norms of matrices defined as follows: For any matrix A, maximum of the 1-norm of all columns, and maximum of 1-norm of all rows. Lemma 3.1 [14]: Let be a continuous mapping and the matrix be invertible at. Then there exist positive constants such that in an -neighborhood of, the matrix is invertible and. We need a little more notation to discuss the derivatives of (equivalently ). Recall, and and define the corresponding index sets,, and.

3 PANCHAPAKESAN et al.: EFFECTS OF MOVING THE CENTERS IN AN RBF NETWORK 1301 By we denote one of several quantities, depending on which index set that is selected from. The notation indicates that the index belongs to one of the sets, or. Note that the cardinalities of the index sets,, are,,, respectively, and we make use of them in some of the results that follow. So the gradient vector is Similarly,,, and are the bounds on the second derivatives of the centers, widths, and mixed partial. We will also define the following bounds use in this section: Lemma 4.1: In an neighborhood of, the Jacobian of satisfies the following bounds: (1) the Jacobian of and the Hessian matrix is given by is Proof: We use the notation given at the end of Section III (2) IV. SOME NEW RESULTS ON THE ERROR FUNCTION We provide bounds on the gradient and the Hessian of the error function viewed as a function of the centers widths and weights. Throughout this section, let The second inequality can be proved in a similar way. Let and denote the bounds the 1, norms of.we will find explicit bounds in Lemma 4.2. Theorem 4.1: In an neighborhood of, the gradient of the error function satisfies the following bounds: the training vectors, the corresponding output vector and the weighting vector are given, and is a given radially symmetric function. We are also given, an initial set of centers, widths, weights, and a radius. denotes the open ball of radius about. In this neighborhood, let be a bound, and, be respective bounds the partial derivatives of with respect to the centers and widths,. i.e.,, Theorem 4.2: In an neighborhood of, the Hessian of the error function satisfies the following bound : Proof: The Hessian of is given by is a bound See Appendix 2 the calculations of.

4 1302 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 Using these partial derivatives in the expression elements are bounded by. Substituting the bounds, and their Jacobians into the following expression gives the necessary bounds: As the proof is given in the Appendix 1. Lemma 4.3: In an -neighborhood of, the following bounds on,, and hold. For Here, is a bound on, and,, and must be chosen such that. Proof: Choose such that, and. We have Let and. Now this we get the bounds using Lemma 3.1 and its proof. Bounds follow from the expression from Lemma 3.2 and its Jacobian Lemma 4.4: In an -neighborhood of Jacobian of are given below the bounds the Plugging in the values the norms in the above expressions and Lemma 4.1 gives a bound in a neighborhood of. Let be the matrix function given in Section III, is invertible; and let be the corresponding optimal basis weights. is the Jacobian matrix of at. For, let denote, respectively, the bounds of, its first and second partial derivatives in an -neighborhood of. Lemma 4.2: In an -neighborhood of the following bounds on,, the Jacobian of, and hold: See Appendix 3 the proof. Theorems 4.3 and 4.4 are shown along similar lines to the preceding proofs by substituting appropriate bounds from the preceding results. Theorem 4.3: In an -neighborhood of By using the above bound, it will be possible to find out whether a change in the centers and widths is desirable. In that case, the result in Theorem 4.2 can be used to fix a step size to get a guaranteed amount of decrease in the error. Theorem 4.4: In an -neighborhood of,,, is bounded above by Proof: The bounds follows from the fact that it is an matrix and that is the bound. Both and have elements, and there are variables. is an matrix such that the nonzero

5 PANCHAPAKESAN et al.: EFFECTS OF MOVING THE CENTERS IN AN RBF NETWORK 1303 Proof: Theorem 5.1: For the following bound:, the Hessian of the error satisfies,,,, and are functions of the initial set of parameters, the direction in which they are moved, the bounds on the basis function and the data. The basis functions are taken to be Gaussians. Proof: The first and the second derivatives of are given as follows. First For each let At an optimal,. For the proofs of the norms of s, which are similar in their derivations refer to the Appendix 4. Using the bounds,, from the lemmas and the bounds s in the expression the second derivative of the error, (2) in Section III, give the necessary bounds. V. BOUNDS FOR THE HESSIAN OF THE ERROR ALONG A GIVEN DIRECTION IN THE CASE OF GAUSSIAN RBF In Section IV, bounds the Hessian in an -ball have been given which can be used in finding the proper step size referred to in Lemma 3.4. But the approximation of the error in the descent direction suggests that a bound the Hessian along the direction in which one moves is sufficient to get the result. We illustrate this when Gaussians with fixed widths are used approximating a function. The bounds the Hessian in a given direction can be given as follows. Let the notation given at the start of Section IV hold here. Let us fix a direction, say, possibly a unit vector, along which we want to move from. For instance, we can select to be the normalized gradient of at the initial set of centers with fixed width and initial weights. We restrict our attention to the ray. We require as in Section IV, or equivalently. We abuse notation slightly by writing in place of. We will determine a constant such that. By further abuse of notation, we write Then the derivative of has the following expression:, are defined as follows: and. As the derivative of,wehave,, are given as functions of as follows. Now, the second derivative of, we have and The next result gives a bound on that relies on certain constants whose consruction is given explicitly in the proof. The remark following the proof outlines how we compute these constants in practice.

6 1304 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 Simplication of this leads to,,,, and are given as functions of as follows: Then we have We calculate expressions and and substitute them in the equation in Theorem 5.1 to find the necessary bounds. By substituting the expressions,, in equation leads to the following: We simply this to get,,,, Then we arrive at,,,, and are given as functions of as follows. Let and,,,, and are the functions of,,,,,, and that are developed next. First we define, each pair, the length. Let. Remark: To construct the values, we follow the method of the proof. First, calculate and as at the start of the proof. The values of are calculated using the length. The coefficients,,,,,,,, and are then calculated according to the mulas provided. Finally, the values of are obtained from the coefficients. VI. EXAMPLES We present two simple examples to illustrate the given theoretical results. The function we approximate is. Three dferent methods are used to update the centers, weights, and values in a function approximation problem using RBF networks. In Lemmas 3.3 and 3.4, a value the step size is

INITIAL CENTER DATA TABLE IV DATA POINTS suggested once a bound the Hessian of the error function is known. Theorems 4.2 and 5.1 give two methods of calculating the desired bounds.

The details of each of the methods used are summarized in the following list.

7 PANCHAPAKESAN et al.: EFFECTS OF MOVING THE CENTERS IN AN RBF NETWORK 1305 TABLE I INITIAL CENTER DATA amount by which the error function decreases as a result of the change in parameters TABLE II DATA POINTS TABLE III INITIAL CENTER DATA TABLE IV DATA POINTS suggested once a bound the Hessian of the error function is known. Theorems 4.2 and 5.1 give two methods of calculating the desired bounds. The first two update methods stem from these results. The third update method is to use an optimal line search algorithm as a comparison. The details of each of the methods used are summarized in the following list. 1) Bounds the Hessian of the error function are given all points in a ball around the initial set of parameters based on the calculations given in Theorem 4.2. The value of is chosen such that. 2) Bounds the Hessian of the error function are obtained along the normalized negative gradient descent direction starting from the initial set of centers, based on Theorem 5.1. As above, the value of is chosen such that. 3) Use a line search algorithm incorporating the Armijo- Goldstein and Wolfe conditions to determine an approximately optimal step size, with an initial step size guess of 1 (see [17] details). Two case scenarios are also used in the experiment. In case 1 (one center, two data points) the initial value of the parameters and the training pairs are given in Tables I and II. In case 2 (four centers, nine data points) the corresponding details are given in Tables III and IV. The results of the experiments are reported in Table V. The column marked Network error improvement is the The quantity shows the average rate of decrease of the error function along the line segment,. Finally, the quantity gives the guaranteed lower bound on the error improvement. When the updates are based on the calculation of bounds, the permance is similar both methods in the simpler Scenario 1. However, the second method, the bounds of the Hessian in an interval is used rather than the bounds in a ball, is much better in terms of achieving a reduction in error the more complex Scenario 2. Although the rates of descent are better methods 1 and 2 than method 3, the latter line search produces a far larger descent in total. Theree, the bounds of the Hessian provided need to be improved to make their application more practical. VII. CONCLUSION In RBF networks, an optimal set of centers would be required to make the networks small and efficient. Based on the error as function of centers, we have given bounds on the gradient and Hessian of the error function. These bounds may be used in deciding the optimality of the present set of centers. If a change in centers is desired, bounds on the Hessian may be used to fix a step size, which depends on the current centers to get a guaranteed amount of decrease in the error. The new theoretical results give permance guarantees on the supervised learning in RBF networks. APPENDIX I APPENDIX II.

8 1306 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 TABLE V NETWORK ERROR IMPROVEMENT. APPENDIX IV. APPENDIX III either is a component of or. Thus

PANCHAPAKESAN et al.: EFFECTS OF MOVING THE CENTERS IN AN RBF NETWORK 1307 REFERENCES [1] D. S. Broomhead and D. Lowe, Multivariate functional interpolation and adaptive networks, Complex Syst., vol.

Girosi, Networks approximation and learning, Proc. IEEE, vol. 78, pp. 1481 1497, Sept. 1990. [4] M. J. D.

Sandberg, Universal approximation using radial basis function networks, Neural Comput., vol. 3, no. 2, pp. 246 257, 1991. [6] D. Wettschereck and T.

San Mateo, CA: Morgan Kaufmann, 1992, pp. 1133 1140. [7] M. J. L. Orr, Regularization in the selection of radial basis function centers, Neural Comput., vol. 7, pp. 606 623, 1995. [8] S. Chen, C. F.

9 PANCHAPAKESAN et al.: EFFECTS OF MOVING THE CENTERS IN AN RBF NETWORK 1307 REFERENCES [1] D. S. Broomhead and D. Lowe, Multivariate functional interpolation and adaptive networks, Complex Syst., vol. 2, pp , [2] J. Platt, A resource allocating network function interpolation, Neural Comput., vol. 3, no. 2, pp , [3] T. Poggio and F. Girosi, Networks approximation and learning, Proc. IEEE, vol. 78, pp , Sept [4] M. J. D. Powell, Radial Basis Function Multivariate Interpolation: A Review Algorithms the Approximation of Functions and Data, J. C. Mason and M. G. Cox, Eds. Oxd, U.K.: Clarendon, [5] J. Park and I. W. Sandberg, Universal approximation using radial basis function networks, Neural Comput., vol. 3, no. 2, pp , [6] D. Wettschereck and T. Dietterich, Improving the permance of radial basis function networks by learning center locations, in Advances in Neural Inmation Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds. San Mateo, CA: Morgan Kaufmann, 1992, pp [7] M. J. L. Orr, Regularization in the selection of radial basis function centers, Neural Comput., vol. 7, pp , [8] S. Chen, C. F. N. Cowan, and P. M. Grant, Orthogonal least squares learning algorithm radial basis function networks, IEEE Trans. Neural Networks, vol. 2, pp , Mar [9] Y. Zhang et al., A New Clustering and Training Method Radial Basis Function Networks. New York: IEEE, 1996, vol. 1, pp [10] J. Moody and C. J. Darken, Fast learning in networks of locally-tuned processing units, Neural Comput., vol. 1, pp , [11] M. Vogt, Combination of radial basis function neural networks with optimized learning vector quantization, Proc. IEEE, vol. 83, pp , Dec [12] V. Kadirkamanathan and M. Niranjan, A function estimation approach to sequential learning with neural networks, Neural Comput., vol. 5, no. 6, pp , [13] B. Fritzke, Fast learning with incremental RBF networks, Neural Processing Lett., vol. 1, no. 1, pp. 2 5, [14] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables. New York: Academic, [15] A. Bjorck, Numerical Methods Least Squares Problems. Philadelphia, PA: SIAM, [16] I. Cha and S. A. Kassam, RBFN restoration of nonlinearly degraded images, IEEE Trans. Image Processing, vol. 5, pp , June [17] R. Fletcher, Practical Methods of Optimization, 2nd ed. New York: Wiley, Chitra Panchapakesan was born in Tambaram, Chennai, India. She received the B.Sc. and M.Sc. degrees in mathematics from the University of Madras, Tamil Nadu, India, and was ranked first in both. She received the Master degree in mathematics from Cornell University, Ithaca, NY, and the Ph.D. degree in fixed-point theory from the Indian Institute of Technology, Madras, India. She received a Postdoctoral Award from the University of Melbourne, Melbourne, Australia, she resumed her research work in the Electrical and Electronics Engineering Department. Marimuthu Palaniswami (S 84 M 85 SM 94) received the B.E. (Hons.) degree from the University of Madras, Madras, India, the M.Eng.Sc. degree from the University of Melbourne, Melbourne, Australia, and the Ph.D. degree from the University of Newcastle, Newcastle, Australia. He is an Associate Professor at the University of Melbourne, Australia. His research interests are in the fields of computational intelligence and data mining, nonlinear dynamics, computer vision, intelligent control, and biomedical engineering. He has published more than 180 conference and journal papers in these topics. He was an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS and is on the editorial board of a few computing and electrical engineering journals. Dr. Palaniswami served as a Technical Program Co-Chair the IEEE International Conference on Neural Networks in 1995 and has served on the programme committees of a number of international conferences. His invited presentations include several keynote lectures and invited tutorials, in the areas of machine learning, biomedical engineering, and control. He has completed several industry sponsored projects National Australia Bank, Broken Hill Propriety Limited, Defence Science and Technology Organization, Integrated Control Systems Pty Ltd., and Signal Processing Associates Pty Ltd. He has also been supported with several Australian Research Council Grants, Industry Research and Development Grants, and Industry Research Contracts. He was also a recipient of a Foreign Specialist Award from the Ministry of Education, Japan. Daniel Ralph received the B.Sc. (Hons.) degree from the University of Melbourne, Melbourne, Australia, and the M.S. and Ph.D. degrees from the University of Wisconsin, Madison. He was a Lecturer with the University of Melbourne seven years and is now a Lecturer at Cambridge University, Cambridge, U.K. His research interests include analysis and algorithms in nonlinear programming and nondferentiable systems, quadratic programming methods, including their application to machine learning, discrete-time optimal control, and model predictive control. He has published numerous refereed papers, and coauthored a research monograph on an area of bilevel optimization called mathematical programming with equilibrium constraints. Dr. Ralph is a Member of the Editorial Board of SIAM Journal on Optimization and an Associate Editor both of Mathematics of Operations Research and The ANZIAM Journal. His conference activities, apart from invited lectures and session organization, include co-organizing the 2002 International Conference on Complementarity Problems and chairing streams at the 1998 International Conference on Nonlinear Programming and Variational Inequalities and the 1997 International Symposium on Mathematical Programming. He has been the recipient of a number of research grants from the Australian Research Council. Chris Manzie was born in Melbourne, Australia, in He received the B.Sc., B.Eng. (Hons.), and Ph.D. degrees from the Department of Electrical and Electronic Engineering, University of Melbourne, in 1996 and 2001, respectively. He is presently a Research Fellow with the University of Melbourne. His interests include the modeling and control of various problems relating to SI automotive engines.

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.