Complexity of Regularization RBF Networks

Complexity of Regularization RBF Networks Mark A Kon Department of Mathematis and Statistis Boston University Boston, MA 02215 mkon@buedu Leszek Plaskota Institute of Applied Mathematis University of Warsaw 02-097 Warsaw, Poland leszekp@hydramimuwedupl Abstrat We attempt to unify and ompare a lass of algorithms for learning input-output (i-o) funtions from examples Our general approah involves parsing information about into a priori and a posteriori information, with eah represented by a probability measure on the spae - of andidate funtions A onsequene is that omplexities of different approximation algorithms for the same problems will be possible to ompare, and optimal algorithms will be possible to identify We illustrate this by formulating an information omplexity theory for regularization radial basis funtion (RBF) networks We show the -omplexity of approximating using regularization is equivalent to the -omplexity of approximating using any onsistent Bayesian approah In partiular, a Gaussian prior distribution may be assumed for orret omputation of all omplexities 1 Introdution There are urrently many areas of mathematis, statistis, and omputer siene whih deal with learning theory, whih effetively involves the extrapolation of funtions from partial information or examples The theory of learning in neural networks has the goal of extrapolating an input-output (i-o) funtion from partial information onsisting of examples of Given a set of data points, the many ways of extrapolating a funtion ²%³ from these imply a need for a study of how suh methodologies fit into a larger framework In this paper we will disuss approahes to this problem, and attempt to plae several approximation proedures into a wider ontext Through this we hope to develop an approah to omparing approximation errors of different methodologies, and hene the omplexities of finding -approximations of i-o funtions A sub-goal is the formulation of a normative index of approximation omplexities, and more diret omparisons of different approximation methods One solution of the learning problem is provided by radial basis funtion (RBF) neural networks [3,4], whih extrapolate from examples by effetively assuming a presribed smoothness, or possibly other a priori information An information omplexity theory for RBF networks estimates how many examples are needed to approximate to a given tolerane Suh a theory has been studied in [1, 2], in the ontext of a worst-ase formulation of error, with an assumption that is known a priori to be ontained in a given set - A more general formulation of the learning problem an be formulated as follows Assume that there is a funtion? @ to be learned through examples 5 ~²²% ³ÁÃÁ²% ³³ If random errors are involved so that ²% ³ is replaed by ²% ³b, the relationship between? and @ beomes probabilisti, and a probability density on?d@ is a better desription of the relationship of input and output We will assume here for simpliity that the relationship between? and @ is deterministi, so that ~ À Most approahes to estimating have in ommon the fat that there is information provided a priori regarding, to be used in learning In addition, it is assumed some additional data about are given (generally in the form of examples 5) These data are a posteriori information The learning algorithm must make the best guess at from the two types of information More generally, we an formulate a priori information as some kind of prior knowledge about ²%³ This prior knowledge an be thought of as a probability distribution pr on the spae - of all possible On the other hand, an a posteriori probability distribution po on - is implied by the information 5 The density po ²³ will be assumed to depend only on the finite olletion of numbers ²% ³ÁÃÁ²% ³À A distribution of this type, whih depends on a finite number

of variables, is known as a ylinder distribution (or ylinder measure) The job of an inferene engine is to ombine the probability distributions pr and po in the best way in order to estimate To estimate it is also neessary to assume a penalty for errors The most general assumption is that this has a form <²Á³ where is the sought funtion, and is the estimate of A reasonable algorithm for estimating will minimize the expeted risk, whih must be arefully defined, given that two equally valid probability measures are at work 2 The parsing of approximation methods: some examples Learning theory inludes many different and very effetive approahes À An important goal is to lassify these within a oherent framework, and to analyze and ompare their omplexities of approximation within this framework We inlude some examples of suessful learning algorithms and indiate their parsing of a priori and a posteriori information Sine the latter generally onsists of examples 5, it is not speified here unless neessary In all of the examples below, it is possible to identify reasonable pr and po, though this is not done here for brevity 1 Interpolatory approah (worst ase approah) in ontinuous omplexity theory : Here generally, a priori information onsists of the fat that -, where - is a balaned onvex set of funtions in a normed linear spae Optimal algorithms approximate enter of set -q5 & through an algorithm ²5³šÀ 2 Average ase approah in ontinuous omplexity: The a priori information is that the funtion to be learned is drawn from a given probability distribution pr ²³ on a normed linear spae - The algorithm for hoosing selets to be the average of pr ²³ onditioned on 5 ~ & If pr is a Gaussian measure then an optimal learning algorithm is the spline algorithm [5], whih gives an approximation to of the form ~ ²5³ ~ ²% ³* 2²%Á % ³ where * ~ ( is the ovariane operator of the Gaussian (a generalization of the ovariane matrix) Here, 2²%Á % ³ is a radial basis funtion, ie, the reproduing kernel for the spae with the norm P(P 3 Feedforward Neural Network Algorithms : These learn from examples 5 ~²²% ³ÁÃÁ²% ³³ of an unknown i- o funtion ²%³, and form an approximation ~ ²%³Á (1) where might be ridge funtions, radial basis funtions or other funtions omputed by the hidden layer, with the oeffiients omputed using bakpropagation or other algorithms An impliit a priori assumption is that an be written in the form (1) This is often effetively a smoothness assumption, sine a finite number of neurons with a smooth ativation funtion / will produe smooth approximations À Indeed, if ²%³ are RBF's, we know optimal approximations (1) expliitly minimize Sobolev norm 4 Maximum likelihood approahes: These estimate using an approximation whih is onsistent with the information 5, and whose probability is the largest under the a priori measure ²³ restrited to 5 ~ &¹À pr 5 Regularization (radial basis funtion) approahes: Here the a priori information onsists, for example, of the fat that is smooth, ie, that its Sobolev norm P(P is small (here (~" b, with " the Laplaian) The algorithm for hoosing involves minimization of a weighted ombination / ²³ ~ P5 &P b P(P, where 5 ~ & is the a posteriori information (data) from the to be learned The minimizing, under some standard assumptions, is evaluated by a neural network whih omputes a linear ombination of radial basis funtions (RBF's) of the form ~ 2²%Á%³ This method an also be reformulated probabilistially in the ontext of the prior and post probability densities pr and po 6 Vapnik-Chervenenkis (V-C) approah: Given nested family = ¹ of andidate a priori spaes inreasing in size (and omplexity) with, one an take the following approah If is small then the andidate set 5 &q= small, so that the seletion of approximation 5 &q= is from reasonable sized set The method is to hoose a suffiiently small that the set = is too small to overfit the data 7 Adaptive resonane theory (ART) algorithms: There is a dynami neural network weight alloation in this proedure for lassifying input vetors xà This is a feedforward

network with a seond ompetitive proessing stage where the hidden neuron & ~ ²%³ with highest ativation suppresses all other neurons & & Thus, an ART network omputes the funtion i ²%³~²ÁÃÁÁ ²%³Á ÁÃÁ³, (2) where the right hand side represents the hoie of the ²%³~& with the maximum value indiating membership of the input in a lass *À A priori information onsists of the fat that the i-o funtion i is approximable in the form (2) À Choie of a smooth ²%³ is effetively an a priori assumption on smoothness of the separators of lasses * above Note we are ignoring here the dynamis of programming an ART network Other methodologies not mentioned above inlude: Î Computational learning theory Î Regression methods in statistis Î Maximum entropy methodologies Î Deision tree methodologies in artifiial intelligene After plaing different algorithms into a single ontext, a seond step should be to identify and ompare - omplexities of different algorithms Given, we wish to ompute the information omplexity (the number of piees of information in 5³ required to approximate within A further goal should be to form a normative index of suh methods aording to their (now omparable) optimality properties This step might involve preisely defining optimality within suh lasses of approahes, and identifying optimal algorithms from among differing approahes This by no means obviates the need for varied methodologies, but rather allows them to be ompared to eah other on the same problems In the next setion we illustrate a methodology for omparing omplexities We formulate a relationship between RBF regularization network tehniques for learning, and Bayesian learning methods used in the average ase setting of ontinuous omplexity theory We believe these results are interesting beause they formulate a onsistent omplexity theory for RBF regularization networks In partiular it follows that the omplexity depends only on the RBF network itself, through its regularization operator ( Our results are obtained through onsideration of all possible Bayesian prior distributions onsistent with the regularization operator ( of the RBF network To define average omplexity (number of examples) required to approximate within error, we will first define the average ase error For this an a priori probability distribution pr on the set of possible is required If the approximation omplexity is dependent on the assumed pr, there will be several different omplexities onsistent with a single given RBF regularization network, and no omplexity theory will be possible Using probabilisti methods, we will show that all pr onsistent with the regularization operator ( yield a single omplexity order, yielding a unique omplexity theory for the RBF network with regularization operator ( We also will show that all Bayesian models whih are onsistent with the regularization RBF network yield the same omplexity of approximation using RBF algorithms This omplexity is the same one obtained by assuming a Gaussian prior distribution whose ovariane operator is ( 3 A formulation of regularization networks in Bayesian terms Let / be a Hilbert spae, and let / be an unknown funtion we seek to identify Assume we have information 5 ~²²% ³ÁÃÁ²% ³³~&, (sometimes alled standard information) We assume we have the priori information that has small norm with respet to some operator, eg, P(P ~ small (for example, (~" b leads to the assumption of smoothness for Â see above) We then minimize a regularization funtional to approximate : ~ arg min P5 &P b P(P ¹ (3) We assume the information 5 ~ & (4) is exat, and that & is given Then (4) beomes a onstraint, and through regularization theory it an be shown that we should take the limit in the optimization problem (3), whih yields ~ arg min P(P 5~&¹À (5) This use of a priori information implies a Bayesian viewpoint: there exists an a priori probability measure on / whose density at / is a funtion of P(P and dereases monotonially with P(PÀ A first guess at an a priori density would be ²³ ~ ²P(P³, (6) with Lebesgue measure on / (at least if / is finite dimensional) A measure onsistent with this definition does not always exist in infinite dimension However, in all dimensions, if ( is invertible and ( is trae lass, there is a Gaussian measure on / onsistent with (6) If / is finite dimensional with dimension, this measure has the form

P( P ( ²³ ~ ² ³ /2 det²(³ The ovariane operator of this Gaussian measure is ( In infinite dimensions suh measures are disussed in, eg, [5] In the general (non-gaussian) ase, we want a measure ²³ whih is onsistent with the regularization operator ( in the sense that it depends only on P(PÀ A general a priori measure satisfying this ondition an be onstruted as follows Define the sets / ~ / P(P~¹À Measure theoretially, we have /~ r/ ~ / d l b, ² 7³ with l b the nonnegative reals and the subsript on the right denoting ~ We want measures whih are onstant on eah / For any Borel measure supported on the domain of (, define ~ ²hO/ ³ to be the onditional measure of on Note that the Gaussian measure ( with ovariane ( has the property (as in the finite dimensional ase) that its onditional density is the unique uniform measure on eah / By (7) (analogously to the finite dimensional ase), any measure ²³ whose density depends only on P(P has uniform onditional measures ~ ( on /, and some marginal measure on l b À The measure (together with the uniform measures ) uniquely define the measure on /, whih is the general form of an a priori measure on / is onsistent with the regularization assumptions In addition, it is onsistent with the regularity assumptions stipulate that be dereasing (sine we want probability to derease with inreasing values of P(P) À Heneforth we assume the measure has a finite mean We an now exatly ompute the information omplexity omp ² ³ of approximating to error À This is defined as the minimum number of examples ²% ÁÃÁ% ³ whih yield an average error of less than (see [5, 6]) Speifially, for given &~5 ~²²% ³ÁÃÁ²% ³³ (8) of ardinality, we define the best estimate ²5Á&³ of by / ²5 Á&³ ~ arg inf², P P ³Á ie, the funtion 5 & whih minimizes the squared error Above,, 5 & denotes the average over all / onsistent with the information &, with respet to the (onditional) measure We emphasize that the results here assume that the information 5 has the standard form (8) (ie, onsists of pointwise evaluations of only) We then define the average ase error to be ²5Á &³ ~, ²P ²5Á &³P ³ Á where the expetation is taken over all with respet to Finally, for given we hoose the information operator 5 of ardinality whih gives the smallest error for the worst &, and measure the error: ²³ ~ inf sup ²5 Á &³ 5 & This represents the minimum error ²³ of estimating the worst funtion, using the best information operator 5 onsisting of examples ²% ³ÁÃÁ²% ³À The -omplexity omp ²³ is defined as the minimum number of examples required to obtain an error less than : omp² ³ ~ inf ho ²³ ¹À Now we relate omp ² ³ with the theory of regularization networks Assume we are given a regularization operator ( and a regularized estimate regá of obtained from (5) above, using the best information operator 5~5 For a given a priori measure pr ~, we define the RBF regularization network -omplexity by omp reg² ³ ~ inf ho, ²P P ³ ¹ We wish to ompare this -omplexity with the - omplexity obtained from the average ase (Bayesian) approah with ontinuous omplexity theory, using any priori probability density onsistent with the regularization assumptions arising from the operator ( above Thus the a priori probability distribution will be assumed to have uniform density on the sets / ~ OP(P ~ }, with a dereasing marginal density (see above) whih has a finite mean With this as the general form of an a priori measure, we have the following theorem, whose proof is omitted for brevity: Theorem: The -omplexity of the regularization approximation problem with regularization operator ( is equal to the average ase (Bayesian) -omplexity for any measure whih is onsistent with ( This in turn is equivalent to the average ase -omplexity for the reg,

Gaussian measure Speifially, we have ( with ovariane operator omp reg( ) ~6² omp ² ³³~6² omp ² ³³² 9³ ( ( À [4] CA Mihelli and M Buhmann, On radial basis approximation on periodi grids, Math Pro Camb Phil So 112, 1992, pp 317-334 We remark that the last omplexity in (9) an be omputed using known tehniques in ontinuous omplexity [5] 4 Conlusions [5] Traub J, G Wasilkowski, and H Wozniakowski, Information-Based Complexity, Aademi Press, Boston, 1988 [6] Traub, J and H Wozniakowski, A General Theory of Optimal Algorithms, Aademi Press, New York, 1980 We onlude that information -omplexities an be omputed for regularization networks on the basis of any a priori assumption pr onsistent with the regularization operator (, and that the omplexities whih obtain are independent of this hoie of pr This allows the definition of a regularization omplexity omp reg( ) whih depends only on the regularization operator ( Equivalently, what is neessary to define global information omplexities is the regularization assumption that P(P should be small ; all omplexities onsistent with this assumption an be omputed from the average ase setting using a Gaussian prior Finally, we onlude with Proposition: Under the same measure Á maximum likelihood estimation also gives the same -omplexities as in the theorem Aknowledgments: The first author's researh was partially supported by the US National Siene Foundation The seond author's researh was partially supported by the Polish-Amerian Fulbright Foundation and the National Committee for Sientifi Researh of Poland Referenes: [1] M Kon and L Plaskota, Information omplexity of neural networks, Neural Networks 13, 2000, pp 365-376 [2] M Kon and L Plaskota, Complexity of neural network approximation with limited information: a worst-ase approah, to appear, J Complexity [3] C A Mihelli, Interpolation of sattered data: Distane matries and onditionally positive definite funtions, Construtive Approximation 2, 1986, pp11-22