Neural Network Learning as an Inverse Problem

Size: px

Start display at page:

Download "Neural Network Learning as an Inverse Problem"

Arlene Jefferson
5 years ago
Views:

1 Neural Network Learning as an Inverse Proble VĚRA KU RKOVÁ, Institute of Coputer Science, Acadey of Sciences of the Czech Republic, Pod Vodárenskou věží 2, Prague 8, Czech Republic. Eail: Abstract Capability of generalization in learning of neural networks fro exaples can be odelled using regularization, which has been developed as a tool for iproving stability of solutions of inverse probles. Such probles are typically described by integral operators. It is shown that learning fro exaples can be reforulated as an inverse proble defined by an evaluation operator. This reforulation leads to an analytical description of an optial input/output function of a network with kernel units, which can be eployed to design a learning algorith based on a nuerical solution of a syste of linear equations. Keywords: learning fro data, generalization, epirical error functional, inverse proble, evaluation operator, kernel ethods 1 Introduction Today s technology provides any easuring and recording devices, which supply us with a huge aount of data. Unfortunately, such achine-ade data are not coprehensible for our brains. We reseble the king Midas, whose foolish wish to transfor everything he touched into gold was fulfilled, which led to his starvation as his body could not be feeded by gold. So we turn again to achines (coputers) to help us to transfor raw data into patterns that our brains can digest. As Popper [24] ephasized, no patterns can be derived solely fro epirical data. Soe hypotheses about patterns have to be chosen and aong patterns satisfying these hypotheses, a pattern with a good fit to data has to be searched for. History of science presents any exaples of this approach, e.g., Kepler s assuption that planets ove on the ost perfect curves fitting to the data collected by Tycho de Brahe. Kepler found that aong perfect curves, ellipses are the ones best fitting to the data (better than circles, which he tried first). Also Gauss and Legendre searched for curves fitting to astronoical data collected fro observations of coets. They developed the least square ethod. In 1805, Legendre wrote of all the principles that can be proposed, I think that there is none ore general, ore exact and ore easy to apply than that consisting of iniizing the su of the squares of errors. Many researchers of later generations shared Legendre s enthusias and used the least square ethod in statistical inference, pattern recognition, function approxiation, curve or surface fitting, etc. In all these techniques, iniization of the average of squares of errors is used to find coefficients of a linear cobination of soe fixed faily of functions. Thus the best fitting function is searched for in a linear hypothesis space. However, the assuption that epirical c The Author, Published by Oxford University Press. All rights reserved. For Perissions, please eail: journals.perissions@oupjournals.org doi: /jigpal/jzi041

2 552 Neural Network Learning as an Inverse Proble data can be approxiated by a linear function is rather restrictive. Especially, it is not suitable for high-diensional data. It is known fro theory of linear approxiation that the diension of a linear space needed for approxiation within accuracy ε is O ( ( 1 ε )d), where d denotes the nuber of variables [21]. So coplexity of linear odels grows exponentially with the data diension d. In contrast to the traditional applications of the least square ethod, in neurocoputing this ethod is applied to nonlinear hypothesis sets. The hypothesis sets are fored by linear cobinations of paraeterized functions, the for of which is given by the type of coputational units, fro which a network is built. Originally the units, called perceptrons, odelled soe siplified properties of neurons, but later also other types of units (radial and kernel units) with suitable atheatical properties for function approxiation becae popular. The back-propagation algorith developed by Werbos in 1970s and reinvented by Ruelhart, Hinton and Willias in 1980s (see [29]) calculates how the gradient of the average square error depends on coefficients of the linear cobination as well as on inner paraeters of functions foring their linear cobination. The algorith iteratively odifies all network paraeters until a sufficiently well fitting input/output function of the network is found. Neurocoputing brought to data analysis also a new terinology: searching for paraeters of their input/output functions is called learning, saples of data training sets and a capability to satisfactorily process new data that have not been used for learning is called generalization. Capability of generalization depends on the choice of a hypothesis set of input/output functions, where one searches for a pattern (a functional relationship) fitting to epirical data. So a restriction of the hypothesis set to only physically eaningful functions can iprove generalization. An alternative approach to odelling of generalization was proposed by Poggio and Girosi [22]. They odified the least square ethod by Tikhonov s regularization, which adds to the least square error a ter, called stabilizer, penalizing undesired input/output functions. Girosi, Jones and Poggio considered stabilizers penalizing high frequencies in the Fourier representation of the hypothetical solutions [11]. Girosi [10] showed that stabilizers of this type belong to a wider class fored by the squares of nors on a special type of Hilbert spaces called reproducing kernel Hilbert spaces (RKHS). Such nors can play a role of easures of various types of oscillations and thus enable to odel a variety of prior knowledge (conceptual data), which has to be added to the epirical ones to guarantee a generalization capability. RKHS were forally defined by Aronszajn [2], but their theory includes any classical results on positive definite functions, atrices and integral operators with kernels. In data analysis, kernels were first used by Parzen [19] and Wahba [28], who applied the to data soothing by splines. Aizeran, Braveran and Rozonoer [1] used kernels (under the nae potential functions) to solve classification tasks by transforing geoetry of input spaces by ebedding the into higher diensional inner product spaces [1]. Boser, Guyon and Vapnik [5] and Cortes and Vapnik [6] farther developed this ethod of classification into the concept of the support vector achine (a one-hidden-layer network with kernel units in the hidden layer and one threshold output unit). A variety of kernel ethods and algoriths are of current interest, and their potential uses are wide ranging; see, e.g., the recent applications-oriented onographs by Schöllkopf and Sola [25] and by Cristianini and Shawe-Taylor [7], a theoretical article by Cucker and

3 Neural Network Learning as an Inverse Proble 553 Sale s [8] or a brief survey by Poggio and Sale [23]. So use of kernels in achine learning has two reasons: (1) nors on RKHSs can play roles of undesirable attributes of input/output functions, the penalization of which iproves generalization, and (2) kernels define ebeddings of input spaces into feature spaces with geoetries that allow to separate linearly data to be classified. In this paper, we add to these two reasons a third one. We show that the Aronszajn s definition [2] of RKHSs (as Hilbert spaces with all evaluation operators continuous) deterines precisely a class of Hilbert spaces, where solutions of the least square probles of finding input/output functions fitting to epirical data (iniization of epirical error functionals ) can be obtained as Moore-Penrose pseudosolutions of linear inverse probles. Tasks of finding unknown causes (such as shapes of functions, forces or distributions) fro known consequences (easured data) have been studied in applied science (such as acoustics, geophysics and coputerized toography see, e.g., [13]) under the nae inverse probles. To solve such a proble, one needs to know how unknown causes deterine known consequences, which can often be described in ters of operators. In probles originating fro physics, such operators are typically integral (such as those defining Radon or Laplace transfors [3], [9]). In this paper, we reforulate iniization of the least square error as an inverse proble defined by an evaluation operator. A crucial property for an application of tools fro theory of inverse probles is continuity of an operator. As the evaluation operator is defined in ters of the evaluation functionals at the saple of input data, the class of Hilbert spaces, where the theory can be applied, corresponds exactly to the class of reproducing kernel Hilbert spaces. Thus reforulation of a learning proble as an inverse proble justifies the use of RKHSs as suitable hypothesis spaces. We show that this reforulation leads to analytical description of the optial input/output function of a network with kernel units, which can be eployed to design a learning algorith based on a nuerical solution of a syste of linear equations. The paper is organized as follows. In Section 2, learning fro data is reforulated as an inverse proble. Section 3 describes properties of reproducing kernel Hilbert spaces. In Sections 4 and 5, ain results on functions iniizing an epirical error functional and its regularizations are presented. 2 Learning fro epirical data as an inverse proble A standard approach to learning fro data used, e.g., in neurocoputing in the backpropagation algorith, is based on iniization of the least square error. For a saple of input/output pairs of data (training set) z = {(u i,v i ) Ω R,i=1,...,}, where Ω R d, a functional E z called epirical error is defined for a function f :Ω Ras E z (f) = 1 (f(u i ) v i ) 2. i=1 We show that iniization of this functional can be reforulated as an inverse proble. For a linear operator A : X Y between two Hilbert spaces X, Y (in finite-diensional case, a atrix A) aninverse proble deterined by A is to find for g Y soe f X such that A(f) =g,

4 554 Neural Network Learning as an Inverse Proble where g is called a data and f a solution [3]. When for soe g Y no solution exists, one can at least search for a pseudosolution f o, for which A(f o ) is a best approxiation to g aong eleents of the range of A, i.e., A(f o ) g Y = in f X A(f) g Y. Using standard terinology fro optiization theory, for a functional Φ : X Rand M X we denote by (M,Φ) the proble of iniizitation of Φ over M; the set M is called a hypothesis set. An eleent f o M such that Φ(f o ) = in Φ(f) f M is called a solution of the proble (M,Φ) and argin(m,φ) = {f o M :Φ(f o ) = in f M Φ(f)} denotes the set of all solutions of (M,Φ). So a pseudosolution is a solution of the proble (X, A(.) g Y ). The set argin(x, A(.) g Y ) is convex and, hence, if it is nonepty, there exists a unique pseudosolution of inial nor, called the noral pseudosolution [12]). This noral pseudosolution, denoted by f +, satisfies f + X = in{ f o X : f o argin(x, A(.) g Y )}. When for every g Y, the set argin(x, A(.) g Y ) is non-epty (and thus there exists a noral pseudosolution f + ), then a pseudoinverse operator A + : Y X can be defined by setting A + (g) =f +. In 1920, Moore [18] described properties of the pseudoinverse of a atrix, which were rediscovered in 1955 by Penrose [20]. In 1970s, the theory of Moore-Penrose pseudoinversion has been extended to the infinite-diensional case it was shown that siilar properties as the ones of Moore-Penrose pseudoinverses of atrices also are possessed by continuous linear operators with closed ranges [12]. We can eploy theory of inverse probles to investigation of iniization of an epirical error functional. To reforulate iniization of such a functional as an inverse proble, for an input data vector u =(u 1,...,u ) Ω and a Hilbert space X of functions on Ω, define an evaluation operator L u : X R by L u (f) = ( f(u1 ),..., f(u ) ).

5 It is easy to check that for every f on Ω, E z (f) = 1 So E z can be represented as i=1 Neural Network Learning as an Inverse Proble 555 (f(u i ) v i ) 2 = L u (f) v 2 2. (2.1) E z = L u v 2, where. 2 denotes the l 2 -nor on R. The representation (2.1) allows one to express the proble of iniization of the epirical error functional E z as the proble of finding a pseudosolution L + u ( v ) of the inverse proble given by the operator L u for the data v. As the range of the operator L u is finite diensional, it is closed in R. Thus to eploy the extension of Moore-Penrose pseudoinversion to the doain of infinite diensional spaces, it reains to find proper hypothesis spaces, on which operators L u are continuous for all input data vectors u. 2 3 Reproducing kernel Hilbert spaces Neither the space C(Ω) of all continuous functions on Ω R d nor the Lebesgue space L 2 (Ω) of square integrable functions are suitable as the hypothesis spaces: the first one is not a Hilbert space and the second one is not fored by pointwise defined functions. Moreover, L u is not continuous on any subspace of L 2 (R d ) containing the sequence of functions {n d x ( e )2} n : all eleents of this sequence have L 2 -nors equal to 1, but the evaluation functional at zero aps this sequence to an unbounded sequence of real nubers and thus it is not continuous. Fortunately, there exists a large class of Hilbert spaces, on which all evaluation functionals are continuous and oreover, nors on such spaces can play roles of easures of various types of oscillations of input/output appings. Spaces fro this class are called reproducing kernel Hilbert spaces (RKHSs). ARKHSis a Hilbert space fored by functions on a nonepty set Ω such that for every x Ω, the evaluation functional F x, defined for any f in the Hilbert space as F x (f) =f(x), is bounded [2]. RKHS can be characterized in ters of kernels, which are syetric positive seidefinite functions K :Ω Ω R, i.e., functions satisfying for all, all (w 1,...,w ) R, and all (x 1,...,x ) Ω, w i w j K(x i,x j ) 0. i,j=1 A kernel is positive definite if i,j=1 w iw j K(x i,x j ) = 0 for any distinct x 1,...,x iplies that for all i =1,...,, w i =0.Insuchacase{K x : x Ω} is a linearly independent set. The RKHS defined by a kernel K on Ω Ω is denoted by H K (Ω).

6 556 Neural Network Learning as an Inverse Proble It is fored by linear cobinations of functions fro {K x : x Ω} and liits of all Cauchy sequences with respect to the nor. K on the RKHS induced by the inner product defined on generators by K x,k y K = K(x, y). For a positive integer and a vector u =(u 1,...,u ), by K[x] is denoted the atrix defined as K[u] i,j = K(u i,u j ), which is called the Gra atrix of the kernel K with respect to the vector u. A paradigatic exaple of a kernel is the Gaussian kernel G ρ (x, y) = e ρ x y 2 on R d R d. For this kernel, the space H Gρ (R d ) contains all functions coputable by radialbasis function networks with a fixed width equal to ρ. 4 Optial solution of the learning task Using tools fro the theory of inverse probles, we can describe the unique function in the reproducing kernel Hilbert space given by a kernel K iniizing an epirical error functional, which has the sallest K-nor. The next theore states that this function is the iage of the vector of noralized output data v under the pseudoinverse operator L + u (for the proof see [14]). Theore 4.1 Let K : Ω Ω R be a kernel, be a positive integer and z = (u, v), where u = (u 1,...,u ) Ω, u 1,...,u are distinct and v =(v 1,...,v ) R, then: (i) L + u ( v ) argin(h K (Ω), E z ), and for every f o argin(h K (Ω), E z ), L + u ( v ) K f o K ; (ii) L + u ( v )= i=1 c ik ui, where c =(c 1,...,c )=K[u] + v. So for every kernel K and every saple of epirical data z, there exists a function f + iniizing the epirical error functional E z over the whole RKHS defined by K. This function is fored by a linear cobination functions K u1,...,k u : f + = L + u ( v ) = c i K ui. f + can be interpreted as an input/output function of a neural network with one hidden layer of kernel units and a single linear output unit. The coefficients of the linear cobination c =(c 1,...,c ) (corresponding to so called output weights) can be coputed applying the pseudoinverse of the Gra atrix of the kernel K with respect to the input data vector u on the output data vector v: c = K[u] + v. Capability of generalization of such a network can be studied in ters of stability of f + against a noise. Recall that the condition nuber of a syetric positive definite atrix is the ratio between the largest and the sallest eigenvalue. So when this ratio is for the atrix K[u] sall, the solution f + is robust against a noise. The condition nuber easures the worst agnification of data errors, while a ore realistic description of ill-conditioning can be derived fro inspection of all the eigenvalues of K[u]. For K positive definite, the row vectors K u1,...,k u of the atrix K[u] are linearly independent. But when the distances i=1

7 Neural Network Learning as an Inverse Proble 557 between the data u 1,...,u are sall, the row vectors ight be nearly parallel and the sall eigenvalues of K[u] ight cluster near zero. In such a case, sall changes of v, can cause large changes of f +. 5 Generalization odelled as regularization The function f + = i=1 c ik ui with c = K[u] + v provides the best fit to the saple of data z that can be obtained using functions fro the reproducing kernel space defined by the kernel K. By choosing this RKHS as a hypothesis space, one iposes a condition on oscillations of potential solutions. The type of such a condition can be illustrated by convolution kernels K : R d R d Rsatisfying K(x, y) =k(x y) for soe k : R Rwith positive Fourier transfor k. For such kernels, K-nors can be expressed as high-frequency filters f 2 K = 1 Rd 2 f(ω) (2π) d/2 k(ω) dω [10]. So to keep this nor sall, f(ω) should decrease rather quickly with s increasing. The restriction on high frequency oscillations can be strengthened by Tikhonov s regularization with. 2 K as a stabilizer, i.e., by replacing iniization of E z with iniization of E z + γ. 2 K, where γ>0is called a regularization paraeter. The next theore describes properties of regularized solutions f γ, their relationship to the pseudosolution f + and iproveent of stability achievable using Tikhonov s regularization (for the proof see [14]). Theore 5.1 Let K :Ω Ω be a kernel, be a positive integer, z =(u, v), where u =(u 1,...,u ) Ω, u 1,...,u are distinct, v =(v 1,...,v ) R and γ>0, then: (i) there exists a unique solution f γ of the proble (H K (Ω), E z + γ. 2 K ); (ii) f γ = i=1 c ik ui, where c =(K[u]+γI) 1 v; (iii) when K is positive definite, then cond(k[u]+γi) =1+ (cond(k[u]) 1)λin λ in+γ, where λ in is the inial eigenvalue of K[u]. Siilarly as in the case of the function f + iniizing the epirical error, the function f γ iniizing the regularized epirical error is a linear cobination of the functions K u1,...,k u defined by the input data u 1,...,u : f γ = c γ i K u i. i=1 But the coefficients of the linear cobination are different, their vector c γ =(c γ 1,...,cγ )is the iage of the output data vector v under the inverse operator (K[u]+γI) 1 : c γ =(K[u]+γI) 1 v. This forula has been derived by several authors [28], [8], [23] using Fréchet derivatives and called the Representer Theore. Here, we obtained it directly fro properties of regularized solutions of inverse probles.

8 558 Neural Network Learning as an Inverse Proble So increase of soothness of the regularized solution f γ is achieved by erely changing the coefficients of the linear cobination: while in the non regularized case, the coefficients are obtained fro the output data vector v using the Moore-Penrose pseudoinverse of the Gra atrix K[u], in the regularized one, they are obtained using the inverse of a odified atrix K[u] + γi. So the regularization erely changes aplitudes, but it preserves the finite set of basis functions fro which the solution is coposed. These basis functions can be, for exaple, Gaussians with centers given by the input data. Theore 5.1 shows that continuous growth of the regularization paraeter γ leads fro under soothing to over soothing and estiates how ill-conditioning can be iproved by the Tikhonov s regularization. As li (1 + cond(k[u] 1)λ in )=1, γ λ in + γ the larger the product γ, the better iproveent of stability. However, the size of the regularization paraeter γ is liited by the requireent of fitting f γ to the saple of epirical data z, so γ cannot be too large. But can be increased by choosing a larger saple of epirical data. However for large, coputational efficiency of iterative ethods for solving systes of linear equations c = K[u] + v and c = (K[u] +γi) 1 v ight liit practical applications of learning algoriths based on the Representer Theore (Theore 5.1 (ii)); see [23] for references to such applications. In typical neural-network algoriths, networks with the nuber of hidden units n uch saller than the size of the training set are used. In [16] and [17], estiates of speed of convergence of infia of epirical error over sets of functions coputable by such networks to the iniu described in Theore 5.1 were derived. For reasonable data sets, such convergence is rather fast. Acknowledgeents This work was partially supported by the project 1ET of the progra Inforation Society of the National Research Progra of the Czech Republic. References [1] Aizeran M. A., Braveran E. M., Rozonoer L. I. (1964). Theoretical foundations of potential function ethod in pattern recognition learning. Autoation and Reote Control 28, [2] Aronszajn N. (1950). Theory of reproducing kernels. Transactions of AMS 68, [3] Bertero M. (1989). Linear inverse and ill-posed probles. Advances in Electronics and Electron Physics 75, [4] Björck A. (1996). Nuerical Methods for Least Squares Proble. SIAM. [5] Boser B., Guyon I., Vapnik V. N. (1992). A training algorith for optial argin clasifiers. In Proceedings of the 5th Annual ACM Workshop on Coputational Learning Theory (Ed. D. Haussler) (pp ). ACM Press. [6] Cortes C., Vapnik V. N. (1995). Support-vector networks. Machine Learning 20, [7] Cristianini N., Shawe-Taylor J. (2000). An Introduction to Support Vector Machines. Cabridge: Cabridge University Press. [8] Cucker F., Sale S. (2001). On the atheatical foundations of learning. Bulletin of the AMS 39, [9] Engl H. W., Hanke M., Neubauer A. (2000). Regularization of Inverse Probles. Dordrecht:Kluwer.

9 Neural Network Learning as an Inverse Proble 559 [10] Girosi F. (1998). An equivalence between sparse approxiation and support vector achines. Neural Coputation 10, (AI Meo No 1606, MIT). [11] Girosi F., Jones M., Poggio T. (1995). Regularization theory and neural network architectures. Neural Coputaion 7, [12] Groetch C. W. (1977). Generalized Inverses of Linear Operators. Dekker: New York. [13] Hansen P. C. (1998). Rank-Deficient and Discrete Ill-Posed Probles. Philadelphia: SIAM. [14] Kůrková V. (2004). Supervised learning as an inverse proble. Research Report ICS , Institute of Coputer Science, Prague. [15] Kůrková V. (2004). Learning fro data as an inverse proble. In COMPSTAT Proceedings on Coputational Statistics (J. Antoch, Ed.), Heidelberg: Physica-Verlag/Springer. [16] Kůrková V., Sanguineti M. (2005). Error estiates for approxiate optiization by the extended Ritz ethod. SIAM Journal on Optiization 15: [17] Kůrková V., Sanguineti M. (2005). Learning with generalization capability by kernel ethods with bounded coplexity. Journal of Coplexity (to appear). [18] Moore E. H. (1920). Abstract. Bull. Aer. Math. Soc. 26, [19] Parzen E. (1966). An approach to tie series analysis. Annals of Math. Statistics 32, [20] Penrose R. (1955). A generalized inverse for atrices. Proc. Cabridge Philos. Soc. 51, [21] Pinkus, A. (1985). n-width in Approxiation Theory. Berlin: Springer-Verlag. [22] Poggio T., Girosi F. (1990). Networks for approxiation and learning. Proceedings IEEE 78, [23] Poggio T., Sale S. (2003). The atheatics of learning: dealing with data. Notices of the AMS 50, [24] Popper K. (1968). The Logic of Scientific Discovery. New York: Harper Torch Book. [25] Schölkopf B., Sola A. J. (2002). Learning with Kernels Support Vector Machines, Regularization, Optiization and Beyond. Cabridge: MIT Press. [26] Tikhonov A. N., Arsenin V. Y. (1977). Solutions of Ill-posed Probles. Washington, D.C.: W.H. Winston. [27] Vapnik V. N. (1995). The Nature of Statistical Learning Theory. New York: Springer-Verlag. [28] Wahba G. (1990). Splines Models for Observational Data. Philadelphia: SIAM. [29] Werbos, P. J. (1995). Backpropagation: Basics and New Developents. In The Handbook of Brain Theory and Neural Networks (Ed. Arbib M.) (pp ). Cabridge: MIT Press. Received 5 January 2005.

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial