Algebraic Analysis for Non-identifiable Learning Machines

Size: px

Start display at page:

Download "Algebraic Analysis for Non-identifiable Learning Machines"

Clement Ball
5 years ago
Views:

1 Algebraic Analysis for Non-identifiable Learning Machines Sumio WATANABE P& I Lab., Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Japan swatanab@pi.titech.ac.jp May 4, 2000 ABSTRACT This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a non-identifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ 1 log n (m 1 1)log log n+constant, where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ 1 and m 1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ 1 is equal to the number of parameters and m 1 = 1, whereas in nonregular models such as multilayer networks, 2λ 1 is not larger than the number of parameters and m 1 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the non-identifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1

2 1 Introduction Learning machines made of superpositions of homogenous functions are often useful for constructing practical information systems. For example, layered neural networks, radial basis functions, and mixtures of normal distributions have been applied to many recognition and prediction systems. They are written in the form, H ψ(x, {a h, b h }) = a h g(b h, x) where g is a function, x is an input vector, and {a h, b h } is a set of parameters to be optimized. One of the important properties of them is that they do not satisfy the regularity condition for the asymptotic normality of the maximum likelihood estimator, in general (Hagiwara, Toda, & Usui, 1993 ; Fukumizu, 1996). For regular statistical models (Cramer, 1949), the set of true parameters consists of only one point and the Fisher information matrix is positive definite, even if the learning model is larger than necessary to attain the true distribution, which case is called over-realizable. On the other hand, if a layered learning machine is in the over-realizable case, the set of true parameters is not one point but an analytic set with singularities, resulting that estimated parameters do not converge to one point. In other words, the correspondence between the set of parameters and the set of the functions is not a one-to-one mapping. In this paper, such learning models are called non-identifiable. Researches for non-identifiable learning machines are important for layered neural networks by the three reasons. Firstly, they are necessary for selecting the optimal model which balances the function approximation error with the statistical estimation error. Although the true distribution is not contained in the parametric learning machine in a practical application, no information criterion can be obtained without studies for the over-realizable case (Hagiwara,Toda,&Usui, 1993; Fukumizu, 1999). The distributions of estimated parameters in non-identifiable cases should be analyzed also for testing hypothesis (Dacunha-Castelle, & Gassiat, 1997; Knowles,& Siemund, 1989). In these researches, the asymptotic case is mainly studied based on the assumption that the number of training samples is sufficiently large. However, there is a similarity between the error behavior in a non-identifiable machine with large size 2 h=1

3 data and that in a complex machine with moderate size data, since the Fisher information matrices are ill-defined in both cases. Therefore, we can expect that the asymptotic research for non-identifiable case is useful for complex learning machines in practical applications. Secondly, studies for non-identifiable machines will clarify the essential difference between regular statistical models and artificial neural networks. The difference of them in the function approximation field has already been studied (Barron, 1994; Mhaskar, 1996), whereas that in the statistical estimation field is not yet. This paper shows that the generalization errors by non-identifiable learning machines with the Bayesian method are not larger than those of identifiable and regular models, which can be proven under the condition that the function approximation error is negligible compared with the statistical estimation error. And lastly, the theory for the learning machines whose loss functions can not be approximated by any quadratic form will be the foundation to devise and improve neural network training algorithms. A lot of training algorithms have been devised on the assumption that the set of the optimal parameter consists of only one point and that the Fisher information matrix is positive definite. However, this assumption is not satisfied in general. For example, we often find that a layered learning machine works very well when functions of hidden units are almost linearly dependent. The training algorithms should be improved so that they make learning machines attain the best performance even when they are in near redundant states. The maximum likelihood method is not an appropriate training algorithm for complex and layered learning machines in such states. In this paper, in order to clarify the statistical properties of the non-identifiable learning machines, we prove that the Bayesian stochastic complexity or the free energy F(n) has the asymptotic form F(n) = λ 1 log n (m 1 1)log log n + O(1), where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number respectively which are determined by the singularities of the set of true parameters. The learning curve are determined by the algebraic geometrical structure of the parameter set. We also show that 3

4 an algorithm to calculate λ 1 and m 1 by using blowing-ups of singularities, and that 2λ 1 is not larger than the number of parameters. Since the increase of the stochastic complexity F(n + 1) F(n) is equal to the generalization error defined by the average Kullback distance of the estimated probability density from the true one (Levin, Tishby, & Solla, 1990 ; Amari, Fujita, & Shinomoto, 1992 ; Amari, & Murata, 1993), our result claims that layered neural networks are the better learning machines than regular statistical models if the Bayesian estimation (Akaike, 1980 ; Mackay, 1992) or ensemble learning is applied in training. The free energy F(n), which is an important observable in Bayesian statistics, information theory, and mathematical physics, has a lot of other names and applications. For example, it is called Bayesian criterion in Bayesian model selection (Schwarz, 1974), Stochastic Complexity in universal coding (Rissanen, 1986 ; Yamanishi, 1998), Akaike s Bayesian criterion in optimization of hyper parameters (Akaike, 1980), or Evidence in neural network learning (Mackay, 1992). In almost all cases in these researches, F(n) was calculated by using the gaussian approximation or the saddle point approximation based on the assumption that the loss function can be approximated by a quadratic form among the one true parameter. In neural network learning, we can not use such an approximation. This paper constructs the general formula, which enables us to analyze both identifiable and non-identifiable models by the same way. To study a loss function whose zero points contain singularities, we employ the Sato-Bernstein polynomial or the so-called b-function in algebraic analysis which can extract an algebraic information from the set of true parameters. Also we construct an algorithm to calculate constants λ 1 and m 1 by using the resolution of singularities in algebraic geometry. Resolution of singularities transforms the integral of several variables into the direct product of integrals of one variable, and enables us to algorithmically calculate learning efficiency of an arbitrary learning machine in Bayesian estimation. 4

5 2 Main Results Let p(y x, w) be a conditional probability density of an output y R N for a given input x R M and a given parameter w R d, which represents a probabilistic inference of a learning machine. Let ϕ(w) be a probability density function on the parameter space R d, whose support is denoted by W = supp ϕ R d. We assume that training or empirical sample pairs {(x i, y i ); i = 1,2,..., n} are independently taken from q(y x)q(x), where q(x) and q(y x) represent the true input probability and the true inference probability, respectively. In the Bayesian estimation, the estimated inference p n (y x) is the average of the a posteriori ensemble, p n (y x) = p(y x, w)ρ n (w)dw, ρ n (w) = 1 n ϕ(w) p(y i x i, w), Z n i=1 where Z n is the constant which ensures ρ n (w)dw = 1. The generalization error is defined by the average Kullback distance of the estimated probability density p n (y x) from the true one q(y x), K(n) = E n { log q(y x) q(x, y)dxdy} p n (y x) where E n { } shows the expectation value over all sets of training sample pairs. In this paper we mainly consider the statistical estimation error and assume that the model can attain the true inference, in other words, there exists a parameter w 0 W such that p(y x, w 0 ) = q(y x). Let us define the average and empirical loss functions. f(w) = f n (w) = log q(y x) p(y x, w) q(y x)q(x)dxdy, 1 n log q(y i x i ) n p(y i x i, w). i=1 Note that f(w) 0 is Kullback information. By the assumption, the set of the true parameters W 0 = {w W ; f(w) = 0} is not an empty set. If f(w) is an analytic function, then W 0 is called an analytic set of f(w). If f(w) is a polynomial, then W 0 is called an algebraic variety. Remark that W 0 is not a manifold in general, since no coordinate can be introduced in the neighborhood 5

6 of singular points. For example, layered neural networks such as three-layered perceptrons and radial basis functions have many singular points. Remark. Even if W 0 consists of one point or if the true distribution is not contained in the parametric models (W 0 is an empty set), finiteness of the number of training samples often makes W 0 seem to be non-identifiable in neural network models. For the purpose of studying such a case rigorously, we assume that W 0 is not empty. From these definitions, it is proven in (Levin, Tishby, & Solla, 1990 ; Amari, 1993) that the average Kullback distance K(n) is equal to the increase of the Bayesian stochastic complexity or the free energy F(n), K(n) = F(n + 1) F(n), where F(n) is defined by F(n) = E n {log exp( nf n (w))ϕ(w)dw}. In this paper, we show the rigorous asymptotic form of F(n) and clarify the algebraic geometrical structure of the stochastic complexity. Theorem 1 and 2 are the main results of this paper. Let C0 be a set of all compact support and C -class functions on R d. Theorem 1 Assume that f(w) is an analytic function and ϕ(w) is a probability density function on R d. Then, there exists a real constant C 1 such that for any natural number n F(n) λ 1 log n (m 1 1)log log n + C 1, (1) where the rational number λ 1 (λ 1 > 0) and the natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ f(w) λ ϕ(w)dw (Re(λ) > 0). where ǫ > 0 is a sufficiently small constant, and ϕ(w) is an arbitrary nonzero C0 -class function that satisfies 0 ϕ(w) ϕ(w). 6

7 Proof of Theorem 1 is shown in section 3. Also it is proven in Lemma 2 of Section 3 that all poles of the function J(λ) are on the negative part of the real axis. In Theorem 1, the largest pole means the largest one as a real value. For Theorem 2, we define a condition. Condition (A) Let ψ(x, w) be a vector-valued function of (x, w) R M R d. We define two conditions on ψ(x, w). (1) ψ(x, w) is an analytic function of w W = supp ϕ R d which can be extended as a holomorphic function on some complex open set W, where W W C d and W is independent of x supp q R M. (2) ψ(x, w) is a measurable function of x R M that satisfies the condition where is the norm of the vector ψ(x, w). sup ψ(x, w) 2 q(x)dx <. (2) w W Theorem 2 Let σ > 0 be a constant value. Assume that ϕ(w) is a C 0 -class probability density function. Let us consider a model p(y x, w) = 1 ψ(x, w) 2 exp( y ), (2πσ 2 ) N/2 2σ 2 where both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A). Then, there exists a constant C 2 > 0 such that for any natural number n F(n) λ 1 log n + (m 1 1)log log n C 2, where the rational number λ 1 (λ 1 > 0) and a natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ where ǫ > 0 is a sufficiently small constant. f(w) λ ϕ(w)dw (Re(λ) > 0), This theorem determines the behavior of the asymptotic stochastic complexity. Proof of Theorem 2 is shown in section 4. 7

8 Remark. A real-valued analytic function ψ(x, w) of x R M and w R d is called to have the associated convergence radii (r 1, r 2,..., r d ) at (x, ŵ), if the Taylor expansion ψ(x, w) = a j1,j 2,...,j d (x)(w 1 ŵ 1 ) j 1 (w 2 ŵ 2 ) j2 (w d ŵ d ) j d j 1,j 2,...,j d absolutely converges in {w R d ; w j ŵ j < r j (j = 1,2,..., d)}, and diverges in {w R d ; w j ŵ j > r j (j = 1,2,..., d)}. For an N-dimensional vectorvalued analytic function, the convergence radii are defined as (minr 1, minr 2,..., minr d ) where min shows the minimum value among corresponding N values. The associated convergence radii at (x, ŵ) are denoted by (r 1 (x, ŵ),..., r d (x, ŵ)). A real analytic function ψ(x, w) can be extended as a holomorphic function on some open set W C d independent of x if and only if holds. min inf inf r j(x, w) > 0 1 j d x K w W In Theorem 1 and 2, we introduced the meromorphic function J(λ), whose largest pole and its multiplicity determines the learning efficiency of the model. It is an important fact that J(λ) is invariant under the transform f(w) ϕ(w)dw f(g(u)) ϕ(g(u)) g (u) du where g : U W is an arbitrary analytic function from some parameter space U to the given parameter space W, and g (u) is its Jacobian. Therefore the constants λ 1 and m 1 are also invariant under the above transform. The analytic function g, which need not have its inverse function g 1, is sometimes called a birational mapping, and algebraic geometry clarifies the geometrical structure of the parameter space which is invariant under birational mappings. To consider the learning curve or the generalization error, let us introduce the definition of the asymptotic expansion. Let {s i (n), i = 1,2,3,...} be a sequence of real-valued functions of a natural number n, which satisfies s i+1 (n) lim n s i (n) = 0 (i = 1,2,...). 8

9 This condition is referred to as s i+1 (n) = o(s i (n)) (i = 1,2,3,..., ). If a real valued function s(n) satisfies the condition lim n 1 k s k (n) {s(n) a i s i (n)} = 0 (k = 1,2,3,..., K), i=1 where {a i } are real values, then we define that s(n) has an asymptotic expansion s(n) K = a i s i (n). (3) i=1 Note that coefficients {a i } are determined uniquely. This definition contains the case K. For a real-valued function s(x) of the real variable x, the asymptotic expansion for x α (α is some value) is defined by the same way. Based on Theorem 2, the function c(n) defined by c(n) = F(n) λ 1 log n + (m 1 1)log log n, is a bounded function, c(n) < C 2. If c(n + 1) c(n) = o(1/(nlog n)), then it follows that K(n) = λ 1 n + m 1 1 nlog n + o( 1 nlog n ), which gives the learning curve of a non-identifiable learning machine. Corollary 1 Assume the same condition as Theorem 2. If c(n + 1) c(n) = o(1/nlog n), then the learning curve is given by K(n) = λ 1 n m 1 1 nlog n. As is well known, regular statistical models have λ 1 = d/2 and m 1 = 1, which can be shown as the special case of Theorem 2 (See Example.1). Nonidentifiable models such as neural networks have different values λ 1 d/2 and m 1 1, in general. Corollary 2 Assume the same condition as Theorem 1. If ϕ(w) can be taken as ϕ(w) > 0 for arbitrary w W 0, then λ 1 d/2, where d is the dimension of the parameter space. Corollary 2 is proven in section 5. Even for a non-identifiable learning machine, we can calculate λ 1 and m 1 based on the resolution technique of singularities in algebraic geometry. An algorithm to calculate λ 1 and m 1 for a given learning machine is shown also in section 5. 9

10 3 Proof of Theorem 1 In this section and the following section, we show the proofs of theorem 1 and 2. These proofs not only ensure theorems but also clarify the mathematical structure of Bayesian learning in artificial neural network models. Lemma 1 Assume that f(w) and f n (w) are continuous functions and that ϕ(w) is a probability density function. Then an inequality F(n) log exp( nf(w))ϕ(w)dw holds for arbitrary natural number n. [Proof of Lemma 1] From Jensen s inequality, log exp(a(w))b(w)dw a(w)b(w)dw holds for an arbitrary continuous function a(w) and an arbitrary compact support probability density b(w). First, assume that ϕ(w) is a compact support function. By applying the above inequality to the special case, a(w) = n{f n (w) f(w)}, b(w) = 1 Y exp( nf(w))ϕ(w) where Y = exp( nf(w))ϕ(w)dw, we obtain log 1 Y exp( nf n (w))ϕ(w)dw 1 Y n{f n (w) f(w)} exp( nf(w))ϕ(w)dw. By using E n {f n (w) f(w)} = 0 and Fubini s theorem, it follows that E n {log exp( nf n (w))ϕ(w)dw} log exp( nf(w))ϕ(w)dw for an arbitrary compact support function ϕ(w). Since the set of all compact support functions is dense in the set of all probability density functions by the L 1 norm, lemma 1 is obtained. (Q.E.D.) For a given ǫ > 0, a set of parameters W ǫ is defined by W ǫ = {w W supp ϕ; f(w) < ǫ}. We introduce Sato-Bernstein s b-function. 10

11 Theorem 3 (Sato, Bernstein, Björk, Kashiwara) Assume that there exists ǫ 0 > 0 such that f(w) is an analytic function in W ǫ0. Then there exists a triple (ǫ, P,b), where (1) ǫ < ǫ 0 is a positive constant, (2) P = P(λ,w, w) is a differential operator which is a polynomial for λ, and (3) b(λ) is a polynomial, such that P(λ,w, w )f(w) λ+1 = b(λ)f(w) λ ( w W ǫ, λ C). The zeros of the algebraic equation b(λ) = 0 are real, rational, and negative numbers. [Explanation of Theorem 3] In Theorem 3, P = P(λ,w, w) is a finite order differential operator for w whose coefficients are analytic functions for w. P can be understood as a mapping from C to C. The formally defined adjoint operator P by the same way as the usual partial differential operator satisfies the relation, φ(w)p(λ,w, w )ϕ(w)dw = P(λ,w, w ) φ(w)ϕ(w)dw for any φ C, ϕ C 0. Theorem 3 was proven based on the algebraic property of the ring of partial differential operators. See references (Bernstein, 1972 ; Sato,& Shintani, 1974 ; Björk, 1979). The rationality of the zeros of b(λ) = 0 is shown based on the resolution of singularities (Atiyah, 1970 ; Kashiwara, 1976). The smallest order polynomial b(λ) that satisfies the above relation is called a Sato-Bernstein polynomial or a b-function. Recently, an algorithm to calculate the b-function has been developed (Oaku, 1997). [Explanation of Theorem 3 End]. Remark. By setting ǫ > 0 small enough, we can assume f(w) > 0 ( w W ǫ \ W 0 ), (4) because f(w) is analytic. Hereafter ǫ > 0 is taken small enough so that both Theorem 3 and eq.(4) hold. Also we can assume that ǫ < 1 without loss of generality. For a given analytic function f(w), let us define a complex function J(λ) of λ C by J(λ) = f(w) λ ϕ(w)dw. W ǫ 11

12 Lemma 2 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C0 -class function. Then, J(λ) can be analytically continued to a meromorphic function on the entire complex plane, in other words, J(λ) has only poles in λ <. Moreover J(λ) satisfies the following conditions. (1) The poles of J(λ) are rational, real, and negative numbers. (2) For an arbitrary a R, J( + a 1) = 0 and J(a ± 1) = 0. [Proof of Lemma 2] J(λ) is an analytic function in the region Re(λ) > 0. First, J( + a 1) = 0 is shown by the Lebesgue s convergence theorem. Second, let us show J(a ± 1) = 0 for fixed a > 0. For t > 0 we define a function Ĵ a (t) = exp(at) W ǫ δ(t log f(w))ϕ(w)dw, where Ĵa(t) is well defined, because δ(g(w)) is well defined if g(w) > 0 for any w that satisfies g(w) = 0 (Gel fand & Shilov, 1964). From the definition, Ĵ a (t) is an L 1 function, and J(a ± b 1) = log ǫ Ĵ a (t) exp(±bt 1)dt, which shows that we can apply the Riemann-Lebesgue convergence theorem, resulting that J(a ± 1) = 0 for fixed a > 0. At last, by using the formal adjoint operator P, J(λ) = = 1 Pf(w) λ+1 ϕ(w)dw b(λ) W ǫ 1 f(w) λ+1 P ϕ(w)dw, b(λ) W ǫ where we used the property of the b-function. Because P ϕ C 0, J(λ) can be analytically continued to J(λ 1) if b(λ) 0. By using analytic continuation, J(a ± 1) = 0 even for a < 0. If b(λ) = 0, then such λ is at most a pole which is on the negative part of real axis. (Q.E.D.) Definition. Poles of the function J(λ) are on the negative part of the real axis and contained in the set {m + ν; m = 0, 1, 2,..., b(ν) = 0}. They are ordered from the bigger to the smaller and referred to as λ 1, λ 2, λ 3,, (λ k > 0 is a rational number.) and the multiplicity of λ k is denoted by m k. 12

13 We define a real-valued function I(t) by I(t) = δ(t f(w))ϕ(w)dw (0 < t < ǫ). (5) W ǫ Here, since f(w) > 0 for w W ǫ \W 0, δ(t f(w)) is well defined (Gel fand, & Shilov, 1964). For t ǫ or t < 0 then we define I(t) = 0. Lemma 3 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C 0 -class function. Then I(t) has an asymptotic expansion for t 0 I(t) = m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m (6) where (m! c k,m+1 ) is the coefficient of the (m + 1)-th order in the Laurent expansion of J(λ) at the pole λ = λ k. [Proof of Lemma 3] The special case of this lemma is shown in the theory of distributions (Gel fand & Shilov, 1964). Let us define I K (t) I(t) K m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m. For eq.(6), it is sufficient to show that, for an arbitrary fixed K, From the definition of J(λ), J(λ) = limi K (t)t λ = 0 ( λ > λ K+1 + 1). (7) t t λ+λ k 1 ( log t) m dt = 0 I(t)t λ dt. The simple calculation shows m! (λ + λ k ) m+1. Therefore, 1 I K (t)t λ dt = J K (λ) where J K (λ) is defined by 0 J K (λ) = J(λ) K m k 1 k=1 m=0 m! c k,m+1 (λ + λ k ) m+1, which is holomorphic in the region Re(λ) > λ K+1. By putting t = e x and by using Laplace inverse transform, I K (e x )e x = 1 τ+i J K (u)e ux du (8) 2πi τ i 13

14 holds for any real τ > 0. By Lemma 2, J(a±i ) = 0, thus J K (a±i ) = 0 for an arbitrary real a. Since J K (λ) is holomorphic in the region Re(λ) > λ K+1, the complex integral path in eq.(8) can be moved from τ > 0 to τ > λ K+1. Therefore, I K (e x )e x τx = 1 + J K (τ + iu)e iux du (9) 2π holds for any real x. Hence, by putting x = 0, J K (τ + iu) L 1. The term in the right hand side of eq.(9) goes to zero when x because of Riemann- Lebesgue convergence theorem. Here, by putting t = log x, we obtain eq.(7). (Q.E.D.) [Proof of Theorem 1] By combining the above results, we have F(n) log = log W 1 0 = log{ exp( nf(w))ϕ(w)dw log n e nt I(t)dt = log m k 1 m k=1 m=0 j=0 0 e t I( t n )dt n c k,m+1 (log n) j ( m C j ) n λ k = λ 1 log n (m 1 1)log log n + O(1), n 0 W ǫ exp( nf(w))ϕ(w)dw e t t λ k 1 ( log t) m j dt} where I(t) is defined by eq.(5) with ϕ(w) instead of ϕ(w). (Q.E.D.) 4 Proof of Theorem 2 Hereafter, we assume that the model is given by p(y x, w) = 1 2π exp( 1 2 (y ψ(x, w))2 )) for the simple description. It is easy to extend the proof for a general standard deviation (σ > 0) case and a general output dimension (N > 1) case. For this model, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 f n (w) = 1 n i, w) ψ(x i, w 0 )) 2n i=1(ψ(x 2 1 n n η i (ψ(x i, w) ψ(x i, w 0 )) i=1 where {η i y i ψ(x i, w 0 )} are independent samples from the standard normal distribution, and w 0 is an arbitrary element in W 0. 14

15 Lemma 4 Let {x i, η i } n i=1 be a set of independent samples taken from q(x)q 0 (η), where q(x) is a probability density and q 0 (η) is the standard normal distribution. Assume that a function ξ(x, w) satisfies the condition (A) and that the Taylor expansion of ξ(x, w) at ŵ absolutely converges in a region T = {w; w j ŵ j < r j }. For a given constant 0 < a < 1, we define a region T a {w; w j ŵ j < ar j }. Then, the followings hold. (1) If ξ(x, w)q(x)dx = 0, there exists a constant c such that for an arbitrary n, A n E n {sup w T a 1 n ξ(x i, w) 2 } < c <. n (2) There exists a constant c such that for an arbitrary n, B n E n {sup w T a i=1 1 n η i ξ(x i, w) 2 } < c <. n i=1 [Proof of Lemma 4] We show (1). The statement (2) can be proven in the same way. We denote k = (k 1, k 2,..., k d ) and ξ(x, w) = = a k (x)(w ŵ) k k=0 a k1 k d (x)(w 1 ŵ 1 ) k1 (w d ŵ d ) k d. k 1,...,k d =0 This power series absolutely converges in T, therefore ξ(x, w) can be extended as a holomorphic function in T. By using Cauchy s integral formula for several complex functions, a k (x) = 1 ξ(x, w) (2πi) d C 1 C d j(w j ŵ j ) dw k 1 dw d (10) j+1 where C j is a circle with a radius r j δ. We define a function M(x) by M(x) = sup w T a ξ(x, w). Then by eq.(2) in the condition (A), M { M(x) 2 q(x)dx } 1/2 <, and there exists δ > 0 such that a k (x) M(x) dj=1. (11) r j δ k j 15

16 By the assumption, a k (x)q(x)dx = 0. Thus 1 n E n { a k (x i ) 2 } 1 2 = { a k (x) 2 q(x)dx} 1 M 2 n j r j δ. k j i=1 By using Lemma 5 in Appendix, we obtain A 1 2 n 1 n = E n {sup ξ(x i, w) 2 } 1 2 w T a n i=1 1 n = E n {sup a k (x i )(w ŵ) k 2 } 1 2 w T a n i=1 k=0 1 n E n {sup a k (x i )(w ŵ) k 2 } 1 2 k=0 w T a n i=1 d ar j M ( k=0 j=1 r j δ )k j/2 < where δ is taken so that ar j < r j δ (j = 1,2,..., d). (Q.E.D.) The function ζ n (w) is defined as follows, which shows the fluctuation of learning. ζ n (w) = n(f(w) fn (w)) f(w) Note that ζ n (w) is an analytic function in W \ W 0. At w 0 W 0, ζ n (w) may be discontinuous, but the following theorem ensures that it is bounded on the average. Theorem 4 Assume the same condition as Theorem 2. Then, there exists a constant c such that for an arbitrary n E n { sup ζ n (w) 2 } < c. w W \W 0 [Proof of Theorem 4] Outside of the neighborhood of W 0, this theorem can be proven by Lemma 4. Hence we can assume that W = W ǫ. Since W is compact, W is covered by a union of finite small open sets. Thus we can assume w is in the neighborhood U of w 0 W 0, where its closure U is contained in the associated convergence radii. We define a (1) n (w) = 1 n η i (ψ(x i, w) ψ(x i, w 0 )), n i=1 a (2) n (w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 1 n (ψ(x i, w) ψ(x i, w 0 )) 2, 2n i=1 16.

17 where {η i } are samples independently taken from the standard normal distribution. By using these definitions, ζ n (w) = n {a (1) n (w) + a (2) n (w)}. f(w) We define a (j) (n) (j = 1,2) by a (j) (n) = E n { sup w U \W 0 n a (j) n (w) 2 f(w) }. It follows that E n { sup w U \W 0 ζ n (w) 2 } 2a (1) (n) + 2a (2) (n). For the proof of Theorem 4, it is sufficient that {a (j) (n); j = 1,2} are bounded. First, we show finiteness of a (1) (n). From Lemma 6 in Appendix, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) is a function which satisfies the condition (A), such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w), (12) j=1 where the matrix M jk (w) h j (x, w)h k (x, w)q(x)dx is positive definite. Let α > 0 be taken smaller than the minimum eigen value of the matrix {M jk (w); w U}. By the definition, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 = 1 J M jk (w)g j (w)g k (w) α 2 j,k=1 2 J g j (w) 2 j=1 by taking small ǫ > 0. Therefore, by using Cauchy-Schwarz inequality, a (1) 1 J (n) = E n { sup w U \W 0 f(w) j=1 g j (w) 1 n η i h j (x i, w) 2 } n i=1 2 J α E 1 n n{ sup η i h j (x i, w) 2 } w U \W 0 j=1 n i=1 17

18 which is bounded by some constant by the preceding Lemma 4. Secondly, we show finiteness of a (2) (n). By the assumption that both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A), an inequality {1 + M(x) 2 } (ψ(x, w) ψ(x, w 0 ) 2 q(x)dx < holds, where M(x) = sup w W ψ(x, w) ψ(x, w 0 ). By using Lemma 6, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) satisfies the condition (A) with {1+M(x) 2 } q(x) instead of q(x) such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w) j=1 where two matrices L jk (w) N jk (w) h j (x, w)h k (x, w)q(x)dx h j (x, w)h k (x, w){1 + M(x) 2 } q(x)dx are positive definite. Hence (ψ(x, w) ψ(x, w0 )) 4 q(x)dx (ψ(x, w) ψ(x, w0 )) 2 q(x)dx j,k N jk (w)g j (w)g k (w) j,k L jk (w)g j (w)g k (w) C where C > 0 is a constant. Thus which ensures that f(w) 1 {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx C a (2) (n) C E n { sup n (w) 2 }. {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx n a (2) w U \W 0 Here finiteness of the last term can be shown by the same way as a (1) (n), where {ψ(x, w) ψ(x, w 0 )} 2 is decomposed by Lemma 6 instead of ψ(x, w) ψ(x, w 0 ). (Q.E.D.) [Proof of Theorem 2] Let us prove Theorem 2. We define α n = sup w W \W 0 ζ n (w). 18

19 Then, by Theorem 4, E n {α 2 n} <. The free energy or the Bayesian stochastic complexity satisfies F(n) = E n {log exp( nf n (w))ϕ(w)dw} W = E n {log exp( nf(w) nf(w)ζ n (w))ϕ(w)dw} W E n {log exp( nf(w) + α n nf(w))ϕ(w)dw} W log exp( n 2 f(w))ϕ(w)dw 1 2 E n{α 2 n}, W where we used an inequality α n nf(w) (α 2 n +nf(w))/2. Let us define Z i(n) (i = 1,2) by Z i (n) = W(i) exp( n 2 f(w))ϕ(w)dw, where W(1) = W ǫ and W(2) = W \ W ǫ. Then by the same method as the proof of Theorem 1, On the other hand, Z 2 (n) Therefore Z 1 (n) (log n) m 1 1 = c 1,m1. W \W ǫ exp( nǫ 2 F(n) log{c 1,m1 (log n) m 1 1 n λ 1 which completes Theorem 2. (Q.E.D.) n λ 1 )ϕ(w)dw exp( nǫ 2 ). + exp( nǫ )} + const., 2 5 Algorithm to calculate the learning efficiency Theorem 2 shows that the important values λ 1 and m 1 are determined by the meromorphic function J(λ). However, J(λ) is defined by the integral of several variables, it is not so easy to determine the largest pole and its multiplicity of J(λ). If J(λ) is given by an integral of a single variable, for example, if J(λ) = ǫ 0 x 2λ x r dx, then the pole of J(λ) is λ 1 = (r + 1)/2, and its multiplicity is m 1 = 1. The resolution of singularities in algebraic geometry transforms an arbitrary 19

20 integral of several variables into an integral whose essential term is a direct product of integrals of single variables. In fact, Atiyah showed that the following Theorem 5 is directly proven from Hironaka s theorem (Hironaka, 1964 ; Atiyah, 1970). Theorem 5 (Hironaka-Atiyah) Let f(w) be a real analytic function defined in a neighborhood of 0 R d. Then there exist an open set W {0}, a real analytic manifold U and a proper analytic map g : U W such that (1) g : U \ U 0 W \ W 0 is a biregular map, where W 0 = f 1 (0) and U 0 = g 1 (W 0 ), (2) for each P U there are local analytic coordinates (u 1,..., u d ) centered at P so that, locally near P, we have f(g(u 1,..., u d )) = a(u 1,.., u d )u k 1 1 uk 2 2 uk d d (13) where a(u) is an invertible analytic function and k i 0. [Explanation of Theorem 5] This theorem is a special version of the well known Hironaka s resolution of singularities in algebraic geometry. Since f(w) is not a polynomial but an analytic function, the singularities of f can be locally resolved. Hironaka s proof is completely constructive, and the above map g can be constructed by the finite recursive procedures which are blowing-ups of non-singular manifolds contained in singular sets. In the theorem, both U and W are locally compact Hausdorff spaces and g is continuous. In this case, g is a proper map if and only if for an arbitrary compact set K, g 1 (K) is compact. [Explanation of Theorem 5 End] The following is an algorithm to calculate λ 1 and m 1. An algorithm to calculate the learning efficiency. (1) Cover the analytic set (the set of the true parameter) W 0 = {w suppϕ; f(w) = 0} by the finite union of open neighborhoods W α. (2) For each neighborhood W α, find a resolution map g α and a manifold U α by using blowing ups. Since g α is a proper map and the closure of W α is compact, U α is covered by a finite union of open sets {U αβ } whose closures are homeomorphic to compact sets in some Euclidean spaces. 20

21 (3) For each neighborhood of W αβ = g(u αβ ), the function J αβ (λ) is calculated by eq.(13). J αβ (λ) = = f(w) λ ϕ(w)dw W αβ f(g α (w)) λ ϕ(g α (u)) g α(u) du U αβ a(u) λ u λk 1 1 u λk 2 2 u λk d d ϕ(g α (u)) g α(u) du, U αβ where g α(u) is Jacobian. Note that {u; g α(u) = 0} gα 1 (W 0 ). The last integration can be done for each variable u i, and the largest pole λ (αβ) 1 and their multiplicity m (αβ) 1 of J αβ (λ) are obtained, where Taylor expansion of g α(u) is used. (4) The largest pole λ 1 of J(λ) is λ 1 = max αβ ( λ (αβ) 1 ), and its multiplicity m 1 is m (α β ) 1, where α β is the argument that maximizes the λ (αβ) 1. If (α, β) that maximizes λ αβ) 1 is not unique, then (α, β) that maximizes m (αβ) 1 among such (α, β) sets is chosen. In order to calculate λ 1 and m 1, only the neighborhood W α that gives the largest pole is important. The singularity that is in such neighborhood W α is called the deepest singularity in this paper. Note that, by Theorem 5, 1 m 1 d, where d is the number of parameters. Example.1 (Regular Models) For the regular statistical models, by using the appropriate coordinate (w 1,..., w d ) the average loss function f(w) can be locally written by d f(w) = wi 2. i=1 Then, among the origin, W 1 = {w; w < 1}, J(λ) = f(w) λ dw, W 1 where we replaced ϕ(w) by ϕ(w) = const., based on the natural assumption that ϕ(w) > 0 on W. We define W 11 = {w W; w i < w 1 } and U 11 = {(u 1,..., u d ); u i < 1}. By using the blowing-up, we find a map g 1 : (u 1,..., u d ) (w 1,..., w d ), w 1 = u 1, w i = u 1 u i (2 i d). 21

22 Then the function J 11 (λ) is J 11 (λ) = f(w) λ dw W 11 = u 2λ 1 d 1 (1 + u 2 i )λ u 1 d 1 du 1 du 2 du d This function has the pole at λ = d/2 with the multiplicity m 1 = 1. We define W 1i and J 1i (λ) by the same way as W 11 and J 11 (λ), then W 1 is the union of W 1i and a measure zero set. Therefore, the free energy is i=2 F(n) = d log n + O(1), 2 and if K(n) satisfies the same condition as Corollary 1, then K(n) = d/2n. [Proof of Corollary 2] Let w 0 be an arbitrary fixed point in W 0. We can assume w 0 = 0 without loss of generality. Since f(w) is analytic, there exists a constant a > 0, such that for any w in the neighborhood of 0 f(w) a w 2. From Lemma 2, we analyze the case when λ is a real and negative value. For such λ, J(λ) has a real value, if it is finite. Let ϕ be an positive and C0 class function smaller than ϕ(w). Then for a sufficiently small constant δ > 0, J(λ) f(w) λ ϕ(w)dw w <δ (a w 2 ) λ ϕ(w)dw. w <δ As is shown in Example.1, the largest pole of the last term is d/2, which shows that the largest pole of J(λ) is larger than d/2. (Q.E.D.) Example.2 If the model p(y x, a, b) = 1 2π exp( 1 2 (y a tanh(bx))2 ) is trained using samples from p(y x, 0,0), then f(a, b) = a 2 tanh(bx) 2 q(x)dx. In this case, W 0 = {ab = 0} and the deepest singularity is the origin. In the neighborhood of the origin, the essential term is f(a, b) = a 2 b 2, and the 22

23 other terms are smaller than this term. It immediately follows that λ 1 = 1/2, m 1 = 2, resulting that F(n) = 1 log n log log n + O(1). 2 Compared 1/2 with d/2 = 2/2, the free energy is smaller than the regular model case. Example.3 Let us consider a neural network p(y x, a, b, c, d) = 1 2π exp( 1 2 (y ψ(x, a, b, c, d))2 ), ψ(x, a, b, c, d) = a tanh(bx) + ctanh(dx). Assume that the true regression function be ψ(x, 0,0,0,0). Then, W 0 = {ab = 0 and cd = 0} {a + c = 0 and b = d} {a = c and b + d = 0} and the deepest singularity of f(a, b, c, d) is (0,0,0,0). In the neighborhood of the origin, defined as W 1, f(a, b, c, d) = (ab + cd) 2 + (ab 3 + cd 3 ) 2, since the higher order term can be bounded by the above two terms (Watanabe, 1998b). Let us define W 11 = {(a, b, c, d); b > d, c > a, ad 3 < ab + cd < cd 3 }, U 11 = {(x, y, z, w); x < 1, y < 1, w < 1,w 1 < z < w + 1}. By using blowing-ups, we find a map g 1 : U 11 W 11 which is defined by a = x, b = y 3 w yzw, c = xzw, d = y. From the algorithmic point of view, W 11 and U 11 are determined systematically in the process of blowing-ups. By using this transform, we obtain f(g 1 (x, y, z, w)) = x 2 y 6 w 2 [1 + {w 2 (y 2 z) 3 + z} 2 ], g 1(x, y, z, w) = xy 3 w. 23

24 We assume naturally ϕ(w) > 0 in {w = (a, b, c, d); a < 1, b < 1, c < 1, d < 1}. In calculation of poles of J(λ), we can simply put ϕ(w) = 1 in this set. J 1 (λ) = = W 1 f(w) λ dw 1 1 dx 1 1 dy 1 1 w+1 dw dz f(g 1 (x, y, z, w)) λ xy 3 w, w 1 resulting that λ (1) 1 = 2/3, and m (1) 1 = 1. From the other regions W 12, W 13,...,, we do not obtain larger λ 1 than 2/3. Hence F(n) = 2 log n + O(1), 3 whereas the dimension divided by two is d/2 = 4/2. Lastly, we introduce an elemental inequality for the asymptotic expansion of the free energy. Let us consider K learning machines (1 k K), p k (y x, w k ) = 1 1 exp( (2πσ 2 ) N/2 2σ y ψ k(x, w 2 k ) 2 ), where N is the number of output units. Assume that the asymptotic expansions of free energies are respectively given by F k (n) = λ (k) log n + (m (k) 1)log log n, which are determined under the condition that the true distributions are {p(y x, w k0 )} and the a priori distribution is ϕ k (w k ). Let us study the case that a learning machine made of a sum of K machines, p(y x, {w k }) = 1 1 exp( (2πσ 2 ) N/2 2σ y K 2 k=1 ψ k (x, w k ) 2 ), is trained by using samples from the distribution p(y x, {w k0 }). The asymptotic expansion of the free energy with the a priori distribution k ϕ k (w k ) is written as Then we have the following inequality. F(n) = λlog n (m 1)log log n. K Corollary 3 Constants λ and {λ (k) } satisfy the inequality, λ λ (k). k=1 24

25 This corollary has an important application. Let λ(h 0, H)log n be the essential term of the asymptotic form of a three-layered perceptron with H hidden units, which is trained using samples from the true distribution represented with H 0 hidden units. We assume that H H 0 = KL, where K and L are some integers. By using Corollary 3, λ(h 0, H) λ(h 0, H 0 ) + λ(0,h H 0 ) λ(h 0, H 0 ) + (H H 0 )/K λ(0,l). Note that λ(h 0, H 0 ) = d/2 where d is the number of parameters in the model with H 0 hidden units (Watanabe, 1998b). Even if the resolution of singularities is too complex and too difficult to calculate λ(h 0, H), we can obtain some inequalities based on the free energy of the simpler case. [Proof of Corollary 3] The Kullback distance of the machine p(y x, {w k }) is given by f(w) 1 K {ψ 2σ 2 k (x, w k ) ψ k (x, w k0 )} 2 q(x)dx k=1 K K ψ 2σ 2 k (x, w k ) ψ k (x, w k0 ) 2 q(x)dx k=1 K = K f k (w k ) k=1 where f k (w k ) is the Kullback distance of p k (y x, w k ) from p k (y x, w k0 ). F(n) K K k log exp( Kn f k (w k )) ϕ k (w k ) dw k k=1 k=1 k=1 K {λ (k) log n (m (k) 1)log log n + const.}, k=1 which completes the Corollary 3. (Q.E.D.) Example.4 (Experiments) Let us consider a case when p(y x, a, b) = 1 2π(0.1) exp( 1 2 2(0.1) 2(y aσ(bx))2 ) is trained using samples from p(y x, a 0, b 0 ), where σ(x) = x + x 2. In this case, we can calculate the maximum likelihood estimator, because the likelihood 25

26 function is a quadratic form of ab and ab 2. Figure 1, 3, and 5 show the maximum likelihood estimators (MLE) for (a, b) which are estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4), respectively. Here q(x) is the standard normal distribution, and n = 20, the number of trials is 200. Note that MLEs which are in the outside of [ 2, 2] [ 2, 2] are not drawn in the figures. Figure 2, 4, and 6 show the parameters which are taken from the a posteriori distribution estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4). Here we defined the a priori distribution of (a, b) by the uniform distribution of [ 2, 2] [ 2, 2]. It should be emphasized that the distribution of MLEs is very different from the Bayesian a posteriori distribution, and that neither distribution is subject to any gaussian distribution even if the true parameter is only one point. In this example, by putting p = ab, q = ab 2, the model can be understood as the regular statistical model, resulting that the generalization error by the maximum likelihood method is asymptotically equal to 2/2n in three cases. However, for a general activation function σ(x), there exists no transform which makes the model regular. The generalization error in such a case is far larger than that in regular model cases. The generalization errors by Bayesian estimation in these cases are asymptotically 1/(2n) 1/(nlog n), 2/(2n), and 2/(2n), respectively. For the case n = 20 (n is the number of training samples), the experimental average generalization errors by the maximum likelihood method were , , and where true parameters were (0, 0), (0.2, 0.2), and (0.4, 0.4), respectively, whereas those by Bayesian method were , , and , respectively. For the case n = 50, the former errors were , , and , respectively, and the latter errors were , , and These results show that, if the model is almost redundant, Bayesian estimation is more appropriate than the maximum likelihood method. 26

27 6 Discussion 6.1 Fundamental Aspects In this subsection, we intuitively explain the fundamental structure of the theorems in this paper. Let us define a volume function of the set of parameters {w W; f(w) t}, V (t) = f(w) t ϕ(w)dw. Then V (t) = I(t), where I(t) is defined by eq.(5). This paper has shown that the asymptotic stochastic complexity F(n) is given by the Laplace Transform of the volume function. F(n) = log exp( nt)dv (t). If the model is identifiable and the Fisher information matrix is positive definite, then the asymptotic shape of the set {w W; f(w) t} is the inside of an ellipsoid. However, if the model is non-identifiable and has singularities, then it is the union of the inner points of several complex hyperboloids. If V (t) satisfies the relation V (t) = t λ 1 ( log t) m 1 1 (t 0), then F(n) = λ1 log n (m 1 1)log n, (n ) (14) J(λ) = t λ V (t)dt = Const.. (15) (λ + λ 1 ) m 1 These relations, eq.(14) and eq.(15), show that the learning efficiency of Bayesian estimation is determined by the volume of almost true parameters. Mathematically speaking, it is difficult to prove V (t) t λ 1 ( log t) m 1 1 directly, because of singularities in the true parameter set. The relation that is proven firstly is eq.(15), where we need algebraic properties of the loss function, Sato-Bernstein s b-function and Hironaka s resolution of singularities. 6.2 Theoretical Aspects The case when the parameter is not identifiable might be understood as a special one which seldom occurs. However, the superpositions of homogenous functions essentially have the problem of redundancy. 27

28 Let us illustrate the mathematical structure of redundancy. A function f(x) in L 2 (R) can be decomposed as f(x) = g(a, b)ϕ( x b a )dadb, where ϕ(x) is an analyzing wavelet and g(a, b) is a coefficient function (Chui, 1992). Here the set of functions {ϕ( x b a )} is an over-complete basis in L2 (R), resulting that g(a, b) is not determined uniquely. This decomposition shows that the set of coefficients {g h } in the neural approximation f(x) H h=1 g h ϕ( x b h a h ) becomes more non-identifiable when H is close to infinity. In order to analyze the case n and H, we have to make a theory for redundant function approximation and redundant statistical estimation. It is shown in (Watanabe, 2000b) that singularities in the parameter space make the generalization error smaller even when the true distribution is not contained in the parametric model. On the other hand, the set of coefficients in the orthonormal basis decomposition is given by the orthonormal projection of the true function to the corresponding function space, hence it is identifiable even if H goes to infinity. This is one of the essential differences between artificial neural networks and orthonormal basis functions. We expect that artificial neural networks are better learning machines than the orthnormal ones if Bayesin estimation is applied. It is well known that, if a statistical model contains an unused or nuisance parameters, then such parameters should not be estimated. If such parameters are estimated and determined, then the statistical estimation error becomes large. They are estimated in the maximum likelihood method, but not estimated in Bayesian estimation. This is one of the reason why Bayesian estimation is useful for neural networks in almost redundant states. 6.3 Practical Aspects In regular learning machines, the generalization error by the maximum likelihood method is asymptotically equal to d/2n (Akaike, 1974), which coincides 28

29 with that by Bayesian estimation (Amari, 1993), where d is the number of parameters and n is the number of training samples. In this paper we have shown that, in non-identifiable learning machines, the generalization error by the Bayesian estimation is not larger than d/2n. When the singularity becomes deeper, the generalization error is reduced to smaller. On the other hand, it is conjectured that the generalization error of a layered learning machine by the maximum likelihood method is far larger than d/2n. For example, it is proven that, if the distribution of the inputs consists of a finite sum of delta functions, then the generalization error is in proportion to log n/n (Hagiwara, Kuno, & Usui, 2000). Also it is proven that, if the activation function of hidden units is a linear function, then the generalization error is C/2n, where C is larger than the number of parameters (Fukumizu, 1999). In a practical application, the true distribution is not strictly contained in the finite parametric model in general (Shibata, 1981). In such a case, the generalization error is the sum of the function approximation error and the statistical estimation error, where the former is called bias and the latter variance. In this paper, we have shown that the increase of the variance of an artificial neural network is not larger than the increase of parameters, if Bayesian estimation is applied. Therefore, we can use the larger size neural network with the not so large variance and the smaller bias. However, if we apply the maximum likelihood method to an artificial neural network, then we should use the smaller model to ensure the variance small, resulting in the larger bias. We conjecture that this difference is the main reason why almost redundant learning machines with ensemble learning are more useful than selected small models with one point estimation, in practical applications. 7 Conclusion Mathematical foundation for non-identifiable learning machines such as neural networks is established based on algebraic analysis and algebraic geometry. The free energy or the stochastic complexity is asymptotically equal to λ 1 log n (m 1 1)log log n + const., where λ 1 is a rational number and m 1 is a natural number. Moreover, we obtained the following properties. (1) The learning curve is determined by the singularities in the parameter 29

30 space. (2) The learning efficiency can be algorithmically calculated by the resolution of singularities. (3) Bayesian estimation is more appropriate than the maximum likelihood method for neural networks in redundant states. Information geometry gives the foundation of statistical learning theory for a regular learning machine (Amari, 1985) which has a positive definite Fisher metric. We expect that algebraic geometry and algebraic analysis will play an important role in study of a complex and hierarchical learning machine which has a degenerate metric. Analysis for both the maximum likelihood method and the maximum a posteriori method is an important problem for the future. Acknowledgment This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientific Research and Appendix: Elemental Lemmas and Proofs In this appendix, we show two elemental lemmas and their proofs. These are technical properties, but sometimes useful in neural network theory. Lemma 5 Let {X k (a); k = 1,2,..., } be a set of real valued random variables with parameter a and A be a set of all parameters. Then E{sup a A X k (a) 2 } 1/2 k=1 k=1 E{sup X k (a) 2 } 1/2 a A where E is the expectation value for the random variables. (Proof) By the definition Y k = sup X k (a), a A E{sup a A X k (a) 2 } E{sup X k (a) X k (a) } k=1 a A k=1 k =1 E{Y k Y k } k=1 k =1 30

31 = { which completes the lemma. (Q.E.D.) k=1 k =1 k=1 E{Y 2 k } 1/2 E{Y 2 k }1/2 E{Y 2 k } 1/2 } 2 Lemma 6 Assume the same condition as Theorem 2. Let U be a neighborhood of w 0 W 0. Then, there exist both a real-valued analytic function g j (w) and a real-valued function h j (x, w) which satisfies the condition (A) in U such that (1) g j (w 0 ) = 0. (2) Functions {h j (x, w)} are linearly independent. (3) For an arbitrary w U J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w). i=1 (Proof of Lemma 6) We can assume w 0 = 0 and ψ(x, 0) = 0 without loss of generality. The d dimensional direct product of the set of natural numbers is referred to as N d and its element is written as j = (j 1, j 2,..., j d ) N d. Let us consider the Talyor expansion ψ(x, w) = j 1 a j (x)w j where j = j 1 + j j d and w j = w j 1 1 w j 2 2 w j d d. We introduce a space of the square integrable functions L 2 (q) with an inner product (u, v) = K u(x)v(x)q(x)dx. By using Cauchy s integral formula as eq.(10), it is immediately proven that a j (x) is contained in L 2 (q). We adopt an order of the set N d which satisfies j j j j. Then, by applying Schmidt s orthogonalization to {a j (x); j N d }, we obtain ψ(x, w) = j N d e j (x)g j (w), 31

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503