Algebraic Geometrical Methods for Hierarchical Learning Machines

Size: px

Start display at page:

Download "Algebraic Geometrical Methods for Hierarchical Learning Machines"

Dwight Cox
5 years ago
Views:

1 1 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe (a) Title : Algebraic geometrical methods for hierarchical learning machines. (b) Author : Sumio Watanabe (c) Affiliation : Precision and Intelligence Laboratory Tokyo Institute of Technology (d) Acknowledgment : This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientific Research (e) Address : Dr. Sumio Watanabe, Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama Japan swatanab@pi.titech.ac.jp Phone : Fax : (f) Running title : Algebraic geometry of learning machines.

2 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology January 18, 2001 Abstract Hierarchical learning machines such as layered perceptrons, radial basis functions, gaussian mixtures are non-identifiable learning machines, whose Fisher information matrices are not positive definite. This fact shows that conventional statistical asymptotic theory can not be applied to the neural network learning theory, for example, either the Bayesian a posteriori probability distribution does not converge to the gaussian distribution, or the generalization error is not in proportion to the number of parameters. The purpose of this paper is to overcome this problem and to clarify the relation between the learning curve of a hierarchical learning machine and the algebraic geometrical structure of the parameter space. We establish an algorithm to calculate the Bayesian stochastic complexity based on blowing-up technology in algebraic geometry and prove that the Bayesian generalization error of a hierarchical learning machine is smaller than that of a regular statistical model, even if the true distribution is not contained in the parametric model. Keywords: algebraic geometry, resolution of singularities, generalization error, stochastic complexity, asymptotic expansion, non-identifiable model. 2

3 3 Mathematical Symbols x : M dimensional input y : N dimensional output w : d dimensional parameter p(y x, w) : conditional probability of a learning machine. q(y x)q(x) : true simultaneous distribution of input and output {(x i, y i )} : a set of training samples H n (w) : empirical Kullback distance H(w) : Kullback distance ϕ(w) : a priori probability distribution on the parameter space ρ n (w) : a posteriori probability distribution on the parameter space p n (y x) : estimated conditional distribution by Bayesian method. G(n) : generalization error F(n) : stochastic complexity F(n) : upper bound of the stochastic complexity v(t) : a distribution of states. J(z) : the Mellin transform of v(t) ǫ : a sufficiently small positive constant. W ǫ : the set of parameter whose Kullback distance is smaller than ǫ. U : a manifold found by blowing-up g(u) : a resolution map s : a positive constant K : the number of hidden units f K (w) : a function of three-layer perceptron with K hidden units a k : parameter from hidden units to output units b k : parameter from input units to output units c k : bias of hidden units σ(x) : activation function of hidden units : = tanh(x)

4 1 INTRODUCTION 4 1 Introduction Learning in artificial neural networks can be understood as statistical estimation of an unknown probability distribution based on empirical samples (White, 1989; Watanabe & Fukumizu, 1995). Let p(y x, w) be a conditional probability density function which represents a probabilistic inference of an artificial neural network, where x is an input and y is an output. The parameter w, which consists of a lot of weights and biases, is optimized so that the inference p(y x, w) approximates the true conditional probability density from which training samples are taken. Let us reconsider a basic property of a homogeneous and hierarchical learning machine. If the mapping from a parameter w to the conditional probability density p(y x, w) is one-to-one, then the model is called identifiable. If otherwise, then it is called non-identifiable. In other words, a model is identifiable if and only if its parameter is uniquely determined from its behavior. The standard asymptotic theory in mathematical statistics requires that a given model should be identifiable. For example, identifiablity is a necessary condition to ensure that both the distribution of the maximum likelihood estimator and the Bayesian a posteriori probability density function converge to the normal distribution if the number of training samples tends to infinity (Cramer, 1949). When we approximate the likelihood function by a quadratic form of the parameter and select the optimal model using information criteria such as AIC, BIC, and MDL, we implicitly assume that the model is identifiable. However, many kinds of artificial neural networks such as layered perceptrons, radial basis functions, Boltzmann machines, and gaussian mixtures are non-identifiable, hence either their statistical property is not yet clarified or conventional statistical design methods can not be applied. In fact, a failure of likelihood asymptotics for normal mixtures was shown from the viewpoint of testing hypothesis in statistics (Hartigan, 1985). In researches of artificial neural networks, it was pointed out that AIC does not correspond to the generalization error by the maximum likelihood method (Hagiwara, 1993), since the Fisher information matrix is degenerate if the parameter represents the smaller model (Fukumizu, 1996). The asymptotic distribution of the maximum likelihood estimator of a non-identifiable model was analyzed based on the theorem that the empirical likelihood function converges to the gaussian process if it satisfies Donsker s condition (Dacunha-Castelle & Gassiat, 1997). It was proven that the generalization error by the Bayesian estimation is far

5 1 INTRODUCTION 5 smaller than the number of parameters divided by the number of training samples (Watanabe, 1997; Watanabe, 1998). When the parameter space is conic and symmetric, the generalization error of the maximum likelihood method is different from that of a regular statistical model (Amari & Ozeki, 2000). If the log likelihood function is analytic for the parameter and if the set of parameters is compact, then the generalization error by the maximum likelihood method is bounded by the constant divided by the number of training samples (Watanabe, 2001b). Let us illustrate the problem caused by non-identifiability of layered learning machines. If p(y x, w) be a three-layer perceptron with K hidden units and if w 0 is a parameter such that p(y x, w 0 ) is equal to the machine with K 0 hidden units (K 0 < K), then the set of true parameters {w; p(y x, w 0 ) = p(y x, w)} consists of several sub-manifolds in the parameter space. Moreover, the Fisher information matrix, I ij (w) = w i log p(y x, w) w j log p(y x, w)p(y x, w)q(x)dxdy, where q(x) is the probability density function on the input space, is positive semidefinite but not positive definite, and its rank, rank I(w), depends on the parameter w. This fact indicates that artificial neural networks have many singular points in the parameter space (Figure 1). A typical example is shown in Example.2 in section 3. By the same reason, almost all homogenous and hierarchical learning machines such as a Boltzmann machine, a gaussian mixture, and a competitive neural network have singularities in their parameter spaces, resulting that we have no mathematical foundation to analyze their learning. In the previous paper (Watanabe, 1999b; Watanabe, 2000; Watanabe, 2001a), in order to overcome such a problem, we proved the basic mathematical relation between the algebraic geometrical structure of singularities in the parameter space and the asymptotic behavior of the learning curve, and constructed a general formula to calculate the asymptotic form of the Bayesian generalization error using resolution of singularities, based on the assumption that the true distribution is contained in the parametric model. In this paper, we consider a three-layer perceptron in the case when the true probability density is not contained in the parametric model, and clarify how singularities in the parameter space affect learning in Bayesian estimation. By employing

6 2 BAYESIAN FRAMEWORK 6 an algebraic geometrical method, we show the following facts. (1) The learning curve is strongly affected by singularities, since the statistical estimation error depends on the estimated parameter. (2) The learning efficiency can be evaluated by using the blowing-up technology in algebraic geometry. (3) The generalization error is made smaller by singularities, if the Bayesian estimation is applied. These results clarify the reason why the Bayesian estimation is useful in practical applications of neural networks, and demonstrate a possibility that algebraic geometry plays an important role in learning theory of hierarchical learning machines, just same as differential geometry did in that of regular statistical models (Amari, 1985). This paper consists of 7 sections. In section 2, the general framework of Bayesian estimation is formulated. In section 3, we analyze a parametric case when the true probability density function is contained in the learning model, and derive the asymptotic expansion of the stochastic complexity using resolution of singularities. In section 4, we also study a non-parametric case when the true probability density is not contained, and clarify the effect of singularities in the parameter space. In section 5, the problem of the asymptotic expansion of the generalization error is considered. Finally, section 6 and 7 are devoted to discussion and conclusion. 2 Bayesian Framework In this section, we formulate the standard framework of Bayesian estimation and Bayesian stochastic complexity (Schwarz 1974; Akaike, 1980; Levin, Tishby, & Solla, 1990; Mackay, 1992; Amari, Fujita, & Shinomoto, 1992; Amari & Murata, 1993). Let p(y x, w) be a probability density function of a learning machine, where an input x, an output y, and a parameter w are M, N, and d dimensional vectors, respectively. Let q(y x)q(x) be a true probability density function on the input and out space, from which training samples {(x i, y i ); i = 1,2,..., n} are independently taken. In this paper, we mainly consider the Bayesian framework, hence the estimated probability density ρ n (w) on the parameter space is defined by ρ n (w) = 1 Z n exp( nh n (w))ϕ(w),

7 2 BAYESIAN FRAMEWORK 7 H n (w) = 1 n n i=1 log q(y i x i ) p(y i x i, w), where Z n is the normalizing constant, ϕ(w) is an arbitrary fixed probability density function on the parameter space called an a priori distribution, and H n (w) is the empirical Kullback distance. Note that the a posteriori distribution ρ n (w) does not depend on {q(y i x i ); i = 1,2,..., n} because q(y i x i ) is a constant function of w. Hence it can be written in the other form, ρ n (w) = 1 Z n ϕ(w) n p(y i x i, w). i=1 The inference p n (y x) of the trained machine for a new input x is defined by the average conditional probability density function, p n (y x) = p(y x, w)ρ n (w)dw. The generalization error G(n) is defined by the Kullback distance of p n (y x) from q(y x), G(n) = E n { q(y x) log q(y x) q(x)dxdy}, (1) p n (y x) where E n { } represents the expectation value overall sets of training samples. One of the most important purposes in learning theory is to clarify the behavior of the generalization error when the number of training samples are sufficiently large. It is well known (Levin, Tishby, Solla, 1990; Amari, 1993; Amari, Murata, 1993) that the generalization error G(n) is equal to the increase of the stochastic complexity F(n), G(n) = F(n + 1) F(n) (n = 1,2,..., ), (2) for an arbitrary natural number n, where F(n) is defined by F(n) = E n {log exp( nh n (w))ϕ(w)dw}. (3) The stochastic complexity F(n) and its generalized concepts, which are sometimes called the free energy, the Bayesian factor, or the logarithm of the evidence, can be seen in statistics, information theory, learning theory, and mathematical physics (Schwarz, 1974; Akaike, 1980; Rissanen, 1986; Mackay, 1992; Opper & Haussler, 1995; Meir & Merhav, 1995 ; Haussler & Opper, 1997; Yamanishi, 1998). For example, both Bayesian model selection and hyperparatemeter optimization are

8 2 BAYESIAN FRAMEWORK 8 often carried out by minimization of the stochastic complexity before averaging. They are called BIC and ABIC, which are important in practical applications. The stochastic complexity satisfies two basic inequalities. Firstly, we define H(w) and F(n) respectively by H(w) = q(y x) log q(y x) p(y x, w) q(x)dxdy, F(n) = log exp( nh(w))ϕ(w)dw. Note that H(w) is called the Kullback information. Then, by applying Jensen s inequality, F(n) F(n) (4) holds for an arbitrary natural number n (Opper & Haussler, 1995; Watanabe, 2001a). Secondly, we use notations F(ϕ, n) = F(n) and F(ϕ, n) = F(n) which explicitly show the a priori probability density ϕ(w). Then F(ϕ, n) and F(ϕ, n) can be understood as a generalized stochastic complexity for a case when ϕ(w) is a non-negative function. If ψ(w) and ϕ(w) satisfy then it immediately follows that ψ(w) ϕ(w) 0 ( w), F(ψ, n) F(ϕ, n), (5) F(ψ, n) F(ϕ, n). (6) Therefore, the restriction of the integrated region of the parameter space makes the stochastic complexity not smaller. For example, we define with sufficiently small ǫ > 0, then F ǫ (n) = log exp( nh(w))ϕ(w)dw, (7) H(w)<ǫ F(n) F(n) F ǫ (n). (8) These two inequalities eq.(4) and eq.(8) give upper bounds of the stochastic complexity. On the other hand, if the support of ϕ(w) is compact, then a lower bound is proven F( n ) α F(n). (9) 2

9 3 A PARAMETRIC CASE 9 Moreover, if the learning machine contains the true distribution, then holds (Watanabe, 1999b; Watanabe, 2001a). F ǫ ( n ) α F(n) (10) 2 In this paper, based on algebraic geometrical methods, we prove rigorously the upper bounds of F(n) such as F(n) F(n) αn + β log n + o(log n), (11) where α, β 0 are constants and o(logn) is a function of n which satisfies o(log n)/log n 0 (n ). Mathematically speaking, although the generalization error G(n) is equal to F(n + 1) F(n) for any natural number n, we can not derive the asymptotic expansion of G(n). However, in section 5, we show that, if G(n) has some asymptotic expansion, then it should satisfy the inequality G(n) α + β n + o(1 ), (12) n for sufficiently large n, from eq.(11). The main results of this paper are the upper bounds of the stochastic complexity, however, we also discuss the behavior of the generalization errors based on eq.(12). 3 A Parametric Case In this section, we consider a parametric case when the true probability distribution q(y x)q(x) is contained in the learning machine p(y x, w)q(x), and show the relation between the algebraic geometrical structure of the machine and the asymptotic form of the stochastic complexity. 3.1 Algebraic Geometry of Neural Networks In this subsection, we briefly summarize the essential result of the previous paper. For the mathematical proofs of this subsection, see (Watanabe, 1999b; Watanabe, 2001a). Strictly speaking, we need assumptions that log p(y x, w) is an analytic function of w, and that it can be analytically continued to a holomorphic function of w whose associated convergence radii is positive uniformly for arbitrary (x, y) that satisfies q(y x)q(x) > 0 (Watanabe, 2000; Watanabe, 2001a). In this paper, we apply the result of the previous paper to the three-layer perceptron.

10 3 A PARAMETRIC CASE 10 If a three-layer perceptron is redundant to approximate the true distribution, then the set of true parameters {w; H(w) = 0} is a union of several sub-manifolds in the parameter space. In general, the set of all zero points of an analytic function is called an analytic set. If the analytic function H(w) is a polynomial, then the set is called an algebraic variety. It is well known that an analytic set and an algebraic variety have complicated singularities in general. We introduce a state density function v(t) v(t) = δ(t H(w))ϕ(w)dw, H(w)<ǫ where δ(t) is Dirac s delta function and ǫ > 0 is a sufficiently small constant. By definition, if t < 0 or t > ǫ, then v(t) = 0. By using v(t), F ǫ (n) is rewritten as F ǫ (n) = log exp( nh(w))ϕ(w)dw = log = log H(w)<ǫ ǫ 0 nǫ 0 exp( nt)v(t)dt exp( t)v( t n )dt n. (13) Hence, if v(t) has an asymptotic expansion for t 0, then F ǫ (n) (n ) has an asymptotic expansion for n. In order to examine v(t), we introduce a kind of the zeta function J(z) (Sato & Shintani, 1974) of the Kullback information H(w) and the a priori probability density ϕ(w), which is a function of one complex variable z, J(z) = H(w) z ϕ(w)dw (14) = H(w)<ǫ ǫ 0 t z v(t)dt. (15) Then J(z) is an analytic function of z in the region Re(z) > 0. It is well known in the theory of distributions and hyperfunctions that, if H(w) is an analytic function of w, then J(z) can be analytically continued to a meromorphic function on the entire complex plane and its poles are on the negative part of the real axis (Atiyah, 1970; Bernstein, 1972; Sato & Shintani, 1974; Björk, 1979). Moreover, the poles of J(z) are rational numbers (Kashiwara, 1976). Let λ 1 (λ 1 > 0) and m 1 (m 1 1) be the largest pole and its order of J(z), respectively. Note that eq.(15) shows J(z) (z C) is the Mellin transform of v(t). Using the inverse Mellin transform, we can show that v(t) satisfies v(t) = c 0 t λ 1 1 ( log t) m 1 1 (t 0),

11 3 A PARAMETRIC CASE 11 where c 0 > 0 is a positive constant. By eq.(13), F ǫ (n) has an asymptotic expansion, F ǫ (n) = λ 1 log n (m 1 1)log log n + O(1), where O(1) is a bounded function of n. Hence, by eq.(8), F(n) λ 1 log n (m 1 1)log log n + O(1). Moreover, if the support of ϕ(w) is a compact set, by eq.(9), we obtain an asymptotic expansion of F(n), We have the first theorem. F(n) = λ 1 log n (m 1 1)log log n + O(1). Theorem 1 (Watanabe, 1999b; Watanabe, 2001a) Assume that the support of ϕ(w) is a compact set. The stochastic complexity F(n) has an asymptotic expansion, F(n) = λ 1 log n (m 1 1)log log n + O(1) where λ 1 and m 1 are respectively the largest pole and its order of the function that is analytically continued from J(z) = H(w)<ǫ H(w) z ϕ(w)dw, where H(w) is the Kullback information and ϕ(w) is the a priori probability density function. Remark that, if the support of ϕ(w) is not compact, then Theorem 1 gives an upper bound of F(n). The important constants λ 1 and m 1 can be calculated by an algebraic geometrical method. We define the set of parameters W ǫ by W ǫ = {w R d ; H(w) < ǫ, ϕ(w) > 0}. It is proven by Hironaka s resolution theorem (Hironaka, 1964 ; Atiyah, 1970) that there exist both a manifold U and a resolution map g : U W ǫ which satisfy H(g(u)) = a(u) d j=1 u k j j,

12 3 A PARAMETRIC CASE 12 in an arbitrary neighborhood of an arbitrary u U that satisfies H(g(u)) = 0, where a(u) > 0 is a strictly positive function and {k i } are non-negative even integers (Figure 2). Let W ǫ = W α α be a decomposition of W ǫ into a finite union of suitable neighborhoods W α, where ϕ(w)dw = 0 (α α ). W α W α By applying the resolution theorem to the function J(z), J(z) = = α = α H(w)<ǫ H(w) z ϕ(w)dw H(w) z ϕ(w)dw W α H(g(u)) z ϕ(g(u)) g (u) du, U α where U α = g 1 (W α ). Since g is given by recursive blowing-ups, the Jacobian g (u) is a direct product local variables u 1, u 2,..., u d, d g (u) = c(u) u j h j, where c(u) is a positive analytic function and {h j } are non-negative integers. In a neighborhood U α, a(u) and ϕ(g(u)) can be set as constant functions in calculation of the poles of J(z), because we can take each U α small enough. Hence we can set a(u) = 1 and ϕ(g(u)) = 1 without loss of generality. Then, j=1 J(z) = α d j=1 U α u k (α) j z+h (α) j du j, where both k (α) j poles { (h (α) j real axis. and h (α) j + 1)/k (α) j depend on the neighborhood U α. We find that J(z) has }, which are rational numbers on the negative part of the Since a resolution map g(u) can be found by using finite recursive procedures of blowing-ups, λ 1 and m 1 can be found algorithmically. It is also proven that λ 1 d/2 if {w; ϕ(w) > 0,H(w) = 0}, and that m 1 d. Theorem 2 (Watanabe, 1999b; Watanabe, 2001a) The largest pole λ 1 and its multiplicity m 1 of the function J(z) can be algorithmically calculated by Hironaka s

13 3 A PARAMETRIC CASE 13 resolution theorem. Moreover, λ 1 is a rational number and m 1 is a natural number, and if {w; ϕ(w) > 0,H(w) = 0}, then where d is the dimension of parameter. 0 < λ 1 d/2, 1 m 1 d, Note that, if the learning machine is a regular statistical model, then always λ 1 = d/2 and m 1 = 1. Also note that, if Jeffreys prior is employed in neural network learning, which is equal to zero at singularities, the assumption {w; ϕ(w) > 0,H(w) = 0} is not satisfied, and then both λ 1 = d/2 and m 1 = 1 hold even if the Fisher metric is degenerate (Watanabe, 2001c). Example.1 (Regular Model) Let us consider a regular statistical model p(y x, a, b) = 1 2π exp( 1 2 (y ax b)2 ), with the set of parameters W = {(a, b); a 1, b 1}. Assume that the true distribution is q(y x)q(x) = 1 2π exp( 1 2 (x2 + y 2 )), (16) and the a priori distribution is the uniform distribution on W. Then, For a subset S W, we define H(a, b) = 1 2 {a2 + b 2 }, J(z) = H(a, b) z da db. J S (z) = S W H(a, b) z da db. Then J(z) = J W1 (z) + J W2 (z), where W 1 = {(a, b) W; a b }, W 2 = W \ W 1. We introduce a mapping g : (u, v) (a, b) by a = uv, b = u. (17)

14 3 A PARAMETRIC CASE 14 Then J W1 (z) = a b = 1 2 z u 1 H(a, b) z dadb v 1 u 2z (v 2 + 1) z udu dv has a pole at z = 1. We can show J W2 (z) has the same pole just the same way as J W1. Hence λ 1 = 1 and m 1 = 1, resulting in F(n) = log n. This coincides with the well known result of the Bayesian asymptotic theory of regular statistical models. The mapping in eq.(17) is a typical example of a blowing-up. Example.2 (Non-identifiable model) Let us consider a learning machine, p(y x, a, b, c) = 1 2π exp( 1 2 (y aφ(bx) cx)2 ), where φ(x) = x+x 2. Assume that the true distribution is same as eq.(16), and that the a priori probability distribution is the uniform one on the set {w = (a, b, c); a 1, b 1, c 1}. Then, the Kullback information is Let us define two sets of parameters, W 1 H(a, b, c) = 1 2 {(ab + c)2 + 3a 2 b 4 }. = {(a, b, c); c a, ab 2 ab + c ab }, U 1 = {(p,q, r); p 1, q 1, r 1}. By using blowing-ups recursively, we find a map g : U 1 W 1 which is defined by By using this transform, we obtain a = p, b = qr, c = (q 1)pqr. H(g(p,q, r)) = 1 2 p2 q 4 r 2 (1 + 3r 2 ), g (p,q, r) = pq 2 r. Therefore, J W1 (z) = H(w) z dw W 1 = dp dq dr {p 2 q 4 r 2 (1 + 3r 2 )} z pq 2 r. 2 z The largest pole of J W1 (z) is 3/4 and its order is one. It is also shown that J W \W1 (z) have largest pole 3/4 with order one. Hence λ 1 = 3/4 and m 1 = 1, resulting that F(n) = 3 log n + O(1). 4

15 3 A PARAMETRIC CASE Application to Layered Perceptron We apply the theory in the foregoing subsection to the three-layer perceptron. A three-layer perceptron with the parameter w = {(a k, b k, c k )} is defined by p(y x, w) = f K (x, w) = 1 1 exp( (2πs 2 ) N/2 2s y f K(x, w) 2 ), 2 (18) K a k σ(b k x + c k ), (19) k=1 where y, f(x, w), and a h are N dimensional vectors, x and b h are M dimensional vectors, c h is a real number, and σ(x) = tanh(x). Here M, N, and K are the numbers of input units, output units, and hidden units. In this paper, we consider a machine which does not estimate the standard deviation s > 0 (s is a constant). We assume that the true distribution is 1 1 q(y x)q(x) = exp( (2πs 2 ) N/2 2s 2 y 2 ) q(x). (20) That is to say, the true regression function is y = 0. This is a special case, but analysis of this case is important in the following section where the true regression function is not contained in the model. Theorem 3 Assume that the learning machine given by eq.(18) and eq.(19) is trained using samples independently taken from the distribution, eq.(20). If the a priori distribution satisfies ϕ(w) > 0 in the neighborhood of the origin w = 0, then λ 1 K 2 min{m + 1,N}. (Proof of Theorem 3) We use notations, w = (a, b, c), a = {a k R N ; k = 1,2,..., K}, a k b k b = {b k R M ; k = 1,2,..., K}, c = {c k R; k = 1,2,..., K}, = {a kp R; p = 1,2,..., N}, = {b kq R; q = 1,2,..., M}. Then the Kullback information is H(a, b, c) = 1 f 2s 2 K (x, a, b, c) 2 q(x)dx N K = B hk (b, c) a hp a kp, p=1 h,k=1

16 3 A PARAMETRIC CASE 16 where B hk (b, c) = 1 σ(b 2s 2 h x + c h )σ(b k x + c k )q(x)dx. Our purpose is to find the pole of the function where J(z) = H(a, b, c) z ϕ(a, b, c) da db dc, W ǫ W ǫ = {(a, b, c); H(a, b, c) < ǫ, ϕ(a, b, c) > 0}. Let us apply the blowing-up technique to the Kullback information H(a, b, c). Firstly, we introduce a mapping g : {u hp ; 1 h K,1 p N} {a hp ; 1 h K,1 p N}, which is defined by a 11 = u 11, a hp = u 11 u hp (h 1 or p 1). Let u be the variables of u except u 11, in other words, u = (u 11, u ). Then where H(u, b, c) = u 2 11 H 1 (u, b, c) K H 1 (u, b, c) = B 11 (b, c) + 2 B h1 (b, c) u h1 h=2 N K + B hk (b, c) u hp u kp, p=2 h,k=1 and the Jacobian g (u) of the mapping g is We define a set of paramaters for δ > 0 g (u) = det( a ij u kl ) = u NK U(δ) = {(u 11, u, b, c); u 11 δ, H 1 (u, b, c) < 1}. By the assumption, there exists δ > 0 such that g(u(δ)) W ǫ.

17 3 A PARAMETRIC CASE 17 In order to obtain an upper bound of the stochastic complexity, we can restrict the integrated region of the parameter space, by using eq.(5) and (6). J(z) = H(a, b, c) z ϕ(a, b, c)dadbdc. g(u(δ)) By the assumption ϕ(w) > 0 in g(u(δ)). In calculation of the pole of J(z), we can assume ϕ(w) = ϕ 0 (ϕ 0 is a constant) in g(u(δ)). J(z) = H(g(u), b, c) z ϕ 0 g (u) dudbdc U(δ) = ϕ 0 δ 0 = ϕ 0δ 2z+NK 2z + NK u 2z 11 u NK 1 11 du 11 H 1 (u, b, c) z du db dc. H 1 (u, b, c) z du db dc The pole of the function δ 2z+NK /(2z + NK) is z = NK/2. Let λ 1 and λ 1 be respectively the largest poles of J(z) and J 1 (z) = H 1 (u, b, c) z du db dc. Then, since H 1 (u, b, c) 0, J 1 (z) does not have zero point in the interval ( λ 1, ). If z = NK/2 is larger than λ 1, then z = NK/2 is a pole of J(z). If otherwise, then J(z) has a larger pole than NK/2. Hence λ 1 NK/2. Secondly, we consider another blowing-up g, g : {u kp, v k ; 1 k K,1 p M} {b kp, c k ; 1 k K,1 p M} which is defined by b 11 = u 11 b kp = u 11 u kp (k 1 or p 1), c k = u 11 v k Then, just the same method as the first half, there exists an analytic function H 2 (a, u, v) such that which implies J(z) = ϕ 0 δ H(a, b, c) = u 2 11 H 2 (a, u, v), 0 u 2z 11 u(m+1)k 1 11 du 11 = ϕ 0 δ 2z+(M+1)K 2z + (M + 1)K H 2 (a, u, v) z du dv H 2 (a, u, v) z du dv.

18 3 A PARAMETRIC CASE 18 Therefore λ 1 (M + 1)K/2. By combing the above two results, the largest pole λ 1 of the J(z) satisfies the inequality, λ 1 K min{n, M + 1}, 2 which completes the proof of Theorem 3. (End of Proof). By Theorem 1, F(n) K 2 min{n, M + 1}log n + o(log n). Moreover, if G(n) has an asymptotic expansion (see section 5), we obtain an inequality of the generalization error, G(n) K 2n min{n, M + 1} + o(1 n ). On the other hand, it is well known that the largest pole of a regular statistical model is equal to d/2, where d is the number of parameters. When a three-layer perceptorn with 100 input units, 10 hidden units, and 1 output unit is employed, then λ 1 5, whereas the regular statistical models with the same number of parameters has λ 1 = d/2 = 500. It should be emphasized that the generalization error of the hierarchical learning machine is far smaller than that of the regular statistical models, if we use the Bayesian estimation. When we adopt the normal distribution as the a priori probability density, we have shown the same result as Theorem 3 by a direct calculation (Watanabe, 1999a). However, Theorem 3 shows systematically that the same result holds for an arbitrary a priori distribution. Moreover, it is easy to generalize the above result to the case when the learning machine has M input units, K 1 first hidden units, K 2 second hidden units,..., K p pth hidden units, and N output units. We assume that hidden units and output units have bias parameters. Then by using same blowing-ups, we can generalize the proof of Theorem 3, λ min{(m + 1)K 1, (K 1 + 1)K 2, (K 2 + 1)K 3,..., (K p 1 + 1)K p, K p N}. Of course, this result holds only when the true regression function is the special case, y = 0. However, in the following section, we show that this result is necessary to obtain a bound for a general regression function.

19 4 A NON-PARAMETRIC CASE 19 4 A Non-parametric Case In the previous section, we have studied a case when the true probability distribution is contained in the parametric model. In this section, we consider a non-parametric case when the true distribution is not contained in the parametric models, which is illustrated in Figure 3. Let w 0 be the parameter that minimizes H(w), which is a point C in Figure 3. Our main purpose is to clarify the effect of singular points such as A and B in Figure 3 which are not contained in the neighborhood of w 0. Let us consider a case when a three-layer perceptron given by eq.(18) and eq.(19) is trained using samples independently taken from the true probability distribution, q(y x)q(x) = 1 1 exp( (2πs 2 ) N/2 2s y 2 g(x) 2 ) q(x), (21) where g(x) is the true regression function and q(x) is the true probability distribution on the input space. Let E(k) be the minimum function approximation error using a three-layer perceptron with k hidden units, E(k) = min g(x) f k (x, w) 2 q(x)dx. (22) w Here we assume that, for each 1 k K, there exists a parameter w that attains the minimum value. Theorem 4 Assume that the learning machine given by eq.(18) and eq.(19) is trained using samples independently taken from the distribution of eq.(21). If the a priori distribution satisfies ϕ(w) > 0 for an arbitrary w, then where F(n) min {ne(k) k K 2s 2 2 (D 1k + D 2 K)log n} + o(logn), D 1 = (M + N + 1) min[n, M + 1], D 2 = min[n, M + 1]. (Proof of Theorem 4) By Jensen s inequality eq.(4), we have F(n) log exp( nh(w))ϕ(w)dw, where H(w) is the Kullback distance, H(w) = 1 g(x) f 2s 2 K (x, w) 2 q(x)dx.

20 4 A NON-PARAMETRIC CASE 20 Let k 1 and k 2 be natural numbers which satisfy both 0 k 1 K and k 1 + k 2 = K. We divide the parameter w = (w 1, w 2 ) where w 1 = {a k, b k, c k ; 1 k k 1 }, w 2 = {a k, b k, c k ; k k K}. Also let γ 1 and γ 2 be real numbers which satisfy both γ 1 > 1 and γ 1 + γ 2 = γ 1 γ 2. Then, for arbitrary u, v R N, Therefore, for arbitrary (x, w), u + v 2 γ 1 u 2 + γ 2 v 2. g(x) f K (x, w) 2 γ 1 g(x) f k1 (x, w) 2 + γ 2 f k2 (x, w) 2. Hence we have an inequality, where we use definitions, H(w) γ 1 H 1 (w 1 ) + γ 2 H 2 (w 2 ), H 1 (w 1 ) = 1 2s 2 H 2 (w 2 ) = 1 2s 2 As F(n) is an increasing function of H(w), g(x) f k1 (x, w 1 ) 2 q(x)dx, f k2 (x, w 2 ) 2 q(x)dx. F(n) F 1 (n) + F 2 (n), where F j (n) = log exp( nγ j H j (w j ))ϕ j (w j )dw j, for j = 1,2 and ϕ j (j = 1,2) are some functions which satisfy ϕ(w) ϕ 1 (w 1 ) ϕ 2 (w 2 ). Here we can choose both ϕ 1 (w 1 ) and ϕ 2 (w 2 ) which are compact support functions. Firstly, we evaluate F 1 (n). Let w 1 be the parameter that minimizes H 1 (w 1 ). Then, by eq.(22) and Theorem 2, F 1 (n) nγ 1 H 1 (w1) log exp( nγ 1 [H 1 (w 1 ) H 1 (w1)])ϕ 1 (w 1 )dw 1, nγ 1E(k 1 ) + d(k 1) log n + o(log n), (23) 2s 2 2

21 4 A NON-PARAMETRIC CASE 21 where d(k 1 ) = (M + N + 1)k 1 is the number of parameters in the three-layer perceptron with k 1 hidden units. Secondly, by applying Theorem 3 to F 2 (n), where λ(k 2 ) satisfies F 2 (n) λ(k 2 )log n + o(log n), (24) λ(k 2 ) k 2 min{n, M + 1}. (25) 2 By combining eq.(23) with eq.(24), and by taking γ 1 sufficiently close 1, we obtain F(n) ne(k 1) + { d(k 1) + λ(k 2s 2 2 )} log n + o(log n), 2 for an arbitrary given 0 k 1 K. Since d(k 1 ) 2 we obtain Theorem 4. (End of Proof). + λ(k 2 ) = 1 2 {D 1k 1 + D 2 K}, Based on Theorem 4, if G(n) has an asymptotic expansion (see section 5), then G(n) should satisfy the inequalities G(n) E(k) 2s 2 for n > n 0 with a sufficiently large n 0. Hence + D 1k 2n + D 2K 2n + o(1 ) (k = 0,1,2,..., K) (26) n G(n) min {E(k) + D 1k 0 k K 2s 2 2n } + D 2K 2n + o(1 ), (27) n for n > n 0 with a sufficiently large n 0. Figure 4 illustrates several learning curves corresponding to k (0 k K). The generalization error G(n) is smaller than every curve. It is well known (Barron, 1994; Murata, 1996) that, if g(x) belongs to some kind of function space, then E(k) C(g) k for sufficiently large k, where C(g) is a positive constant determined by the true regression function g(x). Then, G(n) min {C(g) 0 k K 2s 2 k + D 1k 2n } + D 2K 2n + o(1 ). (28) n

22 5 ASYMPTOTIC PROPERTY OF THE GENERALIZATION ERROR 22 If both n and K are sufficiently large, and if α sk C(g)n/D 1 1, then, by choosing k = C(g)n/(s 2 D 1 ) in eq.(28), G(n) C(g) s 2 n [ D 1 + αd 2 / 4D 1 ] + o( 1 n ). The inequality (27) holds if n is sufficiently large. If n is sufficiently large but not extensively large, then G(n) is bounded by the generalization error of the middle size model. If n becomes larger, then it is bounded by that of the larger model, and if n is extensively large, then it is bounded by that of the largest model. A complex hierarchical learning machine contains a lot of smaller models in its own parameter space as analytic sets with singularities, and chooses the appropriate model adaptively for the number of training samples, if Bayesian estimation is applied. Such a property is caused by the fact that the model is non-identifiable, and its quantitative effect can be evaluated by using algebraic geometry. 5 Asymptotic Property of the Generalization Error In this section, let us consider the asymptotic expansion of the generalization error. By eq.(2), F(n) is equal to the accumulate generalization error, n F(n + 1) = G(i) (n = 1,2,...), (29) i=0 where G(0) is defined by F(1). Hence, if G(n) has an asymptotic expansion for n, then F(n) also has the asymptotic expansion. However, even if F(n) has an asymptotic expansion, G(n) may not have an asymptotic expansion. In the foregoing sections, we have proved that F(n) satisfies inequalities such as F(n) αn + β log n + o(log n), (30) where α, β 0 are constants determined by the singularities and the true distribution. In order to mathematically derive an inequality of G(n) from eq.(30), we need an assumption.

23 5 ASYMPTOTIC PROPERTY OF THE GENERALIZATION ERROR 23 Assumption (A) Assume that the generalization error G(n) has an asymptotic expansion G(n) = Q a q s q (n) + o( 1 q=1 n ) (31) where {a q } are real constants, s q (n) > 0 is a positive and non-increasing function of n which satisfies s 1 (n) = 1, (32) s Q (n) = 1 n, (33) s q (n) = o(s q 1 (n)) (q = 2,3,..., Q). (34) Based on this assumption, we have the following lemma. Lemma 1 If G(n) satisfies the assumption (A) and if eq.(30) holds, then G(n) satisfies an inequality, G(n) α + β n + o(1 ). (35) n (Proof) By the assumption (A) F(n) = a 1 n + o(n), which shows a 1 α. If a 1 < α, then eq.(35) holds. If a 1 = α, then n a 2 s 2 (k) β log(n + 1) + o(log n). (36) k=0 Let t(k) = ks 2 (k). By eq.(32),eq.(33), and eq.(34), t(k) or t(k) C (C > 0). If t(k), then, for arbitrary M > 0, there exists k 0 such that k k 0 t(k) M Hence n n t(k) s 2 (n) = k n M k=0 k=0 k=k 0 k = M(log n const.), which contradicts eq.(36). Hence t(n) C and a 2 C β. (End of Proof Lemma 1).

24 6 DISCUSSION 24 In this paper, we have proven the inequalities same as eq.(30) in Theorem 1, 2, 3, and 4 without assumption (A). Then, we obtain corresponding inequalities same as eq.(35) if we adopt the assumption (A). In other words, if G(n) has an asymptotic expansion and if eq.(30) holds, then G(n) should satisfy eq.(35). It is conjectured that natural learning machines satisfy the assumption (A). A sufficient condition for the assumption (A) is that F(n) has an asymptotic expansion R F(n) = a r S r (n) + o( 1 r=1 n ), where S 1 (n) = n, S R (n) = 1/n, and S r+1 (n) = o(s r (n)) (r = 1,2,..., R 1). For example, if the learner is p(y x, a) = 1 2π exp( 1 2 (y ax)2 ), where the a priori distribution of a is the standard normal distribution, and if the true distribution is q(y x)q(x) = 1 2π exp( 1 2 {(y c)2 + x 2 }), then, it is shown by direct calculation that the stochastic complexity has an asymptotic expansion F(n) = 1 2 {nc2 + log n c c2 n } + o(1 ). (37) n Hence G(n) has an asymptotic expansion G(n) = c n + o(1 n ). It is expected that, in a general case, G(n) has the same asymptotic expansion as Assumption (A), however, mathematically speaking, the necessary and sufficient condition for it is not yet established. This is an important problem in statistics and learning theory for the future. 6 Discussion In this section, universal phenomena which can be observed in hierarchical learning machines.

25 6 DISCUSSION Bias and variance at singularities We consider a covering neighborhood of the parameter space, J W = W(w j ), (38) j=1 where {W(w j )} are the sufficiently small neighborhood of the parameter w j which satisfy W(w i ) W(w j ) ϕ(w)dw = 0 (i j). The number J in eq.(38) is finite when W = supp ϕ is compact. Then, the upperbound of the stochastic complexity can be rewritten as F(n) = log log J j=1 J j=1 W(w j ) exp( H(w))ϕ(w)dw exp( nb(w j ))V (w j ) where B(w j ) is the function approximation error of the parameter w j B(w j ) = min H(w), w W(w j ) and V (w j ) is the statistical estimation error of the neighborhood of w j, V (w j ) = c 0 ( log n) m(w j) 1 n λ(w j), where c 0 > 0 is a constant. The values λ(w j ) and m(w j ) are respectively the largest pole and its multiplicity of the meromorphic function J wj (z) = W(w j ) (H(w) B(w j )) z ϕ(w)dw. Note that B(w j ) and V (w j ) are called the bias and the variance, respectively. In the Bayesian estimation, the neighborhood of the parameter w j that minimizes Z(w j ) exp( nb(w j ))V (w j ), is selected with the largest probability. In regular statistical models, the variance does not depend on the parameter, in other words, λ(w j ) = d/2 and m(w j ) = 1 for an arbitrary parameter w j, hence the parameter that minimizes the function approximation error is selected. On the other hand, in hierarchical learning machines,

26 7 CONCLUSION 26 the variance V (w j ) strongly depends on the parameter w j, and the parameter that minimizes the sum of the bias and variance is selected. If the number of training samples is large but not extensively large, parameters among the singular point A in Figure 3 that represents a middle size model, is automatically selected, resulting in the smaller generalization error. As n increases, the larger but not largest model B is selected. At last, if n becomes extensively large, then the parameter C that minimizes the bias is selected. This is a universal phenomenon of hierarchical learning machines, which indicates the essential difference between the regular statistical models and artificial neural networks. 6.2 Neural networks are over-complete basis Singularities of a hierarchical learning machine originate in the homogeneous structure of a learning model. A set of functions used in an artificial neural network, for example, {σ(b x + c)}, is a set of over-complete basis, in other words, coefficients {a(b, c)} in a wavelet type decomposition of a given function g(x), g(x) = a(b, c)σ(b x + c)dbdc are not uniquely determined for g(x) (Chui, 1989; Murata, 1996). In practical applications, the true probability distribution is seldom contained in a parametric model, however, we adopt a model which almost approximates the true distribution compared with the fluctuation caused by random samples, g(x) k a k σ(b k x + c k ). If we have an appropriate number of samples and choose an appropriate learning model, it is expected that the model is in an almost redundant state, where output functions of hidden units are almost linearly dependent. We expect that this paper will be a mathematical foundation to study learning machines in such states. 7 Conclusion We considered the case when the true distribution is not contained in the parametric models made of hierarchical learning machines, and showed that the parameters among singular points are selected by the Bayesian distribution, resulting in the small generalization error. The quantitative effect of the singularities was clarified

27 7 CONCLUSION 27 based on the resolution of singularities in algebraic geometry. Even if the true distribution is not contained in the parametric models, singularities strongly affect and improve the learning curves. This is a universal phenomenon of the hierarchical learning machines, which can be observed in almost all artificial neural networks. References Akaike, H. (1980). Likelihood and Bayes procedure. In J.M. Bernald, Bayesian Statistics, (pp ). Valencia, Spain: University Press. Amari, S. (1985). Differential-geometrical methods in Statistics. Lecture Notes in Statistics, Springer. Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, Amari, S., Fujita, N., & Shinomoto, S. (1992). Four Types of Learning Curves. Neural Computation, 4 (4), Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss. Neural Computation, 5, Amari, S., & Ozeki, T., (2000) Differential and algebraic geometry of multilayer perceptrons. to appear in IEICE Transactions. Atiyah, M. F. (1970). Resolution of Singularities and Division of Distributions. Communications of Pure and Applied Mathematics, 13, Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1), Bernstein, I. N. (1972). The analytic continuation of generalized functions with respect to a parameter. Functional Analysis Applications, 6, Cramer, H. (1949). Mathematical methods of statistics, Princeton: University Press. Chui, C. K. (1992). An introduction to Wavelets. Academic Press. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1,

28 7 CONCLUSION 28 Fukumizu, K. (1996). A regularity condition of information matrix of a multilayer perceptron network. Neural Networks, 9, Gel fand, I.M., & Shilov, G.E. (1964). Generalized functions. Academic Press. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. Proceedings of International Joint Conference on Neural Networks, Nagoya Japan, 3, Hartigan, J.A. (1985). A Failure of likelihood asymptotics for normal mixtures. Proceedings of the Barkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, 2, Haussler, D., & Opper, M. (1997) Mutual information, metric entropy and cumulative relative entropy risk. Annals of Statistics, 25 (6), Hironaka, H. (1964). Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79, Kashiwara, M. (1976). B-functions and holonomic systems. Inventiones Mathematicae, 38, Levin, E., Tishby, N., & Solla, S.A. (1990). A statistical approaches to learning and generalization in layered neural networks. Proceedings of IEEE, 78(10), Mackay, D. J. (1992). Bayesian interpolation. Neural Computation, 4 (2), Meir, R. & Merhav, N. (1995) On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19 (3), Murata, N (1996). An integral representation with ridge functions and approximation bounds of three-layered network. Neural Networks, 9 (6), Opper, M., & Haussler, D. (1995). Bounds for predictive errors in the statistical mechanics of supervised learning. Physical Review letters, 75 (20), Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, Sato, M., & Shintani,T. (1974). On zeta functions associated with prehomogeneous vector space. Annals of Mathematics, 100,

29 7 CONCLUSION 29 Schwarz, G. (1974). Estimating the dimension of a model. Annals of Statistics, 6(2), Watanabe, S. (1994). A optimization method of layered neural networks based on the modified information criterion. Advances in Neural Information Processing Systems, 6, Watanabe, S. (1997) On the essential difference between neural networks and regular statistical models. Proceedings of 2nd International Conference on Computational Intelligence and Neuroscience, 2, Watanabe, S. (1998). On the generalization error by a layered statistical model with Bayesian estimation. IEICE Transactions, J81-A, English version: Watanabe, S.(2000). Electronics and Communications in Japan, Watanabe, S. (1999a). Learning efficiency of redundant neural networks in Bayesian estimation. submitted to IEEE Transactions on Neural Networks. Watanabe, S. (1999b). Algebraic analysis for singular statistical estimation. Lecture Notes in Computer Science, 1720, Watanabe, S. (2000). Algebraic analysis for non-regular learning machines. Advances in Neural Information Processing, 12, Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, to appear. Watanabe, S. (2001b). Training and generalization errors of the hierarchical learning machines with algebraic singularities. IEICE Transactions, J84A (1), Watanabe, S. (2001c) Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems, 13, to appear. Watanabe, S., & Fukumizu, K. (1995). Probabilistic design of layered neural networks based on their unified framework. IEEE Transactions on Neural Networks, 6(3), White, H. (1989). Learning in artificial neural networks: a statistical prespective. Neural Computation, 1, Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and

30 7 CONCLUSION 30 its applications to learning. IEEE Transactions on Information Theory, 44 (4),

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503