Algebraic Analysis for Non-identifiable Learning Machines

Size: px
Start display at page:

Download "Algebraic Analysis for Non-identifiable Learning Machines"

Transcription

1 Algebraic Analysis for Non-identifiable Learning Machines Sumio WATANABE P& I Lab., Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Japan swatanab@pi.titech.ac.jp May 4, 2000 ABSTRACT This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a non-identifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ 1 log n (m 1 1)log log n+constant, where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ 1 and m 1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ 1 is equal to the number of parameters and m 1 = 1, whereas in nonregular models such as multilayer networks, 2λ 1 is not larger than the number of parameters and m 1 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the non-identifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1

2 1 Introduction Learning machines made of superpositions of homogenous functions are often useful for constructing practical information systems. For example, layered neural networks, radial basis functions, and mixtures of normal distributions have been applied to many recognition and prediction systems. They are written in the form, H ψ(x, {a h, b h }) = a h g(b h, x) where g is a function, x is an input vector, and {a h, b h } is a set of parameters to be optimized. One of the important properties of them is that they do not satisfy the regularity condition for the asymptotic normality of the maximum likelihood estimator, in general (Hagiwara, Toda, & Usui, 1993 ; Fukumizu, 1996). For regular statistical models (Cramer, 1949), the set of true parameters consists of only one point and the Fisher information matrix is positive definite, even if the learning model is larger than necessary to attain the true distribution, which case is called over-realizable. On the other hand, if a layered learning machine is in the over-realizable case, the set of true parameters is not one point but an analytic set with singularities, resulting that estimated parameters do not converge to one point. In other words, the correspondence between the set of parameters and the set of the functions is not a one-to-one mapping. In this paper, such learning models are called non-identifiable. Researches for non-identifiable learning machines are important for layered neural networks by the three reasons. Firstly, they are necessary for selecting the optimal model which balances the function approximation error with the statistical estimation error. Although the true distribution is not contained in the parametric learning machine in a practical application, no information criterion can be obtained without studies for the over-realizable case (Hagiwara,Toda,&Usui, 1993; Fukumizu, 1999). The distributions of estimated parameters in non-identifiable cases should be analyzed also for testing hypothesis (Dacunha-Castelle, & Gassiat, 1997; Knowles,& Siemund, 1989). In these researches, the asymptotic case is mainly studied based on the assumption that the number of training samples is sufficiently large. However, there is a similarity between the error behavior in a non-identifiable machine with large size 2 h=1

3 data and that in a complex machine with moderate size data, since the Fisher information matrices are ill-defined in both cases. Therefore, we can expect that the asymptotic research for non-identifiable case is useful for complex learning machines in practical applications. Secondly, studies for non-identifiable machines will clarify the essential difference between regular statistical models and artificial neural networks. The difference of them in the function approximation field has already been studied (Barron, 1994; Mhaskar, 1996), whereas that in the statistical estimation field is not yet. This paper shows that the generalization errors by non-identifiable learning machines with the Bayesian method are not larger than those of identifiable and regular models, which can be proven under the condition that the function approximation error is negligible compared with the statistical estimation error. And lastly, the theory for the learning machines whose loss functions can not be approximated by any quadratic form will be the foundation to devise and improve neural network training algorithms. A lot of training algorithms have been devised on the assumption that the set of the optimal parameter consists of only one point and that the Fisher information matrix is positive definite. However, this assumption is not satisfied in general. For example, we often find that a layered learning machine works very well when functions of hidden units are almost linearly dependent. The training algorithms should be improved so that they make learning machines attain the best performance even when they are in near redundant states. The maximum likelihood method is not an appropriate training algorithm for complex and layered learning machines in such states. In this paper, in order to clarify the statistical properties of the non-identifiable learning machines, we prove that the Bayesian stochastic complexity or the free energy F(n) has the asymptotic form F(n) = λ 1 log n (m 1 1)log log n + O(1), where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number respectively which are determined by the singularities of the set of true parameters. The learning curve are determined by the algebraic geometrical structure of the parameter set. We also show that 3

4 an algorithm to calculate λ 1 and m 1 by using blowing-ups of singularities, and that 2λ 1 is not larger than the number of parameters. Since the increase of the stochastic complexity F(n + 1) F(n) is equal to the generalization error defined by the average Kullback distance of the estimated probability density from the true one (Levin, Tishby, & Solla, 1990 ; Amari, Fujita, & Shinomoto, 1992 ; Amari, & Murata, 1993), our result claims that layered neural networks are the better learning machines than regular statistical models if the Bayesian estimation (Akaike, 1980 ; Mackay, 1992) or ensemble learning is applied in training. The free energy F(n), which is an important observable in Bayesian statistics, information theory, and mathematical physics, has a lot of other names and applications. For example, it is called Bayesian criterion in Bayesian model selection (Schwarz, 1974), Stochastic Complexity in universal coding (Rissanen, 1986 ; Yamanishi, 1998), Akaike s Bayesian criterion in optimization of hyper parameters (Akaike, 1980), or Evidence in neural network learning (Mackay, 1992). In almost all cases in these researches, F(n) was calculated by using the gaussian approximation or the saddle point approximation based on the assumption that the loss function can be approximated by a quadratic form among the one true parameter. In neural network learning, we can not use such an approximation. This paper constructs the general formula, which enables us to analyze both identifiable and non-identifiable models by the same way. To study a loss function whose zero points contain singularities, we employ the Sato-Bernstein polynomial or the so-called b-function in algebraic analysis which can extract an algebraic information from the set of true parameters. Also we construct an algorithm to calculate constants λ 1 and m 1 by using the resolution of singularities in algebraic geometry. Resolution of singularities transforms the integral of several variables into the direct product of integrals of one variable, and enables us to algorithmically calculate learning efficiency of an arbitrary learning machine in Bayesian estimation. 4

5 2 Main Results Let p(y x, w) be a conditional probability density of an output y R N for a given input x R M and a given parameter w R d, which represents a probabilistic inference of a learning machine. Let ϕ(w) be a probability density function on the parameter space R d, whose support is denoted by W = supp ϕ R d. We assume that training or empirical sample pairs {(x i, y i ); i = 1,2,..., n} are independently taken from q(y x)q(x), where q(x) and q(y x) represent the true input probability and the true inference probability, respectively. In the Bayesian estimation, the estimated inference p n (y x) is the average of the a posteriori ensemble, p n (y x) = p(y x, w)ρ n (w)dw, ρ n (w) = 1 n ϕ(w) p(y i x i, w), Z n i=1 where Z n is the constant which ensures ρ n (w)dw = 1. The generalization error is defined by the average Kullback distance of the estimated probability density p n (y x) from the true one q(y x), K(n) = E n { log q(y x) q(x, y)dxdy} p n (y x) where E n { } shows the expectation value over all sets of training sample pairs. In this paper we mainly consider the statistical estimation error and assume that the model can attain the true inference, in other words, there exists a parameter w 0 W such that p(y x, w 0 ) = q(y x). Let us define the average and empirical loss functions. f(w) = f n (w) = log q(y x) p(y x, w) q(y x)q(x)dxdy, 1 n log q(y i x i ) n p(y i x i, w). i=1 Note that f(w) 0 is Kullback information. By the assumption, the set of the true parameters W 0 = {w W ; f(w) = 0} is not an empty set. If f(w) is an analytic function, then W 0 is called an analytic set of f(w). If f(w) is a polynomial, then W 0 is called an algebraic variety. Remark that W 0 is not a manifold in general, since no coordinate can be introduced in the neighborhood 5

6 of singular points. For example, layered neural networks such as three-layered perceptrons and radial basis functions have many singular points. Remark. Even if W 0 consists of one point or if the true distribution is not contained in the parametric models (W 0 is an empty set), finiteness of the number of training samples often makes W 0 seem to be non-identifiable in neural network models. For the purpose of studying such a case rigorously, we assume that W 0 is not empty. From these definitions, it is proven in (Levin, Tishby, & Solla, 1990 ; Amari, 1993) that the average Kullback distance K(n) is equal to the increase of the Bayesian stochastic complexity or the free energy F(n), K(n) = F(n + 1) F(n), where F(n) is defined by F(n) = E n {log exp( nf n (w))ϕ(w)dw}. In this paper, we show the rigorous asymptotic form of F(n) and clarify the algebraic geometrical structure of the stochastic complexity. Theorem 1 and 2 are the main results of this paper. Let C0 be a set of all compact support and C -class functions on R d. Theorem 1 Assume that f(w) is an analytic function and ϕ(w) is a probability density function on R d. Then, there exists a real constant C 1 such that for any natural number n F(n) λ 1 log n (m 1 1)log log n + C 1, (1) where the rational number λ 1 (λ 1 > 0) and the natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ f(w) λ ϕ(w)dw (Re(λ) > 0). where ǫ > 0 is a sufficiently small constant, and ϕ(w) is an arbitrary nonzero C0 -class function that satisfies 0 ϕ(w) ϕ(w). 6

7 Proof of Theorem 1 is shown in section 3. Also it is proven in Lemma 2 of Section 3 that all poles of the function J(λ) are on the negative part of the real axis. In Theorem 1, the largest pole means the largest one as a real value. For Theorem 2, we define a condition. Condition (A) Let ψ(x, w) be a vector-valued function of (x, w) R M R d. We define two conditions on ψ(x, w). (1) ψ(x, w) is an analytic function of w W = supp ϕ R d which can be extended as a holomorphic function on some complex open set W, where W W C d and W is independent of x supp q R M. (2) ψ(x, w) is a measurable function of x R M that satisfies the condition where is the norm of the vector ψ(x, w). sup ψ(x, w) 2 q(x)dx <. (2) w W Theorem 2 Let σ > 0 be a constant value. Assume that ϕ(w) is a C 0 -class probability density function. Let us consider a model p(y x, w) = 1 ψ(x, w) 2 exp( y ), (2πσ 2 ) N/2 2σ 2 where both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A). Then, there exists a constant C 2 > 0 such that for any natural number n F(n) λ 1 log n + (m 1 1)log log n C 2, where the rational number λ 1 (λ 1 > 0) and a natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ where ǫ > 0 is a sufficiently small constant. f(w) λ ϕ(w)dw (Re(λ) > 0), This theorem determines the behavior of the asymptotic stochastic complexity. Proof of Theorem 2 is shown in section 4. 7

8 Remark. A real-valued analytic function ψ(x, w) of x R M and w R d is called to have the associated convergence radii (r 1, r 2,..., r d ) at (x, ŵ), if the Taylor expansion ψ(x, w) = a j1,j 2,...,j d (x)(w 1 ŵ 1 ) j 1 (w 2 ŵ 2 ) j2 (w d ŵ d ) j d j 1,j 2,...,j d absolutely converges in {w R d ; w j ŵ j < r j (j = 1,2,..., d)}, and diverges in {w R d ; w j ŵ j > r j (j = 1,2,..., d)}. For an N-dimensional vectorvalued analytic function, the convergence radii are defined as (minr 1, minr 2,..., minr d ) where min shows the minimum value among corresponding N values. The associated convergence radii at (x, ŵ) are denoted by (r 1 (x, ŵ),..., r d (x, ŵ)). A real analytic function ψ(x, w) can be extended as a holomorphic function on some open set W C d independent of x if and only if holds. min inf inf r j(x, w) > 0 1 j d x K w W In Theorem 1 and 2, we introduced the meromorphic function J(λ), whose largest pole and its multiplicity determines the learning efficiency of the model. It is an important fact that J(λ) is invariant under the transform f(w) ϕ(w)dw f(g(u)) ϕ(g(u)) g (u) du where g : U W is an arbitrary analytic function from some parameter space U to the given parameter space W, and g (u) is its Jacobian. Therefore the constants λ 1 and m 1 are also invariant under the above transform. The analytic function g, which need not have its inverse function g 1, is sometimes called a birational mapping, and algebraic geometry clarifies the geometrical structure of the parameter space which is invariant under birational mappings. To consider the learning curve or the generalization error, let us introduce the definition of the asymptotic expansion. Let {s i (n), i = 1,2,3,...} be a sequence of real-valued functions of a natural number n, which satisfies s i+1 (n) lim n s i (n) = 0 (i = 1,2,...). 8

9 This condition is referred to as s i+1 (n) = o(s i (n)) (i = 1,2,3,..., ). If a real valued function s(n) satisfies the condition lim n 1 k s k (n) {s(n) a i s i (n)} = 0 (k = 1,2,3,..., K), i=1 where {a i } are real values, then we define that s(n) has an asymptotic expansion s(n) K = a i s i (n). (3) i=1 Note that coefficients {a i } are determined uniquely. This definition contains the case K. For a real-valued function s(x) of the real variable x, the asymptotic expansion for x α (α is some value) is defined by the same way. Based on Theorem 2, the function c(n) defined by c(n) = F(n) λ 1 log n + (m 1 1)log log n, is a bounded function, c(n) < C 2. If c(n + 1) c(n) = o(1/(nlog n)), then it follows that K(n) = λ 1 n + m 1 1 nlog n + o( 1 nlog n ), which gives the learning curve of a non-identifiable learning machine. Corollary 1 Assume the same condition as Theorem 2. If c(n + 1) c(n) = o(1/nlog n), then the learning curve is given by K(n) = λ 1 n m 1 1 nlog n. As is well known, regular statistical models have λ 1 = d/2 and m 1 = 1, which can be shown as the special case of Theorem 2 (See Example.1). Nonidentifiable models such as neural networks have different values λ 1 d/2 and m 1 1, in general. Corollary 2 Assume the same condition as Theorem 1. If ϕ(w) can be taken as ϕ(w) > 0 for arbitrary w W 0, then λ 1 d/2, where d is the dimension of the parameter space. Corollary 2 is proven in section 5. Even for a non-identifiable learning machine, we can calculate λ 1 and m 1 based on the resolution technique of singularities in algebraic geometry. An algorithm to calculate λ 1 and m 1 for a given learning machine is shown also in section 5. 9

10 3 Proof of Theorem 1 In this section and the following section, we show the proofs of theorem 1 and 2. These proofs not only ensure theorems but also clarify the mathematical structure of Bayesian learning in artificial neural network models. Lemma 1 Assume that f(w) and f n (w) are continuous functions and that ϕ(w) is a probability density function. Then an inequality F(n) log exp( nf(w))ϕ(w)dw holds for arbitrary natural number n. [Proof of Lemma 1] From Jensen s inequality, log exp(a(w))b(w)dw a(w)b(w)dw holds for an arbitrary continuous function a(w) and an arbitrary compact support probability density b(w). First, assume that ϕ(w) is a compact support function. By applying the above inequality to the special case, a(w) = n{f n (w) f(w)}, b(w) = 1 Y exp( nf(w))ϕ(w) where Y = exp( nf(w))ϕ(w)dw, we obtain log 1 Y exp( nf n (w))ϕ(w)dw 1 Y n{f n (w) f(w)} exp( nf(w))ϕ(w)dw. By using E n {f n (w) f(w)} = 0 and Fubini s theorem, it follows that E n {log exp( nf n (w))ϕ(w)dw} log exp( nf(w))ϕ(w)dw for an arbitrary compact support function ϕ(w). Since the set of all compact support functions is dense in the set of all probability density functions by the L 1 norm, lemma 1 is obtained. (Q.E.D.) For a given ǫ > 0, a set of parameters W ǫ is defined by W ǫ = {w W supp ϕ; f(w) < ǫ}. We introduce Sato-Bernstein s b-function. 10

11 Theorem 3 (Sato, Bernstein, Björk, Kashiwara) Assume that there exists ǫ 0 > 0 such that f(w) is an analytic function in W ǫ0. Then there exists a triple (ǫ, P,b), where (1) ǫ < ǫ 0 is a positive constant, (2) P = P(λ,w, w) is a differential operator which is a polynomial for λ, and (3) b(λ) is a polynomial, such that P(λ,w, w )f(w) λ+1 = b(λ)f(w) λ ( w W ǫ, λ C). The zeros of the algebraic equation b(λ) = 0 are real, rational, and negative numbers. [Explanation of Theorem 3] In Theorem 3, P = P(λ,w, w) is a finite order differential operator for w whose coefficients are analytic functions for w. P can be understood as a mapping from C to C. The formally defined adjoint operator P by the same way as the usual partial differential operator satisfies the relation, φ(w)p(λ,w, w )ϕ(w)dw = P(λ,w, w ) φ(w)ϕ(w)dw for any φ C, ϕ C 0. Theorem 3 was proven based on the algebraic property of the ring of partial differential operators. See references (Bernstein, 1972 ; Sato,& Shintani, 1974 ; Björk, 1979). The rationality of the zeros of b(λ) = 0 is shown based on the resolution of singularities (Atiyah, 1970 ; Kashiwara, 1976). The smallest order polynomial b(λ) that satisfies the above relation is called a Sato-Bernstein polynomial or a b-function. Recently, an algorithm to calculate the b-function has been developed (Oaku, 1997). [Explanation of Theorem 3 End]. Remark. By setting ǫ > 0 small enough, we can assume f(w) > 0 ( w W ǫ \ W 0 ), (4) because f(w) is analytic. Hereafter ǫ > 0 is taken small enough so that both Theorem 3 and eq.(4) hold. Also we can assume that ǫ < 1 without loss of generality. For a given analytic function f(w), let us define a complex function J(λ) of λ C by J(λ) = f(w) λ ϕ(w)dw. W ǫ 11

12 Lemma 2 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C0 -class function. Then, J(λ) can be analytically continued to a meromorphic function on the entire complex plane, in other words, J(λ) has only poles in λ <. Moreover J(λ) satisfies the following conditions. (1) The poles of J(λ) are rational, real, and negative numbers. (2) For an arbitrary a R, J( + a 1) = 0 and J(a ± 1) = 0. [Proof of Lemma 2] J(λ) is an analytic function in the region Re(λ) > 0. First, J( + a 1) = 0 is shown by the Lebesgue s convergence theorem. Second, let us show J(a ± 1) = 0 for fixed a > 0. For t > 0 we define a function Ĵ a (t) = exp(at) W ǫ δ(t log f(w))ϕ(w)dw, where Ĵa(t) is well defined, because δ(g(w)) is well defined if g(w) > 0 for any w that satisfies g(w) = 0 (Gel fand & Shilov, 1964). From the definition, Ĵ a (t) is an L 1 function, and J(a ± b 1) = log ǫ Ĵ a (t) exp(±bt 1)dt, which shows that we can apply the Riemann-Lebesgue convergence theorem, resulting that J(a ± 1) = 0 for fixed a > 0. At last, by using the formal adjoint operator P, J(λ) = = 1 Pf(w) λ+1 ϕ(w)dw b(λ) W ǫ 1 f(w) λ+1 P ϕ(w)dw, b(λ) W ǫ where we used the property of the b-function. Because P ϕ C 0, J(λ) can be analytically continued to J(λ 1) if b(λ) 0. By using analytic continuation, J(a ± 1) = 0 even for a < 0. If b(λ) = 0, then such λ is at most a pole which is on the negative part of real axis. (Q.E.D.) Definition. Poles of the function J(λ) are on the negative part of the real axis and contained in the set {m + ν; m = 0, 1, 2,..., b(ν) = 0}. They are ordered from the bigger to the smaller and referred to as λ 1, λ 2, λ 3,, (λ k > 0 is a rational number.) and the multiplicity of λ k is denoted by m k. 12

13 We define a real-valued function I(t) by I(t) = δ(t f(w))ϕ(w)dw (0 < t < ǫ). (5) W ǫ Here, since f(w) > 0 for w W ǫ \W 0, δ(t f(w)) is well defined (Gel fand, & Shilov, 1964). For t ǫ or t < 0 then we define I(t) = 0. Lemma 3 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C 0 -class function. Then I(t) has an asymptotic expansion for t 0 I(t) = m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m (6) where (m! c k,m+1 ) is the coefficient of the (m + 1)-th order in the Laurent expansion of J(λ) at the pole λ = λ k. [Proof of Lemma 3] The special case of this lemma is shown in the theory of distributions (Gel fand & Shilov, 1964). Let us define I K (t) I(t) K m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m. For eq.(6), it is sufficient to show that, for an arbitrary fixed K, From the definition of J(λ), J(λ) = limi K (t)t λ = 0 ( λ > λ K+1 + 1). (7) t t λ+λ k 1 ( log t) m dt = 0 I(t)t λ dt. The simple calculation shows m! (λ + λ k ) m+1. Therefore, 1 I K (t)t λ dt = J K (λ) where J K (λ) is defined by 0 J K (λ) = J(λ) K m k 1 k=1 m=0 m! c k,m+1 (λ + λ k ) m+1, which is holomorphic in the region Re(λ) > λ K+1. By putting t = e x and by using Laplace inverse transform, I K (e x )e x = 1 τ+i J K (u)e ux du (8) 2πi τ i 13

14 holds for any real τ > 0. By Lemma 2, J(a±i ) = 0, thus J K (a±i ) = 0 for an arbitrary real a. Since J K (λ) is holomorphic in the region Re(λ) > λ K+1, the complex integral path in eq.(8) can be moved from τ > 0 to τ > λ K+1. Therefore, I K (e x )e x τx = 1 + J K (τ + iu)e iux du (9) 2π holds for any real x. Hence, by putting x = 0, J K (τ + iu) L 1. The term in the right hand side of eq.(9) goes to zero when x because of Riemann- Lebesgue convergence theorem. Here, by putting t = log x, we obtain eq.(7). (Q.E.D.) [Proof of Theorem 1] By combining the above results, we have F(n) log = log W 1 0 = log{ exp( nf(w))ϕ(w)dw log n e nt I(t)dt = log m k 1 m k=1 m=0 j=0 0 e t I( t n )dt n c k,m+1 (log n) j ( m C j ) n λ k = λ 1 log n (m 1 1)log log n + O(1), n 0 W ǫ exp( nf(w))ϕ(w)dw e t t λ k 1 ( log t) m j dt} where I(t) is defined by eq.(5) with ϕ(w) instead of ϕ(w). (Q.E.D.) 4 Proof of Theorem 2 Hereafter, we assume that the model is given by p(y x, w) = 1 2π exp( 1 2 (y ψ(x, w))2 )) for the simple description. It is easy to extend the proof for a general standard deviation (σ > 0) case and a general output dimension (N > 1) case. For this model, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 f n (w) = 1 n i, w) ψ(x i, w 0 )) 2n i=1(ψ(x 2 1 n n η i (ψ(x i, w) ψ(x i, w 0 )) i=1 where {η i y i ψ(x i, w 0 )} are independent samples from the standard normal distribution, and w 0 is an arbitrary element in W 0. 14

15 Lemma 4 Let {x i, η i } n i=1 be a set of independent samples taken from q(x)q 0 (η), where q(x) is a probability density and q 0 (η) is the standard normal distribution. Assume that a function ξ(x, w) satisfies the condition (A) and that the Taylor expansion of ξ(x, w) at ŵ absolutely converges in a region T = {w; w j ŵ j < r j }. For a given constant 0 < a < 1, we define a region T a {w; w j ŵ j < ar j }. Then, the followings hold. (1) If ξ(x, w)q(x)dx = 0, there exists a constant c such that for an arbitrary n, A n E n {sup w T a 1 n ξ(x i, w) 2 } < c <. n (2) There exists a constant c such that for an arbitrary n, B n E n {sup w T a i=1 1 n η i ξ(x i, w) 2 } < c <. n i=1 [Proof of Lemma 4] We show (1). The statement (2) can be proven in the same way. We denote k = (k 1, k 2,..., k d ) and ξ(x, w) = = a k (x)(w ŵ) k k=0 a k1 k d (x)(w 1 ŵ 1 ) k1 (w d ŵ d ) k d. k 1,...,k d =0 This power series absolutely converges in T, therefore ξ(x, w) can be extended as a holomorphic function in T. By using Cauchy s integral formula for several complex functions, a k (x) = 1 ξ(x, w) (2πi) d C 1 C d j(w j ŵ j ) dw k 1 dw d (10) j+1 where C j is a circle with a radius r j δ. We define a function M(x) by M(x) = sup w T a ξ(x, w). Then by eq.(2) in the condition (A), M { M(x) 2 q(x)dx } 1/2 <, and there exists δ > 0 such that a k (x) M(x) dj=1. (11) r j δ k j 15

16 By the assumption, a k (x)q(x)dx = 0. Thus 1 n E n { a k (x i ) 2 } 1 2 = { a k (x) 2 q(x)dx} 1 M 2 n j r j δ. k j i=1 By using Lemma 5 in Appendix, we obtain A 1 2 n 1 n = E n {sup ξ(x i, w) 2 } 1 2 w T a n i=1 1 n = E n {sup a k (x i )(w ŵ) k 2 } 1 2 w T a n i=1 k=0 1 n E n {sup a k (x i )(w ŵ) k 2 } 1 2 k=0 w T a n i=1 d ar j M ( k=0 j=1 r j δ )k j/2 < where δ is taken so that ar j < r j δ (j = 1,2,..., d). (Q.E.D.) The function ζ n (w) is defined as follows, which shows the fluctuation of learning. ζ n (w) = n(f(w) fn (w)) f(w) Note that ζ n (w) is an analytic function in W \ W 0. At w 0 W 0, ζ n (w) may be discontinuous, but the following theorem ensures that it is bounded on the average. Theorem 4 Assume the same condition as Theorem 2. Then, there exists a constant c such that for an arbitrary n E n { sup ζ n (w) 2 } < c. w W \W 0 [Proof of Theorem 4] Outside of the neighborhood of W 0, this theorem can be proven by Lemma 4. Hence we can assume that W = W ǫ. Since W is compact, W is covered by a union of finite small open sets. Thus we can assume w is in the neighborhood U of w 0 W 0, where its closure U is contained in the associated convergence radii. We define a (1) n (w) = 1 n η i (ψ(x i, w) ψ(x i, w 0 )), n i=1 a (2) n (w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 1 n (ψ(x i, w) ψ(x i, w 0 )) 2, 2n i=1 16.

17 where {η i } are samples independently taken from the standard normal distribution. By using these definitions, ζ n (w) = n {a (1) n (w) + a (2) n (w)}. f(w) We define a (j) (n) (j = 1,2) by a (j) (n) = E n { sup w U \W 0 n a (j) n (w) 2 f(w) }. It follows that E n { sup w U \W 0 ζ n (w) 2 } 2a (1) (n) + 2a (2) (n). For the proof of Theorem 4, it is sufficient that {a (j) (n); j = 1,2} are bounded. First, we show finiteness of a (1) (n). From Lemma 6 in Appendix, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) is a function which satisfies the condition (A), such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w), (12) j=1 where the matrix M jk (w) h j (x, w)h k (x, w)q(x)dx is positive definite. Let α > 0 be taken smaller than the minimum eigen value of the matrix {M jk (w); w U}. By the definition, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 = 1 J M jk (w)g j (w)g k (w) α 2 j,k=1 2 J g j (w) 2 j=1 by taking small ǫ > 0. Therefore, by using Cauchy-Schwarz inequality, a (1) 1 J (n) = E n { sup w U \W 0 f(w) j=1 g j (w) 1 n η i h j (x i, w) 2 } n i=1 2 J α E 1 n n{ sup η i h j (x i, w) 2 } w U \W 0 j=1 n i=1 17

18 which is bounded by some constant by the preceding Lemma 4. Secondly, we show finiteness of a (2) (n). By the assumption that both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A), an inequality {1 + M(x) 2 } (ψ(x, w) ψ(x, w 0 ) 2 q(x)dx < holds, where M(x) = sup w W ψ(x, w) ψ(x, w 0 ). By using Lemma 6, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) satisfies the condition (A) with {1+M(x) 2 } q(x) instead of q(x) such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w) j=1 where two matrices L jk (w) N jk (w) h j (x, w)h k (x, w)q(x)dx h j (x, w)h k (x, w){1 + M(x) 2 } q(x)dx are positive definite. Hence (ψ(x, w) ψ(x, w0 )) 4 q(x)dx (ψ(x, w) ψ(x, w0 )) 2 q(x)dx j,k N jk (w)g j (w)g k (w) j,k L jk (w)g j (w)g k (w) C where C > 0 is a constant. Thus which ensures that f(w) 1 {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx C a (2) (n) C E n { sup n (w) 2 }. {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx n a (2) w U \W 0 Here finiteness of the last term can be shown by the same way as a (1) (n), where {ψ(x, w) ψ(x, w 0 )} 2 is decomposed by Lemma 6 instead of ψ(x, w) ψ(x, w 0 ). (Q.E.D.) [Proof of Theorem 2] Let us prove Theorem 2. We define α n = sup w W \W 0 ζ n (w). 18

19 Then, by Theorem 4, E n {α 2 n} <. The free energy or the Bayesian stochastic complexity satisfies F(n) = E n {log exp( nf n (w))ϕ(w)dw} W = E n {log exp( nf(w) nf(w)ζ n (w))ϕ(w)dw} W E n {log exp( nf(w) + α n nf(w))ϕ(w)dw} W log exp( n 2 f(w))ϕ(w)dw 1 2 E n{α 2 n}, W where we used an inequality α n nf(w) (α 2 n +nf(w))/2. Let us define Z i(n) (i = 1,2) by Z i (n) = W(i) exp( n 2 f(w))ϕ(w)dw, where W(1) = W ǫ and W(2) = W \ W ǫ. Then by the same method as the proof of Theorem 1, On the other hand, Z 2 (n) Therefore Z 1 (n) (log n) m 1 1 = c 1,m1. W \W ǫ exp( nǫ 2 F(n) log{c 1,m1 (log n) m 1 1 n λ 1 which completes Theorem 2. (Q.E.D.) n λ 1 )ϕ(w)dw exp( nǫ 2 ). + exp( nǫ )} + const., 2 5 Algorithm to calculate the learning efficiency Theorem 2 shows that the important values λ 1 and m 1 are determined by the meromorphic function J(λ). However, J(λ) is defined by the integral of several variables, it is not so easy to determine the largest pole and its multiplicity of J(λ). If J(λ) is given by an integral of a single variable, for example, if J(λ) = ǫ 0 x 2λ x r dx, then the pole of J(λ) is λ 1 = (r + 1)/2, and its multiplicity is m 1 = 1. The resolution of singularities in algebraic geometry transforms an arbitrary 19

20 integral of several variables into an integral whose essential term is a direct product of integrals of single variables. In fact, Atiyah showed that the following Theorem 5 is directly proven from Hironaka s theorem (Hironaka, 1964 ; Atiyah, 1970). Theorem 5 (Hironaka-Atiyah) Let f(w) be a real analytic function defined in a neighborhood of 0 R d. Then there exist an open set W {0}, a real analytic manifold U and a proper analytic map g : U W such that (1) g : U \ U 0 W \ W 0 is a biregular map, where W 0 = f 1 (0) and U 0 = g 1 (W 0 ), (2) for each P U there are local analytic coordinates (u 1,..., u d ) centered at P so that, locally near P, we have f(g(u 1,..., u d )) = a(u 1,.., u d )u k 1 1 uk 2 2 uk d d (13) where a(u) is an invertible analytic function and k i 0. [Explanation of Theorem 5] This theorem is a special version of the well known Hironaka s resolution of singularities in algebraic geometry. Since f(w) is not a polynomial but an analytic function, the singularities of f can be locally resolved. Hironaka s proof is completely constructive, and the above map g can be constructed by the finite recursive procedures which are blowing-ups of non-singular manifolds contained in singular sets. In the theorem, both U and W are locally compact Hausdorff spaces and g is continuous. In this case, g is a proper map if and only if for an arbitrary compact set K, g 1 (K) is compact. [Explanation of Theorem 5 End] The following is an algorithm to calculate λ 1 and m 1. An algorithm to calculate the learning efficiency. (1) Cover the analytic set (the set of the true parameter) W 0 = {w suppϕ; f(w) = 0} by the finite union of open neighborhoods W α. (2) For each neighborhood W α, find a resolution map g α and a manifold U α by using blowing ups. Since g α is a proper map and the closure of W α is compact, U α is covered by a finite union of open sets {U αβ } whose closures are homeomorphic to compact sets in some Euclidean spaces. 20

21 (3) For each neighborhood of W αβ = g(u αβ ), the function J αβ (λ) is calculated by eq.(13). J αβ (λ) = = f(w) λ ϕ(w)dw W αβ f(g α (w)) λ ϕ(g α (u)) g α(u) du U αβ a(u) λ u λk 1 1 u λk 2 2 u λk d d ϕ(g α (u)) g α(u) du, U αβ where g α(u) is Jacobian. Note that {u; g α(u) = 0} gα 1 (W 0 ). The last integration can be done for each variable u i, and the largest pole λ (αβ) 1 and their multiplicity m (αβ) 1 of J αβ (λ) are obtained, where Taylor expansion of g α(u) is used. (4) The largest pole λ 1 of J(λ) is λ 1 = max αβ ( λ (αβ) 1 ), and its multiplicity m 1 is m (α β ) 1, where α β is the argument that maximizes the λ (αβ) 1. If (α, β) that maximizes λ αβ) 1 is not unique, then (α, β) that maximizes m (αβ) 1 among such (α, β) sets is chosen. In order to calculate λ 1 and m 1, only the neighborhood W α that gives the largest pole is important. The singularity that is in such neighborhood W α is called the deepest singularity in this paper. Note that, by Theorem 5, 1 m 1 d, where d is the number of parameters. Example.1 (Regular Models) For the regular statistical models, by using the appropriate coordinate (w 1,..., w d ) the average loss function f(w) can be locally written by d f(w) = wi 2. i=1 Then, among the origin, W 1 = {w; w < 1}, J(λ) = f(w) λ dw, W 1 where we replaced ϕ(w) by ϕ(w) = const., based on the natural assumption that ϕ(w) > 0 on W. We define W 11 = {w W; w i < w 1 } and U 11 = {(u 1,..., u d ); u i < 1}. By using the blowing-up, we find a map g 1 : (u 1,..., u d ) (w 1,..., w d ), w 1 = u 1, w i = u 1 u i (2 i d). 21

22 Then the function J 11 (λ) is J 11 (λ) = f(w) λ dw W 11 = u 2λ 1 d 1 (1 + u 2 i )λ u 1 d 1 du 1 du 2 du d This function has the pole at λ = d/2 with the multiplicity m 1 = 1. We define W 1i and J 1i (λ) by the same way as W 11 and J 11 (λ), then W 1 is the union of W 1i and a measure zero set. Therefore, the free energy is i=2 F(n) = d log n + O(1), 2 and if K(n) satisfies the same condition as Corollary 1, then K(n) = d/2n. [Proof of Corollary 2] Let w 0 be an arbitrary fixed point in W 0. We can assume w 0 = 0 without loss of generality. Since f(w) is analytic, there exists a constant a > 0, such that for any w in the neighborhood of 0 f(w) a w 2. From Lemma 2, we analyze the case when λ is a real and negative value. For such λ, J(λ) has a real value, if it is finite. Let ϕ be an positive and C0 class function smaller than ϕ(w). Then for a sufficiently small constant δ > 0, J(λ) f(w) λ ϕ(w)dw w <δ (a w 2 ) λ ϕ(w)dw. w <δ As is shown in Example.1, the largest pole of the last term is d/2, which shows that the largest pole of J(λ) is larger than d/2. (Q.E.D.) Example.2 If the model p(y x, a, b) = 1 2π exp( 1 2 (y a tanh(bx))2 ) is trained using samples from p(y x, 0,0), then f(a, b) = a 2 tanh(bx) 2 q(x)dx. In this case, W 0 = {ab = 0} and the deepest singularity is the origin. In the neighborhood of the origin, the essential term is f(a, b) = a 2 b 2, and the 22

23 other terms are smaller than this term. It immediately follows that λ 1 = 1/2, m 1 = 2, resulting that F(n) = 1 log n log log n + O(1). 2 Compared 1/2 with d/2 = 2/2, the free energy is smaller than the regular model case. Example.3 Let us consider a neural network p(y x, a, b, c, d) = 1 2π exp( 1 2 (y ψ(x, a, b, c, d))2 ), ψ(x, a, b, c, d) = a tanh(bx) + ctanh(dx). Assume that the true regression function be ψ(x, 0,0,0,0). Then, W 0 = {ab = 0 and cd = 0} {a + c = 0 and b = d} {a = c and b + d = 0} and the deepest singularity of f(a, b, c, d) is (0,0,0,0). In the neighborhood of the origin, defined as W 1, f(a, b, c, d) = (ab + cd) 2 + (ab 3 + cd 3 ) 2, since the higher order term can be bounded by the above two terms (Watanabe, 1998b). Let us define W 11 = {(a, b, c, d); b > d, c > a, ad 3 < ab + cd < cd 3 }, U 11 = {(x, y, z, w); x < 1, y < 1, w < 1,w 1 < z < w + 1}. By using blowing-ups, we find a map g 1 : U 11 W 11 which is defined by a = x, b = y 3 w yzw, c = xzw, d = y. From the algorithmic point of view, W 11 and U 11 are determined systematically in the process of blowing-ups. By using this transform, we obtain f(g 1 (x, y, z, w)) = x 2 y 6 w 2 [1 + {w 2 (y 2 z) 3 + z} 2 ], g 1(x, y, z, w) = xy 3 w. 23

24 We assume naturally ϕ(w) > 0 in {w = (a, b, c, d); a < 1, b < 1, c < 1, d < 1}. In calculation of poles of J(λ), we can simply put ϕ(w) = 1 in this set. J 1 (λ) = = W 1 f(w) λ dw 1 1 dx 1 1 dy 1 1 w+1 dw dz f(g 1 (x, y, z, w)) λ xy 3 w, w 1 resulting that λ (1) 1 = 2/3, and m (1) 1 = 1. From the other regions W 12, W 13,...,, we do not obtain larger λ 1 than 2/3. Hence F(n) = 2 log n + O(1), 3 whereas the dimension divided by two is d/2 = 4/2. Lastly, we introduce an elemental inequality for the asymptotic expansion of the free energy. Let us consider K learning machines (1 k K), p k (y x, w k ) = 1 1 exp( (2πσ 2 ) N/2 2σ y ψ k(x, w 2 k ) 2 ), where N is the number of output units. Assume that the asymptotic expansions of free energies are respectively given by F k (n) = λ (k) log n + (m (k) 1)log log n, which are determined under the condition that the true distributions are {p(y x, w k0 )} and the a priori distribution is ϕ k (w k ). Let us study the case that a learning machine made of a sum of K machines, p(y x, {w k }) = 1 1 exp( (2πσ 2 ) N/2 2σ y K 2 k=1 ψ k (x, w k ) 2 ), is trained by using samples from the distribution p(y x, {w k0 }). The asymptotic expansion of the free energy with the a priori distribution k ϕ k (w k ) is written as Then we have the following inequality. F(n) = λlog n (m 1)log log n. K Corollary 3 Constants λ and {λ (k) } satisfy the inequality, λ λ (k). k=1 24

25 This corollary has an important application. Let λ(h 0, H)log n be the essential term of the asymptotic form of a three-layered perceptron with H hidden units, which is trained using samples from the true distribution represented with H 0 hidden units. We assume that H H 0 = KL, where K and L are some integers. By using Corollary 3, λ(h 0, H) λ(h 0, H 0 ) + λ(0,h H 0 ) λ(h 0, H 0 ) + (H H 0 )/K λ(0,l). Note that λ(h 0, H 0 ) = d/2 where d is the number of parameters in the model with H 0 hidden units (Watanabe, 1998b). Even if the resolution of singularities is too complex and too difficult to calculate λ(h 0, H), we can obtain some inequalities based on the free energy of the simpler case. [Proof of Corollary 3] The Kullback distance of the machine p(y x, {w k }) is given by f(w) 1 K {ψ 2σ 2 k (x, w k ) ψ k (x, w k0 )} 2 q(x)dx k=1 K K ψ 2σ 2 k (x, w k ) ψ k (x, w k0 ) 2 q(x)dx k=1 K = K f k (w k ) k=1 where f k (w k ) is the Kullback distance of p k (y x, w k ) from p k (y x, w k0 ). F(n) K K k log exp( Kn f k (w k )) ϕ k (w k ) dw k k=1 k=1 k=1 K {λ (k) log n (m (k) 1)log log n + const.}, k=1 which completes the Corollary 3. (Q.E.D.) Example.4 (Experiments) Let us consider a case when p(y x, a, b) = 1 2π(0.1) exp( 1 2 2(0.1) 2(y aσ(bx))2 ) is trained using samples from p(y x, a 0, b 0 ), where σ(x) = x + x 2. In this case, we can calculate the maximum likelihood estimator, because the likelihood 25

26 function is a quadratic form of ab and ab 2. Figure 1, 3, and 5 show the maximum likelihood estimators (MLE) for (a, b) which are estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4), respectively. Here q(x) is the standard normal distribution, and n = 20, the number of trials is 200. Note that MLEs which are in the outside of [ 2, 2] [ 2, 2] are not drawn in the figures. Figure 2, 4, and 6 show the parameters which are taken from the a posteriori distribution estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4). Here we defined the a priori distribution of (a, b) by the uniform distribution of [ 2, 2] [ 2, 2]. It should be emphasized that the distribution of MLEs is very different from the Bayesian a posteriori distribution, and that neither distribution is subject to any gaussian distribution even if the true parameter is only one point. In this example, by putting p = ab, q = ab 2, the model can be understood as the regular statistical model, resulting that the generalization error by the maximum likelihood method is asymptotically equal to 2/2n in three cases. However, for a general activation function σ(x), there exists no transform which makes the model regular. The generalization error in such a case is far larger than that in regular model cases. The generalization errors by Bayesian estimation in these cases are asymptotically 1/(2n) 1/(nlog n), 2/(2n), and 2/(2n), respectively. For the case n = 20 (n is the number of training samples), the experimental average generalization errors by the maximum likelihood method were , , and where true parameters were (0, 0), (0.2, 0.2), and (0.4, 0.4), respectively, whereas those by Bayesian method were , , and , respectively. For the case n = 50, the former errors were , , and , respectively, and the latter errors were , , and These results show that, if the model is almost redundant, Bayesian estimation is more appropriate than the maximum likelihood method. 26

27 6 Discussion 6.1 Fundamental Aspects In this subsection, we intuitively explain the fundamental structure of the theorems in this paper. Let us define a volume function of the set of parameters {w W; f(w) t}, V (t) = f(w) t ϕ(w)dw. Then V (t) = I(t), where I(t) is defined by eq.(5). This paper has shown that the asymptotic stochastic complexity F(n) is given by the Laplace Transform of the volume function. F(n) = log exp( nt)dv (t). If the model is identifiable and the Fisher information matrix is positive definite, then the asymptotic shape of the set {w W; f(w) t} is the inside of an ellipsoid. However, if the model is non-identifiable and has singularities, then it is the union of the inner points of several complex hyperboloids. If V (t) satisfies the relation V (t) = t λ 1 ( log t) m 1 1 (t 0), then F(n) = λ1 log n (m 1 1)log n, (n ) (14) J(λ) = t λ V (t)dt = Const.. (15) (λ + λ 1 ) m 1 These relations, eq.(14) and eq.(15), show that the learning efficiency of Bayesian estimation is determined by the volume of almost true parameters. Mathematically speaking, it is difficult to prove V (t) t λ 1 ( log t) m 1 1 directly, because of singularities in the true parameter set. The relation that is proven firstly is eq.(15), where we need algebraic properties of the loss function, Sato-Bernstein s b-function and Hironaka s resolution of singularities. 6.2 Theoretical Aspects The case when the parameter is not identifiable might be understood as a special one which seldom occurs. However, the superpositions of homogenous functions essentially have the problem of redundancy. 27

28 Let us illustrate the mathematical structure of redundancy. A function f(x) in L 2 (R) can be decomposed as f(x) = g(a, b)ϕ( x b a )dadb, where ϕ(x) is an analyzing wavelet and g(a, b) is a coefficient function (Chui, 1992). Here the set of functions {ϕ( x b a )} is an over-complete basis in L2 (R), resulting that g(a, b) is not determined uniquely. This decomposition shows that the set of coefficients {g h } in the neural approximation f(x) H h=1 g h ϕ( x b h a h ) becomes more non-identifiable when H is close to infinity. In order to analyze the case n and H, we have to make a theory for redundant function approximation and redundant statistical estimation. It is shown in (Watanabe, 2000b) that singularities in the parameter space make the generalization error smaller even when the true distribution is not contained in the parametric model. On the other hand, the set of coefficients in the orthonormal basis decomposition is given by the orthonormal projection of the true function to the corresponding function space, hence it is identifiable even if H goes to infinity. This is one of the essential differences between artificial neural networks and orthonormal basis functions. We expect that artificial neural networks are better learning machines than the orthnormal ones if Bayesin estimation is applied. It is well known that, if a statistical model contains an unused or nuisance parameters, then such parameters should not be estimated. If such parameters are estimated and determined, then the statistical estimation error becomes large. They are estimated in the maximum likelihood method, but not estimated in Bayesian estimation. This is one of the reason why Bayesian estimation is useful for neural networks in almost redundant states. 6.3 Practical Aspects In regular learning machines, the generalization error by the maximum likelihood method is asymptotically equal to d/2n (Akaike, 1974), which coincides 28

29 with that by Bayesian estimation (Amari, 1993), where d is the number of parameters and n is the number of training samples. In this paper we have shown that, in non-identifiable learning machines, the generalization error by the Bayesian estimation is not larger than d/2n. When the singularity becomes deeper, the generalization error is reduced to smaller. On the other hand, it is conjectured that the generalization error of a layered learning machine by the maximum likelihood method is far larger than d/2n. For example, it is proven that, if the distribution of the inputs consists of a finite sum of delta functions, then the generalization error is in proportion to log n/n (Hagiwara, Kuno, & Usui, 2000). Also it is proven that, if the activation function of hidden units is a linear function, then the generalization error is C/2n, where C is larger than the number of parameters (Fukumizu, 1999). In a practical application, the true distribution is not strictly contained in the finite parametric model in general (Shibata, 1981). In such a case, the generalization error is the sum of the function approximation error and the statistical estimation error, where the former is called bias and the latter variance. In this paper, we have shown that the increase of the variance of an artificial neural network is not larger than the increase of parameters, if Bayesian estimation is applied. Therefore, we can use the larger size neural network with the not so large variance and the smaller bias. However, if we apply the maximum likelihood method to an artificial neural network, then we should use the smaller model to ensure the variance small, resulting in the larger bias. We conjecture that this difference is the main reason why almost redundant learning machines with ensemble learning are more useful than selected small models with one point estimation, in practical applications. 7 Conclusion Mathematical foundation for non-identifiable learning machines such as neural networks is established based on algebraic analysis and algebraic geometry. The free energy or the stochastic complexity is asymptotically equal to λ 1 log n (m 1 1)log log n + const., where λ 1 is a rational number and m 1 is a natural number. Moreover, we obtained the following properties. (1) The learning curve is determined by the singularities in the parameter 29

30 space. (2) The learning efficiency can be algorithmically calculated by the resolution of singularities. (3) Bayesian estimation is more appropriate than the maximum likelihood method for neural networks in redundant states. Information geometry gives the foundation of statistical learning theory for a regular learning machine (Amari, 1985) which has a positive definite Fisher metric. We expect that algebraic geometry and algebraic analysis will play an important role in study of a complex and hierarchical learning machine which has a degenerate metric. Analysis for both the maximum likelihood method and the maximum a posteriori method is an important problem for the future. Acknowledgment This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientific Research and Appendix: Elemental Lemmas and Proofs In this appendix, we show two elemental lemmas and their proofs. These are technical properties, but sometimes useful in neural network theory. Lemma 5 Let {X k (a); k = 1,2,..., } be a set of real valued random variables with parameter a and A be a set of all parameters. Then E{sup a A X k (a) 2 } 1/2 k=1 k=1 E{sup X k (a) 2 } 1/2 a A where E is the expectation value for the random variables. (Proof) By the definition Y k = sup X k (a), a A E{sup a A X k (a) 2 } E{sup X k (a) X k (a) } k=1 a A k=1 k =1 E{Y k Y k } k=1 k =1 30

31 = { which completes the lemma. (Q.E.D.) k=1 k =1 k=1 E{Y 2 k } 1/2 E{Y 2 k }1/2 E{Y 2 k } 1/2 } 2 Lemma 6 Assume the same condition as Theorem 2. Let U be a neighborhood of w 0 W 0. Then, there exist both a real-valued analytic function g j (w) and a real-valued function h j (x, w) which satisfies the condition (A) in U such that (1) g j (w 0 ) = 0. (2) Functions {h j (x, w)} are linearly independent. (3) For an arbitrary w U J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w). i=1 (Proof of Lemma 6) We can assume w 0 = 0 and ψ(x, 0) = 0 without loss of generality. The d dimensional direct product of the set of natural numbers is referred to as N d and its element is written as j = (j 1, j 2,..., j d ) N d. Let us consider the Talyor expansion ψ(x, w) = j 1 a j (x)w j where j = j 1 + j j d and w j = w j 1 1 w j 2 2 w j d d. We introduce a space of the square integrable functions L 2 (q) with an inner product (u, v) = K u(x)v(x)q(x)dx. By using Cauchy s integral formula as eq.(10), it is immediately proven that a j (x) is contained in L 2 (q). We adopt an order of the set N d which satisfies j j j j. Then, by applying Schmidt s orthogonalization to {a j (x); j N d }, we obtain ψ(x, w) = j N d e j (x)g j (w), 31

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Algebraic Geometrical Methods for Hierarchical Learning Machines

Algebraic Geometrical Methods for Hierarchical Learning Machines 1 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe (a) Title : Algebraic geometrical methods for hierarchical learning machines. (b) Author : Sumio Watanabe (c) Affiliation

More information

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Contact information for authors. M. Aoyagi Email : miki-a@sophia.ac.jp Address : Department of Mathematics,

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

Algebraic Geometry and Model Selection

Algebraic Geometry and Model Selection Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio

More information

Asymptotic Approximation of Marginal Likelihood Integrals

Asymptotic Approximation of Marginal Likelihood Integrals Asymptotic Approximation of Marginal Likelihood Integrals Shaowei Lin 10 Dec 2008 Abstract We study the asymptotics of marginal likelihood integrals for discrete models using resolution of singularities

More information

Qualifying Exams I, 2014 Spring

Qualifying Exams I, 2014 Spring Qualifying Exams I, 2014 Spring 1. (Algebra) Let k = F q be a finite field with q elements. Count the number of monic irreducible polynomials of degree 12 over k. 2. (Algebraic Geometry) (a) Show that

More information

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0 4 INUTES. If, ω, ω, -----, ω 9 are the th roots of unity, then ( + ω) ( + ω ) ----- ( + ω 9 ) is B) D) 5. i If - i = a + ib, then a =, b = B) a =, b = a =, b = D) a =, b= 3. Find the integral values for

More information

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute

More information

Considering our result for the sum and product of analytic functions, this means that for (a 0, a 1,..., a N ) C N+1, the polynomial.

Considering our result for the sum and product of analytic functions, this means that for (a 0, a 1,..., a N ) C N+1, the polynomial. Lecture 3 Usual complex functions MATH-GA 245.00 Complex Variables Polynomials. Construction f : z z is analytic on all of C since its real and imaginary parts satisfy the Cauchy-Riemann relations and

More information

1 Continuity Classes C m (Ω)

1 Continuity Classes C m (Ω) 0.1 Norms 0.1 Norms A norm on a linear space X is a function : X R with the properties: Positive Definite: x 0 x X (nonnegative) x = 0 x = 0 (strictly positive) λx = λ x x X, λ C(homogeneous) x + y x +

More information

INTRODUCTION TO REAL ANALYTIC GEOMETRY

INTRODUCTION TO REAL ANALYTIC GEOMETRY INTRODUCTION TO REAL ANALYTIC GEOMETRY KRZYSZTOF KURDYKA 1. Analytic functions in several variables 1.1. Summable families. Let (E, ) be a normed space over the field R or C, dim E

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Real Analysis Prelim Questions Day 1 August 27, 2013

Real Analysis Prelim Questions Day 1 August 27, 2013 Real Analysis Prelim Questions Day 1 August 27, 2013 are 5 questions. TIME LIMIT: 3 hours Instructions: Measure and measurable refer to Lebesgue measure µ n on R n, and M(R n ) is the collection of measurable

More information

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 2010 (Day 1)

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 2010 (Day 1) QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 21 (Day 1) 1. (CA) Evaluate sin 2 x x 2 dx Solution. Let C be the curve on the complex plane from to +, which is along

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

CHAPTER VIII HILBERT SPACES

CHAPTER VIII HILBERT SPACES CHAPTER VIII HILBERT SPACES DEFINITION Let X and Y be two complex vector spaces. A map T : X Y is called a conjugate-linear transformation if it is a reallinear transformation from X into Y, and if T (λx)

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Reminder Notes for the Course on Measures on Topological Spaces

Reminder Notes for the Course on Measures on Topological Spaces Reminder Notes for the Course on Measures on Topological Spaces T. C. Dorlas Dublin Institute for Advanced Studies School of Theoretical Physics 10 Burlington Road, Dublin 4, Ireland. Email: dorlas@stp.dias.ie

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

What is Singular Learning Theory?

What is Singular Learning Theory? What is Singular Learning Theory? Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 23 Sep 2011 McGill University Singular Learning Theory A statistical model is regular if it is identifiable and its

More information

(3) Let Y be a totally bounded subset of a metric space X. Then the closure Y of Y

(3) Let Y be a totally bounded subset of a metric space X. Then the closure Y of Y () Consider A = { q Q : q 2 2} as a subset of the metric space (Q, d), where d(x, y) = x y. Then A is A) closed but not open in Q B) open but not closed in Q C) neither open nor closed in Q D) both open

More information

Analysis Qualifying Exam

Analysis Qualifying Exam Analysis Qualifying Exam Spring 2017 Problem 1: Let f be differentiable on R. Suppose that there exists M > 0 such that f(k) M for each integer k, and f (x) M for all x R. Show that f is bounded, i.e.,

More information

MATH 31BH Homework 1 Solutions

MATH 31BH Homework 1 Solutions MATH 3BH Homework Solutions January 0, 04 Problem.5. (a) (x, y)-plane in R 3 is closed and not open. To see that this plane is not open, notice that any ball around the origin (0, 0, 0) will contain points

More information

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality (October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Lebesgue Integration on R n

Lebesgue Integration on R n Lebesgue Integration on R n The treatment here is based loosely on that of Jones, Lebesgue Integration on Euclidean Space We give an overview from the perspective of a user of the theory Riemann integration

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

The following definition is fundamental.

The following definition is fundamental. 1. Some Basics from Linear Algebra With these notes, I will try and clarify certain topics that I only quickly mention in class. First and foremost, I will assume that you are familiar with many basic

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Quasi-conformal maps and Beltrami equation

Quasi-conformal maps and Beltrami equation Chapter 7 Quasi-conformal maps and Beltrami equation 7. Linear distortion Assume that f(x + iy) =u(x + iy)+iv(x + iy) be a (real) linear map from C C that is orientation preserving. Let z = x + iy and

More information

Functional Analysis I

Functional Analysis I Functional Analysis I Course Notes by Stefan Richter Transcribed and Annotated by Gregory Zitelli Polar Decomposition Definition. An operator W B(H) is called a partial isometry if W x = X for all x (ker

More information

Real Analysis Notes. Thomas Goller

Real Analysis Notes. Thomas Goller Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................

More information

Notes on Complex Analysis

Notes on Complex Analysis Michael Papadimitrakis Notes on Complex Analysis Department of Mathematics University of Crete Contents The complex plane.. The complex plane...................................2 Argument and polar representation.........................

More information

Statistical Learning Theory of Variational Bayes

Statistical Learning Theory of Variational Bayes Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology

More information

Class notes: Approximation

Class notes: Approximation Class notes: Approximation Introduction Vector spaces, linear independence, subspace The goal of Numerical Analysis is to compute approximations We want to approximate eg numbers in R or C vectors in R

More information

List of Symbols, Notations and Data

List of Symbols, Notations and Data List of Symbols, Notations and Data, : Binomial distribution with trials and success probability ; 1,2, and 0, 1, : Uniform distribution on the interval,,, : Normal distribution with mean and variance,,,

More information

Two Lemmas in Local Analytic Geometry

Two Lemmas in Local Analytic Geometry Two Lemmas in Local Analytic Geometry Charles L Epstein and Gennadi M Henkin Department of Mathematics, University of Pennsylvania and University of Paris, VI This paper is dedicated to Leon Ehrenpreis

More information

MATHS 730 FC Lecture Notes March 5, Introduction

MATHS 730 FC Lecture Notes March 5, Introduction 1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

Bernstein s analytic continuation of complex powers

Bernstein s analytic continuation of complex powers (April 3, 20) Bernstein s analytic continuation of complex powers Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/. Analytic continuation of distributions 2. Statement of the theorems

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu

More information

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES P. M. Bleher (1) and P. Major (2) (1) Keldysh Institute of Applied Mathematics of the Soviet Academy of Sciences Moscow (2)

More information

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS TSOGTGEREL GANTUMUR Abstract. After establishing discrete spectra for a large class of elliptic operators, we present some fundamental spectral properties

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

SPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS

SPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS SPRING 006 PRELIMINARY EXAMINATION SOLUTIONS 1A. Let G be the subgroup of the free abelian group Z 4 consisting of all integer vectors (x, y, z, w) such that x + 3y + 5z + 7w = 0. (a) Determine a linearly

More information

Complex Analysis Qualifying Exam Solutions

Complex Analysis Qualifying Exam Solutions Complex Analysis Qualifying Exam Solutions May, 04 Part.. Let log z be the principal branch of the logarithm defined on G = {z C z (, 0]}. Show that if t > 0, then the equation log z = t has exactly one

More information

1 Math 241A-B Homework Problem List for F2015 and W2016

1 Math 241A-B Homework Problem List for F2015 and W2016 1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let

More information

Complex Analysis Slide 9: Power Series

Complex Analysis Slide 9: Power Series Complex Analysis Slide 9: Power Series MA201 Mathematics III Department of Mathematics IIT Guwahati August 2015 Complex Analysis Slide 9: Power Series 1 / 37 Learning Outcome of this Lecture We learn Sequence

More information

Part IB Complex Analysis

Part IB Complex Analysis Part IB Complex Analysis Theorems Based on lectures by I. Smith Notes taken by Dexter Chua Lent 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after

More information

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, February 2, Time Allowed: Two Hours Maximum Marks: 40

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, February 2, Time Allowed: Two Hours Maximum Marks: 40 NATIONAL BOARD FOR HIGHER MATHEMATICS Research Scholarships Screening Test Saturday, February 2, 2008 Time Allowed: Two Hours Maximum Marks: 40 Please read, carefully, the instructions on the following

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Multivariable Calculus

Multivariable Calculus 2 Multivariable Calculus 2.1 Limits and Continuity Problem 2.1.1 (Fa94) Let the function f : R n R n satisfy the following two conditions: (i) f (K ) is compact whenever K is a compact subset of R n. (ii)

More information

Mathematical Methods for Physics and Engineering

Mathematical Methods for Physics and Engineering Mathematical Methods for Physics and Engineering Lecture notes for PDEs Sergei V. Shabanov Department of Mathematics, University of Florida, Gainesville, FL 32611 USA CHAPTER 1 The integration theory

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

Two special equations: Bessel s and Legendre s equations. p Fourier-Bessel and Fourier-Legendre series. p

Two special equations: Bessel s and Legendre s equations. p Fourier-Bessel and Fourier-Legendre series. p LECTURE 1 Table of Contents Two special equations: Bessel s and Legendre s equations. p. 259-268. Fourier-Bessel and Fourier-Legendre series. p. 453-460. Boundary value problems in other coordinate system.

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

November 18, 2013 ANALYTIC FUNCTIONAL CALCULUS

November 18, 2013 ANALYTIC FUNCTIONAL CALCULUS November 8, 203 ANALYTIC FUNCTIONAL CALCULUS RODICA D. COSTIN Contents. The spectral projection theorem. Functional calculus 2.. The spectral projection theorem for self-adjoint matrices 2.2. The spectral

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

Problem Set 5. 2 n k. Then a nk (x) = 1+( 1)k

Problem Set 5. 2 n k. Then a nk (x) = 1+( 1)k Problem Set 5 1. (Folland 2.43) For x [, 1), let 1 a n (x)2 n (a n (x) = or 1) be the base-2 expansion of x. (If x is a dyadic rational, choose the expansion such that a n (x) = for large n.) Then the

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Problem Set 5: Solutions Math 201A: Fall 2016

Problem Set 5: Solutions Math 201A: Fall 2016 Problem Set 5: s Math 21A: Fall 216 Problem 1. Define f : [1, ) [1, ) by f(x) = x + 1/x. Show that f(x) f(y) < x y for all x, y [1, ) with x y, but f has no fixed point. Why doesn t this example contradict

More information

Review of complex analysis in one variable

Review of complex analysis in one variable CHAPTER 130 Review of complex analysis in one variable This gives a brief review of some of the basic results in complex analysis. In particular, it outlines the background in single variable complex analysis

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.

More information

Functional Analysis Review

Functional Analysis Review Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520: Statistical Learning Theory and Applications September 9, 2013 1 2 3 4 Vector Space A vector space is a set V with binary

More information

A LITTLE REAL ANALYSIS AND TOPOLOGY

A LITTLE REAL ANALYSIS AND TOPOLOGY A LITTLE REAL ANALYSIS AND TOPOLOGY 1. NOTATION Before we begin some notational definitions are useful. (1) Z = {, 3, 2, 1, 0, 1, 2, 3, }is the set of integers. (2) Q = { a b : aεz, bεz {0}} is the set

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

Analysis Comprehensive Exam Questions Fall 2008

Analysis Comprehensive Exam Questions Fall 2008 Analysis Comprehensive xam Questions Fall 28. (a) Let R be measurable with finite Lebesgue measure. Suppose that {f n } n N is a bounded sequence in L 2 () and there exists a function f such that f n (x)

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

CALCULUS ON MANIFOLDS

CALCULUS ON MANIFOLDS CALCULUS ON MANIFOLDS 1. Manifolds Morally, manifolds are topological spaces which locally look like open balls of the Euclidean space R n. One can construct them by piecing together such balls ( cells

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Fourier Series. 1. Review of Linear Algebra

Fourier Series. 1. Review of Linear Algebra Fourier Series In this section we give a short introduction to Fourier Analysis. If you are interested in Fourier analysis and would like to know more detail, I highly recommend the following book: Fourier

More information

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set Analysis Finite and Infinite Sets Definition. An initial segment is {n N n n 0 }. Definition. A finite set can be put into one-to-one correspondence with an initial segment. The empty set is also considered

More information

Chapter 8. P-adic numbers. 8.1 Absolute values

Chapter 8. P-adic numbers. 8.1 Absolute values Chapter 8 P-adic numbers Literature: N. Koblitz, p-adic Numbers, p-adic Analysis, and Zeta-Functions, 2nd edition, Graduate Texts in Mathematics 58, Springer Verlag 1984, corrected 2nd printing 1996, Chap.

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Chapter 2 Metric Spaces

Chapter 2 Metric Spaces Chapter 2 Metric Spaces The purpose of this chapter is to present a summary of some basic properties of metric and topological spaces that play an important role in the main body of the book. 2.1 Metrics

More information

Chapter 8 Integral Operators

Chapter 8 Integral Operators Chapter 8 Integral Operators In our development of metrics, norms, inner products, and operator theory in Chapters 1 7 we only tangentially considered topics that involved the use of Lebesgue measure,

More information

Qualification Exam: Mathematical Methods

Qualification Exam: Mathematical Methods Qualification Exam: Mathematical Methods Name:, QEID#41534189: August, 218 Qualification Exam QEID#41534189 2 1 Mathematical Methods I Problem 1. ID:MM-1-2 Solve the differential equation dy + y = sin

More information

MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5

MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5 MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5.. The Arzela-Ascoli Theorem.. The Riemann mapping theorem Let X be a metric space, and let F be a family of continuous complex-valued functions on X. We have

More information

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include PUTNAM TRAINING POLYNOMIALS (Last updated: December 11, 2017) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include

More information

TEST CODE: PMB SYLLABUS

TEST CODE: PMB SYLLABUS TEST CODE: PMB SYLLABUS Convergence and divergence of sequence and series; Cauchy sequence and completeness; Bolzano-Weierstrass theorem; continuity, uniform continuity, differentiability; directional

More information

Separation of Variables in Linear PDE: One-Dimensional Problems

Separation of Variables in Linear PDE: One-Dimensional Problems Separation of Variables in Linear PDE: One-Dimensional Problems Now we apply the theory of Hilbert spaces to linear differential equations with partial derivatives (PDE). We start with a particular example,

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true 3 ohn Nirenberg inequality, Part I A function ϕ L () belongs to the space BMO() if sup ϕ(s) ϕ I I I < for all subintervals I If the same is true for the dyadic subintervals I D only, we will write ϕ BMO

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Course 214 Basic Properties of Holomorphic Functions Second Semester 2008

Course 214 Basic Properties of Holomorphic Functions Second Semester 2008 Course 214 Basic Properties of Holomorphic Functions Second Semester 2008 David R. Wilkins Copyright c David R. Wilkins 1989 2008 Contents 7 Basic Properties of Holomorphic Functions 72 7.1 Taylor s Theorem

More information

On rational approximation of algebraic functions. Julius Borcea. Rikard Bøgvad & Boris Shapiro

On rational approximation of algebraic functions. Julius Borcea. Rikard Bøgvad & Boris Shapiro On rational approximation of algebraic functions http://arxiv.org/abs/math.ca/0409353 Julius Borcea joint work with Rikard Bøgvad & Boris Shapiro 1. Padé approximation: short overview 2. A scheme of rational

More information

Chapter 4. Inverse Function Theorem. 4.1 The Inverse Function Theorem

Chapter 4. Inverse Function Theorem. 4.1 The Inverse Function Theorem Chapter 4 Inverse Function Theorem d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d dd d d d d This chapter

More information

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, January 20, Time Allowed: 150 Minutes Maximum Marks: 40

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, January 20, Time Allowed: 150 Minutes Maximum Marks: 40 NATIONAL BOARD FOR HIGHER MATHEMATICS Research Scholarships Screening Test Saturday, January 2, 218 Time Allowed: 15 Minutes Maximum Marks: 4 Please read, carefully, the instructions that follow. INSTRUCTIONS

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information