Algebraic Analysis for Non-identifiable Learning Machines
|
|
- Clement Ball
- 5 years ago
- Views:
Transcription
1 Algebraic Analysis for Non-identifiable Learning Machines Sumio WATANABE P& I Lab., Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Japan swatanab@pi.titech.ac.jp May 4, 2000 ABSTRACT This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a non-identifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ 1 log n (m 1 1)log log n+constant, where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ 1 and m 1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ 1 is equal to the number of parameters and m 1 = 1, whereas in nonregular models such as multilayer networks, 2λ 1 is not larger than the number of parameters and m 1 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the non-identifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1
2 1 Introduction Learning machines made of superpositions of homogenous functions are often useful for constructing practical information systems. For example, layered neural networks, radial basis functions, and mixtures of normal distributions have been applied to many recognition and prediction systems. They are written in the form, H ψ(x, {a h, b h }) = a h g(b h, x) where g is a function, x is an input vector, and {a h, b h } is a set of parameters to be optimized. One of the important properties of them is that they do not satisfy the regularity condition for the asymptotic normality of the maximum likelihood estimator, in general (Hagiwara, Toda, & Usui, 1993 ; Fukumizu, 1996). For regular statistical models (Cramer, 1949), the set of true parameters consists of only one point and the Fisher information matrix is positive definite, even if the learning model is larger than necessary to attain the true distribution, which case is called over-realizable. On the other hand, if a layered learning machine is in the over-realizable case, the set of true parameters is not one point but an analytic set with singularities, resulting that estimated parameters do not converge to one point. In other words, the correspondence between the set of parameters and the set of the functions is not a one-to-one mapping. In this paper, such learning models are called non-identifiable. Researches for non-identifiable learning machines are important for layered neural networks by the three reasons. Firstly, they are necessary for selecting the optimal model which balances the function approximation error with the statistical estimation error. Although the true distribution is not contained in the parametric learning machine in a practical application, no information criterion can be obtained without studies for the over-realizable case (Hagiwara,Toda,&Usui, 1993; Fukumizu, 1999). The distributions of estimated parameters in non-identifiable cases should be analyzed also for testing hypothesis (Dacunha-Castelle, & Gassiat, 1997; Knowles,& Siemund, 1989). In these researches, the asymptotic case is mainly studied based on the assumption that the number of training samples is sufficiently large. However, there is a similarity between the error behavior in a non-identifiable machine with large size 2 h=1
3 data and that in a complex machine with moderate size data, since the Fisher information matrices are ill-defined in both cases. Therefore, we can expect that the asymptotic research for non-identifiable case is useful for complex learning machines in practical applications. Secondly, studies for non-identifiable machines will clarify the essential difference between regular statistical models and artificial neural networks. The difference of them in the function approximation field has already been studied (Barron, 1994; Mhaskar, 1996), whereas that in the statistical estimation field is not yet. This paper shows that the generalization errors by non-identifiable learning machines with the Bayesian method are not larger than those of identifiable and regular models, which can be proven under the condition that the function approximation error is negligible compared with the statistical estimation error. And lastly, the theory for the learning machines whose loss functions can not be approximated by any quadratic form will be the foundation to devise and improve neural network training algorithms. A lot of training algorithms have been devised on the assumption that the set of the optimal parameter consists of only one point and that the Fisher information matrix is positive definite. However, this assumption is not satisfied in general. For example, we often find that a layered learning machine works very well when functions of hidden units are almost linearly dependent. The training algorithms should be improved so that they make learning machines attain the best performance even when they are in near redundant states. The maximum likelihood method is not an appropriate training algorithm for complex and layered learning machines in such states. In this paper, in order to clarify the statistical properties of the non-identifiable learning machines, we prove that the Bayesian stochastic complexity or the free energy F(n) has the asymptotic form F(n) = λ 1 log n (m 1 1)log log n + O(1), where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number respectively which are determined by the singularities of the set of true parameters. The learning curve are determined by the algebraic geometrical structure of the parameter set. We also show that 3
4 an algorithm to calculate λ 1 and m 1 by using blowing-ups of singularities, and that 2λ 1 is not larger than the number of parameters. Since the increase of the stochastic complexity F(n + 1) F(n) is equal to the generalization error defined by the average Kullback distance of the estimated probability density from the true one (Levin, Tishby, & Solla, 1990 ; Amari, Fujita, & Shinomoto, 1992 ; Amari, & Murata, 1993), our result claims that layered neural networks are the better learning machines than regular statistical models if the Bayesian estimation (Akaike, 1980 ; Mackay, 1992) or ensemble learning is applied in training. The free energy F(n), which is an important observable in Bayesian statistics, information theory, and mathematical physics, has a lot of other names and applications. For example, it is called Bayesian criterion in Bayesian model selection (Schwarz, 1974), Stochastic Complexity in universal coding (Rissanen, 1986 ; Yamanishi, 1998), Akaike s Bayesian criterion in optimization of hyper parameters (Akaike, 1980), or Evidence in neural network learning (Mackay, 1992). In almost all cases in these researches, F(n) was calculated by using the gaussian approximation or the saddle point approximation based on the assumption that the loss function can be approximated by a quadratic form among the one true parameter. In neural network learning, we can not use such an approximation. This paper constructs the general formula, which enables us to analyze both identifiable and non-identifiable models by the same way. To study a loss function whose zero points contain singularities, we employ the Sato-Bernstein polynomial or the so-called b-function in algebraic analysis which can extract an algebraic information from the set of true parameters. Also we construct an algorithm to calculate constants λ 1 and m 1 by using the resolution of singularities in algebraic geometry. Resolution of singularities transforms the integral of several variables into the direct product of integrals of one variable, and enables us to algorithmically calculate learning efficiency of an arbitrary learning machine in Bayesian estimation. 4
5 2 Main Results Let p(y x, w) be a conditional probability density of an output y R N for a given input x R M and a given parameter w R d, which represents a probabilistic inference of a learning machine. Let ϕ(w) be a probability density function on the parameter space R d, whose support is denoted by W = supp ϕ R d. We assume that training or empirical sample pairs {(x i, y i ); i = 1,2,..., n} are independently taken from q(y x)q(x), where q(x) and q(y x) represent the true input probability and the true inference probability, respectively. In the Bayesian estimation, the estimated inference p n (y x) is the average of the a posteriori ensemble, p n (y x) = p(y x, w)ρ n (w)dw, ρ n (w) = 1 n ϕ(w) p(y i x i, w), Z n i=1 where Z n is the constant which ensures ρ n (w)dw = 1. The generalization error is defined by the average Kullback distance of the estimated probability density p n (y x) from the true one q(y x), K(n) = E n { log q(y x) q(x, y)dxdy} p n (y x) where E n { } shows the expectation value over all sets of training sample pairs. In this paper we mainly consider the statistical estimation error and assume that the model can attain the true inference, in other words, there exists a parameter w 0 W such that p(y x, w 0 ) = q(y x). Let us define the average and empirical loss functions. f(w) = f n (w) = log q(y x) p(y x, w) q(y x)q(x)dxdy, 1 n log q(y i x i ) n p(y i x i, w). i=1 Note that f(w) 0 is Kullback information. By the assumption, the set of the true parameters W 0 = {w W ; f(w) = 0} is not an empty set. If f(w) is an analytic function, then W 0 is called an analytic set of f(w). If f(w) is a polynomial, then W 0 is called an algebraic variety. Remark that W 0 is not a manifold in general, since no coordinate can be introduced in the neighborhood 5
6 of singular points. For example, layered neural networks such as three-layered perceptrons and radial basis functions have many singular points. Remark. Even if W 0 consists of one point or if the true distribution is not contained in the parametric models (W 0 is an empty set), finiteness of the number of training samples often makes W 0 seem to be non-identifiable in neural network models. For the purpose of studying such a case rigorously, we assume that W 0 is not empty. From these definitions, it is proven in (Levin, Tishby, & Solla, 1990 ; Amari, 1993) that the average Kullback distance K(n) is equal to the increase of the Bayesian stochastic complexity or the free energy F(n), K(n) = F(n + 1) F(n), where F(n) is defined by F(n) = E n {log exp( nf n (w))ϕ(w)dw}. In this paper, we show the rigorous asymptotic form of F(n) and clarify the algebraic geometrical structure of the stochastic complexity. Theorem 1 and 2 are the main results of this paper. Let C0 be a set of all compact support and C -class functions on R d. Theorem 1 Assume that f(w) is an analytic function and ϕ(w) is a probability density function on R d. Then, there exists a real constant C 1 such that for any natural number n F(n) λ 1 log n (m 1 1)log log n + C 1, (1) where the rational number λ 1 (λ 1 > 0) and the natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ f(w) λ ϕ(w)dw (Re(λ) > 0). where ǫ > 0 is a sufficiently small constant, and ϕ(w) is an arbitrary nonzero C0 -class function that satisfies 0 ϕ(w) ϕ(w). 6
7 Proof of Theorem 1 is shown in section 3. Also it is proven in Lemma 2 of Section 3 that all poles of the function J(λ) are on the negative part of the real axis. In Theorem 1, the largest pole means the largest one as a real value. For Theorem 2, we define a condition. Condition (A) Let ψ(x, w) be a vector-valued function of (x, w) R M R d. We define two conditions on ψ(x, w). (1) ψ(x, w) is an analytic function of w W = supp ϕ R d which can be extended as a holomorphic function on some complex open set W, where W W C d and W is independent of x supp q R M. (2) ψ(x, w) is a measurable function of x R M that satisfies the condition where is the norm of the vector ψ(x, w). sup ψ(x, w) 2 q(x)dx <. (2) w W Theorem 2 Let σ > 0 be a constant value. Assume that ϕ(w) is a C 0 -class probability density function. Let us consider a model p(y x, w) = 1 ψ(x, w) 2 exp( y ), (2πσ 2 ) N/2 2σ 2 where both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A). Then, there exists a constant C 2 > 0 such that for any natural number n F(n) λ 1 log n + (m 1 1)log log n C 2, where the rational number λ 1 (λ 1 > 0) and a natural number m 1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from J(λ) = f(w)<ǫ where ǫ > 0 is a sufficiently small constant. f(w) λ ϕ(w)dw (Re(λ) > 0), This theorem determines the behavior of the asymptotic stochastic complexity. Proof of Theorem 2 is shown in section 4. 7
8 Remark. A real-valued analytic function ψ(x, w) of x R M and w R d is called to have the associated convergence radii (r 1, r 2,..., r d ) at (x, ŵ), if the Taylor expansion ψ(x, w) = a j1,j 2,...,j d (x)(w 1 ŵ 1 ) j 1 (w 2 ŵ 2 ) j2 (w d ŵ d ) j d j 1,j 2,...,j d absolutely converges in {w R d ; w j ŵ j < r j (j = 1,2,..., d)}, and diverges in {w R d ; w j ŵ j > r j (j = 1,2,..., d)}. For an N-dimensional vectorvalued analytic function, the convergence radii are defined as (minr 1, minr 2,..., minr d ) where min shows the minimum value among corresponding N values. The associated convergence radii at (x, ŵ) are denoted by (r 1 (x, ŵ),..., r d (x, ŵ)). A real analytic function ψ(x, w) can be extended as a holomorphic function on some open set W C d independent of x if and only if holds. min inf inf r j(x, w) > 0 1 j d x K w W In Theorem 1 and 2, we introduced the meromorphic function J(λ), whose largest pole and its multiplicity determines the learning efficiency of the model. It is an important fact that J(λ) is invariant under the transform f(w) ϕ(w)dw f(g(u)) ϕ(g(u)) g (u) du where g : U W is an arbitrary analytic function from some parameter space U to the given parameter space W, and g (u) is its Jacobian. Therefore the constants λ 1 and m 1 are also invariant under the above transform. The analytic function g, which need not have its inverse function g 1, is sometimes called a birational mapping, and algebraic geometry clarifies the geometrical structure of the parameter space which is invariant under birational mappings. To consider the learning curve or the generalization error, let us introduce the definition of the asymptotic expansion. Let {s i (n), i = 1,2,3,...} be a sequence of real-valued functions of a natural number n, which satisfies s i+1 (n) lim n s i (n) = 0 (i = 1,2,...). 8
9 This condition is referred to as s i+1 (n) = o(s i (n)) (i = 1,2,3,..., ). If a real valued function s(n) satisfies the condition lim n 1 k s k (n) {s(n) a i s i (n)} = 0 (k = 1,2,3,..., K), i=1 where {a i } are real values, then we define that s(n) has an asymptotic expansion s(n) K = a i s i (n). (3) i=1 Note that coefficients {a i } are determined uniquely. This definition contains the case K. For a real-valued function s(x) of the real variable x, the asymptotic expansion for x α (α is some value) is defined by the same way. Based on Theorem 2, the function c(n) defined by c(n) = F(n) λ 1 log n + (m 1 1)log log n, is a bounded function, c(n) < C 2. If c(n + 1) c(n) = o(1/(nlog n)), then it follows that K(n) = λ 1 n + m 1 1 nlog n + o( 1 nlog n ), which gives the learning curve of a non-identifiable learning machine. Corollary 1 Assume the same condition as Theorem 2. If c(n + 1) c(n) = o(1/nlog n), then the learning curve is given by K(n) = λ 1 n m 1 1 nlog n. As is well known, regular statistical models have λ 1 = d/2 and m 1 = 1, which can be shown as the special case of Theorem 2 (See Example.1). Nonidentifiable models such as neural networks have different values λ 1 d/2 and m 1 1, in general. Corollary 2 Assume the same condition as Theorem 1. If ϕ(w) can be taken as ϕ(w) > 0 for arbitrary w W 0, then λ 1 d/2, where d is the dimension of the parameter space. Corollary 2 is proven in section 5. Even for a non-identifiable learning machine, we can calculate λ 1 and m 1 based on the resolution technique of singularities in algebraic geometry. An algorithm to calculate λ 1 and m 1 for a given learning machine is shown also in section 5. 9
10 3 Proof of Theorem 1 In this section and the following section, we show the proofs of theorem 1 and 2. These proofs not only ensure theorems but also clarify the mathematical structure of Bayesian learning in artificial neural network models. Lemma 1 Assume that f(w) and f n (w) are continuous functions and that ϕ(w) is a probability density function. Then an inequality F(n) log exp( nf(w))ϕ(w)dw holds for arbitrary natural number n. [Proof of Lemma 1] From Jensen s inequality, log exp(a(w))b(w)dw a(w)b(w)dw holds for an arbitrary continuous function a(w) and an arbitrary compact support probability density b(w). First, assume that ϕ(w) is a compact support function. By applying the above inequality to the special case, a(w) = n{f n (w) f(w)}, b(w) = 1 Y exp( nf(w))ϕ(w) where Y = exp( nf(w))ϕ(w)dw, we obtain log 1 Y exp( nf n (w))ϕ(w)dw 1 Y n{f n (w) f(w)} exp( nf(w))ϕ(w)dw. By using E n {f n (w) f(w)} = 0 and Fubini s theorem, it follows that E n {log exp( nf n (w))ϕ(w)dw} log exp( nf(w))ϕ(w)dw for an arbitrary compact support function ϕ(w). Since the set of all compact support functions is dense in the set of all probability density functions by the L 1 norm, lemma 1 is obtained. (Q.E.D.) For a given ǫ > 0, a set of parameters W ǫ is defined by W ǫ = {w W supp ϕ; f(w) < ǫ}. We introduce Sato-Bernstein s b-function. 10
11 Theorem 3 (Sato, Bernstein, Björk, Kashiwara) Assume that there exists ǫ 0 > 0 such that f(w) is an analytic function in W ǫ0. Then there exists a triple (ǫ, P,b), where (1) ǫ < ǫ 0 is a positive constant, (2) P = P(λ,w, w) is a differential operator which is a polynomial for λ, and (3) b(λ) is a polynomial, such that P(λ,w, w )f(w) λ+1 = b(λ)f(w) λ ( w W ǫ, λ C). The zeros of the algebraic equation b(λ) = 0 are real, rational, and negative numbers. [Explanation of Theorem 3] In Theorem 3, P = P(λ,w, w) is a finite order differential operator for w whose coefficients are analytic functions for w. P can be understood as a mapping from C to C. The formally defined adjoint operator P by the same way as the usual partial differential operator satisfies the relation, φ(w)p(λ,w, w )ϕ(w)dw = P(λ,w, w ) φ(w)ϕ(w)dw for any φ C, ϕ C 0. Theorem 3 was proven based on the algebraic property of the ring of partial differential operators. See references (Bernstein, 1972 ; Sato,& Shintani, 1974 ; Björk, 1979). The rationality of the zeros of b(λ) = 0 is shown based on the resolution of singularities (Atiyah, 1970 ; Kashiwara, 1976). The smallest order polynomial b(λ) that satisfies the above relation is called a Sato-Bernstein polynomial or a b-function. Recently, an algorithm to calculate the b-function has been developed (Oaku, 1997). [Explanation of Theorem 3 End]. Remark. By setting ǫ > 0 small enough, we can assume f(w) > 0 ( w W ǫ \ W 0 ), (4) because f(w) is analytic. Hereafter ǫ > 0 is taken small enough so that both Theorem 3 and eq.(4) hold. Also we can assume that ǫ < 1 without loss of generality. For a given analytic function f(w), let us define a complex function J(λ) of λ C by J(λ) = f(w) λ ϕ(w)dw. W ǫ 11
12 Lemma 2 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C0 -class function. Then, J(λ) can be analytically continued to a meromorphic function on the entire complex plane, in other words, J(λ) has only poles in λ <. Moreover J(λ) satisfies the following conditions. (1) The poles of J(λ) are rational, real, and negative numbers. (2) For an arbitrary a R, J( + a 1) = 0 and J(a ± 1) = 0. [Proof of Lemma 2] J(λ) is an analytic function in the region Re(λ) > 0. First, J( + a 1) = 0 is shown by the Lebesgue s convergence theorem. Second, let us show J(a ± 1) = 0 for fixed a > 0. For t > 0 we define a function Ĵ a (t) = exp(at) W ǫ δ(t log f(w))ϕ(w)dw, where Ĵa(t) is well defined, because δ(g(w)) is well defined if g(w) > 0 for any w that satisfies g(w) = 0 (Gel fand & Shilov, 1964). From the definition, Ĵ a (t) is an L 1 function, and J(a ± b 1) = log ǫ Ĵ a (t) exp(±bt 1)dt, which shows that we can apply the Riemann-Lebesgue convergence theorem, resulting that J(a ± 1) = 0 for fixed a > 0. At last, by using the formal adjoint operator P, J(λ) = = 1 Pf(w) λ+1 ϕ(w)dw b(λ) W ǫ 1 f(w) λ+1 P ϕ(w)dw, b(λ) W ǫ where we used the property of the b-function. Because P ϕ C 0, J(λ) can be analytically continued to J(λ 1) if b(λ) 0. By using analytic continuation, J(a ± 1) = 0 even for a < 0. If b(λ) = 0, then such λ is at most a pole which is on the negative part of real axis. (Q.E.D.) Definition. Poles of the function J(λ) are on the negative part of the real axis and contained in the set {m + ν; m = 0, 1, 2,..., b(ν) = 0}. They are ordered from the bigger to the smaller and referred to as λ 1, λ 2, λ 3,, (λ k > 0 is a rational number.) and the multiplicity of λ k is denoted by m k. 12
13 We define a real-valued function I(t) by I(t) = δ(t f(w))ϕ(w)dw (0 < t < ǫ). (5) W ǫ Here, since f(w) > 0 for w W ǫ \W 0, δ(t f(w)) is well defined (Gel fand, & Shilov, 1964). For t ǫ or t < 0 then we define I(t) = 0. Lemma 3 Assume that f(w) is an analytic function in W ǫ and that ϕ(w) is a C 0 -class function. Then I(t) has an asymptotic expansion for t 0 I(t) = m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m (6) where (m! c k,m+1 ) is the coefficient of the (m + 1)-th order in the Laurent expansion of J(λ) at the pole λ = λ k. [Proof of Lemma 3] The special case of this lemma is shown in the theory of distributions (Gel fand & Shilov, 1964). Let us define I K (t) I(t) K m k 1 k=1 m=0 c k,m+1 t λ k 1 ( log t) m. For eq.(6), it is sufficient to show that, for an arbitrary fixed K, From the definition of J(λ), J(λ) = limi K (t)t λ = 0 ( λ > λ K+1 + 1). (7) t t λ+λ k 1 ( log t) m dt = 0 I(t)t λ dt. The simple calculation shows m! (λ + λ k ) m+1. Therefore, 1 I K (t)t λ dt = J K (λ) where J K (λ) is defined by 0 J K (λ) = J(λ) K m k 1 k=1 m=0 m! c k,m+1 (λ + λ k ) m+1, which is holomorphic in the region Re(λ) > λ K+1. By putting t = e x and by using Laplace inverse transform, I K (e x )e x = 1 τ+i J K (u)e ux du (8) 2πi τ i 13
14 holds for any real τ > 0. By Lemma 2, J(a±i ) = 0, thus J K (a±i ) = 0 for an arbitrary real a. Since J K (λ) is holomorphic in the region Re(λ) > λ K+1, the complex integral path in eq.(8) can be moved from τ > 0 to τ > λ K+1. Therefore, I K (e x )e x τx = 1 + J K (τ + iu)e iux du (9) 2π holds for any real x. Hence, by putting x = 0, J K (τ + iu) L 1. The term in the right hand side of eq.(9) goes to zero when x because of Riemann- Lebesgue convergence theorem. Here, by putting t = log x, we obtain eq.(7). (Q.E.D.) [Proof of Theorem 1] By combining the above results, we have F(n) log = log W 1 0 = log{ exp( nf(w))ϕ(w)dw log n e nt I(t)dt = log m k 1 m k=1 m=0 j=0 0 e t I( t n )dt n c k,m+1 (log n) j ( m C j ) n λ k = λ 1 log n (m 1 1)log log n + O(1), n 0 W ǫ exp( nf(w))ϕ(w)dw e t t λ k 1 ( log t) m j dt} where I(t) is defined by eq.(5) with ϕ(w) instead of ϕ(w). (Q.E.D.) 4 Proof of Theorem 2 Hereafter, we assume that the model is given by p(y x, w) = 1 2π exp( 1 2 (y ψ(x, w))2 )) for the simple description. It is easy to extend the proof for a general standard deviation (σ > 0) case and a general output dimension (N > 1) case. For this model, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 f n (w) = 1 n i, w) ψ(x i, w 0 )) 2n i=1(ψ(x 2 1 n n η i (ψ(x i, w) ψ(x i, w 0 )) i=1 where {η i y i ψ(x i, w 0 )} are independent samples from the standard normal distribution, and w 0 is an arbitrary element in W 0. 14
15 Lemma 4 Let {x i, η i } n i=1 be a set of independent samples taken from q(x)q 0 (η), where q(x) is a probability density and q 0 (η) is the standard normal distribution. Assume that a function ξ(x, w) satisfies the condition (A) and that the Taylor expansion of ξ(x, w) at ŵ absolutely converges in a region T = {w; w j ŵ j < r j }. For a given constant 0 < a < 1, we define a region T a {w; w j ŵ j < ar j }. Then, the followings hold. (1) If ξ(x, w)q(x)dx = 0, there exists a constant c such that for an arbitrary n, A n E n {sup w T a 1 n ξ(x i, w) 2 } < c <. n (2) There exists a constant c such that for an arbitrary n, B n E n {sup w T a i=1 1 n η i ξ(x i, w) 2 } < c <. n i=1 [Proof of Lemma 4] We show (1). The statement (2) can be proven in the same way. We denote k = (k 1, k 2,..., k d ) and ξ(x, w) = = a k (x)(w ŵ) k k=0 a k1 k d (x)(w 1 ŵ 1 ) k1 (w d ŵ d ) k d. k 1,...,k d =0 This power series absolutely converges in T, therefore ξ(x, w) can be extended as a holomorphic function in T. By using Cauchy s integral formula for several complex functions, a k (x) = 1 ξ(x, w) (2πi) d C 1 C d j(w j ŵ j ) dw k 1 dw d (10) j+1 where C j is a circle with a radius r j δ. We define a function M(x) by M(x) = sup w T a ξ(x, w). Then by eq.(2) in the condition (A), M { M(x) 2 q(x)dx } 1/2 <, and there exists δ > 0 such that a k (x) M(x) dj=1. (11) r j δ k j 15
16 By the assumption, a k (x)q(x)dx = 0. Thus 1 n E n { a k (x i ) 2 } 1 2 = { a k (x) 2 q(x)dx} 1 M 2 n j r j δ. k j i=1 By using Lemma 5 in Appendix, we obtain A 1 2 n 1 n = E n {sup ξ(x i, w) 2 } 1 2 w T a n i=1 1 n = E n {sup a k (x i )(w ŵ) k 2 } 1 2 w T a n i=1 k=0 1 n E n {sup a k (x i )(w ŵ) k 2 } 1 2 k=0 w T a n i=1 d ar j M ( k=0 j=1 r j δ )k j/2 < where δ is taken so that ar j < r j δ (j = 1,2,..., d). (Q.E.D.) The function ζ n (w) is defined as follows, which shows the fluctuation of learning. ζ n (w) = n(f(w) fn (w)) f(w) Note that ζ n (w) is an analytic function in W \ W 0. At w 0 W 0, ζ n (w) may be discontinuous, but the following theorem ensures that it is bounded on the average. Theorem 4 Assume the same condition as Theorem 2. Then, there exists a constant c such that for an arbitrary n E n { sup ζ n (w) 2 } < c. w W \W 0 [Proof of Theorem 4] Outside of the neighborhood of W 0, this theorem can be proven by Lemma 4. Hence we can assume that W = W ǫ. Since W is compact, W is covered by a union of finite small open sets. Thus we can assume w is in the neighborhood U of w 0 W 0, where its closure U is contained in the associated convergence radii. We define a (1) n (w) = 1 n η i (ψ(x i, w) ψ(x i, w 0 )), n i=1 a (2) n (w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 1 n (ψ(x i, w) ψ(x i, w 0 )) 2, 2n i=1 16.
17 where {η i } are samples independently taken from the standard normal distribution. By using these definitions, ζ n (w) = n {a (1) n (w) + a (2) n (w)}. f(w) We define a (j) (n) (j = 1,2) by a (j) (n) = E n { sup w U \W 0 n a (j) n (w) 2 f(w) }. It follows that E n { sup w U \W 0 ζ n (w) 2 } 2a (1) (n) + 2a (2) (n). For the proof of Theorem 4, it is sufficient that {a (j) (n); j = 1,2} are bounded. First, we show finiteness of a (1) (n). From Lemma 6 in Appendix, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) is a function which satisfies the condition (A), such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w), (12) j=1 where the matrix M jk (w) h j (x, w)h k (x, w)q(x)dx is positive definite. Let α > 0 be taken smaller than the minimum eigen value of the matrix {M jk (w); w U}. By the definition, f(w) = 1 (ψ(x, w) ψ(x, w 0 )) 2 q(x)dx 2 = 1 J M jk (w)g j (w)g k (w) α 2 j,k=1 2 J g j (w) 2 j=1 by taking small ǫ > 0. Therefore, by using Cauchy-Schwarz inequality, a (1) 1 J (n) = E n { sup w U \W 0 f(w) j=1 g j (w) 1 n η i h j (x i, w) 2 } n i=1 2 J α E 1 n n{ sup η i h j (x i, w) 2 } w U \W 0 j=1 n i=1 17
18 which is bounded by some constant by the preceding Lemma 4. Secondly, we show finiteness of a (2) (n). By the assumption that both ψ(x, w) and ψ(x, w) 2 satisfy the condition (A), an inequality {1 + M(x) 2 } (ψ(x, w) ψ(x, w 0 ) 2 q(x)dx < holds, where M(x) = sup w W ψ(x, w) ψ(x, w 0 ). By using Lemma 6, there exists a finite set of functions {g j, h j } J j=1, where g j(w) is a real-valued analytic function and h j (x, w) satisfies the condition (A) with {1+M(x) 2 } q(x) instead of q(x) such that J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w) j=1 where two matrices L jk (w) N jk (w) h j (x, w)h k (x, w)q(x)dx h j (x, w)h k (x, w){1 + M(x) 2 } q(x)dx are positive definite. Hence (ψ(x, w) ψ(x, w0 )) 4 q(x)dx (ψ(x, w) ψ(x, w0 )) 2 q(x)dx j,k N jk (w)g j (w)g k (w) j,k L jk (w)g j (w)g k (w) C where C > 0 is a constant. Thus which ensures that f(w) 1 {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx C a (2) (n) C E n { sup n (w) 2 }. {ψ(x, w) ψ(x, w 0 )} 4 q(x)dx n a (2) w U \W 0 Here finiteness of the last term can be shown by the same way as a (1) (n), where {ψ(x, w) ψ(x, w 0 )} 2 is decomposed by Lemma 6 instead of ψ(x, w) ψ(x, w 0 ). (Q.E.D.) [Proof of Theorem 2] Let us prove Theorem 2. We define α n = sup w W \W 0 ζ n (w). 18
19 Then, by Theorem 4, E n {α 2 n} <. The free energy or the Bayesian stochastic complexity satisfies F(n) = E n {log exp( nf n (w))ϕ(w)dw} W = E n {log exp( nf(w) nf(w)ζ n (w))ϕ(w)dw} W E n {log exp( nf(w) + α n nf(w))ϕ(w)dw} W log exp( n 2 f(w))ϕ(w)dw 1 2 E n{α 2 n}, W where we used an inequality α n nf(w) (α 2 n +nf(w))/2. Let us define Z i(n) (i = 1,2) by Z i (n) = W(i) exp( n 2 f(w))ϕ(w)dw, where W(1) = W ǫ and W(2) = W \ W ǫ. Then by the same method as the proof of Theorem 1, On the other hand, Z 2 (n) Therefore Z 1 (n) (log n) m 1 1 = c 1,m1. W \W ǫ exp( nǫ 2 F(n) log{c 1,m1 (log n) m 1 1 n λ 1 which completes Theorem 2. (Q.E.D.) n λ 1 )ϕ(w)dw exp( nǫ 2 ). + exp( nǫ )} + const., 2 5 Algorithm to calculate the learning efficiency Theorem 2 shows that the important values λ 1 and m 1 are determined by the meromorphic function J(λ). However, J(λ) is defined by the integral of several variables, it is not so easy to determine the largest pole and its multiplicity of J(λ). If J(λ) is given by an integral of a single variable, for example, if J(λ) = ǫ 0 x 2λ x r dx, then the pole of J(λ) is λ 1 = (r + 1)/2, and its multiplicity is m 1 = 1. The resolution of singularities in algebraic geometry transforms an arbitrary 19
20 integral of several variables into an integral whose essential term is a direct product of integrals of single variables. In fact, Atiyah showed that the following Theorem 5 is directly proven from Hironaka s theorem (Hironaka, 1964 ; Atiyah, 1970). Theorem 5 (Hironaka-Atiyah) Let f(w) be a real analytic function defined in a neighborhood of 0 R d. Then there exist an open set W {0}, a real analytic manifold U and a proper analytic map g : U W such that (1) g : U \ U 0 W \ W 0 is a biregular map, where W 0 = f 1 (0) and U 0 = g 1 (W 0 ), (2) for each P U there are local analytic coordinates (u 1,..., u d ) centered at P so that, locally near P, we have f(g(u 1,..., u d )) = a(u 1,.., u d )u k 1 1 uk 2 2 uk d d (13) where a(u) is an invertible analytic function and k i 0. [Explanation of Theorem 5] This theorem is a special version of the well known Hironaka s resolution of singularities in algebraic geometry. Since f(w) is not a polynomial but an analytic function, the singularities of f can be locally resolved. Hironaka s proof is completely constructive, and the above map g can be constructed by the finite recursive procedures which are blowing-ups of non-singular manifolds contained in singular sets. In the theorem, both U and W are locally compact Hausdorff spaces and g is continuous. In this case, g is a proper map if and only if for an arbitrary compact set K, g 1 (K) is compact. [Explanation of Theorem 5 End] The following is an algorithm to calculate λ 1 and m 1. An algorithm to calculate the learning efficiency. (1) Cover the analytic set (the set of the true parameter) W 0 = {w suppϕ; f(w) = 0} by the finite union of open neighborhoods W α. (2) For each neighborhood W α, find a resolution map g α and a manifold U α by using blowing ups. Since g α is a proper map and the closure of W α is compact, U α is covered by a finite union of open sets {U αβ } whose closures are homeomorphic to compact sets in some Euclidean spaces. 20
21 (3) For each neighborhood of W αβ = g(u αβ ), the function J αβ (λ) is calculated by eq.(13). J αβ (λ) = = f(w) λ ϕ(w)dw W αβ f(g α (w)) λ ϕ(g α (u)) g α(u) du U αβ a(u) λ u λk 1 1 u λk 2 2 u λk d d ϕ(g α (u)) g α(u) du, U αβ where g α(u) is Jacobian. Note that {u; g α(u) = 0} gα 1 (W 0 ). The last integration can be done for each variable u i, and the largest pole λ (αβ) 1 and their multiplicity m (αβ) 1 of J αβ (λ) are obtained, where Taylor expansion of g α(u) is used. (4) The largest pole λ 1 of J(λ) is λ 1 = max αβ ( λ (αβ) 1 ), and its multiplicity m 1 is m (α β ) 1, where α β is the argument that maximizes the λ (αβ) 1. If (α, β) that maximizes λ αβ) 1 is not unique, then (α, β) that maximizes m (αβ) 1 among such (α, β) sets is chosen. In order to calculate λ 1 and m 1, only the neighborhood W α that gives the largest pole is important. The singularity that is in such neighborhood W α is called the deepest singularity in this paper. Note that, by Theorem 5, 1 m 1 d, where d is the number of parameters. Example.1 (Regular Models) For the regular statistical models, by using the appropriate coordinate (w 1,..., w d ) the average loss function f(w) can be locally written by d f(w) = wi 2. i=1 Then, among the origin, W 1 = {w; w < 1}, J(λ) = f(w) λ dw, W 1 where we replaced ϕ(w) by ϕ(w) = const., based on the natural assumption that ϕ(w) > 0 on W. We define W 11 = {w W; w i < w 1 } and U 11 = {(u 1,..., u d ); u i < 1}. By using the blowing-up, we find a map g 1 : (u 1,..., u d ) (w 1,..., w d ), w 1 = u 1, w i = u 1 u i (2 i d). 21
22 Then the function J 11 (λ) is J 11 (λ) = f(w) λ dw W 11 = u 2λ 1 d 1 (1 + u 2 i )λ u 1 d 1 du 1 du 2 du d This function has the pole at λ = d/2 with the multiplicity m 1 = 1. We define W 1i and J 1i (λ) by the same way as W 11 and J 11 (λ), then W 1 is the union of W 1i and a measure zero set. Therefore, the free energy is i=2 F(n) = d log n + O(1), 2 and if K(n) satisfies the same condition as Corollary 1, then K(n) = d/2n. [Proof of Corollary 2] Let w 0 be an arbitrary fixed point in W 0. We can assume w 0 = 0 without loss of generality. Since f(w) is analytic, there exists a constant a > 0, such that for any w in the neighborhood of 0 f(w) a w 2. From Lemma 2, we analyze the case when λ is a real and negative value. For such λ, J(λ) has a real value, if it is finite. Let ϕ be an positive and C0 class function smaller than ϕ(w). Then for a sufficiently small constant δ > 0, J(λ) f(w) λ ϕ(w)dw w <δ (a w 2 ) λ ϕ(w)dw. w <δ As is shown in Example.1, the largest pole of the last term is d/2, which shows that the largest pole of J(λ) is larger than d/2. (Q.E.D.) Example.2 If the model p(y x, a, b) = 1 2π exp( 1 2 (y a tanh(bx))2 ) is trained using samples from p(y x, 0,0), then f(a, b) = a 2 tanh(bx) 2 q(x)dx. In this case, W 0 = {ab = 0} and the deepest singularity is the origin. In the neighborhood of the origin, the essential term is f(a, b) = a 2 b 2, and the 22
23 other terms are smaller than this term. It immediately follows that λ 1 = 1/2, m 1 = 2, resulting that F(n) = 1 log n log log n + O(1). 2 Compared 1/2 with d/2 = 2/2, the free energy is smaller than the regular model case. Example.3 Let us consider a neural network p(y x, a, b, c, d) = 1 2π exp( 1 2 (y ψ(x, a, b, c, d))2 ), ψ(x, a, b, c, d) = a tanh(bx) + ctanh(dx). Assume that the true regression function be ψ(x, 0,0,0,0). Then, W 0 = {ab = 0 and cd = 0} {a + c = 0 and b = d} {a = c and b + d = 0} and the deepest singularity of f(a, b, c, d) is (0,0,0,0). In the neighborhood of the origin, defined as W 1, f(a, b, c, d) = (ab + cd) 2 + (ab 3 + cd 3 ) 2, since the higher order term can be bounded by the above two terms (Watanabe, 1998b). Let us define W 11 = {(a, b, c, d); b > d, c > a, ad 3 < ab + cd < cd 3 }, U 11 = {(x, y, z, w); x < 1, y < 1, w < 1,w 1 < z < w + 1}. By using blowing-ups, we find a map g 1 : U 11 W 11 which is defined by a = x, b = y 3 w yzw, c = xzw, d = y. From the algorithmic point of view, W 11 and U 11 are determined systematically in the process of blowing-ups. By using this transform, we obtain f(g 1 (x, y, z, w)) = x 2 y 6 w 2 [1 + {w 2 (y 2 z) 3 + z} 2 ], g 1(x, y, z, w) = xy 3 w. 23
24 We assume naturally ϕ(w) > 0 in {w = (a, b, c, d); a < 1, b < 1, c < 1, d < 1}. In calculation of poles of J(λ), we can simply put ϕ(w) = 1 in this set. J 1 (λ) = = W 1 f(w) λ dw 1 1 dx 1 1 dy 1 1 w+1 dw dz f(g 1 (x, y, z, w)) λ xy 3 w, w 1 resulting that λ (1) 1 = 2/3, and m (1) 1 = 1. From the other regions W 12, W 13,...,, we do not obtain larger λ 1 than 2/3. Hence F(n) = 2 log n + O(1), 3 whereas the dimension divided by two is d/2 = 4/2. Lastly, we introduce an elemental inequality for the asymptotic expansion of the free energy. Let us consider K learning machines (1 k K), p k (y x, w k ) = 1 1 exp( (2πσ 2 ) N/2 2σ y ψ k(x, w 2 k ) 2 ), where N is the number of output units. Assume that the asymptotic expansions of free energies are respectively given by F k (n) = λ (k) log n + (m (k) 1)log log n, which are determined under the condition that the true distributions are {p(y x, w k0 )} and the a priori distribution is ϕ k (w k ). Let us study the case that a learning machine made of a sum of K machines, p(y x, {w k }) = 1 1 exp( (2πσ 2 ) N/2 2σ y K 2 k=1 ψ k (x, w k ) 2 ), is trained by using samples from the distribution p(y x, {w k0 }). The asymptotic expansion of the free energy with the a priori distribution k ϕ k (w k ) is written as Then we have the following inequality. F(n) = λlog n (m 1)log log n. K Corollary 3 Constants λ and {λ (k) } satisfy the inequality, λ λ (k). k=1 24
25 This corollary has an important application. Let λ(h 0, H)log n be the essential term of the asymptotic form of a three-layered perceptron with H hidden units, which is trained using samples from the true distribution represented with H 0 hidden units. We assume that H H 0 = KL, where K and L are some integers. By using Corollary 3, λ(h 0, H) λ(h 0, H 0 ) + λ(0,h H 0 ) λ(h 0, H 0 ) + (H H 0 )/K λ(0,l). Note that λ(h 0, H 0 ) = d/2 where d is the number of parameters in the model with H 0 hidden units (Watanabe, 1998b). Even if the resolution of singularities is too complex and too difficult to calculate λ(h 0, H), we can obtain some inequalities based on the free energy of the simpler case. [Proof of Corollary 3] The Kullback distance of the machine p(y x, {w k }) is given by f(w) 1 K {ψ 2σ 2 k (x, w k ) ψ k (x, w k0 )} 2 q(x)dx k=1 K K ψ 2σ 2 k (x, w k ) ψ k (x, w k0 ) 2 q(x)dx k=1 K = K f k (w k ) k=1 where f k (w k ) is the Kullback distance of p k (y x, w k ) from p k (y x, w k0 ). F(n) K K k log exp( Kn f k (w k )) ϕ k (w k ) dw k k=1 k=1 k=1 K {λ (k) log n (m (k) 1)log log n + const.}, k=1 which completes the Corollary 3. (Q.E.D.) Example.4 (Experiments) Let us consider a case when p(y x, a, b) = 1 2π(0.1) exp( 1 2 2(0.1) 2(y aσ(bx))2 ) is trained using samples from p(y x, a 0, b 0 ), where σ(x) = x + x 2. In this case, we can calculate the maximum likelihood estimator, because the likelihood 25
26 function is a quadratic form of ab and ab 2. Figure 1, 3, and 5 show the maximum likelihood estimators (MLE) for (a, b) which are estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4), respectively. Here q(x) is the standard normal distribution, and n = 20, the number of trials is 200. Note that MLEs which are in the outside of [ 2, 2] [ 2, 2] are not drawn in the figures. Figure 2, 4, and 6 show the parameters which are taken from the a posteriori distribution estimated by using samples from the distribution with true parameters (a 0, b 0 ) = (0,0), (0.2, 0.2), and (0.4, 0.4). Here we defined the a priori distribution of (a, b) by the uniform distribution of [ 2, 2] [ 2, 2]. It should be emphasized that the distribution of MLEs is very different from the Bayesian a posteriori distribution, and that neither distribution is subject to any gaussian distribution even if the true parameter is only one point. In this example, by putting p = ab, q = ab 2, the model can be understood as the regular statistical model, resulting that the generalization error by the maximum likelihood method is asymptotically equal to 2/2n in three cases. However, for a general activation function σ(x), there exists no transform which makes the model regular. The generalization error in such a case is far larger than that in regular model cases. The generalization errors by Bayesian estimation in these cases are asymptotically 1/(2n) 1/(nlog n), 2/(2n), and 2/(2n), respectively. For the case n = 20 (n is the number of training samples), the experimental average generalization errors by the maximum likelihood method were , , and where true parameters were (0, 0), (0.2, 0.2), and (0.4, 0.4), respectively, whereas those by Bayesian method were , , and , respectively. For the case n = 50, the former errors were , , and , respectively, and the latter errors were , , and These results show that, if the model is almost redundant, Bayesian estimation is more appropriate than the maximum likelihood method. 26
27 6 Discussion 6.1 Fundamental Aspects In this subsection, we intuitively explain the fundamental structure of the theorems in this paper. Let us define a volume function of the set of parameters {w W; f(w) t}, V (t) = f(w) t ϕ(w)dw. Then V (t) = I(t), where I(t) is defined by eq.(5). This paper has shown that the asymptotic stochastic complexity F(n) is given by the Laplace Transform of the volume function. F(n) = log exp( nt)dv (t). If the model is identifiable and the Fisher information matrix is positive definite, then the asymptotic shape of the set {w W; f(w) t} is the inside of an ellipsoid. However, if the model is non-identifiable and has singularities, then it is the union of the inner points of several complex hyperboloids. If V (t) satisfies the relation V (t) = t λ 1 ( log t) m 1 1 (t 0), then F(n) = λ1 log n (m 1 1)log n, (n ) (14) J(λ) = t λ V (t)dt = Const.. (15) (λ + λ 1 ) m 1 These relations, eq.(14) and eq.(15), show that the learning efficiency of Bayesian estimation is determined by the volume of almost true parameters. Mathematically speaking, it is difficult to prove V (t) t λ 1 ( log t) m 1 1 directly, because of singularities in the true parameter set. The relation that is proven firstly is eq.(15), where we need algebraic properties of the loss function, Sato-Bernstein s b-function and Hironaka s resolution of singularities. 6.2 Theoretical Aspects The case when the parameter is not identifiable might be understood as a special one which seldom occurs. However, the superpositions of homogenous functions essentially have the problem of redundancy. 27
28 Let us illustrate the mathematical structure of redundancy. A function f(x) in L 2 (R) can be decomposed as f(x) = g(a, b)ϕ( x b a )dadb, where ϕ(x) is an analyzing wavelet and g(a, b) is a coefficient function (Chui, 1992). Here the set of functions {ϕ( x b a )} is an over-complete basis in L2 (R), resulting that g(a, b) is not determined uniquely. This decomposition shows that the set of coefficients {g h } in the neural approximation f(x) H h=1 g h ϕ( x b h a h ) becomes more non-identifiable when H is close to infinity. In order to analyze the case n and H, we have to make a theory for redundant function approximation and redundant statistical estimation. It is shown in (Watanabe, 2000b) that singularities in the parameter space make the generalization error smaller even when the true distribution is not contained in the parametric model. On the other hand, the set of coefficients in the orthonormal basis decomposition is given by the orthonormal projection of the true function to the corresponding function space, hence it is identifiable even if H goes to infinity. This is one of the essential differences between artificial neural networks and orthonormal basis functions. We expect that artificial neural networks are better learning machines than the orthnormal ones if Bayesin estimation is applied. It is well known that, if a statistical model contains an unused or nuisance parameters, then such parameters should not be estimated. If such parameters are estimated and determined, then the statistical estimation error becomes large. They are estimated in the maximum likelihood method, but not estimated in Bayesian estimation. This is one of the reason why Bayesian estimation is useful for neural networks in almost redundant states. 6.3 Practical Aspects In regular learning machines, the generalization error by the maximum likelihood method is asymptotically equal to d/2n (Akaike, 1974), which coincides 28
29 with that by Bayesian estimation (Amari, 1993), where d is the number of parameters and n is the number of training samples. In this paper we have shown that, in non-identifiable learning machines, the generalization error by the Bayesian estimation is not larger than d/2n. When the singularity becomes deeper, the generalization error is reduced to smaller. On the other hand, it is conjectured that the generalization error of a layered learning machine by the maximum likelihood method is far larger than d/2n. For example, it is proven that, if the distribution of the inputs consists of a finite sum of delta functions, then the generalization error is in proportion to log n/n (Hagiwara, Kuno, & Usui, 2000). Also it is proven that, if the activation function of hidden units is a linear function, then the generalization error is C/2n, where C is larger than the number of parameters (Fukumizu, 1999). In a practical application, the true distribution is not strictly contained in the finite parametric model in general (Shibata, 1981). In such a case, the generalization error is the sum of the function approximation error and the statistical estimation error, where the former is called bias and the latter variance. In this paper, we have shown that the increase of the variance of an artificial neural network is not larger than the increase of parameters, if Bayesian estimation is applied. Therefore, we can use the larger size neural network with the not so large variance and the smaller bias. However, if we apply the maximum likelihood method to an artificial neural network, then we should use the smaller model to ensure the variance small, resulting in the larger bias. We conjecture that this difference is the main reason why almost redundant learning machines with ensemble learning are more useful than selected small models with one point estimation, in practical applications. 7 Conclusion Mathematical foundation for non-identifiable learning machines such as neural networks is established based on algebraic analysis and algebraic geometry. The free energy or the stochastic complexity is asymptotically equal to λ 1 log n (m 1 1)log log n + const., where λ 1 is a rational number and m 1 is a natural number. Moreover, we obtained the following properties. (1) The learning curve is determined by the singularities in the parameter 29
30 space. (2) The learning efficiency can be algorithmically calculated by the resolution of singularities. (3) Bayesian estimation is more appropriate than the maximum likelihood method for neural networks in redundant states. Information geometry gives the foundation of statistical learning theory for a regular learning machine (Amari, 1985) which has a positive definite Fisher metric. We expect that algebraic geometry and algebraic analysis will play an important role in study of a complex and hierarchical learning machine which has a degenerate metric. Analysis for both the maximum likelihood method and the maximum a posteriori method is an important problem for the future. Acknowledgment This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientific Research and Appendix: Elemental Lemmas and Proofs In this appendix, we show two elemental lemmas and their proofs. These are technical properties, but sometimes useful in neural network theory. Lemma 5 Let {X k (a); k = 1,2,..., } be a set of real valued random variables with parameter a and A be a set of all parameters. Then E{sup a A X k (a) 2 } 1/2 k=1 k=1 E{sup X k (a) 2 } 1/2 a A where E is the expectation value for the random variables. (Proof) By the definition Y k = sup X k (a), a A E{sup a A X k (a) 2 } E{sup X k (a) X k (a) } k=1 a A k=1 k =1 E{Y k Y k } k=1 k =1 30
31 = { which completes the lemma. (Q.E.D.) k=1 k =1 k=1 E{Y 2 k } 1/2 E{Y 2 k }1/2 E{Y 2 k } 1/2 } 2 Lemma 6 Assume the same condition as Theorem 2. Let U be a neighborhood of w 0 W 0. Then, there exist both a real-valued analytic function g j (w) and a real-valued function h j (x, w) which satisfies the condition (A) in U such that (1) g j (w 0 ) = 0. (2) Functions {h j (x, w)} are linearly independent. (3) For an arbitrary w U J ψ(x, w) ψ(x, w 0 ) = g j (w)h j (x, w). i=1 (Proof of Lemma 6) We can assume w 0 = 0 and ψ(x, 0) = 0 without loss of generality. The d dimensional direct product of the set of natural numbers is referred to as N d and its element is written as j = (j 1, j 2,..., j d ) N d. Let us consider the Talyor expansion ψ(x, w) = j 1 a j (x)w j where j = j 1 + j j d and w j = w j 1 1 w j 2 2 w j d d. We introduce a space of the square integrable functions L 2 (q) with an inner product (u, v) = K u(x)v(x)q(x)dx. By using Cauchy s integral formula as eq.(10), it is immediately proven that a j (x) is contained in L 2 (q). We adopt an order of the set N d which satisfies j j j j. Then, by applying Schmidt s orthogonalization to {a j (x); j N d }, we obtain ψ(x, w) = j N d e j (x)g j (w), 31
Algebraic Information Geometry for Learning Machines with Singularities
Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
More informationAlgebraic Geometrical Methods for Hierarchical Learning Machines
1 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe (a) Title : Algebraic geometrical methods for hierarchical learning machines. (b) Author : Sumio Watanabe (c) Affiliation
More informationStochastic Complexities of Reduced Rank Regression in Bayesian Estimation
Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Contact information for authors. M. Aoyagi Email : miki-a@sophia.ac.jp Address : Department of Mathematics,
More informationStochastic Complexity of Variational Bayesian Hidden Markov Models
Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,
More informationAlgebraic Geometry and Model Selection
Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio
More informationAsymptotic Approximation of Marginal Likelihood Integrals
Asymptotic Approximation of Marginal Likelihood Integrals Shaowei Lin 10 Dec 2008 Abstract We study the asymptotics of marginal likelihood integrals for discrete models using resolution of singularities
More informationQualifying Exams I, 2014 Spring
Qualifying Exams I, 2014 Spring 1. (Algebra) Let k = F q be a finite field with q elements. Count the number of monic irreducible polynomials of degree 12 over k. 2. (Algebraic Geometry) (a) Show that
More information1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0
4 INUTES. If, ω, ω, -----, ω 9 are the th roots of unity, then ( + ω) ( + ω ) ----- ( + ω 9 ) is B) D) 5. i If - i = a + ib, then a =, b = B) a =, b = a =, b = D) a =, b= 3. Find the integral values for
More informationResolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation
Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute
More informationConsidering our result for the sum and product of analytic functions, this means that for (a 0, a 1,..., a N ) C N+1, the polynomial.
Lecture 3 Usual complex functions MATH-GA 245.00 Complex Variables Polynomials. Construction f : z z is analytic on all of C since its real and imaginary parts satisfy the Cauchy-Riemann relations and
More information1 Continuity Classes C m (Ω)
0.1 Norms 0.1 Norms A norm on a linear space X is a function : X R with the properties: Positive Definite: x 0 x X (nonnegative) x = 0 x = 0 (strictly positive) λx = λ x x X, λ C(homogeneous) x + y x +
More informationINTRODUCTION TO REAL ANALYTIC GEOMETRY
INTRODUCTION TO REAL ANALYTIC GEOMETRY KRZYSZTOF KURDYKA 1. Analytic functions in several variables 1.1. Summable families. Let (E, ) be a normed space over the field R or C, dim E
More informationFinite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product
Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )
More informationReal Analysis Prelim Questions Day 1 August 27, 2013
Real Analysis Prelim Questions Day 1 August 27, 2013 are 5 questions. TIME LIMIT: 3 hours Instructions: Measure and measurable refer to Lebesgue measure µ n on R n, and M(R n ) is the collection of measurable
More informationQUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 2010 (Day 1)
QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday August 31, 21 (Day 1) 1. (CA) Evaluate sin 2 x x 2 dx Solution. Let C be the curve on the complex plane from to +, which is along
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationTopological properties
CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological
More informationYour first day at work MATH 806 (Fall 2015)
Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies
More informationCHAPTER VIII HILBERT SPACES
CHAPTER VIII HILBERT SPACES DEFINITION Let X and Y be two complex vector spaces. A map T : X Y is called a conjugate-linear transformation if it is a reallinear transformation from X into Y, and if T (λx)
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationReminder Notes for the Course on Measures on Topological Spaces
Reminder Notes for the Course on Measures on Topological Spaces T. C. Dorlas Dublin Institute for Advanced Studies School of Theoretical Physics 10 Burlington Road, Dublin 4, Ireland. Email: dorlas@stp.dias.ie
More informationYour first day at work MATH 806 (Fall 2015)
Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies
More informationWhat is Singular Learning Theory?
What is Singular Learning Theory? Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 23 Sep 2011 McGill University Singular Learning Theory A statistical model is regular if it is identifiable and its
More information(3) Let Y be a totally bounded subset of a metric space X. Then the closure Y of Y
() Consider A = { q Q : q 2 2} as a subset of the metric space (Q, d), where d(x, y) = x y. Then A is A) closed but not open in Q B) open but not closed in Q C) neither open nor closed in Q D) both open
More informationAnalysis Qualifying Exam
Analysis Qualifying Exam Spring 2017 Problem 1: Let f be differentiable on R. Suppose that there exists M > 0 such that f(k) M for each integer k, and f (x) M for all x R. Show that f is bounded, i.e.,
More informationMATH 31BH Homework 1 Solutions
MATH 3BH Homework Solutions January 0, 04 Problem.5. (a) (x, y)-plane in R 3 is closed and not open. To see that this plane is not open, notice that any ball around the origin (0, 0, 0) will contain points
More informationHilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality
(October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are
More informationLaplace s Equation. Chapter Mean Value Formulas
Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic
More informationSTATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY
2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND
More informationLebesgue Integration on R n
Lebesgue Integration on R n The treatment here is based loosely on that of Jones, Lebesgue Integration on Euclidean Space We give an overview from the perspective of a user of the theory Riemann integration
More informationLinear Algebra Massoud Malek
CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product
More informationFunctional Analysis Review
Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all
More informationThe following definition is fundamental.
1. Some Basics from Linear Algebra With these notes, I will try and clarify certain topics that I only quickly mention in class. First and foremost, I will assume that you are familiar with many basic
More informationMATH 205C: STATIONARY PHASE LEMMA
MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)
More informationQuasi-conformal maps and Beltrami equation
Chapter 7 Quasi-conformal maps and Beltrami equation 7. Linear distortion Assume that f(x + iy) =u(x + iy)+iv(x + iy) be a (real) linear map from C C that is orientation preserving. Let z = x + iy and
More informationFunctional Analysis I
Functional Analysis I Course Notes by Stefan Richter Transcribed and Annotated by Gregory Zitelli Polar Decomposition Definition. An operator W B(H) is called a partial isometry if W x = X for all x (ker
More informationReal Analysis Notes. Thomas Goller
Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................
More informationNotes on Complex Analysis
Michael Papadimitrakis Notes on Complex Analysis Department of Mathematics University of Crete Contents The complex plane.. The complex plane...................................2 Argument and polar representation.........................
More informationStatistical Learning Theory of Variational Bayes
Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology
More informationClass notes: Approximation
Class notes: Approximation Introduction Vector spaces, linear independence, subspace The goal of Numerical Analysis is to compute approximations We want to approximate eg numbers in R or C vectors in R
More informationList of Symbols, Notations and Data
List of Symbols, Notations and Data, : Binomial distribution with trials and success probability ; 1,2, and 0, 1, : Uniform distribution on the interval,,, : Normal distribution with mean and variance,,,
More informationTwo Lemmas in Local Analytic Geometry
Two Lemmas in Local Analytic Geometry Charles L Epstein and Gennadi M Henkin Department of Mathematics, University of Pennsylvania and University of Paris, VI This paper is dedicated to Leon Ehrenpreis
More informationMATHS 730 FC Lecture Notes March 5, Introduction
1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists
More informationTopological properties of Z p and Q p and Euclidean models
Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete
More informationBernstein s analytic continuation of complex powers
(April 3, 20) Bernstein s analytic continuation of complex powers Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/. Analytic continuation of distributions 2. Statement of the theorems
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationHow New Information Criteria WAIC and WBIC Worked for MLP Model Selection
How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu
More informationRENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES
RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES P. M. Bleher (1) and P. Major (2) (1) Keldysh Institute of Applied Mathematics of the Soviet Academy of Sciences Moscow (2)
More informationSPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS
SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS TSOGTGEREL GANTUMUR Abstract. After establishing discrete spectra for a large class of elliptic operators, we present some fundamental spectral properties
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationSPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS
SPRING 006 PRELIMINARY EXAMINATION SOLUTIONS 1A. Let G be the subgroup of the free abelian group Z 4 consisting of all integer vectors (x, y, z, w) such that x + 3y + 5z + 7w = 0. (a) Determine a linearly
More informationComplex Analysis Qualifying Exam Solutions
Complex Analysis Qualifying Exam Solutions May, 04 Part.. Let log z be the principal branch of the logarithm defined on G = {z C z (, 0]}. Show that if t > 0, then the equation log z = t has exactly one
More information1 Math 241A-B Homework Problem List for F2015 and W2016
1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let
More informationComplex Analysis Slide 9: Power Series
Complex Analysis Slide 9: Power Series MA201 Mathematics III Department of Mathematics IIT Guwahati August 2015 Complex Analysis Slide 9: Power Series 1 / 37 Learning Outcome of this Lecture We learn Sequence
More informationPart IB Complex Analysis
Part IB Complex Analysis Theorems Based on lectures by I. Smith Notes taken by Dexter Chua Lent 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after
More informationNATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, February 2, Time Allowed: Two Hours Maximum Marks: 40
NATIONAL BOARD FOR HIGHER MATHEMATICS Research Scholarships Screening Test Saturday, February 2, 2008 Time Allowed: Two Hours Maximum Marks: 40 Please read, carefully, the instructions on the following
More informationAn introduction to some aspects of functional analysis
An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms
More informationMultivariable Calculus
2 Multivariable Calculus 2.1 Limits and Continuity Problem 2.1.1 (Fa94) Let the function f : R n R n satisfy the following two conditions: (i) f (K ) is compact whenever K is a compact subset of R n. (ii)
More informationMathematical Methods for Physics and Engineering
Mathematical Methods for Physics and Engineering Lecture notes for PDEs Sergei V. Shabanov Department of Mathematics, University of Florida, Gainesville, FL 32611 USA CHAPTER 1 The integration theory
More information3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?
MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable
More informationTwo special equations: Bessel s and Legendre s equations. p Fourier-Bessel and Fourier-Legendre series. p
LECTURE 1 Table of Contents Two special equations: Bessel s and Legendre s equations. p. 259-268. Fourier-Bessel and Fourier-Legendre series. p. 453-460. Boundary value problems in other coordinate system.
More information08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms
(February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops
More informationNovember 18, 2013 ANALYTIC FUNCTIONAL CALCULUS
November 8, 203 ANALYTIC FUNCTIONAL CALCULUS RODICA D. COSTIN Contents. The spectral projection theorem. Functional calculus 2.. The spectral projection theorem for self-adjoint matrices 2.2. The spectral
More informationTHEOREMS, ETC., FOR MATH 515
THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every
More informationProblem Set 5. 2 n k. Then a nk (x) = 1+( 1)k
Problem Set 5 1. (Folland 2.43) For x [, 1), let 1 a n (x)2 n (a n (x) = or 1) be the base-2 expansion of x. (If x is a dyadic rational, choose the expansion such that a n (x) = for large n.) Then the
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationProblem Set 5: Solutions Math 201A: Fall 2016
Problem Set 5: s Math 21A: Fall 216 Problem 1. Define f : [1, ) [1, ) by f(x) = x + 1/x. Show that f(x) f(y) < x y for all x, y [1, ) with x y, but f has no fixed point. Why doesn t this example contradict
More informationReview of complex analysis in one variable
CHAPTER 130 Review of complex analysis in one variable This gives a brief review of some of the basic results in complex analysis. In particular, it outlines the background in single variable complex analysis
More informationChapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.
Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space
More informationError Empirical error. Generalization error. Time (number of iteration)
Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp
More informationNotions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy
Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.
More informationFunctional Analysis Review
Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520: Statistical Learning Theory and Applications September 9, 2013 1 2 3 4 Vector Space A vector space is a set V with binary
More informationA LITTLE REAL ANALYSIS AND TOPOLOGY
A LITTLE REAL ANALYSIS AND TOPOLOGY 1. NOTATION Before we begin some notational definitions are useful. (1) Z = {, 3, 2, 1, 0, 1, 2, 3, }is the set of integers. (2) Q = { a b : aεz, bεz {0}} is the set
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationf-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models
IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,
More informationAnalysis Comprehensive Exam Questions Fall 2008
Analysis Comprehensive xam Questions Fall 28. (a) Let R be measurable with finite Lebesgue measure. Suppose that {f n } n N is a bounded sequence in L 2 () and there exists a function f such that f n (x)
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationCALCULUS ON MANIFOLDS
CALCULUS ON MANIFOLDS 1. Manifolds Morally, manifolds are topological spaces which locally look like open balls of the Euclidean space R n. One can construct them by piecing together such balls ( cells
More informationFunctional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...
Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................
More informationFourier Series. 1. Review of Linear Algebra
Fourier Series In this section we give a short introduction to Fourier Analysis. If you are interested in Fourier analysis and would like to know more detail, I highly recommend the following book: Fourier
More informationAnalysis Finite and Infinite Sets The Real Numbers The Cantor Set
Analysis Finite and Infinite Sets Definition. An initial segment is {n N n n 0 }. Definition. A finite set can be put into one-to-one correspondence with an initial segment. The empty set is also considered
More informationChapter 8. P-adic numbers. 8.1 Absolute values
Chapter 8 P-adic numbers Literature: N. Koblitz, p-adic Numbers, p-adic Analysis, and Zeta-Functions, 2nd edition, Graduate Texts in Mathematics 58, Springer Verlag 1984, corrected 2nd printing 1996, Chap.
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationChapter 2 Metric Spaces
Chapter 2 Metric Spaces The purpose of this chapter is to present a summary of some basic properties of metric and topological spaces that play an important role in the main body of the book. 2.1 Metrics
More informationChapter 8 Integral Operators
Chapter 8 Integral Operators In our development of metrics, norms, inner products, and operator theory in Chapters 1 7 we only tangentially considered topics that involved the use of Lebesgue measure,
More informationQualification Exam: Mathematical Methods
Qualification Exam: Mathematical Methods Name:, QEID#41534189: August, 218 Qualification Exam QEID#41534189 2 1 Mathematical Methods I Problem 1. ID:MM-1-2 Solve the differential equation dy + y = sin
More informationMATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5
MATH 722, COMPLEX ANALYSIS, SPRING 2009 PART 5.. The Arzela-Ascoli Theorem.. The Riemann mapping theorem Let X be a metric space, and let F be a family of continuous complex-valued functions on X. We have
More informationPUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include
PUTNAM TRAINING POLYNOMIALS (Last updated: December 11, 2017) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include
More informationTEST CODE: PMB SYLLABUS
TEST CODE: PMB SYLLABUS Convergence and divergence of sequence and series; Cauchy sequence and completeness; Bolzano-Weierstrass theorem; continuity, uniform continuity, differentiability; directional
More informationSeparation of Variables in Linear PDE: One-Dimensional Problems
Separation of Variables in Linear PDE: One-Dimensional Problems Now we apply the theory of Hilbert spaces to linear differential equations with partial derivatives (PDE). We start with a particular example,
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationfor all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true
3 ohn Nirenberg inequality, Part I A function ϕ L () belongs to the space BMO() if sup ϕ(s) ϕ I I I < for all subintervals I If the same is true for the dyadic subintervals I D only, we will write ϕ BMO
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationCourse 214 Basic Properties of Holomorphic Functions Second Semester 2008
Course 214 Basic Properties of Holomorphic Functions Second Semester 2008 David R. Wilkins Copyright c David R. Wilkins 1989 2008 Contents 7 Basic Properties of Holomorphic Functions 72 7.1 Taylor s Theorem
More informationOn rational approximation of algebraic functions. Julius Borcea. Rikard Bøgvad & Boris Shapiro
On rational approximation of algebraic functions http://arxiv.org/abs/math.ca/0409353 Julius Borcea joint work with Rikard Bøgvad & Boris Shapiro 1. Padé approximation: short overview 2. A scheme of rational
More informationChapter 4. Inverse Function Theorem. 4.1 The Inverse Function Theorem
Chapter 4 Inverse Function Theorem d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d dd d d d d This chapter
More informationNATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, January 20, Time Allowed: 150 Minutes Maximum Marks: 40
NATIONAL BOARD FOR HIGHER MATHEMATICS Research Scholarships Screening Test Saturday, January 2, 218 Time Allowed: 15 Minutes Maximum Marks: 4 Please read, carefully, the instructions that follow. INSTRUCTIONS
More informationSome Background Material
Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important
More information