Algebraic Geometrical Methods for Hierarchical Learning Machines

Size: px
Start display at page:

Download "Algebraic Geometrical Methods for Hierarchical Learning Machines"

Transcription

1 1 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe (a) Title : Algebraic geometrical methods for hierarchical learning machines. (b) Author : Sumio Watanabe (c) Affiliation : Precision and Intelligence Laboratory Tokyo Institute of Technology (d) Acknowledgment : This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientific Research (e) Address : Dr. Sumio Watanabe, Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama Japan swatanab@pi.titech.ac.jp Phone : Fax : (f) Running title : Algebraic geometry of learning machines.

2 Algebraic Geometrical Methods for Hierarchical Learning Machines Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology January 18, 2001 Abstract Hierarchical learning machines such as layered perceptrons, radial basis functions, gaussian mixtures are non-identifiable learning machines, whose Fisher information matrices are not positive definite. This fact shows that conventional statistical asymptotic theory can not be applied to the neural network learning theory, for example, either the Bayesian a posteriori probability distribution does not converge to the gaussian distribution, or the generalization error is not in proportion to the number of parameters. The purpose of this paper is to overcome this problem and to clarify the relation between the learning curve of a hierarchical learning machine and the algebraic geometrical structure of the parameter space. We establish an algorithm to calculate the Bayesian stochastic complexity based on blowing-up technology in algebraic geometry and prove that the Bayesian generalization error of a hierarchical learning machine is smaller than that of a regular statistical model, even if the true distribution is not contained in the parametric model. Keywords: algebraic geometry, resolution of singularities, generalization error, stochastic complexity, asymptotic expansion, non-identifiable model. 2

3 3 Mathematical Symbols x : M dimensional input y : N dimensional output w : d dimensional parameter p(y x, w) : conditional probability of a learning machine. q(y x)q(x) : true simultaneous distribution of input and output {(x i, y i )} : a set of training samples H n (w) : empirical Kullback distance H(w) : Kullback distance ϕ(w) : a priori probability distribution on the parameter space ρ n (w) : a posteriori probability distribution on the parameter space p n (y x) : estimated conditional distribution by Bayesian method. G(n) : generalization error F(n) : stochastic complexity F(n) : upper bound of the stochastic complexity v(t) : a distribution of states. J(z) : the Mellin transform of v(t) ǫ : a sufficiently small positive constant. W ǫ : the set of parameter whose Kullback distance is smaller than ǫ. U : a manifold found by blowing-up g(u) : a resolution map s : a positive constant K : the number of hidden units f K (w) : a function of three-layer perceptron with K hidden units a k : parameter from hidden units to output units b k : parameter from input units to output units c k : bias of hidden units σ(x) : activation function of hidden units : = tanh(x)

4 1 INTRODUCTION 4 1 Introduction Learning in artificial neural networks can be understood as statistical estimation of an unknown probability distribution based on empirical samples (White, 1989; Watanabe & Fukumizu, 1995). Let p(y x, w) be a conditional probability density function which represents a probabilistic inference of an artificial neural network, where x is an input and y is an output. The parameter w, which consists of a lot of weights and biases, is optimized so that the inference p(y x, w) approximates the true conditional probability density from which training samples are taken. Let us reconsider a basic property of a homogeneous and hierarchical learning machine. If the mapping from a parameter w to the conditional probability density p(y x, w) is one-to-one, then the model is called identifiable. If otherwise, then it is called non-identifiable. In other words, a model is identifiable if and only if its parameter is uniquely determined from its behavior. The standard asymptotic theory in mathematical statistics requires that a given model should be identifiable. For example, identifiablity is a necessary condition to ensure that both the distribution of the maximum likelihood estimator and the Bayesian a posteriori probability density function converge to the normal distribution if the number of training samples tends to infinity (Cramer, 1949). When we approximate the likelihood function by a quadratic form of the parameter and select the optimal model using information criteria such as AIC, BIC, and MDL, we implicitly assume that the model is identifiable. However, many kinds of artificial neural networks such as layered perceptrons, radial basis functions, Boltzmann machines, and gaussian mixtures are non-identifiable, hence either their statistical property is not yet clarified or conventional statistical design methods can not be applied. In fact, a failure of likelihood asymptotics for normal mixtures was shown from the viewpoint of testing hypothesis in statistics (Hartigan, 1985). In researches of artificial neural networks, it was pointed out that AIC does not correspond to the generalization error by the maximum likelihood method (Hagiwara, 1993), since the Fisher information matrix is degenerate if the parameter represents the smaller model (Fukumizu, 1996). The asymptotic distribution of the maximum likelihood estimator of a non-identifiable model was analyzed based on the theorem that the empirical likelihood function converges to the gaussian process if it satisfies Donsker s condition (Dacunha-Castelle & Gassiat, 1997). It was proven that the generalization error by the Bayesian estimation is far

5 1 INTRODUCTION 5 smaller than the number of parameters divided by the number of training samples (Watanabe, 1997; Watanabe, 1998). When the parameter space is conic and symmetric, the generalization error of the maximum likelihood method is different from that of a regular statistical model (Amari & Ozeki, 2000). If the log likelihood function is analytic for the parameter and if the set of parameters is compact, then the generalization error by the maximum likelihood method is bounded by the constant divided by the number of training samples (Watanabe, 2001b). Let us illustrate the problem caused by non-identifiability of layered learning machines. If p(y x, w) be a three-layer perceptron with K hidden units and if w 0 is a parameter such that p(y x, w 0 ) is equal to the machine with K 0 hidden units (K 0 < K), then the set of true parameters {w; p(y x, w 0 ) = p(y x, w)} consists of several sub-manifolds in the parameter space. Moreover, the Fisher information matrix, I ij (w) = w i log p(y x, w) w j log p(y x, w)p(y x, w)q(x)dxdy, where q(x) is the probability density function on the input space, is positive semidefinite but not positive definite, and its rank, rank I(w), depends on the parameter w. This fact indicates that artificial neural networks have many singular points in the parameter space (Figure 1). A typical example is shown in Example.2 in section 3. By the same reason, almost all homogenous and hierarchical learning machines such as a Boltzmann machine, a gaussian mixture, and a competitive neural network have singularities in their parameter spaces, resulting that we have no mathematical foundation to analyze their learning. In the previous paper (Watanabe, 1999b; Watanabe, 2000; Watanabe, 2001a), in order to overcome such a problem, we proved the basic mathematical relation between the algebraic geometrical structure of singularities in the parameter space and the asymptotic behavior of the learning curve, and constructed a general formula to calculate the asymptotic form of the Bayesian generalization error using resolution of singularities, based on the assumption that the true distribution is contained in the parametric model. In this paper, we consider a three-layer perceptron in the case when the true probability density is not contained in the parametric model, and clarify how singularities in the parameter space affect learning in Bayesian estimation. By employing

6 2 BAYESIAN FRAMEWORK 6 an algebraic geometrical method, we show the following facts. (1) The learning curve is strongly affected by singularities, since the statistical estimation error depends on the estimated parameter. (2) The learning efficiency can be evaluated by using the blowing-up technology in algebraic geometry. (3) The generalization error is made smaller by singularities, if the Bayesian estimation is applied. These results clarify the reason why the Bayesian estimation is useful in practical applications of neural networks, and demonstrate a possibility that algebraic geometry plays an important role in learning theory of hierarchical learning machines, just same as differential geometry did in that of regular statistical models (Amari, 1985). This paper consists of 7 sections. In section 2, the general framework of Bayesian estimation is formulated. In section 3, we analyze a parametric case when the true probability density function is contained in the learning model, and derive the asymptotic expansion of the stochastic complexity using resolution of singularities. In section 4, we also study a non-parametric case when the true probability density is not contained, and clarify the effect of singularities in the parameter space. In section 5, the problem of the asymptotic expansion of the generalization error is considered. Finally, section 6 and 7 are devoted to discussion and conclusion. 2 Bayesian Framework In this section, we formulate the standard framework of Bayesian estimation and Bayesian stochastic complexity (Schwarz 1974; Akaike, 1980; Levin, Tishby, & Solla, 1990; Mackay, 1992; Amari, Fujita, & Shinomoto, 1992; Amari & Murata, 1993). Let p(y x, w) be a probability density function of a learning machine, where an input x, an output y, and a parameter w are M, N, and d dimensional vectors, respectively. Let q(y x)q(x) be a true probability density function on the input and out space, from which training samples {(x i, y i ); i = 1,2,..., n} are independently taken. In this paper, we mainly consider the Bayesian framework, hence the estimated probability density ρ n (w) on the parameter space is defined by ρ n (w) = 1 Z n exp( nh n (w))ϕ(w),

7 2 BAYESIAN FRAMEWORK 7 H n (w) = 1 n n i=1 log q(y i x i ) p(y i x i, w), where Z n is the normalizing constant, ϕ(w) is an arbitrary fixed probability density function on the parameter space called an a priori distribution, and H n (w) is the empirical Kullback distance. Note that the a posteriori distribution ρ n (w) does not depend on {q(y i x i ); i = 1,2,..., n} because q(y i x i ) is a constant function of w. Hence it can be written in the other form, ρ n (w) = 1 Z n ϕ(w) n p(y i x i, w). i=1 The inference p n (y x) of the trained machine for a new input x is defined by the average conditional probability density function, p n (y x) = p(y x, w)ρ n (w)dw. The generalization error G(n) is defined by the Kullback distance of p n (y x) from q(y x), G(n) = E n { q(y x) log q(y x) q(x)dxdy}, (1) p n (y x) where E n { } represents the expectation value overall sets of training samples. One of the most important purposes in learning theory is to clarify the behavior of the generalization error when the number of training samples are sufficiently large. It is well known (Levin, Tishby, Solla, 1990; Amari, 1993; Amari, Murata, 1993) that the generalization error G(n) is equal to the increase of the stochastic complexity F(n), G(n) = F(n + 1) F(n) (n = 1,2,..., ), (2) for an arbitrary natural number n, where F(n) is defined by F(n) = E n {log exp( nh n (w))ϕ(w)dw}. (3) The stochastic complexity F(n) and its generalized concepts, which are sometimes called the free energy, the Bayesian factor, or the logarithm of the evidence, can be seen in statistics, information theory, learning theory, and mathematical physics (Schwarz, 1974; Akaike, 1980; Rissanen, 1986; Mackay, 1992; Opper & Haussler, 1995; Meir & Merhav, 1995 ; Haussler & Opper, 1997; Yamanishi, 1998). For example, both Bayesian model selection and hyperparatemeter optimization are

8 2 BAYESIAN FRAMEWORK 8 often carried out by minimization of the stochastic complexity before averaging. They are called BIC and ABIC, which are important in practical applications. The stochastic complexity satisfies two basic inequalities. Firstly, we define H(w) and F(n) respectively by H(w) = q(y x) log q(y x) p(y x, w) q(x)dxdy, F(n) = log exp( nh(w))ϕ(w)dw. Note that H(w) is called the Kullback information. Then, by applying Jensen s inequality, F(n) F(n) (4) holds for an arbitrary natural number n (Opper & Haussler, 1995; Watanabe, 2001a). Secondly, we use notations F(ϕ, n) = F(n) and F(ϕ, n) = F(n) which explicitly show the a priori probability density ϕ(w). Then F(ϕ, n) and F(ϕ, n) can be understood as a generalized stochastic complexity for a case when ϕ(w) is a non-negative function. If ψ(w) and ϕ(w) satisfy then it immediately follows that ψ(w) ϕ(w) 0 ( w), F(ψ, n) F(ϕ, n), (5) F(ψ, n) F(ϕ, n). (6) Therefore, the restriction of the integrated region of the parameter space makes the stochastic complexity not smaller. For example, we define with sufficiently small ǫ > 0, then F ǫ (n) = log exp( nh(w))ϕ(w)dw, (7) H(w)<ǫ F(n) F(n) F ǫ (n). (8) These two inequalities eq.(4) and eq.(8) give upper bounds of the stochastic complexity. On the other hand, if the support of ϕ(w) is compact, then a lower bound is proven F( n ) α F(n). (9) 2

9 3 A PARAMETRIC CASE 9 Moreover, if the learning machine contains the true distribution, then holds (Watanabe, 1999b; Watanabe, 2001a). F ǫ ( n ) α F(n) (10) 2 In this paper, based on algebraic geometrical methods, we prove rigorously the upper bounds of F(n) such as F(n) F(n) αn + β log n + o(log n), (11) where α, β 0 are constants and o(logn) is a function of n which satisfies o(log n)/log n 0 (n ). Mathematically speaking, although the generalization error G(n) is equal to F(n + 1) F(n) for any natural number n, we can not derive the asymptotic expansion of G(n). However, in section 5, we show that, if G(n) has some asymptotic expansion, then it should satisfy the inequality G(n) α + β n + o(1 ), (12) n for sufficiently large n, from eq.(11). The main results of this paper are the upper bounds of the stochastic complexity, however, we also discuss the behavior of the generalization errors based on eq.(12). 3 A Parametric Case In this section, we consider a parametric case when the true probability distribution q(y x)q(x) is contained in the learning machine p(y x, w)q(x), and show the relation between the algebraic geometrical structure of the machine and the asymptotic form of the stochastic complexity. 3.1 Algebraic Geometry of Neural Networks In this subsection, we briefly summarize the essential result of the previous paper. For the mathematical proofs of this subsection, see (Watanabe, 1999b; Watanabe, 2001a). Strictly speaking, we need assumptions that log p(y x, w) is an analytic function of w, and that it can be analytically continued to a holomorphic function of w whose associated convergence radii is positive uniformly for arbitrary (x, y) that satisfies q(y x)q(x) > 0 (Watanabe, 2000; Watanabe, 2001a). In this paper, we apply the result of the previous paper to the three-layer perceptron.

10 3 A PARAMETRIC CASE 10 If a three-layer perceptron is redundant to approximate the true distribution, then the set of true parameters {w; H(w) = 0} is a union of several sub-manifolds in the parameter space. In general, the set of all zero points of an analytic function is called an analytic set. If the analytic function H(w) is a polynomial, then the set is called an algebraic variety. It is well known that an analytic set and an algebraic variety have complicated singularities in general. We introduce a state density function v(t) v(t) = δ(t H(w))ϕ(w)dw, H(w)<ǫ where δ(t) is Dirac s delta function and ǫ > 0 is a sufficiently small constant. By definition, if t < 0 or t > ǫ, then v(t) = 0. By using v(t), F ǫ (n) is rewritten as F ǫ (n) = log exp( nh(w))ϕ(w)dw = log = log H(w)<ǫ ǫ 0 nǫ 0 exp( nt)v(t)dt exp( t)v( t n )dt n. (13) Hence, if v(t) has an asymptotic expansion for t 0, then F ǫ (n) (n ) has an asymptotic expansion for n. In order to examine v(t), we introduce a kind of the zeta function J(z) (Sato & Shintani, 1974) of the Kullback information H(w) and the a priori probability density ϕ(w), which is a function of one complex variable z, J(z) = H(w) z ϕ(w)dw (14) = H(w)<ǫ ǫ 0 t z v(t)dt. (15) Then J(z) is an analytic function of z in the region Re(z) > 0. It is well known in the theory of distributions and hyperfunctions that, if H(w) is an analytic function of w, then J(z) can be analytically continued to a meromorphic function on the entire complex plane and its poles are on the negative part of the real axis (Atiyah, 1970; Bernstein, 1972; Sato & Shintani, 1974; Björk, 1979). Moreover, the poles of J(z) are rational numbers (Kashiwara, 1976). Let λ 1 (λ 1 > 0) and m 1 (m 1 1) be the largest pole and its order of J(z), respectively. Note that eq.(15) shows J(z) (z C) is the Mellin transform of v(t). Using the inverse Mellin transform, we can show that v(t) satisfies v(t) = c 0 t λ 1 1 ( log t) m 1 1 (t 0),

11 3 A PARAMETRIC CASE 11 where c 0 > 0 is a positive constant. By eq.(13), F ǫ (n) has an asymptotic expansion, F ǫ (n) = λ 1 log n (m 1 1)log log n + O(1), where O(1) is a bounded function of n. Hence, by eq.(8), F(n) λ 1 log n (m 1 1)log log n + O(1). Moreover, if the support of ϕ(w) is a compact set, by eq.(9), we obtain an asymptotic expansion of F(n), We have the first theorem. F(n) = λ 1 log n (m 1 1)log log n + O(1). Theorem 1 (Watanabe, 1999b; Watanabe, 2001a) Assume that the support of ϕ(w) is a compact set. The stochastic complexity F(n) has an asymptotic expansion, F(n) = λ 1 log n (m 1 1)log log n + O(1) where λ 1 and m 1 are respectively the largest pole and its order of the function that is analytically continued from J(z) = H(w)<ǫ H(w) z ϕ(w)dw, where H(w) is the Kullback information and ϕ(w) is the a priori probability density function. Remark that, if the support of ϕ(w) is not compact, then Theorem 1 gives an upper bound of F(n). The important constants λ 1 and m 1 can be calculated by an algebraic geometrical method. We define the set of parameters W ǫ by W ǫ = {w R d ; H(w) < ǫ, ϕ(w) > 0}. It is proven by Hironaka s resolution theorem (Hironaka, 1964 ; Atiyah, 1970) that there exist both a manifold U and a resolution map g : U W ǫ which satisfy H(g(u)) = a(u) d j=1 u k j j,

12 3 A PARAMETRIC CASE 12 in an arbitrary neighborhood of an arbitrary u U that satisfies H(g(u)) = 0, where a(u) > 0 is a strictly positive function and {k i } are non-negative even integers (Figure 2). Let W ǫ = W α α be a decomposition of W ǫ into a finite union of suitable neighborhoods W α, where ϕ(w)dw = 0 (α α ). W α W α By applying the resolution theorem to the function J(z), J(z) = = α = α H(w)<ǫ H(w) z ϕ(w)dw H(w) z ϕ(w)dw W α H(g(u)) z ϕ(g(u)) g (u) du, U α where U α = g 1 (W α ). Since g is given by recursive blowing-ups, the Jacobian g (u) is a direct product local variables u 1, u 2,..., u d, d g (u) = c(u) u j h j, where c(u) is a positive analytic function and {h j } are non-negative integers. In a neighborhood U α, a(u) and ϕ(g(u)) can be set as constant functions in calculation of the poles of J(z), because we can take each U α small enough. Hence we can set a(u) = 1 and ϕ(g(u)) = 1 without loss of generality. Then, j=1 J(z) = α d j=1 U α u k (α) j z+h (α) j du j, where both k (α) j poles { (h (α) j real axis. and h (α) j + 1)/k (α) j depend on the neighborhood U α. We find that J(z) has }, which are rational numbers on the negative part of the Since a resolution map g(u) can be found by using finite recursive procedures of blowing-ups, λ 1 and m 1 can be found algorithmically. It is also proven that λ 1 d/2 if {w; ϕ(w) > 0,H(w) = 0}, and that m 1 d. Theorem 2 (Watanabe, 1999b; Watanabe, 2001a) The largest pole λ 1 and its multiplicity m 1 of the function J(z) can be algorithmically calculated by Hironaka s

13 3 A PARAMETRIC CASE 13 resolution theorem. Moreover, λ 1 is a rational number and m 1 is a natural number, and if {w; ϕ(w) > 0,H(w) = 0}, then where d is the dimension of parameter. 0 < λ 1 d/2, 1 m 1 d, Note that, if the learning machine is a regular statistical model, then always λ 1 = d/2 and m 1 = 1. Also note that, if Jeffreys prior is employed in neural network learning, which is equal to zero at singularities, the assumption {w; ϕ(w) > 0,H(w) = 0} is not satisfied, and then both λ 1 = d/2 and m 1 = 1 hold even if the Fisher metric is degenerate (Watanabe, 2001c). Example.1 (Regular Model) Let us consider a regular statistical model p(y x, a, b) = 1 2π exp( 1 2 (y ax b)2 ), with the set of parameters W = {(a, b); a 1, b 1}. Assume that the true distribution is q(y x)q(x) = 1 2π exp( 1 2 (x2 + y 2 )), (16) and the a priori distribution is the uniform distribution on W. Then, For a subset S W, we define H(a, b) = 1 2 {a2 + b 2 }, J(z) = H(a, b) z da db. J S (z) = S W H(a, b) z da db. Then J(z) = J W1 (z) + J W2 (z), where W 1 = {(a, b) W; a b }, W 2 = W \ W 1. We introduce a mapping g : (u, v) (a, b) by a = uv, b = u. (17)

14 3 A PARAMETRIC CASE 14 Then J W1 (z) = a b = 1 2 z u 1 H(a, b) z dadb v 1 u 2z (v 2 + 1) z udu dv has a pole at z = 1. We can show J W2 (z) has the same pole just the same way as J W1. Hence λ 1 = 1 and m 1 = 1, resulting in F(n) = log n. This coincides with the well known result of the Bayesian asymptotic theory of regular statistical models. The mapping in eq.(17) is a typical example of a blowing-up. Example.2 (Non-identifiable model) Let us consider a learning machine, p(y x, a, b, c) = 1 2π exp( 1 2 (y aφ(bx) cx)2 ), where φ(x) = x+x 2. Assume that the true distribution is same as eq.(16), and that the a priori probability distribution is the uniform one on the set {w = (a, b, c); a 1, b 1, c 1}. Then, the Kullback information is Let us define two sets of parameters, W 1 H(a, b, c) = 1 2 {(ab + c)2 + 3a 2 b 4 }. = {(a, b, c); c a, ab 2 ab + c ab }, U 1 = {(p,q, r); p 1, q 1, r 1}. By using blowing-ups recursively, we find a map g : U 1 W 1 which is defined by By using this transform, we obtain a = p, b = qr, c = (q 1)pqr. H(g(p,q, r)) = 1 2 p2 q 4 r 2 (1 + 3r 2 ), g (p,q, r) = pq 2 r. Therefore, J W1 (z) = H(w) z dw W 1 = dp dq dr {p 2 q 4 r 2 (1 + 3r 2 )} z pq 2 r. 2 z The largest pole of J W1 (z) is 3/4 and its order is one. It is also shown that J W \W1 (z) have largest pole 3/4 with order one. Hence λ 1 = 3/4 and m 1 = 1, resulting that F(n) = 3 log n + O(1). 4

15 3 A PARAMETRIC CASE Application to Layered Perceptron We apply the theory in the foregoing subsection to the three-layer perceptron. A three-layer perceptron with the parameter w = {(a k, b k, c k )} is defined by p(y x, w) = f K (x, w) = 1 1 exp( (2πs 2 ) N/2 2s y f K(x, w) 2 ), 2 (18) K a k σ(b k x + c k ), (19) k=1 where y, f(x, w), and a h are N dimensional vectors, x and b h are M dimensional vectors, c h is a real number, and σ(x) = tanh(x). Here M, N, and K are the numbers of input units, output units, and hidden units. In this paper, we consider a machine which does not estimate the standard deviation s > 0 (s is a constant). We assume that the true distribution is 1 1 q(y x)q(x) = exp( (2πs 2 ) N/2 2s 2 y 2 ) q(x). (20) That is to say, the true regression function is y = 0. This is a special case, but analysis of this case is important in the following section where the true regression function is not contained in the model. Theorem 3 Assume that the learning machine given by eq.(18) and eq.(19) is trained using samples independently taken from the distribution, eq.(20). If the a priori distribution satisfies ϕ(w) > 0 in the neighborhood of the origin w = 0, then λ 1 K 2 min{m + 1,N}. (Proof of Theorem 3) We use notations, w = (a, b, c), a = {a k R N ; k = 1,2,..., K}, a k b k b = {b k R M ; k = 1,2,..., K}, c = {c k R; k = 1,2,..., K}, = {a kp R; p = 1,2,..., N}, = {b kq R; q = 1,2,..., M}. Then the Kullback information is H(a, b, c) = 1 f 2s 2 K (x, a, b, c) 2 q(x)dx N K = B hk (b, c) a hp a kp, p=1 h,k=1

16 3 A PARAMETRIC CASE 16 where B hk (b, c) = 1 σ(b 2s 2 h x + c h )σ(b k x + c k )q(x)dx. Our purpose is to find the pole of the function where J(z) = H(a, b, c) z ϕ(a, b, c) da db dc, W ǫ W ǫ = {(a, b, c); H(a, b, c) < ǫ, ϕ(a, b, c) > 0}. Let us apply the blowing-up technique to the Kullback information H(a, b, c). Firstly, we introduce a mapping g : {u hp ; 1 h K,1 p N} {a hp ; 1 h K,1 p N}, which is defined by a 11 = u 11, a hp = u 11 u hp (h 1 or p 1). Let u be the variables of u except u 11, in other words, u = (u 11, u ). Then where H(u, b, c) = u 2 11 H 1 (u, b, c) K H 1 (u, b, c) = B 11 (b, c) + 2 B h1 (b, c) u h1 h=2 N K + B hk (b, c) u hp u kp, p=2 h,k=1 and the Jacobian g (u) of the mapping g is We define a set of paramaters for δ > 0 g (u) = det( a ij u kl ) = u NK U(δ) = {(u 11, u, b, c); u 11 δ, H 1 (u, b, c) < 1}. By the assumption, there exists δ > 0 such that g(u(δ)) W ǫ.

17 3 A PARAMETRIC CASE 17 In order to obtain an upper bound of the stochastic complexity, we can restrict the integrated region of the parameter space, by using eq.(5) and (6). J(z) = H(a, b, c) z ϕ(a, b, c)dadbdc. g(u(δ)) By the assumption ϕ(w) > 0 in g(u(δ)). In calculation of the pole of J(z), we can assume ϕ(w) = ϕ 0 (ϕ 0 is a constant) in g(u(δ)). J(z) = H(g(u), b, c) z ϕ 0 g (u) dudbdc U(δ) = ϕ 0 δ 0 = ϕ 0δ 2z+NK 2z + NK u 2z 11 u NK 1 11 du 11 H 1 (u, b, c) z du db dc. H 1 (u, b, c) z du db dc The pole of the function δ 2z+NK /(2z + NK) is z = NK/2. Let λ 1 and λ 1 be respectively the largest poles of J(z) and J 1 (z) = H 1 (u, b, c) z du db dc. Then, since H 1 (u, b, c) 0, J 1 (z) does not have zero point in the interval ( λ 1, ). If z = NK/2 is larger than λ 1, then z = NK/2 is a pole of J(z). If otherwise, then J(z) has a larger pole than NK/2. Hence λ 1 NK/2. Secondly, we consider another blowing-up g, g : {u kp, v k ; 1 k K,1 p M} {b kp, c k ; 1 k K,1 p M} which is defined by b 11 = u 11 b kp = u 11 u kp (k 1 or p 1), c k = u 11 v k Then, just the same method as the first half, there exists an analytic function H 2 (a, u, v) such that which implies J(z) = ϕ 0 δ H(a, b, c) = u 2 11 H 2 (a, u, v), 0 u 2z 11 u(m+1)k 1 11 du 11 = ϕ 0 δ 2z+(M+1)K 2z + (M + 1)K H 2 (a, u, v) z du dv H 2 (a, u, v) z du dv.

18 3 A PARAMETRIC CASE 18 Therefore λ 1 (M + 1)K/2. By combing the above two results, the largest pole λ 1 of the J(z) satisfies the inequality, λ 1 K min{n, M + 1}, 2 which completes the proof of Theorem 3. (End of Proof). By Theorem 1, F(n) K 2 min{n, M + 1}log n + o(log n). Moreover, if G(n) has an asymptotic expansion (see section 5), we obtain an inequality of the generalization error, G(n) K 2n min{n, M + 1} + o(1 n ). On the other hand, it is well known that the largest pole of a regular statistical model is equal to d/2, where d is the number of parameters. When a three-layer perceptorn with 100 input units, 10 hidden units, and 1 output unit is employed, then λ 1 5, whereas the regular statistical models with the same number of parameters has λ 1 = d/2 = 500. It should be emphasized that the generalization error of the hierarchical learning machine is far smaller than that of the regular statistical models, if we use the Bayesian estimation. When we adopt the normal distribution as the a priori probability density, we have shown the same result as Theorem 3 by a direct calculation (Watanabe, 1999a). However, Theorem 3 shows systematically that the same result holds for an arbitrary a priori distribution. Moreover, it is easy to generalize the above result to the case when the learning machine has M input units, K 1 first hidden units, K 2 second hidden units,..., K p pth hidden units, and N output units. We assume that hidden units and output units have bias parameters. Then by using same blowing-ups, we can generalize the proof of Theorem 3, λ min{(m + 1)K 1, (K 1 + 1)K 2, (K 2 + 1)K 3,..., (K p 1 + 1)K p, K p N}. Of course, this result holds only when the true regression function is the special case, y = 0. However, in the following section, we show that this result is necessary to obtain a bound for a general regression function.

19 4 A NON-PARAMETRIC CASE 19 4 A Non-parametric Case In the previous section, we have studied a case when the true probability distribution is contained in the parametric model. In this section, we consider a non-parametric case when the true distribution is not contained in the parametric models, which is illustrated in Figure 3. Let w 0 be the parameter that minimizes H(w), which is a point C in Figure 3. Our main purpose is to clarify the effect of singular points such as A and B in Figure 3 which are not contained in the neighborhood of w 0. Let us consider a case when a three-layer perceptron given by eq.(18) and eq.(19) is trained using samples independently taken from the true probability distribution, q(y x)q(x) = 1 1 exp( (2πs 2 ) N/2 2s y 2 g(x) 2 ) q(x), (21) where g(x) is the true regression function and q(x) is the true probability distribution on the input space. Let E(k) be the minimum function approximation error using a three-layer perceptron with k hidden units, E(k) = min g(x) f k (x, w) 2 q(x)dx. (22) w Here we assume that, for each 1 k K, there exists a parameter w that attains the minimum value. Theorem 4 Assume that the learning machine given by eq.(18) and eq.(19) is trained using samples independently taken from the distribution of eq.(21). If the a priori distribution satisfies ϕ(w) > 0 for an arbitrary w, then where F(n) min {ne(k) k K 2s 2 2 (D 1k + D 2 K)log n} + o(logn), D 1 = (M + N + 1) min[n, M + 1], D 2 = min[n, M + 1]. (Proof of Theorem 4) By Jensen s inequality eq.(4), we have F(n) log exp( nh(w))ϕ(w)dw, where H(w) is the Kullback distance, H(w) = 1 g(x) f 2s 2 K (x, w) 2 q(x)dx.

20 4 A NON-PARAMETRIC CASE 20 Let k 1 and k 2 be natural numbers which satisfy both 0 k 1 K and k 1 + k 2 = K. We divide the parameter w = (w 1, w 2 ) where w 1 = {a k, b k, c k ; 1 k k 1 }, w 2 = {a k, b k, c k ; k k K}. Also let γ 1 and γ 2 be real numbers which satisfy both γ 1 > 1 and γ 1 + γ 2 = γ 1 γ 2. Then, for arbitrary u, v R N, Therefore, for arbitrary (x, w), u + v 2 γ 1 u 2 + γ 2 v 2. g(x) f K (x, w) 2 γ 1 g(x) f k1 (x, w) 2 + γ 2 f k2 (x, w) 2. Hence we have an inequality, where we use definitions, H(w) γ 1 H 1 (w 1 ) + γ 2 H 2 (w 2 ), H 1 (w 1 ) = 1 2s 2 H 2 (w 2 ) = 1 2s 2 As F(n) is an increasing function of H(w), g(x) f k1 (x, w 1 ) 2 q(x)dx, f k2 (x, w 2 ) 2 q(x)dx. F(n) F 1 (n) + F 2 (n), where F j (n) = log exp( nγ j H j (w j ))ϕ j (w j )dw j, for j = 1,2 and ϕ j (j = 1,2) are some functions which satisfy ϕ(w) ϕ 1 (w 1 ) ϕ 2 (w 2 ). Here we can choose both ϕ 1 (w 1 ) and ϕ 2 (w 2 ) which are compact support functions. Firstly, we evaluate F 1 (n). Let w 1 be the parameter that minimizes H 1 (w 1 ). Then, by eq.(22) and Theorem 2, F 1 (n) nγ 1 H 1 (w1) log exp( nγ 1 [H 1 (w 1 ) H 1 (w1)])ϕ 1 (w 1 )dw 1, nγ 1E(k 1 ) + d(k 1) log n + o(log n), (23) 2s 2 2

21 4 A NON-PARAMETRIC CASE 21 where d(k 1 ) = (M + N + 1)k 1 is the number of parameters in the three-layer perceptron with k 1 hidden units. Secondly, by applying Theorem 3 to F 2 (n), where λ(k 2 ) satisfies F 2 (n) λ(k 2 )log n + o(log n), (24) λ(k 2 ) k 2 min{n, M + 1}. (25) 2 By combining eq.(23) with eq.(24), and by taking γ 1 sufficiently close 1, we obtain F(n) ne(k 1) + { d(k 1) + λ(k 2s 2 2 )} log n + o(log n), 2 for an arbitrary given 0 k 1 K. Since d(k 1 ) 2 we obtain Theorem 4. (End of Proof). + λ(k 2 ) = 1 2 {D 1k 1 + D 2 K}, Based on Theorem 4, if G(n) has an asymptotic expansion (see section 5), then G(n) should satisfy the inequalities G(n) E(k) 2s 2 for n > n 0 with a sufficiently large n 0. Hence + D 1k 2n + D 2K 2n + o(1 ) (k = 0,1,2,..., K) (26) n G(n) min {E(k) + D 1k 0 k K 2s 2 2n } + D 2K 2n + o(1 ), (27) n for n > n 0 with a sufficiently large n 0. Figure 4 illustrates several learning curves corresponding to k (0 k K). The generalization error G(n) is smaller than every curve. It is well known (Barron, 1994; Murata, 1996) that, if g(x) belongs to some kind of function space, then E(k) C(g) k for sufficiently large k, where C(g) is a positive constant determined by the true regression function g(x). Then, G(n) min {C(g) 0 k K 2s 2 k + D 1k 2n } + D 2K 2n + o(1 ). (28) n

22 5 ASYMPTOTIC PROPERTY OF THE GENERALIZATION ERROR 22 If both n and K are sufficiently large, and if α sk C(g)n/D 1 1, then, by choosing k = C(g)n/(s 2 D 1 ) in eq.(28), G(n) C(g) s 2 n [ D 1 + αd 2 / 4D 1 ] + o( 1 n ). The inequality (27) holds if n is sufficiently large. If n is sufficiently large but not extensively large, then G(n) is bounded by the generalization error of the middle size model. If n becomes larger, then it is bounded by that of the larger model, and if n is extensively large, then it is bounded by that of the largest model. A complex hierarchical learning machine contains a lot of smaller models in its own parameter space as analytic sets with singularities, and chooses the appropriate model adaptively for the number of training samples, if Bayesian estimation is applied. Such a property is caused by the fact that the model is non-identifiable, and its quantitative effect can be evaluated by using algebraic geometry. 5 Asymptotic Property of the Generalization Error In this section, let us consider the asymptotic expansion of the generalization error. By eq.(2), F(n) is equal to the accumulate generalization error, n F(n + 1) = G(i) (n = 1,2,...), (29) i=0 where G(0) is defined by F(1). Hence, if G(n) has an asymptotic expansion for n, then F(n) also has the asymptotic expansion. However, even if F(n) has an asymptotic expansion, G(n) may not have an asymptotic expansion. In the foregoing sections, we have proved that F(n) satisfies inequalities such as F(n) αn + β log n + o(log n), (30) where α, β 0 are constants determined by the singularities and the true distribution. In order to mathematically derive an inequality of G(n) from eq.(30), we need an assumption.

23 5 ASYMPTOTIC PROPERTY OF THE GENERALIZATION ERROR 23 Assumption (A) Assume that the generalization error G(n) has an asymptotic expansion G(n) = Q a q s q (n) + o( 1 q=1 n ) (31) where {a q } are real constants, s q (n) > 0 is a positive and non-increasing function of n which satisfies s 1 (n) = 1, (32) s Q (n) = 1 n, (33) s q (n) = o(s q 1 (n)) (q = 2,3,..., Q). (34) Based on this assumption, we have the following lemma. Lemma 1 If G(n) satisfies the assumption (A) and if eq.(30) holds, then G(n) satisfies an inequality, G(n) α + β n + o(1 ). (35) n (Proof) By the assumption (A) F(n) = a 1 n + o(n), which shows a 1 α. If a 1 < α, then eq.(35) holds. If a 1 = α, then n a 2 s 2 (k) β log(n + 1) + o(log n). (36) k=0 Let t(k) = ks 2 (k). By eq.(32),eq.(33), and eq.(34), t(k) or t(k) C (C > 0). If t(k), then, for arbitrary M > 0, there exists k 0 such that k k 0 t(k) M Hence n n t(k) s 2 (n) = k n M k=0 k=0 k=k 0 k = M(log n const.), which contradicts eq.(36). Hence t(n) C and a 2 C β. (End of Proof Lemma 1).

24 6 DISCUSSION 24 In this paper, we have proven the inequalities same as eq.(30) in Theorem 1, 2, 3, and 4 without assumption (A). Then, we obtain corresponding inequalities same as eq.(35) if we adopt the assumption (A). In other words, if G(n) has an asymptotic expansion and if eq.(30) holds, then G(n) should satisfy eq.(35). It is conjectured that natural learning machines satisfy the assumption (A). A sufficient condition for the assumption (A) is that F(n) has an asymptotic expansion R F(n) = a r S r (n) + o( 1 r=1 n ), where S 1 (n) = n, S R (n) = 1/n, and S r+1 (n) = o(s r (n)) (r = 1,2,..., R 1). For example, if the learner is p(y x, a) = 1 2π exp( 1 2 (y ax)2 ), where the a priori distribution of a is the standard normal distribution, and if the true distribution is q(y x)q(x) = 1 2π exp( 1 2 {(y c)2 + x 2 }), then, it is shown by direct calculation that the stochastic complexity has an asymptotic expansion F(n) = 1 2 {nc2 + log n c c2 n } + o(1 ). (37) n Hence G(n) has an asymptotic expansion G(n) = c n + o(1 n ). It is expected that, in a general case, G(n) has the same asymptotic expansion as Assumption (A), however, mathematically speaking, the necessary and sufficient condition for it is not yet established. This is an important problem in statistics and learning theory for the future. 6 Discussion In this section, universal phenomena which can be observed in hierarchical learning machines.

25 6 DISCUSSION Bias and variance at singularities We consider a covering neighborhood of the parameter space, J W = W(w j ), (38) j=1 where {W(w j )} are the sufficiently small neighborhood of the parameter w j which satisfy W(w i ) W(w j ) ϕ(w)dw = 0 (i j). The number J in eq.(38) is finite when W = supp ϕ is compact. Then, the upperbound of the stochastic complexity can be rewritten as F(n) = log log J j=1 J j=1 W(w j ) exp( H(w))ϕ(w)dw exp( nb(w j ))V (w j ) where B(w j ) is the function approximation error of the parameter w j B(w j ) = min H(w), w W(w j ) and V (w j ) is the statistical estimation error of the neighborhood of w j, V (w j ) = c 0 ( log n) m(w j) 1 n λ(w j), where c 0 > 0 is a constant. The values λ(w j ) and m(w j ) are respectively the largest pole and its multiplicity of the meromorphic function J wj (z) = W(w j ) (H(w) B(w j )) z ϕ(w)dw. Note that B(w j ) and V (w j ) are called the bias and the variance, respectively. In the Bayesian estimation, the neighborhood of the parameter w j that minimizes Z(w j ) exp( nb(w j ))V (w j ), is selected with the largest probability. In regular statistical models, the variance does not depend on the parameter, in other words, λ(w j ) = d/2 and m(w j ) = 1 for an arbitrary parameter w j, hence the parameter that minimizes the function approximation error is selected. On the other hand, in hierarchical learning machines,

26 7 CONCLUSION 26 the variance V (w j ) strongly depends on the parameter w j, and the parameter that minimizes the sum of the bias and variance is selected. If the number of training samples is large but not extensively large, parameters among the singular point A in Figure 3 that represents a middle size model, is automatically selected, resulting in the smaller generalization error. As n increases, the larger but not largest model B is selected. At last, if n becomes extensively large, then the parameter C that minimizes the bias is selected. This is a universal phenomenon of hierarchical learning machines, which indicates the essential difference between the regular statistical models and artificial neural networks. 6.2 Neural networks are over-complete basis Singularities of a hierarchical learning machine originate in the homogeneous structure of a learning model. A set of functions used in an artificial neural network, for example, {σ(b x + c)}, is a set of over-complete basis, in other words, coefficients {a(b, c)} in a wavelet type decomposition of a given function g(x), g(x) = a(b, c)σ(b x + c)dbdc are not uniquely determined for g(x) (Chui, 1989; Murata, 1996). In practical applications, the true probability distribution is seldom contained in a parametric model, however, we adopt a model which almost approximates the true distribution compared with the fluctuation caused by random samples, g(x) k a k σ(b k x + c k ). If we have an appropriate number of samples and choose an appropriate learning model, it is expected that the model is in an almost redundant state, where output functions of hidden units are almost linearly dependent. We expect that this paper will be a mathematical foundation to study learning machines in such states. 7 Conclusion We considered the case when the true distribution is not contained in the parametric models made of hierarchical learning machines, and showed that the parameters among singular points are selected by the Bayesian distribution, resulting in the small generalization error. The quantitative effect of the singularities was clarified

27 7 CONCLUSION 27 based on the resolution of singularities in algebraic geometry. Even if the true distribution is not contained in the parametric models, singularities strongly affect and improve the learning curves. This is a universal phenomenon of the hierarchical learning machines, which can be observed in almost all artificial neural networks. References Akaike, H. (1980). Likelihood and Bayes procedure. In J.M. Bernald, Bayesian Statistics, (pp ). Valencia, Spain: University Press. Amari, S. (1985). Differential-geometrical methods in Statistics. Lecture Notes in Statistics, Springer. Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, Amari, S., Fujita, N., & Shinomoto, S. (1992). Four Types of Learning Curves. Neural Computation, 4 (4), Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss. Neural Computation, 5, Amari, S., & Ozeki, T., (2000) Differential and algebraic geometry of multilayer perceptrons. to appear in IEICE Transactions. Atiyah, M. F. (1970). Resolution of Singularities and Division of Distributions. Communications of Pure and Applied Mathematics, 13, Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1), Bernstein, I. N. (1972). The analytic continuation of generalized functions with respect to a parameter. Functional Analysis Applications, 6, Cramer, H. (1949). Mathematical methods of statistics, Princeton: University Press. Chui, C. K. (1992). An introduction to Wavelets. Academic Press. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1,

28 7 CONCLUSION 28 Fukumizu, K. (1996). A regularity condition of information matrix of a multilayer perceptron network. Neural Networks, 9, Gel fand, I.M., & Shilov, G.E. (1964). Generalized functions. Academic Press. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. Proceedings of International Joint Conference on Neural Networks, Nagoya Japan, 3, Hartigan, J.A. (1985). A Failure of likelihood asymptotics for normal mixtures. Proceedings of the Barkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, 2, Haussler, D., & Opper, M. (1997) Mutual information, metric entropy and cumulative relative entropy risk. Annals of Statistics, 25 (6), Hironaka, H. (1964). Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79, Kashiwara, M. (1976). B-functions and holonomic systems. Inventiones Mathematicae, 38, Levin, E., Tishby, N., & Solla, S.A. (1990). A statistical approaches to learning and generalization in layered neural networks. Proceedings of IEEE, 78(10), Mackay, D. J. (1992). Bayesian interpolation. Neural Computation, 4 (2), Meir, R. & Merhav, N. (1995) On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19 (3), Murata, N (1996). An integral representation with ridge functions and approximation bounds of three-layered network. Neural Networks, 9 (6), Opper, M., & Haussler, D. (1995). Bounds for predictive errors in the statistical mechanics of supervised learning. Physical Review letters, 75 (20), Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, Sato, M., & Shintani,T. (1974). On zeta functions associated with prehomogeneous vector space. Annals of Mathematics, 100,

29 7 CONCLUSION 29 Schwarz, G. (1974). Estimating the dimension of a model. Annals of Statistics, 6(2), Watanabe, S. (1994). A optimization method of layered neural networks based on the modified information criterion. Advances in Neural Information Processing Systems, 6, Watanabe, S. (1997) On the essential difference between neural networks and regular statistical models. Proceedings of 2nd International Conference on Computational Intelligence and Neuroscience, 2, Watanabe, S. (1998). On the generalization error by a layered statistical model with Bayesian estimation. IEICE Transactions, J81-A, English version: Watanabe, S.(2000). Electronics and Communications in Japan, Watanabe, S. (1999a). Learning efficiency of redundant neural networks in Bayesian estimation. submitted to IEEE Transactions on Neural Networks. Watanabe, S. (1999b). Algebraic analysis for singular statistical estimation. Lecture Notes in Computer Science, 1720, Watanabe, S. (2000). Algebraic analysis for non-regular learning machines. Advances in Neural Information Processing, 12, Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, to appear. Watanabe, S. (2001b). Training and generalization errors of the hierarchical learning machines with algebraic singularities. IEICE Transactions, J84A (1), Watanabe, S. (2001c) Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems, 13, to appear. Watanabe, S., & Fukumizu, K. (1995). Probabilistic design of layered neural networks based on their unified framework. IEEE Transactions on Neural Networks, 6(3), White, H. (1989). Learning in artificial neural networks: a statistical prespective. Neural Computation, 1, Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and

30 7 CONCLUSION 30 its applications to learning. IEEE Transactions on Information Theory, 44 (4),

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Algebraic Analysis for Non-identifiable Learning Machines

Algebraic Analysis for Non-identifiable Learning Machines Algebraic Analysis for Non-identifiable Learning Machines Sumio WATANABE P& I Lab., Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan E-mail: swatanab@pi.titech.ac.jp May

More information

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Contact information for authors. M. Aoyagi Email : miki-a@sophia.ac.jp Address : Department of Mathematics,

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation Miki Aoyagi and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute

More information

Algebraic Geometry and Model Selection

Algebraic Geometry and Model Selection Algebraic Geometry and Model Selection American Institute of Mathematics 2011/Dec/12-16 I would like to thank Prof. Russell Steele, Prof. Bernd Sturmfels, and all participants. Thank you very much. Sumio

More information

Asymptotic Approximation of Marginal Likelihood Integrals

Asymptotic Approximation of Marginal Likelihood Integrals Asymptotic Approximation of Marginal Likelihood Integrals Shaowei Lin 10 Dec 2008 Abstract We study the asymptotics of marginal likelihood integrals for discrete models using resolution of singularities

More information

What is Singular Learning Theory?

What is Singular Learning Theory? What is Singular Learning Theory? Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 23 Sep 2011 McGill University Singular Learning Theory A statistical model is regular if it is identifiable and its

More information

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models

Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models Asymptotic Analysis of the Bayesian Likelihood Ratio for Testing Homogeneity in Normal Mixture Models arxiv:181.351v1 [math.st] 9 Dec 18 Natsuki Kariya, and Sumio Watanabe Department of Mathematical and

More information

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu

More information

Statistical Learning Theory of Variational Bayes

Statistical Learning Theory of Variational Bayes Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Statistical Learning Theory

Statistical Learning Theory Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report

More information

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 3, MARCH 2001 927 An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle Bayes Codes Masayuki Goto, Member, IEEE,

More information

A general upper bound of likelihood ratio for regression

A general upper bound of likelihood ratio for regression A general upper bound of likelihood ratio for regression Kenji Fukumizu Institute of Statistical Mathematics Katsuyuki Hagiwara Nagoya Institute of Technology July 4, 2003 Abstract This paper discusses

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Local minima and plateaus in hierarchical structures of multilayer perceptrons Neural Networks PERGAMON Neural Networks 13 (2000) 317 327 Contributed article Local minima and plateaus in hierarchical structures of multilayer perceptrons www.elsevier.com/locate/neunet K. Fukumizu*,

More information

A Widely Applicable Bayesian Information Criterion

A Widely Applicable Bayesian Information Criterion Journal of Machine Learning Research 14 (2013) 867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe Department of Computational Intelligence

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

Widths. Center Fluctuations. Centers. Centers. Widths

Widths. Center Fluctuations. Centers. Centers. Widths Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.

More information

LTI Systems, Additive Noise, and Order Estimation

LTI Systems, Additive Noise, and Order Estimation LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Statistical Learning Theory. Part I 5. Deep Learning

Statistical Learning Theory. Part I 5. Deep Learning Statistical Learning Theory Part I 5. Deep Learning Sumio Watanabe Tokyo Institute of Technology Review : Supervised Learning Training Data X 1, X 2,, X n q(x,y) =q(x)q(y x) Information Source Y 1, Y 2,,

More information

Memoirs of the Faculty of Engineering, Okayama University, Vol. 42, pp , January Geometric BIC. Kenichi KANATANI

Memoirs of the Faculty of Engineering, Okayama University, Vol. 42, pp , January Geometric BIC. Kenichi KANATANI Memoirs of the Faculty of Engineering, Okayama University, Vol. 4, pp. 10-17, January 008 Geometric BIC Kenichi KANATANI Department of Computer Science, Okayama University Okayama 700-8530 Japan (Received

More information

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Mean-field equations for higher-order quantum statistical models : an information geometric approach Mean-field equations for higher-order quantum statistical models : an information geometric approach N Yapage Department of Mathematics University of Ruhuna, Matara Sri Lanka. arxiv:1202.5726v1 [quant-ph]

More information

Slow Dynamics Due to Singularities of Hierarchical Learning Machines

Slow Dynamics Due to Singularities of Hierarchical Learning Machines Progress of Theoretical Physics Supplement No. 157, 2005 275 Slow Dynamics Due to Singularities of Hierarchical Learning Machines Hyeyoung Par 1, Masato Inoue 2, and Masato Oada 3, 1 Computer Science Dept.,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Discrete Hessian Matrix for L-convex Functions

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Discrete Hessian Matrix for L-convex Functions MATHEMATICAL ENGINEERING TECHNICAL REPORTS Discrete Hessian Matrix for L-convex Functions Satoko MORIGUCHI and Kazuo MUROTA METR 2004 30 June 2004 DEPARTMENT OF MATHEMATICAL INFORMATICS GRADUATE SCHOOL

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

A Bayesian Local Linear Wavelet Neural Network

A Bayesian Local Linear Wavelet Neural Network A Bayesian Local Linear Wavelet Neural Network Kunikazu Kobayashi, Masanao Obayashi, and Takashi Kuremoto Yamaguchi University, 2-16-1, Tokiwadai, Ube, Yamaguchi 755-8611, Japan {koba, m.obayas, wu}@yamaguchi-u.ac.jp

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

Bernstein s analytic continuation of complex powers

Bernstein s analytic continuation of complex powers (April 3, 20) Bernstein s analytic continuation of complex powers Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/. Analytic continuation of distributions 2. Statement of the theorems

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

Bregman divergence and density integration Noboru Murata and Yu Fujimoto Journal of Math-for-industry, Vol.1(2009B-3), pp.97 104 Bregman divergence and density integration Noboru Murata and Yu Fujimoto Received on August 29, 2009 / Revised on October 4, 2009 Abstract. In this

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture

Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture 009 Technical Report on Information-Based Induction Sciences 009 (IBIS009) Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture Daisue Kaji Sumio Watanabe Abstract: Variational Bayes

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata ' / PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE Noboru Murata Waseda University Department of Electrical Electronics and Computer Engineering 3--

More information

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the

More information

Bootstrap prediction and Bayesian prediction under misspecified models

Bootstrap prediction and Bayesian prediction under misspecified models Bernoulli 11(4), 2005, 747 758 Bootstrap prediction and Bayesian prediction under misspecified models TADAYOSHI FUSHIKI Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569,

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

On the Stochastic Complexity of Learning Realizable and Unrealizable Rules

On the Stochastic Complexity of Learning Realizable and Unrealizable Rules Machine Learning, 19,241-261 (1995) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. On the Stochastic Complexity of Learning Realizable and Unrealizable Rules RONNY MEIR NERI

More information

Information Theory and Hypothesis Testing

Information Theory and Hypothesis Testing Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning Complexity Bounds of Radial Basis Functions and Multi-Objective Learning Illya Kokshenev and Antônio P. Braga Universidade Federal de Minas Gerais - Depto. Engenharia Eletrônica Av. Antônio Carlos, 6.67

More information

The Minimum Message Length Principle for Inductive Inference

The Minimum Message Length Principle for Inductive Inference The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,

More information

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Smooth Bayesian Kernel Machines

Smooth Bayesian Kernel Machines Smooth Bayesian Kernel Machines Rutger W. ter Borg 1 and Léon J.M. Rothkrantz 2 1 Nuon NV, Applied Research & Technology Spaklerweg 20, 1096 BA Amsterdam, the Netherlands rutger@terborg.net 2 Delft University

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2014/2015, Our tasks (recap) The model: two variables are usually present: - the first one is typically discrete k

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Chapter 2 Review of Classical Information Theory

Chapter 2 Review of Classical Information Theory Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of

More information

Information geometry of Bayesian statistics

Information geometry of Bayesian statistics Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.

More information

From CDF to PDF A Density Estimation Method for High Dimensional Data

From CDF to PDF A Density Estimation Method for High Dimensional Data From CDF to PDF A Density Estimation Method for High Dimensional Data Shengdong Zhang Simon Fraser University sza75@sfu.ca arxiv:1804.05316v1 [stat.ml] 15 Apr 2018 April 17, 2018 1 Introduction Probability

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Lower Bounds for Approximation by MLP Neural Networks

Lower Bounds for Approximation by MLP Neural Networks Lower Bounds for Approximation by MLP Neural Networks Vitaly Maiorov and Allan Pinkus Abstract. The degree of approximation by a single hidden layer MLP model with n units in the hidden layer is bounded

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

= w 2. w 1. B j. A j. C + j1j2

= w 2. w 1. B j. A j. C + j1j2 Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp

More information