A Theory of Mean Field Approximation

Similar documents
Mean-field equations for higher-order quantum statistical models : an information geometric approach

Mean field methods for classification with Gaussian processes

Information Geometry on Hierarchy of Probability Distributions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Fractional Belief Propagation

Inference in kinetic Ising models: mean field and Bayes estimators

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis

Barber & van de Laar In this article, we describe an alternative approach which is a perturbation around standard variational methods. This has the pr

Belief Propagation for Traffic forecasting

Information Geometric Structure on Positive Definite Matrices and its Applications

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Information Geometric view of Belief Propagation

HOPFIELD neural networks (HNNs) are a class of nonlinear

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

13 : Variational Inference: Loopy Belief Propagation

Information geometry of Bayesian statistics

Information Geometry

Learning Binary Classifiers for Multi-Class Problem

Energy-Decreasing Dynamics in Mean-Field Spin Models

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information Theory and Predictability Lecture 6: Maximum Entropy Techniques

QM and Angular Momentum

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

Introduction to the Mathematics of the XY -Spin Chain

The Rectified Gaussian Distribution

arxiv: v3 [cond-mat.dis-nn] 18 May 2011

Chaos and Liapunov exponents

Belief Propagation, Information Projections, and Dykstra s Algorithm

Transverse Euler Classes of Foliations on Non-atomic Foliation Cycles Steven HURDER & Yoshihiko MITSUMATSU. 1 Introduction

14 : Theory of Variational Inference: Inner and Outer Approximation

Mixture Models and Representational Power of RBM s, DBN s and DBM s

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

arxiv: v1 [hep-ph] 5 Sep 2017

Neural Codes and Neural Rings: Topology and Algebraic Geometry

Geometry of U-Boost Algorithms

Information geometry connecting Wasserstein distance and Kullback Leibler divergence via the entropy-relaxed transportation problem

Probabilistic and Bayesian Machine Learning

Lecture 3: Error Correcting Codes

The Metric Geometry of the Multivariable Matrix Geometric Mean

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Robustness of Principal Components

T sg. α c (0)= T=1/β. α c (T ) α=p/n

Propagating beliefs in spin glass models. Abstract

Tailored Bregman Ball Trees for Effective Nearest Neighbors

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Graphical models: parameter learning

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

MATH 118, LECTURES 27 & 28: TAYLOR SERIES

Learning sets and subspaces: a spectral approach

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Probabilistic Models in Theoretical Neuroscience

Spin glasses and Stein s method

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 2 Sep 1998

Vector Derivatives and the Gradient

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Dynamical Systems and Deep Learning: Overview. Abbas Edalat

Information geometry of the power inverse Gaussian distribution

Information Projection Algorithms and Belief Propagation

M. Matrices and Linear Algebra

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Lecture 16 Deep Neural Generative Models

Lecture 10: A (Brief) Introduction to Group Theory (See Chapter 3.13 in Boas, 3rd Edition)

Quantum annealing by ferromagnetic interaction with the mean-field scheme

8.5 Taylor Polynomials and Taylor Series

Expectation Consistent Free Energies for Approximate Inference

On the Onsager-Yang-Value of the Spontaneous Magnetization

OPTIMAL PERTURBATION OF UNCERTAIN SYSTEMS

Machine Learning Techniques for Computer Vision

Linear Response for Approximate Inference

6.036 midterm review. Wednesday, March 18, 15

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

OPTIMAL PERTURBATION OF UNCERTAIN SYSTEMS

CSC321 Lecture 20: Autoencoders

MATH 12 CLASS 4 NOTES, SEP

SUBTANGENT-LIKE STATISTICAL MANIFOLDS. 1. Introduction


Basic Principles of Unsupervised and Unsupervised

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Information geometry of mirror descent

Introduction to Group Theory

Variational Linear Response

Discriminative Direction for Kernel Classifiers

We denote the space of distributions on Ω by D ( Ω) 2.

LECTURE Questions 1

Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient

BIAS-VARIANCE DECOMPOSITION FOR BOLTZMANN MACHINES

,, rectilinear,, spherical,, cylindrical. (6.1)

Journée Interdisciplinaire Mathématiques Musique

Approximate inference algorithms for two-layer Bayesian networks

6. APPLICATION TO THE TRAVELING SALESMAN PROBLEM

Natural Gradient for the Gaussian Distribution via Least Squares Regression

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

Divergence of the gradient expansion and the applicability of fluid dynamics Gabriel S. Denicol (IF-UFF)

Transcription:

A Theory of Mean ield Approximation T. Tanaka Department of Electronics nformation Engineering Tokyo Metropolitan University, MinamiOsawa, Hachioji, Tokyo 920397 Japan Abstract present a theory of mean field approximation based on information geometry. This theory includes in a consistent way the naive mean field approximation, as well as the TAP approach the linear response theorem in statistical physics, giving clear informationtheoretic interpretations to them. NTRODUCTON Many problems of neural networks, such as learning pattern recognition, can be cast into a framework of statistical estimation problem. How difficult it is to solve a particular problem depends on a statistical model one employs in solving the problem. or Boltzmann machines[] for example, it is computationally very hard to evaluate expectations of state variables from the model parameters. Mean field approximation[2], which is originated in statistical physics, has been frequently used in practical situations in order to circumvent this difficulty. n the context of statistical physics several advanced theories have been known, such as the TAP approach[3], linear response theorem[4], so on. or neural networks, application of mean field approximation has been mostly confined to that of the socalled naive mean field approximation, but there are also attempts to utilize those advanced theories[5, 6, 7, 8]. n this paper present an informationtheoretic formulation of mean field approximation. t is based on information geometry[9], which has been successfully applied to several problems in neural networks[0]. This formulation includes the naive mean field approximation as well as the advanced theories in a consistent way. give the formulation for Boltzmann machines, but its extension to wider classes of statistical models is possible, as described elsewhere[]. 2 BOLTZMANN MACHNES A Boltzmann machine is a statistical model with binary rom variables,. The vector is called the state of the Boltzmann machine.

0 0 0 The state is also a rom variable, its probability law is given by the Boltzmann ibbs distribution () where is the energy defined by (2) with the parameters, is determined by the normalization condition is called "! the Helmholtz free energy of. The notation means that the summation should be taken over all distinct pairs. Let $ &%' $ (%), where +* means the expectation with respect to. The following problem is essential for Boltzmann machines: Problem Evaluate the expectations $ the Boltzmann machine. $ from the parameters of 3 NORMATON EOMETRY 3. ORTHOONAL DUAL OLATONS A whole set, of the Boltzmannibbs distribution () realizable by a Boltzmann machine is regarded as an exponential family. "! Let us use shorth notations,.,, to represent distinct pairs of indices, such as. The parameters constitute a coordinate system of,, called the canonical parameters of,. The expectations $ $ constitute another coordinate system of,, called the expectation parameters of,. Let 0& be a subset of, on which are all equal to zero. call 02 the factorizable submodel of, since 304 can be factorized with respect to. On 0& the problem is easy: Since are all zero, are statistically independent of others, therefore $ 65798: $ $ ;$< hold. Mean field approximation systematically reduces the problem onto the factorizable submodel 0(. or this reduction, introduce dual foliations 0 onto,. The foliation >0,, BADC?, is parametrized by % each leaf 0 is defined as E (3) The leaf 0 ; is the same as 0&, the factorizable submodel. Each leaf 0 is again an exponential family with $ the canonical the expectation parameters, respectively. A pair of dual potentials is defined on each leaf, one is the Helmholtz free energy another is its Legendre transform, or the ibbs free energy, %H % +$ the parameters of J0 are given by $ HK K K where % @KMLNK K % ;KML>K ADP $. Another foliation O, is parametrized by % " each leaf E>$ H (4) (5) > is defined as,, (6)

%? Each leaf is not an exponential family, but again a pair of dual potentials is defined on each leaf, the former is given by $ the latter by its Legendre transform as the parameters of where K % @KML>K $ K are given by K @KML>K $ HK (7) (8) (9) $. These two foliations form the orthogonal dual foliations, since the leaves 0 are orthogonal at their intersecting point. introduce still another coordinate system on,, called the mixed coordinate system, on the basis of the orthogonal dual foliations. t uses a pair of the expectation the canonical parameters to specify a single element, O. The part specifies the leaf? on which resides, the part specifies the leaf 0. 3.2 REORMULATON O PROBLEM Assume that a target Boltzmann machine is given by specifying its parameters. Problem is restated as follows: evaluate its expectations $ $ from those parameters. To evaluate $ mean field approximation translates the problem into the following one: Problem 2 Let 0 to. be a leaf on which resides. ind 0 which is the closest At first sight this problem is trivial, since one immediately finds the solution. However, solving this problem with respect to $ is nontrivial, it is the key to understing of mean field approximation including advanced theories. Let us measure the proximity of to by the Kullback divergence (0) then solving Problem 2 reduces to finding a minimizer J0 of for a given. or J0, is expressed in terms of the dual potentials of 0 as $ () The minimization problem is thus equivalent to minimizing $ (2) % since K in eq. () does not depend on. Solving the stationary condition with respect to $ will give the correct expectations $, since the true minimizer is. However, the scenario is in general intractable since cannot be given explicitly as a function of $. H

3.3 PLEKA EXPANSON The problem is easy if %6$ as. n this case is given explicitly as a function of Minimization of with respect to gives the solution When 5798: (3) as expected. the expression (3) is no longer exact, but to compensate the error one may use,? leaving convergence problem aside, the Taylor expansion of % with respect to, @K @K K @K K K 6** * (4) This expansion has been called the Plefka expansion[2] in the literature of spin glasses. Note that in considering O O the expansion one should temporarily assume that is fixed: One can rely on the solution evaluated from O K the stationary condition only if the expansion does not change the value of. The coefficients in the expansion can be efficiently computed by fully utilizing the orthogonal dual structure of the foliations. irst, we have the following theorem: Theorem The coefficients of the expansion (4) are given by the cumulant tensors of the corresponding orders, defined on. Because K holds, one can consider derivatives of instead of those of. The firstorder derivatives are immediately given by the property of the potential of the leaf (eq. (9)), yielding K $ (5) where denotes the distribution on? corresponding to. The coefficients of the lowestorders, including the firstorder one, are given by the following theorem. Theorem 2 The first, second, thirdorder coefficients of the expansion (4) are given by: K $ K K ;K ;K (6) where %. K K K ;K ;K ;K The proofs will be found in []. t should be noted that, although these results happen to be the same as the ones which would be obtained by regarding as an exponential family, they are not the same in general since actually is not an exponential family; for example, they are different for the fourthorder coefficients. The explicit formulas for these coefficients for Boltzmann machines are given as follows: or the firstorder, K 6 (7)

K or the secondorder, ;K or the thirdorder, for @K "!!,., K K K K K or other combinations of,.,, K K K H (8). (9) (20) for three distinct indices!,,, (2) H (22) 4 MEAN ELD APPROXMATON 4. MEAN ELD EUATON Truncating the Plefka expansion (4) up to th order term gives th order approximations, %. The Weiss free energy, which is used in the naive mean field approximation, is given by. The TAP approach picks up all relevant terms of the Plefka expansion[2], for the SK model it gives the secondorder approximation. K H The stationary condition gives the socalled mean field equation, from which a solution of the approximate minimization problem is to be determined. or it takes the following familiar form, 5798: for it includes the socalled Onsager reaction term. 57N8 : Note that all of these are expressed as functions of. eometrically, the mean field equation approximately represents the surface in terms of the mixed coordinate K system of,, since for the exact ibbs free energy, the stationary condition gives. Accordingly, the approximate relation )K O, for fixed, represents the th order approximate expression of the leaf in the canonical coordinate system. The fit of this expression to the true leaf? around the point becomes better as the order of approximation gets higher, as seen in ig.. Such a behavior is well expected, since the Plefka expansion is essentially a Taylor expansion. 4.2 LNEAR RESPONSE or estimating $ one can utilize the linear response theorem. n information geometrical framework it is represented as a trivial identity relation for the isher information on the leaf 0. The isher information matrix, or the Riemannian metric tensor, on the leaf 0, its inverse are given by K $ $ $> (25) H (23) (24)

0.75 η 2 0.5 0.25 0 A(m) 0.4 0.3 η 2 0.2 A(m) 0 0th order st order 2nd order 3rd order 4th order 0 0. 0 0.5 0.499 0.5 0.50 η η 2 η η 2 igure : Approximate expressions of " by mean; field approximations of several orders for 2unit Boltzmann machine, with (left), their magnified view (right). q D(p 0 q) D(p q) p (w) D(p 0 p) p 0 0 A(m) igure 2: Relation between naive approximation present theory. HK K (26) respectively. n the framework here, the linear response theorem states the trivial fact that those are the inverse of the other. n mean field approximation, one substitutes an approximation in place of in eq. (26) to get an approximate inverse of the metric. The derivatives in eq. (26) can be analytically calculated, therefore can be numerically evaluated by substituting to it a solution of the mean field equation. Equating its by using eq. (25). So far, Problem has been inverse to gives an estimate of $ solved within the framework of mean field approximation, with $ obtained by the mean field equation the linear response theorem, respectively. 5 DSCUSSON ollowing the framework presented so far, one can in principle construct algorithms of mean field approximation of desired orders. The firstorder algorithm with linear response has been first proposed examined by Kappen Rodríguez[7, 8]. Tanaka[3] has formulated second thirdorder algorithms explored them by computer simulations. t is also possible to extend the present formulation so that it can be applicable to higherorder Boltzmann machines. Tanaka[4] discusses an extension of the present formulation to thirdorder Boltzmann machines: t is possible to extend linear response theorem to higherorders, it allows us to treat higherorder correlations within the framework of mean field approximation.

The common understing about the naive mean field approximation is that it minimizes Kullback divergence with respect to 0 for a given. t can be shown that this view is consistent with the theory presented in this paper. Assume that 0, let be a distribution corresponding the intersecting point of the leaves 0. Because of the orthogonality of the two foliations 0 the following Pythagorean law[9] holds (ig. 2). + (27) ntuitively, measures the squared distance between 0 0, is a second? order quantity in. t should be ignored in the firstorder approximation, thus holds. Under this approximation minimization of the former with respect to is equivalent to that of the latter with respect to, which establishes the relation between the naive approximation the present theory. t can also be checked directly that the firstorder approximation of exactly gives, the Weiss free energy. The present theory provides an alternative view about the validity of mean field approximation: As opposed to a common belief that mean field approximation is a good one when is sufficiently large, one can state from the present formulation that it is so whenever higherorder contribution of the Plefka expansion vanishes, regardless of whether is large or not. This provides a theoretical basis for the observation that mean field approximation often works well for small networks. The author would like to thank the Telecommunications Advancement oundation for financial support. References [] Ackley, D. H., Hinton,. E., Sejnowski, T. J. (985) A learning algorithm for Boltzmann machines. Cognitive Science 9: 47 69. [2] Peterson, C., Anderson, J. R. (987) A mean field theory learning algorithm for neural networks. Complex Systems : 995 09. [3] Thouless, D. J., Anderson, P. W., Palmer, R.. (977) Solution of Solvable model of a spin glass. Phil. Mag. 35 (3): 593 60. [4] Parisi,. (988) Statistical ield Theory. AddisonWesley. [5] all, C. C. (993) The limitations of deterministic Boltzmann machine learning. Network 4 (3): 355 379. [6] Hofmann, T. Buhmann, J. M. (997) Pairwise data clustering by deterministic annealing. EEE Trans. Patt. Anal. & Machine ntell. 9 (): 4; Errata, ibid. 9 (2): 97 (997). [7] Kappen, H. J. Rodríguez,. B. (998) Efficient learning in Boltzmann machines using linear response theory. Neural Computation. 0 (5): 37 56. [8] Kappen, H. J. Rodríguez,. B. (998) Boltzmann machine learning using mean field theory linear response correction. n M.. Jordan, M. J. Kearns, S. A. Solla (Eds.), Advances in Neural nformation Processing Systems 0, pp. 280 286. The MT Press. [9] Amari, S.. (985) Differentialeometrical Method in Statistics. Lecture Notes in Statistics 28, SpringerVerlag. [0] Amari, S.., Kurata, K., Nagaoka, H. (992) nformation geometry of Boltzmann machines. EEE Trans. Neural Networks 3 (2): 260 27. [] Tanaka, T. nformation geometry of mean field approximation. preprint. [2] Plefka, P. (982) Convergence condition of the TAP equation for the infiniteranged sing spin glass model. J. Phys. A: Math. en. 5 (6): 97 978. [3] Tanaka, T. (998) Mean field theory of Boltzmann machine learning. Phys. Rev. E. 58 (2): 2302 230. [4] Tanaka, T. (998) Estimation of thirdorder correlations within mean field approximation. n S. Usui T. Omori (Eds.), Proc. ifth nternational Conference on Neural nformation Processing, vol., pp. 554 557.