Topics in Information Geometry. Dodson, CTJ. MIMS EPrint: Manchester Institute for Mathematical Sciences School of Mathematics

Similar documents
Universal connection and curvature for statistical manifold geometry

Universal connection and curvature for statistical manifold geometry

Neighbourhoods of Randomness and Independence

Information geometry for bivariate distribution control

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

Information Geometry of Bivariate Gamma Exponential Distributions

On the entropy flows to disorder

On the entropy flows to disorder

arxiv: v2 [math-ph] 22 May 2009

On the entropy flows to disorder

Information geometry of the power inverse Gaussian distribution

Lecture Notes in Mathematics Editors: J.-M. Morel, Cachan F. Takens, Groningen B. Teissier, Paris

Isomorphism classes for Banach vector bundle structures of second tangents. Dodson, CTJ and Galanis, GN and Vassiliou, E. MIMS EPrint: 2005.

Chapter 5. Chapter 5 sections

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

A brief introduction to Semi-Riemannian geometry and general relativity. Hans Ringström

Chapter 5 continued. Chapter 5 sections

AFFINE SPHERES AND KÄHLER-EINSTEIN METRICS. metric makes sense under projective coordinate changes. See e.g. [10]. Form a cone (1) C = s>0

1. Geometry of the unit tangent bundle

1: PROBABILITY REVIEW

Information geometry of Bayesian statistics

4.7 The Levi-Civita connection and parallel transport

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Introduction to Information Geometry

Exercises in Geometry II University of Bonn, Summer semester 2015 Professor: Prof. Christian Blohmann Assistant: Saskia Voss Sheet 1

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

UNIVERSITY OF DUBLIN

SUBTANGENT-LIKE STATISTICAL MANIFOLDS. 1. Introduction

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

DIFFERENTIAL GEOMETRY, LECTURE 16-17, JULY 14-17

ON THE FOLIATION OF SPACE-TIME BY CONSTANT MEAN CURVATURE HYPERSURFACES

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

Chapter 7 Curved Spacetime and General Covariance

class # MATH 7711, AUTUMN 2017 M-W-F 3:00 p.m., BE 128 A DAY-BY-DAY LIST OF TOPICS

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

DIFFERENTIAL FORMS AND COHOMOLOGY

CHAPTER 1 PRELIMINARIES

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

ST5215: Advanced Statistical Theory

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

On random sequence spacing distributions

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Exercises and Answers to Chapter 1

ELEMENTS OF PROBABILITY THEORY

MATH DIFFERENTIAL GEOMETRY. Contents

DIFFERENTIAL GEOMETRY. LECTURE 12-13,

Continuous Random Variables

Curvature homogeneity of type (1, 3) in pseudo-riemannian manifolds

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions

If we want to analyze experimental or simulated data we might encounter the following tasks:

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Some illustrations of information geometry in biology and physics

Formulas for probability theory and linear models SF2941

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Metrics and Holonomy

On the existence of isoperimetric extremals of rotation and the fundamental equations of rotary diffeomorphisms

Probability and Distributions

A CONSTRUCTION OF TRANSVERSE SUBMANIFOLDS

Review (Probability & Linear Algebra)

Random Variables. P(x) = P[X(e)] = P(e). (1)

THEODORE VORONOV DIFFERENTIAL GEOMETRY. Spring 2009

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

LECTURE 10: THE PARALLEL TRANSPORT

Bivariate distributions

Review of Probability Theory

1.1 Review of Probability Theory

Measure-theoretic probability

Probability on a Riemannian Manifold

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Mean-field dual of cooperative reproduction

Section 2. Basic formulas and identities in Riemannian geometry

Let X and Y denote two random variables. The joint distribution of these random

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Minimal timelike surfaces in a certain homogeneous Lorentzian 3-manifold II

MATH 4030 Differential Geometry Lecture Notes Part 4 last revised on December 4, Elementary tensor calculus

Open Questions in Quantum Information Geometry

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Lecture 2: Repetition of probability theory and statistics

Strictly convex functions on complete Finsler manifolds

Lecture 1: August 28

EE4601 Communication Systems

2 (Statistics) Random variables

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

3. Probability and Statistics

Weighted Exponential Distribution and Process

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Introduction to Probability and Stocastic Processes - Part I

A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes.

Probability Theory and Statistics. Peter Jochumzen

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Statistics for scientists and engineers

Geometry in a Fréchet Context: A Projective Limit Approach

fy (X(g)) Y (f)x(g) gy (X(f)) Y (g)x(f)) = fx(y (g)) + gx(y (f)) fy (X(g)) gy (X(f))

MAS223 Statistical Inference and Modelling Exercises

LECTURE 9: MOVING FRAMES IN THE NONHOMOGENOUS CASE: FRAME BUNDLES. 1. Introduction

Lecture 22: Variance and Covariance

Loos Symmetric Cones. Jimmie Lawson Louisiana State University. July, 2018

Transcription:

Topics in Information Geometry Dodson, CTJ 005 MIMS EPrint: 005.49 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: And by contacting: http://eprints.maths.manchester.ac.uk/ The MIMS Secretary School of Mathematics The University of Manchester Manchester, M3 9PL, UK ISSN 749-9097

Topics in Information Geometry C.T.J.Dodson c 004 School of Mathematics, Manchester University, Manchester M60 QD UK Print date: July 8, 005

Contents Contents i Preface Mathematical statistics and information theory 3. Probability functions..................................... 4.. Example: Bernoulli distribution........................... 4.. Example: Binomial distribution........................... 5..3 Example: The Poisson probability function.................... 5. Probability density functions................................ 6.. Example: The exponential pdf........................... 6.3 Joint probability density functions............................. 7.3. Example: Bivariate Gaussian distributions..................... 8.4 Information theory...................................... 8.4. Example: Maximum likelihood gamma pdf.................... 9 Information geometry 3. Fisher information metric.................................. 3. Exponential family...................................... 3.3 α-connections......................................... 4.4 Affine immersions....................................... 5.5 Example: Gamma manifold................................. 5.5. α-connection.................................... 6.5. α-curvatures..................................... 7.5.3 Gamma manifold geodesics............................. 8.5.4 Mutually dual foliations............................... 8.5.5 Affine Immersions for Gamma Manifold...................... 8.6 Example: Log-gamma distributions............................. 8.7 Example: Gaussian distributions...............................7. Natural coordinate system and potential function................. i

ii CONTENTS.7. Fisher information metric...............................7.3 Mutually dual foliations................................7.4 Affine immersions for Gaussian manifold...................... 3.8 Example: Bivariate Gaussian 5-manifold......................... 3.8. Fisher information metric.............................. 4.9 Example: Freund distributions............................... 4.9. Fisher information metric.............................. 5.0 Example: Mckay bivariate gamma 3-manifold....................... 6 3 Universal connections and curvature 9 3. Systems of connections and universal objects....................... 9 3. Exponential family of probability density functions on R................. 30 3.3 Universal connection and curvature for exponential families............... 30 3.3. Tangent bundle system: C T T M JT M.................... 3 3.3. Frame bundle system: C F F M JF M..................... 3 3.4 Universal connection and curvature on the gamma manifold............... 33 3.4. Tangent bundle system............................... 34 3.4. Frame bundle system................................ 35 4 Neighbourhoods of randomness, independence, and uniformity 37 4. Gamma manifold G and neighbourhoods of randomness................. 37 4. Log-gamma manifold L and neighbourhoods of uniformity................ 38 4.3 Freund manifold F and neighbourhoods of independence................. 39 4.3. Submanifold F F : α = α, β = β..................... 39 4.4 Neighbourhoods of independence for Gaussian processes [8]............... 4 5 Applications 45 Appendix: List of Mathematica Notebooks 5 Bibliography 53

Preface The original draft of these lecture notes was prepared for a short course at Centro de Investigaciones de Matematica CIMAT), Guanajuato, Mexico in May 004. Many of the calculations arise from joint publications with Khadiga Arwini and from her PhD thesis [6]. The course was given also in the Departamento de Xeometra e Topoloxa, Facultade de Matemticas, Universidade de Santiago de Compostela, Spain in February 005.

CONTENTS

Chapter Mathematical statistics and information theory There are many easily found good books on probability theory and mathematical statistics eg [3, 3, 33, 4, 44, 45, 69]), stochastic processes eg [, 60]) and information theory eg [6]); here we just outline some topics to help make the sequel more self contained. For those who have access to the computer algebra package Mathematica[73], the approach to mathematical statistics and accompanying software in Rose and Smith [63] will be particularly helpful. The word stochastic comes from the Greek stochastikos, meaning skillful in aiming and stochazesthai to aim at or guess at, and stochos means target or aim. In our context, stochastic means involving chance variations around some event rather like the variation in positions of strikes aimed at a target. In its turn, the later word statistics comes through eighteenth century German from the Latin root status meaning state; originally it meant the study of political facts and figures. The noun random was used in the sixteenth century to mean a haphazard course, from the Germanic randir to run, and as an adjective to mean without a definite aim, rule or method, the opposite of purposive. From the middle of the last century, the concept of a random variable has been used to describe a variable that is a function of the result of a well-defined statistical experiment in which each possible outcome has a definite probability of occurrence. The organization of probabilities of outcomes is achieved by means of a probability function for discrete random variables and by means of a probability density function for continuous random variables. The result of throwing two fair dice and summing what they show is a discrete random variable. Mainly, we are concerned with continuous random variables here measurable functions defined on some R n ) with differentiable probability density measure functions, but we do need also to mention the Poisson distribution for the discrete case. However, since the Poisson is a limiting approximation to the Binomial distribution which arises from the Bernoulli distribution which everyone encountered in school!) we mention also those examples. 3

4 CHAPTER. MATHEMATICAL STATISTICS AND INFORMATION THEORY. Probability functions Here we are concerned with discrete random variables so we take the domain set to be N {0}. We may view a probability function as a subadditive measure function of unit weight on N {0}.).).3) p : N {0} [0, ) nonnegativity) pk) = unit weight) k=0 pa B) pa) + pb), A, B N {0}, subadditivity) with equality A B =. Formally, we have a discrete measure space of total measure with σ-algebra the power set and measure function induced by p subn) [0, ) : A k A pk) and as we have anticipated above, we usually abbreviate k A pk) = pa). We have the following expected values of the random variable and its square.4) Ek) = k = kpk).5) Ek ) = k = k=0 k pk). With slight but common abuse of notation, we call k the mean, k k) the variance, σ k = + k k) the standard deviation and σ k /k the coefficient of variation, respectively, of the random variable k. The variance is the square of the standard deviation. The moment generating function Ψt) = Ee tx ), t R of a distribution generates the r th moment as the value of the r th derivative of Ψ evaluated at t = 0. Hence, in particular, the mean and variance are given by:.6).7) EX) = Ψ 0) k=0 V arx) = Ψ 0) Ψ 0)), which can provide an easier method for their computation in some cases... Example: Bernoulli distribution It is said that a random variable X has a Bernoulli distribution with parameter p 0 p ) if X can take only the values 0 and and the probabilities are.8).9) P r X = ) = p P r X = 0) = p Then the probability function of X can be written as follows: { p.0) fx p) = x p) x if x = 0, 0 otherwise If X has a Bernoulli distribution with parameter p, then we can find its expectation or mean value EX) and variance V arx) as follows..).) EX) = p + 0 p) = p V arx) = EX ) EX)) = p p

.. PROBABILITY FUNCTIONS 5 The moment generating function of X is the expectation of e tx,.3) Ψt) = Ee tx ) = pe t + q which is finite for all real t... Example: Binomial distribution If n random variables X, X,..., X n are independently identically distributed, and each has a Bernoulli distribution with parameter p, then it is said that the variables X, X,..., X n form n Bernoulli trials with parameter p. If the random variables X, X,..., X n form n Bernoulli trials with parameter p and if X = X + X +... + X n, then X has a binomial distribution with parameters n and p. The binomial distribution is of fundamental importance in probability and statistics because of the following result for any experiment which can result in only either success or failure. The experiment is performed n times independently and the probability of the success of any given performance is p. If X denotes the total number of successes in the n performances, then X has a binomial distribution with parameters n and p. The probability function of X is: n n.4) P X = r) = P X i = r) = r i= ) p r p) n r where r = 0,,,..., n. We write ) n p.5) fr p) = r r p) n r if r=0,,,..., n 0 otherwise In this distribution n must be a positive integer and p must lie in the interval 0 p. If X is represented by the sum of n Bernoulli trials, then it is easy to get its expectation, variance and moment generating function by using the properties of sums of random variables..6).7).8) EX) = V arx) = n EX i ) = np i= n V arx i ) = np p) i= Ψt) = Ee tx ) = n Ee txi ) = pe t + q) n. i=..3 Example: The Poisson probability function The Poisson distribution is widely discussed in the statistical literature; one monograph devoted to it and its applications is Haight [38]. Take t, τ 0, ).9).0).) p : N {0} [0, ) : k k = t/τ σ k = t/τ. ) k t τ k! e t/τ

6 CHAPTER. MATHEMATICAL STATISTICS AND INFORMATION THEORY This probability function is used to model the number k of random events in a region of measure t when the mean number of events per unit region is τ. Importantly, the Poisson distribution can give a good approximation to the binomial distribution when n is large and p is close to 0. This is easy to see by making the correspondences:.).3) e pn n r)p) n!/n r)! n r.. Probability density functions We are concerned with the case of continuous random variables defined on some Ω R m. For our present purposes we may view a probability density function pdf) on Ω R m as a subadditive measure function of unit weight, namely, a nonnegative map on Ω.4).5).6) Ω f : Ω [0, ) nonnegativity) f = fω) = unit weight) fa B) fa) + fb), A, B Ω, subadditivity) with equality A B =. Formally, we have a measure space of total measure with σ-algebra typically the Borel sets or the power set and the measure function induced by f subω) [0, ) : A f and as we have anticipated above, we usually abbreviate f = fa). Given an integrable ie measurable in the σ-algebra) function u : Ω R, the expectation or mean value of u is defined to A be Eu) = u = uf. We say that f is the joint pdf for the random variables x, x,..., x m, being the coordinates of points in Ω, or that these random variables have the joint probability distribution f. If x is one of these random variables, and in particular for the important case of a single random variable x, we have the following.7) Ex) = x = xf Ω.8) Ex ) = x = x f. Again with slight abuse of notation, we call x the mean, x x) the variance, σ x = + x x) the standard deviation and σ x /x the coefficient of variation, respectively, of the random variable x. The variance is the square of the standard deviation. Usually, a probability density function depends on a set of parameters, θ, θ,..., θ n and we say that we have an n-dimensional family of pdfs. Ω Ω A.. Example: The exponential pdf Take λ R + ; this is called the parameter of the exponential pdf.9) f : [0, ) [0, ) : [a, b].30).3) x = λ σ x = λ. [a,b] λ e x/λ

.3. JOINT PROBABILITY DENSITY FUNCTIONS 7 The parameter space of the exponential distribution is R +, so exponential distributions form a - parameter family. In the sequel we shall see that quite generally we may provide a Riemannian structure to the parameter space of a family of distributions. Sometimes we call a family of pdfs a parametric statistical model. Observe that, in the Poisson probability function.9), the value at zero is p0) = e t/τ and it is not difficult to show that the probability density function for the distance t between random events on [0, ) is an exponential pdf.9 given by f : [0, ) [0, ) : t τ e t/τ where τ is the mean number of events per unit interval..3 Joint probability density functions Let f be a pdf, defined on R or some subset thereof). This is an important case since here we have two variables, X, Y, say, and we can extract certain features of how they interact. In particular, we define their respective mean values and their covariance, σ xy :.3).33).34) x = y = σ xy = x fx, y) dxdy y fx, y) dxdy xy fx, y) dxdy xy. The marginal pdf of X is f X, obtained by integrating f over all y,.35) f X x) = and similarly the marginal pdf of Y is.36) f Y y) = v= u= f X,Y x, v) dv f X,Y u, y) du The jointly distributed random variables X and Y are called independent if their marginal density functions satisfy.37) f X,Y x, y) = f X x)f Y y) for all x, y R It is easily shown that if the variables are independent then their covariance.34) is zero but the converse is not true. Feller [3] gives a simple counterexample: Let X take values, +,, +, each with probability 4 Let Y = X ; then the covariance is zero but there is evidently a nonlinear) dependence. The extent of dependence between two random variables can be measured in a normalised way by means of the correlation coefficient: the ratio of the covariance to the product of marginal standard deviations:.38) ρ xy = σ xy σ x σ y. Note that ρ xy, whenever it exists, the limiting values corresponding to the case of linear dependence between the variables. Intuitively, ρ xy < 0 if y tends to increase as x decreases, and ρ xy > 0 if x and y tend to increase together.

8 CHAPTER. MATHEMATICAL STATISTICS AND INFORMATION THEORY.3. Example: Bivariate Gaussian distributions The probability density function of the two-dimensional Gaussian distribution has the form:.39) where fx, y) = π σ σ σ e σ σ σ ) σx µ) σ x µ ) y µ )+σ y µ ) ), < x < x <, < µ < µ <, 0 < σ, σ <. So we have a five- This contains the five parameters µ, µ, σ, σ, σ ) = ξ, ξ, ξ 3, ξ 4, ξ 5 ) Θ. dimensional parameter space Θ..4 Information theory Jaynes [4] provided the foundation for information theoretic methods in, among other things, Bayes hypothesis testing as described by Tribus et al. [70, 7]; see also Shannon [64], Slepian [65] and Roman [6]. Given a set of observed values < g α t) > for functions g α of the random variable x, we seek a least prejudiced set of probability values p i satisfying.40).4) < g α t) > = = i=n p i g α x i ) for α =,,..., N i= i=n p i i= Jaynes showed that this occurs if we choose those p i that maximize Shannon s entropy function i=n.4) S = p i logp i ) i= It turns out [7] that if we have no data on observed functions of x, so the set of equations.40) is empty) then the maximum entropy choice gives the exponential distribution. If we have estimates of the first two moments of the distribution of x, then we obtain the truncated) Gaussian. If we have estimates of the mean and mean logarithm of x, then the maximum entropy choice is the gamma distribution. In the sequel we shall consider the particular case of the gamma distribution for several reasons: the exponential distributions form a subclass of gamma distributions and exponential distributions represent intervals between events in a Poisson ie. random ) process the sum of n independent identical exponential random variables follows a gamma distribution lognormal distributions may be well-approximated by gamma distributions products of gamma distributions are well-approximated by gamma distributions stochastic porous media have been modeled using gamma distributions [6]. Other parametric statistical models based on different distributions may be treated in a similar way. Let Θ be the parameter space of a parametric statistical model, that is an n-dimensional smooth family of probability density functions defined on some fixed event space Ω of unit measure, p θ = for all θ Θ. Ω

.4. INFORMATION THEORY 9 For each sequence X = {X, X,..., X n }, of independent identically distributed observed values, the likelihood function lik X on Θ which measures the likelihood of the sequence arising from different p θ S is defined by n lik X : Θ [0, ] : θ p θ X i ). Statisticians use the likelihood function, or log-likelihood its logarithm l = log lik, in the evaluation of goodness of fit of statistical models. The so-called method of maximum likelihood is used to obtain optimal fitting of the parameters in a distribution to observed data. i=.4. Example: Maximum likelihood gamma pdf The family of gamma distributions has event space Ω = R + and probability density functions given by Θ {fx; γ, µ) γ, µ R + } so here Θ = R + R + and the random variable is x Ω = R + with ) µ µ x µ.43) fx; γ, µ) = γ Γµ) e xµ/γ Then x = γ and V arx) = γ /µ and we see that γ controls the mean of the distribution while µ controls its variance and hence the shape. Indeed, the property that the variance is proportional to the square of the mean actually characterizes gamma distributions as shown recently by Hwang and Hu [39]. They proved, for n 3 independent positive random variables x, x,..., x n with a common continuous probability density function h, that having independence of the sample mean and sample coefficient of variation is equivalent to h being a gamma distribution. The special case µ = corresponds to the situation of the random or Poisson process with mean inter-event interval γ. In fact, for integer µ =,,...,??) models a process that is Poisson but with intermediate events removed to leave only every ν th. Formally, the gamma distribution is the ν-fold convolution of the exponential distribution, called also the Pearson Type III distribution. Figure. shows a family of gamma distributions, all of unit mean, with µ = 0.5,,, 7. Shannon s information theoretic entropy or uncertainty is given, up to a factor, by the negative of the expectation of the logarithm of the probability density function??), that is.44) S f γ, µ) = 0 logfx; γ, µ)) fx; γ, µ) dx = µ + µ) Γ µ) Γµ) + log γ Γµ) µ At unit mean, the maximum entropy or maximum uncertainty) occurs at µ =, which is the random case, and then S f γ, ) = + log γ. So, a Poisson process of points on a line is such that the points are as disorderly as possible and among all homogeneous point processes with a given density, the Poisson process has maximum entropy. Figure. shows a plot of S f µ, β), for the case of unit mean µ =. We can see the role of the log-likelihood function in the case of a set X = {X, X,..., X n } of measurements, drawn from independent identically distributed random variables, to which we wish to fit the maximum likelihood gamma distribution. The procedure to optimize the choice of γ, µ is as follows. For independent events X i, with identical distribution px; γ, µ), their joint probability density is the product of the marginal densities so a measure of the likelihood of finding such a set of events is n lik X γ, µ) = fx i ; γ, µ). i= We seek a choice of γ, µ to maximize this product and since the log function is monotonic increasing it is simpler to maximize the logarithm n l X γ, µ) = log lik X γ, µ) = log[ fx i ; γ, µ)]. i=

0 CHAPTER. MATHEMATICAL STATISTICS AND INFORMATION THEORY fx;, ν).6.4. µ = 0.5 Clustered) 0.8 0.6 0.4 0. µ = Random) µ = Smoothed) µ = 5 Smoothed) 0.5 0.5 0.75.5.5.75 Inter-event interval t Figure.: Probability density functions, fx; γ, µ), for gamma distributions of inter-event intervals t with unit mean γ =, and µ = 0.5,,, 5. The case µ = corresponds to an exponential distribution from an underlying Poisson process; ν represents some organization clustering µ < ) or smoothing µ > ). Entropy S f γ, µ) 0.5-0.5 4 6 8 0 4 µ - -.5 - -.5 Figure.: Information theoretic entropy S f γ, µ), for gamma distributions of inter-event intervals t with unit mean γ =. The maximum at µ = corresponds to an exponential distribution from an underlying Poisson process. The regime µ < corresponds to clustering of events and ν > corresponds to smoothing out of events, relative to a Poisson process. Note that, at constant mean, the variance of x decays like /µ.

.4. INFORMATION THEORY Substitution gives us l X γ, µ) = n [µlog µ log γ) + µ ) log X i µ γ X i log Γµ)] i= = nµlog µ log γ) + µ ) n log X i µ γ i= n X i n log Γµ). Then, solving for γ l X γ, µ) = µ l X γ, µ) = 0 in terms of properties of the X i, we obtain the maximum likelihood estimates ˆγ, ˆµ of γ, µ in terms of the mean and mean logarithm of the X i where log X = n n i= log X i. log ˆµ Γ ˆµ) Γˆµ) ˆγ = X = n n i= X i = log X log X i=

CHAPTER. MATHEMATICAL STATISTICS AND INFORMATION THEORY

Chapter Information geometry Amari [3] and Amari and Nagaoka [4] provide modern accounts of the differential geometry that arises from the Fisher information metric.. Fisher information metric Let Θ be the parameter space of a parametric statistical model, that is an n-dimensional smooth family of probability density functions defined on some fixed event space Ω of unit measure, p θ = for all θ Θ. Ω Denote by E the expectation operator measure) for functions defined on Ω; in particular, the mean Ex) = x, and variance Ex ) x = V arx), which will be functions of θ. Then, at each point θ Θ, the covariance of partial derivatives of the log-likelihood function, l = log p θ, is a matrix with entries the expectations.) g ij = E l θ i ) l ) l θ j = E θ i θ j for coordinates θ i ) about θ Θ). This gives rise to a positive definite matrix, so inducing a Riemannian metric g on Θ using for coordinates the parameters θ i ); this metric is called the expected information metric for the family of probability density functions; the original ideas are due to Fisher and Rao [34, 6]. Of course, the second equality in equation.) depends on certain regularity conditions but when it holds it can be particularly convenient to use. Amari [3] and Amari and Nagaoka [4] provide modern accounts of the differential geometry that arises from the Fisher information metric.. Exponential family An n-dimensional parametric statistical model Θ {p θ θ Θ}is said to be an exponential family or of exponential type, when the density function can be expressed in terms of functions {C, F,..., F n } on Λ and a function ψ on Θ as:.) px; θ) = e {Cx)+P i θi Fix) ψθ)}, then we say that θ i ) are its natural or canonical parameters, and ψ is the potential function. From the normalization condition px; θ) dx = we obtain:.3) ψθ) = log e {Cx)+P i θi Fix)} dx. 3

4 CHAPTER. INFORMATION GEOMETRY This potential function is therefore a distinguished function of the coordinates alone and in the sequel we make use of it for the presentation of the manifold as an immersion in R n+..3 α-connections Let Θ {p ξ } be an n-dimensional model, and consider the function Γ α) ij,k which maps each point ξ to the following value: ) [ Γ α) log f ij,k = E ξ ξ i ξ j + α ) ] log f log f log f.4) ξ i ξ j ξ k, where α is some arbitrary real number. So we have an affine connection α) on Θ defined by ).5) α) i j, k = Γ α) ij,k, where g =, is the Fisher metric. We call this α) the α-connection. The α-connection is clearly a symmetric connection. We also have ξ.6) α) = α) 0) + α ), = + α ) + α ). Proposition.3. The 0-connection is the Riemannian connection with respect to the Fisher metric. In general, when α 0, α) is not metric. The notion of exponential family has a close relation to ). From the definition of an exponential family given in Equation.), with i = θ i, we obtain.7) and.8) i lx; θ) = F i x) i ψθ) i j lx; θ) = i j ψθ). where lx; θ) = log fx; θ). Hence we have Γ ) ij,k = i j ψ E θ [ k l θ ], which is 0. In other words, we see that θ i ) is a -affine coordinate system, and Θ is -flat. In particular, the -connection is said to be an exponential connection, and the -)-connection is said to be a mixture connection. We say that an α-connection and the α)-connection are mutually dual with respect to the Fisher metric g since the following formula holds: XgY, Z) = g α) X where X, Y and Z are arbitrary vector fields on M. Y, Z) + gy, α) Z), Now, Θ is an exponential family, so a mixture coordinate system is given by a potential function, that is,.9) η i = ψ θ i. Since θ i ) is a -affine coordinate system, η i ) is a )-affine coordinate system, and they are mutually dual with respect to the Fisher metric. Therefore the statistical manifold has dually orthogonal foliations Section 3.7 in [4]). The coordinates in η i ) admit a potential function given by:.0) λ = θ i η i ψθ). X

.4. AFFINE IMMERSIONS 5.4 Affine immersions Let M be an m-dimensional manifold, f an immersion from M to R m+, and ξ a vector field along f. We can x R m+, identify T x R m+ R m+. The pair {f, ξ} is said to be an affine immersion from M to R m+ if, for each point p M, the following formula holds: We call ξ a transversal vector field. T fp) R m+ = f T p M) Span{ξ p }. We denote by D the standard flat affine connection of R m+. Identifying the covariant derivative along f with D, we have the following decompositions: D X f Y = f X Y ) + hx, Y )ξ, D X ξ = f ShX)) + µx)ξ. The induced objects, h, Sh and µ are the induced connection, the affine fundamental form, the affine shape operator and the transversal connection form, respectively. If the affine fundamental form h is positive definite everywhere on M, the immersion f is said to be strictly convex. And if µ = 0, the affine immersion {f, ξ} is said to be equiaffine. It is known that a strictly convex equiaffine immersion induces a statistical manifold. Conversely, the condition when a statistical manifold can be realized in an affine space has been studied. We say that an affine immersion {f, ξ} : Θ R m+ is a graph immersion if the hypersurface is a graph of ψ in R m+ : f : M R m+ : Set i = θ i, ψ ij = ψ θ i θ j. Then we have θ.. θ m D i f j = ψ ij ξ. θ. 0. θ m, ξ = 0 0, ψθ) This implies that the induced connection is flat and θ i ) is a -affine coordinate system. Proposition.4. Let M, h,, ) be a simply connected dually flat space with a global coordinate system and θ) an affine coordinate system of. Suppose that ψ is a θ-potential function. Then M, h, ) can be realized in R m+ by a graph immersion whose potential is ψ..5 Example: Gamma manifold The family of gamma distributions has event space Ω = R + and probability density functions given by S = {fx; γ, µ) γ, µ R + } so here M R + R + and the random variable is x Ω = R + with ) µ µ x µ.) fx; γ, µ) = γ Γµ) e xµ/γ Proposition.5. Let G be the gamma manifold. Set β = µ γ, Then β, µ) is a natural coordinate system of the -connection and.) ψθ) = log Γµ) µ log β is the corresponding potential function.

6 CHAPTER. INFORMATION GEOMETRY Proof: Using β = µ γ, the logarithm of gamma distributions can be written as.3) log px; β, µ) = log β µ xµ Γµ) e β x = log x + µ log x β x) log Γµ) µ log β) Hence the set of all gamma distributions is an exponential family. The coordinates θ, θ ) = β, µ) is a natural coordinate system, and ψθ) = log Γµ) µ log β is its potential function. Corollary.5. Since ψθ) is a potential function, the Fisher metric is given by the Hessian of ψ, that is, with respect to natural coordinates: [ ] ψθ).4) [g ij ] = = θ i θ j [ µ β β β ψ µ) ].5. α-connection For each α R, the α or α) )-connection is the torsion-free affine connection with components: Γ α) ij,k = α i j k ψθ), where ψθ) is the potential function, and i = θ i. Since the set of gamma distributions is an exponential family, the connection ) is flat. In this case, β, µ) is a -affine coordinate system. So the and -)-connections on the gamma manifold are flat. Proposition.5.3 The functions Γ α) ij,k are given by.5) Γ α) α) µ, = β 3, Γ α), = Γ α), = α β, Γ α), = α) ψ µ) while the other independent components are zero. We have an affine connection α) defined by α) i j, k = Γ α) ij,k, So by solving the equations Γ α) ij,k = h= g kh Γ hα) ij, k =, ). we obtain the components of α) :

.5. EXAMPLE: GAMMA MANIFOLD 7 Proposition.5.4 The components Γ iα) jk of the α) -connection are given by.6) Γ α) = α ) + µ ψ, µ)) β + µ ψ, µ)) Γ α) = α ) ψ, µ) + µ ψ µ), Γ α) = α ) β ψ µ) + µ ψ µ), Γ α) = Γ α) = α ) µ β + µ ψ µ)), α β + β µ ψ µ), Γ α) = α ) µ ψ µ) + µ ψ µ). while the other independent components are zero..5. α-curvatures Proposition.5.5 Direct calculation gives the α-curvature tensor of G.7) R α = α ) ψ µ) + µ ψ µ)) 4 β + µ ψ, µ)) while the other independent components are zero. By contraction we obtain: α-ricci tensor:.8) [R α) ij ] = α ) µ ψ µ)+µ ψ µ)) ψ µ)+µ ψ µ)) 4 β +µ ψ µ)) 4 β +µ ψ µ)) ψ µ)+µ ψ µ)) ψ µ) ψ µ)+µ ψ µ)) 4 β +µ ψ µ)) 4 +µ ψ µ)) In addition, the eigenvalues and the eigenvectors for the α-ricci tensor are given by.9) α ) µ+β ψ µ)+ 4 β +µ β µ ψ µ)+β 4 ψ µ) ψ µ)+µ ψ µ)) 8 β +µ ψ µ)) µ+β ψ µ) 4 β +µ β µ ψ µ)+β 4 ψ µ) ψ µ)+µ ψ µ)) 8 β +µ ψ µ)).0) µ β ψ µ)+ 4 β +µ β µ ψ µ)+β 4 ψ µ) β µ+β ψ µ)+ 4 β +µ β µ ψ µ)+β 4 ψ µ) β α-scalar curvature:.) R α) = α ) ψ µ) + µ ψ µ)) + µ ψ µ)). We note that R α) α ) as µ 0.

8 CHAPTER. INFORMATION GEOMETRY.5.3 Gamma manifold geodesics The Fisher information metric for the gamma manifold is given in γ, µ) coordinates by the arc length function ds = µ γ dγ + Γ µ) Γµ) ) ) dµ. µ The Levi-Civita connection is that given by setting α = 0 in the α-connections of the previous section; for this case we give in Figure. some examples of geodesic sprays in the vicinities of the points γ, µ) =, 0.5),, ),, )..5.4 Mutually dual foliations Now, G represents an exponential family of pdfs, so a mixture coordinate system is given by a potential function. Since β, µ) is a -affine coordinate system, η, η ) given by.) η = ψ β = µ β, η = ψ = φµ) log β. µ is a )-affine coordinate system, and they are mutually dual with respect to the Fisher metric. Therefore the gamma manifold has dually orthogonal foliations and potential function.3) λ = µ + ψµ) log Γµ)..5.5 Affine Immersions for Gamma Manifold The gamma manifold has an affine immersion in R 3. Proposition.5.6 Let G be the gamma manifold with the Fisher metric g and the exponential connection ). Denote by β, µ) a natural coordinate system. Then G can be realized in R 3 by the graph of a potential function ) β 0 f : G R 3 β µ, ξ = 0. µ log Γµ) µ log β The submanifold of exponential distributions is represented by the curve 0, ) R 3 : β {β,, log β } and a tubular neighbourhood of this curve will contain all immersions for small enough perturbations of exponential distributions..6 Example: Log-gamma distributions The family of probability density functions for random variable N [0, ] given by.4) gn, µ, β) = β µ N β µ )β log N )β Γβ) Some of these pdfs with central mean are shown in Figure.. for µ > 0 and β > 0.

.6. EXAMPLE: LOG-GAMMA DISTRIBUTIONS 9.6.4 µ. γ 0.4 0.6 0.8..8.6.4.4 µ. 0.4 0.6 0.8. γ 0.8 0.6 µ 0.8 0.6 0.4 0. 0.4 0.6 0.8 γ. Figure.: Geodesic sprays in the gamma manifold, radiating from the points with unit mean γ =, and µ = 0.5,, increasing vertically. The case µ = corresponds to an exponential distribution from an underlying Poisson process; µ < corresponds to clustering and µ increasing above corresponds to the opposite process, dispersion leading to greater uniformity.

0 CHAPTER. INFORMATION GEOMETRY 0 β 8 3 0 6 8 0 6 0. 0.4 0.6 N 0.8 4 β 4 0 0. 0.4 0.6 0.8 N Figure.: The log-gamma family of densities with central mean < N > = contour plot for β. as a surface and as a Proposition.6. The log-gamma family.4) with information metric determines a Riemannian -manifold L with the following properties it contains the uniform distribution it contains approximations to truncated Gaussian distributions it is an isometric isomorph of a the manifold G of gamma distributions. Proof By integration, it is easily checked that the family given by equation.4) consists of probability density functions for the random variable N [0, ]. The limiting densities are given by.5).6) lim gn, µ, β) = gn, µ, ) = ) µ β + µ N lim gn, µ, ) = gn,, ) =. µ The mean, < N >, standard deviation σ N, and coefficient of variation cv N, of N are given by.7).8).9) < N > = σ N = cv N = ) β β β + µ β β + µ σ N < N > = ) β β β β + µ ) β β + µ ) β ) β β + µ. β We can obtain the family of densities having central mean in [0, ], by solving < N >=, which corresponds to the locus µ = β /β ); some of these are shown in Figure. and Figure.3. Evidently, the distributions with central mean and large β provide approximations to Gaussian distributions truncated on [0, ].

.7. EXAMPLE: GAUSSIAN DISTRIBUTIONS.4. 0.8 0.6 0.4 0. 4 3 0 0. 0.4 0.6 0.8 N N 0 0. 0.4 0.6 0.8 Figure.3: Examples from the log-gamma family of probability densities with central mean < N > =. Left: β =,.,.4,.6,.8. Right: β = 4, 6, 8, 0. For the log-gamma densities [, 8], the Fisher information metric [3] on the parameter space Θ = {µ, β) 0, ) [, )} is given by its arc length function.30) ds L = g ij dx i dx j = β µ dµ + Γ β) Γβ) ) ) dβ, β ij In fact,.4) arises from the gamma family.3) fx, µ, β) = xβ β µ )β for the non-negative random variable x = log N. It is known that the gamma family 3.) has also the information metric.30) so the identity map on the space of coordinates µ, β) is an isometry of Riemannian manifolds. Γβ) e x β µ.7 Example: Gaussian distributions The family of univariate normal or Gaussian distributions has event space Ω = R and probability density functions given by N {Nµ, σ )} = {px; µ, σ) µ R, σ R + } with mean µ and variance σ. So here N = R R + is the upper half -plane, and the random variable is x Ω = R with.3) px; µ, σ) = e x µ) σ π σ The mean µ and standard deviation σ are frequently used as a local coordinate system ξ = ξ, ξ ) = µ, σ).

CHAPTER. INFORMATION GEOMETRY Shannon s information theoretic entropy is given by:.33) S N µ, σ) = At unit variance the entropy is S N = + log π)). logpt; µ, σ)) pt; µ, σ) dt = + log π)) + logσ).7. Natural coordinate system and potential function Proposition.7. Let N be the normal manifold. Set θ = µ σ and θ = σ. Then µ σ, σ ) is a natural coordinate system and.34) ψ = θ is the corresponding potential function. 4 θ + log π θ ) = µ σ + log π σ) Proof: Set θ = µ σ and θ = σ. Then the logarithm of univariate normal distributions can be written as log px; θ, θ ) = log e µ ) x+ ) σ σ x µ σ +log π σ) = θ x + θ x θ + 4 θ log π ).35) ) θ Hence the set of all univariate normal distributions is an exponential family. The coordinates θ, θ ) is a natural coordinate system, and ψ = θ 4 θ + log π θ ) = µ σ + log π σ) is its potential function..7. Fisher information metric Proposition.7. The Fisher metric with respect to natural coordinates θ, θ ) is given by: [ ] θ [ ] θ [g ij ] = θ σ = µ σ.36) µ σ σ µ + σ ) θ θ θ θ θ 3 Proof: Since ψ is a potential function, the Fisher metric is given by the Hessian of ψ, that is,.37) g ij = ψ θ i θ j. Then, we have the Fisher metric by a straightforward calculation..7.3 Mutually dual foliations Since N represents an exponential family, a mixture coordinate system is given by a potential function. We have.38) η = ψ θ = θ = µ, θ η = ψ θ = θ θ = µ + σ. 4 θ

.8. EXAMPLE: BIVARIATE GAUSSIAN 5-MANIFOLD 3 Since θ, θ ) is a -affine coordinate system, η, η ) is a )-affine coordinate system, and they are mutually dual with respect to the Fisher metric. Therefore the normal manifold has dually orthogonal foliations. The coordinates in.38) admit the potential function.39) λ = + log π ) ) = + log π) + logσ)). θ.7.4 Affine immersions for Gaussian manifold We show that the normal manifold can be realized in Euclidean R 3 by an affine immersion. Proposition.7.3 Let N be the normal manifold with the Fisher metric g and the exponential connection ). Denote by θ, θ ) a natural coordinate system. Then M can be realized in R 3 by the graph of a potential function, namely, G can be realized by the affine immersion {f, ξ}: f : N R 3 : ) θ θ θ, ξ = 0 0. ψ θ where ψ is the potential function ψ = θ 4 θ + log π θ ). The submanifold of univariate normal distributions with zero mean i.e. θ = 0) is represented by the curve, 0) R 3 : θ {0, θ, log π θ )}, In addition, the submanifold of univariate normal distributions with unit variance i.e. θ = ) is represented by the curve R R 3 : θ {θ,, θ + log π)}..8 Example: Bivariate Gaussian 5-manifold Here we collect the results from Yoshiharu Sato, Kazuaki Sugawa and Michiaki Kawaguchi [66], for comparison with bivariate gamma manifolds. The probability density function for the two-dimensional Gaussian distribution has the form:.40) fx, y) = π σ σ σ e σ σ σ ) σx µ) σ x µ ) y µ )+σ y µ ) ) where < x, x <, < µ, µ <, 0 < σ, σ <. This contains the five parameters µ, µ, σ, σ, σ ) R 5. It is easy to show that the marginal distributions are Gaussian with parameters µ, σ ) and µ, σ ); the covariance is σ.

4 CHAPTER. INFORMATION GEOMETRY.8. Fisher information metric The information geometry of.40 has been studied by Sato et al. [66]; the metric tensor takes the following form: σ σ 0 0 0 σ σ 0 0 0 σ G = [g ij ] = 0 0 ) σ σ σ ).4), σ σ σ 0 0 σ +σ ) σ σ 0 0 where is the determinant σ ) σ σ σ ) = σ σ σ ) The inverse [g ij ] of the fundamental tensor [g ij ] defined by the relation g ij g ik = δ k j is given by.4) G = [g ij ] = σ σ 0 0 0 σ σ 0 0 0 0 0 σ ) σ σ σ ) 0 0 σ σ σ σ + σ ) σ σ 0 0 σ ) σ σ σ )..9 Example: Freund distributions Freund [35] introduced a bivariate exponential mixture distribution arising from the following reliability considerations. Suppose that an instrument has two components A and B with lifetimes X and Y respectively having density functions when both components are in operation) f X x) = α e αx f Y y) = α e αy for α, α > 0; x, y > 0). Then X and Y are dependent in that a failure of either component changes the parameter of the life distribution of the other component. Thus when A fails, the parameter for Y becomes β ; when B fails, the parameter for X becomes β. There is no other dependence. Hence the joint density function of X and Y is.43) { α β fx, y) = e βy α+α β)x for 0 < x < y, α β e βx α+α β)y for 0 < y < x where α i, β i > 0 i =, ). Provided that α + α β, the marginal density function of X is ) ) α f X x) = β e βx α β.44) + α + α ) e α+α)x, x 0 α + α β α + α β and provided that α + α β, The marginal density function of Y is ) ) α f Y y) = β e βy α β.45) + α + α ) e α+α)y, y 0 α + α β α + α β We can see that the marginal density functions are not exponential but rather mixtures of exponential distributions if α i > β i ; otherwise, they are weighted averages. For this reason, this system

.9. EXAMPLE: FREUND DISTRIBUTIONS 5 of distributions should be termed bivariate mixture exponential distributions rather than simply bivariate exponential distributions. The marginal density functions f X x) and f Y y) are exponential distributions only in the special case α i = β i i =, ). Freund discussed the statistics of the special case when α + α = β = β, and obtained the joint density function as:.46) { α α fx, y) = + α )e α+α)y for 0 < x < y, α α + α )e α+α)x for 0 < y < x with marginal density functions:.47).48) f X x) = α + α α + α )x) e α+α)x x 0, f Y y) = α + α α + α )y) e α+α)y y 0 The covariance and correlation coefficient of X and Y are.49).50) CovX, Y ) = ρx, Y ) = β β α α β β α + α ), β β α α α + α α + β α + α α + β Note that 3 < ρx, Y ) <. The correlation coefficient ρx, Y ) when β, β, and ρx, Y ) 3 when α = α and β, β 0. In many applications, β i > α i i =, ) i.e., lifetime tends to be shorter when the other component is out of action) ; in such cases the correlation is positive..9. Fisher information metric Proposition.9. Let F be the set of Freund bivariate mixture exponential distributions,.43). Then it becomes a 4-manifold with Fisher information metric g ij = and x, x, x 3, x 4 ) = α, β, α, β ). is given by α +α α 0 0 0 α 0 β = α +α ) 0 0.5) 0 0 α +α α 0 0 0 0 0 0 log fx, y) x i x j α β α +α ) [g ij ] = fx, y) dx dy 0 0 log fx, y) x i x j fx, y) dx dy The inverse [g ij ] of [g ij ] is given by α + α α 0 0 0.5) β [g ij 0 α +α ) α ] = 0 0 0 0 α + α α 0 0 0 0 β α +α ) α.

6 CHAPTER. INFORMATION GEOMETRY y 0.5 x 0.5 0 0 Figure.4: Part of the family of McKay bivariate gamma pdfs; Observe that the McKay pdf is zero outside the octant 0 < x < y <. Here the correlation coefficient has been set to ρ xy = 0.6 and α = 5..0 Example: Mckay bivariate gamma 3-manifold The information geometry of the 3-manifold of McKay bivariate gamma distributions can provide a metrization of departures from randomness and departures from independence for bivariate processes. The curvature objects are derived, including those on three submanifolds. As in the case of bivariate normal manifolds, we have negative scalar curvature but here it is not constant and we show how it depends on correlation. These results have potential applications, for example, in the characterization of stochastic materials. Fisher information metric The classical family of Mckay bivariate gamma distributions M, is defined on 0 < x < y < with parameters α, σ, α > 0 and pdfs.53) q fx, y; α, σ, α ) = α σ ) α +α ) x α y x) α e α σ y Γα )Γα ) Here σ is the covariance of X and Y. One way to view this is that fx, y) is the probability density for the two random variables X and Y = X + Z where X and Z both have gamma distributions. The correlation coefficient, and marginal functions, of X and Y are given by α.54) ρx, Y ) = > 0 α + α.55).56) q f X x) = α σ ) α x α e α σ x, x > 0 Γα ) q f Y y) = α σ ) α +α ) y α+α) e α σ y, y > 0 Γα + α ) The marginal distributions of X and Y are gamma with shape parameters α and α +α, respectively; note that it is not possible to choose parameters such that both marginal functions are exponential..

.0. EXAMPLE: MCKAY BIVARIATE GAMMA 3-MANIFOLD 7 Proposition.0. Let M be the family of Mckay bivariate gamma distributions, then α, σ, α ) is a local coordinate system, and M becomes a 3-manifold. Fisher information metric.57) [g ij ] = 3 α +α 4 α + Γ α ) Γα ) ) α α α +α 4 α σ 4 σ α α 4 α σ α σ α σ Γ α ) Γα ) ) The inverse [g ij ] of [g ij ] is given by: g = + α + α ) ψ ) α ) ψ α ) + ψ α ) α + α ) ψ, α )) g = g σ + α α ) ψ α )) = α ψ α ) + ψ α ) α + α ) ψ α ))),.58) g 3 = g 3 = ψ α ) + ψ α ) + α + α ) ψ α )), g = σ + 3 α + α + 4 α ψ α ) ) ψ α ) ) α ψ α ) + ψ α ) + α + α ) ψ α ))), g 3 = g 3 σ + α ψ α )) = α ψ α ) + ψ α ) α + α ) ψ α ))), g 33 + α + α ) ψ ) α ) = ψ α ) + ψ α ) α + α ) ψ, α )) where we have abbreviated ψ α ) = Γ α ) Γα ) ).

8 CHAPTER. INFORMATION GEOMETRY

Chapter 3 Universal connections and curvature It is common to have to consider a number of linear connections on a given statistical manifold and so it is important to know the corresponding universal connection and curvature; then all linear connections and their curvatures are pullbacks. An important class of statistical manifolds is that arising from the so-called exponential families [3, 4] and one particular family is that of gamma distributions, which we showed recently [7] to have important uniqueness properties in stochastic processes. Here we describe the system of all linear connections on the manifold of exponential families, using the tangent bundle or the frame bundle to give the system space. Moreover, we provide formulae for the universal connections and curvatures and give an explicit example for the manifold of gamma distributions. 3. Systems of connections and universal objects The concept of system or structure) of connections was introduced by Mangiarotti and Modugno [50, 57], they were concerned with finite-dimensional bundle representations of the space of all connections on a fibred manifold. On each system of connections there exists a unique universal connection of which every connection in the family of connections is a pullback. A similar relation holds between the corresponding universal curvature and the curvatures of the connections of the system. Definition 3.. A system of connections on a fibred manifold p : E M is a fibred manifold p c : C M together with a first jet-valued fibred morphism ξ : C M E JE over M, such that each section Γ : M C determines a unique connection Γ = ξ Γ p, I E ) on E. Then C is the space of connections of the system. In the sequel we are interested in the system of linear connections on a Riemannian manifold. The system of all linear connections is the subject of studies in eg. [4, 5, 7, 4]. Theorem 3.. [50, 57]) Let C, ξ) be a system of connections on a fibred manifold p : E M. Then there is a unique connection form Λ : C M E JC M E) On the fibred manifold π : C M E C which has the coordinate expression Λ = dx λ λ + dc a a + ξ i λ dx λ i. 9

30 CHAPTER 3. UNIVERSAL CONNECTIONS AND CURVATURE This Λ is called the universal connection because it describes all the connections of the system. Explicitly, each Γ SecC/M) gives an injection Γ p, I E ), of E into C E, which is a section of π and Γ coincides with the restriction of Λ to this section: Λ Γ p,ie )E = Γ. A similar relation holds between its curvature Ω, called universal curvature, and the curvatures of the connections of the system. Ω = [Λ, Λ] = d ΛΛ : C M E T C) E V E). So the universal curvature Ω has the coordinate expression: Ω = ξ j λ jξ i µ dx λ dx µ + a ξ i µ dx a dx µ) i. 3. Exponential family of probability density functions on R From the definition of an exponential family, and putting i = x i, we may obtain 3.) i ly; x) = F i y) i ψx) and 3.) i j ly; x) = i j ψx). where ly; x) = log py; x). Hence the Fisher information metric [3, 4] on the n-dimensional space of parameters Θ R n, has coordinates: 3.3) g = [g ij ] = E[ i j ly; x)] = i j ψx) = ψ ij x), The Levi-Civita connection with respect to g is given by: 3.4) Γ k ijx) = n h= gkh i g jh + j g ih h g ij ) = n h= where ψ hk x)) represents the dual metric to ψ hk x)). gkh i j h ψx) = n h= ψkh x) ψ ijh x). 3.3 Universal connection and curvature for exponential families In this section we provide explicit formulae for universal connection and curvature for an arbitrary statistical n-dimensional exponential manifold. We do this in the two available ways: using the tangent bundle and using the frame bundle systems of all linear connections.

3.3. UNIVERSAL CONNECTION AND CURVATURE FOR EXPONENTIAL FAMILIES 3 3.3. Tangent bundle system: C T T M JT M The system of all linear connections on a manifold M has a representation on the tangent bundle E = T M M with system space C T = {α jγ T M M JT M jγ : T M T T M projects onto I T M } Here we view I T M as a section of T M T M, which is a subbundle of T M T T M, with local expression dx λ λ. The fibred morphism for this system is given by 3.5) 3.6) ξ T : C T M T M JT M T M T M T T M, α jγ, ν) αν)jγ. In coordinates 3.7) ξ T = dx λ λ γ i λ i ) = dx λ λ y j Γ i jλ i ) n = dx λ λ y j ψih ψ jλh ) i ) h= Each section of C T M, such as Γ : M C T : x λ ) x λ, γ µϑ ); determines the unique linear connection Γ = ξ T Γ π T, I T M ) with Christoffel symbols Γ λ µϑ. On the fibred manifold π : C T M T M C T ; the universal connection is given by: briefly, Λ T : C T M T M JC T M T M) T C T T C T M T M), x λ, v λ µν, y λ ) [X λ, V λ µν) X λ, V λ µν, Y µ V λ µνx ν )]. 3.8) Λ T = dx λ λ + dv a a + y µ vµν i dx ν i = n dx λ λ + dv a a + y µ ψih ψ µνh ) dx ν i. h= Explicitly, each Γ SecC T /M) gives an injection Γ π T, I T M ), of T M into C T T M, which is a section of π and Γ coincides with the restriction of Λ T to this section: Λ T Γ π T,I T M )T M = Γ. and the universal curvature of the connection Λ is given by: Ω T = d ΛT Λ T : C T M T M T C T ) T M V T M).

3 CHAPTER 3. UNIVERSAL CONNECTIONS AND CURVATURE So here the universal curvature Ω T has the coordinate expression: Ω T = y k v j kλ jy m vmµ i dx λ dx µ + a y m vmµ i dx a dx µ) i ) = n y k n ψjh ψ kλh ) j y m n ψih ψ mµh )dx λ dx µ + a y m ψih ψ mµh )dx a dx µ i. h= h= 3.3. Frame bundle system: C F F M JF M A linear connection is also a principal i.e. group invariant) connection on the principal bundle of frames F M with: h= E = F M M = F M/G consisting of linear frames ordered bases for tangent spaces ) with structure group the general linear group, G = Gln). Here the system space is C F = JF M/G T M T M T F M/G, consisting of G-invariant jets. The system morphism is ξ F : C F F M JF M T M T M T F M, [js x ], b) [T x M T b F M]. In coordinates ξ F = dx λ λ X µ µ s λ ν) ν λ 3.9) = dx λ λ X µ Γ λ µν) ν λ = dx λ λ X µ n h= ψλh ψ µνh ) ν λ where ν λ = b λ ν is the natural base on the vertical fibre of T b F M induced by coordinates b λ ν) on F M. Each section of C F M that is projectable onto I T M, such as, ˆΓ : M C F : x λ ) x λ, [jγ x ]) with Γ λ µν = µ s λ ν; determines the unique linear connection Γ = ξ F ˆΓ π F, I F M ) with Christoffel symbols Γ λ µν. On the principal G-bundle π : C F M F M C F ; the universal connection is given by: briefly, Λ F : C F M F M JC F M F M) T C F F M T C F F M F M), x λ, v λ µν, b µ ν ) [X λ, Y λ µν) X λ, Y λ µν, b µ ν v λ µθx θ )]. 3.0) Λ F = dx λ λ + dv a a + b µ ν vµθ λ dx θ ν λ = n dx λ λ + dv a a + b µ ν ψλh ψ µθh ) dx θ ν λ. h= Explicitly, each Γ SecC F /M) gives an injection Γ π F, I F M ), of F M into C F F M, which is a section of π and Γ coincides with the restriction of Λ F to this section:

3.4. UNIVERSAL CONNECTION AND CURVATURE ON THE GAMMA MANIFOLD 33 Λ F Γ π F,I F M )F M = Γ. and the universal curvature of the connection Λ is given by: h= Ω = d ΛF Λ F : C F M F M T C F ) F M V F M). So here the universal curvature form Ω F has the coordinate expression: Ω F = b k νv β kλ ν β b m ω vmµ α dx λ dx µ + a b m ω vmµ α dx a dx µ) ω α ) = n b k ν ψβh ψ kλh ) n ν β b m n ω ψαh ψ mµh )dx λ dx µ + a b m ω ψαh ψ mµh )dx a dx µ ω α. h= 3.4 Universal connection and curvature on the gamma manifold Here we give explicit forms for the system space and its universal connection and curvature on the -dimensional statistical manifold of gamma distributions. Let G be the family of gamma probability density functions given by 3.) {px; β, µ) = β µ xµ Γµ) e β x β, µ R + }, x R +. So the space of parameters is topologically R + R +. It is an exponential family and it includes as a special case µ = ) the exponential distribution itself, which complements the Poisson process on a line. Since the set of all gamma distributions is an exponential family with natural coordinate system β, µ) and potential function ψ = log Γµ) µ log β, [4] the Fisher metric 3.3) is given by: [ µ β ] [g ij ] = β 3.) β φ µ) where φµ) = Γ µ) Γµ) is the logarithmic derivative of the gamma function. The Levi-Civita connection components 3.4) are given by: h= 3.3) Γ = Γ = Γ = Γ = Γ = Γ = µ φ µ)) β + µ φ µ)), φ µ) µ φ µ) ), β φ µ) µ φ µ) ), µ β µ φ µ)), β µ φ µ) ), µ φ µ) µ φ µ) ) while the other independent components are zero.