Information geometry for bivariate distribution control

Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic processes through sensor estimation of probability density functions has a geometric setting via information theory and the information metric. We consider gamma models with positive correlation, for which the information theoretic 3-manifold geometry has recently been formulated. For comparison we summarize also the case for bivariate Gaussian processes with arbitrary correlation.

Poisson Models Model for random events: probability of exactly m in time t ( ) t m 1 P m = τ m! e t/τ for m = 0, 1, 2,... The mean and variance of the number of events in such intervals is t/τ. From P 0 = e t/τ, the pdf for intervals t between successive events is f random (t; τ) = 1 τ e t/τ for t [0, ) with variance V ar(t) = τ 2. Departures from randomness involve either clustering of events or evening out of event spacings.

Gamma pdfs Parameters: τ, ν R + f(t; τ, ν) = ( ν τ ) ν t ν 1 Γ(ν) e tν/τ t R + Mean t = τ and variance V ar(t) = τ 2 /ν. ν = 1 Poisson, with mean interval τ. ν = 1, 2,..., Z Poisson process with events removed to leave only every ν th. ν < 1 clustering of events. 1.6 1.4 ν = 0.5 (Clustered) f(t; 1, ν) 1.2 1 0.8 0.6 0.4 ν = 1 (Random) ν = 2 (Smoothed) ν = 5 (Smoothed) 0.2 0.25 0.5 0.75 1 1.25 1.5 1.75 2 Inter-event interval t

Fisher Information Metric on pdf family Example of univariate pdf family: {f(t; (θ i )), (θ i ) S} Random variable t R + and n parameters. Log-likelihood function l = log(f) For all points (θ i ) S a subset of R n the covariance of partial derivatives ( ) l g ij = f(t; l R + (θi )) θ i θ j dt = f(t; τ, ν) R + ( 2 l θ i θ j is a positive definite n n matrix and induces a Riemannian metric g on the n-dimensional parameter space S. ) dt General Case: integrate over all random variables for multivariate pdfs.

McKay bivariate gamma pdf The Mckay bivariate gamma distribution defined on 0 < x < y < with parameters α 1, σ 12, α 2 > 0 is given by: f(x, y; α 1, σ 12, α 2 ) = ( α 1 σ ) (α 1 +α 2 ) 2 x α1 1 (y x) α2 1 e α1 σ y 12 12, (1) Γ(α 1 )Γ(α 2 ) where σ 12 is the covariance of X, Y. The correlation coefficient and marginal gamma distributions, of X and Y are given respectively, for positive x, y by : ρ(x, Y ) = α1 α 1 + α 2 f X (x) = ( α 1 σ ) α 1 2 x α1 1 e α1 σ x 12 12, Γ(α 1 ) f Y (y) = ( α 1 σ ) (α 1 +α 2 ) 2 y (α 1+α 2 ) 1 e α1 σ y 12 12. Γ(α 1 + α 2 )

Applicability of McKay distributions Explicitly, f X (x) is a gamma density with mean α 1 σ 12 and dispersion parameter α 1. Similarly, f Y (y) is a gamma density with mean (α 1 + α 2 ) σ 12 α and dispersion parameter (α 1 + α 2 1 ). So, for the McKay distribution to be applicable to a given joint distribution for (x, y), we need to have: 0 < x < y < Covariance > 0 (Dispersion parameter for y) > (Dispersion parameter for x) And we expect, roughly, (Mean x)(mean y) = α 1+α 2 α 1

McKay 3-manifold Denote by M the set of Mckay bivariate gamma distributions, that is M = {f f(x, y; α 1, σ 12, α 2 ) = ( α 1 σ ) (α 1 +α 2 ) 2 x α1 1 (y x) α2 1 e α1 σ y 12 12, Γ(α 1 )Γ(α 2 ) y > x > 0, α 1, σ 12, α 2 > 0} (2) Then we have from Arwini and Dodson: Global coordinates (α 1, σ 12, α 2 ) make M a 3-manifold with Fisher information metric [g ij ] given by : [g ij ] = 3 α 1 +α 2 4 α 1 2 + ψ (α 1 ) α 1 α 2 4 α 1 σ 12 1 α 1 α 2 α 1 +α 2 4 α 1 σ 12 4 σ 2 12 2 α 1 1 2 σ 12 1 2 α 1 1 2 σ 12 ψ (α 2 ) (3)

Information arclength through McKay distributions For small changes dα 1, dα 2, dσ 12 the element of arclength ds is given by ds 2 = 3 ij=1 g ij dx i dx j (4) with (x 1, x 2, x 3 ) = (α 1, σ 12, α 2 ). For larger separations between two bivariate gamma distributions the arclength along a curve is obtained by integration of ds; geodesic curves can give minimizing trajectories. c : [a, b] M : t (c 1 (t), c 2 (t), c 3 (t)) (5) Tangent vector ċ(t) = (ċ 1 (t), ċ 2 (t), ċ 3 (t)) has norm ċ given via (3) by ċ(t) 2 = 3 i,j=1 g ij ċ i (t) ċ j (t). (6) Information length of the curve is L c (a, b) = b a ċ(t) dt for a b. (7)

Bivariate Gaussian The probability density of the 2-dimensional normal distribution has the form: f(x) = 1 2π Σ 1 2 e 1 2 (x µ) Σ 1 (x µ), (8) where x = [ x1 x 2 ], µ = [ µ1 µ 2 ], Σ = < x 1 < x 2 <, < µ 1 < µ 2 <, 0 < σ 11, σ 22 <. [ σ11 σ 12 σ 12 σ 22 This contains the five parameters µ 1, µ 2, σ 11, σ 12, σ 22. So our global coordinate system consists of the 5-tuples θ = (θ 1, θ 2, θ 3, θ 4, θ 5 ) = (µ 1, µ 2, σ 11, σ 12, σ 22 ). ],

Bivariate Gaussian 5-manifold θ 1 = µ 1, θ 2 = µ 2, θ 3 = σ 11, θ 4 = σ 12, θ 5 = σ 22. The information metric tensor is known to be given by: [g ij ] = θ 5 θ4 0 0 0 θ4 θ3 0 0 0 0 0 (θ 5 ) 2 2 2 θ4 θ 5 (θ 4 ) 2 2 2 2 0 0 θ4 θ 5 θ 3 θ 5 +(θ 4 ) 2 2 2 θ3 θ 4 2 0 0 where is the determinant (θ 4 ) 2 2 2 θ3 θ 4 (θ 3 ) 2 2 2 2 = Σ = θ 3 θ 5 (θ 4 ) 2. (9)

B-spline approximations to McKay Since every pdf has unit measure lim f(x, y, α 1, σ 12, α 2 ) = 0 (10) x,y + Hence, for all ɛ > 0 there is a b(ɛ, α 1, σ 12, α 2 ) > 0 such that x, y > b(ɛ, α 1, σ 12, α 2 ) f(x, y, α 1, σ 12, α 2 ) ɛ. (11) So we can use B-spline functions to approximate the McKay pdf such that f(x, y, α 1, σ 12, α 2 ) n i=1 w i B i (x, y) δ (12) where B i (x, y) are pre-specified bivariate basis functions defined on Ω = [0, b] [0, b], w i are the weights to be trained adaptively and δ is a small number generally larger than ɛ.

Square root approximation This is numerically more robust than linear B-splines. Used here to represent the coupled links between the three parameters and the McKay pdf f. Since the basis functions are pre-specified, different values of {α 1, σ 12, α 2 } will generate different sets of weights. Therefore, our approximation should be represented as f(x, y, α 1, σ 12, α 2 ) = where e δ. n i=1 w i B i (x, y) + e (13)

Renormalization Wang has used the following transformed representation f(x, y, α 1, σ 12, α 2 ) = C(x, y)v k +h(v k )B n (x, y) (14) to guarantee that b b 0 0 f(x, y, α 1, σ 12, α 2 )dxdy = 1 where V k = (w 1, w 2..., w n 1 ) T constitutes a vector of independent weights, and h(.) is a known nonlinear function of V k. With this format, the relationship between V k and {α 1, σ 12, α 2 } can be formulated.

Tuning procedure The initial bivariate gamma distribution has parameters {α 1 (0), σ 12 (0), α 2 (0)} and the desired distribution has {α 1 (f), σ 12 (f), α 2 (f)} Let g(x, y) represent the probability density of the bivariate gamma distribution with parameter set {α 1 (f), σ 12 (f), α 2 (f)}. An effective trajectory for V k should be chosen to minimise the following performance function J = 1 2 b 0 b 0 y) g(x, y)logk(x, dxdy (15) g(x, y) with K(x, y) = C(x, y)v k + h(v k )B n (x, y).

Gradient Rule for iteration V k = V k 1 η J V V =V k 1 (16) where k = 0, 1, 2,... represents the sample number and η > 0 is a pre-specified learning rate. Using the relationship between the weight vector and the three parameters, the adaptive tuning of the actual parameters in the pdf of (14) can be readily formulated. The information arclength provides distance estimates between current and target distribution

Kullback-Leibler Divergence For two probability density functions p and p on an event space Ω, the function KL(p, p ) = Ω p(x) log p p(x)dx (17) (x) is called the Kullback-Leibler divergence or relative entropy. We could instead of (15), consider (17) as the performance function; explicitly this would be: W = 1 2 b 0 b 0 g(x, y) g(x, y)log dxdy (18) K(x, y) with K(x, y) = (C(x, y)v k + h(v k )B n (x, y).