Information geometry for bivariate distribution control

Similar documents
Neighbourhoods of Randomness and Independence

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

On the entropy flows to disorder

On the entropy flows to disorder

On the entropy flows to disorder

On random sequence spacing distributions

arxiv: v2 [math-ph] 22 May 2009

Universal connection and curvature for statistical manifold geometry

Topics in Information Geometry. Dodson, CTJ. MIMS EPrint: Manchester Institute for Mathematical Sciences School of Mathematics

Information Geometry of Bivariate Gamma Exponential Distributions

Chapter 5 continued. Chapter 5 sections

Universal connection and curvature for statistical manifold geometry

ELEMENTS OF PROBABILITY THEORY

Lecture 35: December The fundamental statistical distances

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

1 First and second variational formulas for area

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Random Variables and Their Distributions

Maximum Likelihood Estimation. only training data is available to design a classifier

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Information geometry of the power inverse Gaussian distribution

CS Lecture 19. Exponential Families & Expectation Propagation

Chapter 5. Chapter 5 sections

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Multiple Random Variables

Probability on a Riemannian Manifold

Machine learning - HT Maximum Likelihood

Some illustrations of information geometry in biology and physics

Introduction to Information Geometry

Formulas for probability theory and linear models SF2941

PATTERN RECOGNITION AND MACHINE LEARNING

Contrastive Divergence

1. Geometry of the unit tangent bundle

Latent state estimation using control theory

Stat410 Probability and Statistics II (F16)

Copulas. MOU Lili. December, 2014

Gaussian processes for inference in stochastic differential equations

Lecture 1: August 28

Manifold Monte Carlo Methods

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Chapter 3 : Likelihood function and inference

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis

A Detailed Analysis of Geodesic Least Squares Regression and Its Application to Edge-Localized Modes in Fusion Plasmas

Mathematical Preliminaries

Transport Continuity Property

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

STA 2201/442 Assignment 2

Warped Riemannian metrics for location-scale models

Foundations of Nonparametric Bayesian Methods

FuncICA for time series pattern discovery

Information geometry of mirror descent

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Natural Evolution Strategies for Direct Search

Bayes spaces: use of improper priors and distances between densities

Lecture 2: Repetition of probability theory and statistics

Advanced Machine Learning & Perception

Riemannian geometry of surfaces

Independent Component Analysis

ON THE FOLIATION OF SPACE-TIME BY CONSTANT MEAN CURVATURE HYPERSURFACES

The Variational Gaussian Approximation Revisited

Energy-Based Generative Adversarial Network

A NOVEL OPTIMAL PROBABILITY DENSITY FUNCTION TRACKING FILTER DESIGN 1

Metric Spaces. Exercises Fall 2017 Lecturer: Viveka Erlandsson. Written by M.van den Berg

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Consistency of the maximum likelihood estimator for general hidden Markov models

04. Random Variables: Concepts

Wasserstein GAN. Juho Lee. Jan 23, 2017

Exercise 1 (Formula for connection 1-forms) Using the first structure equation, show that

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

GSI Geometric Science of Information, Paris, August Dimensionality reduction for classification of stochastic fibre radiographs

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Week 3: The EM algorithm

MAS223 Statistical Inference and Modelling Exercises

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Gaussian predictive process models for large spatial data sets.

Lecture 7 and 8: Markov Chain Monte Carlo

Errata for Vector and Geometric Calculus Printings 1-4

Covariance function estimation in Gaussian process regression

An Information Geometric Perspective on Active Learning

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Series 7, May 22, 2018 (EM Convergence)

Introduction to Probability and Stocastic Processes - Part I

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Expectation Propagation for Approximate Bayesian Inference

Data assimilation with and without a model

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Independent Component Analysis on the Basis of Helmholtz Machine

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Properties of the boundary rg flow

Dynamic System Identification using HDMR-Bayesian Technique

UNIVERSITY of LIMERICK OLLSCOIL LUIMNIGH

3. Probability and Statistics

Transcription:

Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic processes through sensor estimation of probability density functions has a geometric setting via information theory and the information metric. We consider gamma models with positive correlation, for which the information theoretic 3-manifold geometry has recently been formulated. For comparison we summarize also the case for bivariate Gaussian processes with arbitrary correlation.

Poisson Models Model for random events: probability of exactly m in time t ( ) t m 1 P m = τ m! e t/τ for m = 0, 1, 2,... The mean and variance of the number of events in such intervals is t/τ. From P 0 = e t/τ, the pdf for intervals t between successive events is f random (t; τ) = 1 τ e t/τ for t [0, ) with variance V ar(t) = τ 2. Departures from randomness involve either clustering of events or evening out of event spacings.

Gamma pdfs Parameters: τ, ν R + f(t; τ, ν) = ( ν τ ) ν t ν 1 Γ(ν) e tν/τ t R + Mean t = τ and variance V ar(t) = τ 2 /ν. ν = 1 Poisson, with mean interval τ. ν = 1, 2,..., Z Poisson process with events removed to leave only every ν th. ν < 1 clustering of events. 1.6 1.4 ν = 0.5 (Clustered) f(t; 1, ν) 1.2 1 0.8 0.6 0.4 ν = 1 (Random) ν = 2 (Smoothed) ν = 5 (Smoothed) 0.2 0.25 0.5 0.75 1 1.25 1.5 1.75 2 Inter-event interval t

Fisher Information Metric on pdf family Example of univariate pdf family: {f(t; (θ i )), (θ i ) S} Random variable t R + and n parameters. Log-likelihood function l = log(f) For all points (θ i ) S a subset of R n the covariance of partial derivatives ( ) l g ij = f(t; l R + (θi )) θ i θ j dt = f(t; τ, ν) R + ( 2 l θ i θ j is a positive definite n n matrix and induces a Riemannian metric g on the n-dimensional parameter space S. ) dt General Case: integrate over all random variables for multivariate pdfs.

McKay bivariate gamma pdf The Mckay bivariate gamma distribution defined on 0 < x < y < with parameters α 1, σ 12, α 2 > 0 is given by: f(x, y; α 1, σ 12, α 2 ) = ( α 1 σ ) (α 1 +α 2 ) 2 x α1 1 (y x) α2 1 e α1 σ y 12 12, (1) Γ(α 1 )Γ(α 2 ) where σ 12 is the covariance of X, Y. The correlation coefficient and marginal gamma distributions, of X and Y are given respectively, for positive x, y by : ρ(x, Y ) = α1 α 1 + α 2 f X (x) = ( α 1 σ ) α 1 2 x α1 1 e α1 σ x 12 12, Γ(α 1 ) f Y (y) = ( α 1 σ ) (α 1 +α 2 ) 2 y (α 1+α 2 ) 1 e α1 σ y 12 12. Γ(α 1 + α 2 )

Applicability of McKay distributions Explicitly, f X (x) is a gamma density with mean α 1 σ 12 and dispersion parameter α 1. Similarly, f Y (y) is a gamma density with mean (α 1 + α 2 ) σ 12 α and dispersion parameter (α 1 + α 2 1 ). So, for the McKay distribution to be applicable to a given joint distribution for (x, y), we need to have: 0 < x < y < Covariance > 0 (Dispersion parameter for y) > (Dispersion parameter for x) And we expect, roughly, (Mean x)(mean y) = α 1+α 2 α 1

McKay 3-manifold Denote by M the set of Mckay bivariate gamma distributions, that is M = {f f(x, y; α 1, σ 12, α 2 ) = ( α 1 σ ) (α 1 +α 2 ) 2 x α1 1 (y x) α2 1 e α1 σ y 12 12, Γ(α 1 )Γ(α 2 ) y > x > 0, α 1, σ 12, α 2 > 0} (2) Then we have from Arwini and Dodson: Global coordinates (α 1, σ 12, α 2 ) make M a 3-manifold with Fisher information metric [g ij ] given by : [g ij ] = 3 α 1 +α 2 4 α 1 2 + ψ (α 1 ) α 1 α 2 4 α 1 σ 12 1 α 1 α 2 α 1 +α 2 4 α 1 σ 12 4 σ 2 12 2 α 1 1 2 σ 12 1 2 α 1 1 2 σ 12 ψ (α 2 ) (3)

Information arclength through McKay distributions For small changes dα 1, dα 2, dσ 12 the element of arclength ds is given by ds 2 = 3 ij=1 g ij dx i dx j (4) with (x 1, x 2, x 3 ) = (α 1, σ 12, α 2 ). For larger separations between two bivariate gamma distributions the arclength along a curve is obtained by integration of ds; geodesic curves can give minimizing trajectories. c : [a, b] M : t (c 1 (t), c 2 (t), c 3 (t)) (5) Tangent vector ċ(t) = (ċ 1 (t), ċ 2 (t), ċ 3 (t)) has norm ċ given via (3) by ċ(t) 2 = 3 i,j=1 g ij ċ i (t) ċ j (t). (6) Information length of the curve is L c (a, b) = b a ċ(t) dt for a b. (7)

Bivariate Gaussian The probability density of the 2-dimensional normal distribution has the form: f(x) = 1 2π Σ 1 2 e 1 2 (x µ) Σ 1 (x µ), (8) where x = [ x1 x 2 ], µ = [ µ1 µ 2 ], Σ = < x 1 < x 2 <, < µ 1 < µ 2 <, 0 < σ 11, σ 22 <. [ σ11 σ 12 σ 12 σ 22 This contains the five parameters µ 1, µ 2, σ 11, σ 12, σ 22. So our global coordinate system consists of the 5-tuples θ = (θ 1, θ 2, θ 3, θ 4, θ 5 ) = (µ 1, µ 2, σ 11, σ 12, σ 22 ). ],

Bivariate Gaussian 5-manifold θ 1 = µ 1, θ 2 = µ 2, θ 3 = σ 11, θ 4 = σ 12, θ 5 = σ 22. The information metric tensor is known to be given by: [g ij ] = θ 5 θ4 0 0 0 θ4 θ3 0 0 0 0 0 (θ 5 ) 2 2 2 θ4 θ 5 (θ 4 ) 2 2 2 2 0 0 θ4 θ 5 θ 3 θ 5 +(θ 4 ) 2 2 2 θ3 θ 4 2 0 0 where is the determinant (θ 4 ) 2 2 2 θ3 θ 4 (θ 3 ) 2 2 2 2 = Σ = θ 3 θ 5 (θ 4 ) 2. (9)

B-spline approximations to McKay Since every pdf has unit measure lim f(x, y, α 1, σ 12, α 2 ) = 0 (10) x,y + Hence, for all ɛ > 0 there is a b(ɛ, α 1, σ 12, α 2 ) > 0 such that x, y > b(ɛ, α 1, σ 12, α 2 ) f(x, y, α 1, σ 12, α 2 ) ɛ. (11) So we can use B-spline functions to approximate the McKay pdf such that f(x, y, α 1, σ 12, α 2 ) n i=1 w i B i (x, y) δ (12) where B i (x, y) are pre-specified bivariate basis functions defined on Ω = [0, b] [0, b], w i are the weights to be trained adaptively and δ is a small number generally larger than ɛ.

Square root approximation This is numerically more robust than linear B-splines. Used here to represent the coupled links between the three parameters and the McKay pdf f. Since the basis functions are pre-specified, different values of {α 1, σ 12, α 2 } will generate different sets of weights. Therefore, our approximation should be represented as f(x, y, α 1, σ 12, α 2 ) = where e δ. n i=1 w i B i (x, y) + e (13)

Renormalization Wang has used the following transformed representation f(x, y, α 1, σ 12, α 2 ) = C(x, y)v k +h(v k )B n (x, y) (14) to guarantee that b b 0 0 f(x, y, α 1, σ 12, α 2 )dxdy = 1 where V k = (w 1, w 2..., w n 1 ) T constitutes a vector of independent weights, and h(.) is a known nonlinear function of V k. With this format, the relationship between V k and {α 1, σ 12, α 2 } can be formulated.

Tuning procedure The initial bivariate gamma distribution has parameters {α 1 (0), σ 12 (0), α 2 (0)} and the desired distribution has {α 1 (f), σ 12 (f), α 2 (f)} Let g(x, y) represent the probability density of the bivariate gamma distribution with parameter set {α 1 (f), σ 12 (f), α 2 (f)}. An effective trajectory for V k should be chosen to minimise the following performance function J = 1 2 b 0 b 0 y) g(x, y)logk(x, dxdy (15) g(x, y) with K(x, y) = C(x, y)v k + h(v k )B n (x, y).

Gradient Rule for iteration V k = V k 1 η J V V =V k 1 (16) where k = 0, 1, 2,... represents the sample number and η > 0 is a pre-specified learning rate. Using the relationship between the weight vector and the three parameters, the adaptive tuning of the actual parameters in the pdf of (14) can be readily formulated. The information arclength provides distance estimates between current and target distribution

Kullback-Leibler Divergence For two probability density functions p and p on an event space Ω, the function KL(p, p ) = Ω p(x) log p p(x)dx (17) (x) is called the Kullback-Leibler divergence or relative entropy. We could instead of (15), consider (17) as the performance function; explicitly this would be: W = 1 2 b 0 b 0 g(x, y) g(x, y)log dxdy (18) K(x, y) with K(x, y) = (C(x, y)v k + h(v k )B n (x, y).