Information Geometry

Size: px

Start display at page:

Download "Information Geometry"

Susanna May
6 years ago
Views:

1 2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto Shinto Eguchi Institute Institute Statistical Mathematics, Japan This talk is based on a joint work with Osamu Komori and Atsumi Ohara, University of Fukui

2 Outline Short review for Information geometry Kolmogorov-Nagumo mean -path in a function space Generalized mean and variance divergence geometry U-divergence geometry Minimum U-divergence and density estimation 2

3 A short review of IG Nonparametric space Space of statistics Information geometry is discussed on the product space 3

4 Bartlett s identity Parametric model Bartlett s first identity Bartlett s second identity 4

5 Metric and connections Information metric Mixture connection Exponential connection Rao (1945), Dawid (1975), Amari (1982) 5

6 Geodesic curves and surfaces in IG m-geodesic curve e-geodesic curve m-geodesic surface e-geodesic surface 6

7 Kullback-Leibler K-L divergence 1. KL divergence is the expected log-likelihood ratio 2. Maximum likelihood is minimum KL divergence. Akaike (1974) 3. KL divergence induces to the m-connection and e-connection Eguchi (1983) 7

8 Pythagoras Thm Amari-Nagaoka (2001) Pf 8

9 Exponential model Exponential model Mean parameter For For Amari (1982) Degenerated Bartlett identity 9

10 Exponential model Mean equal space Minimum KL leaf 10

11 Pythagoras foliation 11

12 log + exp log & exp Bartlett identities KL-divergence e-connection m-connection e-geodesic m-geodesic Pythagoras identity exponential model mean equal space Pythagoras foliation 12

13 Path geometry { m-geodesic, e-geodesic, } 13

14 Kolmogorov-Nagumo mean K-N mean is for positive numbers 14

15 K-N mean in Y Def. K-N mean Cf. Naudts (2009) 15

16 -path Def. -path connecting f and g Thm (Pf ) lim c 1 ( c )

17 Examples of -path Exm 0 Exm 1 Exm 2 Exm 3 17

18 Identities of -density Model 1st identity 2nd identity because 18

19 Generalized mean and variance Def Note

20 Generalized mean and variance Exm 20

21 Bartlett Identity Model Bartlett identity Bartlett identities 21

22 Tangent space of Y Tangent space Riemannian metric Expectation gives the tangent space Topological properties of T f depend on If = log, then T f is too large to do statistics on Y Cf. Pistone (1992) 22

23 Def Parallel transport A vector field is parallel along a curve A curve is -geodesic Cf. Amari (1982). 23

24 -geodesic Thm If is the -geodesic curve Proof. 24

25 -divergence -cross entropy -entropy -divergence Note: -divergence is KL-divergence if = log 25

26 Divergence geometry Def. Let be a statistical model. with the Riemannian metric on M : the pair of affine connections on M: 26

27 -divergence geometry The metric Affine connection pair 27

28 Pythagorean theorem Thm -geodesic -geodesic h Pf f g 28

29 -Pythagorean foliation -mean equal space 29

30 -mean -Bartlett identities - divergence - connection - connection - geodesic - geodesic - Pythagoras identity - model - Pythagoras foliation - mean equal space What is a statistical meaning of -mean and -variance? 30

31 U-divergence U-cross-entropy U-entropy U-divergence Note Exm 31

32 U-divergence geometry The metric associated with U-divergence: Affine connections associated with U-divergence: Thm (i) U-geodesic is mixture geodesic. (ii) U*-geodesic is geodesic 32

33 U-geometry = -geometry / -metric on a model M U-metric on a model M -connection U*-connection 33

34 Triangle with D U Thm mixture geodesic -geodesic h Pf f g 34

35 U-loss function U-estimation Let g(x) be a data density function with statistical model U-empirical loss function U-estimator for 35

36 U-estimator under -model -model U-empirical loss function U-estimator for U-estimator under -model has analogy with MLE under exponential model 36

37 Potential function Def We call the potential function on -model Note We define the mean parameter by Cf. mean parameter Thm U-estimator for is given by the sample mean 37

38 Pythagoras foliation Thm Pf 38

39 Pythagoras foliation 39

40 U-Boost learning for density estimation U-loss function L U n 1 ( f ) ( f ( xi )) U( ( f ( x)))dx n p i 1 Dictionary of density functions RI W { g ( x) : g ( x) 0, g ( x)dx 1, } Learning space = -model W * 1 ( co( ( W ))) { 1 ( ( g ( x)))} Let f ( x, π) W * W ( ( g ( x)). Then f ( x,(0,,1, 0)) g ( x) 1 ) ( ) Goal : find f * argmin f W * U L U ( f )

41 U-Boost algorithm ( A) Find f1 arg min L ( g) g W U (B) Update f k f k 1 1 ((1 k 1 ) ( f k ) k 1 ( g k 1 )) st ( k 1, g k 1 ) arg min, g) (0,1) ( W L U ( 1 ((1 ) ( f k ) ( g))) ˆ 1 ( C) Select K, and f ((1 K ) ( fk 1) K ( g K )) Example 2. Power entropy ( f 1 * ) ( x) ( k gk ( x) k k log gk ( x) k k * If 0, f ( x) exp ) g ( x) If * 1, f ( x) ( x) Klemela (2007) k k g k k k Friedman et al (1984)

42 Inner step in the convex hull W { t( x, ) : } W * U 1 (co ( W )) * * 1 Goal : f argmin L ( f ) f ( x) ( ˆ ( fˆ ( x)) ˆ ( ˆ ˆ f ˆ( x))) f W * U U 1 1 k k W g 7 g 5 * W U f (x) g 4 g 6 g 3 g 2 f (x) g 1

43 Non-asymptotic bound Theorem. Assume that a data distribution has a density g(x) and that ( A) sup U''( ){ ( ) ( )} Then we have where FA( g, W EE( g, W IE( K, W ) (,, ) co( ( W )) W W EI D ( g, fˆ ) FA( g, W * ) EE( g, W ) IE( K, W g U = U K ) inf f WU= D U n ( g, f ) ) 2 EI { sup f g f W 2 2 U c b K c 1 1 ( f ( x )) EI ( ) } n i 1 U i ( c:step- lengthconstant) p 2 b U ), (Functional approximation) (Estimation error) (Iteration effect) Remark. Trade between FA( p, W * U ) and EE( p, W ) 43

44 -Boost -Boost KDE RSDE (Girolami-He, 2004) C (skewed-unimodal) H (bimodal) L (quadrimodal) 44

45 Conclusion K-N mean E ( ), Cov ( ) -path = -geodesic -divergence (G ( ),* ( ), ( ) ) -geodesic, -geodesic ) U U-divergence (G (U),* (U), (U) ) -geodesic, m-geodesic ) -path W * U 1 (co ( W )) 45

46 Future problems Tangent space Path space 46

47 Future problems -mean, -variance, -divergence These are natural ideas from IG view point We can build -efficiency in estimation theory, but What is a statistical meaning of -mean and -variance? Can we define a random sample of -version? 47

48 Thank you 48

Information Geometric Structure on Positive Definite Matrices and its Applications

Information Geometric Structure on Positive Definite Matrices and its Applications Atsumi Ohara Osaka University 2010 Feb. 21 at Osaka City University 大阪市立大学数学研究所情報幾何関連分野研究会 2010 情報工学への幾何学的アプローチ 1 Outline