Information Geometry

Similar documents
Information Geometric Structure on Positive Definite Matrices and its Applications

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Information Geometric view of Belief Propagation

Information geometry of Bayesian statistics

Information geometry of mirror descent

Entropy and Divergence Associated with Power Function and the Statistical Application

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Geometry of U-Boost Algorithms

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Surrogate loss functions, divergences and decentralized detection

Bootstrap prediction and Bayesian prediction under misspecified models

Bayesian estimation of the discrepancy with misspecified parametric models

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Lecture 8: Information Theory and Statistics

Information Geometry on Hierarchy of Probability Distributions

The information complexity of best-arm identification

Information geometry for bivariate distribution control

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Quantifying Stochastic Model Errors via Robust Optimization

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Introduction to Information Geometry

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Learning Binary Classifiers for Multi-Class Problem

Information Projection Algorithms and Belief Propagation

LECTURE 15: COMPLETENESS AND CONVEXITY

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

A minimalist s exposition of EM

Consistency of the maximum likelihood estimator for general hidden Markov models

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis

Machine learning - HT Maximum Likelihood

Characterizing the Region of Entropic Vectors via Information Geometry

Bayes spaces: use of improper priors and distances between densities

2.1 Optimization formulation of k-means

Methods of Estimation

Bregman Divergences for Data Mining Meta-Algorithms

Hessian Riemannian Gradient Flows in Convex Programming

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Bounding the Entropic Region via Information Geometry

Statistical physics models belonging to the generalised exponential family

Lecture 7 Introduction to Statistical Decision Theory

Curve Fitting Re-visited, Bishop1.2.5

Manifold Monte Carlo Methods

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

11. Learning graphical models

Covariance function estimation in Gaussian process regression

Gaussian Mixture Models

Expectation Propagation Algorithm

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Nonparametric estimation: s concave and log-concave densities: alternatives to maximum likelihood

Lecture 5 Channel Coding over Continuous Channels

Journée Interdisciplinaire Mathématiques Musique

Relative Loss Bounds for Multidimensional Regression Problems

Nonparametric estimation of log-concave densities

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Welcome to Copenhagen!

Information Geometry: Background and Applications in Machine Learning

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Nonparametric estimation of log-concave densities

F -Geometry and Amari s α Geometry on a Statistical Manifold

Probability and Statistics

Statistical inference on Lévy processes

Econometrics I, Estimation

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

A brief introduction to Conditional Random Fields

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Asymptotic results for empirical measures of weighted sums of independent random variables

Persistent homology and nonparametric regression

Nonparametric estimation under Shape Restrictions

Stat410 Probability and Statistics II (F16)

Statistical Estimation: Data & Non-data Information

An Introduction to Expectation-Maximization

Lecture 1: Derivatives

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

G8325: Variational Bayes

Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood

Lecture 1: Derivatives

Generative Models and Optimal Transport

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Inference. Data. Model. Variates

Nonparametric estimation under Shape Restrictions

Asymptotics for posterior hazards

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Legendre-Fenchel transforms in a nutshell

6.1 Variational representation of f-divergences

Series 7, May 22, 2018 (EM Convergence)

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Legendre-Fenchel transforms in a nutshell

SUBTANGENT-LIKE STATISTICAL MANIFOLDS. 1. Introduction

Decentralized decision making with spatially distributed data

Statistical Data Mining and Machine Learning Hilary Term 2016

Geometry and Optimization of Relative Arbitrage

Scalable Hash-Based Estimation of Divergence Measures

Robust Independent Component Analysis via Minimum -Divergence Estimation

Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing

Transcription:

2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto Shinto Eguchi Institute Institute Statistical Mathematics, Japan This talk is based on a joint work with Osamu Komori and Atsumi Ohara, University of Fukui

Outline Short review for Information geometry Kolmogorov-Nagumo mean -path in a function space Generalized mean and variance divergence geometry U-divergence geometry Minimum U-divergence and density estimation 2

A short review of IG Nonparametric space Space of statistics Information geometry is discussed on the product space 3

Bartlett s identity Parametric model Bartlett s first identity Bartlett s second identity 4

Metric and connections Information metric Mixture connection Exponential connection Rao (1945), Dawid (1975), Amari (1982) 5

Geodesic curves and surfaces in IG m-geodesic curve e-geodesic curve m-geodesic surface e-geodesic surface 6

Kullback-Leibler K-L divergence 1. KL divergence is the expected log-likelihood ratio 2. Maximum likelihood is minimum KL divergence. Akaike (1974) 3. KL divergence induces to the m-connection and e-connection Eguchi (1983) 7

Pythagoras Thm Amari-Nagaoka (2001) Pf 8

Exponential model Exponential model Mean parameter For For Amari (1982) Degenerated Bartlett identity 9

Exponential model Mean equal space Minimum KL leaf 10

Pythagoras foliation 11

log + exp log & exp Bartlett identities KL-divergence e-connection m-connection e-geodesic m-geodesic Pythagoras identity exponential model mean equal space Pythagoras foliation 12

Path geometry { m-geodesic, e-geodesic, } 13

Kolmogorov-Nagumo mean K-N mean is for positive numbers 14

K-N mean in Y Def. K-N mean Cf. Naudts (2009) 15

-path Def. -path connecting f and g Thm (Pf ) lim c 1 ( c )

Examples of -path Exm 0 Exm 1 Exm 2 Exm 3 17

Identities of -density Model 1st identity 2nd identity because 18

Generalized mean and variance Def Note

Generalized mean and variance Exm 20

Bartlett Identity Model Bartlett identity Bartlett identities 21

Tangent space of Y Tangent space Riemannian metric Expectation gives the tangent space Topological properties of T f depend on If = log, then T f is too large to do statistics on Y Cf. Pistone (1992) 22

Def Parallel transport A vector field is parallel along a curve A curve is -geodesic Cf. Amari (1982). 23

-geodesic Thm If is the -geodesic curve Proof. 24

-divergence -cross entropy -entropy -divergence Note: -divergence is KL-divergence if = log 25

Divergence geometry Def. Let be a statistical model. with the Riemannian metric on M : the pair of affine connections on M: 26

-divergence geometry The metric Affine connection pair 27

Pythagorean theorem Thm -geodesic -geodesic h Pf f g 28

-Pythagorean foliation -mean equal space 29

-mean -Bartlett identities - divergence - connection - connection - geodesic - geodesic - Pythagoras identity - model - Pythagoras foliation - mean equal space What is a statistical meaning of -mean and -variance? 30

U-divergence U-cross-entropy U-entropy U-divergence Note Exm 31

U-divergence geometry The metric associated with U-divergence: Affine connections associated with U-divergence: Thm (i) U-geodesic is mixture geodesic. (ii) U*-geodesic is geodesic 32

U-geometry = -geometry / -metric on a model M U-metric on a model M -connection U*-connection 33

Triangle with D U Thm mixture geodesic -geodesic h Pf f g 34

U-loss function U-estimation Let g(x) be a data density function with statistical model U-empirical loss function U-estimator for 35

U-estimator under -model -model U-empirical loss function U-estimator for U-estimator under -model has analogy with MLE under exponential model 36

Potential function Def We call the potential function on -model Note We define the mean parameter by Cf. mean parameter Thm U-estimator for is given by the sample mean 37

Pythagoras foliation Thm Pf 38

Pythagoras foliation 39

U-Boost learning for density estimation U-loss function L U n 1 ( f ) ( f ( xi )) U( ( f ( x)))dx n p i 1 Dictionary of density functions RI W { g ( x) : g ( x) 0, g ( x)dx 1, } Learning space = -model W * 1 ( co( ( W ))) { 1 ( ( g ( x)))} Let f ( x, π) W * W ( ( g ( x)). Then f ( x,(0,,1, 0)) g ( x) 1 ) ( ) Goal : find f * argmin f W * U L U ( f )

U-Boost algorithm ( A) Find f1 arg min L ( g) g W U (B) Update f k f k 1 1 ((1 k 1 ) ( f k ) k 1 ( g k 1 )) st ( k 1, g k 1 ) arg min, g) (0,1) ( W L U ( 1 ((1 ) ( f k ) ( g))) ˆ 1 ( C) Select K, and f ((1 K ) ( fk 1) K ( g K )) Example 2. Power entropy ( f 1 * ) ( x) ( k gk ( x) k k log gk ( x) k k * If 0, f ( x) exp ) g ( x) If * 1, f ( x) ( x) Klemela (2007) k k g k k k Friedman et al (1984)

Inner step in the convex hull W { t( x, ) : } W * U 1 (co ( W )) * * 1 Goal : f argmin L ( f ) f ( x) ( ˆ ( fˆ ( x)) ˆ ( ˆ ˆ f ˆ( x))) f W * U U 1 1 k k W g 7 g 5 * W U f (x) g 4 g 6 g 3 g 2 f (x) g 1

Non-asymptotic bound Theorem. Assume that a data distribution has a density g(x) and that ( A) sup U''( ){ ( ) ( )} Then we have where FA( g, W EE( g, W IE( K, W ) (,, ) co( ( W )) W W EI D ( g, fˆ ) FA( g, W * ) EE( g, W ) IE( K, W g U = U K ) inf f WU= D U n ( g, f ) ) 2 EI { sup f g f W 2 2 U c b K c 1 1 ( f ( x )) EI ( ) } n i 1 U i ( c:step- lengthconstant) p 2 b U ), (Functional approximation) (Estimation error) (Iteration effect) Remark. Trade between FA( p, W * U ) and EE( p, W ) 43

-Boost -Boost KDE RSDE (Girolami-He, 2004) C (skewed-unimodal) H (bimodal) L (quadrimodal) 44

Conclusion K-N mean E ( ), Cov ( ) -path = -geodesic -divergence (G ( ),* ( ), ( ) ) -geodesic, -geodesic ) U U-divergence (G (U),* (U), (U) ) -geodesic, m-geodesic ) -path W * U 1 (co ( W )) 45

Future problems Tangent space Path space 46

Future problems -mean, -variance, -divergence These are natural ideas from IG view point We can build -efficiency in estimation theory, but What is a statistical meaning of -mean and -variance? Can we define a random sample of -version? 47

Thank you 48