Parametrized Measure Models

Size: px

Start display at page:

Download "Parametrized Measure Models"

Valentine O’Brien’
6 years ago
Views:

Parametrized Measure Models Nihat Ay Jürgen Jost Hông Vân Lê Lorenz Schwachhöfer SFI WORKING PAPER: 2015-10-040 SFI Working Papers

We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared

Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration

NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the

1 Parametrized Measure Models Nihat Ay Jürgen Jost Hông Vân Lê Lorenz Schwachhöfer SFI WORKING PAPER: SFI Working Papers contain accounts of scienti5ic work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. SANTA FE INSTITUTE

2 Parametrized measure models Nihat Ay, Jürgen Jost, Hông Vân Lê, Lorenz Schwachhöfer October 25, 2015 Abstract We develope a new and general notion of parametric measure models and statistical models on an arbitrary sample space. This is given by a diffferentiable map from the parameter manifold M into the set of finite measures or probability measures on, respectively, which is differentiable when regarded as a map into the Banach space of all signed measures on. Furthermore, we also give a rigorous definition of roots of measures and give a natural definition of the Fisher metric and the Amari-Chentsov tensor as the pullback of tensors defined on the space of roots of measures. We show that many features such as the preservation of this tensor under sufficient statistics and the monotonicity formula hold even in this very general set-up. MSC2010: 53C99, 62B05 Keywords: Fisher quadratic form, Amari-Chentsov tensor, sufficient statistic, monotonicity 1 Introduction Information geometry is concerned with the use of differential geometric methods in probability theory. An important object of investigation are families of probability measures or, more generally, of finite measures on a given sample space which depend differentiably on a finite number of parameters. Associated to such a family there are two symmetric tensors on the parameter space M. The first is a quadratic form (i.e., a Riemannian metric), called the Fisher metric g F, and the second is a 3-tensor, called the Amari-Chentsov tensor T. The Fisher metric was first suggested by Rao [19], followed by Jeffreys [12], Efron [11] and then systematically developed by Chentsov and Morozova [8], [9] and [16]; the Amari-Chentsov tensor and its significance was discovered by Amari [1], [2] and Chentsov [10]. These tensors are of interest from the differential geometric point of view as they do not depend on the particular choice of parametrization of the family, but they are also natural objects from the point of view of statistics, as they are unchanged under sufficient statistics and are in fact characterized by this property; this was shown in the case of finite sample spaces by Chentsov in [9] and more recently for general sample spaces in [5]. In fact, Chentsov not only showed the invariance of these tensors under sufficient statistics, but also under what he called congruent embeddings of probability measures. These are Markov kernels between finite sample spaces which are right inverses of a statistic. We use this property to give a precise definition of congruent embeddings between arbitrary sample spaces (cf. Definition 3.1). As it turns out, every Markov kernel induces a congruent embedding in this sense, but there are congruent embeddings which are not induced by Markov kernels, cf. Theorem

3 The main conceptual difficulty in the investigation of families of probability measures is the lack of a canonical manifold structure on the spaces M() and P() of finite measures and probability measures on. If is finite, then this issue is not a problem, as in this case a measure is given by finitely many parameters, allowing to identify M() with the positive orthant in R and P() with the intersection of this orthant with an affine hyperplane in R, so that both are (finite dimensional) manifolds in a canonical way. But this is no longer true if is infinite. Attempts have been made to provide P() and M() with a Banach manifold structure. For instance, Pistone and Sempi [18] equipped these spaces with a topology, the so-called e-topology. With this, P() and M() become Banach manifolds and have many remarkable features, see e.g. [7], [17]. On the other hand, the e-topology is very strong in the sense that many families of measures on fail to be continuous w.r.t. the e-topology, so it cannot be applied as widely as one would wish. Another approach was recently pursued by Bauer, Bruveris and Michor [6] under the assumption that is a manifold. In this case, the space of smooth densities also carries a natural topology, and they were able to show that the invariance under diffeomorphisms already suffices to characterize the Fisher metric of a family of such densities. In [5], the authors of the present article proposed to define parametrized measure models as a family given as p(ξ) = p(ω; ξ)µ. (1.1) for some reference measure µ and a positive function p on M which is differentiable in ξ M, an idea which closely follows the notion of Amari [1]. While this embraces many interesting families of measures, it is still restricted as it heavily depends on the choice of the reference measure µ which is a priori not naturally given. It is the aim of the present article to provide a yet more general definition of parametrized measure models which embraces all of the aforementioned definitions, but is more general and more natural than these at the same time. Namely, in this article we define parametrized measure models and statistical models, respectively, as families (p(ξ)) ξ M which are given by a map p from M to M() and P(), respectively, which is differentiable when regarded as a map between the (finite or infinite dimensional) manifold M and the Banach space S() of finite signed measures on, since evidently P() and M() are subsets of S(). That is, the geometric structure on M() and P() is given by the inclusions P() M() S(). If the model is given as in the definition from [5] by (1.1), then we say that it is given by a regular density function. We shall show that most of the statements shown in [5] for parametrized measure models or statistical models with a regular density function also hold without this assumption. As it turns out, neither the Fisher metric nor the Amari-Chentsov tensor may be regarded as pull-backs of a tensor on S() via the map p. In order to overcome this problem, let us for the moment assume that the family is given by a regular density function as in (1.1). In this case, the definitions of g F and T AC can be rewritten as g F (V, W ) = V log p(ω; ξ) W log p(ω; ξ) dp(ξ), = 4 V p(ω; ξ) W p(ω; ξ) dµ (1.2) = 4 ) d ( V p(ξ) W p(ξ) 2

4 and T AC (V, W, U) = V log p(ω; ξ) W log p(ω; ξ) U log p(ω; ξ) dp(ξ) = 27 3 V p(ω; ξ) 3 W p(ω; ξ) 3 U p(ω; ξ) dµ (1.3) ( = 27 d 3 V p(ξ) 3 W p(ξ) ) 3 U p(ξ). Of course, the last lines in (1.2) and (1.3) do not make sense a priori, as it is not clear what a root of a measure should be, but this is precisely the way we shall make use of this. Namely, for r (0, 1] we define the space S r () of r-th powers of a measure, which has M r () and P r () as subsets. Again, S r () is a Banach space and S 1 () = S(). Furthermore, there is a bounded bilinear multiplication map : S r () S s () S r+s (), for r, s, r + s (0, 1] as well as differentiable maps taking the (signed) k-th power, π k, ˆπ k : S r () S kr () for r, kr (0, 1]. That is, one can work with these objects in a very suggestive way. We call a parametrized measure model p : M M() k-integrable, if the map p 1/k : M M 1/k () S 1/k () is (formally) differentiable. This notion of k-integrability turns out to be equivalent to that given in [5] for models with a regular density function. On S 1/n () we define the canonical n-tensor as L n (ν 1,..., ν n ) := n n d(ν 1 ν n ). Then by (1.2) and (1.3) the Fisher metric and the Amari-Chentsov tensor are given as the pull-backs g F = (p 1/2 ) (L 2 ) and T AC = (p 1/3 ) (L 3 ), if the model is k-integrable for k 2 or k 3, respectively. That is, these tensors may be defined as pullbacks of objects which are naturally defined in terms of only. We also discuss the behaviour of the Fisher metric under statistics. For this, let κ : be measurable and κ : S() S( ) be the push-forward of (signed) measures. This is a bounded linear map which maps M() to M( ) and P() to P( ), respectively. In particular, given a parametrized measure model p : M M(), it induces a map p := κ p. We then show that this process preserves k-integrability, i.e., if p is k-integrable, then so is p (cf. Theorem 5.1). Moreover, we show that in this generality the monotonicity formula holds (Theorem 5.2): g F (V, V ) g F (V, V ), (1.4) where g F and g F denote the Fisher metrics of p and p, respectively, and where V T M. Moreover, if p(ξ) has a regular density function p, then equality holds for all V iff κ is a sufficient statistic for the model. In fact, we show (1.4) even in the case where κ is replaced by any Markov kernel K : P( ). While the monotonicity formula has been known in special cases, e.g. if one of, is finite, or if both are manifolds and κ is differentiable, our approach shows that (1.4) holds without any further 3

5 assumption on the model whatsoever. In particular, it is also valid for rather peculiar statistics as e.g. Example 3.2. This paper is structured as follows. In Section 2 we give the formal definition of the spaces of powers of measures, setting up the formalism needed later on. In Section 3 we provide a precise definition of congruent embeddings for arbitrary sample spaces and discuss their relations with Markov kernels and the existence of transverse measures. In the following Section 4 we establish the notion of k-integrability, which is applied in the final Section 5 to the discussion of sufficient statistics and the proof of the monotonicity formula. Acknowledgements. This work was mainly carried out at the Max Planck Institute for Mathematics in the Sciences in Leipzig, and we are grateful for the excellent working conditions provided at that institution. H.V. Lê is partially supported by Grant RVO: J. Jost acknowledges support from the ERC Advanced Grant FP The spaces of measures and their powers 2.1 The space of (signed) finite measures Let (, Σ) be a measurable space, that is an arbitrary set together with a sigma algebra Σ of subsets of. Regarding the sigma algebra Σ on as fixed, we let P() := {µ : µ a probability measure on } M() := {µ : µ a finite measure on } S() := {µ : µ a signed finite measure on } S 0 () := {µ S() : dµ = 0}. Clearly, P() M() S(), and S 0 (), S() are real vector spaces. In fact, both S 0 () and S() are Banach spaces whose norm is given by the total variation of a signed measure, defined as µ T V n := sup µ(a i ) i=1 where the supremum is taken over all finite partitions = A 1... A n with disjoint sets A i Σ. By the Jordan decomposition theorem, each measure µ S() can be decomposed uniquely as µ = µ + µ with µ ± M(), µ + µ. (2.1) Thus, if we define then (2.1) implies so that µ := µ + + µ M(), µ(a) µ (A) for all µ S() and A Σ, (2.2) µ T V = µ T V = µ (). 4

6 In particular, Moreover, fixing a measure µ 0 M(), we let P() = {µ M() : µ T V = 1}. P(, µ 0 ) := {µ P() : µ is dominated by µ 0 } M(, µ 0 ) := {µ M() : µ is dominated by µ 0 } (2.3) S(, µ 0 ) := {µ S() : µ is dominated by µ 0 } S 0 (, µ 0 ) := S(, µ 0 ) S 0 (), where we say that µ 0 dominates µ if every µ 0 -null set is also a µ -null set. We may canonically identify S(, µ 0 ) with L 1 (, µ 0 ) by the correspondence ı can : L 1 (, µ 0 ) S(, µ 0 ), φ φ µ 0. By the Radon-Nikodým theorem, this is an isomorphism whose inverse is given by the Radon- Nikodým derivative µ dµ dµ 0. Observe that ı can is an isomorphism of Banach spaces, since evidently φ L 1 (,µ 0 ) = φ dµ 0 = φ µ 0 T V. 2.2 Differential maps between Banach manifolds and tangent spaces In this section, we shall recall some basic notions of maps between Banach manifolds. For simplicity, we shall restrict ourselves to maps between open subsets of Banach spaces, even though this notion can be generalized to general Banach manifolds, see e.g. [13]. Let V and W be Banach spaces and U V an open subset. A map φ : U W is called differentiable at x U, if there is a bounded linear operator d x φ Lin(V, W ) such that φ(x + h) φ(x) d x φ(h) W lim = 0. (2.4) h 0 h V In this case, d x φ is called the (total) differential of φ at x. Moreover, φ is called continuously differentiable or shortly a C 1 -map, if it is differentiable at every x U, and the map dφ : U Lin(V, W ), x d x φ is continuous. Furthermore, a differentiable map c : ( ε, ε) W is called a curve in W. Definition 2.1. Let X V be an arbitrary subset and let x 0 X. Then v V is called a tangent vector of X at x 0, if there is a curve c : ( ε, ε) X V such that c(0) = x 0 and ċ(0) := d 0 c(1) = v. The set of all tangent vectors at x 0 is called the tangent space of X at x 0 and is denoted by T x0 X. We also let T X := x 0 X T x 0 X X V V V, equipped with the induced topology. For instance, if X = U V is an open set, then T x0 U = V for all x 0 and hence, T U = U V. Indeed, the curve c(t) = x 0 + tv U for small t satisfies the properties required in the definition. While reparametrization of c implies that T x0 X is invariant under scalar multiplication, this set fails to be a linear subspace in general; if X V is a submanifold, however, then T x0 X coincides with the standard notion of the tangent space of a (Banach) manifold, justifying our notation. 5

7 If U V is open and φ : U W is a C 1 -map whose image is contained in X W, then d x0 φ(v ) T φ(x0 )X, whence φ induces a continuous map dφ : T U = U V T X, (u, v) d u φ(v). Proposition 2.1. Let V = S() be the Banach space of signed measures on. Then the tangent spaces of M() and P() are T M() := µ M() S(, µ) M() S() and T P() := µ P() S 0(, µ) P() S(). Remark 2.1. This result is not obvious, because, in the case of an infinite sample space, P() and M() are not Banach submanifolds of S(). Our proof will handle a more general situation, as needed in Proposition 2.2 below. To prove Proposition 2.1, we first show the following simple Lemma 2.1. Let {ν n : n N} S() be a countable family of measures. Then there is a measure µ 0 M() dominating ν n for all n. Proof. We assume w.l.o.g. that ν n 0 for all n and define µ 0 := n=1 1 2 n ν n T V ν n. Since ν n T V = ν n (), it follows that this sum converges, so that µ 0 M() is well defined. Moreover, if µ 0 (A) = 0, then ν n (A) = 0 for all n, showing that µ 0 dominates all ν n as claimed. Proof of Proposition 2.1. Given ν T µ0 M() S(), let (µ t ) t ( ε,ε) be a curve in M() with µ 0 = ν. Pick two sequences t ± n 0, t + n > 0, t n < 0, and let ˆµ 0 M() be a measure which dominates the measures µ t ± n, µ 0 and ν. This measure exists by Lemma 2.1. Thus, there are functions φ 0, φ 0, φ± n L 1 (, ˆµ 0 ) such that µ t ± n = φ ± n ˆµ 0, µ 0 = φ 0ˆµ 0, ν = φ 0ˆµ 0. By hypothesis, φ ± n, φ 0 0. Adapting the definition of a C 1 -map in (2.4), the differential quotient (t ± n ) 1 (µ t ± n µ 0 ) = (t ± n ) 1 (φ ± n φ 0 )ˆµ 0 converges in S() to ν = φ 0ˆµ 0, whence φ ± n φ 0 t ± n n φ 0 in L 1 (, ˆµ 0 ). After passing to subsequences this implies pointwise convergence ˆµ 0 -a.e. If φ 0 (ω) = 0, then φ+ n (ω) φ 0 (ω) 0 while φ t + n (ω) φ 0 (ω) 0. Thus, if both converge to φ n t 0 (ω), then n φ 0 (ω) = 0. That is: on φ 1 0 (0) we have φ 0 = 0 ˆµ 0-a.e., showing that ν = φ 0ˆµ 0 is dominated by µ 0 = φ 0ˆµ 0 and hence ν S(, µ 0 ). Thus, T µ0 M() S(, µ 0 ). 6

8 Conversely, given ν = φµ 0 S(, µ 0 ), define µ t := (1 + tφ) + µ 0 M(, µ 0 ). Then µ t µ 0 tν T V t = ((1 + tφ) + 1 tφ)µ 0 T V t = (1 + tφ) 1 t ( t 1 φ ) 1 t 0 0, using the monotone convergence theorem in the last step. Thus, ν is the derivative of (µ t ) t ( ε,ε), showing that ν T µ0 M() as claimed. To show the statement for P(), let (µ t ) t ( ε,ε) be a curve in P() with µ 0 = ν. Then as µ t is a probability measure for all t, we conclude dν = 1 t d(µ t µ 0 tν) µ t µ 0 tν T V t t 0 0, so that ν S 0 (). Since P() M(), it follows that T µ0 P() T µ0 M() S 0 () = S 0 (, µ 0 ) for all µ 0 P(). Conversely, given ν = φµ 0 S 0 (, µ 0 ), define the curve λ t := µ t µ t 1 T V P() with µ t = (1 + tφ) + µ 0 as before, so that λ 0 = µ 0. We set c ± t := (1 + tφ) ± dµ 0 0, so that c + t = µ t T V. Moreover, c + t c t = (1 + tφ) dµ 0 = 1 = µ t T V = 1 + c t 1, as µ 0 P() and φµ 0 S 0 (, µ). Thus, λ t λ 0 tν T V = (1 + tφ) + 1 tφ µ t T V dµ 0 = c t (1 + tφ) + + (1 + tφ) µ t T V dµ 0 c t = (1 + tφ) + dµ 0 + (1 + tφ) dµ 0 = 2c t µ t. T V } {{ } =c + t = µt T V } {{ } =c t But as shown above, lim t 0 c t t = 0, whence ν = λ(0) and ν T µ0 P(). 2.3 Powers of densities Let us now give the formal definition of roots of measures. On the set M() we define the preordering µ 1 µ 2 if µ 2 dominates µ 1. Then (M(), ) is a directed set, meaning that for any pair µ 1, µ 2 M() there is a µ 0 M() dominating both of them (use e.g. µ 0 := µ 1 + µ 2, but observe that by Lemma 2.1, this even holds for countable families of measures). 7

9 For fixed r (0, 1] and measures µ 1 µ 2 on we define the linear embedding ( ) r ı µ 1 µ 2 : L 1/r (, µ 1 ) L 1/r dµ1 (, µ 2 ), φ φ. dµ 2 Observe that ı µ 1 µ 2 (φ) 1/r = = ı µ 1 µ 2 (φ) 1/r dµ 2 r = φ 1/r dµ 1 r = φ 1/r, φ 1/r dµ 1 dµ 2 dµ 2 r (2.5) so that ı µ 1 µ 2 is an isometry. Moreover, we have evidently ı µ 1 µ 2 ı µ 2 µ 3 = ı µ 1 µ 3 whenever µ 1 µ 2 µ 3. Then we define the space of r-th roots of measures on to be the directed limit over the directed set (M(), ) S r () := lim L 1/r (, µ). (2.6) Let us give a more concrete definition of S r (). On the disjoint union of the spaces L 1/r (, µ) for µ M() we define the equivalence relation L 1/r (, µ 1 ) φ ψ L 1/r (, µ 2 ) ı µ 1 µ 0 (φ) = ı µ 2 µ 0 (ψ) ( ) r ( dµ1 dµ2 φ = ψ dµ 0 dµ 0 for some µ 0 µ 1, µ 2. Then S r () is the set of all equivalence classes of this relation. If we denote the equivalence class of φ L 1/r (, µ) by φµ r, then the equivalence relation yields ( ) r µ r dµ1 1 = µ r 2 (2.7) dµ 2 whenever µ 1 µ 2, justifying this notation. In fact, from this description in the case r = 1 we see that S 1 () = S(). Observe that by (2.5) φ 1/r is constant on equivalence classes, whence there is a norm on S r (), also denoted by. 1/r, for which the inclusions ) r L 1/r (, µ) S r (), φ φµ r are isometries. For r = 1, we have. 1 =. T V. Thus, φµ r 1/r = φ 1/r = φ 1/r dµ r for 0 < r 1 (2.8) Note that the equivalence relation also preserves nonnegativity of functions, whence we may define the subsets M r () := {φµ r : µ M(), φ 0} P r () := {φµ r : µ P(), φ 0, φ 1/r = 1}. (2.9) 8

10 In analogy to (2.3) we define for a fixed measure µ 0 M() and r (0, 1] the spaces S r (, µ 0 ) := {φ µ r 0 : φ L 1/r (, µ 0 )} M r (, µ 0 ) := {φ µ r 0 : φ L 1/r (, µ 0 ), φ 0} P r (, µ 0 ) := {φ µ r 0 : φ L 1/r (, µ 0 ), φ 0, φ 1/r = 1} { } S0(, r µ 0 ) := φµ r 0 : φ L 1/r (, µ 0 ), φ dµ = 0. The elements of P r (, µ 0 ), M r (, µ 0 ), S r (, µ 0 ) are said to be dominated by µ r 0. From Lemma 2.1, we can now conclude the following statement: Any sequence in S r () is contained in S r (, µ 0 ) for some µ 0 M(). In particular, any Cauchy sequence in S r () is a Cauchy sequence in S r (, µ 0 ) = L 1/r (, µ 0 ) for some µ 0 and hence convergent. Thus, (S r (),. 1/r ) is a Banach space. In analogy to Proposition 2.1, we can also determine the tangent spaces of the subsets P r () M r () S r (). The proof of the statement is identical to that of Proposition 2.1 and whence is omitted. Proposition 2.2. For each µ M() (µ P(), respectively), the tangent spaces of P r () M r () S r () at µ r are given as and T µ rm r () T µ rp r () := S r (, µ) S r () := S r 0(, µ) S r (). The product of powers of measures can now be defined for all r, s (0, 1] with r + s 1 and for measures φµ r S r (, µ) and ψµ s S s (, µ): (φµ r ) (ψµ s ) := φψµ r+s. By definition φ L 1/r (, µ) and ψ L 1/s (, µ), whence Hölder s inequality implies that φψ 1/(r+s) φ 1/r ψ 1/s <, so that φψ L 1/(r+s) (, µ) and hence, φψµ r+s S r+s (, µ). Since by (2.7) this definition of the product is independent of the choice of representative µ, it follows that it induces a bilinear product satisfying the Hölder inequality : S r () S s () S r+s (), where r, s, r + s (0, 1], (2.10) ν r ν s 1/(r+s) ν r 1/r ν s 1/s, so that the product in (2.10) is a bounded bilinear map. Definition 2.2. (Canonical pairing) For r (0, 1] we define the pairing (.;.) : S r () S 1 r () R, (ν 1 ; ν 2 ) := d(ν 1 ν 2 ). (2.11) 9

11 It is straightforward to verify that this pairing is non-degenerate in the sense that (ν r ;.) = 0 ν r = 0. (2.12) Besides multiplication of roots of measures, we also wish to take their powers. Here, we have two possibilities to deal with signs. For 0 < k r 1 and ν r = φµ r S r () we define ν r k := φ k µ rk and ν k r := sign (φ) φ k µ rk. Since φ L 1/r (, µ), it follows that φ k L k/r (, µ), so that ν r k, ν k r S rk (). By (2.7) these powers are well defined, independent of the choice of the measure µ, and, moreover, ν r k 1/(rk) = ν k r 1/(rk) = ν r k 1/r. (2.13) Proposition 2.3. Let r (0, 1] and 0 < k 1/r, and consider the maps π k, π k : S r () S rk (), π k (ν) := ν k π k (ν) := ν k. Then π k, π k are continuous maps. Moreover, for 1 < k 1/r they are C 1 -maps between Banach spaces, and their derivatives are given as d νr π k (ρ r ) = k ν r k 1 ρ r and d νr π k (ρ r ) = k ν k 1 r ρ r. (2.14) Observe that for k = 1, π 1 (ν r ) = ν r fails to be C 1, whereas π 1 (ν r ) = ν r, so that π 1 is the identity and hence a C 1 -map. Proof. Let us first assume that 0 < k 1. We assert that in this case, there are constants C k, C k > 0 such that for all x, y R x + y k x k C k y k and sign (x + y) x + y k sign (x) x k C k y k. (2.15) Namely, by homogeneity it suffices to show this for y = 1, and since the functions x x + 1 k x k and x sign (x + 1) x + 1 k sign (x) x k are continuous and have finite limits for x ±, it follows that they are bounded, showing (2.15). Let ν 1, ν 2 S r (), and choose µ 0 M() such that ν 1, ν 2 S r (, µ 0 ), i.e., ν i = φ i µ r 0 with φ i L 1/r (, µ 0 ). Then π k (ν 1 + ν 2 ) π k (ν 1 ) 1/(rk) = φ 1 + φ 2 k φ 1 k 1/(rk) C k φ 2 k 1/rk by (2.15) = C k ν 2 k 1/r by (2.13), so that lim ν2 1/r 0 π k (ν 1 + ν 2 ) π k (ν 1 ) 1/(rk) = 0, showing the continuity of π k for 0 < k 1. The continuity of π k follows analogously. 10

12 Now let us assume that 1 < k 1/r. In this case, the functions x x k and x sign (x) x k with x R are C 1 -maps with respective derivatives x k sign (x) x k 1 and x k x k 1. Thus, if we pick ν i = φ i µ r 0 as above, then by the mean value theorem we have π k (ν 1 + ν 2 ) π k (ν 1 ) = ( φ 1 + φ 2 k φ 1 k )µ rk 0 = k sign (φ 1 + ηφ 2 ) φ 1 + ηφ 2 k 1 φ 2 µ rk 0 = k sign (φ 1 + ηφ 2 ) φ 1 + ηφ 2 k 1 µ r(k 1) 0 ν 2 for some function η : (0, 1). If we let ν η := ηφ 2 µ r 0, then ν η 1/r ν 2 1/r, and we get With the definition of d ν1 π k from (2.14) we have π k (ν 1 + ν 2 ) π k (ν 1 ) = k π k 1 (ν 1 + ν η ) ν 2. π k (ν 1 + ν 2 ) π k (ν 1 ) d ν1 π k (ν 2 ) 1/(rk) = k( π k 1 (ν 1 + ν η ) π k 1 (ν 1 )) ν 2 1/(rk) k π k 1 (ν 1 + ν η ) π k 1 (ν 1 ) 1/(r(k 1)) ν 2 1/r and hence, π k (ν 1 + ν 2 ) π k (ν 1 ) d ν1 π k (ν 2 ) 1 rk ν 2 1 r k π k 1 (ν 1 + ν η ) π k 1 (ν 1 ) 1. r(k 1) Thus, the differentiability of π k will follow if π k 1 (ν 1 + ν η ) π k 1 (ν 1 ) 1/(r(k 1)) ν 2 1/r 0 0, and because of ν η 1/r ν 2 1/r, this is the case if π k 1 is continuous. Analogously, one shows that π k is differentiable if π k 1 is continuous. Since we already know continuity of π k and π k for 0 < k 1, and since C 1 -maps are continuous, the claim now follows by induction on k. Thus, (2.14) implies that the differentials of π k and π k (which coincide on P r () and M r ()) yield continuous maps dπ k = d π k : T P r () T P rk () T M r () T M rk (), (µ, ρ) kµ rk r ρ. 11

13 3 Congruent embeddings 3.1 Statistics and congruent embeddings Given two measurable sets (, Σ) and (, Σ ), a measurable map κ : will be called a statistic. If µ is a (signed) measure on (, Σ), it then induces a (signed) measure κ µ on (, Σ ), via κ µ(a) := µ(κ 1 A), which is called the push-forward of µ by κ. Note that κ : S() S( ) (3.1) is a bounded linear map. Indeed, the norm on both spaces is given by the total variation, and we have for any partition = A 1... A n κ µ(a i) = µ(κ 1 A i) µ T V, whence κ µ T V µ T V. Moreover, κ is monotone, i.e., it maps nonnegative measures to nonnegative measures, and for these, the total variation is preserved: κ µ T V = µ T V for all µ M(). In particular, κ maps probability measures to probability measures, i.e., κ (P()) P( ). We also define the pull-back of a measurable function φ : R as κ (φ ) := φ κ. With this, for subsets A, B and A := κ 1 (A ), B := κ 1 (B ) we have d(κ (κ (χ A ) µ)) = κ (χ A ) dµ = dµ B B A B = dκ (µ) = χ A dκ (µ), A B B since κ (χ A ) = χ A. Thus, κ (κ (χ A )µ) = χ A κ (µ). By linearity, this equation holds when replacing χ A by a step function on, whence by the density of step functions in L 1 (, µ ) we obtain κ (κ (φ ) µ) = φ κ (µ) for all φ L 1 (, κ (µ)). (3.2) Recall that M() and S() denote the spaces of all (signed) measures on, whereas M(, µ) and S(, µ) denote the subspaces of the (signed) measures on which are dominated by µ. Definition 3.1. (Congruent embedding) Let κ : be a statistic and µ M( ). A κ-congruent embedding is a bounded linear map K : S(, µ ) S() such that 12

14 1. K is monotone, i.e., it maps nonnegative measures to nonnegative measures, or shortly: K (M(, µ )) M(). 2. κ (K (ν )) = ν for all ν S(, µ ). Furthermore, the image of a κ-congruent embedding K in S() is called a κ-congruent subspace of S(). Example 3.1. Let κ : be a statistic, let µ M() and µ := κ (µ) M( ). Then the map K µ : S(, µ ) S(, µ) S(), φ µ κ (φ )µ (3.3) for all φ L 1 (, µ ) is a κ-congruent embedding, since by (3.2). κ (K µ (φ µ )) = κ (κ (φ )µ) = φ κ (µ) = φ µ We shall now see that the above example exhausts all possibilities of congruent embeddings. Proposition 3.1. Let κ : be a statistic, let K : S(, µ ) S() for some µ M( ) be a κ-congruent embedding, and let µ := K (µ ) M(). Then K = K µ with the map K µ given in (3.3). Proof. We have to show that K (φ µ) = κ (φ )µ for all φ L 1 (, µ ). By continuity, it suffices to show this for step functions, as these are dense in L 1 (, µ ), whence by linearity, we have to show that for all A, A := κ 1 (A ) K (χ A µ ) = χ A µ. (3.4) Let A 1 := A and A 2 = \A, and let A i := κ 1 (A i ). We define the measures µ i := χ A i µ M( ), and µ i := K (µ i ) M(). Since µ 1 + µ 2 = µ, it follows that µ 1 + µ 2 = µ by the linearity of K. Taking indices mod 2, and using κ (µ i ) = κ (K (µ i )) = µ i by the κ-congruency of K, note that µ i (A i+1 ) = µ i (κ 1 (A i+1)) = κ (µ i )(A i+1) = µ i(a i+1) = 0. Thus, for any measurable B we have µ 1 (B) = µ 1 (B A 1 ) since µ 1 (B A 2 ) µ 1 (A 2 ) = 0 = µ 1 (B A 1 ) + µ 2 (B A 1 ) since µ 2 (B A 1 ) µ 2 (A 1 ) = 0 = µ(b A 1 ) since µ = µ 1 + µ 2 = (χ A µ)(b) since A 1 = A. That is, χ A µ = µ 1 = K (µ 1 ) = K (χ A µ ), so that (3.4) follows. As a consequence, any κ-congruent subspace of S() must be of the form for some µ M(). C κ,µ := {κ (φ )µ : φ L 1 (, κ (µ))} (3.5) 13

15 3.2 Markov kernels and Markov morphisms Definition 3.2. (Markov kernel and Markov morphism, cf. [8], [15]) A Markov kernel between two measurable spaces ( 1, Σ 1 ) and ( 2, Σ 2 ) is a map K : 1 P( 2 ), associating to each ω 1 1 a probability measure on 2 such that for each fixed measurable A 2 2 the map 1 [0, 1], ω 1 K(ω 1 ; A 2 ) := K(ω 1 )(A 2 ) is measurable. The Markov morphism induced by K is the linear map K : S( 1 ) S( 2 ), K(µ 1 ; A 2 ) = K (µ 1 )(A 2 ) := K(ω 1 ; A 2 ) dµ 1 (ω 1 ). (3.6) 1 Since K(ω 1 ) P( 2 ), it follows that K(ω 1 ; 2 ) = 1 and hence (3.6) implies that K (µ 1 )( 2 ) = µ 1 ( 1 ). Thus, K (µ 1 ) T V = µ 1 T V for all µ 1 M( 1 ). (3.7) In particular, a Markov morphism maps probability measures to probability measures. For a general measure µ 1 S( 1 ), (2.2) implies that K (µ 1 ; A 2 ) K ( µ 1 ; A 2 ) for all A 2 Σ 2 and hence, K (µ 1 ) T V K ( µ 1 ) T V = µ 1 T V for all µ 1 S( 1 ), so that K : S( 1 ) S( 2 ) is a bounded linear map. Observe that we can recover the Markov kernel K from K using the relation K(ω 1 ) = K (δ ω 1 ) for all ω 1 1, where δ ω 1 denotes the Dirac measure supported at ω 1 1. Remark 3.1. From (3.6) it is immediate that K preserves dominance of measures, i.e., if µ 1 dominates µ 1, then K (µ 1 ) dominates K (µ 1 ). Thus, for each µ 1 M( 1 ) there is a restriction where µ 2 := K (µ 1 ). K : S( 1, µ 1 ) S( 2, µ 2 ), Definition 3.3. (Composition of Markov kernels) Let ( i, Σ i ), i = 1, 2, 3 be measurable spaces, and let K i : i P( i+1 ) for i = 1, 2 be Markov kernels. The composition of K 1 and K 2 is the Markov kernel K 2 K 1 : 1 P( 3 ), ω (K 2 ) (K 1 (ω)). Since (K 2 ) (K 1 (ω)) T V = K 1 (ω) T V = 1 by (3.7), (K 2 ) (K 1 (ω)) is a probability measure, hence this composition yields indeed a Markov kernel. Moreover, it is straightforward to verify that this composition is associative, and for the induced Markov morphism we have (K 2 K 1 ) = (K 2 ) (K 1 ). (3.8) Markov kernels are generalizations of statistics. In fact, a statistic κ : induces a Markov kernel by K κ (ω) := δ κ(ω), so that K κ (ω; A ) := χ κ 1 (A )(ω). In this case, the Markov morphism induced by K κ is the map κ : S() S( ) from (3.1). We shall write the Markov kernel K κ also as κ if there is no danger of confusion. 14

16 Definition 3.4. (Congruent Markov kernels) A Markov kernel K : P() is called κ-congruent for a statistic κ : if or, equivalently, κ (K(ω )) = δ ω for all ω, (3.9) (K κ K) = Id S( ) : S( ) S( ). In this case, we also call the induced Markov morphism K : S( ) S() κ-congruent. In order to relate the notions of κ-congruent Markov morphism and κ-congruent embeddings from Definition 3.1, we need the notion of κ-transverse measures. Definition 3.5. (Transverse measures) Let κ : be a statistic. A measure µ M() is said to admit κ-transverse measures if there are measures µ ω on κ 1 (ω ) such that for all φ L 1 (, µ) ( ) φ dµ = φ dµ ω dµ (ω ), (3.10) κ 1 (ω ) where µ := κ (µ). In particular, the function ˆR, ω is measurable for all φ L 1 (, µ). κ 1 (ω ) φ dµ ω Observe that the choice of κ-transverse measures µ ω these measures for all ω in a µ -null set. is not unique, but rather, one can change Proposition 3.2. Let κ : be a statistic and µ M() a measure which admits κ-transverse measures {µ ω : ω }. Then µ ω is a probability measure for almost every ω and hence, we may assume w.l.o.g. that µ ω P(κ 1 (ω )) for all ω. Proof. Given ε > 0, define A ε := {ω : µ ω (κ 1 (ω )) 1 + ε}. Then for φ := χ κ 1 (A the ε) two sides of equation (3.10) read ( χ κ 1 (A dµ = µ(κ ε) 1 (A ε)) = µ (A ε) ) ( ) χ κ 1 (A ε) dµ ω dµ (ω ) = dµ ω dµ (ω ) κ 1 (ω ) A ε κ 1 (ω ) = µ ω (κ 1 (ω )) dµ (ω ) A ε (1 + ε)µ (A ε). Thus, (3.10) implies µ (A ε) (1 + ε)µ (A ε), 15

17 and hence, µ (A ε) = 0 for all ε > 0. Thus, µ ({ω : µ ω (κ 1 (ω )) > 1}) = µ ( n=1 A 1/n ) µ (A 1/n ) = 0, whence {ω : µ ω (κ 1 (ω )) > 1} is a µ -null set. Analogously, {ω : µ ω (κ 1 (ω )) < 1} is a µ -null set, that is, µ ω P(κ 1 (ω )) and hence µ ω T V = 1 for µ -a.e. µ. Thus, if we replace µ ω by µ ω := µ ω 1 T V µ ω, then µ ω P(κ 1 (ω )) for all ω, and since µ ω = µ ω for µ -a.e. ω, it follows that (3.10) holds when replacing µ ω by µ ω. We are now ready to relate the notions of κ-congruent embeddings and κ-congruent Markov kernels. Theorem 3.1. Let κ : be a statistic and µ M( ) be a measure. 1. If K : P() is a κ-congruent Markov kernel, then the restriction of K to S(, µ ) S( ) is a κ-congruent embedding and hence, for φ L 1 (, µ ) we have K (φ µ ) = κ (φ )K (µ ). 2. Conversely, if K : S(, µ ) S() is a κ-congruent embedding, then the following are equivalent. (a) K is the restriction of a κ-congruent Markov morphism to S(, µ ) S( ). (b) µ := K (µ ) S() admits κ-transverse measures. Theorem 3.1 implies that the two notions of congruency are equivalent for large classes of statistics κ, since the existence of transversal measures is guaranteed under rather mild hypotheses, e.g. if one of, is a finite set, or if, are differentiable manifolds equipped with a Borel measure µ and κ is a differentiable map. However, there are examples of statistics and measures which do not admit κ-transverse measures, cf. Example 3.2 below. Proof. The first statement follows directly from (K κ K) = (K κ ) K = κ K by (3.8) and Proposition 3.1. For the second, suppose that K : S(, µ ) S() is a κ-congruent embedding. Then K = K µ given in (3.3) for the measure µ := K (µ ) by Proposition 3.1. If we assume that K is the restriction of a κ-congruent Markov morphism induced by the κ- congruent Markov kernel K : P(), then we define the measures µ ω := K(ω ) κ 1 (ω) M(κ 1 (ω )). n=1 Note that for ω K(ω ; \κ 1 (ω )) = (3.9) = dk(ω ) = dκ (K(ω )) \κ 1 (ω ) \ω = 0. \ω dδ ω 16

18 That is, K(ω ) is supported on κ 1 (ω ) and hence, for an arbitrary set A we have K(ω ; A) = K(ω ; A κ 1 (ω )) = µ ω (A κ 1 (ω )) = χ A dµ ω. κ 1 (ω ) Substituting this into the definition of K we obtain for a subset A χ A dµ = µ(a) = K (µ ; A) (3.6) = K(ω ; A) dµ (ω ) ( ) = χ A dµ ω dµ (ω ), κ 1 (ω ) showing that (3.10) holds for φ = χ A. But then, by linearity (3.10) holds for any step function φ, and since these are dense in L 1 (, µ), it follows that (3.10) holds for all φ, so that the measures µ ω defined above yield indeed κ-transverse measures of µ. Conversely, suppose that µ := K (µ ) admits κ-transverse measures µ ω, and by Proposition 3.2 we may assume w.l.o.g. that µ ω P(κ 1 (ω )). Then we define the map K : P(), K(ω ; A) := µ ω (A κ 1 (ω )) = χ A dµ ω. κ 1 (ω ) Since for fixed A the map ω κ 1 (ω ) χ A dµ ω is measurable by the definition of transversal measures, K is indeed a Markov kernel. Moreover, for A κ (K(ω ))(A ) = K(ω ; κ 1 (A )) = µ ω (κ 1 (A ) κ 1 (ω )) = χ A (ω ), so that κ K(ω ) = δ ω for all ω, whence K is κ-congruent. Moreover, for any φ L 1 (, µ ) and A we have K µ (φ µ (3.3) )(A) = κ (φ )µ(a) = χ A κ (φ ) dµ ( ) (3.10) = χ A κ (φ ) dµ ω dµ (ω ) κ 1 (ω ) ( ) = χ A dµ ω φ (ω ) dµ (ω ) κ 1 (ω ) = K(ω ; A) d(φ µ )(ω ) (3.6) = K (φ µ )(A). Thus, K µ (φ µ ) = K (φ µ ) for all φ L 1 (, µ ) and hence, K µ (ν) = K (ν) for all ν S(, µ ). That is, the given congruent embedding K µ coincides with the Markov morphism K induced by K, and this completes the proof. Now we give an example of a statistic which does not admit κ-transverse measures. Example 3.2. Let := S 1 be the unit circle group in the complex plain with the 1-dimensional Borel algebra B. Let Γ := exp(2π 1Q) S 1 be the subgroup of rational rotations, and let := S 1 /Γ be the quotient space with the canonical projection κ :. Let B := {A : 17

19 κ 1 (A ) B}, so that κ : is measurable. For γ Γ, we let m γ : S 1 S 1 denote the multiplication by γ. Let λ be the 1-dimensional Lebesgue measure on and λ := κ (λ) be the induced measure on. Suppose that λ admits κ-transverse measures (λ ω ) ω. Then for each A B we have ( ) λ(a) = dλ ω dλ (ω ). (3.11) A κ 1 (ω ) Since λ is invariant under rotations, we have on the other hand for γ Γ ( ) λ(a) = λ(m 1 γ A) = dλ ω dλ (ω ) = ( (m 1 γ A) κ 1 (ω ) ) d((m γ ) λ ω ) A κ 1 (ω ) dλ (ω ). (3.12) Comparing (3.11) and (3.12) implies that ((m γ ) λ ω ) ω is another familiy of κ-transverse measures of λ which implies that (m γ ) λ ω = λ ω for λ -a.e. ω, and as Γ is countable, it follows that (m γ ) λ ω = λ ω for all γ Γ and λ -a.e. ω. Thus, for a.e. ω we have λ ω ({γ x}) = λ ω ({x}), and since Γ acts transitively on κ 1 (ω ), it follows that singleton subsets have equal measure, i.e., there is a constant c ω with λ ω (A ) = c ω A for all A κ 1 (ω ). As κ 1 (ω ) is countable and infinite, this implies that λ ω = 0 if c ω = 0, and λ ω (κ 1 (ω )) = if c ω > 0. Thus, λ ω is not a probability measure for a.e. ω, contradicting Proposition 3.2. This shows that λ does not admit κ-transverse measures. We conclude this section by the following result (cf. [5, Theorem 4.10]). Theorem 3.2. Any Markov kernel K = P( ) can be decomposed into a statistic and a congruent Markov kernel. That is, there is a Markov kernel K cong : P(ˆ) which is congruent w.r.t. some statistic κ 1 : ˆ, and a statistic κ 2 : ˆ such that K = K κ 2 K cong. Proof. Let ˆ := and let κ 1 : ˆ and κ 2 : ˆ be the canonical projections. We define the Markov kernel K cong : P(ˆ), K cong (ω 1 ; Â) := K(ω 1; κ 2 (Â ({ω 1} ))), and we assert that it is κ 1 -congruent. Namely, (κ 1 ) (K cong (ω 1 ))(A 1 ) = K cong (ω 1 ; κ 1 1 (A 1)) = K cong (ω 1 ; A 1 ) = K(ω 1 ; κ 2 ((A 1 ) ({ω 1 } ))) { K(ω1 ; = ) = 1 if ω 1 A 1 K(ω 1 ; ) = 0 if ω 1 / A 1 = χ A1 (ω 1 ), 18

20 whence (κ 1 ) (K cong (ω 1 )) = δ ω 1 as claimed. Observe that (κ 2 ) (K cong (ω 1 ))(A 2 ) = K cong (ω 1 ; κ 1 2 (A 2)) = K cong (ω 1 ; 1 A 2 ) = K(ω 1 ; κ 2 (( 1 A 2 ) ({ω 1 } ))) = K(ω 1 ; κ 2 ({ω 1 } A 2 )) = K(ω 1 ; A 2 ). Therefore, K = K κ 2 K cong, so the claim follows. 3.3 Powers of densities and congruent embeddings Given a statistic κ :, recall that by Proposition 3.1 any κ-congruent embedding is of the form K µ : S(, µ ) S(, µ) for some µ M() with µ := κ (µ), and the image of K µ, denoted by C κ,µ, is called a κ-congruent subspace. We now wish to generalize the notion of congruent embeddings and congruent subspaces to powers of measures. Namely, for r (0, 1] we define the congruent embedding of r-th powers to be the map K r µ : S r (, µ ) S r (), φµ r κ (φ)µ r, and denote its image by C r κ,µ := {κ (φ )µ r : φ L 1/r (, µ )}. (3.13) As before, we denote the spaces of the form C r κ,µ for µ M() as κ-congruent subspaces. Observe that K 1 µ = K µ and C 1 κ,µ = C κ,µ with the definitions from (3.3) and (3.5). Since the pull-back of functions preserves products and powers, we have immediately the following properties. K r µ(ν r ) 1/r = ν r 1/r K r 1+r 2 µ (ν r1 ν r 2 ) = K r 2 µ (ν r 1 ) K r 2 µ (ν r 2 ) (3.14) K rα µ (π α (ν r )) = π α (K r µ(ν r )) and K rα µ ( π α (ν r )) = π α (K r µ(ν r )) for r 1 + r 2 1 and 0 < α < 1/r. We now show the following decomposition result. Proposition 3.3. Let κ :, µ M() and µ := κ (µ) M( ) be as above. Then for the congruent subspaces C r κ,µ, we have the decomposition where S r (, µ) = C r κ,µ ( (C 1 r κ,µ ) S r (, µ) ), (3.15) (C 1 r κ,µ ) = {ν r S r () : (ν r ; ρ 1 r ) = 0 for all ρ 1 r C 1 r κ,µ } with the pairing (.;.) from (2.11). In particular, for r = 1/2 (3.15) is an orthogonal decomposition w.r.t. the Hilbert metric on S 1/2 (, µ). Moreover, (C 1 r κ,µ ) S r (, µ) = {ν r S r (, µ) : κ (ν r µ 1 r ) = 0}. (3.16) 19

21 Proof. We begin by showing (3.16). Let ν r = φµ r S r (, µ) with φ L 1/r (, µ). Then ν r (Cκ,µ 1 r ) if and only if for all ψ L 1/(1 r) (, µ ) we have 0 = d ( ν r (κ (ψ )µ 1 r)) = φκ (ψ ) dµ = dκ (φκ (ψ )µ) (3.2) = ψ dκ (φµ). But ψ dκ (φµ) = 0 for all ψ L 1/(1 r) (, µ ) holds if and only if κ (φµ) = 0, which shows (3.16) since φµ = ν r µ 1 r. The spaces on the right hand side of (3.15) are obviously contained in S r (, µ). To see that Cκ,µ r (Cκ,µ 1 r ) = 0, observe that for ν r = κ (φ )µ r Cκ,µ r we have κ (ν r µ 1 r ) = κ (κ (φ )µ) = φ µ by (3.2), and by (3.16) this is contained in (Cκ,µ 1 r ) only if φ = 0. Since Kµ r : S r (, µ ) Cκ,µ r is an isometry by (3.14) and S r (, µ ) is complete, it follows that Cκ,µ r is complete and hence closed in S r (, µ). Also (Cκ,µ 1 r ) S r () is a closed subspace being the orthogonal complement of the subspace Cκ,µ 1 r. Therefore, the right hand side of (3.15) is a closed subspace of the left, hence it suffices to show that the right hand side contains the dense subspace {ν r = φµ r : φ L (, µ)} of S r (, µ). Thus, for such a ν r = φµ r S r (, µ) define φ L 1 (, µ ) by Then for A we have (κ (φµ))(a ) = κ (ν r µ 1 r ) = κ (φµ) = φ µ. κ 1 (A ) φ dµ φ µ(κ 1 A ) = φ µ (A ), whence φ φ, so that φ is bounded as well. In particular, φ µ r S r (, µ ), and we decompose ν r = κ (φ )µ r +(ν r κ (φ )µ r ). }{{} Cκ,µ r Thus, it remains to show that ν r 1 := νr κ (φ )µ r (C 1 r κ,µ ) S r (, µ). Since evidently ν r 1 S r (, µ), we may use (3.16) to verify this. Namely, κ (ν r 1 µ 1 r ) = κ (ν r µ 1 r ) κ (κ (φ )µ r µ 1 r ) = φ µ φ κ (µ) = 0 by (3.2) and the definition of φ. This completes the proof. 4 Parametrized measure models and k-integrability In this section, we shall now present our notion of a parametrized measure model. Definition 4.1. (Parametrized measure model) Let be a measure space. 20

22 1. A parametrized measure model is a triple (M,, p) where M is a (finite or infinite dimensional) Banach manifold and p : M M() S() is a C 1 -map in the sense explained in Section The triple (M,, p) is called a statistical model if it consists only of probability measures, i.e., such that the image of p is contained in P(). 3. We call such a model dominated by µ 0 if the image of p is contained in S(, µ 0 ). In this case, we use the notation (M,, µ 0, p) for this model. Remark 4.1. Evidently, for the applications we have in mind, we are interested mainly in statistical models. However, we can take the point of view that P() is the projectivisation of P() = P(M()\0) via rescaling. Thus, given a parametrized measure model (M,, p), normalisation yields a statistical model (M,, p 0 ) defined by p 0 (ξ) := p(ξ) p(ξ) T V. which is again a C 1 -map. Indeed, the map µ µ T V on M() is a C 1 -map, being the restriction of the linear (and hence continuous) map µ dµ on S(). Observe that while S() is a Banach space, the subsets M() and P() do not carry a canonical manifold structure. If a parametrized measure model (M,, µ 0, p) is dominated by µ 0, then there is a density function p : M R such that p(ξ) = p(.; ξ)µ 0. (4.1) From the context, i.e., from the number of arguments, it will be clear which map p is meant, whence we will denote both maps by the same symbol. Evidently, we must have p(.; ξ) L 1 (, µ 0 ) for all ξ. In particular, for fixed ξ, p(.; ξ) is defined only up to changes on a µ 0 -null set. Definition 4.2. (Regular density function) Let (M,, µ 0, p) be a parametrized measure model dominated by µ 0. We say that this model has a regular density function if the density function p : M R satisfying (4.1) can be chosen such that for all V T ξ M the partial derivative V p(.; ξ) exists and lies in L 1 (, µ 0 ). Remark 4.2. The standard notion of a statistical model always assumes that it is dominated by some measure and has a regular density function (e.g. [3, 2, p. 25], [4, 2.1], [19], [5, Definition 2.4]). In fact, the definition of a parametrized measure model or statistical model in [5, Definition 2.4] is equivalent to a parametrized measure model or statistical model with a regular density function in the sense of Definition 4.2. Let us point out why the present notion is indeed more general. The formal definition of differentiability of p implies that for each C 1 -path ξ(t) M with ξ(0) = ξ, ξ(0) =: V Tξ M, the curve t p(.; ξ(t)) L 1 (, µ 0 ) is differentiable. This implies that there is a d ξ p(v ) L 1 (, µ 0 ) such that p(.; ξ(t)) p(.; ξ) d ξ p(v )(.) t 1 t 0 0. If this is a pointwise convergence, then d ξ p(v ) = V p(.; ξ) is the partial derivative and whence, V p(.; ξ) lies in L 1 (, µ 0 ), so that the density function is regular. 21

23 However, in general convergence in L 1 (, µ 0 ) does not imply pointwise convergence, whence there are parametrized measure models in the sense of Definition 4.1 without a regular density function, cf. Example 4.1.2below. Nevertheless, for simplicity we shall frequently use the notation V p( ; ξ) instead of d ξ p(v )(.), even if the density function is not regular. By this convention, for a parametrized measure model (M,, µ 0, p) we can describe its derivative in the direction of V T ξ M as d ξ p(v ) = V p(.; ξ) µ 0. Example The family of normal distributions on R p(µ, σ) := 1 (x µ)2 exp( 2πσ 2σ 2 ) dx is a statistical model with regular density function on the upper half plane H = {(µ, σ) : µ, σ R, σ > 0}. 2. To see that there are parametrized measure models without a regular density function, consider the family of measures on = (0, π) ( 1 + ξ (sin 2 (t 1/ξ)) 1/ξ2) dt for ξ 0 p(ξ) := dt for ξ = 0. This model is dominated by the Lebesgue measure dt, with density function p(t; ξ) = 1 + ξ (sin 2 (t 1/ξ)) 1/ξ2 for ξ 0, p(t; 0) = 1. Thus, the partial derivative ξ p does not exist at ξ = 0, whence the density function is not regular. On the other hand, p : R M(, dt) is differentiable in the above sense at ξ = 0 with d 0 p( ξ ) = 0, so that (M,, p) is a parametrized measure model in the sense of Definition 4.1. To see this, we calculate p(ξ) p(0) ξ = 1 = = (sin 2 (t 1/ξ)) 1/ξ2 dt π 0 π 0 1 (sin 2 (t 1/ξ)) 1/ξ2 dt (sin 2 t) 1/ξ2 dt ξ 0 0. which shows the claim. Here, we used the π-periodicity of the integrand for fixed ξ and dominated convergence in the last step. Since for a parametrized measure model (M,, p) the map p is C 1, it follows that its derivative yields a continuous map between the tangent spaces dp : T M T M() = µ M() S(, µ). That is, for each tangent vector V T ξ M, its differential d ξ p(v ) is contained in S(, p(ξ)) and hence dominated by p(ξ). 22

24 Definition 4.3. Let (M,, p) be a parametrized measure model. Then for each tangent vector V T ξ M of M, we define V log p(ξ) := d{d ξp(v )} dp(ξ) L 1 (, p(ξ)) (4.2) and call this the logarithmic derivative of p at ξ in direction V. If such a model is dominated by µ 0 and has a regular density function p for which (4.1) holds, then we can calculate the Radon-Nikodým derivative as d{d ξ p(v )} dp(ξ) = d{d ( ) ξp(v )} dp(ξ) 1 dµ 0 dµ 0 = V p(.; ξ)(p(.; ξ)) 1 = V log p(.; ξ), where we use the convention log 0 = 0. This justifies the notation in (4.2) even for models without a regular densitiy function. For a parametrized measure model (M,, p) and k > 1 we consider the map p 1/k := π 1/k p : M S 1/k (), ξ p(ξ) 1/k. Since π 1/k is continuous by Proposition 2.3, it follows that p 1/k is continuous as well. Let us pretend for the moment that p 1/k is a C 1 -map, so that d ξ p 1/k (V ) T p(ξ) 1/kM 1/k () = S 1/k (, p(ξ)). In this case, because of π k π 1/k = Id, we have p = π k p 1/k, whence by the chain rule and (2.14) we have for ξ M and V T ξ M Thus with (4.2) this implies d ξ p(v ) = k p(ξ) 1 1/k (d ξ p 1/k (V ) ). d ξ p 1/k (V ) = 1 k V log p(ξ) p 1/k (ξ) S 1/k (, p(ξ)) (4.3) and hence, in particular, V log p(ξ) L k (, p(ξ)), and depends continuously on V T M. This motivates the following definition. Definition 4.4. (k-integrable parametrized measure model) A parametrized measure model (M,, p) is called k-integrable for k 1 if for all ξ M and V T ξ M we have V log p(ξ) = d{d ξp(v )} L k (, p(ξ)), dp(ξ) and moreover, the map dp 1/k : T M T S 1/k () given in (4.3) is continuous. dp 1/k is called the formal derivative of p 1/k. Furthermore, we call the model -integrable if it is k-integrable for all k 1. 23

25 Since p(ξ) is a finite measure, we have L k (, p(ξ)) L l (, p(ξ)) for all 1 l k. Thus, k- integrability implies l-integrability for all such l. Remark By our previous discussion, a parametrized measure model (M,, p) for which p 1/k is a C 1 -map is always k-integrable, and the derivative coincides with the formal derivative. However, it is not clear if there are k-integrable parametrized measure models for which p 1/k is not a C 1 -map. 2. Observe that for parametrized measure models with a regular density function the notion of k-integrability coincides with that given in [5, Definition 2.4]. Definition 4.5. (Canonical n-tensor) For n N, the canonical n-tensor is the covariant n-tensor on S 1/n (), given by L n (ν 1,..., ν n ) = n n d(ν 1 ν n ), where ν i S 1/n (). (4.4) For n = 2, the pairing (.;.) : S 1/2 () S 1/2 () R from (2.11) satisfies (ν 1 ; ν 2 ) = 1 4 L2 (ν 1, ν 2 ). Since (ν; ν) = ν 2 2 by (2.8), it follows: ( S 1/2 (), 1 ) 4 L2 is a Hilbert space with norm. 2. The main purpose of defining the notion of k-integrability is that for a k-integrable model, there is a well defined pullback of the canonical n-tensor L n via the map p1/n for all n k. That is, we define for V 1,..., V n T ξ M τ n (M,,p) (V 1,..., V n ) := L n (d ξp 1/n (V 1 ),..., d ξ p 1/n (V n )) = V 1 log p(ξ) Vn log p(ξ) dp(ξ), where the second line follows immediately from (4.3) and (4.4). Example For n = 1, the canonical 1-form is given as τ(m,,p) 1 (V ) := V log p(ξ) dp(ξ) = V p(ξ). Thus, it vanishes if and only if p(ξ) is locally constant, e.g., if (M,, p) is a statistical model. 2. For n = 2, τ 2 (M,,p) coincides with the Fisher metric g F (V, W ) ξ := V log p(ξ) W log p(ξ) dp(ξ) (4.5) 24

26 3. For n = 3, τ(m,,p) 3 coincides with the Amari-Chentsov 3-symmetric tensor T AC (V, W, X) ξ := log p(ξ) W log p(ξ) X log p(ξ) dp(ξ). V Remark 4.4. While the Fisher metric and the Amari-Chentsov tensor give an interpretation of τ(m,,p) n for n = 2, 3, we do not know of any statistical significance of τ (M,,p) n for n 4. However, one might hope to get an interpretation of e.g. τ(m,,p) 4 as some kind of curvature tensor of a suitable connection. Moreover, in [14, p.212] the question is posed if there are other significant tensors on statistical manifolds, and the canonical n-tensors may be considered as natural candidates. 5 Parametrized measure models and sufficient statistics Given a parametrized measure model (statistical model, respectively) (M,, p) and a Markov kernel K : P( ) which induces the Markov morphism K : M() M( ) as in (3.6), we obtain another parametrized measure model (statistical model, respectively) (M,, p ) by defining p (ξ) := K (p(ξ)). It is the purpose of this section to investigate the relation between these two models in more detail. Theorem 5.1. Let (M,, p), K : P( ) and (M,, p ) be as above, and suppose that (M,, p) is k-integrable for some k 1. Then the following hold. 1. (M,, p ) is also k-integrable. 2. If K is congruent w.r.t. some statistic κ :, then V log p(ξ) k = V log p (ξ) k for all V T ξ M. 3. If K is induced by a statistic κ :, then ( V log p(ξ) κ ( V log p (ξ)) ) p(ξ) 1/k (C 1 1/k κ,p(ξ) ). Proof. Since K is the restriction of a bounded linear map, it is obvious that p : M M( ) is again differentiable, and in fact, d ξ p (V ) = K (d ξ p(v )). for all V T ξ M, ξ M. Let us now assume that K is κ-congruent w.r.t. a statistic κ :. Applying Proposition 3.1 (with 1 := and 2 := ) to µ := p(ξ) M() and µ := K (µ ) = p (ξ) M( ), it follows that d ξ p (V ) = K (d ξ p(v )) = K ( V log p(ξ)µ ) = κ ( V log p(ξ))µ = κ ( V log p(ξ))p (ξ), and hence, for any κ-congruent Markov morphism K, we have V log p (ξ) = d{d ξp (V )} dp (ξ) = κ ( V log p(ξ)), 25

Information Geometry and Sufficient Statistics

Information Geometry and Sufficient Statistics Nihat Ay Jürgen Jost Hong Van Le Lorenz Schwachhoefer SFI WORKING PAPER: 2012-11-020 SFI Working Papers contain accounts of scientific work of the author(s)