THEOREM AND METRIZABILITY

Similar documents
ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS

PROPERTIES. Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria

CONVEX FUNCTIONS AND MATRICES. Silvestru Sever Dragomir

Research Article On Some Improvements of the Jensen Inequality with Some Applications

A New Quantum f-divergence for Trace Class Operators in Hilbert Spaces

Some New Information Inequalities Involving f-divergences

ONE SILVESTRU SEVER DRAGOMIR 1;2

Jensen-Shannon Divergence and Hilbert space embedding

Generalized metric properties of spheres and renorming of Banach spaces

Convexity/Concavity of Renyi Entropy and α-mutual Information

Bounds for entropy and divergence for distributions over a two-element set

Literature on Bregman divergences

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI

Journal of Inequalities in Pure and Applied Mathematics

Divergences, surrogate loss functions and experimental design

Tiziana Cardinali Francesco Portigiani Paola Rubbioni. 1. Introduction

The small ball property in Banach spaces (quantitative results)

SOME REMARKS ON THE SPACE OF DIFFERENCES OF SUBLINEAR FUNCTIONS

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

TWO MAPPINGS RELATED TO SEMI-INNER PRODUCTS AND THEIR APPLICATIONS IN GEOMETRY OF NORMED LINEAR SPACES. S.S. Dragomir and J.J.

Topology, Math 581, Fall 2017 last updated: November 24, Topology 1, Math 581, Fall 2017: Notes and homework Krzysztof Chris Ciesielski

LECTURE 15: COMPLETENESS AND CONVEXITY

Gaussian Measure of Sections of convex bodies

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Series of Error Terms for Rational Approximations of Irrational Numbers

Course 212: Academic Year Section 1: Metric Spaces

Arimoto Channel Coding Converse and Rényi Divergence

A Lower Estimate for Entropy Numbers

An Arrangement of Pseudocircles Not Realizable With Circles

Prof. M. Saha Professor of Mathematics The University of Burdwan West Bengal, India

MATH 31BH Homework 1 Solutions

Information Theory and Hypothesis Testing

Condensing KKM maps and its applications. Ivan D. Arand - elović, Z.Mitrović

Subdifferential representation of convex functions: refinements and applications

Math General Topology Fall 2012 Homework 1 Solutions

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32

Robust error estimates for regularization and discretization of bang-bang control problems

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

arxiv:math/ v1 [math.st] 19 Jan 2005

On the Chi square and higher-order Chi distances for approximating f-divergences

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Entropy measures of physics via complexity

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

PROGRAMMING UNDER PROBABILISTIC CONSTRAINTS WITH A RANDOM TECHNOLOGY MATRIX

On metric characterizations of some classes of Banach spaces

On bisectors in Minkowski normed space.

Some results on stability and on characterization of K-convexity of set-valued functions

arxiv: v1 [math.st] 26 Jun 2011

Introduction to Topology

MATH 426, TOPOLOGY. p 1.

f-divergence Inequalities

GERHARD S I E GEL (DRESDEN)

f(x n ) [0,1[ s f(x) dx.

Applied Mathematics Letters

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

276 P. Erdős, Z. Füredi provided there exists a pair of parallel (distinct) supporting hyperplanes of 91 such that a belongs to one of them and b to t

Math 341: Convex Geometry. Xi Chen

POINT SET TOPOLOGY. Definition 2 A set with a topological structure is a topological space (X, O)

On the Chi square and higher-order Chi distances for approximating f-divergences

DENSELY k-separable COMPACTA ARE DENSELY SEPARABLE

AN INTRODUCTION TO EXTREME POINTS AND APPLICATIONS IN ISOMETRIC BANACH SPACE THEORY

Introduction and Preliminaries

Multi-normed spaces and multi-banach algebras. H. G. Dales. Leeds Semester

MAPS ON DENSITY OPERATORS PRESERVING

MATHS 730 FC Lecture Notes March 5, Introduction

Pascal s Triangle, Normal Rational Curves, and their Invariant Subspaces

LUSIN TYPE THEOREMS FOR MULTIFUNCTIONS, SCORZA DRAGONI S PROPERTY AND CARATHÉODORY SELECTIONS

Convergence of generalized entropy minimizers in sequences of convex problems

MATH 217A HOMEWORK. P (A i A j ). First, the basis case. We make a union disjoint as follows: P (A B) = P (A) + P (A c B)

Chapter 3 Continuous Functions

f-divergence Inequalities

arxiv:math/ v1 [math.fa] 21 Mar 2000

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Notes for Functional Analysis

ASYMPTOTICALLY NONEXPANSIVE MAPPINGS IN MODULAR FUNCTION SPACES ABSTRACT

Deviation Measures and Normals of Convex Bodies

THE FUNDAMENTAL GROUP OF THE DOUBLE OF THE FIGURE-EIGHT KNOT EXTERIOR IS GFERF

Where is matrix multiplication locally open?

MA651 Topology. Lecture 10. Metric Spaces.

A View on Extension of Utility-Based on Links with Information Measures

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA address:

A note on separation and compactness in categories of convergence spaces

Institut für Mathematik

Topology Proceedings. COPYRIGHT c by Topology Proceedings. All rights reserved.

On Improved Bounds for Probability Metrics and f- Divergences

On Kusuoka Representation of Law Invariant Risk Measures

A unique representation of polyhedral types. Centering via Möbius transformations

Random sets. Distributions, capacities and their applications. Ilya Molchanov. University of Bern, Switzerland

EQUICONTINUITY AND SINGULARITIES OF FAMILIES OF MONOMIAL MAPPINGS

SOME NEW CONTINUITY CONCEPTS FOR METRIC PROJECTIONS

Chapter 2 Metric Spaces

arxiv:cond-mat/ v1 23 Jul 2002

THEOREMS, ETC., FOR MATH 516

Banach Algebras of Matrix Transformations Between Spaces of Strongly Bounded and Summable Sequences

SYMMETRY OF POSITIVE SOLUTIONS OF SOME NONLINEAR EQUATIONS. M. Grossi S. Kesavan F. Pacella M. Ramaswamy. 1. Introduction

Generalized Hypothesis Testing and Maximizing the Success Probability in Financial Markets

44 CHAPTER 2. BAYESIAN DECISION THEORY

Lecture 2: A crash course in Real Analysis

(c) For each α R \ {0}, the mapping x αx is a homeomorphism of X.

Regular finite Markov chains with interval probabilities

Transcription:

f-divergences - REPRESENTATION THEOREM AND METRIZABILITY Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria Abstract In this talk we are first going to state the so-called Representation Theorem which provides the representation of general f-divergences in terms of the class of elementary divergences. Then we consider the risk set of a (simple versus simple) testing problem and its characterization by the above mentioned class. These ingredients enable us to prove a sharpening of the Range of Values Theorem presented in Talk 1. In the sequel, necessary and sufficient conditions are given so that f-divergences allow for a topology, respectively a metric. In addition, sufficient conditions are given which permits a suitable power of an f- divergence to be a distance. Finally, we investigate the classes of f-divergences discussed in Talk 1 along these lines. This talk was presented while participating in a workshop of the Research Group in Mathematical Inequalities and Applications at the Victoria University, Melbourne, Australia, in November 2002. Let Ω = {x 1, x 2,...} be a set, P(Ω) the set of all subsets of Ω, P the set of all probability distributions P = (p(x) : x Ω) on Ω and P 2 the set of all (simple) testing problems. Furthermore, let F 0 be the set of convex functions f : [0, ) (, ] continuous at 0 and satisfying f(1) = 0, f F 0 the -conjugate (convex) function of f and f = f + f. 1 REPRESENTATION THEOREM 1.1 REPRESENTATION BY ELEMENTARY DIVERGENCES At the beginning we investigate the f-divergences of the Class (IV) of Elementary Divergences given by f t (u) = max(u t, 0), t 0 1

and the divergences of the associated class of concave functions g t (u) = f t (0) + uft (0) f t (u) = u max(u t, 0) = min(u, t), t 0. Let A t = {q > tp} and A + t = {q tp}. Then the corresponding divergences are, as can easily be verified, the Measures of Similarity (Q tp ) + (Ω) = max (q(x) tp(x), 0) x Ω = Q (A t ) tp (A t ) and the Measures of Orthogonality b(q, tp ) = min (q(x), tp(x)) = Q(A c t) + tp (A t ) x Ω respectively. Obviously it holds Q(A c ) + tp (A) b(q, tp ) A P(Ω) with equality iff A t A A + t. t 1+t, 1 1+t Remark 1: a) Let A be a test, i.e. a subset A Ω so that we decide in favour of the hypothesis Q if x A is observed and in favour of P if x A c is observed. Then P (A) and Q(A c ) is the probability of type I error and type II error respectively. ( ) Furthermore, let us associate with t 0 a probability distribution on the set {P, Q} P of the given test- t ing problem (P, Q). Then 1+t P (A) + 1 1+t Q(Ac ) is the Bayes risk of the ( ) t test A with respect to the prior distribution 1+t, 1 1+t. Consequently 1 1+tb(Q, tp ) is the corresponding minimal Bayes risk. b) As can easily be seen from u 1 = max(u 1, 0) + max(1 u, 0) the special case of the term (Q tp ) + (Ω) for t = 1 is half of the Total Variation Distance of P and Q, i.e. (Q P ) + (Ω) = V (Q, P )/2. c) The general term (Q tp ) + (Ω) can be described more appropriately in the language of measure theory as the total mass of the positive part of the signed measure Q tp. Note in this context that there is an interesting correspondence between the Neyman-Pearson lemma in statistics and the Radon-Nikodym theorem in measure theory. For details we refer to Österreicher & Thaler (1978). 2

Representation Theorem (Feldman & Österreicher (1981)) 1 : Let f F 0 and let D + f denote the right-hand side derivative of the convex function f. Then I f (Q, P ) = = 0 0 [min(1, t) b(q, tp )] dd + f(t) [ ] (Q tp ) + (Ω) (P tp ) + (Ω) dd + f(t). Remark 2: a) The main ingredients of the proof are the following obvious representation of convex functions f F satisfying f(0), D + f(0) R and Fubini s theorem. f(u) = f(0) + ud + f(0) + 0 max(u t, 0) dd + f(t) b) Provided f(0) < the Representation Theorem for the Measure of Orthogonality has the form I g (Q, P ) (= f(0) I f (Q, P ) ) = 0 b(q, tp ) dd + f(t). 1.2 RISK SETS Definition: Let (P, Q) P 2 be a testing problem. Then the set R(P, Q) = co {(P (A), Q (A c )) : A P(Ω), P (A) + Q (A c ) 1} is called the risk set of the testing problem (P, Q), whereby co stands for the convex hull of. The bulkiness of a risk set provides a measure for the deviation of P and Q. In fact, the family of risk sets define a uniform structure on the set P. Cf. Linhart & Österreicher (1985). Properties of Risk Sets (R1) R(P, Q) is a convex subset of the triangle = { (α, β) [0, 1] 2 : α + β 1 } containing the diagonal D = { (α, β) [0, 1] 2 : α + β = 1 }. More specifically it holds D R(P, Q) with equality iff P = Q and P Q respectively. (R2) Let t 0 and b(q, tp ) be the (1 ( + t)-multiple ) of the minimal Bayes risk with respect to the prior distribution. Then the risk set R(P, Q) of t 1+t, 1 1+t a testing problem is determined by its family of supporting lines from below, namely β = b(q, tp ) t α, t 0. 1 Cf. also Österreicher & Vajda (1993) 3

(R2) and the Representation Theorem imply that an f-divergence I f (Q, P ) depends on the testing problem (P, Q) only via its risk set R(P, Q) 2. In fact, the following holds. Remark 3: Let R(P, Q) and R( P, Q) be two testing problems. Then obviously R(P, Q) R( P, Q) b(q, tp ) b( Q, t P ) t 0 and hence by virtue of the Representation Theorem R(P, Q) R( P, Q) I f (Q, P ) I f ( Q, P ) If, in addition, f is strictly convex and I f ( Q, P ) < then equality holds for the f-divergence iff it holds for the risk sets. 1.3 REFINEMENT OF THE RANGE OF VALUES THEO- REM Let 0 < x < 1 and let α [0, 1 x], P α = (α, 1 α) and consequently P α+x = (α + x, 1 (α + x)). Then the testing problem (P α+x, P α ) has the risk set the f-divergence R(P α, P α+x ) = co {(0, 1), (α, 1 (x + α)), (1, 0)}, I f (P α+x, P α ) = αf ( 1 + x ) ( + (1 α) f 1 x ) α 1 α and consequently the Total Variation Distance V (P α+x, P α ) = 2x. On the other hand let P (x) = (0, 1 x, x) and Q (x) = (x, 1 x, 0). Then the testing problem (P (x), Q (x) ) has the risk set the f-divergence R(P (x), Q (x) ) = co {(0, 1), (0, 1 x), (1 x, 0), (1, 0)}, I f (P (x), Q (x) ) = f(0) x and consequently the Total Variation Distance V (P (x), Q (x) ) = 2 x which equals V (P α+x, P α ). Therefore every testing problem satisfies (P, Q) P such that R(P α, P α+x ) R(P, Q) R(P (x), Q (x) ) V (Q, P ) = 2x and I f (P α, P α+x ) I f (Q, P ) f(0) x. 2 This fact and Remark 3 match the properties (a) and (b) of the Characterization Theorem presented in Talk 1. 4

Now let c f (x) = min {I f (P α+x, P α ) : 0 α 1 x}. Then every testing problem (P, Q) with Variation Distance V (Q, P ) = 2x satisfies c f (x) I f (Q, P ) f(0) x. Since obviously c f (0) = f(1) = 0 we have achieved the following Refinement of the Range of Values Theorem (Feldman & Österreicher (1989): Let the function c f : [0, 1] R be defined as above. Then c f (V (Q, P )/2) I f (Q, P ) f(0) V (Q, P )/2. The function c f is characteristic for the f-divergence and therefore important for more elaborate investigations. Properties of c f : Let f F 0. Then the function c f : [0, 1] R has the following properties: (a) 0 f(1 x) c f (x) x [0, 1] with strict second inequality for x (0, 1] if f is strictly convex at 1. Furthermore c f (0) = 0 and c f (1) = f(0), (b) c f is convex and continuous on [0, 1], (c) c f is increasing (strictly increasing iff f is strictly convex at 1 ) and (d) c f c f 1 2 c f. If, in addition, f f then c f (x) = 1 2 c f (x) = (1 + x) f ( ) 1 x 1 + x 2 METRIC f-divergences 2.1 GENERATION OF A TOPOLOGY Let both Ω and f F 0 be non-trivial, P be the set of all probability distributions on Ω, P P and ε > 0, U f (P, ε) = {Q P : I f (Q, P ) < ε} be an I f -ball around P with radius ε, the set of all such balls and V f (P ) = {U f (P, ε) : ε > 0} U f (P ) = {U P : V V f (P ) such that V U}. 5

the associated system of neighbourhoods of P. on P if {U f (P ) : P P} ge- Definition: f is said to generate a topology T f nerates a topology on P. In the following we state two criteria under which f T f on P. generates a topology Theorem (Csiszár (1967)): Let Ω be infinite and f F 0. Then f generates a topology on P iff (i) f is strictly convex at 1 and (iii) f(0) <. Theorem (Kafka (1995)): Let Ω be finite and f F 0. Then f generates a topology on P iff (i) f is strictly convex at 1. Corollary of the Refinement of the Range of Values Theorem: If the properties (i) and (iii) are satisfied then the topology on P generated by f coincides with the topology induced by the Total Variation Distance. 2.2 BASIC PROPERTIES AND RESULTS In the sequel we assume Ω to be infinite and f F 0 to satisfy (without loss of generality) f(u) 0 u [0, ). We consider the Measures of Similarity and concentrate especially on those (further) properties of the convex function f which make metric divergences possible. As we know already I f (Q, P ) fulfils the basic property (M1) of a metric divergence, namely I f (Q, P ) 0 P, Q P with equality iff Q = P, (M1) provided (i) f is strictly convex at 1. In addition I f (Q, P ) is symmetric, i.e. satisfies I f (Q, P ) = I f (P, Q) P, Q P (M2) iff (ii) f is -self conjugate, i.e. satisfies f f. As we know from Csiszár s Theorem f allows for a topology (namely the topology induced by the Total Variation Distance) iff, in addition to (i), (iii) f(0) < holds. Therefore the properties (i), (ii) and (iii) are crucial for f F 0 to allow for the definition of a metric divergence. Now we state two theorems given in Kafka, Österreicher & Vincze (1991). Theorem 1 offers a class (iii,α), α (0, 1] of conditions which are sufficient for guaranteeing the power [I f (Q, P )] α to be a distance on P. Theorem 2 determines, in dependence of the behaviour of f in the neighbourhood of 1 and of g(u) = f(0) (1 + u) f(u) in the neighbourhood of 0, the maximal α providing a distance. 6

Theorem 1: Let α (0, 1] and let f F 0 fulfil, in addition to (ii), the condition (iii,α) the function Then satisfies the triangle inequality h(u) = (1 uα ) 1 α f(u) ρ α (Q, P ) = [I f (Q, P )] α, u [0, 1), is non-increasing. ρ α (Q, P ) ρ α (Q, R) + ρ α (R, P ) P, Q, R P. (M3,α) Remark 4: The conditions (ii) and (iii,α) imply both (i) and (iii). Theorem 2: Let (i),(ii) and (iii) hold true and let α 0 (0, 1] be the maximal α for which (iii,α) is satisfied. Then the following statement concerning α 0 holds. If for some k 0, k 1, c 0, c 1 (0, ) f(0) (1 + u) f(u) c 0 u k0 f(u) c 1 u 1 k1 then k 0 1, k 1 1 and α 0 min (k 0, 1/k 1 ) 1. 2.3 EXAMPLES OF METRIC f-divergences In the sequel we investigate the classes of f-divergences discussed in Talk 1. First we check for which parameters the conditions (i), (ii), (iii) are satisfied. The results are summarized in the following table. Class (i) given for (ii) given for (iii) given for (I) α α = 1 α = 1 (II) α α = 1/2 0 < α < 1 (II ) s s 0 < s < 1 (III) α α α (IV) t = 1 ( t = 1 ) t (V) k k k (VI) β β β Now we proceed with those parameters of the different classes which are not ruled out and present the maximal powers of the f-divergences defining a distance. (I) The class of χ α -Divergences χ α (u) = u 1 α, α 1 From this class only the parameter α = 1 provides a distance, namely the Total Variation Distance V (Q, P ). 7

(II) Dichotomy Class u 1 ln u for α = 0 f α αu+1 α u (u) = α(1 α) for α R \ {0, 1} 1 u + u ln u for α = 1 From this class only the parameter α = 1 2 ( f 1 2 (u) = ( ) u 1) 2 provides a distance, namely the Hellinger Distance [ ] 1 2 ( ) 2 I 1 f 2 (Q, P ) = q(x) p(x). x Ω (II ) Symmetrized Dichotomy Class f (s) (u) = { (u 1) ln(u) for s = 1 1+u (u s +u 1 s ) s(1 s) for s (0, 1) (1, ) As shown by Csiszár & Fischer (1962) the parameters s (0, 1) provide the [ min(s,1 s) distances I f (s)(q, P )]. (III) Matusita s Divergences f α (u) = 1 u α 1 α, 0 < α 1 are the prototypes of metric divergences, providing the distances [I f α(q, P )] α. (IV) Elementary Divergences f t (u) = max(u t, 0), t 0 This class provides only for t = 1 ( f 1 (u) = max(u 1, 0) ) a metric, namely V (Q, P )/2. (V) Puri-Vincze Divergences Φ k (u) = 1 1 u k 2 (u + 1) k 1, k 1 As shown by Kafka, Österreicher & Vincze (1991) this class provides the distances [I Φ k (Q, P )] 1 k. (VI) Divergences of Arimoto-type [ 1 (1 1 1/β + u β 1/β 2 1/β 1 (1 + u)] if β (0, )\{1} f β (u) = (1 + u) ln(2) + u ln(u) (1 + u) ln(1 + u) if β = 1 1 u /2 if β =. As shown by Österreicher & Vajda (1997) this class provides the distances [ I fβ (Q, P ) ] min(β, 1 2 ) for β (0, ) and V (Q, P )/2 for β =. 8

References Csiszár, I. and Fischer, J. (1962): Informationsentfernungen im Raum der Wahrscheinlichkeitsverteilungen. Magyar Tud. Akad. Mat. Kutató Int. Kösl., 7, 159 180. Csiszár, I. (1967): On topological properties of f-divergences. Math. Hungar., 2, 329-339. Studia Sci. Feldman, D. and Österreicher, F. (1981): Divergenzen von Wahrscheinlichkeitsverteilungen integralgeometrisch betrachtet. Acta Math. Acad. Sci. Hungar., 37/4, 329 337. Feldman, D. and Österreicher, F. (1989): A note on f-divergences. Studia Sci. Math. Hungar., 24, 191 200. Kafka, P., Österreicher, F. and Vincze I. (1991): On powers of f-divergences defining a distance. Studia Sci. Math. Hungar., 26, 415 422. Kafka, P.: Erzeugen von Topologien und Projektionen in Wahrscheinlichkeitsräumen mittels f-divergenzen, Diplomarbeit, Salzburg 1995 Liese, F. and Vajda, I.: Convex Statistical Distances. Teubner Texte zur Mathematik, Band 95, Leipzig 1987 Linhart, J. und Österreicher, F. (1985): Uniformity and distance - a vivid example from statistics. Int. J. Edu. Sci. Technol., 16/5, 645-649. Matusita, K. (1964): Distances and decision rules. Ann. Inst. Statist. Math., 16, 305 320. Österreicher, F. and Thaler, M. (1978): The fundamental Neyman-Pearson lemma and the Radon-Nikodym theorem from a common statistical point of view. Int. J. Edu. Sci. Technol., 9, 163-176. Österreicher, F. (1983): Least favourable distributions. Entry from Kotz- Johnson: Encyclopedia of Statistical Sciences, Vol. 4, 588-592, John Wiley & Sons, New York Österreicher, F. (1985) Die Risikomenge eines statistischen Testproblems - ein anschauliches mathematisches Hilfsmittel. Österreichische Zeitschrift f. Statistik u. Informatik, 15. Jg. Heft 2-3, 216-228. Österreicher, F. (1990): The risk set of a testing problem A vivid statistical tool. In: Trans. of the 11 th Prague Conference on Information Theory, Academia Prague, Vol. A, 175 188. Österreicher, F.: Informationstheorie, Skriptum zur Vorlesung, Salzburg 1991 Österreicher, F. and Vajda, I. (1993): Statistical information and discrimination. IEEE Trans. Inform. Theory, 39/3, 1036 1039. 9