Some New Information Inequalities Involving f-divergences

Similar documents
A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

PROPERTIES. Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria

ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS

arxiv:math/ v1 [math.st] 19 Jan 2005

THEOREM AND METRIZABILITY

Nested Inequalities Among Divergence Measures

A View on Extension of Utility-Based on Links with Information Measures

Jensen-Shannon Divergence and Hilbert space embedding

arxiv: v2 [cs.it] 26 Sep 2011

Research Article On Some Improvements of the Jensen Inequality with Some Applications

Literature on Bregman divergences

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Symmetrizing the Kullback-Leibler distance

Convexity/Concavity of Renyi Entropy and α-mutual Information

CONVEX FUNCTIONS AND MATRICES. Silvestru Sever Dragomir

Goodness of Fit Test and Test of Independence by Entropy

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Arimoto Channel Coding Converse and Rényi Divergence

Divergences, surrogate loss functions and experimental design

On Some New Measures of Intutionstic Fuzzy Entropy and Directed Divergence

Computation of Csiszár s Mutual Information of Order α

ONE SILVESTRU SEVER DRAGOMIR 1;2

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Received: 20 December 2011; in revised form: 4 February 2012 / Accepted: 7 February 2012 / Published: 2 March 2012

Posterior Regularization

Fuzzy directed divergence measure and its application to decision making

On the Chi square and higher-order Chi distances for approximating f-divergences

Entropy measures of physics via complexity

1. A remark to the law of the iterated logarithm. Studia Sci. Math. Hung. 7 (1972)

Convergence of generalized entropy minimizers in sequences of convex problems

A Note on Projecting the Cubic Lattice

Bounds for entropy and divergence for distributions over a two-element set

Chebyshev Type Inequalities for Sugeno Integrals with Respect to Intuitionistic Fuzzy Measures

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

SIMILAR MARKOV CHAINS

APPROXIMATION AND GENERALIZED LOWER ORDER OF ENTIRE FUNCTIONS OF SEVERAL COMPLEX VARIABLES. Susheel Kumar and G. S. Srivastava (Received June, 2013)

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

IN this paper, we study two related problems of minimaxity

Learning features by contrasting natural images with noise

A Novel Low-Complexity HMM Similarity Measure

On Improved Bounds for Probability Metrics and f- Divergences

A New Quantum f-divergence for Trace Class Operators in Hilbert Spaces

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Power Series. Part 1. J. Gonzalez-Zugasti, University of Massachusetts - Lowell

ON THE DYNAMICAL SYSTEMS METHOD FOR SOLVING NONLINEAR EQUATIONS WITH MONOTONE OPERATORS

Journal of Inequalities in Pure and Applied Mathematics

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

arxiv:math/ v1 [math.dg] 1 Oct 1992

Generalized Monotonicities and Its Applications to the System of General Variational Inequalities

Information Projection Algorithms and Belief Propagation

arxiv: v2 [math.st] 22 Aug 2018

On the smoothness of the conjugacy between circle maps with a break

GAUSSIAN MEASURE OF SECTIONS OF DILATES AND TRANSLATIONS OF CONVEX BODIES. 2π) n

MAPS ON DENSITY OPERATORS PRESERVING

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Czechoslovak Mathematical Journal

CONSIDER an estimation problem in which we want to

PATTERN RECOGNITION AND MACHINE LEARNING

Characterization of Self-Polar Convex Functions

Belief Propagation, Information Projections, and Dykstra s Algorithm

arxiv:math-ph/ v4 22 Mar 2005

WE start with a general discussion. Suppose we have

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

A Statistical Analysis of Fukunaga Koontz Transform

Information Geometry on Hierarchy of Probability Distributions

FUNDAMENTALS OF REAL ANALYSIS

A Coding Theorem Connected on R-Norm Entropy

Divergence measure of intuitionistic fuzzy sets

arxiv: v4 [cs.it] 17 Oct 2015

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI

Computing and Communications 2. Information Theory -Entropy

arxiv: v2 [cs.cv] 19 Dec 2011

Theory of Probability Fall 2008

Superresolution in the maximum entropy approach to invert Laplace transforms

Modified Kaptur s Measures of Entropy and Directed Divergence on Imposition of Inequality Constraints on Probabilities

COMPLEX ANALYSIS-I. DR. P.K. SRIVASTAVA Assistant Professor Department of Mathematics Galgotia s College of Engg. & Technology, Gr.

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

1/37. Convexity theory. Victor Kitov

On metric characterizations of some classes of Banach spaces

Norm inequalities related to the matrix geometric mean

arxiv: v4 [cs.it] 8 Apr 2014

Power Series. Part 2 Differentiation & Integration; Multiplication of Power Series. J. Gonzalez-Zugasti, University of Massachusetts - Lowell

On the use of mutual information in data analysis : an overview

Machine Learning 3. week

CLOSED RANGE POSITIVE OPERATORS ON BANACH SPACES

Introduction to Series and Sequences Math 121 Calculus II Spring 2015

1.6. Information Theory

Weighted Composition Operators on Sobolev - Lorentz Spaces

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Conjugate Predictive Distributions and Generalized Entropies

a j x j. j=0 The number R (possibly infinite) which Theorem 1 guarantees is called the radius of convergence of the power series.

Maximum-Entropy Models in Science and Engineering

Information Theory and Communication

Chapter 2 Review of Classical Information Theory

ESTIMATING an unknown probability density function

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

Information Divergences and the Curious Case of the Binary Alphabet

Problem Set 6: Inequalities

Transcription:

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 12, No 2 Sofia 2012 Some New Information Inequalities Involving f-divergences Amit Srivastava Department of Mathematics, Jaypee Institute of Information Technology, NOIDA (Uttar Pradesh), INDIA Email: raj_amit377@yahoo.co.in Abstract: New information inequalities involving f-divergences have been established using the convexity arguments and some well known inequalities such as the Jensen inequality and the Arithmetic-Geometric Mean (AGM) inequality. Some particular cases have also been discussed. Keywords: f-divergence, convexity, parameterization, arithmetic mean, geometric mean. 2000 Mathematics Subject Classification: 94A17, 26D15. 1. The concept of f-divergences Let F be the set of convex functions : 0,, which are finite on 0, and continuous at point 00 lim, F F:1 0. Further if F, then is defined by 1 for 0,, lim for 0, is also in F and is called the *-conjugate (convex) function of f. Definition 1.1. Let (1.1),,, : 0, 1,2,,, 1, 2,3, denote the set of all finite discrete (n-ray) complete probability distributions. For a convex function F, the f-divergence of the probability distributions P and Q is given by 3

(1.2), 4 where,,, and,,,. In, we have taken all 0. If we take all 0 for 1, 2,, then we have to suppose that 0ln0 0 ln 0. It is generally common to take logarithms with base of 2, but here we have taken only natural logarithms. These divergences were introduced and studied independently by C s i s z á r [5, 6] and A l i and S i l v e y [1] and are sometimes known as Csiszár f-divergences or Ali-Silvey distances. The f-divergence given by (1.1) is a versatile functional form, which with a suitable choice of the function involved, leads to some well known divergence measures. Some examples as ln ln provide the Kullback-Leibler s measure [13], 1 results in the variational distance [12], 1 yields the divergence [15] and many more. These measures have been applied in a variety of fields, such as economics and political science [18, 19], biology [16], the analysis of contingency tables [7], approximation of probability distributions [4, 11], signal processing [9, 10] and pattern recognition [2, 3, 8]. The f-divergence satisfies a large number of properties which are important from an information theoretic point of view. Ö s t e r r e i c h e r [14] has discussed the basic general properties of f-divergences including their axiomatic properties and some important classes. The f-divergence defined by (1.2) is generally asymmetric in P and Q. Nevertheless, the convexity of f (u) implies that of 1 and with this function we have,,. Hence, it follows, in particular, that the symmetrised f-divergence,, is again an f-divergence, with respect to the convex function. In the present work, we have established new information inequalities involving f-divergences using the convexity arguments and some well known inequalities, such as the jensen inequality and the Arithmetic-Geometric Mean (AGM) inequality. Further we have used these inequalities in establishing relationships among some well-known divergence measures. Without essential loss of insight, we restrict ourselves to discrete probability distributions and note that the extension to the general case relies strongly on the Lebesgue Radon Nikodym Theorem. 2. Information inequalities Result 2.1. If : 0, is convex, then the function, 2 of two variables is convex on the domain, 0,.

P r o o f. Consider 0,1 and two points, from the domain of the function. For and 1 2 we get 1 1, so that 1 1 1 2 1 1 2 2 or, equivalently 1 1 which completes the proof. Result 2.2. If : 0, is convex then the function,, 0, of two variables is convex on the domain, 0,. Proof. The proof follows on similar lines as in the previous result except the choice of which can be taken as, 0. 1 We, therefore have the following divergence functionals of f-divergence type: (2.1), where,,, and,,,. For n = 0, the function (2.1) is reduced to the Csiszár f-divergence given by (1.2). Replacing n by 1/n in (2.1), we obtain (2.2),. Relationship with Csiszár f-divergence follow. Result 2.3. Let : 0, be a differentiable convex function on the interval I, ( is interior of I). Further we assume that 1 0. Then for all, we have (2.3),,, (2.4),,, where, and, are measures given by (1.2) and (2.1) respectively. The equality holds in the above inequalities if for each i. P r o o f. Let,,,. Then it is well known that (2.5). If f is strictly convex, then the equality holds if and only if all x 1 = x 2 =... = x n. 5

The above inequality is famous as Jensen inequality. If f is a concave function, then the inequality sign will change. If we assume with all other zero, then we obtain (2.6). Choosing and 1, we obtain (2.7) since 1 0. Substituting in the above inequality, multiplying by and then summing over all i, we obtain (2.8) 2,,. A choice of and 1will give 2,,. A choice of and 1 will give 2,,. Finally a choice of and 1 will yield (2.3). Combining the above choices of and, we obtain 2, 2, 16, 8, 4, 2,,. 6 Also a choice of and will yield (2.4). The inequalities given by (2.3) and (2.4) can be used in establishing relationship among some well known divergence measures. For example, if ln and 0 in (2.3), we obtain ln 2 1 2 ln, which gives, 1,. 2 Here, and, denote the Relative Jensen-Shannon divergence measure [17] and the Kullback Leibler divergence measure [13] respectively. 3. Parameterization of f-divergences Let us consider the set of all those divergence measures for which the associated convex functions f satisfy the functional equation (3.1) and for which 1 0 i.e.,f. Now for any such solution f, set (3.2) for 1 and 1 for 1 (and define 1 arbitrarily). One can easily check that 1 2 max, 1 therefore, if 1,

1 and, if 1, 1 max, 1 1 1 2 1 1 1 max, 1 1 1 2. Thus (3.1) holds (obviously, also for 0) for the function. defined by (3.2). Therefore, it is very much clear that every solution of (3.1) satisfying 1 0 can be written in the form 1 1 2 for a suitable.. We, therefore consider the following symmetric divergence functional (3.3),. It should be noted that (3.3) represents a parameterization of the set of all such divergence measures which satisfy (3.1). But here the function. can be both convex and concave. Table 1 shows various choices of. and the corresponding divergence functionals. Table 1. New symmetric divergence measures S. No 1. 2. 3., a positive constant, 1,2,3, ln, 1, 2, 3, 4. ln 5. 1, 1, 2, 3, 1, 1 1 2, 2,, 1, 2, 3, 1, 2, 3, 1 1 2 ln 1 2, 2 ln 2 1, 2, 3,,, 1, 2, 3, 1 ln 1 2 ln 2, 1 1 2 1, 2 1,, 1, 2, 3, 1, 2, 3, Result 3.1. Consider the measures,,,,, and, as defined in Table 1. Then the following inequalities measures hold (3.4),,,,, and (3.5),,,,, The equality holds in the above inequalities if for each i. 7

P r o o f. To start with, let us consider the arithmetic-geometric mean inequality given by 1 for all 0. The equality holds for 1. In general we have 8 1 1 2 which is equivalent to (3.6) 1 2 1 2 1 Substituting summing over all i, we obtain (3.4). Again from (3.6), we have 1 ln 1 1 2 1 1 2 1 2 in the above inequality, multiplying by and then 2 1 1 ln1 1 1 2 1 2 1 2 1 2 2 ln 1 2 ln 1 2 1 ln 1 2 Substituting summing over all i, we obtain (3.5). It is interesting to note that in the above inequality, multiplying by and then, 1 2, 1 2,, where, is the well known divergence [15] Result 3.2. Consider a differentiable function. : 0, as defined in (3.3). Then the following inequalities hold (3.7) 1,, 1, if. is convex, and (3.8) 2 2, 1, 1, if. is concave.

P r o o f. First we assume that the function. : 0, is differentiable and convex, then we have the following inequality (3.9) for,. Replacing y by and x by 1 in the above inequality, we obtain 1 2 1 2 1 1. Multiplying both sides by and summing over all I in the above inequality, we obtain (3.7). In addition, if we have 1 1 0, then from (3.7), we have (3.10) 0,. Again if we assume that the function. : 0, to be differentiable and concave, then the inequality given by (3.9) gets reversed and as such the proof of (3.8) follows on similar lines as above. Remark. The measure, offers the following extension:,, 0,, which includes the variation norm for 1. Note that 1 1 0 0,. By virtue of arithmetic-geometric mean inequality, we have 2 1 1 for all 0, 0, which is equivalent to for all 0, 0,. Substituting in the above inequality, multiplying by and then summing over all i, we obtain,,, 0,. Here, are a class of symmetric divergences studied by Puri and Vincze which includes the triangular discrimination for 2. References 1. A l i, S. M., S. D. S i l v e y. A General Class of Coefficients of Divergence of One Distribution from Another. Jour. Roy. Statist. Soc. Series B, Vol. 28, 1966, 131-142. 2. Bassat, M. B. f-entropies, Probability of Error and Feature Selection. Inform. Control, Vol. 39, 1978, 227-242. 3. C h e n, H. C. Statistical Pattern Recognition. Hoyderc Book Co., Rocelle Park, New York, 1973. 4. C h o w, C. K., C. N. L i n. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Inform. Theory, Vol. 14, 1968, No 3, 462-467. 9

5. C s i s z á r, I. Information Type Measures of Difference of Probability Distributions and Indirect Observations. Studia Sci. Math. Hungar., Vol. 2, 1967, 299-318. 6. C s i s z á r, I. Information Measures: A Critical Survey. in Trans. In: Seventh Prague Conf. on Information Theory, Academia, Prague, 1974, 73-86. 7. G o k h a l e, D. V., S. K u l l b a c k. Information in Contingency Tables. New York, Marcel Dekker, 1978. 8. J o n e s, L., C. B y r n e. General Entropy Criteria for Inverse Problems, with Applications to Data Compression, Pattern Classification, and Cluster Analysis. IEEE Trans. Inform. Theory, Vol. 36, 1990, 23-30. 9. K a d o t a, T. T., L. A. S h e p p. On the Best Finite Set of Linear Observables for Discriminating Two Gaussian Signals. IEEE Trans. Inform. Theory, Vol. 13, 1967, 288-294. 10. K a i l a t h, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Comm. Technology, Vol. COM-15, 1967, 52-60. 11. K a z a k o s, D., T. C o t s i d a s. A Decision Theory Approach to the Approximation of Discrete Probability Densities. IEEE Trans. Perform. Anal. Machine Intell, Vol. 1, 1980, 61-67. 12. K o l m o g o r o v, A. N. A New Invariant for Transitive Dynamical System. Dokl. Akad. Nauk. USSR, Vol. 119, 1957, 861-869. 13. K u l l b a c k, S., A. S. L e i b l e r. On Information and Sufficiency. Ann. Math. Stat., Vol. 22, 1951, 79-86. 14. Ö s t e r r e i c h e r, F. Csiszar s f - Divergences Basic Properties. Res. Report Collection, 2002. http://rgmia.vu.edu.au/monographs/csiszar.htm 15. P e a r s o n, K. On the Criterion That a Given System of Deviations from the Probable in the Case of Correlated System of Variables in Such That It Can Be Reasonable Supposed to Have Arisen from Random Sampling. Phil. Mag., 1900, 157-172. 16. P i e l o u, E. C. Ecological Diversity. New York, Wiley, 1975. 17. S i b s o n, R. Information Radius. Z. Wahrs. und Verw Geb., Vol. 14, 1969, 149-160. 18. T h e i l, H. Economics and Information Theory. Amsterdam, North-Holland, 1967. 19. T h e i l, H. Statistical Decomposition Analysis. Amsterdam, North-Holland, 1972. 10