Literature on Bregman divergences

Similar documents
The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

PROPERTIES. Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria

ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS

Information Divergences and the Curious Case of the Binary Alphabet

arxiv: v4 [cs.it] 17 Oct 2015

Convexity/Concavity of Renyi Entropy and α-mutual Information

Some New Information Inequalities Involving f-divergences

CONVEX FUNCTIONS AND MATRICES. Silvestru Sever Dragomir

A View on Extension of Utility-Based on Links with Information Measures

Full text available at: Information Theory and Statistics: A Tutorial

Experimental Design to Maximize Information

Bregman Divergences for Data Mining Meta-Algorithms

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

arxiv: v2 [cs.it] 10 Apr 2017

Entropy measures of physics via complexity

Nash equilibrium in a game of calibration

arxiv: v1 [cs.lg] 22 Oct 2018

WE start with a general discussion. Suppose we have

Minimum Phi-Divergence Estimators and Phi-Divergence Test Statistics in Contingency Tables with Symmetry Structure: An Overview

Robustness and duality of maximum entropy and exponential family distributions

On The Asymptotics of Minimum Disparity Estimation

THEOREM AND METRIZABILITY

A New Quantum f-divergence for Trace Class Operators in Hilbert Spaces

arxiv: v2 [math.pr] 8 Feb 2016

Jensen-Shannon Divergence and Hilbert space embedding

Goodness of Fit Test and Test of Independence by Entropy

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Mathematical Foundations of the Generalization of t-sne and SNE for Arbitrary Divergences

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Conjugate Predictive Distributions and Generalized Entropies

Mutual information of Contingency Tables and Related Inequalities

Information Measures: The Curious Case of the Binary Alphabet

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

Alpha/Beta Divergences and Tweedie Models

Risk Measurement Robust under Model Uncertainty

Information, Utility & Bounded Rationality

Received: 20 December 2011; in revised form: 4 February 2012 / Accepted: 7 February 2012 / Published: 2 March 2012

Information Measures: the Curious Case of the Binary Alphabet

On Improved Bounds for Probability Metrics and f- Divergences

Institut für Mathematik

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Belief Propagation, Information Projections, and Dykstra s Algorithm

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

An Introduction to Functional Derivatives

Generalized Bregman Divergence and Gradient of Mutual Information for Vector Poisson Channels

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Maximum Likelihood Approach for Symmetric Distribution Property Estimation

Arimoto Channel Coding Converse and Rényi Divergence

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Information Theory in Intelligent Decision Making

Information Geometry on Hierarchy of Probability Distributions

A Modification of Linfoot s Informational Correlation Coefficient

Exponentiated Gradient Descent

arxiv: v4 [cs.it] 8 Apr 2014

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

DIVERGENCES (or pseudodistances) based on likelihood

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian

Divergence measures for statistical data processing

Some History of Optimality

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Gaussian Estimation under Attack Uncertainty

Estimation of signal information content for classification

On the Chi square and higher-order Chi distances for approximating f-divergences

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Divergences, surrogate loss functions and experimental design

Extremal properties of the variance and the quantum Fisher information; Phys. Rev. A 87, (2013).

Curvilinear Components Analysis and Bregman Divergences

A Generalized Fuzzy Inaccuracy Measure of Order ɑ and Type β and Coding Theorems

Research Article On Some Improvements of the Jensen Inequality with Some Applications

Computational Systems Biology: Biology X

Information Theory and Hypothesis Testing

HOPFIELD neural networks (HNNs) are a class of nonlinear

ONE SILVESTRU SEVER DRAGOMIR 1;2

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

arxiv: v1 [cs.dm] 27 Jan 2014

Convergence of generalized entropy minimizers in sequences of convex problems

6.891 Games, Decision, and Computation February 5, Lecture 2

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

Finding the best mismatched detector for channel coding and hypothesis testing

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes

Lecture 3: Lower Bounds for Bandit Algorithms

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Noncommutative Uncertainty Principle

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

Lecture 2: Basic Concepts of Statistical Decision Theory

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Bioinformatics: Biology X

Information-theoretic foundations of differential privacy

Asymptotic Minimax Regret for Data Compression, Gambling, and Prediction

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

SOLVING an optimization problem over two variables in a

ON MINIMAL PAIRWISE SUFFICIENT STATISTICS

Series 7, May 22, 2018 (EM Convergence)

On Some New Measures of Intutionstic Fuzzy Entropy and Directed Divergence

On the Borel-Cantelli Lemma

Transcription:

Literature on Bregman divergences Lecture series at Univ. Hawai i at Mānoa Peter Harremoës February 26, 2016 Information divergence was introduced by Kullback and Leibler [25] and later Kullback started using information theory in statistics [24] and here information divergence play a crusial role. Information divergence was used already by Wald [35] although he did not give this quantity a name. The basic properties of information divergence are now described in many textbooks. Optimization with information divergence was described in a systematic way by Topsøe [33]. Relation to the conditional limit theorem can be found in [9]. Alternating minimization was studied in [11]. Information projections and reversed projections are described in [10]. Information divergence can also be used to define a topology with some strange properties [15]. There have been many attempts to generalize the notion of information divergence to a wider class of divergence. There have been two different types of motivation for generalizing information divergence. One motivation has been that quantities that share some properties with information divergence are used in physics, statistics, probability theory or other parts of information theory. If this is the motivation one often has to compare related quantities by inequalities or similar results. This motivation has lead to a great number of good results. Another motivation has been generalization in the hope that some generalized version of information divergence will turn out to be useful. There are many papers that take this approach but most of the divergences that have emerged in this way have never been used again. One important exception is Rényi divergence introduced [31]. All the basic properties of information divergence and Rényi divergence were recently described in [34]. The class of f-divergences were introduced independently by Csiszár and Morimoto [8, 29] and a little later again by Ali and Silvey [2]. The f-divergences generalize information divergence in such a way that convexity and the data processing inequality are still satisfied. It includes various quantities used in statistics including the χ 2 - divergence. In statistics a major question therefore is which f-divergence to use for a specific problem. The standard reference is [26]. An important result is that if the probability measures are close together then it does not make much difference which divergence is used [27]. If the notion of Bahadur efficiency is used information divergence should normally be preferred [21]. In some cases the distribution information 1

divergence is closer to a χ 2 -distribution that other f-divergences [20, 16]. There are many papers on inequalities between f-divergences but it has been shown that an inequality that holds for a binary alphabet holds for any alphabet [22]. Bregman divergences were introduced in [6], but for a long time they did not received the attention they deserve. The Bregman divergence may be characterized as divergences that satisfy the Bregman identity. In the context of Information theory Bregman divergences were partly reinvented by Rao and Nayak[30] where the name cross entropy were proposed and this term is still in use in some groups of scientists. Until 2005 there were only few papers on Bregman divergences but in the paper Clustering with Bregman Divergences [3] all the basic properties of Bregman divergences were described. The paper also clarify the relation between Bregman divergences and exponential families. The sufficiency condition was first used to characterize divergences in [19] and in [23] it was proved that the sufficiency condition can be used to characterize information divergence. This idea was further developed in [17, 18]. The relation between Bregman divergences and metrics was described in [7] and [1]. Inspired by results from 2-person 0-sum games in the 1940 ties Wald developed the idea that in situations with uncertainty one should make decisions that maximize the minimal payoff. This decision criterion is very robust but is often too pessimistic for real world decisions. In 1951 Sevage introduced the minimax regret criterion in decision theory as an alternate criterion for decision making. Regret was introduced as an inference criteria in statistics in 1978 by Rissanen[32] but this idea took slowly momentum, but has now developed into a competitor to Bayesian statistics and the frequential interpretation of statistics allthough it is still not so widely known[4, 14, 13]. The use of regret in economy did not take momentum before 1982 where the idea was revived in a number of papers [12, 28, 5]. Now there is also active research in psykological aspect of using regret as a decision criterion. The relation between Bregman divergences, decision theory and regret is described in [17, 18]. References [1] S. Acharyya, A. Banerjee, and D. Boley. Bregman Divergences and Triangle Inequality, chapter 52, pages 476 484. 2013. [2] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser B, 28:131 142, 1966. [3] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005. [4] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, 44(6):2743 2760, Oct. 1998. Commemorative issue. 2

[5] D. E. Bell. Regret in decision making under uncertainty. Operations research, 30(5):961 981, 1982. [6] L. M. Bregman. The relaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Comput. Math. and Math. Phys., 7:200 217, 1967. Translated from Russian. [7] P. Chen, Y. Chen, and M. Rao. Metrics defined by bregman divergences. Commun. Math. Sci., 6(4):915 926, 12 2008. [8] I. Csiszár. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad., 8:95 108, 1963. [9] I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probab., 12:768 793, 1984. [10] I. Csiszár and F. Matús. Information projections revisited. IEEE Trans. Inform. Theory, 49(6):1474 1490, June 2003. [11] I. Csiszár and G. Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplementary Issue 1:205.237, 1984. [12] P. C. Fishburn. The foundations of expected utility. Theory & Decision Library, 1982. [13] P. Grünwald. the Minimum Description Length principle. MIT Press, 2007. [14] P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Mathematical Statistics, 32(4):1367 1433, 2004. [15] P. Harremoës. The information topology. In Proceedings IEEE International Symposium on Information Theory, page 431, Lausanne, June 2002. IEEE. [16] P. Harremoës. Mutual information of contingency tables and related inequalities. In Proceedings ISIT 2014, pages 2474 2478. IEEE, June 2014. [17] P. Harremoës. Proper scoring and sufficiency. In J. Rissanen, P. Harremoës, S. Forchhammer, T. Roos, and P. Myllymäke, editors, Proceeding of the The Eighth Workshop on Information Theoretic Methods in Science and Engineering, number Report B-2015-1 in Series of Publications B, pages 19 22, University of Helsinki, Department of Computer Science, 2015. An appendix with proofs only exists in the arxiv version of the paper. [18] P. Harremoës. Sufficiency on the stock market. Submitted, Jan. 2016. 3

[19] P. Harremoës and N. Tishby. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings ISIT 2007, Nice, pages 566 571. IEEE Information Theory Society, June 2007. [20] P. Harremoës and G. Tusnády. Information divergence is more χ 2 -distributed than the χ 2 -statistic. In International Symposium on Information Theory (ISIT 2012), pages 538 543, Cambridge, Massachusetts, USA, July 2012. IEEE. [21] P. Harremoës and I. Vajda. On the Bahadur-efficient testing of uniformity by means of the entropy. IEEE Trans. Inform Theory, 54(1):321 331, Jan. 2008. [22] P. Harremoës and I. Vajda. On pairs of f-divergences and their joint range. IEEE Tranns. Inform. Theory, 57(6):3220 3225, June 2011. [23] Jiantao Jiao, Thomas Courtade amd Albert No, Kartik Venkat, and Tsachy Weissman. Information measures: the curious case of the binary alphabet. Trans. Inform. Theory, 60(12):7616 7626, Dec. 2014. [24] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959. [25] S. Kullback and R. Leibler. On information and sufficiency. Ann. Math. Statist., 22:79 86, 1951. [26] F. Liese and I. Vajda. Convex Statistical Distances. Teubner, Leipzig, 1987. [27] F. Liese and I. Vajda. On divergence and informations in statistics and information theory. IEEE Tranns. Inform. Theory, 52(10):4394 4412, Oct. 2006. [28] G. Loomes and R. Sugden. Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92(4):805 824, 1982. [29] T. Morimoto. Markov processes and the h-theorem. J. Phys. Soc. Jap., 12:328 331, 1963. [30] C. R. Rao and T. K. Nayak. Cross entropy, dissimilarity measures, and characterizations of quadratic entropy. IEEE Trans. Inform. Theory, 31(5):589 593, September 1985. [31] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547 561, 1961. [32] J. Rissanen. Modelling by shortest data description. Automatica, 14:465 471, 1978. [33] F. Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8 27, 1979. 4

[34] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Trans Inform. Theory, 60(7):3797 3820, July 2014. [35] A. Wald. Sequensial Analysis. Wiley, 1947. 5