Literature on Bregman divergences Lecture series at Univ. Hawai i at Mānoa Peter Harremoës February 26, 2016 Information divergence was introduced by Kullback and Leibler [25] and later Kullback started using information theory in statistics [24] and here information divergence play a crusial role. Information divergence was used already by Wald [35] although he did not give this quantity a name. The basic properties of information divergence are now described in many textbooks. Optimization with information divergence was described in a systematic way by Topsøe [33]. Relation to the conditional limit theorem can be found in [9]. Alternating minimization was studied in [11]. Information projections and reversed projections are described in [10]. Information divergence can also be used to define a topology with some strange properties [15]. There have been many attempts to generalize the notion of information divergence to a wider class of divergence. There have been two different types of motivation for generalizing information divergence. One motivation has been that quantities that share some properties with information divergence are used in physics, statistics, probability theory or other parts of information theory. If this is the motivation one often has to compare related quantities by inequalities or similar results. This motivation has lead to a great number of good results. Another motivation has been generalization in the hope that some generalized version of information divergence will turn out to be useful. There are many papers that take this approach but most of the divergences that have emerged in this way have never been used again. One important exception is Rényi divergence introduced [31]. All the basic properties of information divergence and Rényi divergence were recently described in [34]. The class of f-divergences were introduced independently by Csiszár and Morimoto [8, 29] and a little later again by Ali and Silvey [2]. The f-divergences generalize information divergence in such a way that convexity and the data processing inequality are still satisfied. It includes various quantities used in statistics including the χ 2 - divergence. In statistics a major question therefore is which f-divergence to use for a specific problem. The standard reference is [26]. An important result is that if the probability measures are close together then it does not make much difference which divergence is used [27]. If the notion of Bahadur efficiency is used information divergence should normally be preferred [21]. In some cases the distribution information 1
divergence is closer to a χ 2 -distribution that other f-divergences [20, 16]. There are many papers on inequalities between f-divergences but it has been shown that an inequality that holds for a binary alphabet holds for any alphabet [22]. Bregman divergences were introduced in [6], but for a long time they did not received the attention they deserve. The Bregman divergence may be characterized as divergences that satisfy the Bregman identity. In the context of Information theory Bregman divergences were partly reinvented by Rao and Nayak[30] where the name cross entropy were proposed and this term is still in use in some groups of scientists. Until 2005 there were only few papers on Bregman divergences but in the paper Clustering with Bregman Divergences [3] all the basic properties of Bregman divergences were described. The paper also clarify the relation between Bregman divergences and exponential families. The sufficiency condition was first used to characterize divergences in [19] and in [23] it was proved that the sufficiency condition can be used to characterize information divergence. This idea was further developed in [17, 18]. The relation between Bregman divergences and metrics was described in [7] and [1]. Inspired by results from 2-person 0-sum games in the 1940 ties Wald developed the idea that in situations with uncertainty one should make decisions that maximize the minimal payoff. This decision criterion is very robust but is often too pessimistic for real world decisions. In 1951 Sevage introduced the minimax regret criterion in decision theory as an alternate criterion for decision making. Regret was introduced as an inference criteria in statistics in 1978 by Rissanen[32] but this idea took slowly momentum, but has now developed into a competitor to Bayesian statistics and the frequential interpretation of statistics allthough it is still not so widely known[4, 14, 13]. The use of regret in economy did not take momentum before 1982 where the idea was revived in a number of papers [12, 28, 5]. Now there is also active research in psykological aspect of using regret as a decision criterion. The relation between Bregman divergences, decision theory and regret is described in [17, 18]. References [1] S. Acharyya, A. Banerjee, and D. Boley. Bregman Divergences and Triangle Inequality, chapter 52, pages 476 484. 2013. [2] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser B, 28:131 142, 1966. [3] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005. [4] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, 44(6):2743 2760, Oct. 1998. Commemorative issue. 2
[5] D. E. Bell. Regret in decision making under uncertainty. Operations research, 30(5):961 981, 1982. [6] L. M. Bregman. The relaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Comput. Math. and Math. Phys., 7:200 217, 1967. Translated from Russian. [7] P. Chen, Y. Chen, and M. Rao. Metrics defined by bregman divergences. Commun. Math. Sci., 6(4):915 926, 12 2008. [8] I. Csiszár. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad., 8:95 108, 1963. [9] I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probab., 12:768 793, 1984. [10] I. Csiszár and F. Matús. Information projections revisited. IEEE Trans. Inform. Theory, 49(6):1474 1490, June 2003. [11] I. Csiszár and G. Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplementary Issue 1:205.237, 1984. [12] P. C. Fishburn. The foundations of expected utility. Theory & Decision Library, 1982. [13] P. Grünwald. the Minimum Description Length principle. MIT Press, 2007. [14] P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Mathematical Statistics, 32(4):1367 1433, 2004. [15] P. Harremoës. The information topology. In Proceedings IEEE International Symposium on Information Theory, page 431, Lausanne, June 2002. IEEE. [16] P. Harremoës. Mutual information of contingency tables and related inequalities. In Proceedings ISIT 2014, pages 2474 2478. IEEE, June 2014. [17] P. Harremoës. Proper scoring and sufficiency. In J. Rissanen, P. Harremoës, S. Forchhammer, T. Roos, and P. Myllymäke, editors, Proceeding of the The Eighth Workshop on Information Theoretic Methods in Science and Engineering, number Report B-2015-1 in Series of Publications B, pages 19 22, University of Helsinki, Department of Computer Science, 2015. An appendix with proofs only exists in the arxiv version of the paper. [18] P. Harremoës. Sufficiency on the stock market. Submitted, Jan. 2016. 3
[19] P. Harremoës and N. Tishby. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings ISIT 2007, Nice, pages 566 571. IEEE Information Theory Society, June 2007. [20] P. Harremoës and G. Tusnády. Information divergence is more χ 2 -distributed than the χ 2 -statistic. In International Symposium on Information Theory (ISIT 2012), pages 538 543, Cambridge, Massachusetts, USA, July 2012. IEEE. [21] P. Harremoës and I. Vajda. On the Bahadur-efficient testing of uniformity by means of the entropy. IEEE Trans. Inform Theory, 54(1):321 331, Jan. 2008. [22] P. Harremoës and I. Vajda. On pairs of f-divergences and their joint range. IEEE Tranns. Inform. Theory, 57(6):3220 3225, June 2011. [23] Jiantao Jiao, Thomas Courtade amd Albert No, Kartik Venkat, and Tsachy Weissman. Information measures: the curious case of the binary alphabet. Trans. Inform. Theory, 60(12):7616 7626, Dec. 2014. [24] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959. [25] S. Kullback and R. Leibler. On information and sufficiency. Ann. Math. Statist., 22:79 86, 1951. [26] F. Liese and I. Vajda. Convex Statistical Distances. Teubner, Leipzig, 1987. [27] F. Liese and I. Vajda. On divergence and informations in statistics and information theory. IEEE Tranns. Inform. Theory, 52(10):4394 4412, Oct. 2006. [28] G. Loomes and R. Sugden. Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92(4):805 824, 1982. [29] T. Morimoto. Markov processes and the h-theorem. J. Phys. Soc. Jap., 12:328 331, 1963. [30] C. R. Rao and T. K. Nayak. Cross entropy, dissimilarity measures, and characterizations of quadratic entropy. IEEE Trans. Inform. Theory, 31(5):589 593, September 1985. [31] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547 561, 1961. [32] J. Rissanen. Modelling by shortest data description. Automatica, 14:465 471, 1978. [33] F. Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8 27, 1979. 4
[34] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Trans Inform. Theory, 60(7):3797 3820, July 2014. [35] A. Wald. Sequensial Analysis. Wiley, 1947. 5