Literature on Bregman divergences

Size: px

Start display at page:

Download "Literature on Bregman divergences"

Edward Greer
5 years ago
Views:

1 Literature on Bregman divergences Lecture series at Univ. Hawai i at Mānoa Peter Harremoës February 26, 2016 Information divergence was introduced by Kullback and Leibler [25] and later Kullback started using information theory in statistics [24] and here information divergence play a crusial role. Information divergence was used already by Wald [35] although he did not give this quantity a name. The basic properties of information divergence are now described in many textbooks. Optimization with information divergence was described in a systematic way by Topsøe [33]. Relation to the conditional limit theorem can be found in [9]. Alternating minimization was studied in [11]. Information projections and reversed projections are described in [10]. Information divergence can also be used to define a topology with some strange properties [15]. There have been many attempts to generalize the notion of information divergence to a wider class of divergence. There have been two different types of motivation for generalizing information divergence. One motivation has been that quantities that share some properties with information divergence are used in physics, statistics, probability theory or other parts of information theory. If this is the motivation one often has to compare related quantities by inequalities or similar results. This motivation has lead to a great number of good results. Another motivation has been generalization in the hope that some generalized version of information divergence will turn out to be useful. There are many papers that take this approach but most of the divergences that have emerged in this way have never been used again. One important exception is Rényi divergence introduced [31]. All the basic properties of information divergence and Rényi divergence were recently described in [34]. The class of f-divergences were introduced independently by Csiszár and Morimoto [8, 29] and a little later again by Ali and Silvey [2]. The f-divergences generalize information divergence in such a way that convexity and the data processing inequality are still satisfied. It includes various quantities used in statistics including the χ 2 - divergence. In statistics a major question therefore is which f-divergence to use for a specific problem. The standard reference is [26]. An important result is that if the probability measures are close together then it does not make much difference which divergence is used [27]. If the notion of Bahadur efficiency is used information divergence should normally be preferred [21]. In some cases the distribution information 1

2 divergence is closer to a χ 2 -distribution that other f-divergences [20, 16]. There are many papers on inequalities between f-divergences but it has been shown that an inequality that holds for a binary alphabet holds for any alphabet [22]. Bregman divergences were introduced in [6], but for a long time they did not received the attention they deserve. The Bregman divergence may be characterized as divergences that satisfy the Bregman identity. In the context of Information theory Bregman divergences were partly reinvented by Rao and Nayak[30] where the name cross entropy were proposed and this term is still in use in some groups of scientists. Until 2005 there were only few papers on Bregman divergences but in the paper Clustering with Bregman Divergences [3] all the basic properties of Bregman divergences were described. The paper also clarify the relation between Bregman divergences and exponential families. The sufficiency condition was first used to characterize divergences in [19] and in [23] it was proved that the sufficiency condition can be used to characterize information divergence. This idea was further developed in [17, 18]. The relation between Bregman divergences and metrics was described in [7] and [1]. Inspired by results from 2-person 0-sum games in the 1940 ties Wald developed the idea that in situations with uncertainty one should make decisions that maximize the minimal payoff. This decision criterion is very robust but is often too pessimistic for real world decisions. In 1951 Sevage introduced the minimax regret criterion in decision theory as an alternate criterion for decision making. Regret was introduced as an inference criteria in statistics in 1978 by Rissanen[32] but this idea took slowly momentum, but has now developed into a competitor to Bayesian statistics and the frequential interpretation of statistics allthough it is still not so widely known[4, 14, 13]. The use of regret in economy did not take momentum before 1982 where the idea was revived in a number of papers [12, 28, 5]. Now there is also active research in psykological aspect of using regret as a decision criterion. The relation between Bregman divergences, decision theory and regret is described in [17, 18]. References [1] S. Acharyya, A. Banerjee, and D. Boley. Bregman Divergences and Triangle Inequality, chapter 52, pages [2] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser B, 28: , [3] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6: , [4] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, 44(6): , Oct Commemorative issue. 2

3 [5] D. E. Bell. Regret in decision making under uncertainty. Operations research, 30(5): , [6] L. M. Bregman. The relaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Comput. Math. and Math. Phys., 7: , Translated from Russian. [7] P. Chen, Y. Chen, and M. Rao. Metrics defined by bregman divergences. Commun. Math. Sci., 6(4): , [8] I. Csiszár. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad., 8:95 108, [9] I. Csiszár. Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probab., 12: , [10] I. Csiszár and F. Matús. Information projections revisited. IEEE Trans. Inform. Theory, 49(6): , June [11] I. Csiszár and G. Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplementary Issue 1: , [12] P. C. Fishburn. The foundations of expected utility. Theory & Decision Library, [13] P. Grünwald. the Minimum Description Length principle. MIT Press, [14] P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Mathematical Statistics, 32(4): , [15] P. Harremoës. The information topology. In Proceedings IEEE International Symposium on Information Theory, page 431, Lausanne, June IEEE. [16] P. Harremoës. Mutual information of contingency tables and related inequalities. In Proceedings ISIT 2014, pages IEEE, June [17] P. Harremoës. Proper scoring and sufficiency. In J. Rissanen, P. Harremoës, S. Forchhammer, T. Roos, and P. Myllymäke, editors, Proceeding of the The Eighth Workshop on Information Theoretic Methods in Science and Engineering, number Report B in Series of Publications B, pages 19 22, University of Helsinki, Department of Computer Science, An appendix with proofs only exists in the arxiv version of the paper. [18] P. Harremoës. Sufficiency on the stock market. Submitted, Jan

4 [19] P. Harremoës and N. Tishby. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings ISIT 2007, Nice, pages IEEE Information Theory Society, June [20] P. Harremoës and G. Tusnády. Information divergence is more χ 2 -distributed than the χ 2 -statistic. In International Symposium on Information Theory (ISIT 2012), pages , Cambridge, Massachusetts, USA, July IEEE. [21] P. Harremoës and I. Vajda. On the Bahadur-efficient testing of uniformity by means of the entropy. IEEE Trans. Inform Theory, 54(1): , Jan [22] P. Harremoës and I. Vajda. On pairs of f-divergences and their joint range. IEEE Tranns. Inform. Theory, 57(6): , June [23] Jiantao Jiao, Thomas Courtade amd Albert No, Kartik Venkat, and Tsachy Weissman. Information measures: the curious case of the binary alphabet. Trans. Inform. Theory, 60(12): , Dec [24] S. Kullback. Information Theory and Statistics. Wiley, New York, [25] S. Kullback and R. Leibler. On information and sufficiency. Ann. Math. Statist., 22:79 86, [26] F. Liese and I. Vajda. Convex Statistical Distances. Teubner, Leipzig, [27] F. Liese and I. Vajda. On divergence and informations in statistics and information theory. IEEE Tranns. Inform. Theory, 52(10): , Oct [28] G. Loomes and R. Sugden. Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92(4): , [29] T. Morimoto. Markov processes and the h-theorem. J. Phys. Soc. Jap., 12: , [30] C. R. Rao and T. K. Nayak. Cross entropy, dissimilarity measures, and characterizations of quadratic entropy. IEEE Trans. Inform. Theory, 31(5): , September [31] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages , [32] J. Rissanen. Modelling by shortest data description. Automatica, 14: , [33] F. Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8 27,

5 [34] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Trans Inform. Theory, 60(7): , July [35] A. Wald. Sequensial Analysis. Wiley,

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali