A categorical characterization of relative entropy on standard Borel spaces

MFPS 2017 A categorical characterization o relative entropy on tandard Borel pace Nicola Gagné 1,2 School o Computer Science McGill Univerity Montréal, Québec, Canada Prakah Panangaden 1,3 School o Computer Science McGill Univerity Montréal, Québec, Canada Abtract We give a categorical treatment, in the pirit o Baez and Fritz, o relative entropy or probability ditribution deined on tandard Borel pace. We deine a category called SbStat uitable or reaoning about tatitical inerence on tandard Borel pace. We deine relative entropy a a unctor into Lawvere category [0, ] and we how convexity, lower emicontinuity and uniquene. Keyword: Entropy, Gìry Monad, Bayeian Learning, tandard Borel pace 1 Introduction The inpiration or the preent work come rom two recent development. The irt i the beginning o a categorical undertanding o Bayeian inverion and learning [9,8,7] the econd i a categorical recontruction o relative entropy [3,2,15]. The preent paper provide a categorical treatment o entropy in the pirit o Baez and Fritz in the etting o tandard Borel pace, thu etting the tage to explore the role o entropy in learning. Recently there have been ome exciting development that bring ome categorical inight to probability theory and peciically to learning theory. Thee are reported in ome recent paper by Clerc, Dahlqvit, Dano and Garnier [9,8,7]. The irt o 1 Thi reearch ha been upported by NSERC. 2 Email: nicola.gagne@mail.mcgill.ca 3 Email: prakah@c.mcgill.ca Thi paper i electronically publihed in Electronic Note in Theoretical Computer Science URL: www.elevier.nl/locate/entc

thee paper howed how to view the Dirichlet ditribution a a natural tranormation thu opening the way to an undertanding o higher-order probabilitie, while the econd gave a powerul ramework or contructing everal natural tranormation. In [9] the hope wa expreed that one could ue thee idea to undertand Bayeian inverion, a core concept in machine learning. In [7] thi wa realized in a remarkably novel way. Thee paper carry out their invetigation in the etting o tandard Borel pace and are baed on the Giry monad [11,13]. In [3,2] a beautiul treatment o relative entropy i given in categorical term. The baic idea i to undertand entropy in term o the reult o experiment and obervation. How much doe one learn about a probabilitic ituation by doing experiment and oberving the reult? A category i et up where the morphim capture the interplay between the original pace and the pace o obervation. In order to interpret the relative entropy a a unctor they ue Lawvere category which conit o a ingle object and a morphim or every extended poitive real number [14]. Our contribution i to develop the theory o Baez et al. in the etting o tandard Borel pace 4 ; their work i carried out with inite et. While the work o [2] give a irm conceptual direction, it give little guidance in the actual development o the mathematical theory. We had to redevelop the mathematical ramework and ind the right anaue or the concept appropriate to the inite cae. 2 Background In thi ection we review ome o the background. We aume that the reader i amiliar with concept rom topoy and meaure theory a well a baic category theory. We have ound book by Ah [1], Billingley [4] and Dudley [10] to be ueul. We will ue letter like,, Z or meaurable pace and capital Greek letter like Σ, Λ, Ω or σ-algebra. We will ue p, q,... or probability meaure. Given, Σ and, Λ and a meaurable unction : and a probability meaure p on, Σ we obtain a meaure on, Λ by p 1 ; thi i called the puhorward meaure or the image meaure. 2.1 The Giry monad We denote the category o meaurable pace and meaurable unction by Me. We recall the Gìry [11] 5 unctor Γ : Me Me which map each meaurable pace to the pace Γ o probability meaure over. Let A Σ, we deine ev A : Γ [0, 1] by ev A p pa. We endow Γ with the mallet σ-algebra making all the ev meaurable. A morphim : in Me i mapped to Γ : Γ Γ by Γp p 1. With the ollowing natural tranormation, 4 In an earlier drat we were loppy about the dierence between tandard Borel pace and Polih pace. We are really working with tandard Borel pace which are Polih pace but with the topoy orgotten and the σ-algebra retained. 5 Gìry name doe have the accent grave on the ı; however, we omit it rom now on to eae the typeetting. 2

thi endounctor i a monad: the Giry monad. The natural tranormation η : I Γ i given by η x δ x, the Dirac meaure concentrated at x. The monad multiplication µ : Γ 2 Γ i given by A B, µ pa : ev A where p i a probability meaure in ΓΓ and ev A : Γ [0, 1] i the meaurable unction on Γ deined by ev A p pa. Even i Me i an intereting category in and o itel, the need or regular conditional probabilitie orce u to retrict ourelve to a ubcategory o tandard Borel pace. Γ 2.2 Standard Borel pace and diintegration The Radon-Nikodym theorem i the main tool ued to how the exitence o conditional probability ditribution, alo called Markov kernel, ee the dicuion below. It i a very general theorem, but it doe not give a trong regularity eature a one might want. A tronger theorem i needed; thi i the o-called diintegration theorem. It require tronger hypothee on the pace on which the kernel are being deined. A category o pace that atiy thee tronger hypothee i the category o tandard Borel pace. In order to deine tandard Borel pace, we mut irt deine Polih pace. Deinition 2.1 A Polih pace i a eparable, completely metrizable topoical pace. Deinition 2.2 A tandard Borel pace i a meaurable pace obtained by orgetting the topoy o a Polih pace but retaining it Borel algebra. The category o tandard Borel pace ha meaurable unction a morphim; we denote it by StBor. We can now tate a verion o the diintegration theorem. known a Rohlin diintegration theorem [17]. The ollowing i alo Theorem 2.3 Diintegration Let, p and, q be two tandard Borel pace equipped with probability meaure, where q i the puhorward meaure q : p 1 or a Borel meaurable unction :. Then, there exit a q-almot everywhere uniquely determined amily o probability meaure {p y } y on uch that i the unction y p y A i a Borel-meaurable unction or each Borelmeaurable et A ; ii p y i a probability meaure on 1 y or q-almot all y ; iii or every Borel-meaurable unction h : [0, ], h h y. 3 1 y

The object obtained are oten called regular conditional probability ditribution. One can ind a crip categorical ormulation o diintegration in [7, Theorem 1]. 2.3 The Kleili category o Γ on StBor It i well known that the Giry monad on Me retricted to StBor admit the ame monad tructure. [11] The Kleili category o Γ ha a object tandard Borel pace and a morphim map rom to Γ : h : B [0, 1] which are meaurable. Here B tand or the Borel et o and Γ ha the σ-algebra decribed above. Now we can curry thi to write it a h : B [0, 1] or hx, U where x i a point in and U i a Borel et in. Written thi way it i called a Markov kernel and one can view it a a tranition probability unction or conditional probability ditribution given x. Compoition o morphim : and g : Z in the Kleili category i given by the ormula g x, V Z gy, V dx,. For an arrow : Γ in StBor, we write y or y or, in kernel orm y,. For arrow t : Z Γ and : Γ in StBor, we denote their Kleili compoition by t : µ Γ t. For tandard Borel pace equipped with a probability meaure p, we ometime omit the meaure in the notation, i.e. we ometime write intead o, p. We ay a probability meaure p i abolutely continuou with repect to another meaure q on the ame meaurable pace, denoted by p q, i or all meaurable et B, qb 0 implie that pb 0. We note that abolute continuity i preerved by Kleili compoition; the proo i traightorward. Propoition 2.4 Given a tandard Borel pace with probability meaure q and q uch that q q. Then, or arbitrary tandard Borel pace and morphim rom to Γ, we have q q. 3 The categorical etting In thi ection, ollowing Baez and Fritz [2] ee alo [3] we decribe the categorie FinStat and FP which they ue or their characterization o entropy on inite pace. We then introduce the category SbStat which will be the arena or the generalization to tandard Borel pace. Beore doing o, we deine the notion o coherence which will play an important role in what ollow. Deinition 3.1 Given tandard Borel pace and with probability meaure p and q, repectively, a pair,, :, p, q and : Γ, i aid to be coherent when i meaure preerving, i.e., q p 1, and y i a probability 4

meaure on 1 y q-almot everywhere. 6 I in addition, p i abolutely continuou with repect to q, then we ay that, i abolutely coherent. Deinition 3.2 The category FinStat ha Object : Pair, p where i a inite et and p a probability meaure on. Morphim : Hom, are all coherent pair,, : and : Γ. We compoe arrow, :, p, q and g, t :, q Z, m a ollow: g, t, : g, in t where in i deined a in t z x y t z y y x. One can think o a a meaurement proce rom to and o a a hypothei about given an obervation in. We ay that a hypothei i optimal i p in q, or equivalenty, i i a diintegration o p along. We denote by FP the ubcategory o FinStat coniting o the ame object, but with only thoe morphim where the hypothei i optimal. See [3,2] and [15] or a dicuion o thee idea. We now leave the inite world or a more general one: the category SbStat. Deinition 3.3 The category SbStat ha Object : Pair, p where i a tandard Borel pace and p a probability meaure on the Borel ubet o. Morphim : Hom, are all coherent pair,, : and : Γ. We compoe arrow, :, p, q and g, t :, q Z, m a ollow: g, t, : g, t. Following the graphical repreentation rom [2] we repreent compoition a ollow: t t, p, q Z, m g Compoition, p Z, m g. The proo o the ollowing propoition i done in the extended verion. Propoition 3.4 Given coherent pair the compoition i coherent. I, in addition, they are abolutely coherent, the compoition i abolutely coherent. We end thi ection by deining one more category; thi one i due to Lawvere [14]. It i jut the et [0, ] but endowed with categorical tructure. Thi allow numerical value aociated with morphim to be regarded a unctor. 6 Note that, being coherent i equivalent to η Γ. 5

Deinition 3.5 The category [0, ] ha Object : One ingle object:. Morphim : For each element r [0, ], one arrow r :. Arrow compoition i deined a addition in [0, ]. Thi i a remarkable category with monoidal cloed tructure and many other intereting propertie. 4 Relative entropy unctor We recapitulate the deinition o the relative entropy unctor on FinStat rom Baez and Fritz [2] and then extend it to SbStat. Deinition 4.1 The relative entropy unctor RE in i deined rom FinStat to [0, ] a ollow: On Object : It map every object, p to. On Morphim : It map a morphim, :, p, q to S in p, in q, where S in p, in q : px px ln. in qx x The convention rom now on will be that c c or 0 < c and 0 0 0. We extend RE in rom FinStat to SbStat. Deinition 4.2 The relative entropy unctor RE i deined rom SbStat to [0, ] a ollow: On Object : It map every object, p to. On Morphim : Given a coherent morphim, :, p, q, i, i abolutely coherent, then RE, Sp, q, where Sp, q : otherwie it i deined a RE,. d q Thi quantity i alo known a the Kullback-Leibler divergence. We could have deined our category to have only abolutely coherent morphim but it would make the comparion with the inite cae more awkward a the inite cae doe not aume the morphim to be abolutely coherent. The preent deinition lead to lightly awkward proo where we have to conider abolutely coherent pair and ordinary coherent pair eparately; mot o which have been omitted in thi abridged verion but can be ound in the extended verion. 6,

Clearly, RE retrict to RE in on FinStat. I, i abolutely coherent, then p i abolutely continuou with repect to q and the Radon-Nikodym derivative i deined. The relative entropy i alway non-negative [12]; thi i an eay conequence o Jenen inequality. Thi how that RE i deined everywhere in SbStat. We will ue the ollowing notation occaionally: RE, p, q : RE,. It remain to how that RE i indeed a unctor. That i, we want to how that t t RE, p, q Z, m RE, p, q + RE, q Z, m. g g In order to do o, we will need the ollowing lemma. Lemma 4.3 The relative entropy i preerved under pre-compoition by optimal hypothee. In other word, we alway have t RE, q Z, m t RE, q, q Z, m. g g Proo. Cae I : g, t i abolutely coherent. Since g, t i abolutely coherent, o i g, t by Propoition 2.4. Hence, to how REg, t REg, t i to how dt m d q d t m d q. Becaue i meaure preerving, it i uicient to how that the ollowing unction on d q t m-almot everywhere. dt m d t m By the Radon-Nikodym theorem, it i uicient to how that or any E meaurable et, we have qe E d t m. dt m 7

The ollowing calculation etablihe the above. E d t m 1 dt m x 1 y E dt m x d t m y dt m 2 dt m y d t m y dt m 3 1 y E dt m y d y dt m 4 1 y E dt m y ye 1 y dt m 5 dt m y ye dt m 6 y E 7 qe 8 We get 2 by applying the diintegration theorem to :, t m, t m. The equation 3 ollow by uing the act that dt m i contant on 1 y or every y. To obtain 4 we apply Lemma??. To how 6 we ue the act that y i a probability meaure on 1 y. We get 7 by the deinition o the Radon-Nikodym derivative and we inally etablih 8 by the deinition o Kleili compoition. Cae II : g, t i not abolutely coherent. The proo i imple but lightly tediou. It can be ound in the extended verion. Theorem 4.4 Functoriality Given arrow, :, p, q and g, t :, q Z, m, we have RE g, t, RE, + REg, t. Proo. Note that by deinition, RE g, t, RE g, t. Cae I :, and g, t are abolutely coherent. By Propoition 3.4, we have that g, t i abolutely coherent. 8

RE g, t 9 d t m d q 10 d q d t m d q + 11 d q d t m d q 12 d t m RE, + RE, + REg, t 13 We get 10 by the chain rule or Radon-Nikodym derivative and 13 by applying Lemma 4.3. Cae II : g, t i not abolutely coherent. The proo i very imilar to the econd cae o the previou lemma. It can be ound in the extended verion. Cae III :, i not abolutely coherent. Although we relegated the proo o cae III to the extended verion, it i neither trivial nor boring. 7 For both o the above cae, we deduce that RE g, t, RE, + REg, t. We have thu hown that RE i a well-deined unctor rom SbStat to [0, ]. 4.1 Convex linearity We how below that the relative entropy unctor atiie a convex linearity property. In [2] convexity look amiliar; here ince we are perorming large um we have to expre it a an integral. Firt we deine a localized verion o the relative entropy. Note that Lemma?? in the appendix ay that y q y q-almot everywhere. Thu, in the ollowing there i no notational clah between the kernel y and q y, the later being the diintegration o q along. Given an arrow, :, p, q in StBor and a point y, we denote by, y, the morphim, retricted to the pair o tandard Borel pace 1 y and {y}. Explicitly,, y : 1 y, y : 1 y, p y {y}, δ y, where δ y i the one and only probability meaure on {y}. 7 It i not anaou to the previou cae ince the exitence o a meaurable et A uch that qa 0 and pa > 0 i urpriingly not enough to conclude that t ma 0. 9

Deinition 4.5 A unctor F rom SbStat to [0, ] i convex linear i or every arrow, :, p, q, we have F, F, y. We will ometime reer to the relative entropy o, y a the local relative entropy o, at y. Theorem 4.6 Convex Linearity The unctor RE i convex linear, i.e., or every arrow, :, p, q, we have Proo. RE, Cae I :, i abolutely coherent. RE, y. Note that by Lemma??, p y q y almot everywhere. So we have RE, 14 d q y 15 1 y d q y y 16 d q y 1 y RE, y. 17 We get 15 by the diintegration theorem and 16 by applying Lemma??. Cae II :, i not abolutely coherent. The proo i not hard and can be ound in the extended verion. 4.2 Lower-emi-continuity Recall that a equence o probability meaure p n converge trongly to p, denoted by p n p, i or all meaurable et E, one ha lim n p n E pe. Deinition 4.7 A unctor F rom SbStat to [0, ] i lower emi-continuou i or every arrow, :, p, q, whenever p n p, then F, p, q lim in n, p n, q n. 10

Note that a lower emi-continuou unctor F on SbStat retrict to a lower emicontinuou a deined lightly dierently in [2] unctor on FinStat. Theorem 4.8 Lower emi-continuity The unctor RE i lower emicontinuou. Proo. Thi i a direct conequence o Pinker [16, Section 2.4]. 5 Uniquene We now how that the relative entropy i, up to a multiplicative contant, the unique unctor atiying the condition etablihed o ar. We irt prove a crucial lemma. Lemma 5.1 Let be a Borel pace equipped with probability meaure p and q, i p q, then we can ind a equence o imple unction p n on uch that or the equence o probability meaure p n E : E p n, we have that p n and p agree on the element o the partition on induced by p n and moreover, p n p trongly. Proo. We write I n,k or the interval [k2 n, k + 12 n and I n, or the interval [n,. Denote by K n the index et {0, 1,..., n2 n 1, } o k. We ix a verion o the Radon-Nikodym uch that < everywhere. We deine a amily o partition and a amily o imple unction a ollow: n,k : { x } x I n,k, p nx : p n,k q n,k on x n,k. Every unction induce a partition on the domain; i moreover the unction i imple, the induced partition i inite. We irt note that p n and p agree on the element o the partition induced by p n: p n n,k p p n,k n n,k n,k q n,k p n,k q n,k q n,k p n,k. Next, we prove the trong convergence o p n p. We irt how p n pointwie. Let x. Pick N large enough uch that x N. For a ixed integer n N, there i exactly one k n or which x n,kn. On the one hand, we have k n 2 n x k n + 12 n on n,kn. But on the other hand, by integrating over n,kn and dividing everything by q n,kn, we alo have k n 2 n p n,kn q n,kn k n + 12 n on n,kn. We thu get pointwie convergence ince we have p nx x p n,kn q n,kn x 2 n or any n N. 11

From the above inequality and the choice o N, we note the ollowing p nx x + 2 n x + 1, or x with x < n, p nx p n, 1 x + 1, or x with x n. So or all n, we can bound p nx everywhere by the integrable unction gx : x+1. Given a meaurable et E, we can thu apply Lebegue dominated convergence theorem. We get lim p ne lim p n n n lim E E n p n pe. E Beore proving uniquene, we recall the main theorem o Baez and Fritz [2] on FinStat. Theorem 5.2 Suppoe that a unctor F : FinStat [0, ] i lower emicontinuou, convex linear and vanihe on FP. Then or ome 0 c we have F, cre in, or all morphim, in FinStat. We are now ready to extend thi characterization to SbStat. Theorem 5.3 Suppoe that a unctor F : SbStat [0, ] i lower emicontinuou, convex linear and vanihe on FP. Then or ome 0 c we have F, cre, or all morphim. Proo. Since F atiie all the above propertie on FinStat, we can apply Theorem 5.2 in order to etablih that F cre in cre or all morphim in the ubcategory FinStat. We how that F extend uniquely to cre on all morphim in SbStat. By convex linearity o F, or an arbitrary morphim, rom, p to, q, we have F, F, y, o F i totally decribed by it local relative entropie. It i thu uicient to how F cre on an arbitrary morphim, :, p {y}, δ y. The cae where p i not abolutely continuou with repect to i traightorward, o let u aume p. 8 8 See the extended verion or detail. 12

We apply Lemma 5.1 with p and to get the amily o imple unction p n and the correponding amily o partition { n,k }. We deine π n a the unction that map x n,k to the element n,k { n,k } k Kn. Denote by πn the diintegration o along π n and by n the correponding marginal. Note that ince p n and p agree on every n,k, p n i indeed the puh-orward o p along π n, o we can identiy p n to the correponding marginal o p. Preented a diagram, we have πn n, p { n,k }, p n {y}, δ y π n n Compoition, p {y}, δ y. From the above diagram and the hypothei that F i a unctor, we have the ollowing inequality F n, n F,, or all n N. 18 Note that, on the one hand the diintegration o p n along π n at the point n,k { n,k } i given by p n,π : p n /p n n,k, but on the other hand, or any meaurable et E, we alo have 1 E n,π n n,k pn E n,k n n,k n,k p n n,k k K n k K n E n,k n,k k K n n n,k k K n E n,k E. Thi mean that p n,π i the diintegration o along π n. Preented a diagram, where we ue pn intead o to indicate that the arrow leave rom the object, p n a oppoed to, p, we have p n,π n, p n { n,k }, p n {y}, δ y π n n Compoition, p n {y}, δ y. pn But ince F vanihe on FP, we have F π n, p n,π 0. Combined with the act that F i a unctor, we get F pn, F π n, p n,π + F n, n F n, n. 19 By Lemma 5.1, we know that p n p, in term o our diagram we have, p n {y}, δ y pn Strong Convergence 13, p {y}, δ y.

Hence, combining 19 with the lower emicontinuity o F, we alo have the inequality F, lim in n F pn, lim in n F n, n. 20 Since n, n i in FinStat, we mut have F n, n cre n, n. combining 18 and 20, we get that F, mut atiy Thu, lim up cre n, n F, lim in cre n, n, n n but o doe cre,. We alo have lim up cre n, n cre, lim in cre n, n. n n Thereore F, cre,, a deired. 6 Concluion and Further Direction A promied, we have given a categorial characterization o relative entropy on tandard Borel pace. Thi greatly broaden the cope o the original work by Baez et al. [3,2]. However, the main motivation i to tudy the role o entropy argument in machine learning. Thee appear in variou ad-hoc way in machine learning but with the appearance o the recent work by Dano and hi co-worker [9,7,8] we eel that we have the propect o a mathematically well-deined ramework on which to undertand Bayeian inverion and it interplay with entropy. The mot recent paper in thi erie [7] adopt a point-ree approach introduced in [5,6]. It would be intereting to extend our deinition to a point-ree ituation. There are alo many intereting quetion with regard to undertanding the algebra o entropy ; ee the book by eung [18] or a tate o thee idea. Acknowledgment We have beneitted rom dicuion with Florence Clerc, Vincent Dano, Tobia Fritz, Renaud Raquépa and Ilia Garnier. We are grateul to the anonymou reeree or very helpul comment. Thi reearch wa upported by a grant rom Google and rom NSERC. Reerence [1] Ah, R. B., Real Analyi and Probability, Academic Pre, 1972. [2] Baez, J. C. and T. Fritz, A bayeian characterization o relative entropy, Theory and Application o Categorie 29 2014, pp. 422 456. 14

[3] Baez, J. C., T. Fritz and T. Leinter, A characterization o entropy in term o inormation lo, Entropy 13 2011, pp. 1945 1957. [4] Billingley, P., Probability and Meaure, Wiley-Intercience, 1995. [5] Chaput, P., V. Dano, P. Panangaden and G. Plotkin, Approximating Markov procee by averaging, in: Proceeding o the 37th International Colloquium On Automata Language And Programming ICALP, Lecture Note In Computer Science 5556, 2009, pp. 127 138. [6] Chaput, P., V. Dano, P. Panangaden and G. Plotkin, Approximating Markov procee by averaging, J. ACM 61 2014, pp. 5:1 5:45. URL http://doi.acm.org/10.1145/2537948 [7] Clerc, F., V. Dano, F. Dahlqvit and I. Garnier, Pointle learning, in: Proceeding o FoSSaCS 2017, 2017. [8] Dahlqvit, F., V. Dano and I. Garnier, Giry and the machine, Electronic Note in Theoretical Computer Science 325 2016, pp. 85 110. [9] Dano, V. and I. Garnier, Dirichlet i natural, Electronic Note in Theoretical Computer Science 319 2015, pp. 137 164. [10] Dudley, R. M., Real Analyi and Probability, Wadworth and Brooke/Cole, 1989. [11] Giry, M., A categorical approach to probability theory, in: B. Banachewki, editor, Categorical Apect o Topoy and Analyi, number 915 in Lecture Note In Mathematic 1981, pp. 68 85. [12] Kullback, S. and R. A. Leibler, On inormation and uiciency, The annal o mathematical tatitic 22 1951, pp. 79 86. [13] Lawvere, F. W., The category o probabilitic mapping 1964, unpublihed typecript. [14] Lawvere, F. W., Metric pace, generalized ic and cloed categorie, Rend. Sem. Mat. Fi. Milano 43 1973, pp. 135 166. [15] Leinter, T., An operadic introduction to entropy, n-category cae. [16] Pinker, M. S., Inormation and inormation tability o random variable and procee 1960. [17] Rokhlin, V. A., On the undamental idea o meaure theory, Matematichekii Sbornik 67 1949, pp. 107 150. [18] eung, R. W., Inormation theory and network coding, Springer Science & Buine Media, 2008. A Lemma Lemma A.1 Given an arrow, :, p, q in SbStat. Let { q y } y denote the collection o conditional probability meaure o q conditioned by, then q y y q-almot everywhere. Lemma A.2 Given, p, q, p where i a continuou unction preerving the meaure o both Borel probability meaure p and p. I p p, then y x y x p -almot everywhere. 15