The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Size: px

Start display at page:

Download "The Information Bottleneck Revisited or How to Choose a Good Distortion Measure"

Marian Garrett
5 years ago
Views:

1 Te Informaton Bottleneck Revsted or How to Coose a Good Dstorton Measure Peter Harremoës Centrum voor Wskunde en Informatca P.O , 090 GB Amsterdam Te Nederlands P.Harremoes@cw.nl Naftal Tsby Te Hebrew Unversty Jerusalem Israel tsby@cs.uj.ac.l Abstract It s well-known tat te nformaton bottleneck metod and rate dstorton teory are related. Here t s descrbed ow te nformaton bottleneck can be consdered as rate dstorton teory for a famly of probablty measures were nformaton dvergence s used as dstorton measure. It s sown tat te nformaton bottleneck metod as some propertes tat are not sared wt rate dstorton teory based on any oter dvergence measure. In ts sense te nformaton bottleneck metod s unque. I. INTRODUCTION For many problems n te teory of lossy compresson t s dfcult to specfy a dstorton measure. In some cases we want to compress a varable but wat we are really nterested n s not te value of but te value of a varable Y correlated or coupled wt. We are only nterested n retrevng te value of to te extent tat t gves nformaton about Y. Tus, eac value of gves a dstrbuton on Y. Te nformaton bottleneck metod was ntroduced n [] to solve ts problem. It as always been known tat te nformaton bottleneck metod s related to rate dstorton teory. In ts paper we sall explore te relaton n detal. Te nformaton bottleneck metod as found natural nterpretatons and a number of applcatons as descrbed n [], [3], [4], [5], [6], [7]. Te results n tese papers do not rule out te possblty tat smlar results could ave been obtaned by oter means (oter dstorton measure). Our approac wll be va rate dstorton teory and as we sall see we are lead drectly to te nformaton bottleneck va a number of assumptons or smplcatons tat efcently rule out all oter dstorton or dvergence measures tan nformaton dvergence. Let A and B be alpabets and let and Y be random varables wt values n A and B For smplcty we sall assume tat A and B are nte but most of te results n ts paper olds for nnte sets as well. Te jont dstrbuton of and Y s gven and gven by a dstrbuton on A and a Markov kernel A! M + (B) ; were M + (B) denotes te set of probablty measures on B As reconstructon alpabet we use ^A M + (B)! Y As dstorton measure d A ^A! R we use d(x; ^x) D ( (x) ; ^x) () were D denotes some dvergence measure on M+ (B) Our goal s to mnmze bot te rate I ; ^ and te dstorton E d ; ^ over all jont dstrbuton of ; ^ A ^A wt prescrbed margnal dstrbuton of Te trade-off between rate and dstorton s gven by te rate dstorton curve. To nd te pont on te rate dstorton curve wt slope one sould mnmze I ; ^ Te most mportant dvergence measure s nformaton dvergence (or Kullback-Lebler nformaton or relatve entropy) dened by D (P kq) p log p q yb We know from Sanov's Teorem [8] tat te dfculty n dstngusng a dstrbuton P from a dstrbuton Q by a statstcal test s gven by te nformaton dvergence. Ts suggests to use nformaton dvergence as dstorton measure. We sall now formalze tese deas. Informaton dvergence belong to a class of dvergence measures called Csszár f-dvergences dened by D f (P; Q) p f B were f s denotes a convex functon satsfyng f () 0 [9], [0]. For f (x) x log x we get nformaton dvergence. For f (x) (x ) we get -dvergence. For f (x) x we get Hellnger dvergence. for f (x) jx j we get varatonal dstance. Informaton dvergence also belongs to te class of Bregman dvergences []. For a nte output alpabet a Bregman dvergence on M + (B) s dened by B f (P; Q) f (P ) (f (Q) + (P Q) rf (Q)) were f M + (B)! R s some convex functon. Informaton dvergence s obtaned wen q f (P ) B p log p p

2 Rate dstorton teory wt Bregman dvergences s well studed n []. Sometmes so-called Burbea-Rao dvergences are used. Te Burbea-Rao dvergence between P and Q s dened by BR f (P; Q) B f (p ) + f (q ) p + q f for some convex functon f [3]. An mportant class of dvergence measures are te separable dvergences ntroduced n [4]. Tese are dvergences dened by D (P; Q) (p ; q ) B for some functon [0; ]! R We note tat Csszár f-dvergences and Burbea-Rao dvergences a separable. A separable Bregman dvergence s gven by B f (P; Q) B g (p ) (g (q ) + (p q ) g 0 (q )) for some convex functon g [0; ]! R A dvergence measure s sad to be reexve f D (P; Q) 0 wt equalty for P Q For dstrbutons P and Q close to eac oter all tese dvergence measures are approxmately proportonal [5] f te functons used to dene tem are sufcently smoot. II. ADDITIVITY Rate dstorton teory s most nterestng wen you ave a rate dstorton teorem. In order to get a rate dstorton teorem one as to consder sequences nstead of sngle events and one as to extend te denton of dstorton from a sequence of nputs x n n A n to a sequence of reconstructon ponts ^x n by d (x n ; ^x n ) n n d (x ; ^x ) ; see [8]. Te factor n s just a matter of normalzaton but for our problem t s essental tat our dvergence measure s addtve,.e. D (P P ; Q Q ) D (P ; Q ) + D (P ; Q ) () As te dvergence on te left and sde lves on a dfferent space tan te dvergences on te rgt and sde Equaton could be used to dene te dvergence on a product space and n ts sense Equaton cannot be used to caracterze dvergence measures sutable for rate dstorton teory. If we requre tat te dvergence measures n te left and te rgt and sde of Equaton belong to specal classes of dvergence measures one can obtan non-trval caracterzatons. Teorem A separable reexve dvergence measure tat s addtve s a lnear combnaton og nformaton dvergence and reversed nformaton dvergence. Proof Let te Dvergence measure D be based on ]0; [! R Take P (p ) B ; Q (p ) B ; P (s; B s) and Q (; ) Ten (p ; q ) + s; + s; B B sp ; q + ( s) p ; q By takng te second dervatve at bot sdes we get 00 s; + 00 s; 00 sp ; q p + 00 For s we get 00 ; B ( s) p ; q 00 p ; q p p Ts equaton sould old for all probablty vectors P and Q wc mples tat (x; y)! 00 (x; y) x s lnear n x and y Tus tere exsts contants c ; c and suc tat 00 (x; y) x c x + c y avng solutons of te form (x; y) c x log x y + c y log y + f (y) x + g (y) x for some functons f; g Te rst two terms gves nformaton dvergence and reversed nformaton dvergence so we just ave to ceck addtvty of a dvergence based on ~ (x; y) f (y) x + g (y) Takng te second dervatve wt respect to y leads to (f 00 (y) x + g 00 (y)) y c 4 x + c 5 y n te same way as above. Ts sould old for all x leadng to f 00 (y) c 4 y and g 00 (y) c 5 y Te solutons are f (y) c 4 log y + c 6 y + c 7 g (y) c 5 y log y + c 8 y + c 9 for some constants c 6 ; c 7 ; c 8 and c 9 Tus ~ (x; y) ( c 4 log y + c 6 y + c 7 ) x + c 5 y log y + c 8 y + c 9 Te terms tat are lnear n x or y may be replaced by constants wtout cangng te dvergence so we may assume tat ~ (x; y) c 4 x log y + c 5 y log y + c 7 xy + c 0 One easly cecks tat te rst two terms satses addtvty and te second ones do not except for c 7 c 0 0 For x y we ave 0 ~ (x; x) (c 5 c 4 ) x log x wc mples tat c 5 c 4 and c 7 c 0 Tus ~ (x; y) c 4 (y log y x log y) and postvty s only obtaned for c 0

3 III. SUFFICIENCY Te number of parameters n our rate dstorton problem can often be reduced by replacng te orgnal varables wt sufcent varables. Ts dea s actually used already n formulatng te model. In prncple any model can always be extended by ncludng less relevant or even rrelevant varables. Normally one would leave out tese less relevant varables at an early stage, but as we sall see n te nformaton bottle neck metod one can also get rd of rrelevant varables wtn te model. Frst we sall consder sufcency on te nput sde. Assume tat ( ; ) and tat s ndependent of Y gven. Equvalently,!! Y s assumed to be a Markov can. As Y only depend on va we would lke to leave out from te analyss. Tus we sould compare te bottleneck problem! Y wt te bottleneck problem! Y and sow tat tey ave te same rate dstorton functon. Obvously any jont dstrbuton on ^; ; Y gves a jont dstrbuton on ^; ; Y wt te same mean dstorton and a smaller (or equal) rate. Let a jont dstrbuton on ^; ; Y be gven were ^ and Y are ndependent gven Te jont dstrbuton on ( ; ; Y ) denes a Markov kernel from to Now consder te jont dstrbuton on ^; ; ; Y were ; ^ and Y are ndependent gven For ts jont dstrbuton temean dstorton s equal to te mean dstorton of ^; ; Y and te rate equals I ^; I ^; I ^; + I ^; j We note tat ts sufcency result on te nput sde olds for any dstorton measure. Next we sall consder sufcency on te output sde. Assume tat Y (Y ; Y ) and tat s ndependent of Y gven Y. Equvalently,! Y! Y s assumed to be a Markov can. As Y only depend on va Y we would lke to leave out Y from te analyss. Tus we sould compare te bottleneck problem! Y wt te bottleneck problem! Y and sow tat tey ave te same rate dstorton functon. We ave D P Y ( j ) kp Y 0 P Y ( j ) kp Y +D P Y ( j Y ; ) kp Y j Y ; ^ A Terefore E D P Y ( j ) kp Y E D P Y ( j ) kp Y (3) Note tat Equaton (3) olds for any f-dvergence. We sall sow tat t essentally only olds for f-dvergences. It s easy to nd examples of pars of general dvergence measures tat fulll Equaton (3) for some specc jont dstrbuton on (; Y ) but t s natural to requre tat te dstorton measure on M+ (B) sould not depend on te jont dstrbuton on (; Y ) Teorem A separable dvergence measure tat satses te sufcency condton as te form D (P; Q) p f q (4) B Remark 3 Te functon f n Equaton 4 s not necessarly convex. If f s convex and f () 0 te dvergence s a Csszár f-dvergence. Proof Te proof of ts result was essentally gven as part of te proof of Teorem n [6]. Ter teorem states tat a separable dvergence satsfyng a data processng nequalty must be a Csszár f-dvergence. Under some condtons ts result can even be extended to Bregman dvergences tat are not separable. q IV. FINITENESS Assume tat te number of elements n A s n Let P M+ (A) denote te margnal dstrbuton of Consder dstrbutons on AM+ (B) ;.e. a jont dstrbuton on ; ^ Te jont dstrbuton can be speced by te dstrbuton Q of ^ and te condtonal dstrbuton of gven ^ gven by a Markov kernel E M+ (B)! M+ (A) Te set of dstrbutons on M+ (B) s an nnte smplex. Consder te convex set C of dstrbuton Q suc tat Q and E determnes jont dstrbuton of ; ^ as P as te margnal dstrbuton of Te condton tat te margnal dstrbuton of as P as prescrbed dstrbuton gves n lnearly ndependent condtons. Terefore te extreme ponts of C are mxtures of at most n ponts. We ave I ; ^ H () H + E E d ; ^ H () + E E d ; ^ H Note tat E d ; ^ H s a functon of ^;.e. a functon M + (B)! R Tus te mnmum of I ; ^ s attaned for an extreme pont of C,.e. te mnmum s attaned for a dstrbuton on ^A wt support on at most n ponts. In te

4 nformaton bottleneck lterature one wll normally nd te result tat te support as at most n + ponts. Note tat no partcular propertes of te dstorton measure d ave been used. Let b A denote a set wt n elements. From now on we sall dentfy a couplng between B and M + (A) wt a couplng between B and b A togeter wt a map b A! M + (B) We know tat ts s no restrcton for calculatng te rate dstorton functon. V. THE BOTTLENECK EQUATIONS Let a jont dstrbuton on ; ^ A A b be gven. Ts gves Markov kernel A b! M+ (A) It can be composed wt to gve te map A b! M+ (B) dened by (^x) xa (x j ^x) (x) We sall use tat nformaton dvergence satses te socalled compensaton equalty rst recognzed n [7]. Let (s ; s ; ; s k ) denote a probablty vector and let Q and P ; P ; ; P k denote probablty measures on te same set. Ten s j D (P j kq) s j D P j kp + D P kq were P P j s jp j In partcular s j D (P j kq) Ts leads to E d(; (^x)) ^x E s j D P j kp (6) E D ( () k (^x)) ^x D ( () k (^x)) ^x Terefore E d(; ( ^)) E D () k ^ E d(; ( ^)) We also ave tat I ; ^ I ; ^ so nstead of mnmzng over all possble maps and all possble jont dstrbutons on A A b we just ave to mnmze over jont dstrbutons and put ^!! Y For a jont dstrbuton of ; ^ we ave tat (^x) s te dstrbuton of Y gven ^ ^x f ^ and Y are ndependent gven ;.e. ^!! Y form a Markov can. Now I(; Y ) I ^; Y + I ; Y I ^; Y + E D () k ^ I ^; Y + E d ; ^ Tus E d(; ( ^)) I(; Y ) I ^; Y We note tat I (; Y ) s a constant and get te followng teorem. Teorem 4 Wen nformaton dvergence s used as dstorton measure nf I ; ^ I(; Y ) + nf I ; ^ I ^; Y Te soluton satses te so-called bottleneck equatons,.e. te mnmum s attaned for reconstructon ponts satsfyng and jont dstrbuton satses te Kun-Tucker condtons for rate dstorton wt tese reconstructons ponts. Te last term s essentally te one tat sall be mnmzed nte nformaton bottleneck were one wants to mnmze I ; ^ and at te same tme maxmze I ^; Y We ave seen tat nequalty (6) s essentally n dervng Teorem 4. Teorem 5 If d M+ (A) M+ (A)! R s a Csszár f-dvergenc for some dfferentable functon f and s j d (P j ; Q) s j d P j ; P for all mxtures P P k s jp j and all Q ten d s proportonal to nformaton dvergence. Proof Assume tat d s te Csszár f-dvergence gven by d (P; Q) D f (P kq) P () f Q () Q () B We sall wrte f as f (x) xg x so tat D f (P kq) Q () g P () P () B We ave Q () s j g P j () P j () B s j B P () g P j () P j () Te probablty vector (Q () ; Q () ; ) satses P Q () We ntroduce a Lagrange multpler and Q () s j g P j () Q () () P j () Dene (x) g 0 B x Ten Pj () s j P () Q () s j g 0 P j ()

5 olds for all because P s a statonary pont. In partcular t sould old f tere exsts an 0 suc tat P j ( 0 ) does not depend on j In tat case Pj ( 0 ) s j s P ( 0 j () () ) Tus for any B we ave Pj () s j () P () Tat mples tat (x) () + (x ) for some constant Ten Z Z g (x) g 0 (x) dx () + x dx ( () ) x + log x + c for come constant c R Ten f (x) x ( () ) x + log x + c () + x log x + cx Te condton f () 0 mples () c Hence D f (P; Q) B B f (x) x log x + c (x ) P () P () P () log Q () Q () + c Q () P () log P () D (P kq) Q () Q () In [] t was sown tat a dvergence measure satsfyng (6) must be a Bregman dvergence. Ts leads us to te followng corollary. Corollary 6 A dvergence measure tat s bot a f- dvergence and a Bregman dvergence must be proportonal to nformaton dvergence. In [6, Tm. 4] te ntersecton of te set of Bregman and f-dvergences s caracterzed by an equaton and a concavty condton and nformaton dvergence s gven as an example of an element n te ntersecton. Corollary 6 mples tat tere are essentally no oter elements n te ntersecton. VI. CONCLUSION In ts paper varous dstorton measures ave been consdered and ter propertes wt respect to rate dstorton teory ave been studed. Some of te results are summarzed n te followng table. Property Addtvty Sufcency on nput sde Sufcency on output sde Data processng nequalty Fnteness Bottleneck equaton Class of dvergences Inf. dv. and reversed All non-convex "f-dv." Csszár f-dv. All Bregman dv. We see tat f one wants to ave all te propertes fullled te dvergence s equal to or proportonal to nformaton dvergence. A dvergence tat s proportonal to nformaton dvergence s essentally nformaton dvergence were nformaton dvergence s measured by a dfferent unt (for nstance bts nstead of nats). Tus t s normally desrable to use nformaton dvergence and tat leads to te nformaton bottleneck metod. Acknowledgement 7 Te autors want to tank Peter Grünwald and Tm van Erven for useful dscussons and comments. REFERENCES [] N. Tsby, F. Perera, and W. Balek, Te nformaton bottleneck metod, n Proceedngs of te 37-t Annual Allerton Conference on Communcaton, Control and Computng, pp , 999. [] N. Slonm and N. Tsby, Document clusterng usng word clusters va te nformaton bottleneck metod, n SIGIR '00 Proceedngs of te 3rd annual nternatonal ACM SIGIR conference on Researc and development n nformaton retreval, (New York, NY, USA), pp. 08 5, ACM Press, 000. [3] G. Ceck, A. Globerson, N. Tsby, and Y. Wess, Informaton bottleneck for Gaussan varables, n Advances n Neural Informaton Processng Systems 6 (S. Trun, L. Saul, and B. Scölkopf, eds.), Cambrdge, MA MIT Press, 004. [4] C. R. Salz and J. P. Crutceld, Informaton bottlenecks, causal states, and statstcal relevance bases How to represent relevant nformaton n memoryless transducton, tec. rep., Santa Fe Insttute, 00. [5] N. Slonm and Y. Wess, Maxmum lkelood and te nformaton bottleneck, n Advances n Neural Informaton Processng Systems 5, Vancouver, Canada, 00 (S. Becker, S. Trun, and K. Obermayer, eds.), pp , 003. [6] G. Eldan and N. Fredman, Te nformaton bottleneck EM algortm, n Proceedngs of te 9t Annual Conference on Uncertanty n Artcal Intellgence (UAI-03), (San Francsco, CA), pp , Morgan Kaufmann, 003. [7] D. Gondek and T. Hofmann, Condtonal nformaton bottleneck clusterng, n Worksop on Clusterng Large Data Sets, IEEE Internatonal Conference on Data Mnng, Melbourne, USA, November 9-, 003., 003. [8] T. Cover and J. A. Tomas, Elements of Informaton Teory. Wley, 99. [9] I. Csszár and P. Selds, Informaton Teory and Statstcs A Tutoral. Foundatons and Trends n Communcatons and Informaton Teory, Now Publsers Inc., 004. [0] F. Lese and I. Vajda, On dvergence and nformatons n statstcs and nformaton teory, IEEE Tranns. Inform. Teory, vol. 5, pp , Oct [] L. M. Bregman, Te relaxaton metod of ndng te common pont of convex sets and ts applcaton to te soluton of problems n convex programmng, USSR Comput. Mat. and Mat. Pys., vol. 7, pp. 00 7, 967. Translated from Russan. [] A. Banerjee, S. Merugu, I. S. Dllon, and J. Gos, Clusterng wt Bregman dvergences, Journal of Macne Learnng Researc, vol. 6, pp , 005. [3] J. Burbea and C. R. Rao, On te convexty of some dvergence measures based on entropy functons, IEEE Trans. Inform. Teory, vol. 8, pp , 98. [4] P. D. Grünwald and A. P. Dawd, Game teory, maxmum entropy, mnmum dscrepancy, and robust Bayesan decson teory, Annals of Matematcal Statstcs, vol. 3, no. 4, pp , 004. [5] M. C. Pardo and I. Vajda, On asymptotc propertes of nformatonteoretc dvergences., IEEE Trans. Inform. Teory, vol. 49, no. 7, pp , 003. [6] M. C. Pardo and I. Vajda, About dstances of dscrete dstrbutons satsfyng te data processng teorem of nformaton teory, IEEE Trans. Inform. Teory, vol. 43, pp , 997. [7] F. Topsøe, An nformaton teoretcal dentty and a problem nvolvng capacty, Studa Scentarum Matematcarum Hungarca, vol., pp. 9 9, 967.

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali