Information Geometry of Gibbs Sampler

Size: px

Start display at page:

Download "Information Geometry of Gibbs Sampler"

Barry Scott
6 years ago
Views:

1 Informaton Geometry of Gbbs Sampler Kazuya Takabatake Neuroscence Research Insttute AIST Central 2, Umezono 1-1-1, Tsukuba JAPAN Abstract: - Ths paper shows some nformaton geometrcal propertes of Gbbs sampler whch s one of Markov chan Monte Carlo(MCMC) methods. The Gbbs sampler belongs to the class of the snglecomponent-update MCMC, n whch two or more components are never updated smultaneously. When a component s updated n the sngle-component-update MCMC, the chan s dstrbuton moves along a m-flat manfold. In cases of the Gbbs sampler, the dstrbuton moves to the pont whch mnmzes the KL dvergence to the target dstrbuton on the m-flat manfold. From ths vewpont, the Gbbs sampler s nterpreted as a greedy algorthm whch mnmzes the KL dvergence n each update. Key-Words: - Gbbs sampler, Markov chan Monte Carlo, nformaton geometry, KL-dvergence, greedy algorthms, convergence 1 Introducton a matrx form The most straghtforward way to generate random numbers accordng to the gven target dstrbuton s usng the table of the target dstrbuton whch conssts of probabltes for all values the random varable takes. However the sze of ths table s proportonal to the exponental of the dmenson of the random varable, therefore ths straghtforward way s not practcal for generatng large-dmensonal random varables. Markov chan Monte Carlo(MCMC)[7] s a class of random number generators whch uses Markov chans whose dstrbuton converges to the target dstrbuton. Let {X (t) }(t = 0, 1,...) be a Markov chan, V be the range of value X (t) takes 1, p (t) be the dstrbuton of X (t), D V be the set of dstrbutons on V. Any dstrbuton q D V can be represented n a vector form 2 q = (q(x))(x V ). (1) Let W (t) be the transton 3 from X (t 1) to X (t) (p (t 1) to p (t) ). W (t) can be represented n 1 For the sake of smplcty, all random varables n ths paper take fnte dscrete values. 2 In ths paper, symbols for dstrbutons or transtons also denote ther vector or matrx form. 3 Any transton s a lnear operator D V D V. Then therefore W (t) = (W (t) xy )(x, y V ) (2) W (t) xy := Pr(X (t) = y X (t 1) = x). (3) p (t) = p (t 1) W (t) (4) p (t) = p (0) W (1)...W (t) (5) holds. The msson of MCMC s generatng X (t) for the gven target varable X ( ) and ts dstrbuton π. In desgnng MCMC, the sequence {W (t) }(t = 1, 2,...) s desgned as p (t) converges to π when t. Under some condtons, we can make such {W (t) } wthout knowng the complete table of the target dstrbuton π. In ths paper, we revew the Metropols- Hastngs algorthm[7] and ts specal case of the Gbbs sampler[3, 7] at frst. Then we show some nformaton geometrcal propertes of the snglecomponent-update MCMC and ts specal case of the Gbbs sampler. 1

2 2 Metropols-Hastngs Algorthm and Gbbs Sampler 2.1 Metropols-Hastngs algorthm Metropols-Hastngs algorthm[7] s the followng algorthm. step0 Prepare an arbtrary sequence of condtonal dstrbuton {q (t) (X X)}(t = 1, 2,...), whch s called proposal dstrbuton, where X s a canddate varable for next tme. Set an arbtrary value to x. Set t = 0. step1 Generate a random number x accordng to q (t) (x x). step2 Set x = x wth probablty α (t) (x, x ) = mn ( 1, π(x )q (t) (x x ) π(x)q (t) (x x) ), (6) whch s called acceptance probablty, otherwse keep x as t s. step3 Set t = t + 1 and go to step1. Ths algorthm smulates the Markov chan whose transton matrx s W (t) xy = { q (t) (y x)α (t) (x, y) x y 1 x y q(t) (y x)α (t) (x, y) x = y. (7) Ths transton matrx holds the followng socalled detaled balance equaton[7] for all x, y, t. π x W (t) xy = π y W (t) yx (8) Summng up eq.(8) about x, we get πw (t) = π (9).e. the transton by W (t) does not move π. It s known that f the Markov chan s weakly ergodc[2, 5] then the dstrbuton p (t) converges to π when t. To perform ths algorthm, we do not need to know the complete table of the target dstrbuton π but just the rato of probablty π(x )/π(x) n eq.(6). Ths s the major mert of the Metropols- Hastngs algorthm. 2.2 Sngle-component Metropols- Hastngs Sngle-component Metropols-Hastngs[7] s a specal case of Metropols-Hastngs algorthm. It s used n cases that X s mult-dmensonal.e. X = (X 0,..., X N 1 ). In the sngle-component Metropols-Hastngs, only one component s updated n each transton. Therefore canddate x dffer from x n one component. Assume t s X and let X be the jont varable of other components. Then for all x x q (t) (x, x x, x ) = 0. (10) There are several ways to select the component updated at tme t[7]. Let (t) be the suffx of the component updated at tme t. In ths paper, we adopt sequental-update: (t) = t mod N. (11) 2.3 Gbbs sampler Gbbs sampler[3, 7] s a specal case of Snglecomponent Metropols-Hastngs. Its proposal dstrbuton s 4 { q (t) (x, x x π(x, x ) = x ) x = x 0 x (12) x therefore acceptance probablty s α (t) ((x, x ), (x, x )) ( = mn 1, π(x, ) x )π(x x ) π(x, x )π(x x ) ( = mn 1, π(x, ) x )π(x, x )π(x ) π(x, x )π(x, x )π(x ) = 1. (13) It shows that the canddate x s always accepted n step2 n secton 2.1. Substtutng eq.(12),(13) nto eq.(7), we get W (t) (x,x )(y,y ) = { π(y x ) x = y 0 x y (14) 4 In ths paper, readers are expected to nterpret symbols for dstrbutons wth x or x as approprate condtonal or margnal dstrbutons. For example π(x x ) = Pr(X ( ) = x X ( ) = x ).

3 and substtutng ths nto eq.(4), we get p (t) (y) = p (t) (y, y ) = x,x p (t 1) (x, x )W (t) (x,x )(y,y ) = x p (t 1) (x, y )π(y y ) = p (t 1) (y )π(y y ) = p (t) (y )π(y y ). (15) π M(p (2), 3) p (3) M(p (1), 2) p (2) M(p (0), 1) The nformaton about the target dstrbuton π requred to perform the Gbbs sampler s the full condtonal dstrbuton[7] π(x x ) n eq.(12). 3 Informaton Geometry of MCMC As we descrbed n the ntroducton, we can treat a dstrbuton as a pont n a vector space. {p (t) } s the seres of ponts whch converges to π. In ths secton we show some nformaton geometrcal propertes of the MCMC whch updates only one component n each transton, and show some specal propertes the Gbbs sampler has. 3.1 Sngle-component-update MCMC When -th component s updated, other components keep ther value therefore the margnal dstrbuton about X s succeeded: x p (t) (x ) = p (t 1) (x ). (16) Let M(p, ) be the manfold of dstrbutons defned by M(p, ) := {q D V x q(x ) = p(x )}. (17) Ths manfold s m-flat(see appendx). The mportant pont s that updatng -th component can move p only along the manfold M(p, ). 3.2 Gbbs sampler For quck convergence of p (t) π, t s a natural dea that we choose the closest pont to π on M(p (t 1), ). Some dstance measure s requred to determne the meanng of closest. We adopt the KL-dvergence KL(p π) as the dstance measure. p (0) p (1) Fg. 1: Ths fgure llustrates the movement of some p (t) n the dstrbuton space D V n the case of a sngle-component-update MCMC. Dots represent dstrbutons and lnes represent m-flat manfolds In any MCMC, f we choose W (t) as t satsfes eq.(9), we get KL(p (t) π) = KL(p (t 1) W (t) πw (t) ) KL(p (t 1) π) (18) from the data processng nequalty(see appendx). It shows that KL(p (t) π) decreases 5 as tme goes. Now we consder the followng mnmzaton problem mn KL(p (t) π) (19) p (t) M(p (t 1),) The mnmzer p (t) s the e-projecton(see appendx) of π onto m-flat manfold M(p (t 1), ). Usng the chan rule of KL-dvergence(see appendx) we get KL(p (t) π) = KL(X (t) = KL(X (t) X ( ) ) X (t 1) X ( ) ) X (t 1) X (t) = KL(X (t 1) + KL(X (t) ) ). (20) The mnmzaton of eq.(19) s equvalent to mn KL(X (t) p (t) M(p (t 1) X (t),) ) (21) 5 Here decreases means at least never ncrease.

4 because the transton by W (t) does not move X (t). From eq.(34), t s clear that the mnmzaton s acheved when and only when x It s equvalent to KL(X (t) x ) = 0. (22) x, x p (t) (x x ) = π(x x ). (23) Multplyng p (t) (x ), we get the same equaton as eq.(15). It mples that the Gbbs samplers transton matrx W (t) moves p (t 1) to p (t) whch s the e-projecton of π onto M(p (t 1), ). The mportant pont s that we need no nformaton about p (t) to desgn the transton matrx W (t). In other words, by usng Gbbs sampler s transton matrx, we can move any dstrbuton p towards the closest pont(e-projecton) to π wthout knowng where p s. π M(p (2), 3) Ths algorthm s a knd of greedy algorthm, because p s moved to the mnmzer of the cost KL(p π) n each step1. In other words, p s not moved to the optmal pont n two or more movements but n each movement. Imagne the smplfed example shown n Fg.3. There are a pont p and a target pont π on a two dmensonal plane. We can move p towards the east or the west for the frst movement and towards the north-east or the south-west for the second movement. In ths example lnes from the west to the east and lnes from the south-west th the north-east correspond to the manfold M(p, ). If we move p to the closest pont to π for the frst movement, we can not make p reach π for the second movement. It s clear that the path wrtten dashed lnes s the optmal path to approach π. W p N S E π p (0) p (1) p (3) M(p (1), 2) p (2) M(p (0), 1) Fg. 2: Ths fgure llustrates the movement of some p (t) n the dstrbuton space D V n the case of a Gbbs sampler. Dots represent dstrbutons, lnes represent m-flat manfolds and dashed lnes represent e-geodescs From the property dscrbed above, we can nterpret the Gbbs sampler as the followng algorthm. step0 Set an arbtrary dstrbuton to ntal p. step1 Move p to the e-projecton of π onto M(p, ). step2 Go to step1. Fg. 3: Ths fgure llustrates the movement of p. The path wrtten n sold lnes represents the movement of the greedy algorthm and the path wrtten n dashed lnes s optmal movement to approach to π Another nterestng property of the Gbbs sampler s KL(p (t 1) π) = KL(p (t 1) p (t) ) + KL(p (t) π), (24) whch s derved from Pythagorean theorem(see appendx). It means that the dvergence whch p (t) moves s equal to the dvergence whch p (t) approaches to π n each transton. Let T D (n) be the travellng dvergence defned by T D (n) := Then we get n KL(p (t 1) p (t) ). (25) t=1 T D (n) + KL(p (n) π) = KL(p (0) π) (26) It mples that T D (n) has upper bound KL(p (0) π) and f p (t) converges to π when t, T D (t) converges to KL(p (0) π).

5 4 Concluson Markov chan Monte Carlo(MCMC) s a class of random number generators whch uses a Markov chan whose dstrbuton p (t) converges to the target dstrbuton π when t. In sngle-component-update MCMC, updatng -th component of the mult-dmensonal varable X moves X s dstrbuton p along the m-flat manfold M(p, ). Gbbs sampler s one of sngle-componentupdate MCMC. From the vewpont of nformaton geometry, the Gbbs sampler s nterpreted as the followng algorthm. step0 Set an arbtrary dstrbuton to ntal p. step1 Move p to the e-projecton of π onto M(p, ). step2 Go to step1. Ths algorthm s a knd of greedy algorthm, because p s moved to the mnmzer of the cost KL(p π) n each step1. In the Gbbs sampler, p (t) does not travels nfnte dvergence n the dstrbuton space. The dvergence p (t) travels s less or equal to KL(p (0) π). If p (t) converges to π when t, the travelng dvergence converges to KL(p (0) π). References: [1] Csszár, I., Körner, J., Informaton Theory: Codng Theorems for Dscrete Memoryless Systems, Academc Press, 1981 [2] Seneta, E., Non-negatve Matrces and Markov Chans, Second Edton, Sprnger- Verlag, 1981 [3] Geman, S., Geman, D., Stochastc Relaxaton, Gbbs Dstrbutons, and the Bayesan Restoraton of Images, IEEE Transacton on Pattern Analyss and Machne Intellgence, Vol. PAMI-6, No. 6, 1984, pp [4] Amar, S., Dfferental-Geometrcal Methods n Statstcs, Lecture Notes n Statstcs, Vol. 28. Sprnger-Verlag, 1985 [5] van Laarhoven, P. J. M., Aarts, E. H. L., Smulated Annealng: Theory and Applcatons, Kluwer Academc Publshers, 1987 [6] Amar, S., Informaton Geometry of the EM and em Algorthms for Neural Networks, Neural Networks, Vol. 8, No. 9, 1995, pp [7] Glks, W. R., Rchardson, S., Spegelhalter, D. J., Introducng Markov chan Monte Carlo, In: Glks, W. R., Rchardson, S., Spegelhalter, D. J.(ed.), Markov Chan Monte Carlo n Practce, Chapman & Hall, 1996, pp.1-19 [8] Glks, W. R.: Full condtonal dstrbutons, In: Glks, W. R., Rchardson, S., Spegelhalter, D. J.(ed.), Markov Chan Monte Carlo n Practce, Chapman & Hall, 1996, pp.1-19 Appendx: KL-dvergence Let R(X) be the range of value X takes. For two random varables X, Y whch have the same range, KL-dvergence X to Y or KL-dvergence ther dstrbutons p X to p Y s defned by KL(X Y ) = KL(p X p Y ) := p X (x) log p X(x) p Y (x). (27) x R(X) KL-dvergence s always non-negatve and KL(p X p Y ) = 0 p X = p Y. (28) Let q be the dstrbuton of a stochastc source, p be a data s dstrbuton and N be the number of samples n the data. Log-lkelhood of the data comes from the stochastc source s L(p q) = x N p(x) log q(x) (29) and t takes maxmum value NH(p) when and only when p = q, where H(p) s Shannon s entropy: H(p) = x KL-dvergence KL(p q) s p(x) log p(x) (30) KL(p q) = 1 L(p q) H(p) (31) N Therefore the meanng of KL(p q) s based loglkelhood of data whose dstrbuton s q comes out from dstrbuton p. The bas s taken as KL(p q) = 0 when p = q.

6 For any dstrbuton vectors p, q and any transton matrx W, KL(pW qw ) KL(p q) (32) holds. Ths nequalty s called data processng nequalty. Let X, Z be random varables whch have the same range and Y, W be random varables whch have the same range. The followng equaton holds for the jont varables XY and ZW. where KL(XY ZW ) = KL(X Z) + KL(Y W X) KL(Y W X) := x R(X) KL(Y W x) := (33) Pr(X = x)kl(y W x) (34) log y R(Y ) Pr(Y = y X = x) Pr(Y = y X = x) Pr(W = y Z = x). (35) Eq.(33) s called chan rule of KL-dvergence and the left sde of eq.(34) s called condtonal KL-dvergence. Let q be a dstrbuton and M be a manfold of dstrbutons. Let q be a dstrbuton, M be a manfold of dstrbutons and p be the e-projecton of q onto M. It s known that the e-geodesc form q to p and M are orthogonal at p. And for any dstrbuton r M, the followng equaton holds. KL(r q) = KL(r p) + KL(p q) (40) It s called Pythagorean theorem [1]. r Fg. 4: Ths fgure llustrates the Pythagorean theorem: KL(r q) = KL(r p) + KL(p q). M s a m-flat manfold of dstrbutons and p s the e-projecton of q onto M. Dashed lne s the e- geodesc from q to p p q M arg mn KL(p q) (36) p M s called e-projecton of q onto M [6]. If M has the followng property p, q M, 0 λ 1 λp+(1 λ)q M, (37) we call M s m-flat. It s known that f M s m-flat the e-projecton of q onto M s unque for any dstrbuton q. The followng curve s called e-geodesc from p 0 to p 1 [6]: {q log q(x) = (1 λ) log p 0 (x) + t log p 1 (x) log φ(λ), 0 λ 1}. (38) where φ(t) s the term for the normalzng condton x q(x) = 1: φ(λ) = x p 0 (x) 1 λ p 1 (x) λ. (39)

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways