CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the multvarate Gaussan dstrbuton. In ths lecture, we dscuss conjugacy more generally. A famly of probablty dstrbutons P s conjugate for a probablty model p f the posteror les n P whenever the pror les n P. Note that the famly of all probablty dstrbutons s conjugate for any model p. What f we have a probablty model p(y θ) for some practcal problem, but the standard conjugate famly for p does not seem to contan a realstc pror dstrbuton? One thng we can do s use a mxture model as our pror: k p(θ) = α j p j (θ) =1 where 0 α 1, α = 1, and each p j (θ) s n a famly P j that s conjugate for p(y θ). If we multply ths p(θ) by p(y θ), then each p j (θ) s multpled by p(y θ), yeldng another dstrbuton n P j. So the posteror p(θ y) s a mxture of the same form. The same result holds for an nfnte mxture ndexed by a contnuous parameter τ: p(θ) = g(τ)p(θ; τ) dτ where g(τ) s an arbtrary densty functon. We can also take a lmt of ncreasngly flat conjugate prors, yeldng an mproper pror wth densty 1 everywhere. Ths s the ultmate unnformatve pror, but t s mproper n that t does not ntegrate to 1. If we multply t by a p(y θ) dstrbuton, such as a Gaussan, the resultng posteror may ntegrate to 1 and thus be a true densty. However, ths s not guaranteed for all choces of p(y θ). If we use a mxture model, how can we set the α s? To be fully Bayesan, we should set them based on some background knowledge. However, a common technque s emprcal Bayes: estmate the α s by maxmum lkelhood on the tranng data. 2 The Exponental Famly For a fully general treatment of conjugate prors, we turn to a very large famly of dstrbutons called the exponental famly, whch actually contans every parametrc famly of dstrbutons. The general form of an exponental famly dstrbuton s: p(x θ) = h(x) exp ( φ(θ) T (x) A(θ) ) (1) Here φ(θ) s the canoncal parameter, often denoted η. T (x) s the suffcent statstc, and A(θ) s the cumulant generatng functon or log partton functon. The x here can range over any set, as long as T (x) 1
2 Conjugacy and the Exponental Famly maps each x to a vector of fxed, fnte dmenson. p(x θ) s a densty wth respect to some underlyng measure µ. The h(x) functon can just be thought of as an adjustment to ths underlyng measure, and s not usually mportant. Snce the expresson n Eq. 1 s a densty, t must ntegrate to 1: h(x)e η T (x) A(θ) dx = 1 where η = φ(θ). Therefore: A(θ) = ln h(x)e η T (x) dx (2) So A(θ) s fully determned by η and T (x). It can be shown that the set of vald η vectors {η : h(x)e η T (x) dx < } s convex. So convex combnatons of vald parameter vectors are also vald parameter vectors. Also, optmzng over the set of vald η s s not too dffcult. So far we have wrtten the cumulant generatng functon as A(θ). We can also wrte t as A(η), usng a dfferent functon A. We wll lmt ourselves to cases where φ(θ) s one-to-one. The second convexty property of the exponental famly s that A(η) s always a convex functon of η. Ths follows from a general convex analyss result about log-sum-exp equaltes such as Eq. 2. A(η) s called the cumulant generatng functon because ts dervatves wth respect to η are the cumulants (central moments,.e., mean, varance, kurtoss, etc.) of the suffcent statstc T (X). For the frst cumulant, we take the frst dervatve: h(x)e η T (x) T (x) dx A(η) = h(x)e η T (x) dx By Eq. 2, the denomnator s e A(η), so: A(η) = h(x)e η T (x) A(η) T (x) dx = ET (x) Smlarly, the Hessan matrx 2 A(η) s the covarance matrx Var(T (x)). Thus, to fnd the cumulants of an exponental famly dstrbuton wth a gven A(η), we don t have to do messy ntegrals; we just have to take dervatves. For more nformaton on the exponental famly, see the recent techncal report by Wanwrght and Jordan, Graphcal models, exponental famles, and varatonal nference (avalable from Prof. Jordan s web page). A good book on the subject s by Lawrence Brown, Fundamentals of Statstcal Exponental Famles, publshed n the IMS Lecture Notes seres n 1986. The exponental famly s qute powerful: t ncludes all the standard dstrbutons such as the Bernoull, Gaussan, gamma, Posson, Ralegh, etc. However, exponental famly dstrbutons are parametrc: the parameter vector η has a fxed dmenson. 2.1 Conjugacy and the exponental famly Consder the setup n Fg. 1, where we take n samples y from an exponental famly dstrbuton: p(y θ) = h(y ) exp ( φ(θ) T (y ) A(θ) ) (3) Let y be the random vector formed by concatenatng all the samples y. Then: ( ) ( p(y θ) = h(y ) exp φ(θ) ) T (y ) na(θ) (4)
Conjugacy and the Exponental Famly 3 µ ν θ Y n Fgure 1: Graphcal model for an exponental famly dstrbuton and ts conjugate pror. Thus, y also has an exponental famly dstrbuton: t has the same canoncal parameter φ(θ); ts suffcent statstc s T (y ); and ts cumulant generatng functon s na(θ). We can construct a conjugate famly of pror dstrbutons as follows, wth two parameters µ and ν: p(θ µ, ν) exp ( φ(θ) µ νa(θ) ) (5) Then the posteror dstrbuton s: ( p(θ y, µ, ν) exp (φ(θ) µ + ) ) T (y ) (n + ν)a(θ) (6) We can also take a herarchcal Bayesan approach, puttng prors on µ and ν as well. It s worth notng that ths s just the mnmal conjugate famly for p(y θ), n the sense that s has a mnmal number of parameters. Of course there are other conjugate famles, such as the famly of mxtures of these dstrbutons, and the famly of all dstrbutons. 2.2 ML estmaton of mean parameters Puttng asde the Bayesan approach for the moment, suppose we want a maxmum lkelhood (ML) estmate of the mean parameter µ = ET (X) for some exponental famly dstrbuton. Ths µ s a functon of η: specfcally µ = η A(η). So t suffces to maxmze the lkelhood wth respect to η. Based on Eq. 4, the log lkelhood s: l(η; y) = η n =1 T (y ) na(η) + c(y) (7) where c(y) s some functon that does not depend on η. Dfferentatng wth respect to η, we get: η l = = n T (y ) n η A(η) =1 n T (y ) net (X) =1 Settng ths to zero, we get: ˆµ ML = 1 n n T (y ) (8) In other words, the maxmum lkelhood estmate for the expectaton of the suffcent statstc s just the emprcal mean of the suffcent statstc. We already knew ths fact for common dstrbutons such as the Gaussan, Posson, etc.; now we have a general proof. =1
4 Conjugacy and the Exponental Famly 2.3 Exercses wth the exponental famly Here are some exercses for the reader nvolvng the exponental famly. Consder the Posson dstrbuton: and the bnomal dstrbuton (wth a fxed n): p(x θ) = p(x θ) = θx e θ x! ( ) n θ x (1 θ) n x x Express each of these dstrbutons n exponental famly form. For example, the canoncal parameter η for the bnomal dstrbuton s ln θ 1 θ. Compute A(θ), and then dfferentate t to obtan the mean µ for each dstrbuton. 2.4 Parameterzatons We have seen several ways of parameterzng the exponental famly, that s, ndexng the set of exponental famly dstrbutons. There s the canoncal parameter η, and also the mean parameter µ = ET (X). Recall that A(η) s a convex functon, so µ = η A(η) s nondecreasng n η. If A(η) s strctly convex, then η A(η) s ncreasng n η, and therefore the mappng between η and µ s one-to-one. See Fg. 2 for an llustraton of ths relatonshp. A(η) slope µ Fgure 2: If η s one-dmensonal, then the mean parameter µ correspondng to η s just the slope of a tangent lne to the cumulant generatng functon A evaluated at η. We have also used another parameter, denoted θ to ndex the set of exponental famly dstrbutons. Ths parameter does not have a specal name; t s just some parameter that s convenent for defnng the dstrbuton. η 3 The Exponental Famly and Graphcal Models 3.1 The Isng model The prevous course n ths sequence dealt wth probablstc graphcal models. We can also thnk of graphcal models as defnng exponental famly dstrbutons. As an example, consder the Isng model, llustrated n Fg. 3. Ths model comes from statstcal physcs, where each node represents the spn (up or down) of a partcle. We represent each partcle s spn wth a varable X takng values {0, 1} (we can formulate an equvalent model wth { 1, 1}). The parameters are θ, representng the external feld on partcle, and θ j, representng the attracton between partcles and j. If and j are not adjacent n the graph, then θ j = 0.
Conjugacy and the Exponental Famly 5 X 1 X 2 X n Fgure 3: An Isng model wth n = 9 nodes The probablty dstrbuton s: p(x θ) = exp θ j x x j + <j θ x A(θ) = 1 Z(θ) exp θ j x x j + <j θ x where Z(θ) s the partton functon. Ths s an exponental famly dstrbuton where the suffcent statstc T (x) conssts of all the values x and x x j (for < j) concatenated together. So f µ ET (X), then µ = EX and µ j = EX X j. Thus, the µ vector contans the expectatons and correlatons of the partcles spns. It can be shown that A(η) s strctly convex, so there s a one-to-one mappng between η and µ. However, actually computng µ from η s #P-hard. In fact, we can thnk of the whole problem of probablstc nference as computng the mean parameter µ from the canoncal parameter η (or some other parameter θ). Of course, ths problem s #P-hard n general. 3.2 Graphcal models n general In an undrected graphcal model, we specfy a potental ψ C for each clque C of the graph. The jont probablty dstrbuton s then gven by: p(x) = 1 ψ C (x C ) (9) Z where Z s some normalzaton constant. We can wrte ths n exponental famly form as: ( ) p(x) = exp ln ψ C (x C ) ln Z C If all the varables n the model have dscrete values, then for each clque C, we can defne an ndcator vector wth an entry for each confguraton of the varables n C. The suffcent statstc T (x) s formed by concatenatng the ndcator vectors for all the clques. In general, f each ψ C s an exponental famly dstrbuton, we can form T (x) by concatenatng the suffcent statstcs for these dstrbutons. In a drected graphcal model, we specfy a condtonal dstrbuton for each varable gven ts parents n the graph. The jont dstrbuton s: p(x) = p(x x π() ) (11) C (10)
6 Conjugacy and the Exponental Famly The exponental famly form of ths dstrbuton s smple: ( ) p(x) = exp ln p(x x π() ) (12) If each p(x x π() ) s an exponental famly dstrbuton, then t s clear that the dstrbuton n Eq. 12 s also n the exponental famly.