Multi-Conditional Learning for Joint Probability Models with Latent Variables

Mult-Condtonal Learnng for Jont Probablty Models wth Latent Varables Chrs Pal, Xueru Wang, Mchael Kelm and Andrew McCallum Department of Computer Scence Unversty of Massachusetts Amherst Amherst, MA USA {pal,xueru,mccallum}@cs.umass.edu Abstract We ntroduce Mult-Condtonal Learnng, a framework for optmzng graphcal models based not on ont lkelhood, or on condtonal lkelhood, but based on a product of several margnal condtonal lkelhoods each relyng on common sets of parameters from an underlyng ont model and predctng dfferent subsets of varables condtoned on other subsets. When appled to undrected models wth latent varables, such as the Harmonum, ths approach can result n powerful, structured latent varable representatons that combne some of the advantages of condtonal random felds wth the unsupervsed clusterng ablty of popular topc models, such as latent Drchlet allocaton and ts successors. We present new algorthms for parameter estmaton usng expected gradent based optmzaton and develop fast approxmate nference algorthms nspred by the contrastve dvergence approach. Our ntal expermental results show mproved cluster qualty on synthetc data, promsng results on a vowel recognton problem and sgnfcant mprovement nferrng hdden document categores from multple attrbutes of documents. Introducton Recently, there has been substantal nterest n Condtonal Random Felds (CRFs) [6] for sequence processng. CRFs are random felds for a ont dstrbuton globally condtoned on feature observatons. The CRF constructon can be contrasted wth MRFs whch have been used n the past to defne and model the ont dstrbuton of both labels and features. CRFs are also optmzed usng a Maxmum Condtonal Lkelhood obectve as opposed to a Maxmum (Jont) Lkelhood obectve whch s a tradtonal obectve functon used for MRFs. In the Machne Learnng communty the Boltzmann machne [] s another well known example of an MRF and recent attenton has been gven to a restrcted type of Boltzman machne [] known as a Harmonum [8]. We are nterested n more deeply explorng the relatonshps between these models, the dstrbutons they defne and the ways n whch they can be optmzed. In the approach we propose and develop here, one begns by specfyng a ont probablty model for all quanttes one wshes to consder as random quanttes, for our dscusson here we can thnk of ths as a ont model for: labels, features, hdden varables and parameters f desred. However, to optmze the ont model we propose fndng pont estmates for parameters usng an obectve functon consstng of the product of select

(margnal) condtonal dstrbutons. The goal of ths obectve s thus to obtan a Random Feld that has been optmzed to be good at modelng a number of condtonal dstrbutons that the modeler s partcularly nterested n focusng on capturng well wthn the context of a sngle ont model. Our experments show that latent varable models obtaned by optmzaton wth respect to ths type of obectve can produce both qualtatvely better clusters as well as quanttatvely better structured latent topc spaces. Mult-Condtonal Learnng n Jont Models. A Smple Illustratve Example n a Locally Normalzed Model Here we present an example of how one can optmze a ont probablty model under a number of dfferent obectves. Consder a Gaussan mxture model (GMM) for real valued random observed varables x (e.g., observed D values) wth an unobserved sub-class, s assocated wth each observed class label c. We wll use the notaton x and c to denote observatons or nstantatons of contnuous and dscrete random varables. We can wrte a model for the ont dstrbuton of these random varables as P (x, c, s) = p(x s)p (s c)p (c), where P (s c) s a sparse matrx assocatng a number of sub-classes. We shall use Θ to denote all the parameters of the model. Now consder that t s possble to optmze the GMM n a number of dfferent ways. Frst, consder the log margnal ont lkelhood L x,c of such a model, whch can be expressed as: L x,c (Θ; { x }, { c }) = log P ( x, c Θ) = log s P ( x, s, c Θ) = L x,c (Θ) () Second, n contrast to the log margnal ont lkelhood, the log margnal condtonal lkelhood L c x can be expressed as: L c x (Θ; { x }, { c }) = log s P ( c, s x, Θ) = L x,c (Θ) L x (Θ) () Thrd, consder the followng mult-condtonal obectve functon, L c x,x c whch we express as: L c x,x c (Θ; { x }, { c }) = log P ( c x, Θ) + log P ( x c, Θ) = L x,c (Θ) L x (Θ) L c (Θ) () Consder now the followng smple example data set whch s smlar to the example presented n Jebara s work [] to llustrate hs Condtonal Expectaton Maxmzaton (CEM) approach. Smlarly, we generate data from two classes, each wth four sub-classes drawn from D sotropc Gaussans. The data are llustrated by red s and blue s n Fgures. In contrast to [], here we ft models wth dagonal covarance matrces and we use the condtonal expected gradent [7] optmzaton approach to update parameters. To llustrate the effects of the dfferent optmzaton crtera we have ft models wth two subclasses for each class. We run each algorthm wth random ntalzatons usng gradent based optmzaton for the three obectve functons and choose the best model under the ont, condtonal and mult-condtonal obectves, (), () and (), respectvely. We llustrate the model parameters usng ellpses of constant probablty under the model. From ths llustratve example, we see that the ont lkelhood based obectve encodes no element explctly enforcng a good model of the condtonal dstrbuton of the class label and can thus place probablty mass n poor locatons wth respect to classfcaton. The condtonal obectve focuses completely on the decson boundary and can produce parameters wth very lttle nterpretablty. Whereas our mult-condtonal obectve explctly optmzes for a good class condtonal dstrbuton and a good settng of parameters for makng classfcatons.

6 7......... 6 8 6 7 Fgure : (Left) Jont lkelhood optmzaton. (Mddle) One of the many near optmal solutons found by condtonal lkelhood optmzaton. (Rght) An optmal soluton found by our multcondtonal obectve. Quanttatvely, we have found that a smlar mult-condtonal optmzaton and model selecton procedure for the class, solated vowel recognton problem n [] leads to a test set error rate of.6 compared to. usng the ML and CL obectves. In contrast, the best publshed result s.9 usng multvarate adaptve regresson splnes (MARS) [].. Our Proposed Mult-Condtonal Obectve More generally, our obectve functon can be expressed as follows. Consder a data set consstng of =... N observaton nstances, hdden dscrete and contnuous varables {z} and z respectvely. We defne =... M pars of dsont subsets of observed varables where { x}, represents the th nstance of the varables n subset and { x}, s the other half of the par (whch we wll condton upon). Usng these defntons, the optmal parameter settngs under our mult-condtonal crteron are gven by argmax P ({ x},, {z},, z, { x}, ; θ)dz,, () θ {z}, where we derve these margnal condtonal lkelhoods from a sngle underlyng ont model whch tself may be normalzed locally, globally or usng some combnaton of the two.. A Structured, Globally Normalzed, Latent Varable Model for Documents A Harmonum model [8] s a Markov Random Feld (MRF) consstng of observed varables and hdden varables. Lke all MRFs the model we present here wll be defned n terms of a globally normalzed product of (un-normalzed) potental functons defned upon subsets of varables. A Harmonum can also be descrbed as a type of restrcted Boltzmann machne [] whch can be wrtten as an exponental famly model. In partcular, the exponental famly Harmonum structured model we develop here can be wrtten as { P (x, y Θ) = exp θ T f (x ) + θ T f (y ) + } θ T f (x, y ) A(Θ), () where y s a vector of hdden varables, x s a vector of observatons, θ represents parameter vectors (or weghts), θ represents a parameter vector on a cross product of states, f denotes potental functons, Θ = {θ, θ, θ } s the set of all parameters and A s the log-partton functon or normalzaton constant. A Harmonum model factorzes the thrd term of () nto θ T f (x, y ) = f (x ) T W T f (y ), where W T s a parameter matrx wth dmensons a b,.e., wth rows equal to the number of states of f (x ) and columns equal to the number of states of f (y ). Fgure (rght) llustrates a Harmonum model as a factor graph []. Importantly, a Harmonum descrbes the factorzaton of a

ont dstrbuton for observed and hdden varables nto a globally normalzed product of local functons. In our experments here we shall use the Harmonum s factorzaton structure to defne a MRF and we wll then defne sets of margnal condtonals dstrbutons of some observed varables gven others that are of partcular nterest so as to form our mult-condtonal obectve. Importantly, usng a globally normalzed ont dstrbuton wth ths constructon t s also possble to derve two consstent condtonal models, one for hdden varables gven observed varables and one for observed varables gven hdden varables [9]. The condtonal dstrbutons defned by these models can also be used to mplement samplng schemes for varous probabltes n the underlyng ont model. However, s mportant to remember that the orgnal model parameterzaton s not defned n terms of these condtonal dstrbutons. In our specfc experments below we use a ont model wth a form defned by () wth W T = [Wb T WT d ] such that the (exponental famly) condtonal dstrbutons consstent wth the ont model are gven by P (y n x) = N (y n ; ˆµ, I), ˆµ = µ + W T x and (6) P (x b ỹ) = B(x b ; ˆθ b ), ˆθb = θ b + W b ỹ and (7) P (x d ỹ) = D(x d ; ˆθ), ˆθd = θ d + W d ỹ (8) Where N (), B() and D() represent Normal, Bernoull and Dscrete dstrbutons respectvely. The followng equaton can be used to represent the margnal P (x θ, Λ) = exp{θ T x + x T Λx A(θ, Λ)} (9) where Λ = WWT. In an exponental famly model wth exponental functon F(x; θ), t s easy to verfy that the gradent of the log ont lkelhood can be expressed as: [ ] L(θ; x) F(x; θ) F(x; θ) = N E P (x) E P (x;θ) () where E P (x) denotes the expectaton under the emprcal dstrbuton, E P (x) s an expectaton under the models margnal dstrbuton and N s the number of data elements. We can thus compute the gradent of the log-lkelhood under our constructon usng L(W T ; X) W T = N d ( ) N d W T x x T N s W T x,() x T,() N s = where N d are the number of vectors of observed data, x,() are samples ndexed by and N s are the number of MCMC samples used per data vector and computed Gbbs samplng and condtonals (6), (7) and (8). In our experments here we have found t possble to use ether one or a small number of MCMC steps ntalzed from the data vector (the contrastve dvergence approach) but a more standard MCMC approxmaton s also possble. Fnally, for condtonal lkelhood and mult-condtonal lkelhood based learnng, gradent values can be obtaned from L = N [ [ E P (x,x ) F(x, x ; θ) Experments, Results and Analyss = E P (x ) E P (x x ;θ) () ]] F(x, x ; θ) We are nterested n examnng the qualty of the latent representatons obtaned when optmzng mult-attrbute Harmonum structured models under ML, CL and MCL obectves. We use a smlar testng strategy to [9] but focus on comparng the dfferent latent spaces obtaned wth the dfferent optmzaton obectves. For our experments, we use the reduced newsgroups dataset prepared n MATLAB by Sam Rowes. In ths data ()

Observed x Unobserved y Precson.7.7.6.6..... Recall ML ML CL MCL Fgure : (Left) A factor graph for a Harmonum model. (Rght) Precson-recall curves for the newsgroups data usng ML, CL and MCL wth latent varables. Random guessng s a horzontal lne at.. set, 6 documents are represented by word vocabulary bnary occurrences and are labeled as one of four domans. To evaluate the qualty of our latent space, we retreve documents that have the same doman label as a test document based on ther cosne coeffcent n the latent space when observng only bnary occurrences. We randomly splt data nto a tranng set of, documents and a test set of documents. We use ont model wth a correspondng full rank multvarate Bernoull condtonal for bnary word occurrences and a dscrete condtonal for domans. Fgure shows precson-recall results. ML- s our model wth no doman label nformaton. ML- s optmzed wth doman label nformaton. CL s optmzed to predct domans from words and MCL s optmzed to predct both words from domans and domans from words. From Fgure we see that the latent space captured by the model s more relevant for doman classfcaton when the model s optmzed under the CL and MCL obectves. Further, at low recall both the CL and MCL derved latent spaces produced smlar precsons. However, as recall ncreases the precson for comparsons made n the MCL derved latent space s consstently better. In concluson, these results lead us to beleve that further nvestgaton s warranted nto the the use of Mult-Condtonal Learnng methods for dervng both more meanngful and more useful hdden varable models. References [] D. H. Ackley, G. E. Hnton, and T. J. Senowsk. A learnng algorthm for Boltzmann machnes. Cogntve Scence, 9:7 69, 98. [] T. Haste, R. Tbshran, and J. Fredman. The Elements of Statstcal Learnng: Data Mnng, Inference, and Predcton. Sprnger-Verlag,. [] G. Hnton. Tranng products of experts by mnmzng contrastve dvergence. Neural Computaton, :77 8,. [] T. Jebara and A. Pentland. On reversng ensen s nequalty. NIPS,. [] F. R. Kschschang, B. Frey, and H.-A. Loelger. Factor graphs and the sum-product algorthm. IEEE Transactons on Informaton Theory, 7():98 9,. [6] John Lafferty, Andrew McCallum, and Fernando Perera. Condtonal random felds: Probablstc models for segmentng and labelng sequence data. In ICML, pages 8 89,. [7] Ruslan Salakhutdnov, Sam Rowes, and Zoubn Ghahraman. Optmzaton wth EM and expectaton-conugate-gradent. Proceedngs of (ICML),. [8] P. Smolensky. Informaton processng n dynamcal systems: foundatons of harmony theory, chapter, pages 9 8. McGraw-Hll, New York, 986. [9] Max Wellng, Mchal Rosen-Zv, and Geoffrey Hnton. Exponental famly harmonums wth an applcaton to nformaton retreval. In NIPS7, pages 8 88..