Collocation Map for Overcoming Data Sparseness

Cllcatin Map fr Overcming Data Sparseness Mnj Kim, Yung S. Han, and Key-Sun Chi Department f Cmputer Science Krea Advanced Institute f Science and Technlgy Taejn, 305-701, Krea mj0712~eve.kaist.ac.kr, yshan~csking.kaist.ac.kr, kschi~csking.kai~t.ac.k~ Abstract Statistical language mdels are useful because they can prvide prbabilistic infrmatin upn uncertain decisin making. The mst cmmn statistic is n-grams measuring wrd cccurrences in texts. The methd suffers frm data shrtage prblem, hwever. In this paper, we suggest Bayesian netwrks be used in apprximating the statistics f insufficient ccurrences and f thse that d nt ccur in the sample texts with graceful degradatin. Cllcatin map is a sigmid belief netwrk that can be cnstructed frm bigrams. We cmpared the cnditinal prbabilities and mutual infrmatin cmputed frm bigrams and Cllcatin map. The results shw that the variance f the values frm Cllcatin map is smaller than that frm frequency measure fr the infrequent pairs by 48%. The predictive pwer f Cllcatin map fr arbitrary assciatins nt bserved frm sample texts is als demnstrated. 1 Intrductin In statistical language prcessing, n-grams are bar sic t many prbabilistic mdels including Hidden Markv mdels that wrk n the limited dependency f linguistic events. In this regard, Bayesian mdels (Bayesian netwrk, Belief netwrk, Inference diagram t name a few) are nt very different frm ItMMs. Bayesian mdels capture the cnditinal independence amng prbabilistic variables, and can cmpute the cnditinal distributin f the variables, which is knwn as a prbabilistic inferencing. The pure n-gram statistic, hwever, is smewhat crude in that it cannt d anything abut unbserved events and its apprximatin n infrequent events can be unreliable. In this paper we shw by way f extensive experiments that the Bayesian methd that als can be cmpsed frm bigrams can vercme the data sparseness prblem that is inherent in frequency cunting methds. Accrding t the empirical results, Cllcatin map that is a Bayesian mdel fr lexical variables induced graceful apprximatin ver unbserved and infrequent events. There are tw knwn methds t deal with the data sparseness prblem. They are smthing and class based methds (Dagan 1992). Smthing methds (Church and Gale 1991) readjust the distributin f frequencies f wrd ccurrences btained frm sample texts, and verify the distributin thrugh held-ut texts. As Dagan (1992) pinted ut, hwever, the values frm the smthing methds clsely agree with the prbability f a bigram cnsisting f tw independent wrds. Class based methds (Pereira et al. 1993) apprximate the likelihd f unbserved wrds based n similar wrds. Dagan and et al. (1992) prpsed a nn-hierarchical class based methd. The tw appraches reprt limited successes f purely experimental nature. This is s because they are based n strng assumptins. In the case f smthing methds, frequency readjustment is smewhat arbitrary and will nt be gd fr heavily dependent bigrams. As t the class based methds, the ntin f similar wrds differs acrss different methds, and the assciatin f prbabilistic dependency with the similarity (class) f wrds is t strng t assume in generm. Cllcatin map that is first suggested in (Itan 1993) is a sigmid belief netwrk with wrds as prbabilistic variables. Sigmid belief netwrk is extensively studied by Neal (1992), and has an efficient inferencing algrithm. Unlike ther Bayesian mdels, the inferencing n sigmid belief netwrk is nt NP-hard, and inference methds by reducing the netwrk and sampling are discussed in (Han 1995). Bayesian mdels cnstructed frm lcal dependencies prvide frmal apprximatin amng the variables, thus using Cllcatin map des nt require strng assumptin r intuitin t justify the assciatins amng wrds prduced by the map. The results f inferencing n Cllcatin map are prbabilities amng any cmbinatins f wrds represented in the map, which is nt fund 53

in ther mdels. One significant shrtcming f Bayesian mdels lies in the heavy cst f inferencing. Our implementatin f Cllcatin map includes 988 ndes, and takes 2 t 3 minutes t cmpute an assciatin between wrds. The purpse f experiments is t find ut hw gracefully Cllcatin map deals with the unbserved cccurrences in cmparisn with a naive bigram statistic. In the next sectin, Cllcatin map is reviewed fllwing the definitin in (Flail 1993). In sectin 3, mutual infrmatin and cnditinal prbabilities cmputed using bigrams and Cllcatin map are cmpared. Sectin 4 cncludes the paper by summarizing the gd and bad pints f the Cllcatin map and ther methds. 2 Cllcatin Map In this sectin, we make a brief intrductin n Cllcatin map, and refer t (ttan 1993) fr mre discussin n the definitin and t (ttan 1995) n infi~rence methds. Bayesian mdel cnsists f a netwrk and prbability tables defined n the ndes f the netwrk. The ndes in the netwrk repre.sent prbabilistic variables f a prblem dmain. The netwrk can cmpute prbabilistic dependency between an)" cmbinatin f the variables. The mdel is well dcumented as subjective prbability thery (Pearl 1988). Cllcatin map is an applicatin mdel f sigmld belief netwrk (Neal 1992) that belngs t belief netwrks which in turn is a type f Bayesian mdel. Unlike belief netwrks, Cllcatin map des nt have deterministic variables thus cnsists nly f prbabilistic variables that crrespnd t wrds in this case. Sigmid belief netwrk is different frm ther belief netwrks in that it des nt have prbability distributin table at each nde but weights n the edges between the ndes. A nde takes binary utcmes (1, -1) and the prbability that a nde takes an utcme given the vectr f utcmes f its preceding ndes is a sigmid functin f the utcmes and the weights f assciated edges. In this regard, the sigmid belief netwrk resembles artificial neural netwrk. Such prbabilities used t be stred at ndes in rdinary Bayesian mdels, and this makes the inferencing very difficult because the prbability table can be very big. Sigmid belief netwrk des away with the NP-hard cmplexity by aviding the tables at the lss f expressive generality f prbability distributins that can be encded in the tables. One wh wrks with Cllcatin map has t deal with tw prblems. The first is hw t cnstruct the netwrk, and the ther is hw t cmpute the prbabilities n the netwrk. Netwrk can be cnstructed directly frm a set f bigrams btained frm a training sample. Because Cllcatin map is a directed a~yclic graph, P( prfit I investment ) = 0.644069 P( risk-taking I investment ) = 0.549834 P( stck } investment ) = 0.546001 P( high-incme I investment ) = 0.564798 P( investment I high-incme ) = 0.500000 P( high-incme I risk-taking prfit ) = 0.720300 P( investment I prtfli high-incme risk-taking ) = 0.495988 P( prtfli I blue-chip ) = 0.500000 P( prtfli stck I prtfli stck ) = 1.000000 Figure 1: Example Cllcatin map and example inferences. Graph reductin methd (Hall 1995) is used in cmputing the prbabilities. cycles are avided by making additinal nde f a wrd when facing cycle due t the nde. N mre than tw ndes fr each wrd are needed t avid the cycle in any case (ltan 1993). Once the netwrk is setup, edges f the netwrk are assigned with weights that are nrmalized frequency f the edges at a nde. The inferencing n Cllcatin map is nt different frm that fr sigmid belief netwrk. The time cmplexity f inferencing by reducing graph n sigmid belief netwrks is O(N a) given N ndes (Han 1995). It turned ut that inferencing n netwrks cntaining mre than a few hundred ndes was nt practical using either nde reductin methd r sampling methd, thus we adpted the hybrid inferencing methd that first reduces the netwrk and applies Gibbs sampling methd (Hall 1995). Using the hybrid inferencing methd, cmputatin f cnditinal prbabilities tk less than a secnd fr a netwrk with 50 ndes, tw secnds fr a netwrk with 100 ndes, abut nine secnds fr a netwrk with 200 ndes, and abut tw minutes fr a netwrk with abut 1000 ndes. Cnditinal and marginal prbabilities can be apprximated frm Gibb's sampling. Sme cnditinal prbabilities cmputed frm a small netwrk are shwn in figure 1. Thugh the netwrk may nt be big enugh t mdel the dmain f finance, the resulting values frm the small netwrk cmpsed f 9 dependencies seem useful and intuitive. 54

20 average MI * variance 15 Mutual in Infrmatin ~v e~ O ~-~ ~,,. ~.'~:~. " e ~ g D ~ q 0 w ~@ ee.~dr'ee~ 0 ~ 0 C P O ie ~' 50 I00 150 200 Frequency f bigrams Figure 2: Average MI's and variances. 378,888 unique bigrams are classified accrding t frequency. 55

The cmputatin in figure 1 was dne by using graph reductin methd. As it is shwn in the example inferences, the assciatin between any cmbinatin f variables can be measured. 3 Experiments The gal f ur experiment is first t find hw data sparseness is related with the frequency based statistics and t shw Cllcatin map based methd gives mre reliable apprximatins. particular, frm the experiments we bserved the variances f statistics might suggest the level f data sparseness. The less frequent data tended t have higher variances thugh the values f statistics (mutual infrmatin fr instance) did nt distinguish the level f ccurrences. The predictive accunt f Cllcatin map is demnstrated by bserving the variances f apprximatins n the infrequent events. The tagged Wall Street Jurnal articles f Penn Tree crpus were used that cntain abut 2.6 millin wrd units. In the experiments, abut 1.2 millin f them was used. Prgrams were cded in C language, and run n a Sun Spare 10 wrkstatin. Fr the first 1.2 millin wrds, the bigrams cnsisting f fur types f categries (NN, NNS, IN, J J) were btained, and mutual infrmatin f each bigram (rder insensitive) was cmputed. The bi- grams were classified int 200 sets accrding t their ccurrences. Figure 2 summarizes the the average MI value and the variance f each frequency range. Frm figure 3 that shws the ccurrence distributin f 378,888 unique bigrams, abut 70% f them ccur nly ne time. One interesting and imprtant bservatin is that thse f 1 t 3 frequency range that take abut 90% f the ppulatin have very high MI values. This results als agree with Dunning's argument abut verestimatin n the infrequent ccurrences in which many infrequent pairs tend t get higher estimatin (Dunning 1993). The prblem is due t the assumptin f nrmality in naive frequency based statistics accrding t Dunning (1993). Apprximated values, thus, d nt indicate the level f data quality. Figure 3 shws variances can suggest the level f data sufficiency. Frm this bservatin we prpse the fllwing definitin n the ntin f data sparseness. A set f units belnging t a sample f rdered wrd units (texts) is cz datasparse if and nly if the variance f measurements n the set is greater than ~. The definitin sets the cncept f sparseness within the cntext f a fcused set f linguistic units. Fr a set f units unberved frm a sample, the given sample text is fr sure data-sparse. The abve definitin then gives a way t judge In with respect t bserved units. The measurement f data sparseness can be a gd issue t study where it may depend n the cntexts f research. Here we suggest a simple methd perhaps fr the first time in the literature. Figure 4 cmpares the results frm using Cllcatin map and simple frequency statistic. The variances are smaller and the pairs in frequency 1 class have nn zer apprximatins. Because cmputatin n Cllcatin map is very high, we have chsen 2000 unique pairs at randm. The netwrk cnsists f 988 ndes. Cmputing an apprximatin (inferencing) tk abut 3 minutes. The test size f 2000 pairs may nt be sufficient, but it shwed the cnsistent tendency f graceful degradatin f variances. The verestimatin prblem was nt significant in the apprximatins by Cllcatin map. The average value f zer frequency class t which 50 unbserved pairs belng was als n the line f smth degradatin, and figure 4 shws nly the variance. Table 1 summarizes the details f perfrmance gain by using Cllcatin map. 4 Cnclusin Crpus based natural language prcessing has been ne f the central subjects gaining rapid attentin frm the research cmmunity. The majr virtue f statistical appraches is in evaluating linguistic events and determining the relative imprtance f the events t reslve ambiguities. The evaluatin n the events (mstly cccurrences) in many cases, hwever, has been unreliable because f the lack f data. Data sparseness addresses the shrtage f data in estimating prbabilistic parameters. As a result, there are t many events unbserved, and even if events have been fund, the ccurrence is nt sufficient enugh fr the estimatin t be reliable. In cntrast with existing methds that are based n strng assumptins, the methd using Cllcatin map prmises a lgical apprximatin since it is built n a thrugh frmal argument f Bayesian prbability thery. The pwerful feature f the framewrk is the ability t make use f the cnditinal independence amng wrd units and t make assciatins abut unseen cccurrences based n bserved nes. This naturally induces the attributes required t deal with data sparseness. Our experiments cnfirm that Cllcatin map makes predictive apprximatin and avids verestimatin f infrequent ccurrences. One critical drawback f Cllcatin map is the time cmplexity, but it can be useful fr applicatins f limited scpe. 56

0.8 0.6 Percentage 0.4 0.2 0 0 2 4 6 8 10 Frequency f bigrams Figure 3: The distributin f 378,888 unique bigrams. First ten classes are shwn. 1 5.1 12.2 57% 10 2.28 4.28 46% 20 1.29 5.29 75% 30 1.51 3.51 56% 40 2.18 3.18 31% 50 1.52 2.87 47% average 2.04 4.5 45% Table 1: Cmparisn f variances between frequency based and Cllcatin map based MI cmputatins. 57

12 fre, luency based Cl[catic n ma[ 10 MI variance q 4 O. w 0 0 r 0 : uo ~ 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 0 Ug 0 5 I0 15 20 25 30 35 40 45 50 Frequency f bigrams Figure 4: Variances by frequency based and Cllcatin map based MI cmputatins fr 2000 unique bigrarns. 58

References Kenneth W. Church, and William A. Gale. 1991. A cmparisn f the enhanced Gd-Turing and deleted estimatin methds fr estimating prbabilities f English bigrams. Cmputer Speech and Language. 5. 19-54. Ted Dunning. 1993. Accurate methds fr the statistics f surprise and cincidence. Cmputatinal Linguistics. 19 (1). 61-74. Id Dagan, Shaul Marcus, and Shaul Markvitch. 1992. Cntextual wrd similarity and estimatin frm sparse data. In Prceedings f AAAI fall sympsium, Cambridge, MI. 164-171. Yung S. Han, Yung G. Han, and Key-sun Chi. 1992. Recursive Markv chain as a stchastic grammar. In Prceedings f a SIGLEX wrkshp, Clumbus, Ohi. 22-31. Yung S. Han, Yung C. Park, and Key-sun Chi. 1995. Efficient inferencing fr sigmid Bayesian netwrks, t appear in Applied Intelligence. Radfrd M. Neal. 1992. Cnnectinist learning f belief netwrks. J f Artificial Intelligence. 56. 71-113. Judea Pearl. 1988. Prbabilistic Reasning in Intelligent Systems. Mrgan Kaufmann Publishers. Fernand Pereira, Naftali Tishby, and Lillian Lee. 1993. Distributinal clustering f English wrds. In Prceedings f the Annual Meeting f the A CL. 59