The BBN Crosslingual Topic Detection and Tracking System

Size: px

Start display at page:

Download "The BBN Crosslingual Topic Detection and Tracking System"

Phebe Foster
5 years ago
Views:

1 The BBN Crosslngual Topc Detecton and Trackng System Tm Leek, Hubert Jn, Sreenvasa Ssta, Rchard Schwartz BBN Technologes, Cambrdge, MA ABSTRACT Ths was the frst year that the TDT program ncluded a requred crosslngual test: Englsh and Mandarn. Most of our work, therefore, was to adapt our trackng and detecton systems to work on a corpus of documents n these two languages. To ths end, we worked both on quck, adequate translaton, and the modfcatons necessary to our systems to attan good performance n ths crosslngual doman. We started by buldng smple term translaton systems, and ended wth more complcated ones that made sensble use of knowledge of the target language. Workng from a parallel corpus of algned sentences, we estmated pror term translaton probabltes. Addtonally, we devsed an algorthm for teratvely refnng a translaton, va word co-occurrence statstcs, n order to choose the most consstent translaton. The trackng and detecton systems themselves were left largely unchanged, wth the excepton that we estmated score normalzaton statstcs separately for documents n dfferent languages. We were pleasantly surprsed to fnd that the performance on trackng and detecton usng our translaton system was about the same as when we avaled ourselves of the suppled machne translatons of Mandarn. 1. Translatng for TDT It was our belef that the ntroducton of cross-lngual tasks nto the TDT evaluaton meant we needed to buld a translaton system. It s worth notng that both tme and resources were lmted. We had only a few months to do ths work, and were provded the sort of resources you d expect when asked to work n a sparse language: an errorful translaton dctonary, a parallel corpus not qute from the rght doman, varous msmatched word lsts and concordances. But ths seemed to be the man thrust of TDT3: how well and how quckly can you adapt your TDT systems to work on an addtonal language, gven lmted resources? 1.1. A Smple Term Translaton System We decded that the frst pass at a soluton should be to translate all the Mandarn documents nto Englsh usng a blngual dctonary and then run our monolngual trackng and detecton systems as usual. Obvously, ths s not gong to gve the best result. But, snce t s the easest thng to do, t s mportant to have t as a benchmark aganst whch to compare other, more elaborate solutons. Our translaton system used a very crude algorthm. 1. Segment the orgnal Mandarn document nto words. 2. Look up each Mandarn word n the blngual dctonary. If t s n the dctonary, make a bag of all the Englsh words from all the translatons. Else, throw t away. 3. Each Mandarn document s just a bg bag of Englsh words. We were fortunate to have avalable two systems for segmentng Mandarn nto words, one provded by the LDC and another we at BBN had developed as part of a Mandarn Informaton Extracton system. Upon nspecton, the output of the LDC segmenter appeared more approprate for nput to a translaton system. The BBN segmenter tended to leave denttes (persons, locatons, organzaton, etc) unsegmented. Ths s a valuable behavour when namefndng, but t often renders these crucal terms untranslatable. In addton to tryng these dfferent programs for segmentng the Mandarn, we also expermented wth segmentng greedly (always take the longest word next that s n the dctonary), and segmentng all-ways at once (all substrngs of less than M characters). Notce at least two obvous but sgnfcant falngs of ths sort of smple-mnded approach to term translaton. 1. We are throwng away too much of the orgnal Mandarn document. Our estmate s that 25% of the words don t translate. Only about 5% are due to errors n segmentaton. About half are names. 2. By translatng a Mandarn word as the entrety of the Englsh half of ts dctonary entry, we are weghtng the words n the translaton n precsely the wrong way. Common words typcally translate many ways. Uncommon words typcally translate very few ways. Together, these two facts conspre to mean that we are ncreasng the weght for common words, and decreasng t for uncommon words. Ths s a bad dea. Common words are generally not useful for TDT, whle uncommon words are generally very mportant. The frst of these problems can only be addressed by ncreasng the coverage of the Mandarn words by the term translaton system. The second problem s best solved by ntroducng the noton of translatng wth probablty Extended Word Lookup Translaton We can solve the coverage problem by translatng and segment jontly,.e. choosng a segmentaton that ncreases dctonary coverage. We decded to start wth the LDC segmentaton and then repar t. Roughly half of the untranslated words are names. In many cases, the correct translaton for a Chnese name nto Englsh s smply the Pnyn spellng for the sequence of characters. We made use of ths target language knowledge as well as a lst of Chnese surnames, to refne our dctonary lookup algorthm.

2 1. If the untranslated word begns wth a Chnese surname, wrte out the surname and gven name n Pnyn. 2. Else greedly subsegment usng the dctonary. Ths soluton does ncrease coverage; every word n the orgnal now translates. However, we stll have the word-weghtng problem Probablstc Term Translaton A coarse soluton to the word weghtng problem s smply to mandate that the Englsh translaton have the same total number of words as the orgnal Mandarn. We can acheve ths by gvng each Englsh word a fractonal count proportonal to the total number of words n the Englsh translaton. Ths wll mean that the three character word that translates only one way (n the blngual dctonary we had to work wth), as Mao Zedong, wll result n two Englsh words wth half a sngle count each. Lkewse, the two character word that translates eght ways as varous forms of haven t and to not be wll result n thrteen Englsh words wth 1=13 of a count each. A better soluton nvolves estmatng non-unform pror translaton probabltes for these ndvdual words. Not all translatons are equally lkely. For nstance, the Chnese word for Amerca translates ten ways. Clearly, though, Amerca s a more lkely translaton than yankeedom. We mplemented an teratve procedure to estmate non-unform pror translaton probabltes usng the observatons of algned sentences n the parallel corpora of Hong Kong Laws. 1. Intalze P (E jjc) = 1 Nt(C) P 2. Count C(E j C)= P S c S e C2S c E j 2S e P (E j jc) 3. Re-estmate P (E jjc) = ap (E jjc) +(1; a) usng a = 4. Iterate. where x+p x. C(E C) E j s the j th translaton for Chnese word C. Nt(C) s the number of translatons for C. P (E jc) P C(E j C) C(E C) S c S e are a par of algned Chnese and Englsh sentences, x s a parameter that governs the rate at whch probabltes are updated. There are about 230,000 sentences and about 8 mllon Englsh words n ths parallel corpus. Unfortunately, ts doman s law, so we can expect t to be useful only for estmatng translaton probables for words lkely to occur n laws. In partcular, we noted that our procedure only assgned non-unform translaton probabltes to 4,979 of the 128,365 Mandarn entres. Ths s not qute as bad as t sounds; 95,511 of the Mandarn entres have only one possble translaton. Our re-estmaton procedure gave non-unform translaton probabltes to 15% of words that have more than one translaton Co-occurrence Statstcs Addtonally, we devsed an algorthm for teratvely mprovng a translaton usng co-occurrence statstcs. When we have alternate translatons for a Mandarn word, the algorthm tends to favor those that are consstent wth the rest of the document. 1. Estmate coocurrence probabltes from large background corpus. P (E 1 2 W je 2 2 W ). W s some wndow of words. 2. Create an ntal translaton usng pror term translaton probabltes. Translaton s a probablstc bag of words. 3. For each Mandarn word wth more than one translaton, estmate posteror probablty, P (E jjc T ~ ), of each Englsh translaton gven the orgnal Mandarn word and the rest of the probablstc translaton so far. 4. Iterate, replacng the pror term translaton probabltes wth the posteror probabltes. There are many ways of estmatng the posteror. One would be to use Bayes rule to rearrange thngs untl we have quanttes we can estmate well. P (E jjc ~ T ) P (E jjc) P (Ejj ~ T ) P (E j) P (E jjc) s just the pror translaton probablty. We can estmate P (E jj T ~ ) wth a generatve model. In ths model, frst, we choose a word C 0 from the orgnal Mandarn document, accordng to P (C 0 jorg). Second, we choose an Englsh translaton accordng to P (E 0 jc 0 ). Thrd and fnally, we choose to generate a dfferent, Englsh word E j, accordng to the co-occurrence probablty P (E j 2 W je 0 2 W ). The full equaton follows. P (E jj X X T ~ ) P (E jje 0 )P (E 0 jc 0 )P (C 0 jorg) (2) C 0 6=C E 0 6=E j 2. Crosslngual TDT Systems 2.1. Trackng System Overvew We have developed several approaches to topc trackng, all Bayesan. The two most successful are known as the topc spottng (TS) system, and the nformaton retreval (IR) system. These systems are descrbed n more detal n [3]. The TS trackng system s based upon our work developng the OnTopc topc classfcaton system [1]. The raw score s a loglkelhood rato, and t represents how much more probable a document s, under the hypothess that t s on the topc, compared wth how probable t s under the null hypothess,.e. that t s not relevant to the topc. Assumng that words are generated ndependently, we can approxmate ths log-lkelhood rato as score TS X P (DjT = log ) P (wjt log ) P (D) P (3) (w) w2d P (w) s the probablty of w estmated on some large, background corpus. P (wjt ) s formed by poolng the words n documents that have T as a topc, and then re-estmatng these probabltes usng (1)

3 an teratve EM-lke procedure that tends to ncrease the lkelhood of the data gven the model. The IR trackng system s based upon our work developng the BBN IR system [4]. We use the tranng documents for a topc to form a large query, and compute the posteror probablty that a test document s relevant gven the query. Usng Bayes rule to re-wrte ths n terms of quanttes that are easer to estmate, we have P (T jdsr) P (DsRjT )=P(DsR) P (T ) P (T ) s constant across test documents. Whle P (DsR), the pror probablty that a document s relevant, could be modeled so that t would dffer from document to document, we choose to leave t as a constant. So we can safely compose the IR trackng score smply out of the condtonal probablty of the query T beng generated, under the hypothess that D s a document relevant to the query. We construct P (T jdsr) as a mxture model, one state generatng the words n the query by drawng from the document, accordng to P (wjd), and the other by drawng from a background corpus, accordng to P (w): score IR(D T ) = log P (T jdsr) (5) X log(ap (wjd) +(1; a)p (w)) (6) w2t where a can be ether a constant or a functon of features of w. It s a requrement of TDT that the score we gve for a test document be comparable across topcs. We therefore normalze a document s score usng the statstcs of the scores of thousands of known offtopc or no documents. score(d T score 0 ) ; no (D T )= (7) no where no s the mean score of all no documents, and no s the standard devaton of no document scores. Our trackng system also adapts, unsupervsed, to test documents extremely lkely to be on-topc. Any test document wth a normalzed score that s hgher than some threshold s added to the set of tranng examples and we re-estmate P (wjt ) wth all the examples. We combne the scores from the TS and IR systems to form our fnal result, usng logstc regresson, tranng the combnaton weghts on prevous corpora and tests. Ths combnaton system gves the best results. Unless otherwse noted here that wll be the result reported Detecton System Overvew The Detecton system, descrbed n detal elsewhere [2], uses the trackng TS score to measure the dstance between a document and a cluster. But here, we must normalze twce f our score s to be comparable both across test documents and clusters (topcs). Only then wll t be a score we can use along wth a threshold to make a bnary decson about whether to add a test document to a cluster or to use t as the seed for a new cluster. Frst, we compute the IR scores of many background documents aganst the cluster, fnd the mean and standard devaton, NO D and NO D, and use equaton 7 to gve us a score that s normalzed wth respect to documents certan to be off-topc. (4) Not Translated Translated by SYSTRAN BBN Table 1: Effect of Translaton: Monolngual Chnese trackng. Translatng to Englsh results n a loss of 24% to 30%. Second, we compute the normalzed scores for the test document, wth respect to many background clusters, and agan fnd the mean and standard devaton, NO C and NO C, of those scores. Then we normalze a second tme, agan usng equaton 7, now wth respect to clusters of varous szes all of whch are unlkely to be on the same topc as the test document. Ths two-level normalzaton s crtcal to detecton performance Crosslngual TDT Issues We clamed, above, that our approach would be to translate all of the Mandarn documents nto Englsh and then smply use our monolngual trackng and detecton systems as-s. Ths s almost true. In fact, we keep track of the orgnal language of each document and make use of that nformaton whle trackng. We can t really expect the score for a translated Mandarn test document to have the same statstcs as the score for an Englsh test document. So, we compute normalzaton statstcs separately for the two languages and amend our normalzaton formula to score 0 (D T score(d T )= ) ; L(D) no (8) L(D) no where L(D) s the natve language for document D. 3. Performance We present results both for development experments and the TDT3 evaluaton. Note that ctrack numbers are ncomparable between development experments, snce corpus, test, and evaluaton measures were all n flux over the tme perod Effect of Translaton Our fnal cross-lngual system made use of the re-estmated probablstc translaton dctonary. We had better results when we smply chose the hghest probablty translaton for a Mandarn word, nstead of ncludng all translatons wth fractonal counts. There was not tme to run experments wth the co-occurrence re-estmaton translaton refnement algorthm. In order to understand the loss due to translaton, we ran trackng experments on just the Mandarn documents n the development corpus (TDT2). We ran our trackng system three tmes. The frst run was wthout translaton. The second was wth the suppled SYS- TRAN (commercal MT) translaton. And the thrd was wth BBN s term translaton system. Ths comparson s n table 1. The performance of the trackng system s worse by 30% relatve when we translate wth our system and worse by 24% relatve when we use the SYSTRAN translaton. So t s certanly true that translaton s hurtng. What s remarkable s that the dfference n performance between SYSTRAN and the BBN translaton s so small, consderng how much effort has gone nto developng SYSTRAN and how smple the BBN translaton was.

4 Normalzaton Tran Test one-way two-way ENG ENG+MAN MAN ENG+MAN ENG+MAN ENG+MAN Condton Translaton SYS BBN MUL E C Table 2: Wthn Language Score Normalzaton: Trackng results, usng the smple translaton system (no probabltes), TS trackng system only. Gan s largest when trackng nvolves constructng models only from translated Mandarn documents. It occurred to us as a result of ths analyss that t would be worthwhle to spend some tme and thought ntegratng our models of translaton and TDT. Because we chose to translate everythng nto Englsh and then smply work monolngually, we lose n several ways. Obvously, whenever we translate a document, we ntroduce nose. What s perhaps less obvous s that we are translatng far more often than we need to be. When we have tranng n both languages, surely t s preferable always to compare a test document to a model n the correct language, rather than one bult of translated examples. And even when we have tranng n only one language, once adaptaton has dscovered an on-topc example n the other language, t wll be better to use that for future comparsons. In fact, we concluded, we should take a symmetrc vew of the problem. We should translate every document nto the other language for when we need to make cross-lngual comparsons. And we should mantan dstnct topc models n both languages, beng careful always to compare a test document to a model n the approprate language. Ths means we would be translatng as lttle as possble and should mprove results dramatcally Wthn-Language Score Normalzaton Score normalzaton by language was an mportant gan (see table 2). When the tranng was translated Mandarn documents, two-way normalzaton mproved ctrack by 55%. When the tranng was Englsh documents, t mproved ctrack by 5%. The dramatc dfference between these two gans s probably due to the fact that the development corpus had a very small proporton of Mandarn documents Evaluaton Results The fnal evaluaton results for TDT3 were qute good. From table 3, we see that the trackng result, as n our development experments, s about the same when we use our own smple translaton Tranng Translaton SYS BBN E C E+C Table 3: Trackng Evaluaton Result: Performance s about the same for SYSTRAN and BBN translaton. Loss for usng Chnese nstead of Englsh tranng s small. There s a gan for addng Chnese tranng to Englsh. Englsh tranng s the prmary result here. Table 4: Detecton Evaluaton Result: Agan, performance s about the same for the two knds of translaton. Chnese monolngual performance s better than Englsh monolngual, for some reason. system as t s when we use SYSTRAN. As we d expect, our performance s worse the more we translate. But we see that our system s defntely able to make use of the translated examples, achevng a farly respectable ctrack of when all of the tranng examples are n Chnese, and always mprovng (from to for SYSTRAN and from to for BBN s translaton) when translated Chnese examples are added to the Englsh ones. The detecton eval result, shown n table 4 shows a smlar result as for trackng, wth respect to translaton. Performance usng SYS- TRAN s about on par wth the BBN translaton. In partcular, t s vrtually the same n the multlngual case. 4. Conclusons We have demonstrated that t s possble to buld a full-fledged crosslngual TDT system n a few months wth lmted resources. It s clear that t s not necessary to have a mature commercal machne translaton system embodyng deep lngustc knowledge for IR-lke applcatons such as TDT. Our TDT systems dd about as well usng our translaton as they dd usng the suppled SYSTRAN translaton. Smple term translaton wth a blngual dctonary and some care about word weghtng s as good, for the purposes of TDT, as commercal MT. Ths s lkely because all TDT systems, at some level, just count the number of word matches between documents. Successful TDT systems consder matches for very uncommon words better evdence than matches for common words. Term translaton makes the fewest errors when translatng uncommon words. Wth hndsght, t s obvous that ths sort of approach wll work farly well. However, at the same tme t s clear that there s much work left to be done. We have shown that translatng results n a loss of about 30% (table 1). So even though our term translaton s as good as SYSTRAN, t could be much better. And t appears that there s a loss of about 50% for workng crosslngually (table 4). It s lkely that the choce to translate everythng nto Englsh and work monolngually costs dearly. If ths s the case, future work should revolve around mplementng the symmetrc vew of the problem that ntegrates translaton and TDT, presented above. There, we proposed translatng every document nto the other language and always comparng a test document to a model n ts natve language, thus effectvely mnmzng the amount of translaton we are forced to do. 5. Acknowledgements Ths work was supported by the Defense Advanced Research Projects Agency and montored by Ft. Huachuca under contract No. DABT63-94-C-0063 and by the Defense Advanced Research Projects Agency and montored by NRaD under contract No.

5 N D The vews and fndngs contaned n ths materal are those of the authors and do not necessarly reflect the poston or polcy of the Government and no offcal endorsement should be nferred. References 1. Schwartz, R., Ima, T., Nguyen, L., and Makhoul, J., A maxmum Lkelhood Model for Topc Calssfcaton of Broadcast News, n Proc. Eurospeech, Rhodes, Greece, September, Walls, F., Jn, H., Ssta, S., and Schwartz, R., Topc Detecton n Broadcast News, n Proceedngs of the DARPA Broadcast News Workshop, Herndon, Va, Jn, H., Schwartz, R., Ssta, S., and Walls, F., Topc Trackng for Rado, TV Broadcast, and Newswre, n Proceedngs of the DARPA Broadcast News Workshop, Herndon, Va, Mller, D., Leek, T., and Schwartz, R., A Hdden Markov Model Informaton Retreval System, n Proceedngs of the ACM Sgr 99.

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are