Collocation Extraction Using Square Mutual Information Approaches. Received December 2010; revised January PDF Free Download

Iteratoal Joural of Kowledge www.jklp.org ad Laguage Processg KLP Iteratoal c2011 ISSN 2191-2734 Volume 2, Number 1, Jauary 2011 pp. 53-58 Collocato Extracto Usg Square Mutual Iformato Approaches Huaru Zhag 1, Yogwe Zhag 2 ad Jgsog Yu 3 1 Isttute of Computatoal Lgustcs Pekg Uversty Bejg, Cha hrzhag@pku.edu.c 2,3 School of Software ad Mcroelectrocs Pekg Uversty Bejg, Cha 2 zhagywbb@gmal.com, 3 yjs@ss.pku.edu.c Receved December 2010; revsed Jauary 2011 ABSTRACT. MI (Mutual Iformato has bee proposed for measure of collocato log before, although stll wdely appled today varous felds, t has the dsadvatage of heavly favorg rarely occurrg tems. A ew mproved Square Mutual Iformato approach s proposed to solve ths problem. Supported by expermetal results, the precso of ths ew method s better tha that of MI ad other modfed approach such as combato of exteral ad teral measures. Aother advatage of ths ew approach s that t remas laguage depedet. Keywords: Collocato, assocato measure, square mutual formato, mproved square mutual formato 1. Itroducto. Statstcal approach of collocato extracto has bee a domat tred for years, from [4, 9, 6] to [5, 7, 1]. Mutual Iformato (MI s oe of most early ad wdely used measures, referred the by the majorty of research papers o collocato extracto. I [8], a total of 82 assocato measures are emprcally tested, 6 amog whch are mutual formato ad derved measures. However, the ew approach proposed ths paper s ot foud the full lst. Our ma terest les o the mprovemet of mutual formato related measures. Oe tutoal motvato s that mutual formato s orgated from formato theory, whle may formato-theoretc approaches have bee qute successful NLP. Aother motvato from the opposte drecto s that mutual formato s sometmes cosdered as a poor measure for collocato extracto. Despte the dsadvatage of heavly favorg rarely occurrg tems, we thk that MI ca be mproved to get better performace. We wll frst revew oe of such attempt to modfy MI [2, 3].

2. Uthood: Che s approach. Che [2, 3] calculates uthood measure by combg the exteral measure ad the teral measure. The exteral measure s based o two rates: the left depedet rate (LD ad the rght depedet rate (RD. max f ( aw1 w a A LD( w1 w f ( w w max f ( w1 wb b B RD( w1 w f ( w w where w = w 1 w 2 w f(w s the frequecy of a strg w, A s the full set of all the left eghbor elemets of w, a s ay elemet of set A, B s the full set of all rght eghbor elemets of w, b s ay elemet of set B. The exteral measure, deoted as IDR (depedet rate, s gve by. IDR( w.. w (1 1/ f ( w.. w (1 LD( w.. w (1 RD( w.. w (3 1 1 1 1 The teral measure s based o CoectRate(w w +1, whch s gve by CoectRat e( w w 1 p( w w 1 p( w w 1 1 p( w p( w The mmum of CoectRate(w w +1, deoted as MCoectRate(w 1..w, s the teral measure. 1 1 MCoectRate ( w.. w m CoectRate ( w w 1 1 1 1 The fal formula of uthood measure, deoted as UtRate(w 1..w, s the product of exteral measure IDR(w 1..w ad teral measure MCoectRate(w 1..w. UtRate ( w.. w IDR( w.. w MCoectRate ( w.. w 1 1 1 It ca be see that CoectRate(w w +1 s a trasformato of MI, whch ca be derved from MI drectly. Ths suggests that Che s approach also belogs to the famly of MI, wth whch we wll compare the results of our ew method. 3. Improved square mutual formato: New approach. We add a ew term to square MI, whch creases the fluece of hgh frequecy combatos by logarthmc scale. The bgram verso s gve by 54

2 f ( xy log (1 f ( xy SquareMI ( x, y log ( f ( x f ( y where x, y s the adjacet part of combato xy, f(x, f(y s the frequecy of part x, y, f(xy s the frequecy of combato xy. Whle the -gram verso s SquareMI w f ( w... w log (1 f ( w... w ( 1 1 1,..., w log ( where w = w 1 w 2 w, f(w s the frequecy of part w, f(w 1 w s the frequecy of combato w. 1 f( w 4. Results ad Dscusso. The evaluatos ad results are as below: The frst part of the evaluato data s the People s Daly Corpus (Jauary 1998 segmeted ad aotated by Isttute of Computatoal Lgustcs, Pekg Uversty. The secod part of the evaluato data s Facal Tmes (http://www.ftchese.com/, maly Chese text traslated from orgal Eglsh text. The evaluato s based o the followg assumpto: The coecto betwee collocatos ad words s smlar to that betwee words ad Chese characters. If a method s sutable for extractg words from Chese character combatos, the t s sutable for extractg collocatos from word combatos. TABLE 1. Comparso of precsos Number of collocatos Mutual Iformato(% Ut Rate(% Square MI(% Top 100 68.00 86.00 95.00 Top 500 69.60 87.58 88.18 Top 1000 66.70 81.60 87.20 Top 5000 63.02 67.34 76.10 Top 10000 58.46 58.75 64.75 Top 15000 53.29 53.55 57.32 Top 21296 47.92 49.15 50.26 The top 21296 terms are selected for evaluato, parallel wth Che s approach (deoted as UtRate hereafter for better comparablty, as show Table 1. The precso chages wth the umber of collocatos selected. As show Fgure 1, 2, ad 3, the horzotal axs s umber of collocatos (100 as a ut, whle the y-axs s precso. From Fgure 1 we ca see that our mproved square mutual formato approach s 55

better tha Che s method ad potwse mutual formato method. FIGURE 1. Comparso wth MI ad UtRate. I [2], Che s methods acheved hgher precso tha that by repeatg hs method. Oe cojecture s that preprocessg ad/or postprocessg are doe before/after the extracto. After we remove the word extracto result cotag Chese characters stop lst, the precso curve becomes Fgure 2. FIGURE 2. Comparso wth UtRate after flterg. From Fgure 2 we ca see that after the removal of words cotag Chese characters stop lst, Che s method get much closer result to our mproved square mutual formato method. Fgure 3 shows the chage precso curve of our mproved square mutual formato method before ad after the removal of words cotag stoppg Chese characters. The mor chage precso curve of our method suggests that our method ca do better eve before the use of flterg, whch meas our method s more effectve ad ca be laguage depedet. 56

(After (Before FIGURE 3. Improved Square MI (before ad after flterg. Expert Evaluato: A radomly-chose sample of the result s maually checked by huma experts, ad the approved percetage s show Table 2. TABLE 2. Comparso of expert evaluato Number of collocatos Ut Rate(% Square MI(% Top 100 82 84 Top 500 72 78 Top 1000 58 63 Top 3000 53 56 Top 5000 40 43 Top 10000 38 38 From these comparsos, we fd that our mproved square mutual formato approach obtas a better precso collocato extracto. 5. Coclusos. The ew mproved square mutual formato approach over performs potwse mutual formato method completely. Although smpler tha Che s approach, our approach s stll more effectve tha Che s whe o flter s appled. Huma evaluato o chose sample also cofrms the advatage of ths ew approach. Ackowledgmet. Ths work s partally based o the segmeted ad aotated Chese corpus developed by Isttute of Computatoal Lgustcs at Pekg Uversty uder the leadershp of Professor Shwe YU. 57

REFERENCES [1] I. A. Bolshakov, E. I. Bolshakova, A. P. Kotlyarov ad A. Gelbukh, Varous Crtera of Collocato Coheso Iteret: Comparso of Resolvg Power, Computatoal Lgustcs ad Itellget Text Processg, Lecture Notes Computer Scece, vol.4919, pp.64-72, 2010. [2] Che Yrog, The Research o Automatc Chese Term Extracto Itegrated wth Uthood ad Doma Feature, Master Thess Pekg Uversty, Bejg, 2005. [3] Yrog Che, Q Lu, Weje L, Zhfag Su ad Lug J, A Study o Termology Extracto Based o Classfed Corpora, Proceedgs of the Ffth Iteratoal Coferece o Laguage Resources ad Evaluato (LREC'06, pp.2383-2386, 2006. [4] K. Church ad P. Haks, Word assocato orms, mutual formato ad lexcography, Computatoal Lgustcs, vol.16, o.1, pp.22 29, 1990. [5] S. Evert, The Statstcs of Word Cooccurreces: Word Pars ad Collocatos, PhD dssertato, IMS, Uversty of Stuttgart, 2004. [6] C. Mag ad H. Schutze, Foudatos of statstcal atural laguage processg, MIT Press, Cambrdge, MA, 1999. [7] B. T. McIes, Extedg the Log Lkelhood Measure to Improve Collocato Idetfcato, M.S. Thess, Departmet of Computer Scece, Uversty of Mesota, Duluth, 2004. [8] P. Peca, Lexcal assocato measures ad collocato extracto, Lag Resources & Evaluato, vol.44, pp.137 158, 2010. [9] J. Pustejovsky, P. Ack, ad S. Bergler, Lexcal sematc techques for corpus aalyss, Computatoal Lgustcs, vol.19, o.2, pp.331-358, 1993. 58

Collocation Extraction Using Square Mutual Information Approaches. Received December 2010; revised January 2011