A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

Size: px

Start display at page:

Download "A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation"

Louisa Harris
5 years ago
Views:

1 A Chukg Strategy Towards Ukow Word Detecto Chese Word Segmetato Zhou GuoDog Isttute for Ifocomm Research, 2 Heg Mu Keg Terrace, Sgapore 963 zhougd@2r.a-star.edu.sg Abstract. Ths paper proposes a chukg strategy to detect ukow words Chese word segmetato. Frst, a raw setece s pre-segmeted to a sequece of word atoms usg a maxmum matchg algorthm. The a chukg model s appled to detect ukow words by chukg oe or more word atoms together accordg to the word formato patters of the word atoms. I ths paper, a dscrmatve Markov model, amed Mutual Iformato Idepedece Model (MIIM), s adopted chukg. Besdes, a maxmum etropy model s appled to tegrate varous types of cotexts ad resolve the data sparseess problem MIIM. Moreover, a error-drve learg approach s proposed to lear useful cotexts the maxmum etropy model. I ths way, the umber of cotexts the maxmum etropy model ca be sgfcatly reduced wthout performace decrease. Ths makes t possble for further mprovg the performace by cosderg more varous types of cotexts. Evaluato o the PK ad CTB corpora the Frst SIGHAN Chese word segmetato bakeoff shows that our chukg approach successfully detects about 80% of ukow words o both of the corpora ad outperforms the best-reported systems by 8.% ad 7.% ukow word detecto o them respectvely. Itroducto Pror to ay lgustc aalyss of Chese text, Chese word segmetato s the ecessary frst step ad oe of major bottleecks Chese formato processg sce a Chese setece s wrtte a cotuous strg of characters wthout obvous separators (such as blaks) betwee the words. Durg the past two decades, ths research has bee a hot topc Chese formato processg [-0]. There exst two major problems Chese word segmetato: ambguty resoluto ad ukow word detecto. Whle -gram modelg ad/or word cooccurrece has bee successfully appled to deal wth the ambguty problems [3, 5, 0, 2, 3], ukow word detecto has become the major bottleeck Chese I ths paper, word atoms refer to basc buldg uts words. For example, the word Ï (computer) cossts of two word atoms: Ï (computg) ad (mache). Geerally, word atoms ca ether occur depedetly, e.g. Ï (computg), or oly become a part of a word, e.g. (mache) the word Ï (computer). R. Dale et al. (Eds.): IJCNLP 2005, LNAI 365, pp , Sprger-Verlag Berl Hedelberg 2005

2 A Chukg Strategy Towards Ukow Word Detecto 53 word segmetato. Curretly, almost all Chese word segmetato systems rely o a word dctoary. The problem s that whe the words stored the dctoary are suffcet, the system's performace wll be greatly deterorated by the presece of words that are ukow to the system. Moreover, maual mateace of a dctoary s very tedous ad tme cosumg. It s therefore mportat for a Chese word segmetato system to detfy ukow words from the text automatcally. I lterature, two categores of competg approaches are wdely used to detect ukow words 2 : statstcal approaches [5,, 2, 3, 4, 5] ad rule-based approaches [5,, 4, 5]. Although rule-based approaches have the advatage of beg smple, the complexty ad doma depedecy of how the ukow words are produced greatly reduce the effcecy of these approaches. O the other had, statstcal approaches have the advatage of beg doma-depedet [6]. It s terestg to ote that may systems apply a hybrd approach [5,, 4, 5]. Regardless of the choce of dfferet approaches, fdg a way to automatcally detect ukow words has become a crucal ssue Chese word segmetato ad Chese formato processg geeral. Iput raw setece: d ~. MMA pre-segmetato: À Ä¹ ~. Ukow word detecto: À Ä¹ ~. Zhag Je graduate from JaoTog Uversty. Fg.. MMA ad ukow word detecto by chukg: a example Ths paper proposes a chukg strategy to cope wth ukow words Chese word segmetato. Frst, a raw setece s pre-segmeted to a sequece of word atoms (.e. sgle-character words ad mult-character words) usg a maxmum matchg algorthm (MMA) 3. The a chukg model s appled to detect ukow words by chukg oe or more word atoms together accordg to the word formato patters of the word atoms. Fgure gves a example. Here, the problem of ukow word detecto s re-cast as chukg oe or more word atoms together to form a ew word ad a dscrmatve Markov model, amed Mutual Iformato Idepedece Model (MIIM), s adopted chukg. Besdes, a maxmum etropy model s appled to tegrate varous types of cotexts ad resolve the data sparseess problem MIIM. Moreover, a error-drve learg approach s proposed to lear useful 2 Some systems [3,4] focus o proper ames due to ther mportace Chese formato processg. 3 A typcal MMA detfes all character sequeces whch are foud the word dctoary ad marks them as words. Those character sequeces, whch ca be segmeted more tha oe way, are marked as ambguous ad a word ugram model s appled to choose the most lkely segmetato sequece. The remag sequeces,.e. those ot foud the dctoary, are called fragmets ad segmeted to sgle characters. I ths way, each Chese setece s pre-segmeted to a sequece of sgle-character words ad multcharacter words. For coveece, we call these sgle-character words ad mult-character words the output of the MMA algorthm as word atoms.

3 532 G. Zhou cotexts the maxmum etropy model. I ths way, the umber of cotexts the maxmum etropy model ca be sgfcatly reduced wthout performace decrease. Ths makes t possble for further mprovg the performace by cosderg more varous types of cotexts the future. Evaluato o the PK ad CTB corpora the Frst SIGHAN Chese word segmetato bakeoff shows that our chukg strategy performs best ukow word detecto o both of the corpora. The rest of the paper s as follows: I Secto 2, we wll dscuss detals about our chukg strategy ukow word detecto. Expermetal results are gve Secto 3. Fally, some remarks ad coclusos are made Secto 4. 2 Ukow Word Detecto by Chukg I ths secto, we wll frst descrbe the chukg strategy ukow word detecto of Chese word segmetato usg a dscrmatve Markov model, called Mutual Iformato Idepedece Model (MIIM). The a maxmum etropy model s appled to tegrate varous types of cotexts ad resolve the data sparseess problem MIIM. Fally, a error-drve learg approach s proposed to select useful cotexts ad reduce the cotext feature vector dmeso. 2. Mutual Iformato Idepedece Model ad Ukow Word Detecto Mutual Iformato Idepedece Model I ths paper, we use a dscrmatve Markov model, called Mutual Iformato Idepedece Model (MIIM) proposed by Zhou et al [7] 4, ukow word detecto by chukg. MIIM s derved from a codtoal probablty model. Gve a observato sequece O = o o 2 Lo, the goal of a codtoal probablty model s to fd a stochastc optmal state(tag) sequece S = s s 2 Ls that maxmzes: log P( S O P( S, O ) ) = log P( S ) + log () P( S ) P( O ) The secod term Equato () s the par-wse mutual formato (PMI) betwee S ad O. I order to smplfy the computato of ths term, we assume a par-wse mutual formato depedece (2):, O ) = PMI ( s, O ) = PMI( S or P( S, O ) P( s, O ) log = (2) P( S ) P( O ) ) log = P( s ) P( O 4 We have reamed the dscrmatve Markov model [7] as the Mutual Iformato Idepedece Model accordg to the ovel par-wse mutual formato depedece assumpto the model. Aother reaso s to dstgush t from the tradtoal Hdde Markov Model [8] ad avod msleadg.

4 A Chukg Strategy Towards Ukow Word Detecto 533 That s, a dvdual state s oly depedet o the observato sequece depedet o other states the state sequece O ad S. Ths assumpto s reasoable because the depedece amog the states the state sequece S has already bee captured by the frst term Equato (). Applyg Equato (2) to Equato (), we have Equato (3) 5 : O ) = PMI( s, S ) + log P( s O ) = 2 = log P( S (3) We call the above model as show Equato (3) the Mutual Iformato Idepedece Model due to ts par-wse mutual formato assumpto as show Equato (2). The above model cossts of two sub-models: the state trasto model = 2 = PMI ( s, S as the frst term Equato (3) ad the output model ) log P ( s O as the secod term Equato (3). Here, a varat of the Vterb ) algorthm [9] decodg the stadard Hdde Markov Model (HMM) [8] s mplemeted to fd the most lkely state sequece by replacg the state trasto model ad the output model of the stadard HMM wth the state trasto model ad the output model of the MIIM, respectvely. Ukow Word Detecto For ukow word detecto by chukg, a word (kow word or ukow word) s regarded as a chuk of oe or more word atoms ad we have: o =< p, w > ; w s the th word atom the sequece of word atomsw = w w 2 Lw ; p s the word formato patter of the word atom w. Here p measures the word formato power of the word atom w ad cossts of: o The percetage of w occurrg as a whole word (roud to 0%) o The percetage of w occurrg at the begg of other words (roud to 0%) o The percetage of w w occurrg at the ed of other words (roud to 0%) o The legth of o The occurrg frequecy feature of w, whch s mapped to max(log(frequecy), 9 ). s : the states are used to bracket ad dfferetate varous types of words. I ths way, Chese ukow word detecto ca be regarded as a bracketg process whle dfferetato of dfferet word types ca help the bracketg process. s s structural ad cossts of three parts: 5 Detals about the dervato are omtted due to space lmtato. Please see [7] for more.

5 534 G. Zhou o Boudary Category (B): t cludes four values: {O, B, M, E}, where O meas that curret word atom s a whole word ad B/M/E meas that curret word atom s at the Begg/ the Mddle/at the Ed of a word. o Word Category (W): It s used to deote the class of the word. I our system, words are classfed to two types: pure Chese word type ad mxed word type (.e. cludg Eglsh characters ad Chese dgts/umbers/symbols). o Word Atom Formato Patter (P): Because of the lmted umber of boudary ad word categores, the word atom formato patter descrbed above s added to the structural state to represet a more accurate state trasto model MIIM whle keepg ts output model. Problem wth Ukow Word Detecto Usg MIIM From Equato (3), we ca see that the state trasto model of MIIM ca be computed by usg gram modelg [20, 2, 22], where each tag s assumed to be depedet o the N- prevous tags (e.g. 2). The problem wth the above MIIM les the data sparseess problem rased by ts output model: log P ( s O = ). Ideally, we would have suffcet trag data for every evet whose codtoal probablty we wsh to calculate. Ufortuately, there s rarely eough trag data to compute accurate probabltes whe decodg o ew data. Geerally, two smoothg approaches [2, 22, 23] are appled to resolve ths problem: lear terpolato ad back-off. However, these two approaches oly work well whe the umber of dfferet formato sources s very lmted. Whe a few features ad/or a log cotext are cosdered, the umber of dfferet formato sources s expoetal. Ths makes smoothg approaches approprate our system. I ths paper, the maxmum etropy model [24] s proposed to tegrate varous cotext formato sources ad resolve the data sparseess problem our system. The reaso that we choose the maxmum etropy model for ths purpose s that t represets the state-of the-art the mache learg research commuty ad there are good mplemetatos of the algorthm avalable. Here, we use the ope NLP maxmum etropy package 6 our system. 2.2 Maxmum Etropy The maxmum etropy model s a probablty dstrbuto estmato techque wdely used recet years for atural laguage processg tasks. The prcple of the maxmum etropy model estmatg probabltes s to clude as much formato as s kow from the data whle makg o addtoal assumptos. The maxmum etropy model returs the probablty dstrbuto that satsfes the above property wth the hghest etropy. Formally, the decso fucto of the maxmum etropy model ca be represeted as: k f j ( h, o) P( o, h) = α (4) j Z ( h) j= 6

6 A Chukg Strategy Towards Ukow Word Detecto 535 where o s the outcome, h s the hstory (cotext feature vector ths paper), Z(h) s a ormalzato fucto, {f, f 2,..., f k } are feature fuctos ad {α, α 2,, α k } are the model parameters. Each model parameter correspods to exactly oe feature ad ca be vewed as a "weght" for that feature. All features used the maxmum etropy model are bary, e.g. f j, f o = Idepede tword, CurretWor datom = ( h, o) = 0, otherwse. y ( we ); I order to relably estmate P ( s O ) the output model of MIIM usg the maxmum etropy model, varous cotext formato sources are cluded the cotext feature vector: p : curret word atom formato patter : prevous word atom formato patter ad curret word atom formato patter p : curret word atom formato patter ad ext word atom formato patter p : curret word atom formato patter ad curret word atom p p p + w p w p : prevous word atom formato patter, prevous word atom ad curret word atom formato patter p p+ w+ : curret word atom formato patter, ext word atom formato patter ad ext word atom p pw : prevous word atom formato patter, curret word atom formato patter ad curret word atom p w p+ : curret word atom formato patter, curret word atom ad ext word atom formato patter p w p w : prevous word atom formato patter, prevous word atom, curret word atom formato patter ad curret word atom p w p+ w+ : curret word atom formato patter, curret word atom, ext word atom formato patter ad ext word atom However, there exsts a problem whe we clude above varous cotext formato the maxmum etropy model: the cotext feature vector dmeso easly becomes too large for the model to hadle. Oe easy soluto to ths problem s to oly keep those frequetly occurrg cotexts the model. Although ths frequecy flterg approach s smple, may useful cotexts may ot occur frequetly ad be fltered out whle those kept may ot be useful. To resolve ths problem, we propose a alteratve error-drve learg approach to oly keep useful cotexts the model. 2.3 Cotext Feature Selecto Usg Error-Drve Learg Here, we propose a error-drve learg approach to exame the effectveess of varous cotexts ad select useful cotexts to reduce the sze of the cotext feature (5)

7 536 G. Zhou vector used the maxmum etropy model for estmatg P ( s O ) the output model of MIIM. Ths makes t possble to further mprove the performace by corporatg more varous types of cotexts the future. Assume Φ s the cotaer for useful cotexts. Gve a set of exstg useful cotexts Φ ad a set of ew cotexts Φ, the effectveess of a ew cotext C Φ, E( Φ, C ), s measured by the C -related reducto errors whch results from addg the ew cotext set Φ to the useful cotext set Φ : E Φ, C ) = # Error( Φ, C ) # Error( Φ + Φ, C ) (6) ( Here, # Error( Φ, C ) s the umber of C -related chukg errors before Φ s added to Φ ad # Error( Φ + Φ, C ) s the umber of C -related chukg errors after Φ s added to Φ. That s, E( Φ, C ) s the umber of the chukg error correctos made o the cotext C Φ whe Φ s added to Φ. If E ( Φ, C ) > 0, we declare that the ew cotext C s a useful cotext ad should be added to Φ. Otherwse, the ew cotext C s cosdered useless ad dscarded. Gve the above error-drve learg approach, we talze Φ = { p } (.e. we assume all the curret word atom formato patters are useful cotexts) ad choose oe of the other cotext types as the ew cotext set Φ, e.g. Φ = { p w }. The, we ca tra two MIIMs wth dfferet output models usg Φ ad Φ + Φ respectvely. Moreover, useful cotexts are leart o the trag data a two-fold way. For each fold, two MIIMs are traed o 50% of the trag data ad for each ew cotext C Φ, evaluate ts effectveess E( Φ, C ) o the remag 50% of the trag data accordg to the cotext effectveess measure as show Equato (6). If E ( Φ, C ) > 0, C s marked as a useful cotext ad added to Φ. I ths way, all the useful cotexts Φ are corporated to the useful cotext set Φ. Smlarly, we ca clude useful cotexts of other cotext types to the useful cotext set Φ oe by oe. I ths paper, varous types of cotexts are leart oe by oe the exact same order as show Secto 2.2. Fally, sce dfferet types of cotexts may have cross-effects, the above process s terated wth the reewed useful cotext set Φ utl very few useful cotexts ca be foud at each loop. Our expermets show that terato coverges wth four loops. 3 Expermetal Results All of our expermets are evaluated o the PK ad CTB bechmark corpora used the Frst SIGHAN Chese word segmetato bakeoff 7 wth the closed cofgurato. That s, oly the trag data from the partcular corpus s used durg trag. For ukow word detecto, the chukg trag data s derved by usg the same Maxmum Matchg Algorthm (MMA) to segmet each word the orgal trag data as a chuk of word atoms. Ths s doe a two-fold way. For each fold, the 7

8 A Chukg Strategy Towards Ukow Word Detecto 537 MMA s traed o 50% of the orgal trag data ad the used to segmet the remag 50% of the orgal trag data. The the MIIM s used to tra a chukg model for ukow word detecto o the chukg trag data. Table shows the detals of the two corpora. Here, s defed as the percetage of words the test corpus ot occurrg the trag corpus ad dcates the out-ofvocabulary rate the test corpus. Table. Statstcs of the corpora used our evaluato Corpus Abbrevato Trag Data Test Data Bejg Uversty PK 6.9% 00K words 7K words UPENN Chese Treebak CTB 8.% 250K words 40K words Table 2 shows the detaled performace of our system ukow word detecto ad Chese word segmetato as a whole usg the stadard scorg scrpt 8 o the test data. I ths ad subsequet tables, varous evaluato measures are provded: precso (P), recall (R), F-measure, recall o out-of-vocabulary words ( R ) ad recall o -vocabulary words ( R IV ). It shows that our system acheves precso/recall/f-measure of 93.5%/96.%/94.8 ad 90.5%/90.%/90.3 o the PK ad CTB corpora respectvely. Especally, our chukg approach ca successfully detect 80.5% ad 77.6% of ukow words o the PK ad CTB corpora respectvely. Table 2. Detaled performace of our system o the st SIGHAN Chese word segmetato bechmark data Corpus P R F R R IV PK CTB Table 3 ad Table 4 compare our system wth other best-reported systems o the PK ad CTB corpora respectvely. Table 3 shows that our chukg approach ukow word detecto outperforms others by more tha 8% o the PK corpus. It also shows that our system performs comparably wth the best reported systems o the PK corpus whe the out-of-vocabulary rate s moderate(6.9%). Our performace Chese word segmetato as a whole s somewhat pulled dow by the lower performace recallg -vocabulary words. Ths may be due to the preferece of our chukg strategy detectg ukow words by wrogly combg some of vocabulary words to ukow words. Such preferece may cause egatve effect Chese word segmetato as a whole whe the ga ukow word detecto fals to compesate the loss wrogly combg some of -vocabulary words to ukow words. Ths happes whe the out-of-vocabulary rate s ot hgh, e.g. o the 8

9 538 G. Zhou PK corpus. Table 4 shows that our chukg approach ukow word detecto outperforms others by more tha 7% o the CTB corpus. It also shows that our system outperforms the other best-reported systems by more tha 2% Chese word segmetato as a whole o the CTB corpus. Ths s largely due to the huge ga ukow word detecto whe the out-of-vocabulary rate s hgh (e.g. 8.% the CTB corpus), eve though our system performs worse o recallg -vocabulary words tha others. Evaluato o both the PK ad CTB corpora shows that our chukg approach ca successfully detect about 80% of ukow words o corpora wth a large rage of the out-of-vocabulary rates. Ths suggests the powerfuless of usg varous word formato patters of word atoms detectg ukow words. Ths also demostrates the effectveess ad robustess of our chukg approach ukow word detecto of Chese word segmetato ad ts portablty to dfferet geres. Table 3. Comparso of our system wth other best-reported systems o the PK corpus Corpus P R F R R IV Ours Zhag et al [25] Wu [26] Che [27] Table 4. Comparso of our system wth other best-reported systems o the CTB corpus Corpus P R F R R IV Ours Zhag et al [25] Dua et al [28] Fally, Table 5 ad Table 6 compare our error-drve learg approach wth the frequecy flterg approach learg useful cotexts for the output model of MIIM o the PK ad CTB corpora respectvely. Due to memory lmtato, at most 400K useful cotexts are cosdered the frequecy flterg approach. Frst, they show that the error-drve learg approach s much more effectve tha the smple frequecy flterg approach. Wth the same umber of useful cotexts, the errordrve learg approach outperforms the frequecy flterg approach by 7.8%/0.6% ad 5.5%/0.8% R (ukow word detecto)/f-measure(chese word segmetato as a whole) o the PK ad CTB corpora respectvely. Moreover, the error-drve learg approach slghtly outperforms the frequecy flterg approach wth the best cofgurato of 2.5 ad 3.5 tmes of useful cotexts. Secod, they show that creasg the umber of frequetly occurrg cotexts usg the frequecy flterg approach may ot crease the performace. Ths may be due to that some of frequetly occurrg cotexts are osy or useless ad cludg them may have

10 A Chukg Strategy Towards Ukow Word Detecto 539 egatve effect. Thrd, they show that the error-drve learg approach s effectve learg useful cotexts by reducg 96-98% of possble cotexts. Fally, the fgures sde paretheses show the umber of useful patters shared betwee the error-drve learg approach ad the frequecy flterg approach. They show that about 40-50% of useful cotexts selected usg the error-drve learg approach do ot occur frequetly the useful cotexts selected usg the frequecy flterg approach. Table 5. Comparso of the error-drve learg approach wth the frequecy flterg approach learg useful cotexts for the output model of MIIM o the PK corpus (Total umber of possble cotexts: 4836K) Approach #useful cotexts F R R IV Error-Drve Learg 98K Frequecy Flterg 98K (63K) Frequecy Flterg (best performace) 250K (90K) Frequecy Flterg 400K (94K) Table 6. Comparso of the error-drve learg approach wth the frequecy flterg approach learg useful cotexts for the output model of MIIM o the CTB corpus (Total umber of possble cotexts: 038K) Approach #useful cotexts F R R IV Error-Drve Learg 43K Frequecy Flterg 43K (2K) Frequecy Flterg (best performace) 50K Frequecy Flterg 400K (40K) Cocluso I ths paper, a chukg strategy s preseted to detect ukow words Chese word segmetato by chukg oe or more word atoms together accordg to the varous word formato patters of the word atoms. Besdes, a maxmum etropy model s appled to tegrate varous types of cotexts ad resolve the data sparseess problem our strategy. Fally, a error-drve learg approach s proposed to lear useful cotexts the maxmum etropy model. I ths way, the umber of cotexts the maxmum etropy model ca be sgfcatly reduced wthout performace decrease. Ths makes t possble for further mprovg the performace by cosderg more varous types of cotexts. Evaluato o the PK ad CTB corpora the Frst SIGHAN Chese word segmetato bakeoff shows that our chukg strategy ca detect about 80% of ukow words o both of the corpora ad outperforms the best-reported systems by 8.% ad 7.% ukow word detecto

11 540 G. Zhou o them respectvely. Whle our Chese word segmetato system wth chukgbased ukow word detecto performs comparably wth the best systems o the PK corpus whe the out-of-vocabulary rate s moderate(6.9%), our system sgfcatly outperforms others by more tha 2% whe the out-of-vocabulary rate s hgh(8.%). Ths demostrates the effectveess ad robustess of our chukg strategy ukow word detecto of Chese word segmetato ad ts portablty to dfferet geres. Refereces. Je CY, Lu Y ad Lag NY. (989). O methods of Chese automatc segmetato, Joural of Chese Iformato Processg, 3(): L KC, Lu KY ad Zhag YK. (988). Segmetg Chese word ad processg dfferet meags structure, Joural of Chese Iformato Processg, 2(3): Lag NY, (990). The kowledge of Chese word segmetato, Joural of Chese Iformato Processg, 4(2): Lua KT, (990). From character to word - A applcato of formato theory, Computer Processg of Chese & Oretal Laguages, 4(4): Lua KT ad Ga GW. (994). A applcato of formato theory Chese word segmetato. Computer Processg of Chese & Oretal Laguages, 8(): Wag YC, SU HJ ad Mo Y. (990). Automatc processg of Chese words. Joural of Chese Iformato Processg. 4(4):-. 7. Wu JM ad Tseg G. (993). Chese text segmetato for text retreval: achevemets ad problems. Joural of the Amerca Socety for Iformato Scece. 44(9): Xu H, He KK ad Su B. (99) The mplemetato of a wrtte Chese automatc segmetato expert system, Joural of Chese Iformato Processg, 5(3): Yao TS, Zhag GP ad Wu YM. (990). A rule-based Chese automatc segmetato system, Joural of Chese Iformato Processg, 4(): Yeh CL ad Lee HJ. (995). Rule-based word detfcato for Madar Chese seteces - A ufcato approach, Computer Processg of Chese & Oretal Laguages, 9(2): Ne JY, J WY ad Mare-Louse Haa. (997). A hybrd approach to ukow word detecto ad segmetato of Chese, Chese Processg of Chese ad Oretal Laguages, (4): pp Tug CH ad Lee HJ. (994). Idetfcato of ukow word from a corpus, computer Processg of Chese & Oretal Laguages, 8(Supplemet): Chag JS et al. (994). A mult-corpus approach to recogto of proper ames Chese Text, Computer Processg of Chese & Oretal Laguages, 8(): Su MS, Huag CN, Gao HY ad Fag J. (994). Idetfyg Chese Names I Urestrcted Texts, Commucatos of Chese ad Oretal Laguages Iformato Processg Socety, 4(2): Zhou GD ad Lua KT, (997). Detecto of Ukow Chese Words Usg a Hybrd Approach, Computer Processg of Chese & Oretal Laguage, (): Eugee Charak, Statstcal laguage learg, The MIT Press, ISBN Zhou GDog ad Su J. (2002). Named Etty Recogto Usg a HMM-based Chuk Tagger, Proceedgs of the Coferece o Aual Meetg for Computatoal Lgustcs (ACL 2002) , Phladelpha.

12 A Chukg Strategy Towards Ukow Word Detecto Raber L A Tutoral o Hdde Markov Models ad Selected Applcatos Speech Recogto. IEEE 77(2), pages Vterb A.J Error Bouds for Covolutoal Codes ad a Asymptotcally Optmum Decodg Algorthm. IEEE Trasactos o Iformato Theory, IT 3(2), Gale W.A. ad Sampso G Good-Turg frequecy estmato wthout tears. Joural of Quattatve Lgustcs. 2: Jelek F. (989). Self-Orgazed Laguage Modelg for Speech Recogto. I Alex Wabel ad Ka-Fu Lee(Edtors). Readgs Speech Recogtop. Morga Kaufma Katz S.M. (987). Estmato of Probabltes from Sparse Data for the Laguage Model Compoet of a Speech Recogzer. IEEE Trasactos o Acoustcs. Speech ad Sgal Processg. 35: Che ad Goodma. (996). A Emprcal Study of Smoothg Techques for Laguage Modelg. I Proceedgs of the 34th Aual Meetg of the Assocato of Computatoal Lgustcs (ACL 996). pp Sata Cruz, Calfora, USA. 24. Rataparkh A. (996). A Maxmum Etropy Model for Part-of-Speech Taggg. Proceedgs of the Coferece o Emprcal Methods Natural Laguage Processg., Zhag HP, Yu HK, Xog DY ad Lu Q. (2003). HHMM-based Chese Lexcal Aalyzer ICTCLAS. Proceedgs of 2 d SIGHAN Workshop o Chese Laguage Processg Sapporo, Japa. 26. Wu AD. (2003). Chese Word Segmetato MSR-NLP. Proceedgs of 2 d SIGHAN Workshop o Chese Laguage Processg Sapporo, Japa. 27. Che AT. (2003). Chese Word Segmetato Usg Mmal Lgustc Kowledge. Proceedgs of 2 d SIGHAN Workshop o Chese Laguage Processg Sapporo, Japa. 28. Dua HM, Ba XJ, Chag BB ad Yu SW. (2003). Chese Word Segmetato at Pekg Uversty. Proceedgs of 2 d SIGHAN Workshop o Chese Laguage Processg Sapporo, Japa.

Collocation Extraction Using Square Mutual Information Approaches. Received December 2010; revised January 2011

Collocation Extraction Using Square Mutual Information Approaches. Received December 2010; revised January 2011 Iteratoal Joural of Kowledge www.jklp.org ad Laguage Processg KLP Iteratoal c2011 ISSN 2191-2734 Volume 2, Number 1, Jauary 2011 pp. 53-58 Collocato Extracto Usg Square Mutual Iformato Approaches Huaru