27 1 2006 1 M IN I- M ICRO SYST EM S V o l127 N o 1 Jan 2006 1, 2, 1, 1 1, 1 (,, 610065) 2 (,, 610065) E2m ail: yuzhonghua@cs scu edu cn :,,,, M edline, 99%,,,,,, : : T P391 : A : 100021220 (2006) 0120180205 Sen tence Boundary D etection in B iom ed ica l Texts Using Con textm orpholog ica l Fea tures YU Zhong2hua 1, ZHAN G Rong 2, TAN G Chang2jie 1, ZUO J ie 1, ZHAN G T ian2qing 1 1 (Comp u ter S cience S chool, S ichuan U niversity, Cheng d u 610065, Ch ina) 2 (N etw ork E d ucation S chool, S ichuan U niversity, Cheng d u 610065, Ch ina) Abstract: A sentence boundary detection algo rithm is p ropo sed fo r info rm ation extraction from biom edical texts acco rding to characteristics of the texts and special requirem ents of info rm ation extraction featu res and supervised learn ing techno logy T he algo rithm is based on context mo rpho logical In con trast to algo rithm s developed fo r sen tence boundary detection in comm on English texts, the algo rithm does no t use special vocabulary and gramm atical level info rm ation, and m akes decision about sen2 tence boundary just based on mo rpho logical info rm ation of the context w o rdṡ A m axim um entropy detecto r and a SVM detec2 to r are developed by using these featureṡ Experim ents done on M edline abstracts show that the algo rithm has ach ieved accura2 cy of recognition above 99%, and m axim um entropy and SVM m ethods have the app roxim ate perfo rm ance fo r the p roblem of sentence boundary disam biguation T he experim ents also show that just using mo rpho logical level info rm ation w ithout supp le2 m entary vocabulary and gramm atical level info rm ation rem ains the app roxim ately sam e perfo rm ance as the o ther algo rithm s us2 ing supp lem entary vocabulary and gramm atical level info rm ation fo r common English texts do Key words: natu ral language p rocessing b iom edical info rm ation ex traction sen tence boundary detection m ach ine learn ing 1,,,,,,,,, (GenBank),, :,, [1, 2 (1) ], [3 (2) ],, [428 (3) ] [9 (4) ], [10, 11 (5) ] : 2004207213 : (60073046) (20020610007) :,, 1967,,,,,, 1963,,,, 1946,,,,, 1977,,,, 1972,, 1994-2006 China Academic Journal Electronic Publishing House All rights reserved http://wwwcnkinet
1 : 181, ) (,, ),, : 3 [16 ],,,,,, :,,,,,,,,, 4,, 5,?,!,,,, 6,,, ( 3 14 ) ( U 3 S ),, ( T he p resident lives in W ash ington D, [ 16 ] C ),,, M EDL IN E [17 ], ( ) 2 (1)?,,, (2)?, [12 ] [13215 ], (3),,,,, [12 ] 100 1, 570 M EDL IN E F lex, W SJ (W all Street Journal) [16 ], 12, 200 0 9% ( W SJ ), 1 (P recision ),, (R ecall), F2 (F2M easure) (E rro r2r ate),,,,,? 1 [13 ], M EDL IN E (W SJ) [16 ] [13 ] [14, ] [15, ], P recision 99 93% 99 56% Recall 71 03% 76 95% : ( M ṙ ), F2M easure 83 04% 86 81%,, Erro r2rate 29 02% 16 25%, W SJ 0 39% 22% (,?,! ",, ), P - Po sitive,,, P - N egative,n - Po sitive,,, N -, N egative,,,,,, F2 :, ( Info rm ation Extrac2 P - P ositive tion ), ( Info rm ation R etrieval), P recision= P - P ositive+ N - P ositive,, P - P ositive R ecall= P - P ositive+ P - N eg ative, 2 P recision R ecall F - M easu re= (,, P recision+ R ecall 1994-2006 China Academic Journal Electronic Publishing House All rights reserved http://wwwcnkinet
182 2006 P - N eg ative+ N - P ositive E rror2r ate= P - P ositive+ P - N eg ative,, 1, (A cronym ),,, F5 F5,,, (F2M easure 86 81% ),, (F2M easure 83 04% ) : W 2 A 2 M edline,,,! ( : A [ 1 ], A [ 2 ], )?, K2W A ( ),, : W,, TRU E,, FAL SE,, : boo l bretv alue = false,,, bretv alue = true, break } 4 1,,,, 2, Eh rlich ia E 2,,, 1,, (Shallow Parsing),,,,,,,,, Eh rlich ia chaffeensis, an obligato ry intracellular bacterium of, monocytes o r m acrophages, is the etio logic agent of hum an monocytic eh rlich io sis< S> < gs> O ur p revious study show ed that gamm a in2,, terferon ( IFN 2 gamm a) added p rio r to o r at early stage of infection in2 hibited infection of hum an monocytes w ith E chaffeensis how ever,, after 24 h of infection, IFN 2gamm a had no antiehrlichial effect< S> < gs> To test w hether 4 1 (,, 2 M edline ( ) F1: F2: F3: F4: F5: F6: F7: F8: F9: F10: F11:, M edline, fo r (i = K21 i> 0 i22) { (A cronym ), A CE ( antio tensin2converting enzym e ) if (ϖw ( (A [ i]= = strcat (W,W ) ) &&(strlen (W ) > 0) ) ) { ggstrcat (W,W ) W W, } return (bretv alue) strlen (W ) W ), 2 M edline 11 :, < S> < gs>, 2, L abel (+ 1, 21 ) 2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 L abel 0 1 0 0 0 1 1 0 0 0 0 + 1 1 0 0 0 1 1 1 0 0 0 0 21 0 1 0 0 0 1 1 0 0 0 0 + 1 4 2 [18 F5 ], 1994-2006 China Academic Journal Electronic Publishing House All rights reserved http://wwwcnkinet
1 : 183,, ) 80% 220%,,,,,,,, 3,,,, 2,,,, x, y 1 y = + 1 x F 2= 1 ), f (x, y ) = 0 3 n,, p g (x ) p (y gx ) f i (x, y ) = x, y p g 6 (x, y ) f i (x, y ), i= 1,, n x, y, p, H (p ) = - pg (x ) p (y gx ) log p (y gx ),,, x, y g p ( ),,, 2, G IS (Generalized Iterative Scaling) [19 ], IIS ( Imp roved Iterative Scaling) [20 ],, [21, 22 ] Am is [21 ], h ttp: ggwww 2tsujii iṡ s u2tokyo ac jp g yusukegam isg,, 4 3 (Suppo rt V ecto r M ch ines, SVM ) V ap2 nik [23 ], References: [ 1 ] Co llier N, N obata C, T sujii J Extracting the nam es of genes and gene p roducts w ith a h idden m arkov model[c ] In: P roc ( ) of the 18 th International Conference on Computational L inguis2 [24 ] [7 ], tics (COL IN G22000), Saarbrucken, Germ any [ 2 ] Fukuda et al Tow ard info rm ation extraction: Identifying p ro2 tein nam es from biom edical papers[c ] In: P roc of the Pacific Sympo sium on B iocomputing 98 (PSB 98), H aw aii,, [ 3 ] L iu H, Johnson S B, F riedm an C A utom atic reso lution of am 2, Internet, SVM ligh t [25 ], L IBSVM [26 ] biguous term s based on m achine learning and concep tual rela2 tions in the UM L S[J ] Journal of the Am erican M edical Info r2 L IBSVM [26 ], h ttp: ggwww csie m atics A ssociation, 2002, 9 (6) : 6212636 ntu edu tw g cjlinglibsvm g [ 4 ] Chang J, Schutze H, A ltm an R C reating an online dictionary of 5 1, 570 [ 5 ] Schw artz A, H earst M A simp le algo rithm fo r identifying ab2 3 P recision 98 55% 99 19% R ecall 99 59% 99 06% F2M easu re 99 07% 99 13% E rro r2r ate 1 87% 1 75% M edline ( 13, 851, 12, 200 ) ( 2 3, 99%, ( M edline,, abbreviations from M EDL IN E [ J ], Journal of the Am erican M edical Info rm atics A ssociation, 2002, 9 (6) : 6122620 breviation definitions in biom edical text [ C ] In: P roc of the Pacific Sympo sium on B iocomputing 2003 (PSB 2003) [ 6 ] Pakhomov S Sem i2supervised m axim um extropy based app roa2 ch to acronym and abbreviation no rm alization in m edical texts [C ] In: P roc of the 40th A nnual M eeting of the A ssociation fo r Computational L inguistics (A CL ), 1602167 [ 7 ] Yu Z, T suruoka Y, T sujii J A utom atic reso lution of am bigu2 ous abbreviations in biom edical texts using suppo rt vecto r m a2 chines and one sense per discourse hypo thesis[c ] A CM In: P roc of S IG IR 03 W o rk shop on T ext A nalysis and Search fo r 1994-2006 China Academic Journal Electronic Publishing House All rights reserved http://wwwcnkinet
184 2006 B io info rm atics, 57262 [ 8 ] Yu H et al M app ing abbreviations to full fo rm s in biom edical articles[j ] Journal of Am erican M edical Info rm ation A ssocia2 tion, 2002, (9) : 2622272 [ 9 ] Castan ζ o J, Zhang J, Pustejovsky J A napho ra reso lution in biom edical literature [ C ] International Sympo sium on Refer2 ence Reso lution, 2002 [ 10 ] Yu H et al A utom atic extraction of gene and p ro tein synonym s from m edline and journal articles [ C ] In: P roc of AM IA Symp, 2002: 9192923 [ 11 ] R indflesch L et al Edgar: Extraction of drugs, genes and rela2 tions from the biom edical literature[c ] In: P roc of the Pacific Sympo sium on B iocomputing 2000 (PSB 2000) [12 ] A berdeen J et al D escrip tion of the alem bic system used fo r M U C26[C ] In P roceedings of the Sixth M essage U nderstand2 ing Conference (NU C26), M o rgan Kaufm ann [ 13 ] Palm er D D, H earst M A A dap tive m ultilingual sentence boun2 dary disam biguation [ J ] Computational L inguistics, 1997, 23 (3) : 2412267 [ 14 ] Reynar J C, Ratnaparkh ia A m axim um entropy app roach to i2 dentifying sentence boundaries[c ] In: P roceedings of the F ifth A CL Conference on A pp lied N atural L anguage P rocessing (ANL P 97), W ashington, D C, 1997 [15 ] M ikheev A Tagging sentence boundaries [C ] In NA CL 2000 A CL, 2000, 2642271 [ 16 ] W ang H, H uang W Bondec2A sentence boundary detecto r[ebg OL ] http: ggnlp stanfo rd edugcoursesgcs224ng2003gfpg huangygfinal- p ro jecṫ doc [ 17 ] M edline[ebgol ] h ttp: ggwww nlm nih govg [ 18 ] Berger A L et al A m axim um entropy app roach to natural lan2 guage p rocessing [J ], Computational L inguistics, 1996, 22 (1) : 39268 [ 19 ] D arroch J N, Ratcliff D Generalized iterative scaling fo r log2 linear models [ J ] The A nnals of M athem atical Statistics, 1972, 43 (5) : 147021480 [20 ] D ella P S et al Inducing features of random fields [J ] IEEE T ransactions on Pattern A nalysis and M achine Intelligence, 1995, 19 (4) : 3802393 [ 21 ] h ttp: ggwww 2tsujii iṡ s u2tokyo ac jpg yusukegam isg [ 22 ] http: ggnlp stanfo rd edugdow nloadsgclassifier shtm l [ 23 ] Co rtes C, V apnik V Suppo rt2vecto r netwo rk s[j ] M ach ine L e2 arning, 1995, 20 (11) : 2732297 [ 24 ] T ho rsten J T ext catego rization w ith suppo rt vecto r m ach ines: L earning w ith m any relevant features[c ] European Conference on M achine L earning (ECM L ), 1998 [ 25 ] http: ggwww cs co rnell edugpeop legtjgsvm - lightg [ 26 ] h ttp: ggwww csie ntu edu tw g cjlinglibsvm g 1994-2006 China Academic Journal Electronic Publishing House All rights reserved http://wwwcnkinet