Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

1 1 1 2 2 30% 70% 70% NTCIR-7 13% 90% 1,000 Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms Itsuki Toyota 1 Yusuke Takahashi 1 Kensaku Makita 1 Takehito Utsuro 2 Mikio Yamamoto 2 Abstract: A bilingual lexicon for technical terms is necessary in the process of translating patent documents. This paper studies a method of generating bilingual lexicon for technical terms from parallel patent documents. In the previous methods of generating bilingual lexicon from parallel patent documents, the portion from which parallel patent sentences are extracted is composed of the parts of Background and Embodiment. However, this portion is about 30% out of the whole Background and Embodiment parts and about 70% are not used. Considering this situation, this paper proposes to generate bilingual lexicon for technical terms from the remaining 70% out of the whole Background and Embodiment parts. The proposed method employs the compositional translation estimation technique which uses an existing bilingual lexicon. We show that, for 13% of the Japanese compound nouns that are not included in the phrase translation table trained with parallel patent sentences nor in the existing bilingual lexicon, translation candidates can be generated through the compositional translation estimation technique, and can be found in the English part of the patent family. On the average, we generate about two pairs of bilingual technical terms per patent family and we achieve over 90% accuracy. Keywords: bilingual lexicon for technical terms, patent family, statistical machine translation cfl 2012 Information Processing Society of Japan 1

1. [10] NTCIR-7 [3] 180 [7] Support Vector Machines (SVMs) [15] [8] [10] 180 [14] 30% 70% NTCIR-7 [12] 13% 90% 1,000 2. NTCIR-7 [3] 180 ( 1 ) 1993-2000 (2) 1 Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, 305 8573, Japan 2 Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, 305 8573, Japan (3) (4) [14] 3. Moses [6] Moses. (1) (2) (3) (4) ( ). (5) (4) V 1,V 2,...,V n W 1,W 2,...,W m P J (= V p V p ) P E (= W q W q ) P J,P E T J,T E V i,w j p i p q j q P J P E T J,T E 4. 4.1 1 cfl 2012 Information Processing Society of Japan 2

1 1 1,631,099 1,847,945 2,244,117 47,554 41,810 129,420 24,696 23,025 82,087 1 parallel mode * 1 *2 [12] 2 ( JUMAN *3 ) P 2 P 2 B P B S Ver.131 Ver.79 Ver.131 *1 http://www.eijiro.jp/ *2 Ver.79 Ver.131 *3 http://nlp.ist.i.kyoto-u.ac.jp/en/index.php?juman 1 4.2 y S y S y T y S s i y S = s 1,s 2,,s n s i t i y T y T = t 1,t 2,,t n y S,y T y S,y T = s 1,t 1, s 2,t 2,, s n,t n y T y S y T ( ) y T s i,t i q( s i,t i ) y T cfl 2012 Information Processing Society of Japan 3

2 (PSD ) (NPSD ) y T ( Q corpus (y T ) ) y T 2 n q( s i,t i ) Q corpus (y T ) i=1 2 Q(y S,y T ) Q(y S,y T )= y S=s 1,s 2,...,s n i=1 n q( s i,t i ) Q corpus (y T ) s, t q( s, t ) s, t 10 (compo(s) 1) q( s, t ) = log 10 f p ( s, t ) B P log 10 f s ( s, t ) B S compo(s) s f p ( s, t ) P 2 s, t f p ( s, t ) P 2 s, t Q corpus (y T ) y T { 1 Q corpus (y T )= 0 5. NJ 1 B J NJ 2 M J NJ 3 B J M J PSD J NPSD J 2 D E PSD E NPSD E *4 D J = N 1 J,B J,N 2 J,M J,N 3 J B J M J = PSD J,NPSD J D E = PSD E,NPSD E B J M J NPSD J t J *4 cfl 2012 Information Processing Society of Japan 4

2 1,000 ( (%)) 345 (1.5) 2,914 (13.0) 13,165 (58.8) 5,972 (26.7) 22,396 (100) 3 500 ( (%)) 5 (1.0) 71 (14.2) 423 (84.6) 1 (0.2) 500 (100) t J D E NPSD E TranCand(t J,NPSD E ) ) TranCand (t J,NP SD E { = t E NPSD E tj } t E Q(t J,t E ) > 0 TranCand(t J,NPSD E ) CompoTrans max CompoTrans max (t J,NP SD E ) = arg max Q(t J,t E ) t E TranCand(t J,NP SD E) D E NPSD E t E 6. 1,000 1,000 NTCIR-7 2 2,914 500 13,165 5,972 19,137 100 500 3 500 424 (84.8%) 424 423 (99.8%) DB DB 100 4 100 79 (79%) 79 52 (65.8%) 52 cfl 2012 Information Processing Society of Japan 5

4 100 4 () 17 52 27 100 5 52 ( (%)) 6 (11.6) 1 (1.9) and 1 (1.9) 1 (1.9) 43 (82.7) 52 5 52 6 (11.6%) 7. ( ) ( ) IBM [1,2] [6] [9] [10] NTCIR-7 [3] ( 180 ) Support Vector Machines (SVMs) [15] [8] [9] ( ) [5] Google *5 [4] *5 http://www.google.com/ cfl 2012 Information Processing Society of Japan 6

[11],. [13] 8. 70% NTCIR-7 [3] 13% 90% 1,000 [8] 17 pp. 963 966 (2011). [9] Matsumoto, Y. and Utsuro, T.: Lexical Knowledge Acquisition, Handbook of Natural Language Processing (Dale, R., Moisl, H. and Somers, H., eds.), Marcel Dekker Inc., chapter 24, pp. 563 610 (2000). [10] Vol. J93 D, No. 11, pp. 2525 2537 (2010). [11] 13 pp. 879 882 (2007). [12] 11 pp. 17 20 (2005). [13] Vol. 14, No. 2, pp. 33 68 (2007). [14] Utiyama, M. and Isahara, H.: A Japanese-English Patent Parallel Corpus, Proc. MT Summit XI, pp. 475 482 (2007). [15] Vapnik, V. N.: Statistical Learning Theory, Wiley- Interscience (1998). [1] Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L. and Roosin, P. S.: A Statistical Approach to Machine Translation, Computational Linguistics, Vol. 16, No. 2, pp. 79 85 (1990). [2] Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L.: The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, Vol. 19, No. 2, pp. 263 311 (1993). [3] Fujii, A., Utiyama, M., Yamamoto, M. and Utsuro, T.: Overview of the Patent Translation Task at the NTCIR- 7 Workshop, Proc. 7th NTCIR Workshop Meeting, pp. 389 400 (2008). [4] World Wide Web Vol. 47, No. 3, pp. 968 979 (2006). [5] Huang, F., Zhang, Y. and Vogel, S.: Mining Key Phrase Translations from Web Corpora, Proc. HLT/EMNLP, pp. 483 490 (2005). [6] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation, Proc. 45th ACL, Companion Volume, pp. 177 180 (2007). [7] Koehn, P., Och, F. J. and Marcu, D.: Statistical Phrase- Based Translation, Proc. HLT-NAACL, pp. 127 133 (2003). cfl 2012 Information Processing Society of Japan 7