Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

Size: px

Start display at page:

Download "Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms"

Hannah Harmon
5 years ago
Views:

1 % 70% 70% NTCIR-7 13% 90% 1,000 Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms Itsuki Toyota 1 Yusuke Takahashi 1 Kensaku Makita 1 Takehito Utsuro 2 Mikio Yamamoto 2 Abstract: A bilingual lexicon for technical terms is necessary in the process of translating patent documents. This paper studies a method of generating bilingual lexicon for technical terms from parallel patent documents. In the previous methods of generating bilingual lexicon from parallel patent documents, the portion from which parallel patent sentences are extracted is composed of the parts of Background and Embodiment. However, this portion is about 30% out of the whole Background and Embodiment parts and about 70% are not used. Considering this situation, this paper proposes to generate bilingual lexicon for technical terms from the remaining 70% out of the whole Background and Embodiment parts. The proposed method employs the compositional translation estimation technique which uses an existing bilingual lexicon. We show that, for 13% of the Japanese compound nouns that are not included in the phrase translation table trained with parallel patent sentences nor in the existing bilingual lexicon, translation candidates can be generated through the compositional translation estimation technique, and can be found in the English part of the patent family. On the average, we generate about two pairs of bilingual technical terms per patent family and we achieve over 90% accuracy. Keywords: bilingual lexicon for technical terms, patent family, statistical machine translation cfl 2012 Information Processing Society of Japan 1

2 1. [10] NTCIR-7 [3] 180 [7] Support Vector Machines (SVMs) [15] [8] [10] 180 [14] 30% 70% NTCIR-7 [12] 13% 90% 1, NTCIR-7 [3] 180 ( 1 ) (2) 1 Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, , Japan 2 Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba, , Japan (3) (4) [14] 3. Moses [6] Moses. (1) (2) (3) (4) ( ). (5) (4) V 1,V 2,...,V n W 1,W 2,...,W m P J (= V p V p ) P E (= W q W q ) P J,P E T J,T E V i,w j p i p q j q P J P E T J,T E cfl 2012 Information Processing Society of Japan 2

3 1 1 1,631,099 1,847,945 2,244,117 47,554 41, ,420 24,696 23,025 82,087 1 parallel mode * 1 *2 [12] 2 ( JUMAN *3 ) P 2 P 2 B P B S Ver.131 Ver.79 Ver.131 *1 *2 Ver.79 Ver.131 * y S y S y T y S s i y S = s 1,s 2,,s n s i t i y T y T = t 1,t 2,,t n y S,y T y S,y T = s 1,t 1, s 2,t 2,, s n,t n y T y S y T ( ) y T s i,t i q( s i,t i ) y T cfl 2012 Information Processing Society of Japan 3

4 2 (PSD ) (NPSD ) y T ( Q corpus (y T ) ) y T 2 n q( s i,t i ) Q corpus (y T ) i=1 2 Q(y S,y T ) Q(y S,y T )= y S=s 1,s 2,...,s n i=1 n q( s i,t i ) Q corpus (y T ) s, t q( s, t ) s, t 10 (compo(s) 1) q( s, t ) = log 10 f p ( s, t ) B P log 10 f s ( s, t ) B S compo(s) s f p ( s, t ) P 2 s, t f p ( s, t ) P 2 s, t Q corpus (y T ) y T { 1 Q corpus (y T )= 0 5. NJ 1 B J NJ 2 M J NJ 3 B J M J PSD J NPSD J 2 D E PSD E NPSD E *4 D J = N 1 J,B J,N 2 J,M J,N 3 J B J M J = PSD J,NPSD J D E = PSD E,NPSD E B J M J NPSD J t J *4 cfl 2012 Information Processing Society of Japan 4

5 2 1,000 ( (%)) 345 (1.5) 2,914 (13.0) 13,165 (58.8) 5,972 (26.7) 22,396 (100) ( (%)) 5 (1.0) 71 (14.2) 423 (84.6) 1 (0.2) 500 (100) t J D E NPSD E TranCand(t J,NPSD E ) ) TranCand (t J,NP SD E { = t E NPSD E tj } t E Q(t J,t E ) > 0 TranCand(t J,NPSD E ) CompoTrans max CompoTrans max (t J,NP SD E ) = arg max Q(t J,t E ) t E TranCand(t J,NP SD E) D E NPSD E t E 6. 1,000 1,000 NTCIR-7 2 2, ,165 5,972 19, (84.8%) (99.8%) DB DB (79%) (65.8%) 52 cfl 2012 Information Processing Society of Japan 5

6 () ( (%)) 6 (11.6) 1 (1.9) and 1 (1.9) 1 (1.9) 43 (82.7) (11.6%) 7. ( ) ( ) IBM [1,2] [6] [9] [10] NTCIR-7 [3] ( 180 ) Support Vector Machines (SVMs) [15] [8] [9] ( ) [5] Google *5 [4] *5 cfl 2012 Information Processing Society of Japan 6

7 [11],. [13] 8. 70% NTCIR-7 [3] 13% 90% 1,000 [8] 17 pp (2011). [9] Matsumoto, Y. and Utsuro, T.: Lexical Knowledge Acquisition, Handbook of Natural Language Processing (Dale, R., Moisl, H. and Somers, H., eds.), Marcel Dekker Inc., chapter 24, pp (2000). [10] Vol. J93 D, No. 11, pp (2010). [11] 13 pp (2007). [12] 11 pp (2005). [13] Vol. 14, No. 2, pp (2007). [14] Utiyama, M. and Isahara, H.: A Japanese-English Patent Parallel Corpus, Proc. MT Summit XI, pp (2007). [15] Vapnik, V. N.: Statistical Learning Theory, Wiley- Interscience (1998). [1] Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L. and Roosin, P. S.: A Statistical Approach to Machine Translation, Computational Linguistics, Vol. 16, No. 2, pp (1990). [2] Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L.: The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, Vol. 19, No. 2, pp (1993). [3] Fujii, A., Utiyama, M., Yamamoto, M. and Utsuro, T.: Overview of the Patent Translation Task at the NTCIR- 7 Workshop, Proc. 7th NTCIR Workshop Meeting, pp (2008). [4] World Wide Web Vol. 47, No. 3, pp (2006). [5] Huang, F., Zhang, Y. and Vogel, S.: Mining Key Phrase Translations from Web Corpora, Proc. HLT/EMNLP, pp (2005). [6] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation, Proc. 45th ACL, Companion Volume, pp (2007). [7] Koehn, P., Och, F. J. and Marcu, D.: Statistical Phrase- Based Translation, Proc. HLT-NAACL, pp (2003). cfl 2012 Information Processing Society of Japan 7

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science