Language Information RetrievalÓ j

Size: px

Start display at page:

Download "Language Information RetrievalÓ j"

Samuel Carroll
5 years ago
Views:

1 ± i jschang@cs.nthu.edu.tw o ± k j ± j z ± wyw j ± j ÒCross Language Information RetrievalÓ j h ±ÒQuery TranslationÓ ± k h ± ± ± k ± h ± j ± ± ± 1. ± j y k j ± j z ± y p wywj ± 1

2 j ÒCross-Language Information RetrievalÓ q z ± ± j ä ±åòdocument translationó ä hi ±åòquery translationó ÒMcCarley 1999Ó h ± NTCIR-2 i ÒKando et al. 2001Ó h h Assembly Parade Law Parade and Demonstration Constitution Freedom of speech Indemnification Communism Country separation Council of Grand Justices Legislation Amendments h ± dg ÒWord Sense DisambiguationÓ ÒIde and Veronis 1998, Chen and Chang 1998Ó ±ÒPhrase TranslationÓ ± ± ÒText CollectionÓ ± d g ± ÒLexical ChoiceÓ demonstration ± ä å ± ä å 2

3 j u 1. ± ÒGey and Chen 1997, Kwok 2001Ó 2. ÒOard 1999, Kwok 2001Ó u ± ± Kwok ± Michael Jordan ±ä å j ± Chen Ò1999Ó Òoccurrence statisticsó ± µ ± j ± ÒKnight and Graehl 1997, Chang et al. 2001Ó j j p j h ± ± ± ÒExample-based ApproachÓ ± ji Òdata-drivenÓ j i ± IBM Watson Brown Ò1988, 1990, 1993Ó ± j p j k u ± ÒStatistical Alignment and Machine TranslationÓ h ± j h ± j g u ± ÒAlignment ProbabilityÓ 3

4 ± 2. ± ± ± ÒDirect ApproachÓ r ÒTransfer ApproachÓ 1980 j ÒEmpirical ApproachÓ ± ± Brown p Brown S ± T ± ÒTranslation ProbabilityÓ Pr( T S ) g (a) ± ÒLexical Translation ProbabilityÓ Pr( S i T j ) (b) ÒFertility ProbabilityÓ Pr( a b ) (c) ÒDistortion ProbabilityÓ Pr(i j, k, m ) S i S i T j T j a S i b T j k S m T Brown ± k ä å ä åqv ÒExpectation and Maximization 4

5 AlgorithmÓ ä å º ± ä å º ± k EM qv ± ± k ÒNoisy Channel ModelÓ ± N-gram ÒLanguage ModelÓ qv ÒBeam SearchÓ ± 3. ± Brown ± ± ÒalignmentÓ y ± ± S i, i i T j S i j 1 Pr(j i, k, m) 1 Pr(j i, k, m) =0,i i k ± ± A* S 1 S 2 S 3 ± S 1 {T 1, T 2 } S 2 {T 3, T 4 } S 3 {T 5 } A* = (0, 12, 34, 5)Ò 0 Ó k =3 m =5 ± A* 35% A* ÒMaximum Likelihood EstimationÓ 5

6 Pr MLE ( A* )=0.35 Pr( A* )=P(1 1,3,5) P(2 1,3,5) P(3 2,3,5) P(4 2,3,5) P(5 3,3,5) j Ò0.6Ó P(j i,3,5) k Pr( A* ) < (0.6) 5 = << 0.35 Pr(A*)<<Pr MLE ( A* ) w ± ± ± ÒAlignment ProbabilityÓ ± ± S ± T ± PrÒT SÓ g (a) ± ÒLexical Translation ProbabilityÓ Pr( T(A i ) S i ) (b) ± ÒAlignment ProbabilityÓ Pr ( A k, m )=Pr ( A 0, A 1, A 2,,A k k, m ) S i S i T(A i ) T S i A 0 T S f A i T S i f, i>0 k S m T S i T 6

7 B(A, i) E(A, i) k S i T(A i )={T B(A,i), T B(A,i)+1, T E(A,i) } A i ={B(A,i), B(A,i)+1,, E(A,i) } 4. ± k g 1. ± j 2. ± u ± k kn 3. ± j k 4. ± j 4.1 BDCr ÒBDC 1992Ó 3 ± ± i i k 96,156 ± ÒP n, Q n Ó,n=1,N h EM qv ª ± ± Och Ò2000Ó 7

8 IBM 1. Brown EM qv ± 2. ± j jv Pr(j i, k, m )=1/m j 0.5 i 0.5 3I ( i j, k, m) = 1 [1] m k i = k = j = m = S T i k j M Pr( j i,k,m) flight flight flight flight eight eight eight eight ± 2 ± Pr( j i,k,m) Òflight eight, Ó 1 v 1 8

9 E C ± Pr (C E) Pr( C E) N k m n= 1 i= 1 j= 1 = N δ ( E, P ( i)) δ ( C, Q ( j)) Pr( j i, k, m) k m n= 1 i= 1 j= 1 n δ ( E, P ( i)) Pr( j i, k, m) P n (i) P n i Q n (j) Q n j k = P n m = Q n δ(x, y) =1 x = y, δ(x, y) =0 x y n n [2] S i T j i k j m Pr(j i,k,m) Pr(T j S i ) Pr(T j S i ) Pr(j i,k,m) flight flight flight flight eight eight eight eight ± 2 E C E Pr (C E) ± 9

10 4.2 EM qv v EM qv º ÒGreedy MethodÓ ÒP n, Q n Ó 0 ± Ò 2Ó º (threshold) k y ¼ flight eight 2 3 Ò0, 34, 12Ó S i T i i j k m Pr(j i,k,m) Pr(T j S i ) Pr(T j S i ) Pr(j i,k,m) flight flight eight eight Òflight eight, Ó Ò0, 34, 12Ó v - ± k 10

11 ± k m u A count( A ( S, T) ) Pr( A k, m) = [3] count( k = S, m = T ) 4 k m A A 0 A 1 A 2 Pr(A k,m) ± 15 EM qv 601 u 38 u ¼ 11

12 1. ± ± ± % y S T T(A 0 ) S 1 T(A 1 ) S 2 T(A 2 ) T-shaped antenna ƒ T-shaped antenna ƒ X-ray examination X-ray examination irresistible force irresistible force Unwritten law unwritten law Central Asia Central Asia mutual non-interference mutual non-interference undesirable element undesirable element unalterable truth unalterable truth come soon come soon a desperado a desperado 5 5 u v ± 4.2 ± g 12

13 Òbound morphemeó $empty$ i Òdata sparsenessó $any$ Good-Turing Òsmoothing methodó $any$ ± E C PrÒC EÓ flight flight flight flight $empty$ flight flight flight flight flight flight flight flight flight d flight $any$ flight ± 6 flight ± $empty$ $any$ flight $empty$ k ± 4 Ò0,0,1234Ó 13

14 Ò1,0,234Ó $any$ EM qv $empty$ r 4.3 EM qv v jv 1 ± S T A 0 A 1 A 2 T(A 0 ) T(A 1 ) T(A 2 ) Pr(T S,A) flight eight flight eight flight eight $empty$ flight eight flight eight Pr( flight eight) 5 jv ÒS, TÓ A v ± Pr(T S, A) A Pr(T S, A) A A (S i, T(A i )) Pr( T S) = max Pr( T S, A) = max Pr( A k, m) A A* A k i= 1 Pr( T ( A ) S ) i i 14

15 A * = arg max Pr( T S, A) A = arg max Pr( A k, m) A k i= 1 Pr( T( A ) S ) i i [4] k = S, m = T (S, T) =Òflight eight, Ó A ± v A = (0, 12, 34) Pr(T S,A) =Pr(0,12,34 2,4) Pr( flight) Pr( eight) A = (0, 34, 12) Pr(T S,A) =Pr(0,34,12 2,4) Pr( flight) Pr( eight) A = (0, 3, 124) Pr(T S,A) =Pr(0, 3, 124 2,4) Pr( flight) Pr( eight) A = (2, 34, 1) Pr(T S,A) =Pr(2, 34, 1 2,4) Pr( flight) Pr( eight) A = (12, 34, 0) Pr(T S,A) =Pr(12, 34, 0 2,4) Pr( flight) Pr($empty$ eight) 7 Òflight eight, Ó ± 7 A* = (0, 34, 12) 5. ± ± k 10 i EM qv j j 15

16 S T T(A 0 ) T(A 1 ) T(A 2 ) T(A 0 ) T(A 1 ) T(A 2 ) association football delay flip-flop I demodulator fg fg f g Disgraceful act disregard to secret ballot bearer stock false retrieval used car infix operation jv jv jv 8 jv ± ¼ ± j j 1. j n ± association ± ä å ä å association football ± ± äa å 2. ± ± Òlight verbó make take to 3. ± Òflight eight, Ó ¼ u j Brown u 16

17 8 Och (2000) 100 ( ) 2 u S (sure) P (possible) S P S P j (recall) (precision) (error rate) A Ι S recall = S 185 = = A Ι P precision = A 226 = = A Ι S + AΙ P AER( S, P; A) = 1 = 1 = A + S ± 6.1 ± ± i k i Òflight attendant Ó Òflight, Ó Òflight attendant Ó flight $any$ $any$ j 17

18 j œ~ d d œ~ Òflight, Ó ±Òflight, Ó Òflight, Ó ± Òflight, Ó 0 Òattendant, Ó ±Òattendant, Ó Òattendant, Ó Òattendant, Ó Òattendant, $any$ó ± j Òpreservation, Ó Òpreservation, Ó Òpreservation, Ó ± Òflight attendant Ó ± Pr($any$ flight) Pr($any$ eight) m ± Pr(A 2,3) u Ò0, 12, 3Ó Òflight, Ó Òattendant, Ó ± Òflight, Ó Òattendant, Ó j j Òflight, Ó Òattendant, Ó 18

19 LTP º k LTP j 6.2 ± ± ±j ± ± v j j h ± ± ± h ± u 1. ± Òdocument frequencyó ± ± 2. k k ± N-gram qv ± ± ª ± ± h S ± T* Pr( T S) = max Pr( T S, A) A 19

20 T * = arg max Pr( T S) Pr( T ) T = arg max max Pr( T S, A) Pr( T ) T = arg max max Pr( A k, m) T A A k i= 1 Pr( T ( A ) S ) Pr( T ) Pr(T) = ± T k = S m = T Pr(A k, m) Pr(T S) v qv ÒBranch and Bound AlgorithmÓ T* Pr(A k, *) Pr(* S i ) gòsolutionó Òupper boundó Pr(* S i ) Pr(T) N-gram k r g œ ± ƒ ± ÒKupiec 1993Ó ± k º P v ± A ± T j+1, j+m i i arg max Pr( A n( i), m) Pr( T A, j, m k= 1 j+ 1, j+ m = arg max Pr( A n( i), m) Pr( T A, j, m n( i ) = arg max Pr( A n( i), m) Pr( T A, j, m j+ 1, j+ m P, A) S b, e, A) j+ B( A, K ), j+ E( A, k ) 20 S b+ k 1 A n m )

21 Pr( A n, m) A S b+k-1 P k T j+b(a,k),j+e(a,k) P k ± A ± qv S T k (P 1, Q 1 ), (P 2, Q 2 ),,(P k, Q k ) (1). ÒPart of speech taggeró S ÒlemmaÓ (2). Òshallow parsingó Òbasic phraseó P 1, P 2,,P k P I = S b(i), e(i), P b =n(i) (3). P I v ± A ± T j ð (i) 1, j m*(i) ÒBranch and Bound AlgorithmÓ A*(i), j*(i), m*(i) argmax Pr( A n( i), m) A, j, m n( i) k = 1 Pr( T j+ B( A, k ), j+ E ( A, k ) S b+ k 1 ) (4). P I (P I, T j ð (i) 1, j m*(i) ) 7. ± ± h ± j 21

22 Brown v ±ª ƒ ± ± ± j h 1. ± 2. dg h œ ± 3. d Wordnet ± dg ƒf H g ~ ± 1. BDC 1992 The BDC Chinese-English electronic dictionary (version 2.0), Behavior Design Corporation, Taiwan. 2. Brown, P. F., Cocke J., Della Pietra S. A., Della Pietra V. J., Jelinek F., Mercer R. L., and Roosin P. S A Statistical Approach to Language Translation, In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, pp

23 3. Brown, P. F., Cocke J., Della Pietra S. A., Della Pietra V. J., Jelinek F., Lafferty J. D., Mercer R. L., and Roosin P. S A Statistical Approach to Machine Translation, Computational Linguistics, 16/2, pp Brown, P. F., Della Pietra S. A., Della Pietra V. J., and Mercer R. L The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 19/2, pp Chang, J. S. et al Nathu IR System at NTCIR-2. In Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, pp. (5) 49-52, National Institute of Informatics, Japan. 6. Chang, J. S., Ker S. J., and Chen M. H Taxonomy and Lexical Semantics From the Perspective of Machine Readable Dictionary, In Proceedings of the third Conference of the Association for Machine Translation in the Americas (AMTA), pp Chen, H.H., G.W. Bian and W.C. Lin Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval. In Proceedings of the 37 th Annual Meeting of the Association for Computation Linguistics, pp Dagan, I., Church K. W. and Gale W. A Robust Bilingual Word Alignment or Machine Aided Translation, In Proceedings of the Workshop on Very Large Corpora Academic and Industrial Perspectives, pp Fung, P. and McKeown K Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping, In Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA), pp , Columbia, Maryland, USA. 10. Gale, W. A. and Church K. W Identifying Word Correspondences in Parallel Texts, In Proceedings of the Fourth DARPA Speech and Natural Language Workshop, pp Gey, F C and A. Chen Phrase Discovery for English and Cross-Language Retrieval at TREC-6. In Proceedings of the 6 th Text Retrieval Evaluation Conference, pp

24 12. Ide, N. and J Veronis Special Issue on Word Sense Disambiguation, editors, Computational Linguistics, 24/ Isabelle, P Machine Translation at the TAUM Group, In M. King, editor, Machine Translation Today: The State of the Art, Proceedings of the Third Lugano Tutorial, pp Kando, Noriko, Kenro Aihara, Koji Eguchi and Hiroyuki Kato Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, National Institute of Informatics, Japan. 15. Kay, M. and Röscheisen M Text-Translation Alignment, Technical Report P , Xerox Palo Alto Research Center. 16. Ker, S. J. and Chang J. S A Class-base Approach to Word Alignment, Computational Linguistics, 23/2, pp Knight, K. and J Graehl Machine Transliteration, In Proceedings of the 35 th Annual Meeting of the Association for Computational Linguistics and 8 th Conference of ACL European Chapter, pp Kupiec, Julian An Algorithm for finding noun phrase correspondence in bilingual corpus, In ACL 31, 23/2, pp Kwok, K L NTCIR-2 Chinese, Cross-Language Retrieval Experiments Using PIRCS. In Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, pp. (5) 14-20, National Institute of Informatics, Japan. 20. Longman Group 1992 Longman English-Chinese Dictionary of Contemporary English, Published by Longman Group (Far East) Ltd., Hong Kong. 21. McCarley, J. Scott Should we Translate the Documents or the Queries in Cross-Language Information Retrieval? In Proceedings of the 37 th Annual Meeting of the Association for Computation Linguistics, pp Melamed, I. D Automatic Construction of Clean Broad-Coverage Translation Lexicons, In Proceedings of the second Conference of the Association for Machine Translation in the Americas (AMTA), pp Nagao, M Machine Translation: How Far Can it Go? Oxford University Press, Oxford. 24

25 24. Oard, D W and J. Wang Effect of Term Segmentation on Chinese/English Cross-Language Information Retrieval. In Proceedings of the Symposium on String and Processing and Information Retrieval Och, Franz Josef and Hermann Ney Improved Statistical Alignment Models. In Proceedings of the 38 th Annual Meeting of the Association for Computation Linguistics. 26. Pirkola, A The Effect of Query Structure and Dictionary Setups in Dictionary-based Cross-Language Retrieval. In Proceedings of the 21 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp Smadja, F., McKeown K., and Hatzivassiloglou V Translating Collocations for Bilingual Lexicons: A Statistical Approach, Computational Linguistics, 22/1, pp Utsuro, T., Ikeda H., Yamane M., Matsumoto M., and Nagao M Bilingual Text Matching Using Bilingual Dictionary and Statistics, In Proceedings of the 15th International Conference on Computational Linguistics, pp Wu, D. and Xia X Learning an English-Chinese Lexicon from a Parallel Corpus, In Proceedings of the first Conference of the Association for Machine Translation in the Americas (AMTA), pp

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

1 1 1 2 2 30% 70% 70% NTCIR-7 13% 90% 1,000 Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms Itsuki Toyota 1 Yusuke Takahashi 1 Kensaku