Toponym Disambiguation in an English-Lithuanian SMT System with Spatial Knowledge

Size: px
Start display at page:

Download "Toponym Disambiguation in an English-Lithuanian SMT System with Spatial Knowledge"

Transcription

1 Toponym Disambiguation in an English-Lithuanian SMT System with Spatial Knowledge Raivis Skadiņš Tilde SIA Vienibas gatve 75a, Riga, Latvia Tatiana Gornostay Tilde SIA Vienibas gatve 75a, Riga, Latvia Valters Šics Tilde SIA Vienibas gatve 75a, Riga, Latvia Abstract This paper presents an innovative research resulting in the English-Lithuanian statistical factored phrase-based machine translation system with a spatial ontology. The system is based on the Moses toolkit and is enriched with semantic knowledge inferred from the spatial ontology. The ontology was developed on the basis of the GeoNames database (more than toponyms), implemented in the web ontology language (OWL), and integrated into the machine translation process. Spatial knowledge was added as an additional factor in the statistical translation model and used for toponym disambiguation during machine translation. The implemented machine translation approach was evaluated against the baseline system without spatial knowledge. A multifaceted evaluation strategy including automatic metrics, human evaluation and linguistic analysis, was implemented to perform evaluation experiments. The results of the evaluation have shown a slight improvement in the output quality of machine translation with spatial knowledge. 1 Introduction and Background During recent decades the corpus-based strategy has become dominant for machine translation, as it has proven to be more effective both from the point of view of time and labour resources and the quality of the output. The statistical approach has occupied the leading position with the first research results performed in the late 1980s. Since then statistical machine translation (SMT) has become the major focus for many research efforts due to its cost effectiveness doubled with the availability of such open source tools as GI- ZA++ (Och and Ney, 2003) and Moses (Koehn et al., 2007), as well as parallel text resources on the Internet. Pure SMT methods (Brown et al., 1993; Koehn et al., 2003) do not use any linguistic knowledge (e.g. morphological information). As a result, they perform better for analytical languages, such as English, with little inflection. Although English and Lithuanian are Indo- European languages and share some grammatical features, they have a wealth of differences. English belongs to the Germanic language group while Lithuanian belongs to the group of Baltic languages. Also, in the morphological typology English is an analytical language in contrast to a synthetic Lithuanian with a rich set of inflections. SMT for synthetic languages with high inflection (e.g. Lithuanian, Latvian, Russian and others) requires larger amounts of training data and additional knowledge to get the same level of performance. Modern SMT methods use different kinds of additional knowledge (e.g. morphological or syntactical) to build more sophisticated statistical models and improve the output quality of machine translation (see, for example, factored SMT (Koehn et al., 2007), tree-based SMT (Chiang 2007; Marcu et al., 2006; Li et al., 2009); treelet SMT (Quirk et al., 2005). This paper presents an innovative research resulting in an English-Lithuanian statistical factored phrasebased machine translation system based on the Moses toolkit and enriched with semantic knowledge inferred from the spatial ontology. Using semantic knowledge in rule-based machine translation is not new in the field. In SMT, however, there has been little research in this area 1. The implemented SMT system that is de- 1 See, for example, the research on extracting phrasal correspondences that are approximately semantically equivalent for building a full-sentence paraphrasing model that then is applied to a single good reference translation for each sentence in a statistical machine translation development set (Madnani et al., 2008). Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa (Eds.) NODALIDA 2011 Conference Proceedings, pp

2 Raivis Skadiņš, Tatiana Gornostay and Valters Šics scribed in this paper uses semantic knowledge to improve the quality of translation, in particular with regard to the disambiguation of geographical names, or toponyms. Spatial knowledge is added to toponyms in the source text as additional semantic tags, or factors. By adding factors into the source text, the translation accuracy is improved. This is the result of resolving semantic ambiguities in the source language. The first part of the paper overviews the system design including a description of its functionality and implementation with spatial knowledge. In the second part we focus on the system multifaceted evaluation and its results, as well as potential limitations of the system. Finally, we present conclusions and future plans. 2 System Design 2.1 Functionality In the overall machine translation theory and in practice English-Lithuanian toponym translation problems have not been researched before. The core functionality of the presented system is a disambiguation of toponyms during the machine translation process. Toponyms are geographical names, or names of places (hydronyms, oronyms, geonyms, oeconyms, etc.). A natural language is ambiguous and toponyms are not exceptions. This fact makes toponyms difficult for processing (e.g. resolution, cross-language information retrieval, human translation and especially machine translation), and due to their linguistic and extra-linguistic nature toponyms require special treatment (Gornostay and Skadiņa, 2009). There are cases when real-world geographical knowledge is required for the resolution of ambiguous toponyms. The implemented SMT system deals with two types of ambiguity (see Leidner (2007) for the description of possible types of toponym ambiguity). The first type is a referential ambiguity, where a toponym may refer to more than one location of the same type, for example: Georgia as the US state and the country in Caucasus (English); Riga as the populated place and the capital of Latvia and as the populated place in the USA, state Michigan (Latvian); Šveicarija as the village in Lithuania and as the country in Europe (Lithuanian). The second type of ambiguity is a feature type ambiguity, where a toponym may refer to more than one place of a different type, for example: Tanfield refers to the populated place as well as the castle in the United Kingdom (English); Gauja refers to the populated place as well as the river in Latvia (Latvian); Šventoji as the town near the Baltic Sea as well as the name of 3 different rivers in Lithuania (Lithuanian). In the implemented system the two described types of toponym ambiguity are resolved using semantic knowledge inferred from the spatial ontology. 2.2 Baseline SMT System The baseline system was a statistical phrasebased machine translation system based on the Moses toolkit and trained on the following publicly available and proprietary corpora: DGT-TM parallel corpus 2 a publicly available collection of legislative texts in 22 languages of the European Union; OPUS parallel corpus a publicly available collection of texts from the web in different domains 3 (Tiedemann, 2004; Tiedemann, 2009). Localization parallel corpus obtained from translation memories that have been created during the localization of software, user manuals and helps. We also included word and phrase translations from bilingual dictionaries and term translations from EuroTermBank 4 to increase word coverage. Monolingual corpora for the training of language models were prepared from corresponding monolingual parts of parallel corpora, as well as Lithuanian news articles collected from the web. Bilingual and monolingual resources prepared and used for the baseline SMT system development are represented in Table 1. Monolingual corpus Lithuanian side of parallel corpora Units ~4,04 mil We chose the EMEA (medical domain) and KDE4 (IT domain) sentence-aligned corpora

3 Toponym Disambiguation in English-Lithuanian SMT System with Spatial Knowledge Web news ~5,22 mil. Total ~9,26 mil. (filtered) Bilingual corpus Parallel units Localization TM ~5,21 mil. DGT-TM ~1,08 mil. OPUS EMEA ~1,04 mil. Dictionary data ~0,27 mil. EuroTermBank data ~0,1 mil. KDE4 ~0,05 mil. Fiction ~0,01 mil. Total (used for the baseline system) ~7,76 mil. (filtered) Table 1. Training corpora. 2.3 Spatial Ontology The spatial ontology to be integrated into the machine translation process was developed using the ontology language, designed and implemented in the web ontology language (OWL) using RCC-8 properties (Region Connection Calculus) (Randell et al., 1992), and tools developed in the SOLIM project 5. RCC-8 properties are as follows: externally connected (EC), disconnected (DC), covered by/tangential proper part (TPP), inside/non-tangential proper part (NTPP), equal (EQ), partial overlap (PO), covers/tangential proper part inverse (TPPi), and contains/nontangential proper part inverse (NTPPi). The spatial ontology consisted of three subontologies: basic and two language ontologies. The basic ontology contained concepts and spatial properties. The two language ontologies contained English and Lithuanian toponyms. Words in language ontologies were matched with concepts in the basic ontology (e.g. United States, US and USA represent the same concept USA). All locations in language ontologies were represented by a geo info.owl code and lexically represented by a haslexrep relation. A list of instances was created on the basis of the GeoNames database 6 (7 continents, 193 countries, 51 USA states, 6359 USA cities, 6955 Lithuanian place names, 1869 cities from top 10 cities of other countries). The GeoNames database contains information about continents, countries and cities and it contains information about spatial relations between these objects. RCC-8 relations were extracted from the GeoNames database To query the spatial ontology we used the function GetSpatialRelations(A,B) to get spatial knowledge about relations between A and B. This information can be inferred from the spatial ontology, whereas we cannot get false or unknown information, for example: GetSpatialRelations(Georgia,Armenia)= EC only if there is enough information in the ontology to infer this relation; GetSpatialRelations(Georgia,Latvia)= DC if this relation can be inferred; GetSpatialRelations(Georgia, Latvia)=, if there is not enough information in the ontology to infer the DC relation. 2.4 Implemented SMT System with Spatial Knowledge For the implemented system with spatial knowledge we used the same training corpora as for the baseline system, as well as prepared two more corpora from the ontology a translation dictionary (~0,02 mil. units) and spatial relation dictionary (~0,42 mil. units). The developed baseline SMT system was a pure phrase-based SMT system which dealt only with surface forms of words. Its translation model contained simple probabilities like: P(Georgia Gruzija) a probability that Georgia is the English translation of the Lithuanian word Gruzija; P(Georgia Džordžija) a probability that Georgia is the English translation of the Lithuanian word Džordžija. It also contained probabilities for all morphological variants of Lithuanian words and phrases. However, it was difficult to choose the correct Lithuanian translation of a given ambiguous English toponym since both probabilities were similar: P(Georgia Gruzija) P(Georgia Džordžija). The factored phrase-based SMT (Koehn and Hoang, 2007) is an extension of the phrase-based approach. It contains an additional annotation at a lexical unit level. The lexical unit is no longer just a token, but a vector of factors that represent different levels of annotation. The training data (a parallel corpus) has to be annotated with additional factors. For instance, it is possible to add lemma or part-of-speech information on source and target sides. 193

4 Raivis Skadiņš, Tatiana Gornostay and Valters Šics The implemented SMT system was based on the Moses toolkit that features factored translation models allowing the integration of additional layers of data directly into the process of translation. Spatial knowledge was used during training and translation processes as additional semantic factors integrated with the source language data. All toponyms in the source text were analysed and tagged (annotated) with semantic factors (spatial knowledge) inferred from the spatial ontology with a reasoner. For example, a toponym Georgia is ambiguous: it can refer to the USA state or the Caucasian country. See the example sentences: There are Lithuanians living in Georgia, Florida and other states. Experts have failed to travel to Georgia at the Tbilisi airport. In the first sentence Georgia refers to the USA state, while in the second one it refers to the Caucasian country. To resolve this type of ambiguity, spatial knowledge was used to determine spatial relations between corresponding toponyms within one sentence. For example, in the first sentence Georgia was annotated with EC.Florida since that information had been inferred from the spatial ontology (Georgia is externally connected to Florida). In the second sentence Georgia was annotated with NTPPi.Tbilisi (Tbilisi is a city in Georgia). We searched a sentence for toponyms and queried the spatial ontology for their relations. If there were more than two toponyms in a sentence we used just one (the first found, but not DC) annotation to each toponym. Compared with a simple unfactored translation model, that kind of factored translation model contained more useful information for toponym disambiguation since it might contain probabilities like: P(Georgia/EC.Florida Džordžija) a probability that Georgia is the English translation of a Lithuanian word Džordžija given that Georgia is externally connected to Florida; P(Georgia/NTPPi.Tbilisi Gruzija) a probability that Georgia is the English translation of Lithuanian word Gruzija given that Georgia encloses Tbilisi. The translation model with probabilities about words and phrases with spatial knowledge helped to perform more accurate toponym disambiguation, because spatial context was included in the translation model. For example, if we have almost equal probabilities for Georgia, being a translation of both Gruzija and Džordžija in the translation model of the baseline system, probabilities with spatial knowledge are significantly different: P(Georgia/EC.Armenia Gruzija) P(Georgia/EC.Armenia Džordžija) P(Georgia/EC.Florida Džordžija) P(Georgia/EC.Florida Gruzija) Thus, during the machine translation process semantic factors inferred from the spatial ontology provide additional information for the Moses decoder. As a result, it helps in choosing the appropriate translation equivalent. Therefore, SMT training data annotated with the proposed kind of spatial knowledge leads to a better machine translation quality. It should also be mentioned that two SMT systems with spatial knowledge were trained. The first system (later referred as Spatial-8) was trained using corpora annotated with all eight RCC-8 spatial relations. The second system (later referred as Spatial-7) was trained using only seven RCC-8 relations since initial experiments, proved with the linguistic analysis, showed that using the DC:disconnected relation did not help in toponym disambiguation. 3 Evaluation and Limitations A multifaceted strategy with three procedures was applied to the evaluation of the output quality of machine translation performed by the implemented system with spatial knowledge: automatic (black-box) evaluation; human evaluation; linguistic analysis. 3.1 Automatic Evaluation For the automatic evaluation the two most popular and widely used metrics BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) were used. Automatic metrics are cost-effective and do not require much human intervention. They allow comparisons of two and more systems, as well as different versions of one system in the process of its implementation and improvement as many times as necessary. A balanced test set of 500 English sentences was developed for the automatic evaluation purposes. Sentences were manually collected from 194

5 Toponym Disambiguation in English-Lithuanian SMT System with Spatial Knowledge the web and translated into Lithuanian by a professional translator (a reference set to be compared with). The breakdown of topics in the corpus is presented in Table 2. Domain Percentage General information 12% about the EU Specification and manuals 12% Popular scientific 12% and educational Official and legal documents 12% News and magazine articles 24% Information technology 18% Letters 5% Fiction 5% Table 2. Testing set. The procedure of the automatic evaluation consists of several sub-processes and the main idea, in general, is in the comparison of machine translation and reference sets. The higher the automatic scores are, the better the machine translation output quality is. BLEU and NIST scores for the baseline system were 27,35 and 5,90 correspondingly. BLEU and NIST scores for the implemented system with spatial knowledge were 27,97 (BLEU) and 5,97 (NIST) for the system Spatial-8 and 27,47 (BLEU) and 5,91 (NIST) for the system Spatial-7 (see Table 3). System BLEU NIST Baseline 27,35 5,90 Spatial-8 27,97 5,97 Spatial-7 27,47 5,91 Table 3. Results of the automatic evaluation. As a result, a slight improvement in the output quality of machine translation with spatial knowledge can be observed. In general, this improvement is not high and is not sufficient for the objective and an integrated evaluation procedure. Results of the automatic evaluation can be explained so that general-purpose development and evaluation corpora used for the evaluation did not contain many ambiguous geographical names. Therefore, the evaluation with the taskspecific evaluation corpus was performed during the human evaluation. Nevertheless, automatic scores were set as a threshold for further experiments. 3.2 Human Evaluation A test set of 464 English sentences containing ambiguous toponyms was developed for human evaluation purposes. A ranking of translated sentences relative to each other was used for the manual evaluation of systems. This was the official determinant of translation quality used in the 2009 Workshop on Statistical Machine Translation shared tasks (Callison-Burch et al., 2009). A web-based human evaluation environment (Skadiņš et al., 2010) was used where source sentences and translation outputs of the two SMT systems could be uploaded as simple txt files. Once the evaluation of the two systems was set up, a link to the evaluation survey was sent to evaluators. Evaluators were evaluating the systems sentence by sentence. Evaluators saw the source sentence and the translation output of the two SMT systems baseline and the one implemented with spatial knowledge. The frequency of preferring each system based on evaluators answers and a comparison of the sentences was calculated. About 20 evaluators participated, each comparing translations of 50 sentences. The manual comparison of the two systems (Baseline vs. Spatial-8) 7 has shown that the implemented SMT system with spatial knowledge is slightly better than the baseline system: in 50,66% of cases evaluators judged its output to be better than the output of the baseline system. Results of the human evaluation do not allow us to say with certainty either the spatial SMT system is significantly better or it is disambiguating toponyms better, since the difference is not convincing and evaluators have been comparing sentences using subjective criteria and not paying a special attention to the translation of toponyms. 3.3 Linguistic Evaluation of Toponym Disambiguation A detailed linguistic analysis of toponym disambiguation during the machine translation process was performed. The same corpus as for the human evaluation was used and the accuracy of the toponym translation was evaluated. The accuracy of the baseline system was 84,09%. The accuracy of the Spatial-8 system was 83,87%. Since results for the baseline system were better, it was decided to analyse the impact of each spatial relation to toponym disambiguation. It was discovered that the accuracy could be increased to 88.00% if the DC:disconnected relation was ignored (system Spatial-7). 7 The human evaluation of the system Spatial-7 is in progress at the moment and will be presented in the final version of the paper. 195

6 Raivis Skadiņš, Tatiana Gornostay and Valters Šics 4 Conclusions and Future works In the paper we have presented how toponyms can be disambiguated in the process of statistical machine translation using spatial knowledge by the example of the English-Lithuanian system. We have overviewed the system design including the description of its functionality, baseline and implementation with spatial knowledge, as well as focused on the system multifaceted evaluation and its results. We can see that the quality of machine translation can be improved by using the semantic information from the spatial ontology. Nevertheless improvement is not big and further more detailed evaluation would be necessary to assess whether this improvement is statistically significant. It was noticed during linguistic evaluation that some RCC-8 properties seem to be much more useful than others (e.g. EC:externally connected and EQ:equal). But a detailed evaluation of the impact of each relation has not been done yet. The EQ property can be used for machine translation of toponyms which are synonyms, for example, a full name and an abbreviation the United States of America and USA. The same property can be used for the so-called exonyms (names of places used by other groups, not locals) as Praha for its inhabitants and Prague for the English (for other examples, see Leidner (2007)). It should be also noted, that the best version of the implemented system with the spatial ontology is not dealing with DC:disconnected relations, e.g. Georgia is disconnected from California or Hawaii. In this case, other types of information in the spatial ontology may be used in further experiments, e.g. the ontology class State and its instances. Moreover, the spatial ontology was not used for disambiguation of common nouns since they were not represented in the ontology. However, a morpho-syntactic type of toponym ambiguity, when a word itself can be a toponym or a common noun in a language) and its resolution can be performed with the help of the spatial ontology, for example: Hook refers to the populated place in the UK and hook is a common noun (English); Liepa refers to the populated place in Latvia and liepa (lime-tree) is a common noun (Latvian); Batą refers to the populated place in Lithuania and batą (shoe) is a common noun (Lithuanian). The proposed approach to toponym disambiguation is not limited to: machine translation per se and can be regarded as generic, i.e. it can be also applied to other fields of natural language processing, e.g. information retrieval; use of spatial knowledge only: other types of implicit or inferred knowledge can be used in a similar way. References Brown P., Della Pietra S., Della Pietra V., Mercer R The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp Callison-Burch C., Koehn P., Monz C., Schroeder J Findings of the 2009 Workshop on Statistical Machine Translation. Proceedings of the Fourth Workshop on Statistical Machine Translation, 1-28, Athens, Greece. Chiang D Hierarchical Phrase-Based Translation. Computational Linguistics, 33(2), pp Koehn P., Och F. J., Marcu D Statistical phrase based translation. Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL). Koehn P. and Hoang H Factored Translation Models. Proceedings of EMNLP 07. Li Z., Callison-Burch C., Dyer C., Ganitkevitch J., Khudanpur S., Schwartz L., Thornton W., Weese J., Zaidan O Joshua: An Open Source Toolkit for Parsing-based Machine Translation. Proceedings of the Workshop on Statistical Machine Translation (WMT09). Marcu D., Wang W., Echihabi A., Knight K SPMT: statistical machine translation with syntactified target language phrases. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, July ACL Workshops. Association for Computational Linguistics, pp Quirk C., Menezes A., Cherry C Dependency Treelet Translation: Syntactically Informed Phrasal SMT. Proceedings of ACL Papineni K., Roukos S., Ward T. et al BLEU: a Method for Automatic Evaluation of Machine 196

7 Toponym Disambiguation in English-Lithuanian SMT System with Spatial Knowledge Translation. ACL 02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pensilvania. Morristown, NJ: Association for Computational Linguistics, pp Doddington G Automatic Evaluation of Machine Translation Quality Using N-gram Co- Occurrence Statistics. HLT 2002: Human Language Technology Conference: Proceedings of the Second International Conference on Human Language Technology Research, San Diego, California. San Francisco: Morgan Kaufmann Publishers, pp Tiedemann J. and Nygaard L The OPUS corpus parallel & free. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04), Lisbon, Portugal, May Tiedemann J News from OPUS A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing, vol. V, John Benjamins, Amsterdam/Philadelphia, pp Randell D. A., Cui Z., Cohn A. G A spatial logic based on regions and connection. Proceedings of the 3 rd International Conference on Knowledge Representation and Reasoning, Morgan Kaufmann, San Mateo, pp Skadiņš R., Goba K. and Šics V Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic HLT 2010, Riga, Latvia, pp Och F. J. and Ney H A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, (29)1, pp , Koehn P., Federico M., Cowan B., Zens R., Duer C., Bojar O., Constantin A., Herbst E Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the ACL 2007 Demo and Poster Sessions, Prague, pp Madnani Nitin, Resnik Philip, Dorr Bonnie, Schwartz Richard Applying Automatically Generated Semantic Knowledge: A Case Study in Machine Translation. Proceedings of the Symposium on Semantic Knowledge Discovery, Organization and Use. Leidner Jochen L Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names. PhD thesis. Institute for Communicating and Collaborative Systems School of Informatics, University of Edinburgh. Gornostay T. and Skadiņa I English-Latvian Toponym Processing: Translation Strategies and Linguistic Patterns. EAMT-2009: Proceedings of the 13th Annual Conference of the European Association for Machine Translation, May 14-15, Universitat Politècnica de Catalunya, Barcelona, Spain, pp ISSN Vol. 11

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science

More information

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms 1 1 1 2 2 30% 70% 70% NTCIR-7 13% 90% 1,000 Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms Itsuki Toyota 1 Yusuke Takahashi 1 Kensaku

More information

Improving Relative-Entropy Pruning using Statistical Significance

Improving Relative-Entropy Pruning using Statistical Significance Improving Relative-Entropy Pruning using Statistical Significance Wang Ling 1,2 N adi Tomeh 3 Guang X iang 1 Alan Black 1 Isabel Trancoso 2 (1)Language Technologies Institute, Carnegie Mellon University,

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder

More information

Enhanced Bilingual Evaluation Understudy

Enhanced Bilingual Evaluation Understudy Enhanced Bilingual Evaluation Understudy Krzysztof Wołk, Krzysztof Marasek Department of Multimedia Polish Japanese Institute of Information Technology, Warsaw, POLAND kwolk@pjwstk.edu.pl Abstract - Our

More information

Natural Language Processing (CSEP 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,

More information

Toponym Disambiguation using Ontology-based Semantic Similarity

Toponym Disambiguation using Ontology-based Semantic Similarity Toponym Disambiguation using Ontology-based Semantic Similarity David S Batista 1, João D Ferreira 2, Francisco M Couto 2, and Mário J Silva 1 1 IST/INESC-ID Lisbon, Portugal {dsbatista,msilva}@inesc-id.pt

More information

Phrase Table Pruning via Submodular Function Maximization

Phrase Table Pruning via Submodular Function Maximization Phrase Table Pruning via Submodular Function Maximization Masaaki Nishino and Jun Suzuki and Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun,

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

arxiv: v1 [cs.cl] 22 Jun 2015

arxiv: v1 [cs.cl] 22 Jun 2015 Neural Transformation Machine: A New Architecture for Sequence-to-Sequence Learning arxiv:1506.06442v1 [cs.cl] 22 Jun 2015 Fandong Meng 1 Zhengdong Lu 2 Zhaopeng Tu 2 Hang Li 2 and Qun Liu 1 1 Institute

More information

Results of the WMT13 Metrics Shared Task

Results of the WMT13 Metrics Shared Task Results of the WMT13 Metrics Shared Task Matouš Macháček and Ondřej Bojar Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics machacekmatous@gmail.com

More information

A Discriminative Model for Semantics-to-String Translation

A Discriminative Model for Semantics-to-String Translation A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley

More information

Results of the WMT14 Metrics Shared Task

Results of the WMT14 Metrics Shared Task Results of the WMT14 Metrics Shared Task Matouš Macháček and Ondřej Bojar Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics machacekmatous@gmail.com

More information

Machine Learning for Interpretation of Spatial Natural Language in terms of QSR

Machine Learning for Interpretation of Spatial Natural Language in terms of QSR Machine Learning for Interpretation of Spatial Natural Language in terms of QSR Parisa Kordjamshidi 1, Joana Hois 2, Martijn van Otterlo 1, and Marie-Francine Moens 1 1 Katholieke Universiteit Leuven,

More information

Minimum Bayes-risk System Combination

Minimum Bayes-risk System Combination Minimum Bayes-risk System Combination Jesús González-Rubio Instituto Tecnológico de Informática U. Politècnica de València 46022 Valencia, Spain jegonzalez@iti.upv.es Alfons Juan Francisco Casacuberta

More information

Neural Hidden Markov Model for Machine Translation

Neural Hidden Markov Model for Machine Translation Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and

More information

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Multi-Task Word Alignment Triangulation for Low-Resource Languages Multi-Task Word Alignment Triangulation for Low-Resource Languages Tomer Levinboim and David Chiang Department of Computer Science and Engineering University of Notre Dame {levinboim.1,dchiang}@nd.edu

More information

Sampling Alignment Structure under a Bayesian Translation Model

Sampling Alignment Structure under a Bayesian Translation Model Sampling Alignment Structure under a Bayesian Translation Model John DeNero, Alexandre Bouchard-Côté and Dan Klein Computer Science Department University of California, Berkeley {denero, bouchard, klein}@cs.berkeley.edu

More information

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing Daniel Ferrés and Horacio Rodríguez TALP Research Center Software Department Universitat Politècnica de Catalunya {dferres,horacio}@lsi.upc.edu

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

Maja Popović Humboldt University of Berlin Berlin, Germany 2 CHRF and WORDF scores

Maja Popović Humboldt University of Berlin Berlin, Germany 2 CHRF and WORDF scores CHRF deconstructed: β parameters and n-gram weights Maja Popović Humboldt University of Berlin Berlin, Germany maja.popovic@hu-berlin.de Abstract Character n-gram F-score (CHRF) is shown to correlate very

More information

A phrase-based hidden Markov model approach to machine translation

A phrase-based hidden Markov model approach to machine translation A phrase-based hidden Markov model approach to machine translation Jesús Andrés-Ferrer Universidad Politécnica de Valencia Dept. Sist. Informáticos y Computación jandres@dsic.upv.es Alfons Juan-Císcar

More information

Word Alignment by Thresholded Two-Dimensional Normalization

Word Alignment by Thresholded Two-Dimensional Normalization Word Alignment by Thresholded Two-Dimensional Normalization Hamidreza Kobdani, Alexander Fraser, Hinrich Schütze Institute for Natural Language Processing University of Stuttgart Germany {kobdani,fraser}@ims.uni-stuttgart.de

More information

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José

More information

Minimum Error Rate Training Semiring

Minimum Error Rate Training Semiring Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)

More information

A Systematic Comparison of Training Criteria for Statistical Machine Translation

A Systematic Comparison of Training Criteria for Statistical Machine Translation A Systematic Comparison of Training Criteria for Statistical Machine Translation Richard Zens and Saša Hasan and Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018 Evaluation Brian Thompson slides by Philipp Koehn 25 September 2018 Evaluation 1 How good is a given machine translation system? Hard problem, since many different translations acceptable semantic equivalence

More information

Consensus Training for Consensus Decoding in Machine Translation

Consensus Training for Consensus Decoding in Machine Translation Consensus Training for Consensus Decoding in Machine Translation Adam Pauls, John DeNero and Dan Klein Computer Science Division University of California at Berkeley {adpauls,denero,klein}@cs.berkeley.edu

More information

Structure and Complexity of Grammar-Based Machine Translation

Structure and Complexity of Grammar-Based Machine Translation Structure and of Grammar-Based Machine Translation University of Padua, Italy New York, June 9th, 2006 1 2 Synchronous context-free grammars Definitions Computational problems 3 problem SCFG projection

More information

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview Parisa Kordjamshidi 1, Taher Rahgooy 1, Marie-Francine Moens 2, James Pustejovsky 3, Umar Manzoor 1 and Kirk Roberts 4 1 Tulane University

More information

Triplet Lexicon Models for Statistical Machine Translation

Triplet Lexicon Models for Statistical Machine Translation Triplet Lexicon Models for Statistical Machine Translation Saša Hasan, Juri Ganitkevitch, Hermann Ney and Jesús Andrés Ferrer lastname@cs.rwth-aachen.de CLSP Student Seminar February 6, 2009 Human Language

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation Spyros Martzoukos Christophe Costa Florêncio and Christof Monz Intelligent Systems Lab

More information

Computing Optimal Alignments for the IBM-3 Translation Model

Computing Optimal Alignments for the IBM-3 Translation Model Computing Optimal Alignments for the IBM-3 Translation Model Thomas Schoenemann Centre for Mathematical Sciences Lund University, Sweden Abstract Prior work on training the IBM-3 translation model is based

More information

Spatial Role Labeling CS365 Course Project

Spatial Role Labeling CS365 Course Project Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important

More information

Lattice-based System Combination for Statistical Machine Translation

Lattice-based System Combination for Statistical Machine Translation Lattice-based System Combination for Statistical Machine Translation Yang Feng, Yang Liu, Haitao Mi, Qun Liu, Yajuan Lu Key Laboratory of Intelligent Information Processing Institute of Computing Technology

More information

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions Wei Lu and Hwee Tou Ng National University of Singapore 1/26 The Task (Logical Form) λx 0.state(x 0

More information

Machine Translation Evaluation

Machine Translation Evaluation Machine Translation Evaluation Sara Stymne 2017-03-29 Partly based on Philipp Koehn s slides for chapter 8 Why Evaluation? How good is a given machine translation system? Which one is the best system for

More information

Theory of Alignment Generators and Applications to Statistical Machine Translation

Theory of Alignment Generators and Applications to Statistical Machine Translation Theory of Alignment Generators and Applications to Statistical Machine Translation Raghavendra Udupa U Hemanta K Mai IBM India Research Laboratory, New Delhi {uraghave, hemantkm}@inibmcom Abstract Viterbi

More information

Phrasetable Smoothing for Statistical Machine Translation

Phrasetable Smoothing for Statistical Machine Translation Phrasetable Smoothing for Statistical Machine Translation George Foster and Roland Kuhn and Howard Johnson National Research Council Canada Ottawa, Ontario, Canada firstname.lastname@nrc.gc.ca Abstract

More information

Maximal Lattice Overlap in Example-Based Machine Translation

Maximal Lattice Overlap in Example-Based Machine Translation Maximal Lattice Overlap in Example-Based Machine Translation Rebecca Hutchinson Paul N. Bennett Jaime Carbonell Peter Jansen Ralf Brown June 6, 2003 CMU-CS-03-138 School of Computer Science Carnegie Mellon

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Shift-Reduce Word Reordering for Machine Translation

Shift-Reduce Word Reordering for Machine Translation Shift-Reduce Word Reordering for Machine Translation Katsuhiko Hayashi, Katsuhito Sudoh, Hajime Tsukada, Jun Suzuki, Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai,

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

GEOGRAPHICAL NAMES AS PART OF THE GLOBAL, REGIONAL AND NATIONAL SPATIAL DATA INFRASTRUCTURES

GEOGRAPHICAL NAMES AS PART OF THE GLOBAL, REGIONAL AND NATIONAL SPATIAL DATA INFRASTRUCTURES GEOGRAPHICAL NAMES AS PART OF THE GLOBAL, REGIONAL AND NATIONAL SPATIAL DATA INFRASTRUCTURES Željko HEĆIMOVIĆ, Željka JAKIR, Zvonko ŠTEFAN, Danijela KUKIĆ zeljko.hecimovic@cgi.hr, zeljka.jakir@cgi.hr,

More information

A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia

A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia Daryl Woodward, Jeremy Witmer, and Jugal Kalita University of Colorado, Colorado Springs Computer Science Department 1420 Austin

More information

Feature-Rich Translation by Quasi-Synchronous Lattice Parsing

Feature-Rich Translation by Quasi-Synchronous Lattice Parsing Feature-Rich Translation by Quasi-Synchronous Lattice Parsing Kevin Gimpel and Noah A. Smith Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213,

More information

Mining coreference relations between formulas and text using Wikipedia

Mining coreference relations between formulas and text using Wikipedia Mining coreference relations between formulas and text using Wikipedia Minh Nghiem Quoc 1, Keisuke Yokoi 2, Yuichiroh Matsubayashi 3 Akiko Aizawa 1 2 3 1 Department of Informatics, The Graduate University

More information

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems Wolfgang Macherey Google Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043, USA wmach@google.com Franz

More information

Annotation tasks and solutions in CLARIN-PL

Annotation tasks and solutions in CLARIN-PL Annotation tasks and solutions in CLARIN-PL Marcin Oleksy, Ewa Rudnicka Wrocław University of Technology marcin.oleksy@pwr.edu.pl ewa.rudnicka@pwr.edu.pl CLARIN ERIC Common Language Resources and Technology

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview Parisa Kordjamshidi 1(B), Taher Rahgooy 1, Marie-Francine Moens 2, James Pustejovsky 3, Umar Manzoor 1, and Kirk Roberts 4 1 Tulane University,

More information

Collaborative NLP-aided ontology modelling

Collaborative NLP-aided ontology modelling Collaborative NLP-aided ontology modelling Chiara Ghidini ghidini@fbk.eu Marco Rospocher rospocher@fbk.eu International Winter School on Language and Data/Knowledge Technologies TrentoRISE Trento, 24 th

More information

Fast Consensus Decoding over Translation Forests

Fast Consensus Decoding over Translation Forests Fast Consensus Decoding over Translation Forests John DeNero Computer Science Division University of California, Berkeley denero@cs.berkeley.edu David Chiang and Kevin Knight Information Sciences Institute

More information

1 Evaluation of SMT systems: BLEU

1 Evaluation of SMT systems: BLEU 1 Evaluation of SMT systems: BLEU Idea: We want to define a repeatable evaluation method that uses: a gold standard of human generated reference translations a numerical translation closeness metric in

More information

Shift-Reduce Word Reordering for Machine Translation

Shift-Reduce Word Reordering for Machine Translation Shift-Reduce Word Reordering for Machine Translation Katsuhiko Hayashi, Katsuhito Sudoh, Hajime Tsukada, Jun Suzuki, Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai,

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Translation as Weighted Deduction

Translation as Weighted Deduction Translation as Weighted Deduction Adam Lopez University of Edinburgh 10 Crichton Street Edinburgh, EH8 9AB United Kingdom alopez@inf.ed.ac.uk Abstract We present a unified view of many translation algorithms

More information

Phrase-Based Statistical Machine Translation with Pivot Languages

Phrase-Based Statistical Machine Translation with Pivot Languages Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008

More information

Generalized Stack Decoding Algorithms for Statistical Machine Translation

Generalized Stack Decoding Algorithms for Statistical Machine Translation Generalized Stack Decoding Algorithms for Statistical Machine Translation Daniel Ortiz Martínez Inst. Tecnológico de Informática Univ. Politécnica de Valencia 4607 Valencia, Spain dortiz@iti.upv.es Ismael

More information

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way

More information

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective Alexander Klippel 1, Alan MacEachren 1, Prasenjit Mitra 2, Ian Turton 1, Xiao Zhang 2, Anuj Jaiswal 2, Kean

More information

Annotating Spatial Containment Relations Between Events. Kirk Roberts, Travis Goodwin, and Sanda Harabagiu

Annotating Spatial Containment Relations Between Events. Kirk Roberts, Travis Goodwin, and Sanda Harabagiu Annotating Spatial Containment Relations Between Events Kirk Roberts, Travis Goodwin, and Sanda Harabagiu Problem Previous Work Schema Corpus Future Work Problem Natural language documents contain a large

More information

Toponymic guidelines of Poland for map editors and other users. Fourth revised edition. Submitted by Poland*

Toponymic guidelines of Poland for map editors and other users. Fourth revised edition. Submitted by Poland* UNITED NATIONS Working Paper GROUP OF EXPERTS ON No. 27 GEOGRAPHICAL NAMES Twenty-sixth session Vienna, 2-6 May 2011 Item 19 of the provisional agenda Toponymic guidelines for map editors and other editors

More information

Internet Engineering Jacek Mazurkiewicz, PhD

Internet Engineering Jacek Mazurkiewicz, PhD Internet Engineering Jacek Mazurkiewicz, PhD Softcomputing Part 11: SoftComputing Used for Big Data Problems Agenda Climate Changes Prediction System Based on Weather Big Data Visualisation Natural Language

More information

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Applications of Tree Automata Theory Lecture VI: Back to Machine Translation Andreas Maletti Institute of Computer Science Universität Leipzig, Germany on leave from: Institute for Natural Language Processing

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Translation as Weighted Deduction

Translation as Weighted Deduction Translation as Weighted Deduction Adam Lopez University of Edinburgh 10 Crichton Street Edinburgh, EH8 9AB United Kingdom alopez@inf.ed.ac.uk Abstract We present a unified view of many translation algorithms

More information

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,

More information

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation Journal of Machine Learning Research 12 (2011) 1-30 Submitted 5/10; Revised 11/10; Published 1/11 Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation Yizhao

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

On Terminological Default Reasoning about Spatial Information: Extended Abstract

On Terminological Default Reasoning about Spatial Information: Extended Abstract On Terminological Default Reasoning about Spatial Information: Extended Abstract V. Haarslev, R. Möller, A.-Y. Turhan, and M. Wessel University of Hamburg, Computer Science Department Vogt-Kölln-Str. 30,

More information

Tree as a Pivot: Syntactic Matching Methods in Pivot Translation

Tree as a Pivot: Syntactic Matching Methods in Pivot Translation Tree as a Pivot: Syntactic Matching Methods in Pivot Translation Akiva Miura, Graham Neubig,, Katsuhito Sudoh, Satoshi Nakamura Nara Institute of Science and Technology, Japan Carnegie Mellon University,

More information

Project EuroGeoNames (EGN) Results of the econtentplus-funded period *

Project EuroGeoNames (EGN) Results of the econtentplus-funded period * UNITED NATIONS Working Paper GROUP OF EXPERTS ON No. 33 GEOGRAPHICAL NAMES Twenty-fifth session Nairobi, 5 12 May 2009 Item 10 of the provisional agenda Activities relating to the Working Group on Toponymic

More information

Natural Language Processing (CSE 517): Machine Translation

Natural Language Processing (CSE 517): Machine Translation Natural Language Processing (CSE 517): Machine Translation Noah Smith c 2018 University of Washington nasmith@cs.washington.edu May 23, 2018 1 / 82 Evaluation Intuition: good translations are fluent in

More information

Effectiveness of complex index terms in information retrieval

Effectiveness of complex index terms in information retrieval Effectiveness of complex index terms in information retrieval Tokunaga Takenobu, Ogibayasi Hironori and Tanaka Hozumi Department of Computer Science Tokyo Institute of Technology Abstract This paper explores

More information

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online

More information

Integrating Morphology in Probabilistic Translation Models

Integrating Morphology in Probabilistic Translation Models Integrating Morphology in Probabilistic Translation Models Chris Dyer joint work with Jon Clark, Alon Lavie, and Noah Smith January 24, 2011 lti das alte Haus the old house mach das do that 2 das alte

More information

Economic and Social Council

Economic and Social Council United Nations Economic and Social Council Distr.: General 6 June 2012 Original: English E/CONF.101/138 Tenth United Nations Conference on the Standardization of Geographical Names New York, 31 July 9

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun

Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach. Abhishek Arun Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach Abhishek Arun Doctor of Philosophy Institute for Communicating and Collaborative Systems School of Informatics University

More information

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n. University of Groningen Geographically constrained information retrieval Andogah, Geoffrey IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

More information

BIBLIOTECA TÈCNICA DE POLÍTICA LINGÜÍSTICA

BIBLIOTECA TÈCNICA DE POLÍTICA LINGÜÍSTICA Place Names Database of Latvia Zane Cekula DOI: 10.2436/15.8040.01.231 Abstract The Place Names Database of Latvia is being developed at the Latvian Geospatial Information Agency. Microsoft Access 2000

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012 Language Modelling Marcello Federico FBK-irst Trento, Italy MT Marathon, Edinburgh, 2012 Outline 1 Role of LM in ASR and MT N-gram Language Models Evaluation of Language Models Smoothing Schemes Discounting

More information

Exploiting WordNet as Background Knowledge

Exploiting WordNet as Background Knowledge Exploiting WordNet as Background Knowledge Chantal Reynaud, Brigitte Safar LRI, Université Paris-Sud, Bât. G, INRIA Futurs Parc Club Orsay-Université - 2-4 rue Jacques Monod, F-91893 Orsay, France {chantal.reynaud,

More information

National Centre for Language Technology School of Computing Dublin City University

National Centre for Language Technology School of Computing Dublin City University with with National Centre for Language Technology School of Computing Dublin City University Parallel treebanks A parallel treebank comprises: sentence pairs parsed word-aligned tree-aligned (Volk & Samuelsson,

More information

ECONOMIC AND SOCIAL COUNCIL 13 July 2007

ECONOMIC AND SOCIAL COUNCIL 13 July 2007 UNITED NATIONS E/CONF.98/CRP.34 ECONOMIC AND SOCIAL COUNCIL 13 July 2007 Ninth United Nations Conference on the Standardization of Geographical Names New York, 21-30 August 2007 Item 17(b) of the provisional

More information

Gappy Phrasal Alignment by Agreement

Gappy Phrasal Alignment by Agreement Gappy Phrasal Alignment by Agreement Mohit Bansal UC Berkeley, CS Division mbansal@cs.berkeley.edu Chris Quirk Microsoft Research chrisq@microsoft.com Robert C. Moore Google Research robert.carter.moore@gmail.com

More information

The OntoNL Semantic Relatedness Measure for OWL Ontologies

The OntoNL Semantic Relatedness Measure for OWL Ontologies The OntoNL Semantic Relatedness Measure for OWL Ontologies Anastasia Karanastasi and Stavros hristodoulakis Laboratory of Distributed Multimedia Information Systems and Applications Technical University

More information

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar NATURAL LANGUAGE PROCESSING Dr. G. Bharadwaja Kumar Sentence Boundary Markers Many natural language processing (NLP) systems generally take a sentence as an input unit part of speech (POS) tagging, chunking,

More information

B u i l d i n g a n d E x p l o r i n g

B u i l d i n g a n d E x p l o r i n g B u i l d i n g a n d E x p l o r i n g ( Web) Corpora EMLS 2008, Stuttgart 23-25 July 2008 Pavel Rychlý pary@fi.muni.cz NLPlab, Masaryk University, Brno O u t l i n e (1)Introduction to text/web corpora

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Monte Carlo techniques for phrase-based translation

Monte Carlo techniques for phrase-based translation Monte Carlo techniques for phrase-based translation Abhishek Arun (a.arun@sms.ed.ac.uk) Barry Haddow (bhaddow@inf.ed.ac.uk) Philipp Koehn (pkoehn@inf.ed.ac.uk) Adam Lopez (alopez@inf.ed.ac.uk) University

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

A Data Repository for Named Places and Their Standardised Names Integrated With the Production of National Map Series

A Data Repository for Named Places and Their Standardised Names Integrated With the Production of National Map Series A Data Repository for Named Places and Their Standardised Names Integrated With the Production of National Map Series Teemu Leskinen National Land Survey of Finland Abstract. The Geographic Names Register

More information