Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Similar documents
Natural Language Processing (CSE 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation

Better Conditional Language Modeling. Chris Dyer

arxiv: v1 [cs.cl] 21 May 2017

Natural Language Processing (CSEP 517): Language Models, Continued

Deep learning for Natural Language Processing and Machine Translation

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Natural Language Processing (CSE 517): Cotext Models (II)

Tuning as Linear Regression

Multi-Source Neural Translation

Improved Learning through Augmenting the Loss

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Inter-document Contextual Language Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Conditional Language Modeling. Chris Dyer

Multi-Source Neural Translation

arxiv: v1 [cs.cl] 22 Jun 2015

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Natural Language Processing (CSE 517): Cotext Models

Theory of Alignment Generators and Applications to Statistical Machine Translation

Learning to translate with neural networks. Michael Auli

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Edinburgh Research Explorer

A Discriminative Model for Semantics-to-String Translation

Applications of Deep Learning

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

Factored Neural Machine Translation Architectures

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Word Alignment via Submodular Maximization over Matroids

Conditional Language modeling with attention

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Word Alignment by Thresholded Two-Dimensional Normalization

CSC321 Lecture 16: ResNets and Attention

Coverage Embedding Models for Neural Machine Translation

Investigating Connectivity and Consistency Criteria for Phrase Pair Extraction in Statistical Machine Translation

Improving Sequence-to-Sequence Constituency Parsing

Natural Language Processing (CSE 490U): Language Models

CSC321 Lecture 10 Training RNNs

Random Coattention Forest for Question Answering

arxiv: v2 [cs.cl] 1 Jan 2019

ANLP Lecture 6 N-gram models and smoothing

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Diversifying Neural Conversation Model with Maximal Marginal Relevance

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

Phrase Table Pruning via Submodular Function Maximization

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Deep Learning for Natural Language Processing

Leveraging Context Information for Natural Question Generation

CSC321 Lecture 15: Recurrent Neural Networks

Structure and Complexity of Grammar-Based Machine Translation

From perceptrons to word embeddings. Simon Šuster University of Groningen

Theory of Alignment Generators and Applications to Statistical Machine Translation

IBM Model 1 for Machine Translation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Deep Learning for NLP

Computing Optimal Alignments for the IBM-3 Translation Model

arxiv: v1 [cs.cl] 31 May 2015

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Natural Language Processing

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Hidden Markov Model for Machine Translation

Introduction to Data-Driven Dependency Parsing

The Noisy Channel Model and Markov Models

A Tree-Position Kernel for Document Compression

Analysis of techniques for coarse-to-fine decoding in neural machine translation

arxiv: v2 [cs.cl] 22 Mar 2016

COMS 4705, Fall Machine Translation Part III

Based on the original slides of Hung-yi Lee

Graphical models for part of speech tagging

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

arxiv: v1 [cs.cl] 28 Feb 2015

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

ACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION

Expectation Maximization (EM)

Lexical Translation Models 1I

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

statistical machine translation

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Recurrent neural network grammars

arxiv: v1 [stat.ml] 18 Nov 2017

CKY-based Convolutional Attention for Neural Machine Translation

Sequence Modeling with Neural Networks

Lecture 13: Structured Prediction

Deep Learning for NLP

Direct Output Connection for a High-Rank Language Model

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Machine Learning for Structured Prediction

Marrying Dynamic Programming with Recurrent Neural Networks

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Variational Decoding for Statistical Machine Translation

Word Alignment. Chris Dyer, Carnegie Mellon University

lecture 6: modeling sequences (final part)

Transcription:

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30

To-Do List Online quiz: due Sunday A5 due May 28 (Sunday) Watch for final exam instrutions around May 29 (Monday) 2 / 30

Neural Machine Translation Original idea proposed by Forcada and Ñeco (1997); resurgence in interest starting around 2013. Strong starting point for current work: Bahdanau et al. (2014). (My exposition is borrowed with gratitude from a lecture by Chris Dyer.) This approach eliminates (hard) alignment and phrases. Take care: here, the terminology encoder and decoder are used differently than in the noisy-channel pattern. 3 / 30

High-Level Model p(e = e f) = p(e = e encode(f)) l = p(e j e 0,..., e j 1, encode(f)) j=1 The encoding of the source sentence is a deterministic function of the words in that sentence. 4 / 30

Building Block: Recurrent Neural Network Review from lecture 2! Each input element is understood to be an element of a sequence: x 1, x 2,..., x l At each timestep t: The tth input element xt is processed alongside the previous state s t 1 to calculate the new state (s t ). The tth output is a function of the state s t. The same functions are applied at each iteration: s t = g recurrent (x t, s t 1 ) y t = g output (s t ) 5 / 30

Neural MT Source-Sentence Encoder 0 0 backward RNN forward RNN [ ] lookups source sentence encoding Ich möchte ein Bier F is a d m matrix encoding the source sentence f (length m). 6 / 30

Decoder: Contextual Language Model Two inputs, the previous word and the source sentence context. p(e t = v e 1,..., e t 1, f) = [y t ] v s t = g recurrent (e et 1, Fa }{{} t, s t 1 ) context y t = g output (s t ) (The forms of the two component gs are suppressed; just remember that they (i) have parameters and (ii) are differentiable with respect to those parameters.) The neural language model we discussed earlier (Mikolov et al., 2010) didn t have the context as an input to g recurrent. 7 / 30

Neural MT Decoder 0 [ ] 8 / 30

][ Neural MT Decoder 0 [ ] a 1 9 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] a 1 10 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] a 1 a 2 11 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] a 1 a 2 12 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] a 1 a 2 a 3 13 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] a 1 a 2 a 3 14 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] [ ] a 1 a 2 a 3 a 4 15 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] [ ] a 1 a 2 a 3 a 4 16 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] [ ] [ ] a 1 a 2 a 3 a 4 a 5 17 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] [ ] [ ] a 1 a 2 a 3 a 4 a 5 18 / 30

][ Neural MT Decoder I d like a beer STOP 0 [ ] [ ] [ ] [ ] [ ] a 1 a 2 a 3 a 4 a 5 [ ] [ ] [ ] [ ] [ ] 19 / 30

Computing Attention Let Vs t 1 be the expected input embedding for timestep t. (Parameters: V.) Attention is a t = softmax ( F Vs t 1 ). Context is Fa t, i.e., a weighted sum of the source words in-context representations. 20 / 30

Learning and Decoding log p(e encode(f)) = m log p(e i e 0:i 1, encode(f)) i=1 is differentiable with respect to all parameters of the neural network, allowing end-to-end training. Trick: train on shorter sentences first, then add in longer ones. Decoding typically uses beam search. 21 / 30

Remarks We covered two approaches to machine translation: Phrase-based statistical MT following Koehn et al. (2003), including probabilistic noisy-channel models for alignment (a key preprocessing step; Brown et al., 1993), and Neural MT with attention, following Bahdanau et al. (2014). Note two key differences: Noisy channel p(e) p(f e) vs. direct model p(e f) Alignment as a discrete random variable vs. attention as a deterministic, differentiable function At the moment, neural MT is winning when you have enough data; if not, phrase-based MT dominates. When monolingual target-language data is plentiful, we d like to use it! Recent neural models try (Sennrich et al., 2016; Xia et al., 2016; Yu et al., 2017). 22 / 30

Summarization 23 / 30

Automatic Text Summarization Mani (2001) provides a survey from before statistical methods came to dominate; more recent survey by Das and Martins (2008). Parallel history to machine translation: Noisy channel view (Knight and Marcu, 2002) Automatic evaluation (Lin, 2004) Differences: Natural data sources are less obvious Human information needs are less obvious We ll briefly consider two subtasks: compression and selection 24 / 30

Sentence Compression as Structured Prediction (McDonald, 2006) Input: a sentence Output: the same sentence, with some words deleted McDonald s approach: Define a scoring function for compressed sentences that factors locally in the output. He factored into bigrams but considered input parse tree features. Decoding is dynamic programming (not unlike Viterbi). Learn feature weights from a corpus of compressed sentences, using structured perceptron or similar. 25 / 30

Sentence Selection Input: one or more documents and a budget Output: a within-budget subset of sentences (or passages) from the input Challenge: diminishing returns as more sentences are added to the summary. Classical greedy method: maximum marginal relevance (Carbonell and Goldstein, 1998) Casting the problem as submodular optimization: Lin and Bilmes (2009) Joint selection and compression: Martins and Smith (2009) 26 / 30

Finale 27 / 30

Mental Health for Exam Preparation (and Beyond) Most lectures included discussion of: Representations or tasks (input/output) Evaluation criteria Models (often with variations) Learning/estimation algorithms NLP algorithms Practical advice For each task, keep these elements separate in your mind, and reuse them where possible. 28 / 30

References I Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR, 2014. URL https://arxiv.org/abs/1409.0473. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263 311, 1993. Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proc. of SIGIR, 1998. Dipanjan Das and André F. T. Martins. A survey of methods for automatic text summarization, 2008. Mikel L. Forcada and Ramón P. Ñeco. Recursive hetero-associative memories for translation. In International Work-Conference on Artificial Neural Networks, 1997. Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1):91 107, 2002. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proc. of NAACL, 2003. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proc. of ACL Workshop: Text Summarization Branches Out, 2004. Hui Lin and Jeff A. Bilmes. How to select a good training-data subset for transcription: Submodular active selection for sequences. In Proc. of Interspeech, 2009. Inderjeet Mani. Automatic Summarization. John Benjamins Publishing, 2001. 29 / 30

References II André F. T. Martins and Noah A. Smith. Summarization with a joint model for sentence extraction and compression. In Proc. of the ACL Workshop on Integer Linear Programming for Natural Langauge Processing, 2009. Ryan T. McDonald. Discriminative sentence compression with soft syntactic evidence. In Proc. of EACL, 2006. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Proc. of Interspeech, 2010. URL http: //www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_is100722.pdf. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proc. of ACL, 2016. URL http://www.aclweb.org/anthology/p16-1009. Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In NIPS, 2016. Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The neural noisy channel. In Proc. of ICLR, 2017. 30 / 30