CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

Similar documents
CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

Deep Learning for NLP

Marrying Dynamic Programming with Recurrent Neural Networks

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

Natural Language Processing

with Local Dependencies

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Lecture 13: Structured Prediction

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Applied Natural Language Processing

Lecture 5 Neural models for NLP

Neural Networks in Structured Prediction. November 17, 2015

10/17/04. Today s Main Points

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Intelligent Systems (AI-2)

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Learning to translate with neural networks. Michael Auli

Natural Language Processing

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Soft Inference and Posterior Marginals. September 19, 2013

Intelligent Systems (AI-2)

Neural Architectures for Image, Language, and Speech Processing

Part-of-Speech Tagging + Neural Networks 2 CS 287

Improving Sequence-to-Sequence Constituency Parsing

CS388: Natural Language Processing Lecture 4: Sequence Models I

CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

LECTURER: BURCU CAN Spring

The Infinite PCFG using Hierarchical Dirichlet Processes

Statistical Methods for NLP

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

CSE 490 U Natural Language Processing Spring 2016

Lecture 12: Algorithms for HMMs

Chapter 14 (Partially) Unsupervised Parsing

CSC321 Lecture 16: ResNets and Attention

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Conditional Random Fields

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Lecture 9: Hidden Markov Model

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lab 12: Structured Prediction

Lecture 12: Algorithms for HMMs

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Recurrent neural network grammars

Log-Linear Models, MEMMs, and CRFs

Probabilistic Context Free Grammars. Many slides from Michael Collins

arxiv: v2 [cs.cl] 1 Jan 2019

AN ABSTRACT OF THE DISSERTATION OF

A Support Vector Method for Multivariate Performance Measures

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Maschinelle Sprachverarbeitung

Structured Prediction Models via the Matrix-Tree Theorem

Maschinelle Sprachverarbeitung

Probabilistic Context-free Grammars

Hidden Markov Models

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CS838-1 Advanced NLP: Hidden Markov Models

The Noisy Channel Model and Markov Models

Natural Language Processing with Deep Learning CS224N/Ling284

Sequential Supervised Learning

Task-Oriented Dialogue System (Young, 2000)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Lecture 3: ASR: HMMs, Forward, Viterbi

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Algorithms for Syntax-Aware Statistical Machine Translation

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Ecient Higher-Order CRFs for Morphological Tagging

Statistical Methods for NLP

Deep Learning for NLP Part 2

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

A brief introduction to Conditional Random Fields

Convolutional Neural Networks

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Deep Learning for NLP

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

arxiv: v1 [cs.cl] 22 Jun 2015

Natural Language Processing

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

NLP Programming Tutorial 11 - The Structured Perceptron

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Machine Learning for Structured Prediction

Lecture 15. Probabilistic Models on Graph

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Improvements to Training an RNN parser

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Multiword Expression Identification with Tree Substitution Grammars

CSC321 Lecture 10 Training RNNs

Generative Models for Sentences

Transcription:

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week Project presentaions December 5/7 (Imeslots to be assigned when proposals are turned in) Final project due December 15 (no slip days!) Project Proposals ~1 page Define a problem, give context of related work (at least 3-4 relevant papers) Propose a direcion that you think is feasible and outline steps to get there, including what dataset you ll use Neural CRFs Tagging / NER Parsing This Lecture Okay to change direcions a_er the proposal is submi`ed, but run it by me if it s a big change

NER Revisited Neural CRF Basics Features in CRFs: I[tag=B-LOC & curr_word=hangzhou], I[tag=B-LOC & prev_word=to], I[tag=B-LOC & curr_prefix=han] Linear model over features Downsides: Lexical features mean that words need to be seen in the training data Can only use limited context windows Linear model can t capture feature conjuncions effecively LSTMs for NER B-PER I-PER O O O LSTMs for NER B-PER I-PER O O O B-LOC for the G20 meeing <s> B-PER I-PER O O O Barack Obama will travel to Hangzhou Encoder-decoder (MT-like model) What are the strengths and weaknesses of this model compared to CRFs? Transducer (LM-like model) What are the strengths and weaknesses of this model compared to CRFs?

LSTMs for NER B-PER I-PER O O O B-LOC Neural CRFs BidirecIonal transducer model Barack Obama will travel to Hangzhou What are the strengths and weaknesses of this model compared to CRFs? Barack Obama will travel to Hangzhou Neural CRFs: bidirecional LSTMs (or some NN) compute emission potenials, capture structural constraints in transiion potenials P (y x) = 1 Z i=2 ConvenIonal: Neural: Neural CRFs ny ny exp( t (y i 1,y i )) exp( e (y i,i,x)) i=1 e(y i,i,x) =w > f e (y i,i,x) e t y 1 y 2 y n e(y,i,x) =Wf(i, x) f/phi are vectors, len(phi) = num labels f(i, x) could be the output of a feedforward neural network looking at the words around posiion i, or the ith output of an LSTM, Neural network computes unnormalized potenials that are consumed and normalized by a structured model Inference: compute f, use Viterbi (or beam) P (y x) = 1 Z i=2 ConvenIonal: For linear model: CompuIng Gradients ny ny exp( t (y i 1,y i )) exp( e (y i,i,x)) i=1 e(y i,i,x) =w > f e (y i,i,x) @ e,i w i = f e,i (y i,i,x) e t y 1 y 2 y n Neural: e(y,i,x) =Wf(i, x) @L = @ e,i P (y i = s x)+i[s is gold] error signal, compute with F-B chain rule say to muliply together, gives our update For neural model: compute gradient of phi w.r.t. parameters of neural net

Neural CRFs FFNN Neural CRF for NER 2) Run forward-backward 3) Compute error signal 1) Compute f(x) 4) Backprop (no knowledge Barack Obama will travel to Hangzhou of sequenial structure required) e(to) FFNN e(hangzhou) e(today) previous word curr word next word to Hangzhou today e = Wg(Vf(x,i)) f(x,i)=[emb(x i 1 ), emb(x i ), emb(x i+1 )] Or f(x) looks at output of LSTM, or another model Neural CRFs LSTMs for NER B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou Neural CRFs: bidirecional LSTMs compute emission potenials, also transiion potenials (usually based on sparse features) Barack Obama will travel to Hangzhou How does this compare to neural CRF?

NLP (Almost) From Scratch How do we use a tagger for SRL? Tagging problem with respect to a par=cular verb WLL: independent classificaion; SLL: neural CRF LM1/LM2: pretrained word embeddings from a language model over large corpora Collobert, Weston, et al. 2008, 2011 Can t do this with feedforward networks efficiently, arguments are too far from the verb to use fixed context window sizes Figure from He et al. (2017) CNN Neural CRFs CNN NCRFs vs. FFNN NCRFs conv+pool+ffnn expected to quicken a bit -2-1 0 1 2 expected to quicken a bit Append to each word vector an embedding of the rela=ve posi=on of that word ConvoluIon over the sentence produces a posiion-dependent representaion Use this for SRL: the verb (predicate) is at posiion 0, CNN looks at the whole sentence relaive to the verb Sentence approach (CNNs) is comparable to window approach (FFNNs) except for SRL where they claim it works much be`er

How from scratch was this? NN+SLL isn t great Neural CRFs with LSTMs Neural CRF using character LSTMs to compute word representaions LM2: trained for 7 weeks on Wikipedia+Reuters very expensive! Sparse features needed to get best performance on NER+SRL anyway No use of sub-word features Collobert and Weston 2008, 2011 Chiu and Nichols (2015), Lample et al. (2016) Chiu+Nichols: character CNNs instead of LSTMs Neural CRFs with LSTMs Lin/Passos/Luo: use external resources like Wikipedia LSTM-CRF captures the important aspects of NER: word context (LSTM), sub-word features (character LSTMs), outside knowledge (word embeddings) Neural CRFs for Parsing Chiu and Nichols (2015), Lample et al. (2016)

ConsItuency Parsing VP PP VBD PP He wrote a long report on Mars. He wrote a long report on Mars. My report Fig. 1 report on Mars wrote on Mars Drawbacks score Discrete Parsing PP He wrote a long report on Mars. = w > f 2 5 PP 7 Le_ child last word = report PP Need to learn each word s properies individually Hard to learn feature conjuncions (report on X) Taskar et al. (2004) Hall, Durre`, and Klein (ACL 2014) f wrote a long report on Mars. PP wrote a long report on Mars. = ConInuous-State Grammars Joint Discrete and ConInuous Parsing score PP score PP = w > f 2 5 PP 7 He wrote a long report on Mars. He wrote a long report on Mars Powerful nonlinear featurizaion, but inference is intractable s > + s X 2 X 5 X W ` 7 PP Socher et al. (2013) Durre` and Klein (ACL 2015)

Joint Discrete and ConInuous Parsing score PP He wrote a long report on Mars w > = f PP X s > X X W ` PP Durre` and Klein (ACL 2015) + s score Joint Discrete and ConInuous Parsing 2 5 PP 7 Learned jointly w > = f PP + s ` v s > W > neural network Rule = PP Parent = He wrote a long report on Mars. X s > 2 X 5 X W ` 7 PP } rule embeddings Discrete structure Durre` and Klein (ACL 2015) Joint Discrete and ConInuous Parsing Joint Modeling Helps! Chart remains discrete! Discrete + ConInuous Discrete + ConInuous Parsing a sentence: PP Feedforward pass on nets Discrete feature computaion He wrote a long report on Mars Run CKY dynamic program Durre` and Klein (ACL 2015) Approx human performance Penn Treebank Dev set F1 95 90 85 80 92.0 90.8 91.0 Joint Discrete ConInuous Durre`, Hall, Durre`, Klein 2014 Durre`, Klein 2015 Klein 2015

Comparison to Neural Parsers Results: 8 languages Approx human performance 95 95 Penn Treebank Test set F1 90 85 90.4 CVG Socher+ 2013 88.3 LSTM Vinyals+ 2015 90.5 LSTM ensemble Vinyals+ 2015 91.1 Joint Durre`, Klein 2015 SPMRL Treebanks Test set F1 85 75 81.2 Berkeley Petrov+ 2006 84.2 FM Fernandes- Gonzalez and MarIns 2015 85.7 Joint Durre`, Klein 2015 80 65 Dependency Parsing Score each head-child pair in a dependency parse, use Eisner s algorithm or MST to assemble a parse Feedforward neural network approach: use features on head and modifier Dependency Parsing Biaffine approach: condense each head and modifier separately, compute score h T Um ROOT Casey hugged Kim head feats modifier feats FFNN score of dependency arc Pei et al. (2015), Kiperwasser and Goldberg (2016), Dozat and Manning (2017) Dozat and Manning (2017)

Results Biaffine approach works well (other neural CRFs are also strong) Dozat and Manning (2017) State-of-the-art for: Neural CRFs POS NER without extra data (Lample et al.) Dependency parsing (Dozat and Manning) SemanIc Role Labeling (He et al.) Why do they work so well? Word-level LSTMs compute features based on the word + context Character LSTMs/CNNs extract features per word Pretrained embeddings capture external semanic informaion CRF handles structural aspects of the problem Takeaways Any structured model / dynamic program + any neural network to compute potenials = neural CRF Can incorporate transiion potenials or other scores over the structure like grammar rules State-of-the-art for many text analysis tasks