CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

Similar documents
CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

Deep Learning for NLP

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Marrying Dynamic Programming with Recurrent Neural Networks

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

Natural Language Processing

Applied Natural Language Processing

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

with Local Dependencies

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Lecture 13: Structured Prediction

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

10/17/04. Today s Main Points

Intelligent Systems (AI-2)

Lecture 5 Neural models for NLP

Neural Networks in Structured Prediction. November 17, 2015

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Soft Inference and Posterior Marginals. September 19, 2013

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Intelligent Systems (AI-2)

CS388: Natural Language Processing Lecture 4: Sequence Models I

Natural Language Processing

Learning to translate with neural networks. Michael Auli

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Neural Architectures for Image, Language, and Speech Processing

Part-of-Speech Tagging + Neural Networks 2 CS 287

Improving Sequence-to-Sequence Constituency Parsing

CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Conditional Random Fields

LECTURER: BURCU CAN Spring

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

The Infinite PCFG using Hierarchical Dirichlet Processes

Statistical Methods for NLP

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

CSE 490 U Natural Language Processing Spring 2016

Lecture 12: Algorithms for HMMs

CSC321 Lecture 16: ResNets and Attention

Chapter 14 (Partially) Unsupervised Parsing

AN ABSTRACT OF THE DISSERTATION OF

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

CS838-1 Advanced NLP: Hidden Markov Models

Convolutional Neural Networks

Lecture 9: Hidden Markov Model

Lab 12: Structured Prediction

Lecture 12: Algorithms for HMMs

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Recurrent neural network grammars

Log-Linear Models, MEMMs, and CRFs

Probabilistic Context Free Grammars. Many slides from Michael Collins

arxiv: v2 [cs.cl] 1 Jan 2019

A Support Vector Method for Multivariate Performance Measures

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Structured Prediction Models via the Matrix-Tree Theorem

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Probabilistic Context-free Grammars

Hidden Markov Models

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Task-Oriented Dialogue System (Young, 2000)

The Noisy Channel Model and Markov Models

Sequential Supervised Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Lecture 3: ASR: HMMs, Forward, Viterbi

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Algorithms for Syntax-Aware Statistical Machine Translation

Ecient Higher-Order CRFs for Morphological Tagging

Statistical Methods for NLP

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Deep Learning for NLP Part 2

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

A brief introduction to Conditional Random Fields

Deep Learning for NLP

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

arxiv: v1 [cs.cl] 22 Jun 2015

Natural Language Processing

NLP Programming Tutorial 11 - The Structured Perceptron

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Lecture 15. Probabilistic Models on Graph

Machine Learning for Structured Prediction

Improvements to Training an RNN parser

Latent Variable Models in NLP

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Multiword Expression Identification with Tree Substitution Grammars

Transcription:

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett

Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week Project presentaions December 5/7 (Imeslots to be assigned when proposals are turned in) Final project due December 15 (no slip days!)

Project Proposals ~1 page Define a problem, give context of related work (at least 3-4 relevant papers) Propose a direcion that you think is feasible and outline steps to get there, including what dataset you ll use Okay to change direcions a_er the proposal is submi`ed, but run it by me if it s a big change

This Lecture Neural CRFs Tagging / NER Parsing

Neural CRF Basics

NER Revisited B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG Features in CRFs: I[tag=B-LOC & curr_word=hangzhou], I[tag=B-LOC & prev_word=to], I[tag=B-LOC & curr_prefix=han] Linear model over features Downsides: Lexical features mean that words need to be seen in the training data Can only use limited context windows Linear model can t capture feature conjuncions effecively

LSTMs for NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG B-PER I-PER O O O for the G20 meeing <s> B-PER I-PER O O O Encoder-decoder (MT-like model) What are the strengths and weaknesses of this model compared to CRFs?

LSTMs for NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou Transducer (LM-like model) What are the strengths and weaknesses of this model compared to CRFs?

LSTMs for NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou BidirecIonal transducer model What are the strengths and weaknesses of this model compared to CRFs?

Neural CRFs B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG Barack Obama will travel to Hangzhou Neural CRFs: bidirecional LSTMs (or some NN) compute emission potenials, capture structural constraints in transiion potenials

Neural CRFs ny ny t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 i=1 e ConvenIonal: e(y i,i,x) =w > f e (y i,i,x) Neural: e(y,i,x) =Wf(i, x) f/phi are vectors, len(phi) = num labels f(i, x) could be the output of a feedforward neural network looking at the words around posiion i, or the ith output of an LSTM, Neural network computes unnormalized potenials that are consumed and normalized by a structured model Inference: compute f, use Viterbi (or beam)

CompuIng Gradients ny ny t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 i=1 e ConvenIonal: e(y i,i,x) =w > f e (y i,i,x) Neural: @L @ e,i = e(y,i,x) =Wf(i, x) P (y i = s x)+i[s is gold] error signal, compute with F-B chain rule say to muliply @ e,i For linear model: = f e,i (y i,i,x) together, gives our update w i For neural model: compute gradient of phi w.r.t. parameters of neural net

Neural CRFs B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG 2) Run forward-backward 3) Compute error signal 1) Compute f(x) 4) Backprop (no knowledge Barack Obama will travel to Hangzhou of sequenial structure required)

FFNN Neural CRF for NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG FFNN e = Wg(Vf(x,i)) f(x,i)=[emb(x i 1 ), emb(x i ), emb(x i+1 )] e(to) e(hangzhou) e(today) Or f(x) looks at output of LSTM, previous word curr word next word or another model to Hangzhou today

Neural CRFs B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG Barack Obama will travel to Hangzhou Neural CRFs: bidirecional LSTMs compute emission potenials, also transiion potenials (usually based on sparse features)

LSTMs for NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou How does this compare to neural CRF?

NLP (Almost) From Scratch WLL: independent classificaion; SLL: neural CRF LM1/LM2: pretrained word embeddings from a language model over large corpora Collobert, Weston, et al. 2008, 2011

How do we use a tagger for SRL? Tagging problem with respect to a par=cular verb Can t do this with feedforward networks efficiently, arguments are too far from the verb to use fixed context window sizes Figure from He et al. (2017)

CNN Neural CRFs Append to each word vector an embedding of the rela=ve posi=on of that word conv+pool+ffnn ConvoluIon over the sentence produces a posiion-dependent expected to quicken a bit representaion Use this for SRL: the verb (predicate) is at posiion 0, CNN looks at the whole -2-1 0 1 2 sentence relaive to the verb expected to quicken a bit

CNN NCRFs vs. FFNN NCRFs Sentence approach (CNNs) is comparable to window approach (FFNNs) except for SRL where they claim it works much be`er

How from scratch was this? NN+SLL isn t great LM2: trained for 7 weeks on Wikipedia+Reuters very expensive! Sparse features needed to get best performance on NER+SRL anyway No use of sub-word features Collobert and Weston 2008, 2011

Neural CRFs with LSTMs Neural CRF using character LSTMs to compute word representaions Chiu and Nichols (2015), Lample et al. (2016)

Neural CRFs with LSTMs Chiu+Nichols: character CNNs instead of LSTMs Lin/Passos/Luo: use external resources like Wikipedia LSTM-CRF captures the important aspects of NER: word context (LSTM), sub-word features (character LSTMs), outside knowledge (word embeddings) Chiu and Nichols (2015), Lample et al. (2016)

Neural CRFs for Parsing

ConsItuency Parsing VP PP VBD PP He wrote a long report on Mars. He wrote a long report on Mars. My report Fig. 1 report on Mars wrote on Mars

Discrete Parsing score = w > f PP PP 2 5 7 wrote a long report on Mars. He wrote a long report on Mars. f PP = 2 5 7 2 5 7 wrote a long report on Mars. Drawbacks Le_ child last word = report PP Need to learn each word s properies individually Hard to learn feature conjuncions (report on X) Taskar et al. (2004) Hall, Durre`, and Klein (ACL 2014)

ConInuous-State Grammars score PP He wrote a long report on Mars. Powerful nonlinear featurizaion, but inference is intractable Socher et al. (2013)

Joint Discrete and ConInuous Parsing score PP 2 5 7 = w > f PP 2 5 7 He wrote a long report on Mars + s s > X W ` X X PP 2 5 7 Durre` and Klein (ACL 2015)

Joint Discrete and ConInuous Parsing score PP 2 5 7 He wrote a long report on Mars = w > f PP 2 5 7 + s s > X W ` X X PP 2 5 7 Durre` and Klein (ACL 2015)

Joint Discrete and ConInuous Parsing score PP 2 5 7 Learned jointly = w > f + PP 2 5 7 W > ` s s > X X X PP 2 5 7 Rule = PP Parent = W ` } rule embeddings s > Discrete structure neural network v He wrote a long report on Mars. 2 5 7 Durre` and Klein (ACL 2015)

Joint Discrete and ConInuous Parsing Chart remains discrete! Discrete + ConInuous Discrete + ConInuous PP Parsing a sentence: Feedforward pass on nets Discrete feature computaion He wrote a long report on Mars Run CKY dynamic program Durre` and Klein (ACL 2015)

Joint Modeling Helps! Approx human performance 95 92.0 Penn Treebank Dev set F1 90 85 90.8 91.0 Joint Discrete ConInuous Hall, Durre`, Klein 2014 Durre`, Durre`, Klein 2015 Klein 2015 80

Comparison to Neural Parsers Approx human performance 95 Penn Treebank Test set F1 90 85 90.4 CVG Socher+ 2013 88.3 LSTM Vinyals+ 2015 90.5 LSTM ensemble Vinyals+ 2015 91.1 Joint Durre`, Klein 2015 80

Results: 8 languages 95 SPMRL Treebanks Test set F1 85 75 81.2 Berkeley Petrov+ 2006 84.2 FM Fernandes- Gonzalez and MarIns 2015 85.7 Joint Durre`, Klein 2015 65

Dependency Parsing Score each head-child pair in a dependency parse, use Eisner s algorithm or MST to assemble a parse Feedforward neural network approach: use features on head and modifier ROOT Casey hugged Kim head feats modifier feats FFNN score of dependency arc Pei et al. (2015), Kiperwasser and Goldberg (2016), Dozat and Manning (2017)

Dependency Parsing Biaffine approach: condense each head and modifier separately, compute score h T Um Dozat and Manning (2017)

Results Biaffine approach works well (other neural CRFs are also strong) Dozat and Manning (2017)

Neural CRFs State-of-the-art for: POS NER without extra data (Lample et al.) Dependency parsing (Dozat and Manning) SemanIc Role Labeling (He et al.) Why do they work so well? Word-level LSTMs compute features based on the word + context Character LSTMs/CNNs extract features per word Pretrained embeddings capture external semanic informaion CRF handles structural aspects of the problem

Takeaways Any structured model / dynamic program + any neural network to compute potenials = neural CRF Can incorporate transiion potenials or other scores over the structure like grammar rules State-of-the-art for many text analysis tasks