NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Similar documents
Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Probabilistic Context-free Grammars

10/17/04. Today s Main Points

Personal Project: Shift-Reduce Dependency Parsing

HMM and Part of Speech Tagging. Adam Meyers New York University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Natural Language Processing

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Probabilistic Context Free Grammars. Many slides from Michael Collins

Applied Natural Language Processing

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

with Local Dependencies

Sequence Labeling: HMMs & Structured Perceptron

Deep Learning For Mathematical Functions

Natural Language Processing

Statistical Methods for NLP

Lab 12: Structured Prediction

Marrying Dynamic Programming with Recurrent Neural Networks

Lecture 12: Algorithms for HMMs

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Homework 4, Part B: Structured perceptron

Introduction to Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

NLP Programming Tutorial 11 - The Structured Perceptron

Source localization in an ocean waveguide using supervised machine learning

Lecture 12: Algorithms for HMMs

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

CS388: Natural Language Processing Lecture 4: Sequence Models I

Expectation Maximization (EM)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Transition-Based Parsing

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Features of Statistical Parsers

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Natural Language Processing

Computational Linguistics

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Statistical Methods for NLP

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical methods in NLP, lecture 7 Tagging and parsing

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

CS 224n: Assignment #3

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Spatial Role Labeling CS365 Course Project

Natural Language Processing

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Feedforward Neural Networks. Michael Collins, Columbia University

Deep Learning for NLP Part 2

LECTURER: BURCU CAN Spring

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing with Context-Free Grammars

Multiword Expression Identification with Tree Substitution Grammars

Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Deep learning for Natural Language Processing and Machine Translation

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Statistical Methods for NLP

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Latent Variable Models in NLP

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Stephen Scott.

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Lecture Notes on Inductive Definitions

Feedforward Neural Networks

1 Binary Classification

Conditional Language modeling with attention

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Architectures for Image, Language, and Speech Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 13: Structured Prediction

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

NEURAL LANGUAGE MODELS

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Natural Language Processing with Deep Learning CS224N/Ling284

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Transition-based dependency parsing

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Introduction to Machine Learning HW6

Task-Oriented Dialogue System (Young, 2000)

Hidden Markov Models

CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China

Transcription:

NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used in the syntactic analysis of sentences. 1 One way to recover a dependency tree for a sentence is through transition-based methods. In transition-based parsing, a tree is converted to a sequence of shift-reduce actions and the parser decides between possible actions depending on the current configuration of the partial tree. punct nsubj root prep pobj nn nn I live in New York city. PRP VBP IN NNP NNP NN. Figure 1: An example dependency tree from the Wall Street Journal treebank. In this homework, you are going to train a dependency parsing model based on the arc-standard system [2]. In the arc-standard system, each parse state is a configuration C = {σ, β, α}, in which σ is a stack of processed words, β is an ordered list of unprocessed words and α is a set of already recovered dependency relations j l i (ith word is headed by the jth word with a dependency label l). The initial configuration is C (0) = {σ = [root 0 ], β = [w 1,, w n ], α = {}} where root is the dummy root node of the sentence and [w 1... w n ] are the words in the sentence. One can theoretically show that the arc-standard parser can 1 The definitions and explanations in this section are meant to be very brief. This is mainly because you will not really implement the parsing actions by yourself and we have already provided that for you. If you are interested, you can look at the corresponding chapter in the NLP textbook: https://web.stanford.edu/~jurafsky/slp3/14.pdf. 1

Initialization {σ = [root], β = [w 1,, w n ], α = {}} Termination {σ = [root], β = [], α = {j l i; 0 j n, i [1,, n]}} Shift {σ, i β, α} {σ i, β, α} Left-Arc label {σ ij, β, α} {σ j, β, α {j label i}} Right-Arc label {σ ij, β, α} {σ i, β, α {i label j}} Figure 2: Initial and terminal states and actions in the arc-standard transition system. The notation σ ij means a stack where the top words are σ 0 = j and σ 1 = i. Action σ β h l d Shift [root 0] [I 1,live 2, in 3, New 4, York 5, city 6,. 7] Shift [root 0, I 1] [live 2, in 3, New 4, York 5, city 6,. 7] Left-Arc nsubj [root 0, I 1, live 2] [in 3, New 4, York 5, city 6,. 7] 2 nsubj 1 Shift [root 0, live 2] [in 3, New 4, York 5, city 6,. 7] Shift [root 0, live 2, in 3] [New 4, York 5, city 6,. 7] Shift [root 0, live 2, in 3, New 4] [York 5, city 6,. 7] Shift [root 0, live 2, in 3, New 4, York 5] [city 6,. 7] Left-Arc nn [root 0, live 2, in 3, New 4, York 5, city 6] [. 7] 6 nn 5 Left-Arc nn [root 0, live 2, in 3, New 4, city 6] [. 7] 6 nn 4 Right-Arc pobj [root 0, live 2, in 3, city 6] [. 7] 3 pobj 6 Right-Arc prep [root 0, live 2, in 3] [. 7] 2 prep 3 Shift [root 0, live 2] [. 7] Right-Arc punct [root 0, live 2,. 7] [] 2 punct 7 Right-Arc root [root 0, live 2] [] 0 root 2 Terminal [root 0] [] Figure 3: A sample action sequence with the arc-standard actions (Figure 2) for the tree in Figure 1. reach the terminal state after applying exactly 2n actions where n is the number of words in the sentence. The final state will be C (2n) = {σ = [root], β = [], α = {j l i; 0 j n, i [1,, n]}} There are three main actions that change the state of a configuration in the algorithm: shift, left-arc and right-arc. For every specific dependency relation (e.g. subject or object), the left-arc and right-arc actions become fine-grained (e.g. RIGHT-ARC:nsubj, RIGHT-ARC:pobj, etc.). Figure 2 shows the conditions for applying those actions. Figure 3 shows a sequence of correct transitions for the tree in Figure 1. 2

2 Learning a Parser from Transitions It is straightforward to train a model based on transitions extracted from trees in each sentence. A python script is provided for you to convert dependency trees to training instances, so you need not worry about the details of converting dependencies to training instances. 2.1 Input Layer We follow the approach of [1, 3] and use three kinds of features: Word features: 20 types of word features. Part-of-speech features: 20 types of POS features. Dependency label features: 12 types of dependency features. The data files should have 53 columns (space separated), where the first 20 are word-based features, the next 20 are the POS features, the next 12 are dependency features and the last column is the gold label (action). 2.2 Embedding Layer Based on the the input layer features φ(d i ) for data D i, the model should use the following embedding (lookup) parameters: Word embedding (E w ) with dimension d w. If the training corpus has N w unique words (including the < null >, < root >, and < unk > symbols), the size of the embedding dictionary will be R dw Nw. Part-of-speech embedding (E t ) with dimension d t. If the training corpus has N t unique POS tags (including the < null > and < root > symbols), the size of the embedding dictionary will be R dt Nt. Dependency label embedding (E l ) with dimension d l. If the training corpus has N l unique dependency labels (including the < null > and < root > symbols), the size of the embedding dictionary will be R d l N l. Thus the output from the embedding layer e, is the concatenation of all embedding values for the specific data instance. 2.3 First Hidden Layer d e = 20(d w + d t ) + 12d l The output of the embedding layer e should be fed to the first hidden layer with a rectified linear unit (RELU) activation: h 1 = RELU(W 1 e + b 1 ) = max{0, W 1 e + b 1 } where W 1 R d h 1 d e and b 1 R d h 1. 3

2.4 Second Hidden Layer The output of the first hidden layer h 1 should be fed to the second hidden layer with a rectified linear unit (RELU) activation: h 2 = RELU(W 2 h 1 + b 2 ) = max{0, W 2 h 1 + b 2 } where W 2 R d h 2 d h1 and b 2 R d h 2. 2.5 Output Layer Finally, the output of the second hidden layer h 2 should be fed to the output layer with a softmax activation function: q = softmax(v h 2 + γ) where V R A d h 2 and γ R A. A is the number of possible actions in the arc-standard algorithm. Note that in decoding, you can find the best action by finding the arg max of the output layer without using the softmax function: and y = arg max i [1...A] l i l i = (V h 2 + γ) i The objective function is to minimize the negative log-likelihood of the data given the model parameters ( i log q y i). It is usually a good practice to minimize the objective for random mini-batches from the data. 3 Step by Step Guideline This section provides a step by step guideline to train a deep neural network for transition-based parsing. You can download the code from the following repository: https://github.com/rasoolims/nlp_hw_dep. 3.1 Vocabulary Creation for Features Deep networks work with vectors, not strings. Therefore, first you have to create vocabularies for word features, POS features, dependency label features and parser actions. Run the following command: >> python src/gen vocab.py trees/train.conll data/vocabs After that, you should have the following files: 4

data/vocabs.word: This file contains indices for words. Remember that < null > is not a real word but since there are some cases that a word feature is null, we also learn embedding for a null word. Note that < root > is a word feature for the root token. Any word that does not occur in this vocabulary should be mapped to the value corresponding to the < unk > word in this file. data/vocabs.pos: Similar to the word vocabulary but this is for partof-speech tags. There is no notion of unknown in POS features. data/vocabs.labels: Similar to the word vocabulary but this is for dependency labels. There is no notion of unknown in dependency label features. data/vocabs.actions: This is a string to index conversion for the actions (this should be used for the output layer). 3.2 Data Generation The original tree files cannot be directly used for training. You should use the following command to convert the training and development datasets to data files: >> python src/gen.py trees/train.conll data/train.data >> python src/gen.py trees/dev.conll data/dev.data Each line shows one data instance. The first 20 columns are for wordbased features. The second 20 columns are for the POS-based features. The next 12 features are for dependency labels. The last column shows the action. Remember that, when you want to use these features in implementation, you should convert them to integer values, based the vocabularies generated in the previous subsection. 3.3 Neural Network Implementation Now it is time for implementing a neural network model. We recommend you to implement it with the Dynet library 2 because we can help you with details but if you are more familiar with other libraries such as Pytorch, Chainer, TensorFlow, Keras, Caffe, CNTK, DeepLearning4J, and Thiano, that is totally fine. Feel free to use any library or implementation of feed-forward neural networks. You can mimic the same implementation style as in the POS tagger implementation in the following repository: https://github.com/rasoolims/ff_tagger. 2 https://github.com/clab/dynet: take a look at its sample codes in the paper: https: //arxiv.org/pdf/1701.03980.pdf and its API reference: https://github.com/clab/dynet/ blob/master/examples/jupyter-tutorials/api.ipynb. 5

3.4 Decoding After you are done with training a model, you should be able to parse sentences directly from tree files. We provided wrappers for you to be able to run a decoder. Bellow is a container for your decoder in the depmodel.py file. You can change the content of the class (especially the score function), to make it work as a correct decoder. 1 import os, s y s 2 from decoder import 3 4 c l a s s DepModel : 5 d e f i n i t ( s e l f ) : 6 7 You can add more arguments f o r examples a c t i o n s and model paths. 8 You need to load your model here. 9 a c t i o n s : p r o v i d e s i n d i c e s f o r a c t i o n s. 10 i t has the same o r d e r as the data / vocabs. a c t i o n s f i l e. 11 12 # i f you p r e f e r to have your own index f o r a c t i o n s, change t h i s. 13 s e l f. a c t i o n s = [ SHIFT, LEFT ARC: cc,..., RIGHT ARC: r o o t ] 14 # w r i t e your code here f o r a d d i t i o n a l parameters. 15 # f e e l f r e e to add more arguments to the i n i t i a l i z e r. 16 17 d e f s c o r e ( s e l f, s t r f e a t u r e s ) : 18 19 : param s t r f e a t u r e s : S t r i n g f e a t u r e s 20 20 f i r s t : words, next 2 0 : pos, next 1 2 : dependency l a b e l s. 21 DO NOT ADD ANY ARGUMENTS TO THIS FUNCTION. 22 : r e t u r n : l i s t o f s c o r e s 23 24 # change t h i s part o f the code. 25 r e t u r n [ 0 ] l e n ( s e l f. a c t i o n s ) 26 27 i f name == m a i n : 28 m = DepModel ( ) 29 i n p u t p = os. path. abspath ( s y s. argv [ 1 ] ) 30 output p = os. path. abspath ( s y s. argv [ 2 ] ) 31 Decoder (m. s c o r e, m. a c t i o n s ). p a r s e ( input p, output p ) If you just run the code as-is, it will generate default trees with very low accuracies (because it only generates zero scores for every possible action). >> python src/depmodel.py trees/dev.conll outputs/dev.out 4 Homework Questions This section provides details about the questions. Please put your code as well as required outputs in the corresponding directories. A blind test set is also provided for you (trees/test.conll). You do not have access to the correct dependencies in the blind testing data but your grade will be affected by the 6

output from that blind set. 3 Part 1 Table 1 provides details about the parameters you need to use for this part. Except the ones mentioned, every other setting should be the default setting of the neural network library that you use. Parameter Value Trainer Adam algorithm Training epochs 7 Transfer function RELU Word embedding dimension 64 POS embedding dimension 32 Dependency embedding dimension 32 Minibatch size 1000 First hidden layer dimension (d h1 ) 200 Second hidden layer dimension (d h2 ) 200 Table 1: Parameter values for the experiments. 1. After training the model with 7 epochs on the training data, run the decoder on the file trees/dev.conll and put the output in trees/dev part1.conll. 4 2. Run the following script to get the unlabeled and labeled dependency accuracies. Report this number in your document. >> python src/eval.py trees/dev.conll outputs/dev part1.conll 3. A blind test set is also provided for you (trees/test.conll). Run the decoder with your trained model on the blind test set and output it in trees/test part1.conll. Part 2 This part is roughly the same as part one but with the following differences: d h1 = 400 and d h2 = 400. Similar to part 1, do the following steps: 1. After training the model with 7 epochs on the training data, run the decoder on the file trees/dev.conll and put the output in trees/dev part2.conll. 2. Run the following script to get the unlabeled and labeled dependency accuracies. Report this number in your document. 3 Because of the randomness in deep learning, we do not expect a particular performance but your accuracies should not be far bellow the numbers that we produce. 4 Remember to shuffle you training data at each epoch. 7

>> python src/eval.py trees/dev.conll outputs/dev part2.conll 3. Run the decoder with your trained model on the blind test set and output it in trees/test part2.conll. 4. Why do you think the accuracy changed? Part 3 This part is open-ended so you are free to change the parameters. Try at least one variation in of model and report the accuracy on the development data. Use the best model to output trees/dev part3.conll and trees/test part3.conll files. The variations can come from the following (though you are not bounded to these variations): Changing the activation function. For example, using the leaky RELU instead of the original RELU activation function. Initializing word embeddings with pre-trained vectors for example from https://nlp.stanford.edu/projects/glove/. Changing dimensions of the hidden or embedding layers. Dropout during training. Changing the mini-batch size. Changing the training algorithm (e.g. SGD, AdaGrad) or changing the updater s learning rate. Changing the number of training epochs. In addition to reporting the accuracies, you should explain your observations and reasons for the improvement that you achieve. References [1] Chen, D., and Manning, C. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (2014), pp. 740 750. [2] Nivre, J. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together (2004), Association for Computational Linguistics, pp. 50 57. 8

[3] Weiss, D., Alberti, C., Collins, M., and Petrov, S. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015), Association for Computational Linguistics, pp. 323 333. 9