CS 224n: Assignment #3

Size: px

Start display at page:

Download "CS 224n: Assignment #3"

Randall Whitehead
5 years ago
Views:

1 CS 224n: Assignment #3 Due date: 2/27 11:59 PM PST (You are allowed to use 3 late days maximum for this assignment) These questions require thought, but do not require long answers. Please be as concise as possible. We ask that you abide the university Honor Code and that of the Computer Science department, and make sure that all of your submitted work is done by yourself. Please review any additional instructions posted on the assignment page at When you are ready to submit, please follow the instructions on the course website. Note: This assignment involves running an experiment that takes an estimated 3-4 hours. Do not start this assignment at the last minute! Note: In this assignment, the inputs to neural network layers will be row vectors because this is standard practice for TensorFlow (some built-in TensorFlow functions assume the inputs are row vectors). This means the weight matrix of a hidden layer will right-multiply instead of left-multiply its input (i.e., xw + b instead of Wx + b). A primer on named entity recognition In this assignment, we will build several different models for named entity recognition (NER). NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories: Person (PER): e.g. Martha Stewart, Obama, Tim Wagner, etc. Pronouns like he or she are not considered named entities. Organization (ORG): e.g. American Airlines, Goldman Sachs, Department of Defense. Location (LOC): e.g. Germany, Panama Strait, Brussels, but not unnamed locations like the bar or the farm. Miscellaneous (MISC): e.g. Japanese, USD, 1,000, Englishmen. We formulate this as a 5-class classification problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). For an entity that spans multiple words ( Department of Defense ), each word is separately tagged, and every contiguous sequence of non-null tags is considered to be an entity. Here is a sample sentence (x (t) ) with the named entities tagged above each token (y (t) ) as well as hypothetical predictions produced by a system (ŷ (t) ): y (t) ORG ORG O O O ORG ORG... O PER PER O ŷ (t) MISC O O O O ORG O... O PER PER O x (t) American Airlines, a unit of AMR Corp.,... spokesman Tim Wagner said. 1

2 CS 224n Assignment 3 Page 2 of 11 In the above example, the system mistakenly predicted American to be of the MISC class and ignores Airlines and Corp.. All together, it predicts 3 entities, American, AMR and Tim Wagner. To evaluate the quality of a NER system s output, we look at precision, recall and the F 1 measure. 1 In particular, we will report precision, recall and F 1 at both the token-level and the name-entity level. In the former case: Precision is calculated as the ratio of correct non-null labels predicted to the total number of non-null labels predicted (in the above example, it would be p = 3 4 ). Recall is calculated as the ratio of correct non-null labels predicted to the total number of correct non-null labels (in the above example, it would be r = 3 6 ). F 1 is the harmonic mean of the two: F 1 = 2pr p+r. (in the above example, it would be F 1 = 6 10 ). For entity-level F 1 : Precision is the fraction of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. In our example, AMR would be marked incorrectly because it does not cover the whole entity, i.e. AMR Corp., as would American, and we would get a precision score of 1 3. Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions. Here, we would get a recall score of 1 3. Finally, the F 1 score is still the harmonic mean of the two, and would be 1 3 in the example. Our model also outputs a token-level confusion matrix 2. A confusion matrix is a specific table layout that allows visualization of the classification performance. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another). 1. A window into NER (30 points) Let s look at a simple baseline model that predicts a label for each token separately using features from a window around it. Figure 1: A sample input sequence Figure 1 shows an example of an input sequence and the first window from this sequence. Let x def = x (1), x (2),..., x (T ) be an input sequence of length T and y def = y (1), y (2),..., y (T ) be an output sequence, also of length T. Here, each element x (t) and y (t) are one-hot vectors representing the word at the t-th index of the sentence. In a window based classifier, every input sequence is split into T new data points, each representing a window and its label. A new input is constructed from a window around x (t) by concatenating w tokens to the left and right of x (t) (t) def : x = [x (t w),..., x (t),..., x (t+w) ]; we continue

3 CS 224n Assignment 3 Page 3 of 11 to use y (t) as its label. For windows centered around tokens at the very beginning of a sentence, we add special start tokens (<START>) to the beginning of the window and for windows centered around tokens at the very end of a sentence, we add special end tokens (<END>) to the end of the window. For example, consider constructing a window around Jim in the sentence above. If window size were 1, we would add a single start token to the window (resulting in a window of [<START>, Jim, bought]). If window size were 2, we would add two start tokens to the window (resulting in a window of [<START>, <START>, Jim, bought, 300]). With these, each input and output is of a uniform length (w and 1 respectively) and we can use a simple feedforward neural net to predict y (t) from x (t) : As a simple but effective model to predict labels from each window, we will use a single hidden layer with a ReLU activation, combined with a softmax output layer and the cross-entropy loss: e (t) = [x (t w) E,..., x (t) E,..., x (t+w) E] h (t) = ReLU(e (t) W + b 1 ) ŷ (t) = softmax(h (t) U + b 2 ) CE(y (t), ŷ (t) ) = i J = CE(y (t), ŷ (t) ) y (t) i log(ŷ (t) i ). where E R V D are word embeddings, h (t) is dimension H and ŷ (t) is of dimension C, where V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5). (a) (5 points) (written) i. (2 points) Provide 2 examples of sentences containing a named entity with an ambiguous type (e.g. the entity could either be a person or an organization, or it could either be an organization or not an entity). Solution: Here are a couple of examples: Spokesperson for Levis, Bill Murray, said..., where it is ambiguous whether Levis is a person or an organization. Heartbreak is a new virus, where Heartbreak could either be a MISC named entity (it s actually the name of a virus), or simply a noun. Basically, we are looking for any situation wherein the sentence contains a word. ii. (1 point) Why might it be important to use features apart from the word itself to predict named entity labels? Solution: Often named entities can be rare words, e.g. peoples names or heartbreak. Using features like casing helps the system generalize. iii. (2 points) Describe at least two features (apart from the word) that would help in predicting whether a word is part of a named entity or not. Solution: Casing of words and parts of speech. (b) (5 points) (written) i. (2 points) What are the dimensions of e (t), W and U if we use a window of size w? Solution: e (t) is 1 (2w + 1)D. W is (2w + 1)D H. U is H C. ii. (3 points) What is the computational complexity of predicting labels for a sentence of length T? Solution: For a single window, it requires O((2w + 1)D) operations to compute e (t), O((2w + 1)DH + H) to compute h (t) and O(HC + C) operations to compute y (t). In total, it requires O((2w + 1)DHT + HC) operations to predict labels for the entire sentence. Assuming that D, H C, O((2w + 1)DHT ) is also an acceptable answer.

4 CS 224n Assignment 3 Page 4 of 11 (c) (15 points) (code) Implement a window-based classifier model in q1 window.py using this approach. To do so, you will have to: i. (4 points) Transform a batch of input sequences into a batch of windowed input-output pairs in the make windowed data function. You can test your implementation by running python q1 window.py test1. ii. (6 points) Implement the feed-forward model described above by appropriately completing functions and the variable n window features in the WindowModel class. You can test your implementation by running python q1 window.py test2. iii. (3 points) Complete the training loop in method NERModel.fit (ner model.py). The training and validation datasets are already divided, and the outer loop for training multiple epochs is also written. You need to examine the function minibatches (utils.py) and Model.train on batch (which NERModel inherits from) and write the training procedure that loops over all minibatches in the dataset. iv. (2 points) Train your model using the command python q1 window.py train. The code should take only about 2 3 minutes to run and you should get a development score of at least 81% F 1. The model and its output will be saved to results/window/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q1 window.py shell -m results/window/<timestamp>/ Deliverable: After your model has trained, copy the window predictions.conll file from the appropriate results folder into the root folder of your code directory, so that it can be included in your submission. (d) (5 points) (written) Analyze the predictions of your model using the files generated above. i. (1 point) Report your best development entity-level F 1 score and the corresponding token-level confusion matrix. Briefly describe what the confusion matrix tells you about the errors your model is making. Solution: confusion matrix: The best development F 1 we got on the corpus is 83% and has the following gold/guess PER ORG LOC MISC O PER ORG LOC MISC O The confusion matrix shows that the biggest source of confusion for the model is the organization label, where many organizations are confused for people or missed out on. On the other hand, it seems like people are pretty well recognized. ii. (4 points) Describe at least 2 modeling limitations of the window-based model and support these conclusions using examples from your model s output (i.e. identify errors that your model made due to its limitations). You can also support your conclusions using predictions made by your model on examples manually entered through the shell. Solution: The window-based model can not use information from neighboring predictions to disambiguate labeling decisions, leading to non-contiguous entity predictions. Here, we concisely display examples with a word/gold/guess notation. For example the sentence, Quarter-

5 CS 224n Assignment 3 Page 5 of 11 final results in the Hong/ORG/LOC Kong/ORG/ORG Open/ORG/ORG on Friday. The window around Kong can easily identify that it is part of an organization, while the window around Hong doesn t know whether it is part of a location or an organization. The window-based model also can t use information from other parts of the sentence: take this example: I m an emotional player, said the 104th-ranked Tarango/PER/LOC.. Here the sentence clearly describes Tarango as being a person (player), but this information can t be used by the window based model. 2. Recurrent neural nets for NER (40 points) We will now tackle the task of NER by using a recurrent neural network (RNN). See slide 47 ( RNNs can be used for tagging ) of Lecture 8 for an illustration of this kind of tagging task. Recall that each RNN cell combines the previous hidden state with the current input using a sigmoid. We then use the hidden state to predict the output at each timestep: e (t) = x (t) E h (t) = σ(h (t 1) W h + e (t) W e + b 1 ) ŷ (t) = softmax(h (t) U + b 2 ), where E R V D are word embeddings, W h R H H, W e R D H and b 1 R H are parameters for the RNN cell, and U R H C and b 2 R C are parameters for the softmax. As before, V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5). In order to train the model, we use a cross-entropy loss for the every predicted token: J = T CE(y (t), ŷ (t) ) t=1 CE(y (t), ŷ (t) ) = i y (t) i log(ŷ (t) i ). (a) (4 points) (written) i. (1 point) How many more parameters does the RNN model in comparison to the window-based model? Solution: The RNN model differs from the window based model in that (a) W e has only D H parameters instead of (2w + 1)D H for W in the window model and (b) it has H H additional parameters for W h. ii. (3 points) What is the computational complexity of predicting labels for a sentence of length T (for the RNN model)? Solution: It takes O(D) operations to compute e (t), O(H 2 +DH +H) operations to compute each h (t) and O(HC + C) operations to compute each ŷ (t). In total, it takes about O((D + H)HT ) operations to predict labels for the whole sentence. (b) (2 points) (written) Recall that the actual score we want to optimize is entity-level F 1. i. (1 point) Name at least one scenario in which decreasing the cross-entropy cost would lead to an decrease in entity-level F 1 scores. Solution: Consider the scenario with a sentence with words/gold labels The James/MISC scandal/misc. If you predicted The James/MISC scandal/o versus The James/O scandal/o, your cross entropy loss would decrease because you ve predicted one more token s label correctly. On the other hand, your entity-level F 1 score would also decrease because you have now predicted another entity incorrectly: your precision goes down while your recall remains the same.

6 CS 224n Assignment 3 Page 6 of 11 Alternative: Training data is skewed. The cross entropy would optimize for precision at the cost of big drop in recall, causing the F1 scroe to decrease. ii. (1 point) Why it is difficult to directly optimize for F 1? Solution: For starters, F 1 is not differentiable. Secondly, it is hard to directly optimze for F 1 because it requires predictions from the entire corpus to compute, making it very difficult to batch and parallelize. (c) (5 points) (code) Implement an RNN cell using the equations described above in the rnn cell function of q2 rnn cell.py. You can test your implementation by running python q2 rnn cell.py test. Hint: We recommend you initialize the bias terms in the RNN/GRU cell, with 0s. You can do this by using tf.constant initializer. (d) (8 points) (code/written) Implementing an RNN requires us to unroll the computation over the whole sentence. Unfortunately, each sentence can be of arbitrary length and this would cause the RNN to be unrolled a different number of times for different sentences, making it impossible to batch process the data. The most common way to address this problem is pad our input with zeros. Suppose the largest sentence in our input is M tokens long, then, for an input of length T we will need to: 1. Add 0-vectors to x and y to make them M tokens long. These 0-vectors are still one-hot vectors, representing a new NULL token. 2. Create a masking vector, (m (t) ) M t=1 which is 1 for all t T and 0 for all t > T. This masking vector will allow us to ignore the predictions that the network makes on the padded input Of course, by extending the input and output by M T tokens, we might change our loss and hence gradient updates. In order to tackle this problem, we modify our loss using the masking vector: J = M m (t) CE(y (t), ŷ (t) ). t=1 i. (3 points) (written) How would the loss and gradient updates change if we did not use masking? How does masking solve this problem? Solution: The loss includes the accuracy of predicting extra 0-labels. The gradients from the padding input would flow through the hidden state and affect the learning of the parameters. By masking the loss, we zero-out the loss due to these extra 0-labels (and hence their gradients), solving the problem. ii. (5 points) (code) Implement pad sequences in your code. You can test your implementation by running python q2 rnn.py test1. (e) (12 points) (code) Implement the rest of the RNN model assuming only fixed length input by appropriately completing functions in the RNNModel class. This will involve: 1. Implementing the add placeholders, create feed dict, add embedding, add training op functions. 2. Implementing the add prediction op operation that unrolls the RNN loop self.max length times. Remember to reuse variables in your variable scope from the 2nd timestep onwards to share the RNN cell weights W e and W h across timesteps. 3. Implementing add loss op to handle the mask vector returned in the previous part. You can test your implementation by running python q2 rnn.py test2. 3 In our code, we will actually use the Boolean values of True and False instead of 1 and 0 for computational efficiency.

7 CS 224n Assignment 3 Page 7 of 11 (f) (3 points) (code) Train your model using the command python q2 rnn.py train. Training should take about 1 2 hours on your CPU and minutes if you use a GPU. You should get a development F 1 score of at least 85%. The model and its output will be saved to results/rnn/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q2 rnn.py shell -m results/rnn/<timestamp>/ Deliverable: After your model has trained, copy the rnn predictions.conll file from the appropriate results folder into your code directory so that it can be included in your submission. (g) (6 points) (written) i. (3 points) Describe at least 2 modeling limitations of this RNN model and support these conclusions using examples from your model s output. ii. (3 points) For each limitation, suggest some way you could extend the model to overcome the limitation. Solution: One limitation is that the model does not get to see into the future while making predictions; e.g. New York State University. The first New is likely to be tagged LOC (for New York). This problem can be solved with a birnn. Another limitation is that the model doesn t enforce that adjacent tokens have the same tag (illustrated by the same example above). Introducing a pair-wise agreements (i.e. using a CRF loss) in the loss would solve this problem. 3. Grooving with GRUs (30 points) In class, we learned that a gated recurrent unit (GRU) is an improved RNN cell that greatly reduces the problem of vanishing gradients. Recall that a GRU is described by the following equations: z (t) = σ(x (t) W z + h (t 1) U z + b z ) r (t) = σ(x (t) W r + h (t 1) U r + b r ) h (t) = tanh(x (t) W h + r (t) h (t 1) U h + b h ) h (t) = z (t) h (t 1) + (1 z (t) ) h (t), where z (t) is considered to be an update gate and r (t) is considered to be a reset gate. 4 Also, to keep the notation consistent with the GRU, for this problem, let the basic RNN cell be described by the equations: h (t) = σ(x (t) W h + h (t 1) U h + b h ). To gain some inutition, let s explore the behavior of the basic RNN cell and the GRU on some generated 1-D sequences. (a) (4 points) (written) Modeling latching behavior. Let s say we are given input sequences starting with a 1 or 0, followed by n 0s, e.g. 0, 1, 00, 10, 000, 100, etc. We would like our state h to continue to remember what the first character was, irrespective of how many 0s follow. This scenario can also be described as wanting the neural network to learn the following simple automaton: 4 Section of provides a good introduction to GRUs. github.io/posts/ understanding-lstms/ provides a more colorful picture of LSTMs and to an extent GRUs.

8 CS 224n Assignment 3 Page 8 of 11 x = 0 x = 1 x = 0, 1 h = 0 h = 1 In other words, when the network sees a 1, it should change its state to also be a 1 and stay there. In the following questions, assume that the state is initialized at 0 (i.e. h (0) = 0), and that all the parameters are scalars. Further, assume that all sigmoid activations and tanh activations are replaced by the indicator function: σ(x) { 1 if x > 0 0 otherwise tanh(x) { 1 if x > 0 0 otherwise. i. (1 point) Identify values of w h, u h and b h for an RNN cell that would allow it to replicate the behavior described by the automaton above. Solution: Since it s 1 point. The student do not have to give detailed description of how they obtain the answer. For the RNN cell h (t) = σ(x (t) W h + h (t 1) U h + b h ) we consider the following four cases: h (t 1) = 0, x (t) = 0. Since we want h (t) = 0, we have σ(b h ) = 0 b h 0 h (t 1) = 0, x (t) = 1. Since we want h (t) = 1, we have σ(w h + b h ) = 1 W h + b h > 0 h (t 1) = 1, x (t) = 0. Since we want h (t) = 1, we have σ(u h + b h ) = 1 U h + b h > 0 h (t 1) = 1, x (t) = 1. Since we want h (t) = 1, we have σ(u h + W h + b h ) = 1 U h + W h + b h > 0 Therefore, the inequalities W h, U h, b h should satisfy are: b H 0 W h + b h > 0 U h + b h > 0 ii. (3 points) Let W r = U r = b r = b z = b h = 0 (remember that all of them are scalars in this subquestion). Identify values of W z, U z, W h and U h for a GRU cell that would allow it to replicate the behavior described by the automaton above.

9 CS 224n Assignment 3 Page 9 of 11 Solution: Since W r = U r = b r = b z = b h = 0, the GRU cell can be simplified as: z (t) = σ(x (t) W z + h (t 1) U z ) r (t) = 0 h (t) = tanh(x (t) W h ) h (t) = z (t) h (t 1) + (1 z (t) ) h (t) We consider the following four cases: h (t 1) = 0, x (t) = 0. Since we want h (t) = 0, we have z (t) = σ(0) = 0 h (t) = tanh(0) = 0 h (t) = 0 h (t 1) = 0, x (t) = 1. Since we want h (t) = 1, we have z (t) = σ(w z ) h (t) = tanh(w h ) = 0 h (t) = (1 σ(w z )) tanh(w h ) = 1 W z 0 W h > 0 h (t 1) = 1, x (t) = 0. Since we want h (t) = 1, we have z (t) = σ(u z ) h (t) = tanh(0) = 0 h (t) = z (t) h (t 1) = σ(u z ) = 1 U z > 0 h (t 1) = 1, x (t) = 1. Since we want h (t) = 1, we have z (t) = σ(w z + U z ) h (t) = tanh(w h ) = 0 h (t) = z (t) + (1 σ(w z )) tanh(w h ) = 1 W z + U h > 0 or W z + U h 0 U h R Therefore, the inequalities U z, W z, U h, W h should satisfy are: U z > 0 W z 0 U h R W h > 0 (b) (6 points) (written) Modeling toggling behavior. Now, let us try modeling a more interesting behavior. We are now given an arbitrary input sequence, and must produce an output sequence that switches from 0 to 1 and vice versa whenever it sees a 1 in the input. For example, the input sequence should produce This behavior could be described by the following automaton:

10 CS 224n Assignment 3 Page 10 of 11 x = 0 x = 1 x = 0 h = 0 h = 1 x = 1 Once again, assume that the state is initialized at 0 (i.e. h (0) = 0), that all the parameters are scalars, that all sigmoid activations and tanh activations are replaced by the indicator function. i. (3 points) Show that a 1D RNN can not replicate the behavior described by the automaton above. Solution: First of all, we need that if input is 0, the state maintains its value: 0 W h + 0 U h + b h 0 1 W h + 0 U h + b h > 0. This implies that we must have W h > 0. Next, we need to be able to model the property that when the state needs to flip from 0 to 1 when seeing an input of 1 and flip back from 1 to 0 when seeing a 1 again. In equations, this requires that: 0 W h + 1 U h + b h > 0 1 W h + 1 U h + b h 0 However, this implies that we must have w h < 0, which contradicts the earlier condition! ii. (3 points) Let W r = U r = b z = b h = 0. Identify values of b r, W z, U z, W h and U h for a GRU cell that would allow it to replicate the behavior described by the automaton above. Solution: First, let s set b r = 1 to deactivate the reset gate. We ll exploit the update gate that the GRU has, setting it to be 1 only when x = 1 (and 0 otherwise). We ll also set the pre-output gate, h to be less than 0 when h = 1 and greater than 0 otherwise. We can achieve the first property by setting U z = 1, b z = W z = 0, and the second property by setting U h = 0, b h = 1, W h = 2. (c) (6 points) (code) Implement the GRU cell described above in q3 gru cell.py. You can test your implementation by running python q3 gru cell.py test. (d) (6 points) (code) We will now use an RNN model to try and learn the latching behavior described in part (a) using TensorFlow s RNN implementation: tf.nn.dynamic rnn. i. In q3 gru.py, implement add prediction op by applying TensorFlow s dynamic RNN model on the sequence input provided. Also apply a sigmoid function on the final state to normalize the state values between 0 and 1. Also implement add loss op. ii. Next, write code to calculate the gradient norm and implement gradient clipping in add training op. iii. Run the program: python q3 gru.py predict -c [rnn gru] [-g] to generate a learning curve for this task for the RNN and GRU models. The -g flag activates gradient clipping. These commands produce a plot of the learning dynamics in q3-noclip-<model>.png and q3-clip-<model>.png respectively. Deliverable: Attach the plots of learning dynamics generated for a GRU and RNN, with and without gradient clipping (in total 4 plots) to your write up.

11 CS 224n Assignment 3 Page 11 of 11 (e) (5 points) (written) Analyze the graphs obtained above and describe the learning dynamics you see. Make sure you address the following questions: i. Does either model experience vanishing or exploding gradients? If so, does gradient clipping help? ii. Which model does better? Can you explain why? Important: There is some variance in the graphs you get based on the order of initializing variables. It might be the case that you do not see any effect from clipping gradients that is ok, report your graphs and what you observe. For reference, the order in which we initialize variables for the GRU is W r,u r,b r,w z,u z,b z,w o,u o,b o and for the RNN cell it is W h,w x,b. Solution: The plots for this question vary a bit based on initialization, and we are evaluating their descriptions of what they see. The loss is generally around Some things that we are looking for: (a) gradients should eventually decrease (or something is definitely fishy). The RNN s gradients vanish quickly, so clipping doesn t really help. For the GRU, gradients are maintained and it can improve on the loss. That said, clipping can prevent these gradients from improving the loss. (f) (3 points) (code) Run the NER model from question 2 using the GRU cell using the command: python q2 rnn.py train -c gru Training should take about 3 4 hours on your CPU and about 30 minutes on a GPU 5. You should get a development F 1 score of at least 85%. 5 Yes, several hours is a long time, but you are learning to become a Deep Learning researcher so you need to be able to manage several-hour experiments!

12 CS 224n Assignment 3 Page 12 of 11 The model and its output will be saved to results/gru/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q2 rnn.py shell -m results/gru/<timestamp>/ Deliverable: After your model has trained, copy the gru predictions.conll file from the appropriate results folder into your code directory so that it can be included in your submission.

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.