CS 224n: Assignment #3

Size: px
Start display at page:

Download "CS 224n: Assignment #3"

Transcription

1 CS 224n: Assignment #3 Due date: 2/27 11:59 PM PST (You are allowed to use 3 late days maximum for this assignment) These questions require thought, but do not require long answers. Please be as concise as possible. We ask that you abide the university Honor Code and that of the Computer Science department, and make sure that all of your submitted work is done by yourself. Please review any additional instructions posted on the assignment page at When you are ready to submit, please follow the instructions on the course website. Note: This assignment involves running an experiment that takes an estimated 3-4 hours. Do not start this assignment at the last minute! Note: In this assignment, the inputs to neural network layers will be row vectors because this is standard practice for TensorFlow (some built-in TensorFlow functions assume the inputs are row vectors). This means the weight matrix of a hidden layer will right-multiply instead of left-multiply its input (i.e., xw + b instead of Wx + b). A primer on named entity recognition In this assignment, we will build several different models for named entity recognition (NER). NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories: Person (PER): e.g. Martha Stewart, Obama, Tim Wagner, etc. Pronouns like he or she are not considered named entities. Organization (ORG): e.g. American Airlines, Goldman Sachs, Department of Defense. Location (LOC): e.g. Germany, Panama Strait, Brussels, but not unnamed locations like the bar or the farm. Miscellaneous (MISC): e.g. Japanese, USD, 1,000, Englishmen. We formulate this as a 5-class classification problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). For an entity that spans multiple words ( Department of Defense ), each word is separately tagged, and every contiguous sequence of non-null tags is considered to be an entity. Here is a sample sentence (x (t) ) with the named entities tagged above each token (y (t) ) as well as hypothetical predictions produced by a system (ŷ (t) ): y (t) ORG ORG O O O ORG ORG... O PER PER O ŷ (t) MISC O O O O ORG O... O PER PER O x (t) American Airlines, a unit of AMR Corp.,... spokesman Tim Wagner said. 1

2 CS 224n Assignment 3 Page 2 of 11 In the above example, the system mistakenly predicted American to be of the MISC class and ignores Airlines and Corp.. All together, it predicts 3 entities, American, AMR and Tim Wagner. To evaluate the quality of a NER system s output, we look at precision, recall and the F 1 measure. 1 In particular, we will report precision, recall and F 1 at both the token-level and the name-entity level. In the former case: Precision is calculated as the ratio of correct non-null labels predicted to the total number of non-null labels predicted (in the above example, it would be p = 3 4 ). Recall is calculated as the ratio of correct non-null labels predicted to the total number of correct non-null labels (in the above example, it would be r = 3 6 ). F 1 is the harmonic mean of the two: F 1 = 2pr p+r. (in the above example, it would be F 1 = 6 10 ). For entity-level F 1 : Precision is the fraction of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. In our example, AMR would be marked incorrectly because it does not cover the whole entity, i.e. AMR Corp., as would American, and we would get a precision score of 1 3. Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions. Here, we would get a recall score of 1 3. Finally, the F 1 score is still the harmonic mean of the two, and would be 1 3 in the example. Our model also outputs a token-level confusion matrix 2. A confusion matrix is a specific table layout that allows visualization of the classification performance. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another). 1. A window into NER (30 points) Let s look at a simple baseline model that predicts a label for each token separately using features from a window around it. Figure 1: A sample input sequence Figure 1 shows an example of an input sequence and the first window from this sequence. Let x def = x (1), x (2),..., x (T ) be an input sequence of length T and y def = y (1), y (2),..., y (T ) be an output sequence, also of length T. Here, each element x (t) and y (t) are one-hot vectors representing the word at the t-th index of the sentence. In a window based classifier, every input sequence is split into T new data points, each representing a window and its label. A new input is constructed from a window around x (t) by concatenating w tokens to the left and right of x (t) (t) def : x = [x (t w),..., x (t),..., x (t+w) ]; we continue

3 CS 224n Assignment 3 Page 3 of 11 to use y (t) as its label. For windows centered around tokens at the very beginning of a sentence, we add special start tokens (<START>) to the beginning of the window and for windows centered around tokens at the very end of a sentence, we add special end tokens (<END>) to the end of the window. For example, consider constructing a window around Jim in the sentence above. If window size were 1, we would add a single start token to the window (resulting in a window of [<START>, Jim, bought]). If window size were 2, we would add two start tokens to the window (resulting in a window of [<START>, <START>, Jim, bought, 300]). With these, each input and output is of a uniform length (w and 1 respectively) and we can use a simple feedforward neural net to predict y (t) from x (t) : As a simple but effective model to predict labels from each window, we will use a single hidden layer with a ReLU activation, combined with a softmax output layer and the cross-entropy loss: e (t) = [x (t w) E,..., x (t) E,..., x (t+w) E] h (t) = ReLU(e (t) W + b 1 ) ŷ (t) = softmax(h (t) U + b 2 ) CE(y (t), ŷ (t) ) = i J = CE(y (t), ŷ (t) ) y (t) i log(ŷ (t) i ). where E R V D are word embeddings, h (t) is dimension H and ŷ (t) is of dimension C, where V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5). (a) (5 points) (written) i. (2 points) Provide 2 examples of sentences containing a named entity with an ambiguous type (e.g. the entity could either be a person or an organization, or it could either be an organization or not an entity). Solution: Here are a couple of examples: Spokesperson for Levis, Bill Murray, said..., where it is ambiguous whether Levis is a person or an organization. Heartbreak is a new virus, where Heartbreak could either be a MISC named entity (it s actually the name of a virus), or simply a noun. Basically, we are looking for any situation wherein the sentence contains a word. ii. (1 point) Why might it be important to use features apart from the word itself to predict named entity labels? Solution: Often named entities can be rare words, e.g. peoples names or heartbreak. Using features like casing helps the system generalize. iii. (2 points) Describe at least two features (apart from the word) that would help in predicting whether a word is part of a named entity or not. Solution: Casing of words and parts of speech. (b) (5 points) (written) i. (2 points) What are the dimensions of e (t), W and U if we use a window of size w? Solution: e (t) is 1 (2w + 1)D. W is (2w + 1)D H. U is H C. ii. (3 points) What is the computational complexity of predicting labels for a sentence of length T? Solution: For a single window, it requires O((2w + 1)D) operations to compute e (t), O((2w + 1)DH + H) to compute h (t) and O(HC + C) operations to compute y (t). In total, it requires O((2w + 1)DHT + HC) operations to predict labels for the entire sentence. Assuming that D, H C, O((2w + 1)DHT ) is also an acceptable answer.

4 CS 224n Assignment 3 Page 4 of 11 (c) (15 points) (code) Implement a window-based classifier model in q1 window.py using this approach. To do so, you will have to: i. (4 points) Transform a batch of input sequences into a batch of windowed input-output pairs in the make windowed data function. You can test your implementation by running python q1 window.py test1. ii. (6 points) Implement the feed-forward model described above by appropriately completing functions and the variable n window features in the WindowModel class. You can test your implementation by running python q1 window.py test2. iii. (3 points) Complete the training loop in method NERModel.fit (ner model.py). The training and validation datasets are already divided, and the outer loop for training multiple epochs is also written. You need to examine the function minibatches (utils.py) and Model.train on batch (which NERModel inherits from) and write the training procedure that loops over all minibatches in the dataset. iv. (2 points) Train your model using the command python q1 window.py train. The code should take only about 2 3 minutes to run and you should get a development score of at least 81% F 1. The model and its output will be saved to results/window/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q1 window.py shell -m results/window/<timestamp>/ Deliverable: After your model has trained, copy the window predictions.conll file from the appropriate results folder into the root folder of your code directory, so that it can be included in your submission. (d) (5 points) (written) Analyze the predictions of your model using the files generated above. i. (1 point) Report your best development entity-level F 1 score and the corresponding token-level confusion matrix. Briefly describe what the confusion matrix tells you about the errors your model is making. Solution: confusion matrix: The best development F 1 we got on the corpus is 83% and has the following gold/guess PER ORG LOC MISC O PER ORG LOC MISC O The confusion matrix shows that the biggest source of confusion for the model is the organization label, where many organizations are confused for people or missed out on. On the other hand, it seems like people are pretty well recognized. ii. (4 points) Describe at least 2 modeling limitations of the window-based model and support these conclusions using examples from your model s output (i.e. identify errors that your model made due to its limitations). You can also support your conclusions using predictions made by your model on examples manually entered through the shell. Solution: The window-based model can not use information from neighboring predictions to disambiguate labeling decisions, leading to non-contiguous entity predictions. Here, we concisely display examples with a word/gold/guess notation. For example the sentence, Quarter-

5 CS 224n Assignment 3 Page 5 of 11 final results in the Hong/ORG/LOC Kong/ORG/ORG Open/ORG/ORG on Friday. The window around Kong can easily identify that it is part of an organization, while the window around Hong doesn t know whether it is part of a location or an organization. The window-based model also can t use information from other parts of the sentence: take this example: I m an emotional player, said the 104th-ranked Tarango/PER/LOC.. Here the sentence clearly describes Tarango as being a person (player), but this information can t be used by the window based model. 2. Recurrent neural nets for NER (40 points) We will now tackle the task of NER by using a recurrent neural network (RNN). See slide 47 ( RNNs can be used for tagging ) of Lecture 8 for an illustration of this kind of tagging task. Recall that each RNN cell combines the previous hidden state with the current input using a sigmoid. We then use the hidden state to predict the output at each timestep: e (t) = x (t) E h (t) = σ(h (t 1) W h + e (t) W e + b 1 ) ŷ (t) = softmax(h (t) U + b 2 ), where E R V D are word embeddings, W h R H H, W e R D H and b 1 R H are parameters for the RNN cell, and U R H C and b 2 R C are parameters for the softmax. As before, V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5). In order to train the model, we use a cross-entropy loss for the every predicted token: J = T CE(y (t), ŷ (t) ) t=1 CE(y (t), ŷ (t) ) = i y (t) i log(ŷ (t) i ). (a) (4 points) (written) i. (1 point) How many more parameters does the RNN model in comparison to the window-based model? Solution: The RNN model differs from the window based model in that (a) W e has only D H parameters instead of (2w + 1)D H for W in the window model and (b) it has H H additional parameters for W h. ii. (3 points) What is the computational complexity of predicting labels for a sentence of length T (for the RNN model)? Solution: It takes O(D) operations to compute e (t), O(H 2 +DH +H) operations to compute each h (t) and O(HC + C) operations to compute each ŷ (t). In total, it takes about O((D + H)HT ) operations to predict labels for the whole sentence. (b) (2 points) (written) Recall that the actual score we want to optimize is entity-level F 1. i. (1 point) Name at least one scenario in which decreasing the cross-entropy cost would lead to an decrease in entity-level F 1 scores. Solution: Consider the scenario with a sentence with words/gold labels The James/MISC scandal/misc. If you predicted The James/MISC scandal/o versus The James/O scandal/o, your cross entropy loss would decrease because you ve predicted one more token s label correctly. On the other hand, your entity-level F 1 score would also decrease because you have now predicted another entity incorrectly: your precision goes down while your recall remains the same.

6 CS 224n Assignment 3 Page 6 of 11 Alternative: Training data is skewed. The cross entropy would optimize for precision at the cost of big drop in recall, causing the F1 scroe to decrease. ii. (1 point) Why it is difficult to directly optimize for F 1? Solution: For starters, F 1 is not differentiable. Secondly, it is hard to directly optimze for F 1 because it requires predictions from the entire corpus to compute, making it very difficult to batch and parallelize. (c) (5 points) (code) Implement an RNN cell using the equations described above in the rnn cell function of q2 rnn cell.py. You can test your implementation by running python q2 rnn cell.py test. Hint: We recommend you initialize the bias terms in the RNN/GRU cell, with 0s. You can do this by using tf.constant initializer. (d) (8 points) (code/written) Implementing an RNN requires us to unroll the computation over the whole sentence. Unfortunately, each sentence can be of arbitrary length and this would cause the RNN to be unrolled a different number of times for different sentences, making it impossible to batch process the data. The most common way to address this problem is pad our input with zeros. Suppose the largest sentence in our input is M tokens long, then, for an input of length T we will need to: 1. Add 0-vectors to x and y to make them M tokens long. These 0-vectors are still one-hot vectors, representing a new NULL token. 2. Create a masking vector, (m (t) ) M t=1 which is 1 for all t T and 0 for all t > T. This masking vector will allow us to ignore the predictions that the network makes on the padded input Of course, by extending the input and output by M T tokens, we might change our loss and hence gradient updates. In order to tackle this problem, we modify our loss using the masking vector: J = M m (t) CE(y (t), ŷ (t) ). t=1 i. (3 points) (written) How would the loss and gradient updates change if we did not use masking? How does masking solve this problem? Solution: The loss includes the accuracy of predicting extra 0-labels. The gradients from the padding input would flow through the hidden state and affect the learning of the parameters. By masking the loss, we zero-out the loss due to these extra 0-labels (and hence their gradients), solving the problem. ii. (5 points) (code) Implement pad sequences in your code. You can test your implementation by running python q2 rnn.py test1. (e) (12 points) (code) Implement the rest of the RNN model assuming only fixed length input by appropriately completing functions in the RNNModel class. This will involve: 1. Implementing the add placeholders, create feed dict, add embedding, add training op functions. 2. Implementing the add prediction op operation that unrolls the RNN loop self.max length times. Remember to reuse variables in your variable scope from the 2nd timestep onwards to share the RNN cell weights W e and W h across timesteps. 3. Implementing add loss op to handle the mask vector returned in the previous part. You can test your implementation by running python q2 rnn.py test2. 3 In our code, we will actually use the Boolean values of True and False instead of 1 and 0 for computational efficiency.

7 CS 224n Assignment 3 Page 7 of 11 (f) (3 points) (code) Train your model using the command python q2 rnn.py train. Training should take about 1 2 hours on your CPU and minutes if you use a GPU. You should get a development F 1 score of at least 85%. The model and its output will be saved to results/rnn/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q2 rnn.py shell -m results/rnn/<timestamp>/ Deliverable: After your model has trained, copy the rnn predictions.conll file from the appropriate results folder into your code directory so that it can be included in your submission. (g) (6 points) (written) i. (3 points) Describe at least 2 modeling limitations of this RNN model and support these conclusions using examples from your model s output. ii. (3 points) For each limitation, suggest some way you could extend the model to overcome the limitation. Solution: One limitation is that the model does not get to see into the future while making predictions; e.g. New York State University. The first New is likely to be tagged LOC (for New York). This problem can be solved with a birnn. Another limitation is that the model doesn t enforce that adjacent tokens have the same tag (illustrated by the same example above). Introducing a pair-wise agreements (i.e. using a CRF loss) in the loss would solve this problem. 3. Grooving with GRUs (30 points) In class, we learned that a gated recurrent unit (GRU) is an improved RNN cell that greatly reduces the problem of vanishing gradients. Recall that a GRU is described by the following equations: z (t) = σ(x (t) W z + h (t 1) U z + b z ) r (t) = σ(x (t) W r + h (t 1) U r + b r ) h (t) = tanh(x (t) W h + r (t) h (t 1) U h + b h ) h (t) = z (t) h (t 1) + (1 z (t) ) h (t), where z (t) is considered to be an update gate and r (t) is considered to be a reset gate. 4 Also, to keep the notation consistent with the GRU, for this problem, let the basic RNN cell be described by the equations: h (t) = σ(x (t) W h + h (t 1) U h + b h ). To gain some inutition, let s explore the behavior of the basic RNN cell and the GRU on some generated 1-D sequences. (a) (4 points) (written) Modeling latching behavior. Let s say we are given input sequences starting with a 1 or 0, followed by n 0s, e.g. 0, 1, 00, 10, 000, 100, etc. We would like our state h to continue to remember what the first character was, irrespective of how many 0s follow. This scenario can also be described as wanting the neural network to learn the following simple automaton: 4 Section of provides a good introduction to GRUs. github.io/posts/ understanding-lstms/ provides a more colorful picture of LSTMs and to an extent GRUs.

8 CS 224n Assignment 3 Page 8 of 11 x = 0 x = 1 x = 0, 1 h = 0 h = 1 In other words, when the network sees a 1, it should change its state to also be a 1 and stay there. In the following questions, assume that the state is initialized at 0 (i.e. h (0) = 0), and that all the parameters are scalars. Further, assume that all sigmoid activations and tanh activations are replaced by the indicator function: σ(x) { 1 if x > 0 0 otherwise tanh(x) { 1 if x > 0 0 otherwise. i. (1 point) Identify values of w h, u h and b h for an RNN cell that would allow it to replicate the behavior described by the automaton above. Solution: Since it s 1 point. The student do not have to give detailed description of how they obtain the answer. For the RNN cell h (t) = σ(x (t) W h + h (t 1) U h + b h ) we consider the following four cases: h (t 1) = 0, x (t) = 0. Since we want h (t) = 0, we have σ(b h ) = 0 b h 0 h (t 1) = 0, x (t) = 1. Since we want h (t) = 1, we have σ(w h + b h ) = 1 W h + b h > 0 h (t 1) = 1, x (t) = 0. Since we want h (t) = 1, we have σ(u h + b h ) = 1 U h + b h > 0 h (t 1) = 1, x (t) = 1. Since we want h (t) = 1, we have σ(u h + W h + b h ) = 1 U h + W h + b h > 0 Therefore, the inequalities W h, U h, b h should satisfy are: b H 0 W h + b h > 0 U h + b h > 0 ii. (3 points) Let W r = U r = b r = b z = b h = 0 (remember that all of them are scalars in this subquestion). Identify values of W z, U z, W h and U h for a GRU cell that would allow it to replicate the behavior described by the automaton above.

9 CS 224n Assignment 3 Page 9 of 11 Solution: Since W r = U r = b r = b z = b h = 0, the GRU cell can be simplified as: z (t) = σ(x (t) W z + h (t 1) U z ) r (t) = 0 h (t) = tanh(x (t) W h ) h (t) = z (t) h (t 1) + (1 z (t) ) h (t) We consider the following four cases: h (t 1) = 0, x (t) = 0. Since we want h (t) = 0, we have z (t) = σ(0) = 0 h (t) = tanh(0) = 0 h (t) = 0 h (t 1) = 0, x (t) = 1. Since we want h (t) = 1, we have z (t) = σ(w z ) h (t) = tanh(w h ) = 0 h (t) = (1 σ(w z )) tanh(w h ) = 1 W z 0 W h > 0 h (t 1) = 1, x (t) = 0. Since we want h (t) = 1, we have z (t) = σ(u z ) h (t) = tanh(0) = 0 h (t) = z (t) h (t 1) = σ(u z ) = 1 U z > 0 h (t 1) = 1, x (t) = 1. Since we want h (t) = 1, we have z (t) = σ(w z + U z ) h (t) = tanh(w h ) = 0 h (t) = z (t) + (1 σ(w z )) tanh(w h ) = 1 W z + U h > 0 or W z + U h 0 U h R Therefore, the inequalities U z, W z, U h, W h should satisfy are: U z > 0 W z 0 U h R W h > 0 (b) (6 points) (written) Modeling toggling behavior. Now, let us try modeling a more interesting behavior. We are now given an arbitrary input sequence, and must produce an output sequence that switches from 0 to 1 and vice versa whenever it sees a 1 in the input. For example, the input sequence should produce This behavior could be described by the following automaton:

10 CS 224n Assignment 3 Page 10 of 11 x = 0 x = 1 x = 0 h = 0 h = 1 x = 1 Once again, assume that the state is initialized at 0 (i.e. h (0) = 0), that all the parameters are scalars, that all sigmoid activations and tanh activations are replaced by the indicator function. i. (3 points) Show that a 1D RNN can not replicate the behavior described by the automaton above. Solution: First of all, we need that if input is 0, the state maintains its value: 0 W h + 0 U h + b h 0 1 W h + 0 U h + b h > 0. This implies that we must have W h > 0. Next, we need to be able to model the property that when the state needs to flip from 0 to 1 when seeing an input of 1 and flip back from 1 to 0 when seeing a 1 again. In equations, this requires that: 0 W h + 1 U h + b h > 0 1 W h + 1 U h + b h 0 However, this implies that we must have w h < 0, which contradicts the earlier condition! ii. (3 points) Let W r = U r = b z = b h = 0. Identify values of b r, W z, U z, W h and U h for a GRU cell that would allow it to replicate the behavior described by the automaton above. Solution: First, let s set b r = 1 to deactivate the reset gate. We ll exploit the update gate that the GRU has, setting it to be 1 only when x = 1 (and 0 otherwise). We ll also set the pre-output gate, h to be less than 0 when h = 1 and greater than 0 otherwise. We can achieve the first property by setting U z = 1, b z = W z = 0, and the second property by setting U h = 0, b h = 1, W h = 2. (c) (6 points) (code) Implement the GRU cell described above in q3 gru cell.py. You can test your implementation by running python q3 gru cell.py test. (d) (6 points) (code) We will now use an RNN model to try and learn the latching behavior described in part (a) using TensorFlow s RNN implementation: tf.nn.dynamic rnn. i. In q3 gru.py, implement add prediction op by applying TensorFlow s dynamic RNN model on the sequence input provided. Also apply a sigmoid function on the final state to normalize the state values between 0 and 1. Also implement add loss op. ii. Next, write code to calculate the gradient norm and implement gradient clipping in add training op. iii. Run the program: python q3 gru.py predict -c [rnn gru] [-g] to generate a learning curve for this task for the RNN and GRU models. The -g flag activates gradient clipping. These commands produce a plot of the learning dynamics in q3-noclip-<model>.png and q3-clip-<model>.png respectively. Deliverable: Attach the plots of learning dynamics generated for a GRU and RNN, with and without gradient clipping (in total 4 plots) to your write up.

11 CS 224n Assignment 3 Page 11 of 11 (e) (5 points) (written) Analyze the graphs obtained above and describe the learning dynamics you see. Make sure you address the following questions: i. Does either model experience vanishing or exploding gradients? If so, does gradient clipping help? ii. Which model does better? Can you explain why? Important: There is some variance in the graphs you get based on the order of initializing variables. It might be the case that you do not see any effect from clipping gradients that is ok, report your graphs and what you observe. For reference, the order in which we initialize variables for the GRU is W r,u r,b r,w z,u z,b z,w o,u o,b o and for the RNN cell it is W h,w x,b. Solution: The plots for this question vary a bit based on initialization, and we are evaluating their descriptions of what they see. The loss is generally around Some things that we are looking for: (a) gradients should eventually decrease (or something is definitely fishy). The RNN s gradients vanish quickly, so clipping doesn t really help. For the GRU, gradients are maintained and it can improve on the loss. That said, clipping can prevent these gradients from improving the loss. (f) (3 points) (code) Run the NER model from question 2 using the GRU cell using the command: python q2 rnn.py train -c gru Training should take about 3 4 hours on your CPU and about 30 minutes on a GPU 5. You should get a development F 1 score of at least 85%. 5 Yes, several hours is a long time, but you are learning to become a Deep Learning researcher so you need to be able to manage several-hour experiments!

12 CS 224n Assignment 3 Page 12 of 11 The model and its output will be saved to results/gru/<timestamp>/, where <timestamp> is the date and time at which the program was run. The file results.txt contains formatted output of the model s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F 1 scores computed during the training. Finally, you can interact with your model using: python q2 rnn.py shell -m results/gru/<timestamp>/ Deliverable: After your model has trained, copy the gru predictions.conll file from the appropriate results folder into your code directory so that it can be included in your submission.

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam This examination consists of 14 printed sides, 5 questions, and 100 points. The exam accounts for 17% of your total grade.

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb

More information

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including

More information

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Contents. (75pts) COS495 Midterm. (15pts) Short answers Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

RECURRENT NETWORKS I. Philipp Krähenbühl

RECURRENT NETWORKS I. Philipp Krähenbühl RECURRENT NETWORKS I Philipp Krähenbühl RECAP: CLASSIFICATION conv 1 conv 2 conv 3 conv 4 1 2 tu RECAP: SEGMENTATION conv 1 conv 2 conv 3 conv 4 RECAP: DETECTION conv 1 conv 2 conv 3 conv 4 RECAP: GENERATION

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017 Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label

More information

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Recurrent neural networks

Recurrent neural networks 12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 3: Word Window Classification, Neural Networks, and Matrix Calculus 1. Course plan: coming up Week 2: We learn

More information

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

NLP Homework: Dependency Parsing with Feed-Forward Neural Network NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used

More information

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM CIS59: Applied Machine Learning Fall 208 Homework 5 Handed Out: December 5 th, 208 Due: December 0 th, 208, :59 PM Feel free to talk to other members of the class in doing the homework. I am more concerned

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018 Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Homework 4, Part B: Structured perceptron

Homework 4, Part B: Structured perceptron Homework 4, Part B: Structured perceptron CS 585, UMass Amherst, Fall 2016 Overview Due Friday, Oct 28. Get starter code/data from the course website s schedule page. You should submit a zipped directory

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

Home Assignment 4. Figure 1: A sample input sequence for NER tagging

Home Assignment 4. Figure 1: A sample input sequence for NER tagging Advanced Methods n NLP Due Date: May 22, 2018 Home Assgnment 4 Lecturer: Jonathan Berant In ths home assgnment we wll mplement models for NER taggng, get famlar wth TensorFlow and learn how to use TensorBoard

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Introduction to Machine Learning HW6

Introduction to Machine Learning HW6 CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 9 Recurrent neural nets CSC321 Lecture 9 Recurrent neural nets Roger Grosse and Nitish Srivastava February 3, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 9 Recurrent neural nets February 3, 2015 1 / 20 Overview You

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Lecture 35: Optimization and Neural Nets

Lecture 35: Optimization and Neural Nets Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015] Aside: CNN vs ConvNet Note: There are many papers that use

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

(

( Class 15 - Long Short-Term Memory (LSTM) Study materials http://colah.github.io/posts/2015-08-understanding-lstms/ (http://colah.github.io/posts/2015-08-understanding-lstms/) http://karpathy.github.io/2015/05/21/rnn-effectiveness/

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

1 Newton s 2nd and 3rd Laws

1 Newton s 2nd and 3rd Laws Physics 13 - Winter 2007 Lab 2 Instructions 1 Newton s 2nd and 3rd Laws 1. Work through the tutorial called Newton s Second and Third Laws on pages 31-34 in the UW Tutorials in Introductory Physics workbook.

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness Lecture 19 NP-Completeness I 19.1 Overview In the past few lectures we have looked at increasingly more expressive problems that we were able to solve using efficient algorithms. In this lecture we introduce

More information

Source localization in an ocean waveguide using supervised machine learning

Source localization in an ocean waveguide using supervised machine learning Source localization in an ocean waveguide using supervised machine learning Haiqiang Niu, Emma Reeves, and Peter Gerstoft Scripps Institution of Oceanography, UC San Diego Part I Localization on Noise09

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data. INST 737 April 1, 2013 Midterm Name: }{{} by writing my name I swear by the honor code Read all of the following information before starting the exam: For free response questions, show all work, clearly

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Homework 6: Image Completion using Mixture of Bernoullis

Homework 6: Image Completion using Mixture of Bernoullis Homework 6: Image Completion using Mixture of Bernoullis Deadline: Wednesday, Nov. 21, at 11:59pm Submission: You must submit two files through MarkUs 1 : 1. a PDF file containing your writeup, titled

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

Physics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates.

Physics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates. Learning Goals Experiment 3: Force After you finish this lab, you will be able to: 1. Use Logger Pro to analyze video and calculate position, velocity, and acceleration. 2. Find your center of mass by

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Computing Neural Network Gradients

Computing Neural Network Gradients Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary

More information

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis Programming Assignment 4: Image Completion using Mixture of Bernoullis Deadline: Tuesday, April 4, at 11:59pm TA: Renie Liao (csc321ta@cs.toronto.edu) Submission: You must submit two files through MarkUs

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Long-Short Term Memory

Long-Short Term Memory Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow

More information

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information