(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Similar documents
Contents. (75pts) COS495 Midterm. (15pts) Short answers

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v1 [cs.cl] 21 May 2017

Recurrent Neural Networks. Jian Tang

Recurrent and Recursive Networks

word2vec Parameter Learning Explained

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Neural Architectures for Image, Language, and Speech Processing

NEURAL LANGUAGE MODELS

Neural Networks Language Models

Slide credit from Hung-Yi Lee & Richard Socher

Natural Language Processing and Recurrent Neural Networks

Part-of-Speech Tagging + Neural Networks 2 CS 287

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Recurrent Neural Network

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

RECURRENT NETWORKS I. Philipp Krähenbühl

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Lecture 11 Recurrent Neural Networks I

Sequence Modeling with Neural Networks

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Long-Short Term Memory and Other Gated RNNs

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Tracking the World State with Recurrent Entity Networks

Long-Short Term Memory

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

arxiv: v3 [cs.cl] 30 Jan 2016

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Convolutional Neural Networks

Lecture 5 Neural models for NLP

Lecture 2: Learning with neural networks

Generating Sequences with Recurrent Neural Networks

text classification 3: neural networks

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

An overview of word2vec

ECE521 Lecture 7/8. Logistic Regression

Conditional Random Fields

Recurrent Neural Networks

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

lecture 6: modeling sequences (final part)

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Learning Neural Networks

Introduction to RNNs!

Backpropagation and Neural Networks part 1. Lecture 4-1

Name (NetID): (1 Point)

CS 224n: Assignment #3

Lecture 12: Algorithms for HMMs

CSCI567 Machine Learning (Fall 2018)

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Lecture 15: Exploding and Vanishing Gradients

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 2: Modular Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 15: Exploding and Vanishing Gradients

Midterm, Fall 2003

Natural Language Processing with Deep Learning CS224N/Ling284

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Seq2Tree: A Tree-Structured Extension of LSTM Network

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 12: Algorithms for HMMs

Neural Network Language Modeling

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 17: Neural Networks and Deep Learning

Natural Language Processing

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

CSC321 Lecture 16: ResNets and Attention

CSCI 315: Artificial Intelligence through Deep Learning

Structured Neural Networks (I)

with Local Dependencies

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Midterm: CS 6375 Spring 2018

Natural Language Processing

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Deep Neural Networks

Recurrent neural networks

6.036: Midterm, Spring Solutions

Deep Learning Recurrent Networks 10/11/2017

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

EE-559 Deep learning Recurrent Neural Networks

Lecture 15: Recurrent Neural Nets

Natural Language Processing

Deep Learning Recurrent Networks 2/28/2018

On the use of Long-Short Term Memory neural networks for time series prediction

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Transcription:

Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular implementation of neural networks.......... 4 (15pts) Design an RNN.......................... 6 (75pts) COS495 Midterm Your name: (15pts) Short answers (2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses the pair-pattern matrix? between them the word-context matrix? pairs of words/the semantic relations words (2pts) Bigrams are consecutive tokens. Let x 1 = dog bites man more and x 2 = man bites dog less. Write down the dense bigram indicator feature vectors for this dataset, label the dimensions. (dog,bites) (bites,man) (man,more) (man,bites) (bites,dog) (dog,less) x 1 1 1 1 x 2 1 1 1 (3pts) When faced with a big softmax over the vocabulary, training can be slow. Describe and explain the hierarchical softmax for speeding up training used in word2vec. Identify its parameters and explain the savings. For a given word w P(w v) = vt w v w vt w v and vector v we wish to approximate the softmax without computing a sum over V inner products in the denominator (which has complexity O(V )), where V is the number of words in the vocabulary. Instead we construct a Huffman tree of depth log V with one word (and its associated embedding) assigned to each leaf node and a parameter vector at all other nodes. The probability is then approximated as 1

P(w v) n P (w) 1 1 + exp (( 1) Lw,n v T n v) where P (w) is the path from the root node to w and L w,n is 1 if w is n or a descendant of the left child of n and -1 otherwise. This can be computed in O(log V ) time. GloVe has objective ij f(x ij)(w i w j + b i + b j log X ij ) 2 (2pt) What do the indices i, j represent and what is X ij? i and j represent words and X ij represents the number of times they co-occur within a fixed-size window. (3pt) What is an issue with the alternative objective ij (w i w j + b i + bj log X ij ) 2? since the sum is taken over all word pairs, this expression is infinite/undefined if there exist pairs of words that never co-occur (which is usually the case). this objectives gives equal weight to words that co-occur rarely as words that co-occur frequently, even though the confidence we have in the latter is much higher. (3pts) Let z = W a, W R m n, and consider the backward pass. Suppose we have i.i.d samples dz i N(, 1) and W ij N(, σ 2 ). We would like Var[ da i ] = 1 by setting σ appropriately. Compute σ and show your steps. da i = W T i z L = j W ji z j Setting the desired variance to 1 and using the fact that the variance of a product is the product of variances (for mean-zero i.r.v.) we have [ ] 1 = Var = [ ] Var[W ji ] Var = mσ 2 = σ = 1 da i dz j i m (5pts) Unequal loss Not all mistakes are created equal. We define cost(y, y ) as the cost when the label is y and the prediction is y. Then the hinge loss is 2

max(, max y cost(y, y ) + s y s y ). * (3pts) Show that this hinge loss is an upperbound of cost(y, y ). y being the prediction means that y = arg maxỹ sỹ, and in particular s y s y. Then, max cost(y, ỹ) + sỹ s y cost(y, y ) + s y s y cost(y, y ). ỹ Noting max(, x) x gives the result. * (2pts) Give the necessary condition on s i for this upperbound to be tight and an example when it is loose. For the bound to be tight means the prediction has the same score as the target s y = s y, and everything else clears the margin, for all ỹ, cost(y, ỹ) + sỹ cost(y, y ) + s y This bound is loose whenever there is a clear wrong prediction, s y > s y, or if there is ŷ for which cost(y, ŷ) + sŷ > cost(y, y ) + s y. (15pts) About LSTMs Here are the LSTM equations and the corresponding illustration LSTM-Chain LSTM-Notation (illustration from Chris Olah) f t = σ(w f [h t 1, x t ] + b f ) i t = σ(w i [h t 1, x t ] + b_i) o t = σ(w o [h t 1, x t ] + b o ) c t = tanh(w c [h t 1, x t ] + b c ) c t = f t c t 1 + i t c t h t = o t tanh(c t ) (5pt) Label each of f t, i t, o t, c t, c t, c t 1 on the illustration. there is no ambiguity in your labeling. Make sure Suppose we have a one-layer LSTM primitive [h 1,..., h T ] = LSTM([x 1,..., x T ]) R h T. (2pts) Represent a two-layer LSTM using LSTM [h 1,..., h T ] = LSTM(LSTM([x 1,..., x T ])) 3

(3pts) Represent a bidirectional LSTM and use its hidden states to perform a sequence tagging task, where the tag at each timestep depends only on forward and backward hidden states at that time step. Clearly define your operations. A BiLSTM can be expressed as [h 1,..., h T ] = Concat (LSTM([x 1,..., x T ]), LSTM([x T,..., x 1 ])) where Concat([a 1,..., a T ], [b 1,..., b T ]) = [( a1 b 1 ),..., ( at b T )]. We can then train sequence tagging via a softmax loss: L([x 1,..., x T ], y t ) = w T y t h t log y exp wt y h t. LSTM is supposed to alleviate the vanishing and exploding gradient problem. Let us consider a loss function that only depends on the final step L(c T ) where T is the final step. (5pts) Express dc t in terms of dc t+1, dh t, and non-recursive terms. = dc t+1 + dh t dc t dc t+1 dc t dh t dc t = f t + o t tanh (c t ) dc t+1 dh t (5pts) By analyzing your expression for dc 1, explain why the long term dependency does not vanish when the forget gate is f t = 1. = c T Assuming the forget gate is 1, c 1 +... so rather than vanishing, the gradient due to the initial state is around the same magnitude as that due to the final state, so the long-term dependency will be reflected in the loss function. (25pts) Modular implementation of neural networks Modules makes it easy to construct neural networks with various structures. Each module supports the forward and backward functions, might maintain internal state, and might be be composed of other modules. Linear and Sequential are two examples. class Linear(Module): def init (self, W, b): self.w, self.b = W, b return np.dot(self.w, input) + self.b 4

return np.dot(self.w.t, gradout) class Sequential(Module): def init (self, children): self.children = children for child in self.children: input = child.forward(input) return input for child in reversed(self.children): gradout = child.backward(gradout) return gradout (3pt) While Linear is computing the right outputs, it is missing a step that would enable learning in Linear itself. What step is missing? What else needs to be stored in order to perform this step? A step to compute and store the gradients needed for its own parameter updates is missing in backward. (5pt) Implement the Sigmoid module which computes the element-wise function σ(x) = 1 1+exp( x) # pseudo code or python code are both acceptable # clarity and precision is required, being able to run is not required class Sigmoid(Module): self.output = 1. / (1.+np.exp(-input)) return self.output return self.output * (1.-self.output) * gradout (5pt) Implement the Dropout module (space below) class Dropout(Module): def init (self, p): self.p = p # with prob. p, the input stays self.keep = (np.random.rand(*input.shape) < self.p).astype(input.dtype) return self.keep * input return self.keep * gradout (7pt) Implement EMul that takes some children modules, give all children the 5

same inputs, and outputs the elementwise product of the outputs of all children. class EMul(Module): def init (self, children): self.children = children self.factors = [child.forward(input) for child in self.children] self.output = np.copy(self.factors[]) for factor in self.factors[1:]: self.output *= factor return self.output return self.output * sum(child.backward(gradout)/factor for child, factor in (5pt) Use existing modules to implement the following network function by writing a single statement invoking constructors of Sequential, EMul, Sigmoid, and Linear given weights W1, W2, W3 a 1 = σ(w 1 x) a 2 = σ(w 2 x) z = a 1 a 2 y = W 3 z Network = Sequential(EMul(Sequential(Linear(W1,.), Sigmoid()), Sequential(Linear(W2,.), Sigmoid())), Linear(W3,.)) (15pts) Design an RNN We would like an RNN to recognize a string of well-balanced parenthesis. Well balanced parenthesis are strings such as (()) or (())() where the number of ( is the same as the number of ) in the whole string any prefix string does not contain more ) than (. Let x = x 1, x 2,..., x T where x i R 2 1 with [, 1] for ) and [1, ] for (. This RNN should perform the following updates h i+1 = relu(w h i + Ux i+1 ) In addition, we start with h (dimension 3 with initial value is recommended, but you are free to use anything you like). At the end, a prediction y = V h T is made, where y means balanced and y > otherwise. 6

(7pts) Clearly state your strategy for the RNN Store the number of times each character has been seen in first two dimensions. At each step compute the difference between the two and store in the last two dimensions (#(-#) and #)-#(). Make the fifth dimension positive whether or not a prefix string has ever had more ) than ( (i.e. whether the fourth dimension has ever been positive). Then if any of the last three dimensions of h T are positive the string is not well-balanced. (8pts) Specify explicit values of W, U, V to achieve it 1 1 h =, W = 1 1 1 1 1, U = 1 1 1 1 1, V = ( 1 1 1 ) 1 Solution with 3 states 3-state solution 1 1 h =, W = 1, U = 1, V = ( 1 1 B ) 1 1 B 1 1 h 3 stores if the sequence has failed for having too many ), which will remain positive once it got to be positive B (B needs to be big enough e.g. B=2). If sequence did not fail for too many )s, V h T > when there are more ( than ). 7