Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Similar documents
Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture 6: Neural Networks for Representing Word Meaning

Neural Networks Language Models

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

word2vec Parameter Learning Explained

text classification 3: neural networks

DISTRIBUTIONAL SEMANTICS

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

ECE521 Lectures 9 Fully Connected Neural Networks

Convolutional Neural Networks

GloVe: Global Vectors for Word Representation 1

An overview of word2vec

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Deep Learning Recurrent Networks 2/28/2018

arxiv: v3 [cs.cl] 30 Jan 2016

NEURAL LANGUAGE MODELS

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning. Ali Ghodsi. University of Waterloo

Deep Learning for NLP Part 2

Natural Language Processing and Recurrent Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Network Language Modeling

Lecture 5 Neural models for NLP

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks for NLP. COMP-599 Nov 30, 2016

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Deep Learning for NLP

Deep Learning For Mathematical Functions

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

CS224n: Natural Language Processing with Deep Learning 1

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Deep Learning. Language Models and Word Embeddings. Christof Monz

ATASS: Word Embeddings

Neural Word Embeddings from Scratch

An overview of deep learning methods for genomics

Neural Architectures for Image, Language, and Speech Processing

Learning to translate with neural networks. Michael Auli

Natural Language Processing

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Learning for NLP

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Data Mining & Machine Learning

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing

Structured Neural Networks (I)

Part-of-Speech Tagging + Neural Networks 2 CS 287

COMS F18 Homework 3 (due October 29, 2018)

Sequence Modeling with Neural Networks

arxiv: v2 [cs.cl] 1 Jan 2019

Classification with Perceptrons. Reading:

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Conditional Language modeling with attention

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Natural Language Processing

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Deep Learning for Natural Language Processing

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Notes on Back Propagation in 4 Lines

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

Lecture 35: Optimization and Neural Nets

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Natural Language Processing

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Natural Language Processing

STA141C: Big Data & High Performance Statistical Computing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Word Embeddings 2 - Class Discussions

Bag of Words Meets Bags of Popcorn

Lecture 11 Recurrent Neural Networks I

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

10-701/ Machine Learning, Fall

Assignment 1. Learning distributed word representations. Jimmy Ba

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Distributional Semantics and Word Embeddings. Chase Geigle

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

STA141C: Big Data & High Performance Statistical Computing

A Convolutional Neural Network-based

Deep Feedforward Networks

Lecture 11 Recurrent Neural Networks I

Lecture 7: Word Embeddings

NLP Programming Tutorial 8 - Recurrent Neural Nets

Long-Short Term Memory and Other Gated RNNs

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Transcription:

Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as mentioned in the Prerequisites file. Then download the source code and extract it in the same directory as the embedding file(as mentioned in the Prerequisites). The rest of the details for the programming assigment can be found inside the hw3.ipynb file. The written portion of the assignment will discuss a neural network model that produces word embeddings. For understanding word embeddings we suggest the following readings : class slides and textbook chapter 10. The appendix in this document also has a high level overview. You may notice different resources using different notation. The concepts remain the same. For this assignment we will be using the notation as discussed in the section Skip-gram Neural Network architecture. We use upper case letters to denote matrices and lower case letters to denote row/- column vectors. 1 Skip-gram models The skip-gram neural network model produces word embeddings by reducing the task of the neural network to predicting the context words for a given input word. Consider the sentence The quick brown fox jumped over the lazy dog. For training data we define a window, let s say, and produce pairs of words looking within the window size on both sides of the current word. Given the input word, we would like the probability of the context words to be higher than the rest of the words in the vocabulary. For every sample pair noted in the diagram above the input is the first word in the sample. The output label is the second word in the sample pair. The input and output words are converted to a one hot vector of size vocab sizex1. Now that we have our training data let s look at the neural network architecture. It will be a simple feed forward network with one hidden layer. The hidden layer weights at the end of training will become our word embeddings! 1.1 Skip-gram Neural Network architecture 1. Input layer - The training samples are broken down into smaller batches each with batch size number of samples. We will be passing one batch of samples for a single iteration to our 1

neural network. Once we complete the forward propogation and backward propogation, we pass in the next batch. Each one hot vector is 1Xvocab size. We will stack the input one hot vectors of all the samples. We will feed into the network the transpose of the horizontally stacked one-hot vectors. The dimensions of the input, A in to our neural network will be therefore be vocab sizexbatch size. 1. Fully connected feed forward layer - The feed forward layer (or hidden layer) will have hd 1 number of nodes. The layer does a affine transformation and activation of the input values. A in is the input to the layer and A 1 is the output of the layer. W 1 and b 1 are the weights associated with hidden layer. Z 1 = W 1 A in + b 1 A 1 = f 1 (Z 1 ). Output layer - The output layer will have vocab size nodes. We do a final output affine transformation and softmax (f out ) activation to get the probabilities for the vocabulary vector. For each sample, we calculate our loss on the predicted vocabulary vector to our label vector. Z out = W out A 1 + b out A out = f out (Z out ) Loss = crossentropyloss(y, A out ) We predict the label by looking at the index of highest probability for each sample. predicted label[i] = argmax(a out.t [i])

Questions Weights are usually normalized values between [-1,1]. For easy calculations we will be using integers in the assignment. Question 1. You are given the following values vocab = 5, 000, batch size = 100, hd 1 = 18. Write the dimensions of A in, W 1, b 1, A 1, W out, b out, A out. (7 points) Question. Forward propogation - You are given the following W 1, b 1, W out, b out, A 1 and the activation function f 1. For the purpose of easier calculations we will assume f out is the same as f 1. Calculate Z 1, A 1, Z out and A out. (4++4+ points) 3 1 0 3 1 4 1 3 5 6 W 1 = b 1 3 4 3 1 1 = 1 6 4 5 1 [ ] 3 1 4 W out = 1 5 1 3 1 4 5 1 3 A in = 3 1 1 4 1 f 1 (x) = f out (x) = relu(x) = [ ] 4 b out = 5 { x if x > 0 0 if x <= 0 Question 3. Backward propogation - You are given the Loss/ A out. [ ] Loss 3 5 = A out 4 Use W 1, b 1, Z 1, W out, b out from q to solve the questions below. a. Write the formula for calculating Loss/ W out, Loss/ b out and Loss/ A out. ( point each) b. Calculate the Loss wrt W 1, b 1, W out, b out given the derivative of relu. (5+4+5+4 points each) { 1 if x > 0 relu der(x) = 0 if x <= 0 c. Update the W 1, b 1, W out, b out using the gradients with a learning rate of 0.01. (1 points each) Question 4. The GloVe and wordvec libraries provide pre-trained word embeddings for dimensions from 50-300. In an NLP application task what are the trade-offs of using word embeddings of smaller vs larger dimensions? (5 points) 3

APPENDIX Word Embeddings In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. A Word Embedding format generally tries to map a word using a dictionary to a vector. A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. This is just a very simple method to represent a word in the vector form. The different types of word embeddings can be broadly classified into two categories- Frequency based Embedding - The embeddings Count vectors, TF-IDF vectors that you have come across in this course till now broadly fall under this category. The dimensions of the such vectors are decided by the vocabulary size of the train data. Prediction based Embedding - The frequency methods limited each dimension in the vector to independent though words rarely are. It was not capable of representing semantic or syntactic information. For example, the word house and the word houses would be mapped to a different dimension in the vector. By looking at the vectors themselves you would not be able to conclude any relation between the words. The vectors produced by these methods would also result in high computational cost due to the large matrix multiplications required in neural networks. Word vectors proved to be limited in their word representations until Mikolov et al. introduced wordvec. The wordvec methods were prediction based in the sense that they provided probabilities to the words in limited dimensions and proved to be state of the art for tasks like word analogies and word similarities. They were also able to achieve tasks like king - man + woman = queen, which was considered a result almost magical. We can visualize the learned vectors by projecting them down to dimensions using dimensionality reduction techniques. When we inspect these visualizations it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another. Some of the induced vector space specialize towards certain semantic relationships, e.g. male-female, verb tense and even country-capital relationships between words, as illustrated in the figure below - 4

Wordvec is not a single algorithm but a combination of two techniques CBOW(Continuous bag of words) and skip-gram model. Both of these are achieved through shallow neural network architecture. They learn weights which act as word vector representations. For first part of this assignment we will look at skip-gram architecture. 5