Natural Language Processing and Recurrent Neural Networks

Similar documents
Recurrent Neural Networks. Jian Tang

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Neural Networks Language Models

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Long-Short Term Memory and Other Gated RNNs

Deep Learning Recurrent Networks 2/28/2018

Sequence Modeling with Neural Networks

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Improved Learning through Augmenting the Loss

Recurrent and Recursive Networks

GloVe: Global Vectors for Word Representation 1

Slide credit from Hung-Yi Lee & Richard Socher

word2vec Parameter Learning Explained

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Lecture 15: Exploding and Vanishing Gradients

An overview of word2vec

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Lecture 11 Recurrent Neural Networks I

CS224n: Natural Language Processing with Deep Learning 1

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Structured Neural Networks (I)

The representation of word and sentence

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

NEURAL LANGUAGE MODELS

CSC321 Lecture 15: Exploding and Vanishing Gradients

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Recurrent Neural Network

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

EE-559 Deep learning LSTM and GRU

arxiv: v3 [cs.cl] 30 Jan 2016

Lecture 6: Neural Networks for Representing Word Meaning

CSCI 315: Artificial Intelligence through Deep Learning

Introduction to RNNs!

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Tracking the World State with Recurrent Entity Networks

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Deep Learning for Natural Language Processing

Neural Networks for NLP. COMP-599 Nov 30, 2016

RECURRENT NETWORKS I. Philipp Krähenbühl

Natural Language Processing

Deep Learning for NLP

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Stephen Scott.

Lecture 5 Neural models for NLP

arxiv: v1 [cs.cl] 21 May 2017

Recurrent neural networks

Neural Networks in Structured Prediction. November 17, 2015

text classification 3: neural networks

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Deep Learning. Ali Ghodsi. University of Waterloo

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Generating Sequences with Recurrent Neural Networks

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Neural Networks 2. 2 Receptive fields and dealing with image inputs

EE-559 Deep learning Recurrent Neural Networks

Neural Network Language Modeling

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CSC321 Lecture 16: ResNets and Attention

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Networks

(

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Natural Language Processing with Deep Learning CS224N/Ling284

Neural Word Embeddings from Scratch

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Long-Short Term Memory

Lecture 7: Word Embeddings

Implicitly-Defined Neural Networks for Sequence Labeling

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

CSC321 Lecture 10 Training RNNs

Long Short-Term Memory (LSTM)

Faster Training of Very Deep Networks Via p-norm Gates

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Feedforward Neural Networks. Michael Collins, Columbia University

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Feedforward Neural Networks

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Deep Learning for NLP

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Conditional Language modeling with attention

Applied Natural Language Processing

Deep learning for Natural Language Processing and Machine Translation

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Introduction to Deep Neural Networks

Transcription:

Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018

Outline Introduction to NLP Word2vec RNN GRU LSTM Demo

What is NLP? Natural Language? : Huge amount of information available pertaining to different languages in terms of text and speech. Processing? : The idea is to use computer to understand languages to do a bunch of stuff.

Uses of Natural Language Processing Spell checking, keyword search, synonyms Automated translation Sentiment analysis of movie reviews Speech recognition, complex question answering Language modelling

Why is language different? Language is essentially a signalling system by which we convey information. Interestingly language is mostly discrete/categorical in nature. These signals are communicated in different ways: Sound, Text, Image, Gesture. Huge vocabulary results in sparsity problem for encoding the symbolic/categorical signals. [We will talk about it later!]

Why is NLP difficult? Inherent ambiguity in human language. Here is an example of a real newspaper (Time Magazine) headline The pope s baby steps on gays. Efficiently representing word as a vector of numbers is challenging. Language conveys information in a sequential manner. [This is where RNN is going to help us!]

Word vector representation Batman will beat Superman with enough preparation time. One easy way to represent Batman is using one-hot vector. [ 0 0 0 0 0 0 1 0 0 0 ] Step 1: Using a corpus of words, we can build a dictionary or vocabulary. [Which is essentially a vector of words arranged, not necessarily in alphabetic order.] Step 2: i th word in that dictionary will be represented by i th unit vector. (1)

Issues with one-hot representation If your dictionary consists of T words. Each word will be represented by T 1 dimensional vectors. T is usually very large! 20K(Speech)- 500K(Machine translation)- 13M(Google 1TB corpus). Localist representation. Doesn t give any inherent notion of association between words.

Hurricane in Tallahassee Hurricane in Florida If one searches one of these two sentences, search engine should return results for other sentence as well. In one-hot representation, vector representations of Tallahassee and Florida are orthogonal.

Distributional Similarity The idea of distributional similarity was first coined by linguist Z.S. Harris (1954). Later this idea has been used to represent words by means of its neighbors (Bengio et. al, 03, Mikolov et. al,13). Neighboring words will now represent banking....debt problems turning into banking crises as has......europe needs unified banking regulation to replace the...

word2vec The general idea is to define a model that predicts between a center word w t and neighboring context words in terms of word vectors and P(w t w t ) or P(w t w t ). Goal: Obtain a vector representation that maximizes P(w t w t ) or P(w t w t ). Skip-gram model: Predict context words given a center word. Continuous bag-of-words model: Predict center word from a bag of context words.

skip-gram For each position t = 1(1)T, predict context words within a window of fixed size m, given center word w j L(θ) = T t=1 m j m,j 0 The goal is to minimize J(θ) where P(w t+j w t ; θ) (2) J(θ) = 1 T log L(θ) = 1 T T log P(w t+j w t ) (3) t=1 m j m,j 0

Two different vectors are used to represent a word w. v w when w is a center word. u w when w is a context word. For any j 0, P(w t+j w t ) = exp(u T w t+j v wt ) V i=1 exp(ut w t+i v w t ) This is called softmax function.

Suppose we are using a d-dimensional vector to represent a word and we have V -many wordsθ = v aardvark v zyzzyva u aardvark u zyzzyva R 2dV

The gradient of J(θ) is calculated from Update equation v wt P(w t+j w t ) = u wt+j V P(w t+i w t )u wt+i i=1 θ new = θ old α θ J(θ) or elementwise it is θ new j = θ old j α θj old J(θ)

Stochastic Gradient Descent For a large corpus, updating entire gradient vector is extremely expensive. Calculate the gradient for one randomly selected center word J t (θ). Update equation θ new = θ old α θ J t (θ)

CBOW Unlike skip-gram, we predict center word w.r.t context words. Cost function J(θ) = 1 T T log P(w t w t m,, w t 1, w t+1,, w t+m ) t=1 Softmax function defined as P(v wt û) = exp(v T w t û) V i=1 exp(v T w t û) where û = u t m + + u t 1 + u t+1 + + u t+m 2m

Recurrent Neural Networks

one-hot word vectors x (t) R V Word embedding e (t) = Ex (t) Hidden states h (t) = g(w h h (t 1) + W e e (t) + b) h (0) is the initial hidden state usually a vector of 0s where g is some activation function(sigmoid, tanh, ReLu). Output probability ŷ (t) = softmax(uh (t) + b 1 )

How to Train an RNN Language Model Get a Big corpus of text which is a sequence of words: x (1), x (2),, x (T ). Compute the output distribution ŷ (t) for every time step t. This is essentially the probability distribution of every word given the previously occured words. Loss function- cross entropy y (t) = x (t+1) J (t) (θ) = V t=1 y (t) j log ŷ (t) j

Overall cost function is J(θ) = T J (t) (θ) t=1 Updating parameters in NN architecture using J(θ) is called back-propagation.

Backpropagation through time We need to update the parameters using gradient descent. For example let us look at the update of U. We need to calculate the J U using J U = t J ŷ (t) ŷ (t) U b 1 can be updated in a similar way.

Next we are going to update W h. Notice that, at each time step t, J(θ) depends on W h through h (t) which itself depends on h (t 1). So we are going to backpropagate over time steps t = T, T 1,, 0 summing the gradients as we go. Update equation J (t) W h = J ŷ (t) ŷ (t) h (t) h (t) W h

Since h (t) depends on h (t 1), J (t) W h = J(t) ŷ (t) ŷ (t) h (t) h (t) W h J (t) W h = t k=1 J (t) ŷ (t) h (t) h (t 1) ŷ (t) h (t) h (t 1) W h J = W h t t k=1 J (t) ŷ (t) ŷ (t) h (t) h (t) h (k) h (k) W h

Calculating gradient of W e and b is comparatively easier. J = W e t J b = t J ŷ (t) h (t) ŷ (t) h (t) W e J ŷ (t) h (t) ŷ (t) h (t) b

Vanishing and Exploding gradient problem Derivatives through time can be very small or very large very quickly. (Bengio et al 1994) J = W h t t k=1 J (t) ŷ (t) ŷ (t) h (t) h (t) h (k) h (k) W h In NLP, vanishing gradient is an issue The cat, which already ate a plate full of fish, was full. One trick to deal with this problem- Gradient clipping. Not efficient.

Gated Recurrent Unit An efficient way to deal with vanishing gradient problem is to use more complicated hidden units!(cho et al 2014) Main idea is to keep around memories to capture long term dependencies. Essentially this is sort of a simpler version of LSTM which will be discussed later.

GRU first computes an update gate(another layer!) based on current input word vector and hidden state z (t) = σ(w z x (t) + U z h (t 1) ) And a reset gate similarly but with different parameters r (t) = σ(w r x (t) + U r h (t 1) ) Sigmoid activation is used because we want the values to be either close to 0 or close to 1. (Explained later!)

New memory content h (t) = g(wx (t) + r (t) Uh (t 1) ) If reset step is very close to 0, then this ignores previous memory and stores the new word information. Current time step update h (t) = z (t) h (t 1) + (1 z (t) ) h (t)

If reset is close to 0, ignore previous hidden state which allows model to drop information that is irrelevant in the future. Update gate controls how much of past state should matter now. If update gate is close to 1, then we can copy information in that unit through many time steps! Units with short-term dependencies often have reset gates very active.

Long short-term memory(hochreiter and Schmidhuber, 1997) Input gate Output gate Forget gate New memory cell i (t) = σ(w i x (t) + U i h (t 1) ) o (t) = σ(w o x (t) + U o h (t 1) ) f (t) = σ(w f x (t) + U f h (t 1) ) c (t) = g(w c x (t) + U c h (t 1) )

Final memory cell Final hidden state c (t) = f (t) c (t 1) + i (t) c (t) h t = o (t) g(c (t) ) Memory cells can keep information intact, unless input makes them forget it or overwrite it with new input. Cell can decide to output the information of just to store it.

Discussion All the RNN models discussed here can only take previous words into consideration. Solution- Bidirectional RNN. RNNs are a great way for language modelling, machine translation, speech recognition, name entity tagging.

Thank you