Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018

Outline Introduction to NLP Word2vec RNN GRU LSTM Demo

What is NLP? Natural Language? : Huge amount of information available pertaining to different languages in terms of text and speech. Processing? : The idea is to use computer to understand languages to do a bunch of stuff.

Uses of Natural Language Processing Spell checking, keyword search, synonyms Automated translation Sentiment analysis of movie reviews Speech recognition, complex question answering Language modelling

Why is language different? Language is essentially a signalling system by which we convey information. Interestingly language is mostly discrete/categorical in nature. These signals are communicated in different ways: Sound, Text, Image, Gesture. Huge vocabulary results in sparsity problem for encoding the symbolic/categorical signals. [We will talk about it later!]

Why is NLP difficult? Inherent ambiguity in human language. Here is an example of a real newspaper (Time Magazine) headline The pope s baby steps on gays. Efficiently representing word as a vector of numbers is challenging. Language conveys information in a sequential manner. [This is where RNN is going to help us!]

Word vector representation Batman will beat Superman with enough preparation time. One easy way to represent Batman is using one-hot vector. [ 0 0 0 0 0 0 1 0 0 0 ] Step 1: Using a corpus of words, we can build a dictionary or vocabulary. [Which is essentially a vector of words arranged, not necessarily in alphabetic order.] Step 2: i th word in that dictionary will be represented by i th unit vector. (1)

Issues with one-hot representation If your dictionary consists of T words. Each word will be represented by T 1 dimensional vectors. T is usually very large! 20K(Speech)- 500K(Machine translation)- 13M(Google 1TB corpus). Localist representation. Doesn t give any inherent notion of association between words.

Hurricane in Tallahassee Hurricane in Florida If one searches one of these two sentences, search engine should return results for other sentence as well. In one-hot representation, vector representations of Tallahassee and Florida are orthogonal.

Distributional Similarity The idea of distributional similarity was first coined by linguist Z.S. Harris (1954). Later this idea has been used to represent words by means of its neighbors (Bengio et. al, 03, Mikolov et. al,13). Neighboring words will now represent banking....debt problems turning into banking crises as has......europe needs unified banking regulation to replace the...

word2vec The general idea is to define a model that predicts between a center word w t and neighboring context words in terms of word vectors and P(w t w t ) or P(w t w t ). Goal: Obtain a vector representation that maximizes P(w t w t ) or P(w t w t ). Skip-gram model: Predict context words given a center word. Continuous bag-of-words model: Predict center word from a bag of context words.

skip-gram For each position t = 1(1)T, predict context words within a window of fixed size m, given center word w j L(θ) = T t=1 m j m,j 0 The goal is to minimize J(θ) where P(w t+j w t ; θ) (2) J(θ) = 1 T log L(θ) = 1 T T log P(w t+j w t ) (3) t=1 m j m,j 0

Two different vectors are used to represent a word w. v w when w is a center word. u w when w is a context word. For any j 0, P(w t+j w t ) = exp(u T w t+j v wt ) V i=1 exp(ut w t+i v w t ) This is called softmax function.

Suppose we are using a d-dimensional vector to represent a word and we have V -many wordsθ = v aardvark v zyzzyva u aardvark u zyzzyva R 2dV

The gradient of J(θ) is calculated from Update equation v wt P(w t+j w t ) = u wt+j V P(w t+i w t )u wt+i i=1 θ new = θ old α θ J(θ) or elementwise it is θ new j = θ old j α θj old J(θ)

Stochastic Gradient Descent For a large corpus, updating entire gradient vector is extremely expensive. Calculate the gradient for one randomly selected center word J t (θ). Update equation θ new = θ old α θ J t (θ)

CBOW Unlike skip-gram, we predict center word w.r.t context words. Cost function J(θ) = 1 T T log P(w t w t m,, w t 1, w t+1,, w t+m ) t=1 Softmax function defined as P(v wt û) = exp(v T w t û) V i=1 exp(v T w t û) where û = u t m + + u t 1 + u t+1 + + u t+m 2m

Recurrent Neural Networks

one-hot word vectors x (t) R V Word embedding e (t) = Ex (t) Hidden states h (t) = g(w h h (t 1) + W e e (t) + b) h (0) is the initial hidden state usually a vector of 0s where g is some activation function(sigmoid, tanh, ReLu). Output probability ŷ (t) = softmax(uh (t) + b 1 )

How to Train an RNN Language Model Get a Big corpus of text which is a sequence of words: x (1), x (2),, x (T ). Compute the output distribution ŷ (t) for every time step t. This is essentially the probability distribution of every word given the previously occured words. Loss function- cross entropy y (t) = x (t+1) J (t) (θ) = V t=1 y (t) j log ŷ (t) j

Overall cost function is J(θ) = T J (t) (θ) t=1 Updating parameters in NN architecture using J(θ) is called back-propagation.

Backpropagation through time We need to update the parameters using gradient descent. For example let us look at the update of U. We need to calculate the J U using J U = t J ŷ (t) ŷ (t) U b 1 can be updated in a similar way.

Next we are going to update W h. Notice that, at each time step t, J(θ) depends on W h through h (t) which itself depends on h (t 1). So we are going to backpropagate over time steps t = T, T 1,, 0 summing the gradients as we go. Update equation J (t) W h = J ŷ (t) ŷ (t) h (t) h (t) W h

Since h (t) depends on h (t 1), J (t) W h = J(t) ŷ (t) ŷ (t) h (t) h (t) W h J (t) W h = t k=1 J (t) ŷ (t) h (t) h (t 1) ŷ (t) h (t) h (t 1) W h J = W h t t k=1 J (t) ŷ (t) ŷ (t) h (t) h (t) h (k) h (k) W h

Calculating gradient of W e and b is comparatively easier. J = W e t J b = t J ŷ (t) h (t) ŷ (t) h (t) W e J ŷ (t) h (t) ŷ (t) h (t) b

Vanishing and Exploding gradient problem Derivatives through time can be very small or very large very quickly. (Bengio et al 1994) J = W h t t k=1 J (t) ŷ (t) ŷ (t) h (t) h (t) h (k) h (k) W h In NLP, vanishing gradient is an issue The cat, which already ate a plate full of fish, was full. One trick to deal with this problem- Gradient clipping. Not efficient.

Gated Recurrent Unit An efficient way to deal with vanishing gradient problem is to use more complicated hidden units!(cho et al 2014) Main idea is to keep around memories to capture long term dependencies. Essentially this is sort of a simpler version of LSTM which will be discussed later.

GRU first computes an update gate(another layer!) based on current input word vector and hidden state z (t) = σ(w z x (t) + U z h (t 1) ) And a reset gate similarly but with different parameters r (t) = σ(w r x (t) + U r h (t 1) ) Sigmoid activation is used because we want the values to be either close to 0 or close to 1. (Explained later!)

New memory content h (t) = g(wx (t) + r (t) Uh (t 1) ) If reset step is very close to 0, then this ignores previous memory and stores the new word information. Current time step update h (t) = z (t) h (t 1) + (1 z (t) ) h (t)

If reset is close to 0, ignore previous hidden state which allows model to drop information that is irrelevant in the future. Update gate controls how much of past state should matter now. If update gate is close to 1, then we can copy information in that unit through many time steps! Units with short-term dependencies often have reset gates very active.

Long short-term memory(hochreiter and Schmidhuber, 1997) Input gate Output gate Forget gate New memory cell i (t) = σ(w i x (t) + U i h (t 1) ) o (t) = σ(w o x (t) + U o h (t 1) ) f (t) = σ(w f x (t) + U f h (t 1) ) c (t) = g(w c x (t) + U c h (t 1) )

Final memory cell Final hidden state c (t) = f (t) c (t 1) + i (t) c (t) h t = o (t) g(c (t) ) Memory cells can keep information intact, unless input makes them forget it or overwrite it with new input. Cell can decide to output the information of just to store it.

Discussion All the RNN models discussed here can only take previous words into consideration. Solution- Bidirectional RNN. RNNs are a great way for language modelling, machine translation, speech recognition, name entity tagging.

Thank you