Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Size: px

Start display at page:

Download "Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher"

Whitney McBride
5 years ago
Views:

1 Recurrent Neural Networks Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion Slides were adapted from lectures by Richard Socher

2 Overview Tradi8onal language models RNNs RNN language models Important training problems and tricks Vanishing and exploding gradient problems Bidirec8onal RNNs RNNs for other sequence tasks

3 Language Models A language model computes a probability for a sequence of words: Useful for machine transla8on Word ordering: p(the cat is small) > p(small the is cat) Word choice: p(walking home aoer school) > p(walking house aoer school)

4 Tradi8onal Language Models Probability is usually condi8oned on window of n previous words An incorrect but necessary Markov assump8on! To es8mate probabili8es, compute for unigrams and bigrams (condi8oning on one/two previous word(s):

5 Tradi8onal Language Models Performance improves with keeping around higher n--- grams counts and doing smoothing and so---called backoff (e.g. if 4---gram not found, try 3---gram, etc) There are A LOT of n-grams! à Gigan8c RAM requirements! Recent state of the art: Scalable Modified Kneser-Ney Language Model Es8ma8on by Heafield et al.: Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens

Original neural language model A Neural Probabilis8c Language Model, Bengio et al. 2003 i-th output = P(w t = i context) Original equa8ons: y = b +Wx +U tanh(d + Hx) softmax.

6 Original neural language model A Neural Probabilis8c Language Model, Bengio et al i-th output = P(w t = i context) Original equa8ons: y = b +Wx +U tanh(d + Hx) softmax most computation here Pˆ(w t w t 1, w t n+1 )= ey wt i e y i. tanh Problem: Fixed window of context for condi8oning :( C(w t n+1 )... Table look up in C index for w t n+1... C(w t 2 ) C(w t 1 )... Matrix C index for w t 2 shared parameters across words... index for w t 1

7 Recurrent Neural Networks! RNNs 8e the weights at each 8me step Condi8on the neural network on all previous words RAM requirement only scales with number of words y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 A Recurrent Neural Network (RNN). Three 8me-steps are shown.

at 8mestamp t ßà output features h t output of the

8 Recurrent Neural Network language model Given list of word vectors: At a single 8me step: predic8on output at 8mestamp t ßà output features h t output of the previous step the next word vector in the document x t h t

9 Recurrent Neural Network language model Main idea: we use the same set of W weights at all 8me steps! Everything else is the same: Output of the non-linear func8on at the previous 8me-step, t 1. The document context score so far. is the column vector of L at index [t] at 8me step t weights matrix used to condi8on the output of the previous 8me-step, h_(t-1) weights matrix used to condi8on the input word vector, x_t V is the vocabulary.

10 Recurrent Neural Network language model is a probability distribu8on over the vocabulary. y t is the distribu8on at each 8mestamp t. Essen8ally, y t is the next predicted word given the document context score so far (i.e. h t 1 ) and the last observed word vector x (t)

11 RNN Example Vocabulary size C = 8000 A hidden layer size H = 100 U,V and W are the parameters of our network that we want to learn from data. Thus, we need to learn a total of 2HC + H 2 parameters = 1,610,000.

12 Recurrent Neural Network language model Same cross entropy loss func8on but predic8ng words instead of classes

13 Recurrent Neural Network language model Evalua8on could just be nega8ve of average log probability over dataset of size (number of words) T: But more common: Perplexity: 2 J Lower is beoer!

x t 1 x t x t+1 Ideally inputs from many 8me steps ago can

14 Training RNNs is hard Mul8ply the same matrix at each 8me step during forward prop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Ideally inputs from many 8me steps ago can modify output y Take for an example RNN with 2 8me steps! Insighfull!

15 The vanishing/exploding gradient problem RNN goal is to enable propaga8ng context informa8on through faraway 8me-steps. Recurrent neural networks propagate the same W matrix from one 8me- step to the next. y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1

16 The vanishing/exploding gradient problem Will RNN succeed in the same way in the following sentences? For long sentences, the probability that "John" would be recognized as the next word reduces with the size of the context.

17 The vanishing gradient problem - Details Similar but simpler RNN formula8on: Total error is the sum of each error at 8me steps t Hardcore chain rule applica8on:

18 The vanishing gradient problem - Details Similar to backprop but less efficient formula8on Useful for analysis we ll look at: Remember: More chain rule, remember: Each par8al is a Jacobian: :

19 Jacobian - Reminder

20 The vanishing gradient problem - Details From previous slide: h t 1 h t Remember: To compute Jacobian, derive each element of matrix: Where: Check at home that you understand the diag matrix formula8on

21 The vanishing gradient problem - Details Analyzing the norms of the Jacobians, yields: Where we defined 's as upper bounds of the norms The gradient is a product of Jacobian matrices, each associated with a step in the forward computa8on. This can become very small or very large quickly [Bengio et al 1994], and the locality assump8on of gradient descent breaks down. à Vanishing or exploding gradient

22 Why is the vanishing gradient a problem? The error at a 8me step ideally can tell a previous 8me step from many steps away to change during backprop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1

23 IPython Notebook with vanishing gradient example Example of simple and clean NNet implementa8on Comparison of sigmoid and ReLu units A liole bit of vanishing gradient

24 RNN Python Example The model: Our loss: o 4 V W U W U

25 RNN Python Example: Training Here s an actual training example from our text:

probabili8es) # Each O t is a vector of probabili8es represen8ng the words in our vocabulary, but

26 RNN Python Example: Parameter Init ini8alize the weights randomly in the interval from, n- is the number of incoming connec8ons from the previous layer forward propaga8on (predic8ng word probabili8es) # Each O t is a vector of probabili8es represen8ng the words in our vocabulary, but some8mes, for example when evalua8ng our model, all we want is the next word with the highest probability:

27 RNN Python Example: Parameter Init For each word in the sentence (in our case: 45), our model made 8000 predic8ons represen8ng probabili8es of the next word (currently random as all U,V,W are random)

28 RNN Python Example: Loss Calcula8on For random init: we have C words in our vocabulary, so each word should be (on average) predicted with probability 1/C, which would yield a loss of :

29 RNN Python Example: Training the RNN with SGD and Backpropaga8on Through Time (BPTT) The model: Our loss: O 4 V W U W U

30 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Calculate the Gradient Find the parameters U,V,W and that minimize the total loss L on the training data using SGD. # It takes as input a training example (x,y) and outputs the gradients =

31 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Gradient Checking

32 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD

33 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD Seems the loss is decreasing!

34 RNN Python Example: Generate Text

35 RNN Python Example: Generate Text A few selected (censored) sentences with capitaliza8on for reddit comments Anyway, to the city scene you re an idiot teenager. 2. What?!!!! ignore! 3. Screw fitness, you re saying: hops 4. Thanks for the advice to keep my thoughts around girls. 5. Yep, please disappear with the terrible genera8on. What went right? Successfully learn syntax correctly places commas, ends sentence with punctua8on, mul8ple exclama8on marks etc. What went wrong? Sentences don t make sense or have gramma8cal errors! Why? Our vanilla RNN can t generate meaningful text because it s unable to learn dependencies between words that are several steps apart

37 Trick for exploding gradient: clipping trick The solu8on first introduced by Mikolov is to clip gradients to a maximum value. Makes a big difference in RNNs.

38 Gradient clipping intui8on Figure from paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al Error surface of a single unit RNN running through a small number of 8me- steps High curvature walls Solid lines: standard gradient descent trajectories Dashed lines gradients rescaled to fixed size

39 For vanishing gradients: Ini8aliza8on + ReLus! Ini8alize W (*) s to iden8ty matrix I and f(z) = rect(z) = max(z, 0) Test Accuracy LSTM RNN + Tanh RNN + ReLUs IRNN Pixel by pixel permuted MNIST à Huge difference! Steps 5 x 10 Ini8aliza8on idea first introduced in Parsing with Composi8onal Vector Grammars, Socher et al New experiments with recurrent neural nets a year ago (!) in A Simple Way to Ini8alize Recurrent Networks of Rec8fied Linear Units, Le et al. 2015

40 Perplexity Results KN5 = Count-based language model with Kneser-Ney smoothing & 5-grams Perplexity the lower the beoer Table from paper Extensions of recurrent neural network language model by Mikolov et al 2011

41 Problem: So[max is huge and slow Trick: Class-based word predic8on p(w t history) = p(c t history)p(w t c t ) = p(c t h t )p(w t c t ) The more classes, the beoer perplexity but also worse speed: If # classes as your vocabulary you don t get anything Richard Socher 4/20/15

43 Sequence modeling for other tasks Classify NER the sen8ment of each word in its context opinionated expressions Example applica8on and slides from paper Opinion Mining with Deep Recurrent Nets by Irsoy and Cardie 2014

44 Opinion Mining with Deep Recurrent Nets Goal: Classify each word as direct subjec8ve expressions (DSEs) and expressive subjec8ve expressions (ESEs). DSE: Explicit men8ons of private states or speech events expressing private states ESE: Expressions that indicate sen8ment, emo8on, etc. without explicitly conveying them.

45 Example Annota8on In BIO nota8on (tags either begin-of-en8ty (B_X) or con8nua8on-of-en8ty (I_X)): The commioee, [as usual] ESE, [has refused to make any statements] DSE.

46 Approach: Recurrent Neural Network Nota8on from paper (so you get used to different ones) y h h t = f (Wx t + Vh t-1 + b) y t = g(uh t + c) x x represents a token (word) as a vector. y represents the output label (B, I or O) g = soomax! h is the memory, computed from the past memory and current word. It summarizes the sentence up to that 8me.

47 Bidirec8onal RNNs Problem: For classifica8on you want to incorporate informa8on from words both preceding and following Ideas? y h x now represents (summarizes) the past and future around a single token.

48 Deep Bidirec8onal RNNs y h (3) h (2) h (1) x Each memory layer passes an intermediate sequential representation to the next.

49 Data MPQA 1.2 corpus (Wiebe et al., 2005) consists of 535 news ar8cles (11,111 sentences) manually labeled at the phrase level (with DSE and ESEs) Evalua8on: F1

50 Evalua8on Bin F # Layers # Layers 24k 200k No8ces that aoer several layers the results start to get worse (overfit?)

51 Summary Recurrent Neural Networks are powerful Next up: Fancy Recursive Neural Networks! (advanced stuff)

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating