Neural Network Language Modeling

Size: px

Start display at page:

Download "Neural Network Language Modeling"

Abigayle Newton
5 years ago
Views:

1 Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith

2 Course Project Sign up your course project In-class presentation on next Friday, 5 minute each

3 Language Modeling (Recap) Goal: - calculate the probability of a sentence - calculate probability of a word in the sentence

4 N-gram Language Modeling (Recap)

5 Zero Probabilities (Recap) When we have sparse statistics: P(w denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports claims request attack man outcome Steal probability mass to generalize better P(w denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations reports claims request attack man outcome

6 Smoothing (Recap) Backoff Interpolation Kneser-Ney Smoothing

7 Problems with N-grams Problem 1: N-grams are sparse. - There are V 4 possible 4-grams. With V=10,000, that s grams. - We will only see a tiny fraction of them in our training data.

8 Problems with N-grams Problem 2: Words are independent. - It only map together identical words, but ignore similar or related words. - If we have seen yellow daffodil in the data, we could use the intuition that blue is related to yellow to handle blue daffodils.

9 Vector Representation Let s represent words (or any objects) as vectors. Let s choose them, so that similar words have similar vectors.

10 One-hot Word Vectors Each element represents the word.

11 One-hot Word Vectors That s a very large vector! They tell us very little.

12 Distributed Vectors (Representations) Each element represents a property, and they are shared between the words.

13 Distributed Vectors Groups similar words/objects together.

14 Distributed Vectors Use cosine similarity to calculate similarity between two words

15 Distributed Vectors Use cosine similarity to calculate similarity between two words

16 Distributed Vectors We can infer some information based only on the vector of the word.

17 Distributed Vectors We don t even need to know the labels on the vector elements.

18 Distributed Vectors The vectors are usually not 2 or 3-dimensional. More often dimensions.

19 Neural Network Language Models Represent each word as a vector, and similar words with similar vectors. Idea: similar contexts have similar words so we define a model that aims to predict between a word wt and context words: P(wt context) or P(context wt) Optimize the vectors together with the model, so we end up with vectors that perform well for language modeling (aka representation learning)

20 Artificial Neuron Perceptrons = Single-Layer Neural Networks!

21 Artificial Neuron

22 Neural Network Many neurons connected together Multi-Layer Feed-Forward Networks!

23 Neural Network Usually, a neuron is shown as a single unit

24 Neural Network Or a whole layer of neurons is represented as a block

25 Neuron Activation w/ Vectors

26 Feedforward Activation w/ Matrices The same process applies when activating multiple neurons Now the weights are in a matrix as opposed to a vector Activation f(z) is applied to each neuron separately

27 Feedforward Activation w/ Matrices

28 Neural Network

29 Feedforward Activation 1) Take vector from the previous layer 2) Multiply it with the weight matrix 3) Apply the activation function 4) Repeat

30 Feedforward Activation

31 Exercise

32 Input Layer

33 Hidden Layer Computation

34 Output Layer Computation

35 Error Computed output: y =.76 Correct output: t = 1.0 ) How do we adjust the weights?

36 Back-propagation Training Gradient descent error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient! get direction to the error minimum adjust weights towards direction of lower error Back-propagation first adjust last set of weights propagate error back to each previous layer adjust their weights

37 Final Layer Update Linear combination of weights s = P k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 1 2 (t y)2 Derivative of error with regard to one weight w k de = de dy ds dw k dy ds dw k

38 Final Layer Update (1) Linear combination of weights s = P k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 1 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k Error E is defined with respect to y de dy = d dy 1 2 (t y)2 = (t y)

39 Final Layer Update (2) Linear combination of weights s = P k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 1 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)(1 sigmoid(s)) = y(1 y)

40 Final Layer Update (3) Linear combination of weights s = P k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 1 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k x is weighted linear combination of hidden node values h k ds dw k = d X w k h k = h k dw k k

41 Putting it Together Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k = (t y) y(1 y) h k error derivative of sigmoid: y 0 Weight adjustment will be scaled by a fixed learning rate µ w k = µ (t y) y 0 h k

42 Multiple Output Nodes Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E = X j 1 2 (t j y j ) 2 Weights k! j are adjusted according to the node they point to w j k = µ(t j y j ) y 0 j h k

43 Hidden Layer Update In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node j =(t j y j ) y 0 j Back-propagate the error term (why this way? there is math to back it up...) i = X j w j i j y 0 i Universal update formula w j k = µ j h k

44 Final Layer Update (example) A B C D E F G.76 Computed output: y =.76 Correct output: t = 1.0 Final layer weight updates (learning rate µ = 10) G =(t y) y 0 =(1.76) =.0434 w GD = µ G h D = =.391 w GE = µ G h E = =.074 w GF = µ G h F = =.434

45 Final Layer Update (example) A B C D E F G.76 Computed output: y =.76 Correct output: t = 1.0 Final layer weight updates (learning rate µ = 10) G =(t y) y 0 =(1.76) =.0434 w GD = µ G h D = =.391 w GE = µ G h E = =.074 w GF = µ G h F = =.434

46 Hidden Layer Update (example) A B C D E F Hidden node D P D = j w j i j y 0 D = w GD G y 0 D = =.0175 w DA = µ D h A = =.175 w DB = µ D h B = =0 w DC = µ D h C = =.175 Hidden node E P E = j w j i j y 0 E = w GE G y 0 E = =.0464 w EA = µ E h A = =.464 etc. G.76

47 Initialization of Weights Weights are initialized randomly e.g., uniformly from interval [ 0.01, 0.01] Glorot and Bengio (2010) suggest for shallow neural networks 1 pn, 1 p n n is the size of the previous layer for deep neural networks p 6 p nj + n j+1, p 6 p nj + n j+1 n j is the size of the previous layer, n j size of next layer

48 Neural Networks for Classification Predict class: one output node per class Training data output: One-hot vector, e.g., ~y =(0, 0, 1) T Prediction predicted class is output node y i with highest value obtain posterior probability distribution by soft-max softmax(y i )= ey i P j ey j

49 Feedforward Neural Network Language Model Input: vector representations of previous words E(wi-3) E(wi-2) E (wi-1) Output: the conditional probability of wj being the next word

50 Feedforward Neural Network Language Model We can also think of the input as a concatenation of the context vectors The hidden layer h is calculated as in previous examples How do we calculate?

51 Feedforward Neural Network Language Model Our output vector o has an element for each possible word wj We take a softmax over that vector How do we calculate?

52 Feedforward Neural Network Language Model Our output vector o has an element for each possible word wj We take a softmax over that vector

53 Feedforward Neural Network Language Model 1) Multiple input vectors with weights 2) Apply the activation function Bengio et al. (2003)

54 Feedforward Neural Network Language Model softmax 3) Multiply hidden vector with output weights tanh 4) Apply softmax to the output vector Now the j-th element in the output vector o j contains the probability of w j being the next word. Bengio et al. (2003)

55 Feedforward Neural Network Language Model softmax Like a log-linear language model with two kinds of features: tanh 1) concatenation of context-word embedding vectors 2) tanh-affine transformation of the above Bengio et al. (2003) one hot vectors

56 Feedforward Neural Network Language Model softmax M u, T tanh tanh W softmax b, A one hot vectors Bengio et al. (2003)

57 Neural Network: Definitions Warning: there is no widely accepted standard notation! Afeedforwardneuralnetworkn is defined by: I A function family that maps parameter values to functions of the form n : R d in! R d out ;typically: I Non-linear I Di erentiable with respect to its inputs I Assembled through a series of a ne transformations and non-linearities, composed together I Symbolic/discrete inputs handled through lookups. I Parameter values I I Typically a collection of scalars, vectors, and matrices We often assume they are linearized into R D

58 A Couple of Useful Functions I softmax : R k! R k hx 1,x 2,...,x k i 7! * e x 1 P k j=1 ex j, e x 2 P k j=1 ex j,..., e x k P k j=1 ex j + I tanh : R! [ 1, 1] e x x 7! ex e x + e x Generalized to be elementwise, sothatitmapsr k! [ 1, 1] k. I Others include: ReLUs, logistic sigmoids, PReLUs,...

59 Feedforward Neural Network Language Model (Bengio et al., 2003) Define the n-gram probability as follows: M u, T tanh W softmax p( hh 1,...,h n 1 i)=n he h1,...,e hn 1 i = 0 0 V + Xn 1 j=1 e hj V > MA j V d d V + W V H H + Xn 1 j=1 e > h j M T j AA d H 11 where each e hj 2 R V is a one-hot vector and H is the number of hidden units in the neural network (a hyperparameter ). b, A Parameters include: I M 2 R V d, which are called embeddings (row vectors), one for every word in V I Feedforward NN parameters b 2 R V, A 2 R (n 1) d V, W 2 R V H, u 2 R H, T 2 R (n 1) d H

60 Breaking It Down M u, T tanh Look up each of the history words h j, 8j 2 {1,...,n 1} in M; keep two copies. Rename the embedding for h j as m hj. W softmax e hj > M = m hj e hj > M = m hj b, A

61 Breaking It Down tanh M u, T Apply an a ne transformation to the second copy of the history-word embeddings (u, T) W softmax m hj b, A u H + Xn 1 j=1 m hj T j d H

62 Breaking It Down tanh M u, T Apply an a ne transformation to the second copy of the history-word embeddings (u, T) andatanh nonlinearity. W softmax m hj 0 u + Xn 1 m hj T j 1 A b, A j=1

63 Breaking It Down tanh M u, T Apply an a ne transformation to everything (b, A, W). b, A W softmax b V + + W V H Xn 1 j=1 m hj A j 0 d V u + Xn 1 j=1 m hj T j 1 A

64 Breaking It Down M u, T tanh Apply a softmax transformation to make the vector sum to one. b, A W softmax 0 b + Xn 1 j=1 m hj A j 0 + W u + Xn 1 j=1 m hj T j 1 1 A A

65 Breaking It Down M u, T tanh W softmax 0 b + Xn 1 j=1 m hj A j 0 + W u + Xn 1 j=1 m hj T j 1 1 A A Like a log-linear language model with two kinds of features: b, A I Concatenation of context-word embeddings vectors m hj I tanh-a ne transformation of the above New parameters arise from (i) embeddings and (ii) a ne transformation inside the nonlinearity.

66 Number of Parameters M u, T b, A tanh {z} {z} W softmax {z} {z} {z } {z} {z} {z } D = {z} V d + {z} V +(n 1)dV + V H {z } {z} + {z} H +(n 1)dH {z } M b A W u T I V (after OOV processing) I d 2 {30, 60} I H 2 {50, 100} I n 1=5 So D = 461V parameters, compared to O(V n ) for classical n-gram models. I Forcing A = 0 eliminated 300V parameters and performed a bit better, but was slower to converge. I If we averaged m hj instead of concatenating, we d get to 221V (this is a variant of continuous bag of words, Mikolov et al., 2013). 15 / 5

67 Why does it work?

Introduction to Neural Networks

Introduction to Neural Networks Philipp Koehn 4 April 205 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can