Natural Language Processing (CSEP 517): Language Models, Continued

Size: px

Start display at page:

Download "Natural Language Processing (CSEP 517): Language Models, Continued"

Nathaniel Atkins
5 years ago
Views:

1 Natural Language Processing (CSEP 517): Language Models, Continued Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 3, / 102

2 To-Do List Online quiz: due Sunday Print, sign, and return the academic integrity statement (if you haven t already) Read: Smith (2017); optionally, Jurafsky and Martin (2016), Collins (2011) 2, and Goldberg (2015) 0 4, if you want to know more about neural networks A1 now due April 9 (Sunday) Late policy: four late days 2 / 102

3 Language Models: Definitions V is a finite set of (discrete) symbols ( words or possibly characters); V = V V is the (infinite) set of sequences of symbols from V whose final symbol is p : V R, such that: For any x V, p(x) 0 p(x = x) = 1 x V (I.e., p is a proper probability distribution.) Language modeling: estimate p from examples, x 1:n = x 1, x 2,..., x n. Evaluation on test data x 1:m : perplexity, 2 1 m M i=1 log 2 p( x i) 3 / 102

4 Log-Linear Models: Definitions We define a conditional log-linear model p(y X) as: Y is the set of events/outputs ( for language modeling, V) X is the set of contexts/inputs ( for n-gram language modeling, V n 1 ) φ : X Y R d is a feature vector function w R d are the model parameters p w (Y = y X = x) = exp w φ(x, y) exp w φ(x, y ) y Y 4 / 102

5 Breaking It Down p w (Y = y X = x) = exp w φ(x, y) exp w φ(x, y) y Y 5 / 102

6 Breaking It Down exp w φ(x, y) p w (Y = y X = x) = exp w φ(x, y) y Y linear score w φ(x, y) 6 / 102

7 Breaking It Down exp w φ(x, y) p w (Y = y X = x) = exp w φ(x, y) y Y linear score w φ(x, y) nonnegative exp w φ(x, y) 7 / 102

8 Breaking It Down exp w φ(x, y) p w (Y = y X = x) = exp w φ(x, y) y Y linear score w φ(x, y) nonnegative exp w φ(x, y) normalizer exp w φ(x, y ) = Z w (x) y Y 8 / 102

9 Breaking It Down p w (Y = y X = x) = exp w φ(x, y) exp w φ(x, y) y Y linear score w φ(x, y) Log-linear comes from the fact that: nonnegative exp w φ(x, y) normalizer exp w φ(x, y ) = Z w (x) y Y log p w (Y = y X = x) = w φ(x, y) log Z w (x) }{{} constant in y This is an instance of the family of generalized linear models. 9 / 102

10 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 As a simple example, let the two features be binary functions. 10 / 102

11 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ = w 1 φ 1 + w 2 φ 2 = 0 11 / 102

12 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 distance(w φ = 0, φ 0 ) = w φ 0 w 2 w φ 0 12 / 102

13 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ(x, y 1 ) > w φ(x, y 3 ) > w φ(x, y 4 ) > 0 w φ(x, y 2 ) 13 / 102

14 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 p w (y 1 x) > p w (y 3 x) > p w (y 4 x) > p w (y 2 x) 14 / 102

15 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 15 / 102

16 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 p w (y 3 x) > p w (y 1 x) > p w (y 2 x) > p w (y 4 x) 16 / 102

17 The Geometric View Suppose we have instance x, Y = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 Log-linear parameter estimation tries to choose w so that p w (Y x) matches the empirical distribution, c(x,y ) c(x). 17 / 102

18 Why Build Language Models This Way? Exploit features of histories for sharing of statistical strength and better smoothing (Lau et al., 1993) Condition the whole text on more interesting variables like the gender, age, or political affiliation of the author (Eisenstein et al., 2011) Interpretability! Each feature φ k controls a factor to the probability (e w k ). 1 If w k < 0 then φ k makes the event less likely by a factor of e. w k If wk > 0 then φ k makes the event more likely by a factor of e w k. If wk = 0 then φ k has no effect. 18 / 102

19 Log-Linear n-gram Models p w (X = x) = = l p w (X j = x j X 0:j 1 = x 0:j 1 ) j=1 l j=1 assumption = = l j=1 exp w φ(x 0:j 1, x j ) Z w (x 0:j 1 ) l j 1 exp w φ(x j n+1:j 1, x j ) Z w (x j n+1:j 1 ) exp w φ(h j, x j ) Z w (h j ) 19 / 102

20 Example The man who knew too much many little few. hippopotamus 20 / 102

21 What Features in φ(x j n+1:j 1, X j )? 21 / 102

22 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man 22 / 102

23 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man 23 / 102

24 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man Spelling features: X j s first character is capitalized 24 / 102

25 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man Spelling features: X j s first character is capitalized Class features: X j is a member of class / 102

26 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man Spelling features: X j s first character is capitalized Class features: X j is a member of class 132 Gazetteer features: X j is listed as a geographic place name 26 / 102

27 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man Spelling features: X j s first character is capitalized Class features: X j is a member of class 132 Gazetteer features: X j is listed as a geographic place name You can define any features you want! Too many features, and your model will overfit Too few (good) features, and your model will not learn 27 / 102

28 What Features in φ(x j n+1:j 1, X j )? Traditional n-gram features: X j 1 = the X j = man Gappy n-grams: X j 2 = the X j = man Spelling features: X j s first character is capitalized Class features: X j is a member of class 132 Gazetteer features: X j is listed as a geographic place name You can define any features you want! Too many features, and your model will overfit Feature selection methods, e.g., ignoring features with very low counts, can help. Too few (good) features, and your model will not learn 28 / 102

29 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. 29 / 102

30 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. Sometimes feature engineering is used pejoratively. 30 / 102

31 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. Sometimes feature engineering is used pejoratively. Some people would rather not spend their time on it! 31 / 102

32 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. Sometimes feature engineering is used pejoratively. Some people would rather not spend their time on it! There is some work on automatically inducing features (Della Pietra et al., 1997). 32 / 102

33 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. Sometimes feature engineering is used pejoratively. Some people would rather not spend their time on it! There is some work on automatically inducing features (Della Pietra et al., 1997). More recent work in neural networks can be seen as discovering features (instead of engineering them). 33 / 102

34 Feature Engineering Many advances in NLP (not just language modeling) have come from careful design of features. Sometimes feature engineering is used pejoratively. Some people would rather not spend their time on it! There is some work on automatically inducing features (Della Pietra et al., 1997). More recent work in neural networks can be seen as discovering features (instead of engineering them). But in much of NLP, there s a strong preference for interpretable features. 34 / 102

35 How to Estimate w? p θ (x) = n-gram l j=1 θ xj h j log-linear n-gram l j 1 exp w φ(h j, x j ) Z w (h j ) Parameters: θ v h w k v V, h (V { }) n 1 k {1,..., d} MLE: ˆθv h = c(hv) c(h) no closed form 35 / 102

36 MLE for w Let training data consist of {(h i, x i )} N i=1. 36 / 102

37 MLE for w Let training data consist of {(h i, x i )} N i=1. Maximum likelihood estimation is: max w R d N log p w (x i h i ) i=1 = max w R d = max w R d N i=1 N log exp w φ(h i, x i ) Z w (h i ) i=1 w φ(h i, x i ) log v V exp w φ(h i, v) } {{ } Z w(h i ) 37 / 102

38 MLE for w Let training data consist of {(h i, x i )} N i=1. Maximum likelihood estimation is: max w R d N log p w (x i h i ) i=1 = max w R d = max w R d This is concave in w. N i=1 N log exp w φ(h i, x i ) Z w (h i ) i=1 w φ(h i, x i ) log v V exp w φ(h i, v) } {{ } Z w(h i ) 38 / 102

39 MLE for w Let training data consist of {(h i, x i )} N i=1. Maximum likelihood estimation is: max w R d N log p w (x i h i ) i=1 = max w R d = max w R d This is concave in w. N i=1 N log exp w φ(h i, x i ) Z w (h i ) i=1 w φ(h i, x i ) log v V Z w (h i ) involves a sum over V terms. exp w φ(h i, v) } {{ } Z w(h i ) 39 / 102

40 MLE for w max w R d N w φ(h i, x i ) log Z w (h i ) }{{} f i (w) i=1 40 / 102

41 MLE for w max w R d N w φ(h i, x i ) log Z w (h i ) }{{} f i (w) i=1 Hope/fear view: for each instance i, increase the score of the correct output x i, score(x i ) = w φ(h i, x i ) decrease the softened max score overall, log v V exp score(v) 41 / 102

42 MLE for w max w R d N w φ(h i, x i ) log Z w (h i ) }{{} f i (w) i=1 Gradient view: w f i = φ(h i, x i ) p }{{} w (v h i ) φ(h i, v) v V observed features }{{} expected features Setting this to zero means getting model s expectations to match empirical observations. 42 / 102

43 MLE for w: Algorithms Batch methods (L-BFGS is popular) Stochastic gradient ascent/descent more common today, especially with special tricks for adapting the step size over time Many specialized methods (e.g., iterative scaling ) 43 / 102

44 Stochastic Gradient Descent Goal: minimize N i=1 f i(w) with respect to w. Input: initial value w, number of epochs T, learning rate α For t {1,..., T }: Choose a random permutation π of {1,..., N}. For i {1,..., N}: w w α w f π(i) Output: w 44 / 102

45 Avoiding Overfitting Maximum likelihood estimation: max w R d N w φ(h i, x i ) log Z w (h i ) i=1 If φ j (h, x) is (almost) always positive, we can always increase the objective (a little bit) by increasing w j toward / 102

46 Avoiding Overfitting Maximum likelihood estimation: max w R d N w φ(h i, x i ) log Z w (h i ) i=1 If φ j (h, x) is (almost) always positive, we can always increase the objective (a little bit) by increasing w j toward +. Standard solution is to add a regularization term: max w R d N i=1 w φ(h i, x i ) log v V where λ > 0 is a hyperparameter and p = 2 or 1. exp w φ(h i, v) λ w p p 46 / 102

47 MLE for w If we had more time, we d study this problem more carefully! Here s what you must remember: There is no closed form; you must use a numerical optimization algorithm like stochastic gradient descent. Log-linear models are powerful but expensive (Z w (h i )). Regularization is very important; we don t actually do MLE. Just like for n-gram models! Only even more so, since log-linear models are even more expressive. 47 / 102

48 Quick Recap Two kinds of language models so far: representation? estimation? think about? n-gram h i is (n 1) previous symbols count and normalize smoothing log-linear featurized representation of h i, x i iterative gradient descent features 48 / 102

49 Neural Network: Definitions Warning: there is no widely accepted standard notation! A feedforward neural network n ν is defined by: A function family that maps parameter values to functions of the form n : R d in R dout ; typically: Non-linear Differentiable with respect to its inputs Assembled through a series of affine transformations and non-linearities, composed together Symbolic/discrete inputs handled through lookups. Parameter values ν Typically a collection of scalars, vectors, and matrices We often assume they are linearized into R D 49 / 102

50 A Couple of Useful Functions softmax : R k R k x 1, x 2,..., x k e x 1 k, j=1 ex j e x 2 k,..., j=1 ex j e x k k j=1 ex j tanh : R [ 1, 1] x ex e x e x + e x Generalized to be elementwise, so that it maps R k [ 1, 1] k. Others include: ReLUs, logistic sigmoids, PReLUs, / 102

51 One Hot Vectors Arbitrarily order the words in V, giving each an index in {1,..., V }. Let e i R V contain all zeros, with the exception of a 1 in position i. This is the one hot vector for the ith word in V. 51 / 102

52 Feedforward Neural Network Language Model (Bengio et al., 2003) Define the n-gram probability as follows: ( p( h 1,..., h n 1 ) = n ν eh1,..., e hn 1 ) = n 1 softmax + M tanh b V j=1 e hj V A j V d d V + W V H u H e h j M T j n 1 + where each e hj R V is a one-hot vector and H is the number of hidden units in the neural network (a hyperparameter ). Parameters ν include: M R V d, which are called embeddings (row vectors), one for every word in V Feedforward NN parameters b R V, A R (n 1) d V, W R V H, u R H, T R (n 1) d H j=1 d H 52 / 102

53 Breaking It Down Look up each of the history words h j, j {1,..., n 1} in M; keep two copies. e hj V e hj V M V d M V d 53 / 102

54 Breaking It Down Look up each of the history words h j, j {1,..., n 1} in M; keep two copies. Rename the embedding for h j as m hj. e hj M = m hj e hj M = m hj 54 / 102

55 Breaking It Down Apply an affine transformation to the second copy of the history-word embeddings (u, T) m hj u H n 1 + j=1 m hj T j d H 55 / 102

56 Breaking It Down Apply an affine transformation to the second copy of the history-word embeddings (u, T) and a tanh nonlinearity. m hj n 1 tanh u + m hj T j j=1 56 / 102

57 Breaking It Down Apply an affine transformation to everything (b, A, W). b V m hj A j n 1 + j=1 + W V H d V n 1 tanh u + m hj T j j=1 57 / 102

58 Breaking It Down Apply a softmax transformation to make the vector sum to one. n 1 softmax b + m hj A j j=1 n 1 + W tanh u + m hj T j j=1 58 / 102

59 Breaking It Down n 1 softmax b + m hj A j j=1 n 1 + W tanh u + m hj T j j=1 Like a log-linear language model with two kinds of features: Concatenation of context-word embeddings vectors m hj tanh-affine transformation of the above New parameters arise from (i) embeddings and (ii) affine transformation inside the nonlinearity. 59 / 102

60 Visualization M u, T tanh W softmax b, A 60 / 102

61 Number of Parameters D = }{{} V d + }{{} V + (n 1)dV + }{{}}{{} V H + }{{} H + (n 1)dH }{{} M b A W u T For Bengio et al. (2003): V (after OOV processing) d {30, 60} H {50, 100} n 1 = 5 So D = 461V parameters, compared to O(V n ) for classical n-gram models. Forcing A = 0 eliminated 300V parameters and performed a bit better, but was slower to converge. If we averaged m hj instead of concatenating, we d get to 221V (this is a variant of continuous bag of words, Mikolov et al., 2013). 61 / 102

62 Why does it work? 62 / 102

63 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. 63 / 102

64 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x1, x 2 ); this can t be expressed as a linear function of x 1 and x / 102

65 xor Example x 2 x 1 y Tuples where y = xor(x 1, x 2 ) are red; tuples where y xor(x 1, x 2 ) are blue. 65 / 102

66 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x 1, x 2 ); this can t be expressed as a linear function of x 1 and x 2. But: z = x 1 x 2 y = x 1 + x 2 2z 66 / 102

67 xor Example (D = 13) Credit: Chris Dyer ( min mean squared error v,a,w,b x 1 {0,1} x 2 {0,1} min v,a,w,b x 1 {0,1} x 2 {0,1} iterations ( ( xor(x 1, x 2 ) v 3 xor(x 1, x 2 ) v 3 ( W 3 2 x 2 ( tanh ) ) 2 + b + a 3 W 3 2 x 2 ) ) 2 + b + a 3 67 / 102

68 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x 1, x 2 ); this can t be expressed as a linear function of x 1 and x 2. But: z = x 1 x 2 y = x 1 + x 2 2z With high-dimensional inputs, there are a lot of conjunctive features to search through. For log-linear models, Della Pietra et al. (1997) did this, greedily. 68 / 102

69 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x 1, x 2 ); this can t be expressed as a linear function of x 1 and x 2. But: z = x 1 x 2 y = x 1 + x 2 2z With high-dimensional inputs, there are a lot of conjunctive features to search through. For log-linear models, Della Pietra et al. (1997) did this, greedily. Neural models seem to smoothly explore lots of approximately-conjunctive features. 69 / 102

70 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x 1, x 2 ); this can t be expressed as a linear function of x 1 and x 2. But: z = x 1 x 2 y = x 1 + x 2 2z With high-dimensional inputs, there are a lot of conjunctive features to search through. For log-linear models, Della Pietra et al. (1997) did this, greedily. Neural models seem to smoothly explore lots of approximately-conjunctive features. Modern answer: representations of words and histories are tuned to the prediction problem. 70 / 102

71 Why does it work? Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can t get. Suppose we want y = xor(x 1, x 2 ); this can t be expressed as a linear function of x 1 and x 2. But: z = x 1 x 2 y = x 1 + x 2 2z With high-dimensional inputs, there are a lot of conjunctive features to search through. For log-linear models, Della Pietra et al. (1997) did this, greedily. Neural models seem to smoothly explore lots of approximately-conjunctive features. Modern answer: representations of words and histories are tuned to the prediction problem. Word embeddings: a powerful idea / 102

72 Important Idea: Words as Vectors The idea of embedding words in R d is much older than neural language models. 72 / 102

73 Important Idea: Words as Vectors The idea of embedding words in R d is much older than neural language models. You should think of this as a generalization of the discrete view of V. 73 / 102

74 Important Idea: Words as Vectors The idea of embedding words in R d is much older than neural language models. You should think of this as a generalization of the discrete view of V. Considerable ongoing research on learning word representations to capture linguistic similarity (Turney and Pantel, 2010); this is known as vector space semantics. 74 / 102

75 Words as Vectors: Example baby cat 75 / 102

76 Words as Vectors: Example baby pig mouse cat 76 / 102

77 Parameter Estimation Bad news for neural language models: Log-likelihood function is not concave. So any perplexity experiment is evaluating the model and an algorithm for estimating it. Calculating log-likelihood and its gradient is very expensive (5 epochs took 3 weeks on 40 CPUs). 77 / 102

78 Parameter Estimation Bad news for neural language models: Log-likelihood function is not concave. So any perplexity experiment is evaluating the model and an algorithm for estimating it. Calculating log-likelihood and its gradient is very expensive (5 epochs took 3 weeks on 40 CPUs). Good news: n ν is differentiable with respect to M (from which its inputs come) and ν (its parameters), so gradient-based methods are available. Essential: the chain rule from calculus (sometimes called backpropagation ) Lots more details in Bengio et al. (2003) and (for NNs more generally) in Goldberg (2015). 78 / 102

79 Next Up More examples of neural language models (in brief): The log-bilinear language model Recurrent neural network language models 79 / 102

80 Log-Bilinear Language Model (Mnih and Hinton, 2007) Define the n-gram probability as follows, for each v V: n 1 ( exp p(v h 1,..., h n 1 ) = j=1 n 1 ( exp v V j=1 m hj A j d d d m hj A j d d d ) + b m v d d ) + b m v d d +c v +c v 80 / 102

81 Log-Bilinear Language Model (Mnih and Hinton, 2007) Define the n-gram probability as follows, for each v V: n 1 ( exp p(v h 1,..., h n 1 ) = j=1 n 1 ( exp v V j=1 m hj A j d d d m hj A j Number of parameters: D = }{{} V d + (n 1)d 2 + d }{{} M A d d d }{{} b ) + b m v d d ) + b m v d d + V }{{} c +c v +c v 81 / 102

82 Log-Bilinear Language Model (Mnih and Hinton, 2007) Define the n-gram probability as follows, for each v V: n 1 ( exp p(v h 1,..., h n 1 ) = j=1 n 1 ( exp v V j=1 m hj A j d d d m hj A j d d d ) + b m v d d ) + b m v d d +c v +c v Number of parameters: D = }{{} V d + (n 1)d 2 + }{{}}{{} d + }{{} V M A b c The predicted word s probability depends on its vector m v, not just on the vectors of the history words. 82 / 102

83 Log-Bilinear Language Model (Mnih and Hinton, 2007) Define the n-gram probability as follows, for each v V: n 1 ( exp p(v h 1,..., h n 1 ) = j=1 n 1 ( exp v V j=1 m hj A j d d d m hj A j d d d ) + b m v d d ) + b m v d d +c v +c v Number of parameters: D = }{{} V d + (n 1)d 2 + }{{}}{{} d + }{{} V M A b c The predicted word s probability depends on its vector m v, not just on the vectors of the history words. Training this model involves a sum over the vocabulary (like log-linear models we saw earlier). 83 / 102

84 Log-Bilinear Language Model (Mnih and Hinton, 2007) Define the n-gram probability as follows, for each v V: n 1 ( exp p(v h 1,..., h n 1 ) = j=1 n 1 ( exp v V j=1 m hj A j d d d m hj A j d d d ) + b m v d d ) + b m v d d +c v +c v Number of parameters: D = }{{} V d + (n 1)d 2 + }{{}}{{} d + }{{} V M A b c The predicted word s probability depends on its vector m v, not just on the vectors of the history words. Training this model involves a sum over the vocabulary (like log-linear models we saw earlier). Later work explored variations to make learning faster. 84 / 102

85 Observations about Neural Language Models (So Far) There s no knowledge built in that the most recent word h n 1 should generally be more informative than earlier ones. This has to be learned. In addition to choosing n, also have to choose dimensionalities like d and H. Parameters of these models are hard to interpret. 85 / 102

86 Observations about Neural Language Models (So Far) There s no knowledge built in that the most recent word h n 1 should generally be more informative than earlier ones. This has to be learned. In addition to choosing n, also have to choose dimensionalities like d and H. Parameters of these models are hard to interpret. Example: l2 -norm of A j and T j in the feedforward model correspond to the importance of history position j. Individual word embeddings can be clustered and dimensions can be analyzed (e.g., Tsvetkov et al., 2015). 86 / 102

87 Observations about Neural Language Models (So Far) There s no knowledge built in that the most recent word h n 1 should generally be more informative than earlier ones. This has to be learned. In addition to choosing n, also have to choose dimensionalities like d and H. Parameters of these models are hard to interpret. Architectures are not intuitive. 87 / 102

88 Observations about Neural Language Models (So Far) There s no knowledge built in that the most recent word h n 1 should generally be more informative than earlier ones. This has to be learned. In addition to choosing n, also have to choose dimensionalities like d and H. Parameters of these models are hard to interpret. Architectures are not intuitive. Still, impressive perplexity gains got people s interest. 88 / 102

89 Recurrent Neural Network Each input element is understood to be an element of a sequence: x 1, x 2,..., x l At each timestep t: The tth input element x t is processed alongside the previous state s t 1 to calculate the new state (s t ). The tth output is a function of the state st. The same functions are applied at each iteration: s t = f recurrent (x t, s t 1 ) y t = f output (s t ) In RNN language models, words and histories are represented as vectors (respectively, x t = e xt and s t ). 89 / 102

90 RNN Language Model The original version, by Mikolov et al. (2010) used a simple RNN architecture along these lines: ( ( ) s t = f recurrent (e xt, s t 1 ) = sigmoid e x t M) A + s t 1 B + c ( ) y t = f output (s t ) = softmax s t U p(v x 1,..., x t 1 ) = [y t ] v Note: this is not an n-gram (Markov) model! 90 / 102

91 Visualization M sigmoid A, B, c U softmax s t s t - 1 x t y t 91 / 102

92 Visualization M A, B, c sigmoid U softmax M A, B, c sigmoid U softmax M A, B, c sigmoid U softmax M A, B, c sigmoid U softmax 92 / 102

93 Improvements to RNN Language Models The simple RNN is known to suffer from two related problems: Vanishing gradients during learning make it hard to propagate error into the distant past. State tends to change a lot on each iteration; the model forgets too much. Some variants: Stacking these functions to make deeper networks. Sundermeyer et al. (2012) use long short-term memories (LSTMs) and Cho et al. (2014) use gated recurrent units (GRUs) to define f recurrent. Mikolov et al. (2014) engineer the linear transformation in the simple RNN for better preservation. Jozefowicz et al. (2015) used randomized search to find even better architectures. 93 / 102

94 Comparison: Probabilistic vs. Connectionist Modeling Probabilistic Connectionist What do we engineer? features, assumptions architectures Theory? as N gets large not really Interpretation of parameters? often easy usually hard 94 / 102

95 Parting Shots 95 / 102

96 Parting Shots I said very little about estimating the parameters. 96 / 102

97 Parting Shots I said very little about estimating the parameters. At present, this requires a lot of engineering. 97 / 102

98 Parting Shots I said very little about estimating the parameters. At present, this requires a lot of engineering. New libraries to help you are coming out all the time. 98 / 102

99 Parting Shots I said very little about estimating the parameters. At present, this requires a lot of engineering. New libraries to help you are coming out all the time. Many of them use GPUs to speed things up. 99 / 102

100 Parting Shots I said very little about estimating the parameters. At present, this requires a lot of engineering. New libraries to help you are coming out all the time. Many of them use GPUs to speed things up. This progression is worth reflecting on: history: represented as: before 1996 (n 1)-gram discrete feature vector embedded vector since 2010 unrestricted embedded 100 / 102

101 References I Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb): , URL Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder decoder for statistical machine translation. In Proc. of EMNLP, Michael Collins. Log-linear models, MEMMs, and CRFs, URL Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4): , Jacob Eisenstein, Amr Ahmed, and Eric P Xing. Sparse additive generative models of text. In Proc. of ICML, Yoav Goldberg. A primer on neural network models for natural language processing, URL Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proc. of ICML, URL Daniel Jurafsky and James H. Martin. N-grams (draft chapter), URL / 102

102 References II Raymond Lau, Ronald Rosenfeld, and Salim Roukos. Trigger-based language models: A maximum entropy approach. In Proc. of ICASSP, Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Proc. of Interspeech, URL http: // Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of ICLR, URL Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc Aurelio Ranzato. Learning longer memory in recurrent neural networks, arxiv: Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In Proc. of ICML, Noah A. Smith. Probabilistic language models 1.0, URL Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. LSTM neural networks for language modeling. In Proc. of Interspeech, Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. Evaluation of word vector representations by subspace alignment. In Proc. of EMNLP, Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1): , URL / 102

Neural Network Language Modeling

Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation