Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Charla Potter
5 years ago
Views:

1 Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley

2 Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters); V = V V + is the infinite set of sequences of symbols from V; each sequence ends with STOP x V +

3 Language Model Language modeling is the task of estimating P(w)

4 Language Model arma virumque cano arms man and I sing Arms, and the man I sing [Dryden] I sing of arms and a man When we have choices to make about different ways to say something, a language models gives us a formal way of operationalizing their fluency (in terms of how likely they are exist in the language)

5 Markov assumption bigram model (first-order markov) n i P (w i w i 1 ) P (STOP w n ) trigram model (second-order markov) n i P (w i w i 2,w i 1 ) P (STOP w n 1,w n )

6 Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all documents Y = {english, mandarin, greek, } x = a single document y = ancient greek

7 Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y Y = {the, of, a, dog, iphone, } x = (context) y = word

8 Logistic regression P(y = 1 x, β) = 1 +exp 1 F i=1 x iβ i output space Y = {0, 1}

9 x = feature vector β = coefficients Feature Value Feature β the 0 and 0 bravest 0 love 0 loved 0 genius 0 not 0 fruit 1 BIAS 1 the 0.01 and 0.03 bravest 1.4 love 3.1 loved 1.2 genius 0.5 not -3.0 fruit -0.8 BIAS

10 P(Y = 1 X = x; β) = exp(x β) 1 +exp(x β) = exp(x β) exp(x 0)+exp(x β)

11 Logistic regression P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y ) output space Y = {1,...,K}

12 x = feature vector β = coefficients Feature Value Feature β1 β2 β3 β4 β5 the 0 and 0 bravest 0 love 0 loved 0 genius 0 not 0 fruit 1 BIAS 1 the and bravest love loved genius not fruit BIAS

13 Language Model We can use multi class logistic regression for language modeling by treating the vocabulary as the output space Y = V

14 Unigram LM A unigram language model here would have just one feature: a bias term. Feature βthe βof βa βdog βiphone BIAS

15 Bigram LM Feature Value βthe βof βa βdog βiphone wi-1=the 1 wi-1=and 0 wi-1=brave 0 st wi-1=love 0 wi-1=loved 0 wi-1=geniu 0 s wi-1=not 0 wi-1=fruit 0 BIAS

16 P(w i = dog w i 1 = the) Feature Value βthe βof βa βdog βiphone wi-1=the 1 wi-1=and 0 wi-1=brave 0 st wi-1=love 0 wi-1=loved 0 wi-1=geniu 0 s wi-1=not 0 wi-1=fruit 0 BIAS

17 Trigram LM P(w i = dog w i 2 = and, w i 1 = the) Feature Value wi-2=the ^ wi-1=the 0 wi-2=and ^ wi-1=the 1 wi-2=bravest ^ wi-1=the 0 wi-2=love ^ wi-1=the 0 wi-2=loved ^ wi-1=the 0 wi-2=genius ^ wi-1=the 0 wi-2=not ^ wi-1=the 0 wi-2=fruit ^ wi-1=the 0 BIAS 1

18 Parameters How many parameters do we have with a log-linear trigram language model? How many do we have with a generative trigram LM?

19 Smoothing P(w i = dog w i 2 = and, w i 1 = the) Feature Value second-order features first-order features wi-2=the ^ wi-1=the 0 wi-2=and ^ wi-1=the 1 wi-2=bravest ^ wi-1=the 0 wi-2=love ^ wi-1=the 0 wi-1=the 1 wi-1=and 0 wi-1=bravest 0 wi-1=love 0 BIAS 1

20 L2 regularization N F (β) = log P(y i x i, β) η β 2 j i=1 j=1 we want this to be high but we want this to be small We can do this by changing the function we re trying to optimize by adding a penalty for having values of β that are high This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0. η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 20

21 L1 regularization N F (β) = log P(y i x i, β) η β j i=1 j=1 we want this to be high but we want this to be small L1 regularization encourages coefficients to be exactly 0. η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 21

22 Richer representations Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on. We can reason about any observations from the entire history and not just the local context.

23 JACKSONVILLE, Fla. Stressed and exhausted families across the Southeast were assessing the damage from Hurricane Irma on Tuesday, even as flooding from the storm continued to plague some areas, like Jacksonville, and the worst of its wallop was being revealed in others, like the Florida Keys. Officials in Florida, Georgia and South Carolina tried to prepare residents for the hardships of recovery from the

24 Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald

26 Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald feature classes example ngrams (wi-1, wi-2:wi-1, wi-3:wi-1) wi-2= to, wi= donald gappy ngrams spelling, capitalizaiton class/gazetteer membership w1= hillary and wi-1= donald wi-1 is capitalized and wi is capitalized wi-1 in list of names and wi in list of names 26

27 Classes bigram count bigram count black car 100 <COLOR> car 137 blue car 37 red car 0 Goldberg 2017

28 Tradeoffs Richer representations = more parameters, higher likelihood of overfitting Much slower to train than estimating the parameters of a generative model P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y )

29 Neural LM Simple feed-forward multilayer perceptron (e.g., one hidden layer) input x = vector concatenation of a conditioning context of fixed size k 1 x =[v(w 1 );...; v(w k )] Bengio et al. 2003

30 1 v(w1) v(w1) x =[v(w 1 );...; v(w k )] v(w2) 1 v(w2) v(w3) 1 v(w3) w1 = tried w2 = to w3 = prepare w4 = residents v(w4) 1 one-hot encoding v(w4) distributed representation Bengio et al. 2003

31 W 1 R kd H W 2 R H V b 1 R H b 2 R V h = g(xw 1 + b 1 ) x =[v(w 1 );...; v(w k )] ŷ = softmax(hw 2 + b 2 ) Bengio et al. 2003

32 Softmax P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y )

33 Neural LM conditioning context tried to prepare residents for the hardships of recovery from the y

34 Recurrent neural network RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.

35 Recurrent neural network Goldberg 2017

36 Recurrent neural network Each time step has two inputs: x i (the observation at time step i); one-hot vector, feature vector or distributed representation. s i-1 (the output of the previous state); base case: s 0 = 0 vector

37 Recurrent neural network s i = R(x i, s i 1 ) R computes the output state as a function of the current input and previous state y i = O(s i ) O computes the output as a function of the current output state

38 Continuous bag of words s i = R(x i, s i 1 )=s i 1 + x i Each state is the sum of all previous inputs y i = O(s i )=s i Output is the identity

39 Simple RNN g = tanh or relu s i = R(x i, s i 1 )=g(s i 1 W s + x i W x + b) Different weight vectors W transform the previous state and current input before combining W s R H H W x R D H b R H y i = O(s i )=s i Elman 1990, Mikolov 2012

40 RNN LM The output state s i is an H- dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax y i = O(s i )= (s i W o + b o ) Elman 1990, Mikolov 2012

41 Training RNNs Given this definition of an RNN: s i = R(x i,s i 1 )=g(s i 1 W s + x i W x + b) y i = O(s i )= (s i W o + b o ) We have five sets of parameters to learn: W s,w x,w o,b,b o

42 Training RNNs we tried to prepare residents At each time step, we make a prediction and incur a loss; we know the true y (the word we see in that position)

43 L( ) y1 W s L( ) y2 W s L( ) y3 W s L( ) y4 W s L( ) y5 W s we tried to prepare residents Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update

44 Generation context1 context2 generated word As we sample, the words we generate form the new context we condition on START START The START The dog The dog walked dog walked in

45 Generation

46 Conditioned generation In a basic RNN, the input at each timestep is a representation of the word at that position s i = R(x i,s i 1 )=g(s i 1 W s + x i W x + b) But we can also condition on any arbitrary context (topic, author, date, metadata, dialect, etc.) s i = R(x i,s i 1 )=g(s i 1 W s + [x i ; c]w x + b)

47 we tried to prepare residents Each state i encodes information seen until time i and its structure is optimized to predict the next word

48 Character LM Vocabulary V is a finite set of discrete characters When the output space is small, you re putting a lot of the burden on the structure of the model Encode long-range dependencies (suffixes depend on prefixes, word boundaries etc.)

49 Character LM

50 Character LM

51 Character LM

52 Drawbacks Very expensive to train (especially for large vocabulary, though tricks exists cf. hierarchical softmax) Backpropagation through long histories leads to vanishing gradients (cf. LSTMs in a few weeks). But they consistently have some of the strongest performances in perplexity evaluations.

53 Mikolov 2010

54 Melis, Dyer and Blunsom 2017

55 Distributed representations Some of the greatest power 1 in neural network approaches is in the representation of words (and contexts) as lowdimensional vectors We ll talk much more about that on Tuesday.

Applied Natural Language Processing

Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ