Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Shanon Tate
5 years ago
Views:

1 Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley

2 .

3 History of NLP Foundational insights, 940s/950s Two camps (symbolic/stochastic), Four paradigms (stochastic, logic-based, NLU, discourse modeling), Empiricism and FSM ( ) Field comes together ( ) Machine learning (2000 today) J&M 2008, ch Neural networks (~204 today)

4 Neural networks in NLP Language modeling [Mikolov et al. 200] Text classification [Kim 204; Iyyer et al. 205] Syntactic parsing [Chen and Manning 204, Dyer et al. 205, Andor et al. 206] CCG super tagging [Lewis and Steedman 204] Machine translation [Cho et al. 204, Sustkever et al. 204] Dialogue agents [Sordoni et al. 205, Vinyals and Lee 205, Ji et al. 206] (for overview, see Goldberg 207,.3.)

5 Neural networks Discrete, high-dimensional representation of inputs (one-hot vectors) -> low-dimensional distributed representations. Non-linear interactions of input features Multiple layers to capture hierarchical structure

6 Neural network libraries

7 Logistic regression x β ŷ = +exp F i= x iβ i not -0.5 bad -.7 movie 0 0.3

8 SGD Calculate the derivative of some loss function with respect to parameters we can change, update accordingly to make predictions on training data a little less wrong next time. 8

9 Logistic regression ŷ = +exp F i= x iβ i x β x β x2 β2 y not -0.5 bad -.7 x3 β3 movie 0 0.3

10 Neural networks Two core ideas: Non-linear activation functions Multiple layers

11 *For simplicity, we re leaving out the bias term, but assume most layers have them as well. W V x W, W,2 h V x2 W2, W2,2 W3, h2 V2 y x3 W3,2 Input Hidden Output Layer

12 W V x h x2 y h2 x3 x W V y not bad movie

13 W V x h x2 y h2 x3 h j = f F i= x i W i,j the hidden nodes are completely determined by the input and weights

14 W V x h x2 y h2 x3 h = f F i= x i W i,

15 Activation functions σ(z) = +exp( z) y x

16 Logistic regression x β ŷ = +exp F i= x iβ i x2 β2 y ŷ = σ F x i β i x3 β3 i= We can think about logistic regression as a neural network with no hidden layers

17 Activation functions tanh(z) = exp(z) exp( z) exp(z)+exp( z) y x

18 Activation functions rectifier(z) = max(0, z) y x

19 W V x h x2 y h2 x3 h = σ h 2 = σ F i= F i= x i W i, x i W i,2 ŷ = σ [V h + V 2 h 2 ]

20 W V x h x2 y h2 x3 ŷ = σ V σ F i x i W i, + V 2 σ F i x i W i,2 we can express y as a function only of the input x and the weights W and V

21 F F ŷ = σ V σ x i W i, +V 2 σ x i W i,2 i i h h 2 This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.

22 W V x h x2 y h2 x3 Neural networks are a series of functions chained together xw σ (xw) σ (xw) V σ (σ (xw) V) The loss is another function chained on top log (σ (σ (xw) V))

23 Chain rule V log (σ (σ (xw) V)) Let s take the likelihood for a single training example with label y =; we want this value to be as high as possible = log (σ (σ (xw) V)) σ (σ (xw) V) σ (σ (xw) V) σ (xw) V σ (xw) V V A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V

24 Chain rule A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V A B C = σ (hv) σ (hv) ( σ (hv)) h =( σ (hv))h =( ŷ)h

25 Neural networks Tremendous flexibility on design choices (exchange feature engineering for model engineering) Articulate model structure and use the chain rule to derive parameter updates.

26 Neural network structures x h x2 y h2 x3 Output one real value

27 Neural network structures x y 0 h x2 y h2 x3 y 0 Multiclass: output 3 values, only one = in training data

28 Neural network structures x y h x2 y h2 x3 y 0 output 3 values, several = in training data

29 Regularization Increasing the number of parameters = increasing the possibility for overfitting to training data

30 Regularization L2 regularization: penalize W and V for being too large Dropout: when training on a <x,y> pair, randomly remove some node and weights. Early stopping: Stop backpropagation before the training error is too small.

31 Deeper networks W W2 V x h x2 h2 h2 y x3 h2 h2 x3

32 x W Densely connected layer x2 x3 h x x4 x5 h2 h2 W x6 h x7 h = (xw )

33 Convolutional networks With convolution networks, the same operation is (i.e., the same set of parameters) is applied to different regions of the input

2D Convolution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 blurring http://www.wildml.

34 2D Convolution blurring

35 D Convolution convolution K /3 /3 /3 x x :4K = moving average

36 I x Convolutional networks hated x2 h=f(i, hated, it) it I x3 x4 h h2 h2=f(it, I, really) x W really x5 h3 h h3=f(really, hated, it) hated it x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )

37 Indicator vocab item indicator vector Every token is a V- dimensional vector (size of the vocab) with a single identifying the word We ll get to distributed representations of words in on 9/9 a 0 aa 0 aal 0 aalii 0 aam 0 aardvark aardwolf 0 aba 0

38 Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 h3 W W2 W3 x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )

39 Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 x6 h3 W W2 W3 For indicator vectors, we re just adding these numbers together x7 h = (W,x id + W 2,x id 2 + W 3,x id 3 ) (Where xn id specifies the location of the in the vector i.e., the vocabulary id)

40 7 Pooling Down-samples a layer by selecting a single point from some set Max-pooling selects the largest value

41 Convolutional x networks x2 x3 0 x4 2 0 x5 - x6 5 This defines one filter. x7 convolution max pooling

42 x W x x x2 x3 W W2 W3 x2 h vectorize vectorize x3 x4 x W x5 x6 x2 x3 W2 W3 x7 h = (x W )

43 x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4

44 x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4

45 Convolutional networks With max pooling, we select a single number for each filter over all tokens (e.g., with 00 filters, the output of max pooling stage = 00-dimensional vector) If we specify multiple filters, we can also scope each filter over different window sizes

46 Zhang and Wallace 206, A Sensitivity Analysis of (and Practitioners Guide to) Convolutional Neural Networks for Sentence Classification

47 Wa Wb Wc Wd V x x2 x3 x4 y x5 5 tokens, 4 filters x6 x7 tanh(x W ) max (hv )

48 CNN as important ngram detector Higher-order ngrams are much more informative than just unigrams (e.g., i don t like this movie [ I, don t, like, this, movie ]) We can think about a CNN as providing a mechanism for detecting important (sequential) ngrams without having the burden of creating them as unique features unique types unigrams 5092 bigrams 45,220 trigrams 90,694 4-grams,074,92 Unique ngrams (-4) in Cornell movie review dataset

49 CNN Backprop (V) L(W, V )=y log o A +( y) log( o) B L(W, V ) V = A + B V = A V + B V

50 CNN Backprop (V)

51 CNN Backprop (V)

52 CNN Backprop (V)

53 You ll derive and implement updates for the rest of the parameters in homework 2

54 Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With this you could generate new data P(x, y) =P(y) P(x y) Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P(y x)

55 259 project proposal due 9/26 Final project involving or 2 students involving natural language processing -- either focusing on core NLP methods or using NLP in support of an empirical research question. Proposal (2 pages): outline the work you re going to undertake motivate its rationale as an interesting question worth asking assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (cite 5 relevant sources) who is the team and what are each of your responsibilities (everyone gets the same grade)

56 Thursday Read Hovy and Spruit (206) and come prepared to discuss!

Natural Language Processing

Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);