Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley
. https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
History of NLP Foundational insights, 940s/950s Two camps (symbolic/stochastic), 957-970 Four paradigms (stochastic, logic-based, NLU, discourse modeling), 970-983 Empiricism and FSM (983-993) Field comes together (994-999) Machine learning (2000 today) J&M 2008, ch Neural networks (~204 today)
Neural networks in NLP Language modeling [Mikolov et al. 200] Text classification [Kim 204; Iyyer et al. 205] Syntactic parsing [Chen and Manning 204, Dyer et al. 205, Andor et al. 206] CCG super tagging [Lewis and Steedman 204] Machine translation [Cho et al. 204, Sustkever et al. 204] Dialogue agents [Sordoni et al. 205, Vinyals and Lee 205, Ji et al. 206] (for overview, see Goldberg 207,.3.)
Neural networks Discrete, high-dimensional representation of inputs (one-hot vectors) -> low-dimensional distributed representations. Non-linear interactions of input features Multiple layers to capture hierarchical structure
Neural network libraries
Logistic regression x β ŷ = +exp F i= x iβ i not -0.5 bad -.7 movie 0 0.3
SGD Calculate the derivative of some loss function with respect to parameters we can change, update accordingly to make predictions on training data a little less wrong next time. 8
Logistic regression ŷ = +exp F i= x iβ i x β x β x2 β2 y not -0.5 bad -.7 x3 β3 movie 0 0.3
Neural networks Two core ideas: Non-linear activation functions Multiple layers
*For simplicity, we re leaving out the bias term, but assume most layers have them as well. W V x W, W,2 h V x2 W2, W2,2 W3, h2 V2 y x3 W3,2 Input Hidden Output Layer
W V x h x2 y h2 x3 x W V y not -0.5.3 4. bad 0.4 0.08-0.9 movie 0.7 3.
W V x h x2 y h2 x3 h j = f F i= x i W i,j the hidden nodes are completely determined by the input and weights
W V x h x2 y h2 x3 h = f F i= x i W i,
Activation functions σ(z) = +exp( z).00 0.75 y 0.50 0.25 0.00-0 -5 0 5 0 x
Logistic regression x β ŷ = +exp F i= x iβ i x2 β2 y ŷ = σ F x i β i x3 β3 i= We can think about logistic regression as a neural network with no hidden layers
Activation functions tanh(z) = exp(z) exp( z) exp(z)+exp( z).0 0.5 y 0.0-0.5 -.0-0 -5 0 5 0 x
Activation functions rectifier(z) = max(0, z) 0.0 7.5 y 5.0 2.5 0.0-0 -5 0 5 0 x
W V x h x2 y h2 x3 h = σ h 2 = σ F i= F i= x i W i, x i W i,2 ŷ = σ [V h + V 2 h 2 ]
W V x h x2 y h2 x3 ŷ = σ V σ F i x i W i, + V 2 σ F i x i W i,2 we can express y as a function only of the input x and the weights W and V
F F ŷ = σ V σ x i W i, +V 2 σ x i W i,2 i i h h 2 This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.
W V x h x2 y h2 x3 Neural networks are a series of functions chained together xw σ (xw) σ (xw) V σ (σ (xw) V) The loss is another function chained on top log (σ (σ (xw) V))
Chain rule V log (σ (σ (xw) V)) Let s take the likelihood for a single training example with label y =; we want this value to be as high as possible = log (σ (σ (xw) V)) σ (σ (xw) V) σ (σ (xw) V) σ (xw) V σ (xw) V V A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V
Chain rule A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V A B C = σ (hv) σ (hv) ( σ (hv)) h =( σ (hv))h =( ŷ)h
Neural networks Tremendous flexibility on design choices (exchange feature engineering for model engineering) Articulate model structure and use the chain rule to derive parameter updates.
Neural network structures x h x2 y h2 x3 Output one real value
Neural network structures x y 0 h x2 y h2 x3 y 0 Multiclass: output 3 values, only one = in training data
Neural network structures x y h x2 y h2 x3 y 0 output 3 values, several = in training data
Regularization Increasing the number of parameters = increasing the possibility for overfitting to training data
Regularization L2 regularization: penalize W and V for being too large Dropout: when training on a <x,y> pair, randomly remove some node and weights. Early stopping: Stop backpropagation before the training error is too small.
Deeper networks W W2 V x h x2 h2 h2 y x3 h2 h2 x3
x W Densely connected layer x2 x3 h x x4 x5 h2 h2 W x6 h x7 h = (xw )
Convolutional networks With convolution networks, the same operation is (i.e., the same set of parameters) is applied to different regions of the input
2D Convolution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 blurring http://www.wildml.com/205//understanding-convolutional-neural-networks-for-nlp/ https://docs.gimp.org/en/plug-in-convmatrix.html
D Convolution convolution K /3 /3 /3 x 3-4 0 x :4K = moving average
I x Convolutional networks hated x2 h=f(i, hated, it) it I x3 x4 h h2 h2=f(it, I, really) x W really x5 h3 h h3=f(really, hated, it) hated it x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )
Indicator vocab item indicator vector Every token is a V- dimensional vector (size of the vocab) with a single identifying the word We ll get to distributed representations of words in on 9/9 a 0 aa 0 aal 0 aalii 0 aam 0 aardvark aardwolf 0 aba 0
Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 h3 W W2 W3 x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )
Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 x6 h3 W W2 W3 For indicator vectors, we re just adding these numbers together x7 h = (W,x id + W 2,x id 2 + W 3,x id 3 ) (Where xn id specifies the location of the in the vector i.e., the vocabulary id)
7 Pooling 3 7 9 2 9 Down-samples a layer by selecting a single point from some set 0 5 3 5 Max-pooling selects the largest value
Convolutional x networks x2 x3 0 x4 2 0 x5 - x6 5 This defines one filter. x7 convolution max pooling
x W x x x2 x3 W W2 W3 x2 h vectorize vectorize x3 x4 x W x5 x6 x2 x3 W2 W3 x7 h = (x W )
x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4
x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4
Convolutional networks With max pooling, we select a single number for each filter over all tokens (e.g., with 00 filters, the output of max pooling stage = 00-dimensional vector) If we specify multiple filters, we can also scope each filter over different window sizes
Zhang and Wallace 206, A Sensitivity Analysis of (and Practitioners Guide to) Convolutional Neural Networks for Sentence Classification
Wa Wb Wc Wd V x x2 x3 x4 y x5 5 tokens, 4 filters x6 x7 tanh(x W ) max (hv )
CNN as important ngram detector Higher-order ngrams are much more informative than just unigrams (e.g., i don t like this movie [ I, don t, like, this, movie ]) We can think about a CNN as providing a mechanism for detecting important (sequential) ngrams without having the burden of creating them as unique features unique types unigrams 5092 bigrams 45,220 trigrams 90,694 4-grams,074,92 Unique ngrams (-4) in Cornell movie review dataset
CNN Backprop (V) L(W, V )=y log o A +( y) log( o) B L(W, V ) V = A + B V = A V + B V
CNN Backprop (V)
CNN Backprop (V)
CNN Backprop (V)
You ll derive and implement updates for the rest of the parameters in homework 2
Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With this you could generate new data P(x, y) =P(y) P(x y) Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P(y x)
259 project proposal due 9/26 Final project involving or 2 students involving natural language processing -- either focusing on core NLP methods or using NLP in support of an empirical research question. Proposal (2 pages): outline the work you re going to undertake motivate its rationale as an interesting question worth asking assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (cite 5 relevant sources) who is the team and what are each of your responsibilities (everyone gets the same grade)
Thursday Read Hovy and Spruit (206) and come prepared to discuss!