Neural Networks in Structured Prediction November 17, 2015
HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving (Dec 3) Bring draft of paper to class for discussion
Goals for This Week s Lectures Overview of neural networks (terminology, basic architectures, learning) Neural networks in structured prediction: Option 1: locally nonlinear factors in globally linear models Option 2: operation sequence models Option 3: global, nonlinear structured models [speculative]
Neural Nets: Big Ideas Nonlinear function classes Learning via backpropagation of errors Neural networks as feature inducers ŷ = Wx + b What features should we use?? The multitask hypothesis
Recall: Parameters We want to condition on lots of information, but recall that X,Y (x, y) = X Y =y (x) Y (y) O(xy + y) =O(xy) Neural networks let us learn arbitrary joint interactions, but control the number parameters. - Control overfitting/improve generalization - Reduce memory requirements
When to use them? If you want to condition on a lot of information (entire documents, entire sentences, ) But you don t know what features are useful Neural networks are great!
Neurons x 1 x 2 x 3 x 4 x 5 x
Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w
Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w x 4 w 4 x 5 w 5 x w
Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 b is a bias term, can also x 5 w 5 be encoded by forcing one of the inputs to always be 1. x w
Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 y = g(w x + b) x 5 x w 5 w g is a nonlinear function that usually has a sigmoid shape
Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w y
Neural Networks x 1 x 2 x 3 x 4 x 5 x
Neural Networks x 1 x 2 x 3 x 4 x 5 x w 4,1
Neural Networks x 1 x 2 x 3 y 1 y 2 y = Wx + b x 4 y 3 x 5 x W y
Neural Networks x 1 x 2 y 1 y = Wx + b x 3 y 2 y = g(wx + b) x 4 y 3 x 5 x W y Soft max g(u) i = exp u i P i 0 exp u i 0
Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y
Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y V
Deep x 1 z 1 x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z
Deep x 1 z 1 z = g(vy + c) x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z
Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z
Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 x 4 y 2 y 3 z 3 Note: if g(x) =h(x) =x x 5 x W y V z 4 z z = V(Wx + b)+c = VWx {z } Ux + Vb + c {z } d
Predicting Discrete Objects x 1 z 1 DT x 2 y 1 z 2 NN x 3 y 2 z 3 VB x 4 y 3 z 4 JJ x 5 x W y V z Soft max g(u) i = exp u i P i 0 exp u i 0
More Concise tanh h = tanh(wx + b) y = Vh + a x W h V y
Feature Induction ŷ = Wx + b What features should we use?? L = X i ŷ i y i 2 2 In linear regression the goal is to learn W such that L is minimized on a training set. ŷ = W g(vr + c) {z } x +b Compute features of naive representation r, put result in x Learn to extract features that produce linear separability
Feature Induction ŷ = W g(vr + c) {z } x +b What can this function represent? If x is big enough, this can represent any function! This is obvious much more powerful than a linear model
Joint Interactions?
Feature Induction Neural networks let us use a fixed number of parameters Avoid the curse of dimensionality! But they let us learn interactions (conjunctions) - if it is helpful to do so. Let the data decide!
Training Neural networks are supervised We need pairs of inputs and outputs For LM: input = history, output = word Also needed: a differentiable loss function L(x, y)! R + Losses over training instances sum Any of our favorite losses work here
Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)
Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)
Stochastic Gradient Descent for i = 1, 2, Pick random training example t and compute: g (i) = @L(x t,y t ) @ (i+1) = (i) g (i) = (i)
Computing Derivatives Training neural networks involves computing derivatives The standard algorithm for doing this is called backpropagation. It is a variant of automatic differentiation (technically reverse-mode AD )
Automatic Differentiation? Compiler translates a function into a sequence of small operations Every small operation (nb. of a differentiable function) is itself differentiable The chain rule tells us how to compute the derivatives of composite functions using the derivatives of the pieces they are composed of
Let s start with a really simple example. y = log sin 2 x What is the derivative at x 0? components range differential d-range y = f(u) = log u u = g(v) =v 2 v = h(x) =sinx R R R dy du = 1 u du dv =2v dv dx = cos x R R R dy dx x=x0 = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0
In general, for our applications x in f(x) will be a vector. nx y = (W exp x) i where x 2 R d and W 2 R n d i=1 components range differential d-range nx y = f(u) = i=1 u i u = g(v) =Wv v = h(x) =expx R R n R d @y @u = 1 @u @v = W @v = diag(exp x) @x R 1 n R n d R d d dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d
Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint
Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint
Learning with Backprop Training data x1 x2 y* 0 0 0 Models y = V(Wx + b)+c y = V tanh(wx + b)+c 1 0 1 0 1 1 Objective L =(y(x) y ) 2 1 1 0
Learning with Backprop y - y* 2 y = V(Wx + b)+c Iteration
Learning with Backprop y - y* 2 y = V(Wx + b)+c y = V tanh(wx + b)+c Iteration
Concept: Word Embedding Represent each word in a vocabulary as a vector Vectors are learned as parameters Words that behave similarly in the model will usually end up with similar embeddings
A simple tagging model x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)
A simple tagging model NNP h 1 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)
A simple tagging model NNP VBZ h 1 h 2 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)
A simple tagging model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)
Simple Tagging Model Parameters: word embeddings, C -1, C 0, C 2, W, b Upsides of this model: Decisions are independent- likelihood/decoding is cheap (no Viterbi!) Downsides of this model Say that (A B) and (B A) are both good taggings according to the observations, but (A A) and (B B) are bad taggings. Independence hurts us! Limited context window
Simple Tagging Model In short: this is not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? How can we address the finite horizon problem?
Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP
Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP
Recurrent Neural Networks (RNNs) h t = f(h t 1, x t ) c = RNN(x) 0 c x = START start x 1 x 2 x 3 x 4 What is a vector representation of a sequence? Note: numerous definitions exist for f. x
Bidirectional RNN Tagging Model 0 x = START start John books flights start STOP
Bidirectional RNN Tagging Model 0 0 x = START start John books flights start STOP
Bidirectional RNN Tagging Model NNP VBZ NNS 0 0 x = START start John books flights start STOP
BiRNN Tagging Model In short: this is still not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? This POS tagger is currently state-of-the-art! Maybe POS tagging doesn t need that much structure
Next time Recurrent models for adding statistical dependencies among output variables