Neural Networks in Structured Prediction. November 17, 2015

Size: px

Start display at page:

Download "Neural Networks in Structured Prediction. November 17, 2015"

Alaina Dickerson
6 years ago
Views:

1 Neural Networks in Structured Prediction November 17, 2015

2 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving (Dec 3) Bring draft of paper to class for discussion

3 Goals for This Week s Lectures Overview of neural networks (terminology, basic architectures, learning) Neural networks in structured prediction: Option 1: locally nonlinear factors in globally linear models Option 2: operation sequence models Option 3: global, nonlinear structured models [speculative]

4 Neural Nets: Big Ideas Nonlinear function classes Learning via backpropagation of errors Neural networks as feature inducers ŷ = Wx + b What features should we use?? The multitask hypothesis

5 Recall: Parameters We want to condition on lots of information, but recall that X,Y (x, y) = X Y =y (x) Y (y) O(xy + y) =O(xy) Neural networks let us learn arbitrary joint interactions, but control the number parameters. - Control overfitting/improve generalization - Reduce memory requirements

6 When to use them? If you want to condition on a lot of information (entire documents, entire sentences, ) But you don t know what features are useful Neural networks are great!

7 Neurons x 1 x 2 x 3 x 4 x 5 x

8 Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w

9 Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w x 4 w 4 x 5 w 5 x w

10 Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 b is a bias term, can also x 5 w 5 be encoded by forcing one of the inputs to always be 1. x w

11 Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 y = g(w x + b) x 5 x w 5 w g is a nonlinear function that usually has a sigmoid shape

12 Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w y

13 Neural Networks x 1 x 2 x 3 x 4 x 5 x

14 Neural Networks x 1 x 2 x 3 x 4 x 5 x w 4,1

15 Neural Networks x 1 x 2 x 3 y 1 y 2 y = Wx + b x 4 y 3 x 5 x W y

16 Neural Networks x 1 x 2 y 1 y = Wx + b x 3 y 2 y = g(wx + b) x 4 y 3 x 5 x W y Soft max g(u) i = exp u i P i 0 exp u i 0

17 Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y

18 Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y V

19 Deep x 1 z 1 x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

20 Deep x 1 z 1 z = g(vy + c) x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

21 Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

22 Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 x 4 y 2 y 3 z 3 Note: if g(x) =h(x) =x x 5 x W y V z 4 z z = V(Wx + b)+c = VWx {z } Ux + Vb + c {z } d

23 Predicting Discrete Objects x 1 z 1 DT x 2 y 1 z 2 NN x 3 y 2 z 3 VB x 4 y 3 z 4 JJ x 5 x W y V z Soft max g(u) i = exp u i P i 0 exp u i 0

24 More Concise tanh h = tanh(wx + b) y = Vh + a x W h V y

25 Feature Induction ŷ = Wx + b What features should we use?? L = X i ŷ i y i 2 2 In linear regression the goal is to learn W such that L is minimized on a training set. ŷ = W g(vr + c) {z } x +b Compute features of naive representation r, put result in x Learn to extract features that produce linear separability

26 Feature Induction ŷ = W g(vr + c) {z } x +b What can this function represent? If x is big enough, this can represent any function! This is obvious much more powerful than a linear model

27 Joint Interactions?

28 Feature Induction Neural networks let us use a fixed number of parameters Avoid the curse of dimensionality! But they let us learn interactions (conjunctions) - if it is helpful to do so. Let the data decide!

29 Training Neural networks are supervised We need pairs of inputs and outputs For LM: input = history, output = word Also needed: a differentiable loss function L(x, y)! R + Losses over training instances sum Any of our favorite losses work here

30 Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)

31 Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)

32 Stochastic Gradient Descent for i = 1, 2, Pick random training example t and compute: g (i) t,y t (i+1) = (i) g (i) = (i)

33 Computing Derivatives Training neural networks involves computing derivatives The standard algorithm for doing this is called backpropagation. It is a variant of automatic differentiation (technically reverse-mode AD )

34 Automatic Differentiation? Compiler translates a function into a sequence of small operations Every small operation (nb. of a differentiable function) is itself differentiable The chain rule tells us how to compute the derivatives of composite functions using the derivatives of the pieces they are composed of

35 Let s start with a really simple example. y = log sin 2 x What is the derivative at x 0? components range differential d-range y = f(u) = log u u = g(v) =v 2 v = h(x) =sinx R R R dy du = 1 u du dv =2v dv dx = cos x R R R dy dx x=x0 = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0

36 In general, for our applications x in f(x) will be a vector. nx y = (W exp x) i where x 2 R d and W 2 R n d i=1 components range differential d-range nx y = f(u) = i=1 u i u = g(v) =Wv v = h(x) =expx R R n = = diag(exp R 1 n R n d R d d dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d

37 Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint

38 Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint

39 Learning with Backprop Training data x1 x2 y* Models y = V(Wx + b)+c y = V tanh(wx + b)+c Objective L =(y(x) y )

40 Learning with Backprop y - y* 2 y = V(Wx + b)+c Iteration

41 Learning with Backprop y - y* 2 y = V(Wx + b)+c y = V tanh(wx + b)+c Iteration

42 Concept: Word Embedding Represent each word in a vocabulary as a vector Vectors are learned as parameters Words that behave similarly in the model will usually end up with similar embeddings

43 A simple tagging model x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

44 A simple tagging model NNP h 1 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

45 A simple tagging model NNP VBZ h 1 h 2 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

46 A simple tagging model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

47 Simple Tagging Model Parameters: word embeddings, C -1, C 0, C 2, W, b Upsides of this model: Decisions are independent- likelihood/decoding is cheap (no Viterbi!) Downsides of this model Say that (A B) and (B A) are both good taggings according to the observations, but (A A) and (B B) are bad taggings. Independence hurts us! Limited context window

48 Simple Tagging Model In short: this is not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? How can we address the finite horizon problem?

49 Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP

50 Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP

51 Recurrent Neural Networks (RNNs) h t = f(h t 1, x t ) c = RNN(x) 0 c x = START start x 1 x 2 x 3 x 4 What is a vector representation of a sequence? Note: numerous definitions exist for f. x

52 Bidirectional RNN Tagging Model 0 x = START start John books flights start STOP

53 Bidirectional RNN Tagging Model 0 0 x = START start John books flights start STOP

54 Bidirectional RNN Tagging Model NNP VBZ NNS 0 0 x = START start John books flights start STOP

55 BiRNN Tagging Model In short: this is still not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? This POS tagger is currently state-of-the-art! Maybe POS tagging doesn t need that much structure

56 Next time Recurrent models for adding statistical dependencies among output variables

Lecture 5 Neural models for NLP

CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm