Deep Learning for NLP

Size: px

Start display at page:

Download "Deep Learning for NLP"

Amberlynn Alexandra Pope
5 years ago
Views:

1 Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett

2 Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

3 Sentiment Analysis the movie was very good

4 Sentiment Analysis with Linear Example Label Feature Type the movie was very good I[good] Unigrams the movie was very bad I[bad] Unigrams the movie was not bad I[not bad] Bigrams the movie was not very good I[not very good] Trigrams the movie was not really very enjoyable 4-grams!

5 Drawbacks More complex features capture interactions but scale badly (13M unigrams, 1.3B 4-grams in Google n-grams) Can we do better than seeing every n-gram once in the training data? not very good not so great Instead of more complex linear functions, let s use simpler nonlinear functions, namely neural networks the movie was not really very enjoyable

6 Neural Networks: XOR Let s see how we can use neural nets to learn a simple nonlinear function Inputs x 1, x 2 x (generally x =(x 1,...,x m )) Output y 0 1 x 1 (generally y =(y 1,...,y n )) y = x 1 XOR x 2 x 1 x

7 Neural Networks: XOR x y = a 1 x 1 + a 2 x 2 y = a 1 x 1 + a 2 x 2 + a 3 tanh(x 1 + x 2 ) X or 0 1 x 1 (looks like action potential in neuron) x 1 x 2 x 1 XOR x

Neural Networks: XOR x 2 1 0 y = a 1 x 1 + a 2 x 2 y = a 1 x 1 + a 2 x 2 + a 3 tanh(x 1 + x 2 )

8 Neural Networks: XOR x y = a 1 x 1 + a 2 x 2 y = a 1 x 1 + a 2 x 2 + a 3 tanh(x 1 + x 2 ) X 0 1 x 1 y = x 1 x tanh(x 1 + x 2 ) or x 1 x 2 x 1 XOR x x 2 x 1

9 Neural Networks: XOR I x 2 [good] 1-1 y = 2x 1 x tanh(x 1 + x 2 ) 0 0 I x 1 [not] x 2 x 1

10 Neural Networks (Linear model: y = w x + b ) y = g(w x + b) y = g(wx + b) Nonlinear transformation Warp space Shift Taken from

11 Neural Networks Linear classifier Neural network possible because we transformed the space! Taken from

12 Deep Neural Networks (this was our neural net from the XOR example) y 1 = g(w 1 x + b 1 ) Adopted from Chris Dyer

13 Deep Neural Networks y 1 = g(w 1 x + b 1 ) Adopted from Chris Dyer

14 Deep Neural Networks Input Hidden Layer Output z = g(vg(wx + b)+c) } output of first layer z = g(vy + c) Adopted from Chris Dyer

15 Neural Networks Linear classifier Neural network possible because we transformed the space! Taken from

16 Deep Neural Networks Taken from

17 Deep Neural Networks Input Hidden Layer Output z = g(vg(wx + b)+c) } output of first layer With no nonlinearity: z = VWx + Vb + c Equivalent to z = Ux + d Adopted from Chris Dyer

18 Deep Neural Networks Input Hidden Layer Output Nodes in the hidden layer can learn interactions or conjunctions of features I I [not] [good] y = 2x 1 x tanh(x 1 + x 2 ) not OR good

19 Learning Neural Networks Input Hidden Layer Output change in output w.r.t. hidden change in output w.r.t. input change in hidden w.r.t. input Computing these looks like running this network in reverse (backpropagation)

20 Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

21 Feedforward Bag-of-words I x 2 [good] 1-1 I [a] I [to] 0 0 y = g(wx + b) I x 1 [not] I I [bad] I[not] [good] real-valued matrix, dims = vocabulary size (~10k) x hidden layer size (~100) binary vector, length = vocabulary size

22 Drawbacks to FFBoW Lots of parameters to learn Doesn t preserve ordering in the input really not very good and really not very enjoyable we don t know the relationship between good and enjoyable I I [a] I [to] I [bad] I[not] [good]

23 Word Embeddings word2vec: turn each word into a 100-dimensional vector Context-based embeddings: find a vector predictive of a word s context Words in similar contexts will end up with similar vectors dog great good enjoyable bad is

24 Feedforward with word vectors the movie was good. y = g(wx + b) hidden layer size ~100 x (sentence length (~10) x vector size (~100)) the movie was great. Each x now represents multiple bits of input Can capture word similarity binary vector, length = sentence length x vector size

25 Feedforward with word vectors the movie was good. y = g(wx + b) the movie was very good Need our model to be shiftinvariant, like bag-of-words is

Comparing Architectures Instead of more complex linear functions, let s use simpler nonlinear functions Feedforward bag-of-words: didn t take advantage of word

26 Comparing Architectures Instead of more complex linear functions, let s use simpler nonlinear functions Feedforward bag-of-words: didn t take advantage of word similarity, lots of parameters to learn Feedforward with word vectors: our parameters are attached to particular indices in a sentence Solution: convolutional neural nets

27 Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

28 Convolutional Networks the movie was good } good filter output max = 1.1

29 Convolutional Networks the movie was good. bad okay terrible } max =

30 Convolutional Networks the movie was good bad } max = 1.1 Input: n vectors of length m each k filter outputs of length 1 each k filters of length m each Takes variable-length input and turns it into fixed-length output 0.1 Filters are initialized randomly and then learned Features for a classifier, or input to another neural net layer

31 Convolutional Networks the movie was great }max = 1.8 Word vectors for similar words are similar, so convolutional filters will have similar outputs

32 Convolutional Networks the movie was not good. + + } } not good }max = 1.5 Analogous to bigram features in bag-of-words models

33 Comparing Architectures Instead of more complex linear functions, let s use simpler nonlinear functions Convolutional networks let us take advantage of word similarity Convolutional networks are translation-invariant like bag-of-words Convolutional networks can capture local interactions with filters of width > 1 (i.e. not good )

34 Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

35 Sentence Classification the movie was not good. convolutional fully connected prediction

36 Object Recognition convolutional layers fully connected layers Conv layer 3 Conv layer 1 AlexNet (2012)

37 Neural networks are NNs are built from convolutional layers, fully connected layers, and some other types Can chain these together into various architectures Any neural network built this way can be learned from data!

38 Sentence Classification the movie was not good. convolutional fully connected prediction

39 Sentence Classification movie review sentiment subjectivity/objectivity detection product reviews Outperforms highly-tuned bag-of-words model question type classification Taken from Kim (2014)

40 Entity Linking Although he originally won the event, the United States Anti- Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from ? Lance Edward Armstrong is an American former professional road cyclist? Armstrong County is a county in Pennsylvania Conventional: compare vectors from tf-idf features for overlap Convolutional networks can capture many of the same effects: distill notions of topic from n-grams Francis-Landau, Durrett, and Klein (NAACL 2016)

Entity Linking Although he originally won the event, the United States Anti- Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins

41 Entity Linking Although he originally won the event, the United States Anti- Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from Lance Edward Armstrong is an American former professional road cyclist Armstrong County is a county in Pennsylvania convolutional convolutional convolutional topic vector topic vector topic vector similar probable link dissimilar improbable link Francis-Landau, Durrett, and Klein (NAACL 2016)

42 Syntactic Parsing NP NP PP VB D VP NP PP He wrote a long report on Mars. He wrote a long report on Mars. My report Fig. 1 report on Mars wrote on Mars

43 Chart Parsing NP chart value = score(rule) + chart(left child) + chart(right child) NP PP He wrote a long report on Mars

44 Syntactic Parsing score NP NP PP = w > f NP NP PP wrote a long report on Mars. He wrote a long report on Mars feat = I Left child last word = report NP NP PP Features need to combine surface information and syntactic information, but looking at words directly ends up being very sparse

45 Scoring parses with neural nets score NP NP PP = s >. vector representation of rule being applied s > v neural network He wrote a long report on Mars Jupiter Durrett and Klein (ACL 2015)

46 Syntactic Parsing Discrete+ Continuous NP NP PP He wrote a long report on Mars Parsing a sentence: Feedforward pass on nets Discrete feature computation Run CKY dynamic program Durrett and Klein (ACL 2015)

47 Machine Translation the cat ate STOP le chat a mangé Long short-term memory units

48 Long Short-Term Memory Networks Map sequence of inputs to sequence of outputs Taken from

49 Machine Translation the cat ate STOP le chat a mangé Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods

50 Neural Network Tensorflow: By Google, actively maintained, bindings for many languages Theano: University of Montreal, less and less maintained Torch: Facebook AI Research, Lua

51 Neural Network

52 Word Vector Tools Word2Vec: Python code, actively maintained GLoVe: Word vectors trained on very large corpora

53 Convolutional Networks CNNs for sentence class.: Based on tutorial from: Python code Trains very quickly

54 Takeaways Neural networks have several advantages for NLP: We can use simpler nonlinear functions instead of more complex linear functions We can take advantage of word similarity We can build models that are both position-dependent (feedforward neural networks) and position-independent (convolutional networks) NNs have natural applications to many problems While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks

CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

CS395T: Structured Models for NLP Lecture 17: CNNs Greg Durrett Project 2 Results Top 3 scores: Su Wang: 90.13 UAS Greedy logisrc regression with extended feature set, trained for 30 epochs with Adagrad