Better Conditional Language Modeling. Chris Dyer

Size: px
Start display at page:

Download "Better Conditional Language Modeling. Chris Dyer"

Transcription

1 Better Conditional Language Modeling Chris Dyer

2 Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with unconditional models, it is again helpful to use the chain rule to decompose this probability: Ỳ p(w x) = p(w t x,w 1,w 2,...,w t 1 ) t=1 What is the probability of the next word, given the history of previously generated words and conditioning context x?

3 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = softmax(u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

4 K&B 2013: RNN Decoder p(tom s, hsi) s p(likes s, hsi, tom) p(beer s, hsi, tom, likes) p(h\si s, hsi, tom, likes, beer) tom likes beer </s> ˆp 1 softmax softmax softmax softmax h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

5 K&B 2013: Encoder How should we define c =embed(x)? The simplest model possible: c = X i x i x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6

6 K&B 2013: Problems The bag of words assumption is really bad (part 1) Alice saw Bob. Bob saw Alice. I would like some fresh bread with aged cheese. I would like some aged bread with fresh cheese. We are putting a lot of information inside a single vector (part 2)

7 Sutskever et al. (2014) LSTM encoder (c 0, h 0 ) are parameters (c i, h i ) = LSTM(x i, c i 1, h i 1 ) The encoding is (c`, h`) where ` = x. LSTM decoder w 0 = hsi (c t+`, h t+`) = LSTM(w t 1, c t+` 1, h t+` 1 ) u t = Ph t+` + b p(w t x, w <t ) = softmax(u t )

8 Sutskever et al. (2014) START c Aller Anfang ist schwer STOP

9 Sutskever et al. (2014) Beginnings START c Aller Anfang ist schwer STOP

10 Sutskever et al. (2014) Beginnings are START c Aller Anfang ist schwer STOP

11 Sutskever et al. (2014) Beginnings are difficult START c Aller Anfang ist schwer STOP

12 Sutskever et al. (2014) Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

13 Sutskever et al. (2014) Good RNNs deal naturally with sequences of various lengths LSTMs in principle can propagate gradients a long distance Very simple architecture! Bad The hidden state has to remember a lot of information!

14 Sutskever et al. (2014): Tricks Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

15 Sutskever et al. (2014): Tricks Read the input sequence backwards : +4 BLEU Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

16 Sutskever et al. (2014): Tricks Use an ensemble of J independently trained models. Ensemble of 2 models: +3 BLEU Ensemble of 5 models: +4.5 BLEU Decoder: (c (j) t+`, h(j) t+`) = LSTM(j) (w t u (j) t = Ph (j) t JX u t = 1 J j 0 =1 + b (j) u (j0 ) p(w t x, w <t ) = softmax(u t ) 1, c (j) t+` 1, h(j) t+` 1 )

17 Sutskever et al. (2014): Tricks Use beam search: +1 BLEU x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

18 Sutskever et al. (2014): Tricks Use beam search: +1 BLEU Make the beam really big: -1 BLEU (Koehn and Knowles, 2017)

19 Conditioning with vectors We are compressing a lot of information in a finite-sized vector.

20 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. You can't cram the meaning of a whole %&!$# sentence into a single $&!#* vector! Prof. Ray Mooney

21 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. Gradients have a long way to travel. Even LSTMs forget!

22 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. Gradients have a long way to travel. Even LSTMs forget! What is to be done?

23 Solving the vector bottleneck Represent a source sentence as a matrix Generate a target sentence from a matrix This will Solve the capacity problem Solve the gradient flow problem

24 Sentences as vectors matrices Problem with the fixed-size vector model Sentences are of different sizes but vectors are of the same size Solution: use matrices instead Fixed number of rows, but number of columns depends on the number of words Usually f = #cols

25 Sentences as matrices Ich möchte ein Bier

26 Sentences as matrices Ich möchte ein Bier Mach s gut

27 Sentences as matrices Ich möchte ein Bier Mach s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtümer

28 Sentences as matrices Ich möchte ein Bier Mach s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtümer Question: How do we build these matrices?

29 Sentences as matrices With concatenation Each word type is represented by an n-dimensional vector Take all of the vectors for the sentence and concatenate them into a matrix Simplest possible model So simple, no one has bothered to publish how well/badly it works!

30 x 1 x 2 x 3 x 4 Ich möchte ein Bier

31 f i = x i x 1 x 2 x 3 x 4 Ich möchte ein Bier

32 f i = x i F 2 R n f x 1 x 2 x 3 x 4 Ich möchte ein Bier Ich möchte ein Bier

33 Sentences as matrices With bidirectional RNNs By far the most widely used matrix representation, due to Bahdanau et al (2015) One column per word Each column (word) has two halves concatenated together: a forward representation, i.e., a word and its left context a reverse representation, i.e., a word and its right context Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations

34 x 1 x 2 x 3 x 4 Ich möchte ein Bier

35 ! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

36 h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

37 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

38 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

39 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

40 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

41 f i =[h i ;! h i ] h 1 h 2 h 3 h 4 F 2 R 2n f! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier Ich möchte ein Bier

42 Sentences as matrices Where are we in 2018? There are lots of ways to construct F More exotic architectures coming out daily Increasingly common goal: get rid of O( f ) sequential processing steps, i.e., RNNs during training syntactic information can help (Sennrich & Haddow, 2016; Nadejde et al., 2017), but many more integration strategies are possible try something with phrase types instead of word types? Multi-word expressions are a pain in the neck.

43 Conditioning on matrices We have a matrix F representing the input, now we need to generate from it Bahdanau et al. (2015) and Luong et al. (2015) concurrently proposed using attention for translating from matrix-encoded sentences High-level idea Generate the output sentence word by word using an RNN At each output position t, the RNN receives two inputs (in addition to any recurrent inputs) a fixed-size vector embedding of the previously generated output symbol e t-1 a fixed-size vector encoding a view of the input matrix How do we get a fixed-size vector from a matrix that changes over time? Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fa t ) The weighting of the input columns at each time-step (a t ) is called attention

44 Recall RNNs

45 Recall RNNs

46 Recall RNNs

47 I'd Recall RNNs

48 I'd I'd Recall RNNs

49 I'd like I'd Recall RNNs

50

51 Ich möchte ein Bier

52 Ich möchte ein Bier

53 Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

54 c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

55 c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

56 I'd c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

57 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

58 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

59 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

60 I'd I'd c 2 = Fa 2 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

61 I'd like I'd c 2 = Fa 2 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

62 I'd like a I'd like z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

63 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

64 I'd like a beer STOP stop I'd like a beer z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

65 Attention How do we know what to attend to at each timestep? That is, how do we compute a t?

66 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the expected input embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

67 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

68 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

69 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

70 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

71 Computing attention In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = tanh (WF + r t ) v (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

72 Computing attention Nonlinear additive attention model In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = v > tanh(wf + r t ) (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

73 Computing attention Nonlinear additive attention model In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = v > tanh(wf + r t ) (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

74 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t )

75 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t ) doesn t depend on output decisions

76 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 X = WF while e t 6= h/si : t = t +1 r t = Vs t 1 X u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t )

77 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 X = WF while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(x + r t ) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) y t = softmax(ps t + b) e t e <t Categorical(y t ) }(Compute attention; part 2 of lecture) ( e t 1 is a learned embedding of e t ) ( P and b are learned parameters)

78 Attention in MT: Results Add attention to seq2seq translation: +11 BLEU

79

80 A word about gradients

81 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

82 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

83 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

84 Attention and translation Cho s question: does a translator read and memorize the input sentence/document and then generate the output? Compressing the entire input sentence into a vector basically says memorize the sentence Common sense experience says translators refer back and forth to the input. (also backed up by eyetracking studies) Should humans be a model for machines?

85 Summary Attention provides the ability to establish information flow directly from distant closely related to pooling operations in convnets (and other architectures) Traditional attention model seems to only cares about content No obvious bias in favor of diagonals, short jumps, fertility, etc. Some work has begun to add other structural biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities Factorization into keys and values (Miller et al., 2016; Ba et al., 2016, Gulcehre et al., 2016) Attention weights provide interpretation you can look at

86 Questions?

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online

More information

Today s Lecture. Dropout

Today s Lecture. Dropout Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Structured Neural Networks (I)

Structured Neural Networks (I) Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. deeplearning.ai. Why sequence models? Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

Seq2Seq Losses (CTC)

Seq2Seq Losses (CTC) Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal Natural Language Understanding Kyunghyun Cho, NYU & U. Montreal 2 Machine Translation NEURAL MACHINE TRANSLATION 3 Topics: Statistical Machine Translation log p(f e) =log p(e f) + log p(f) f = (La, croissance,

More information

Tracking the World State with Recurrent Entity Networks

Tracking the World State with Recurrent Entity Networks Tracking the World State with Recurrent Entity Networks Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun Task At each timestep, get information (in the form of a sentence) about the

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli

More information

Deep learning for Natural Language Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Speech and Language Processing

Speech and Language Processing Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinoaki 08//6 Lecture Plan (Shinoaki s part) I gives

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata

More information

NLU: Semantic parsing

NLU: Semantic parsing NLU: Semantic parsing Adam Lopez slide credits: Chris Dyer, Nathan Schneider March 30, 2018 School of Informatics University of Edinburgh alopez@inf.ed.ac.uk Recall: meaning representations Sam likes Casey

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

Neural Hidden Markov Model for Machine Translation

Neural Hidden Markov Model for Machine Translation Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) A brief introduction Daniel Renshaw 24th November 2014 1 / 15 Context and notation Just to give the LSTM something to do: neural network language modelling Vocabulary, size

More information

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS A Project Presented to The Faculty of the Department of Computer Science San José State University In

More information

Lexical Translation Models 1I

Lexical Translation Models 1I Lexical Translation Models 1I Machine Translation Lecture 5 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time... X p( Translation)= p(, Translation)

More information

Recurrent neural networks

Recurrent neural networks 12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay LSTM 15 jun, 2017 lgsoft:nlp:lstm:pushpak 1 Recap 15 jun, 2017 lgsoft:nlp:lstm:pushpak 2 Feedforward Network and Backpropagation

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 9 Recurrent neural nets CSC321 Lecture 9 Recurrent neural nets Roger Grosse and Nitish Srivastava February 3, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 9 Recurrent neural nets February 3, 2015 1 / 20 Overview You

More information

arxiv: v1 [cs.cl] 21 Sep 2018

arxiv: v1 [cs.cl] 21 Sep 2018 PARIS D ESCARTES U NIVERSITY arxiv:1810.00660v1 [cs.cl] 21 Sep 2018 D EPARTMENT OF M ATHEMATICS AND C OMPUTER S CIENCE M ASTER S THESIS Attention-based Encoder-Decoder Networks for Spelling and Grammatical

More information

Lecture 6: Neural Networks for Representing Word Meaning

Lecture 6: Neural Networks for Representing Word Meaning Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Jianshu Zhang, Yixing Zhu, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language Information

More information

Analysis of techniques for coarse-to-fine decoding in neural machine translation

Analysis of techniques for coarse-to-fine decoding in neural machine translation Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data

More information

Top Down Intro to Neural Semantic Parsing. Ofir

Top Down Intro to Neural Semantic Parsing. Ofir Top Down Intro to Neural Semantic Parsing Ofir Plan 1. Intro to NNs and RNNs Short! 2. Model 1: seq2seq Meaty! 3. Model 2: seq2tree 4. How to improve these models with attention Important! 5. Results &

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb

More information

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I. January 27, 2015 Lexical Translation Models 1I January 27, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { p(e f,m)=

More information

Lecture 8: Recurrent Neural Networks

Lecture 8: Recurrent Neural Networks Lecture 8: Recurrent Neural Networks André Martins Deep Structured Learning Course, Fall 2018 André Martins (IST) Lecture 8 IST, Fall 2018 1 / 85 Announcements The deadline for the project midterm report

More information