Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Size: px

Start display at page:

Download "Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal"

Kathleen Norman
5 years ago
Views:

1 Natural Language Understanding Kyunghyun Cho, NYU & U. Montreal

2 2 Machine Translation

3 NEURAL MACHINE TRANSLATION 3 Topics: Statistical Machine Translation log p(f e) =log p(e f) + log p(f) f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) Translation model: log p(e f) Fit it with parallel corpora Parallel Corpora TM log p(e f) + LM log p(f) Mono Corpora Language model: log p(f) Fit it with monolingual corpora e = (Economic, growth, has, slowed, down, in, recent, years,.) The whole task log p(f e) is conditional language modelling.

) Parallel Corpora Mono Corpora + w 1 w 2 w 3 w N Feature Feature Feature Feature.

4 NEURAL MACHINE TRANSLATION 4 Topics: Statistical Machine Translation - In Reality log p(f e) NX n=1 Log-linear model Feature function f n (e, f) + C f n (e, f) Count-based or linguistics-based f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) Parallel Corpora Mono Corpora + w 1 w 2 w 3 w N Feature Feature Feature Feature... f 1 f 2 f 3 f N Steps: Learned from corpora (1)Experts engineer useful features e = (Economic, growth, has, slowed, down, in, recent, years,.) (2)Use a simple log-linear model (3)Use a strong, external language model

5 NEURAL MACHINE TRANSLATION 5 Topics: Sequence-to-Sequence Learning Word Probability Word Ssample f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) u i p i Decoder (Forcada&Ñeco, 1997; Kalchbrenner&Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014) Recurrent State z i Recurrent Statehi Continuous-space Word Representation 1-of-K coding s i w i Encoder e = (Economic, growth, has, slowed, down, in, recent, years,.)

6 NEURAL MACHINE TRANSLATION 6 Topics: Sequence-to-Sequence Learning Encoder Encoder (1)1-of-K coding of source words (2)Continuous-space representation s t 0 = W > x t 0,whereW 2 R V d (3)Recursively read words h t = f(h t 1,s t ), for t =1,...,T Recurrent Continuous-space Word Representation 1-of-K coding Statehi s i w i (1) (2) (3) (4) e = (Economic, growth, has, slowed, down, in, recent, years,.)

7 NEURAL MACHINE TRANSLATION 7 Topics: Sequence-to-Sequence Learning Decoder Decoder (1)Recursively update the memory z t 0 = f(z t 0 1,u t 0 1,h T ) (2)Compute the next word prob. p(u t 0 u <t 0) / exp(r > u t 0 z t 0 + b ut 0 ) (3)Sample a next word Beam search is a good idea Word Ssample Word Probability Recurrent State f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) u i p i z i (1) (2) (3) e = (Economic, growth, has, slowed, down, in, recent, years,.) ( )

NEURAL MACHINE TRANSLATION 8 Topics: Sequence-to-Sequence Learning Issue This is quite an unrealistic model. Why? You can t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!

8 NEURAL MACHINE TRANSLATION 8 Topics: Sequence-to-Sequence Learning Issue This is quite an unrealistic model. Why? You can t cram the meaning of a whole %&!$# sentence into a single $&!#* vector! Ray Mooney Word Ssample Word Probability Recurrent State f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) u i p i z i Decoder Recurrent Statehi Continuous-space Word Representation 1-of-K coding s i w i Encoder e = (Economic, growth, has, slowed, down, in, recent, years,.)

9 NEURAL MACHINE TRANSLATION 9 Topics: Attention-based Model Encoder: Bidirectional RNN A set of annotation vectors Word Ssample Recurrent State f = (La, croissance, économique, s'est, ralentie, ces, dernières, années,.) u i z i {h 1,h 2,...,h T } Attention-based Decoder Attention Mechanism a j Attention weight + a j Σ =1 (1)Compute attention weights t 0,t / exp(e(z t 0 1,u t 0 1,h t )) Annotation Vectors (2)Weighted-sum of the annotation vectors c t 0 = P T t=1 t 0,th t h j e = (Economic, growth, has, slowed, down, in, recent, years,.) (3)Use c t 0 instead of h T

10 NEURAL MACHINE TRANSLATION 10 Topics: Attention-based Model Encoder: Bidirectional RNN A set of annotation vectors {h 1,h 2,...,h T } Attention-based Decoder (1)Compute attention weights t 0,t / exp(e(z t 0 1,u t 0 1,h t )) (2)Weighted-sum of the annotation vectors c t 0 = P T t=1 t 0,th t (3)Use c t 0 instead of h T

11 NEURAL MACHINE TRANSLATION 11 Topics: Few tricks for neural machine translation Very large target vocabulary (Jean et al., 2015) p(y t y <t,x)= exp w t > (y t 1,z t,c t ) P k:y k 2V exp w> k (y t 1,z t,c t ) exp w t > (y t 1,z t,c t ) P k:y k 2V exp w > 0 k (y t 1,z t,c t )

12 NEURAL MACHINE TRANSLATION 12 Topics: Few tricks for neural machine translation Deep Fusion of Target Language Model (Gulcehre&Firat et al., 2015) Language Model Recurrent State z i LM Deep Fusion g Translation Model Recurrent State Word Probability z TM i p(y t y <t,x) / exp(y t > (W o f o, (zt LM,g t zt TM,y t 1,c t )+b o ))

13 13 Attention-based neural machine translation is comparable to phrase-based statistical machine translation

14 14 Teaching Machines to Read, Comprehend and Answer Based on (Hermann et al., 2015; Blunsom, 2015)

READING COMPREHENSION 15 Topics: Teaching machines to read and comprehend CNN article: Document The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the Top Gear host,

15 READING COMPREHENSION 15 Topics: Teaching machines to read and comprehend CNN article: Document The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the Top Gear host, his lawyer said Friday. Clarkson, who hosted one of the most-watched television shows in the world, was dropped by the BBC Wednesday after an internal investigation by the British broadcaster found he had subjected producer Oisin Tymon to an unprovoked physical and verbal attack.... Query Producer X will not press charges against Jeremy Clarkson, his lawyer says. Answer Oisin Tymon

16 READING COMPREHENSION 16 Topics: Teaching machines to read and comprehend Deep LSTM Reader Document Reader h t = f(h t 1,w t ), for all t =1,...,T g Summary of the document: h T Query Reader z t = f(z t 1,w 0 t), for all t =1,...,T 0 Mary went to England X visited England No!!! Summary of the query: z T 0 Answer selection p(a {w t } T t=1, {w t 0 } T 0 t 0 =1 )=g a(h T,z T )

17 READING COMPREHENSION 17 Topics: Teaching machines to read and comprehend Attentive Reader g Document Reader: BiRNN Annotation vectors: {h 1,h 2,...,h T } s(1)y(1) s(2)y(2) r s(3)y(3) s(4)y(4) u Query Reader: z T 0 Answer selection Attention mechanism Query-dependent document summary Answer selection: t / e(h t,z T 0) c = P T t=1 th t p(a {w t } T t=1, {w t 0 } T 0 t 0 =1 )=g a(z T 0,c) Mary went to England X visited England

18 READING COMPREHENSION 18 Topics: Teaching machines to read and comprehend Attentive Reader (Examples) Visualize the attention

19 19 Going beyond Natural Languages Is a human language special?

20 20 BEYOND NATURAL LANGUAGES p(two, dolphins, are, diving ) =? Encoder: convolutional network Pretrained as a classifier or autoencoder Decoder: recurrent neural network RNN Language model With attention mechanism (Xu et al., 2015) Word Ssample ui zi Attention weight aj Σ aj =1 + Convolutional Neural Network Task: conditional language modelling Attention Mechanism Recurrent State Topics: Beyond Natural Languages Image Caption Generation f = (a, man, is, jumping, into, a, lake,.) Usm 2i HX- kyr8v- Uu Q 2i HX- kyr8v Annotation Vectors hj X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

21 BEYOND NATURAL LANGUAGES 21 Topics: Beyond Natural Languages Image Caption Generation (Examples)

22 BEYOND NATURAL LANGUAGES 22 Topics: Beyond Natural Languages Image Caption Generation (Examples)

+ 3-D CNN + Per-frame CNN + Both 0.2832 0.2900 0.2960 33.42 27.89 27.

44 The 3-D convolutional network for motion from [23].

attention-based model, the attention mechanism applied to the task of

inner workings of the model. See Fig. 7 for some examples.

Beyond Languages Attention Models convolutional network, shown in Fig.

spatially over a (Chorowski two-dimensional image, the 3-D

, its (local) filters across the spatial dimensions as well as the

Furthermore, those filters work not on (Yao et al.

concentrate on motion rather than appearance.

, trained on larger (Vinyals 2015) video datasets to recognize an

convolutional layer were used as context.

temporal structures complementing the al.

frame-wise application of a 2-D convolutional network.

with the content-based attention mechanism in Eq. (16).

23 + 3-D CNN + Per-frame CNN + Both BEYOND NATURAL LANGUAGES Fig The 3-D convolutional network for motion from [23]. Similarly to all the other previous applications of the attention-based model, the attention mechanism applied to the task of video description also provides a straightforward way to inspect the inner workings of the model. See Fig. 7 for some examples. The othernatural type of encoder in [23] is a so-called 3-D Topics: Beyond Languages Attention Models convolutional network, shown in Fig. 6. Unlike the usual convolutional network which often works only spatially over a (Chorowski two-dimensional image, the 3-D convolutional network applieset al., its (local) filters across the spatial dimensions as well as the temporal dimensions. Furthermore, those filters work not on (Yao et al., 2015) pixels but on local motion statistics, enabling the model to concentrate on motion rather than appearance. Similarly to the strategy from Sec. II-D, the model et wasal., trained on larger (Vinyals 2015) video datasets to recognize an action from each video clip, and the activation vectors from the last convolutional layer were used as context. The authors of [23] suggest that this encoder extracts more local temporal structures complementing the al., 2015) and references therein global structures extracted from the frame-wise application of a 2-D convolutional network. The same type of decoder, a conditional RNN-LM, used in [22] was used with the content-based attention mechanism in Eq. (16). 2) Experimental Result: In [23], this approach to video description generation has been tested on two datasets; (1) Youtube2Text [54] and (2) Montreal DVS [55]. They showed that it is beneficial to have both types of encoders together End-to-End Speech Recognition Video Description Generation Discrete Optimization and many more (Cho et ; Chan et al., 2015)

24 24 Connectionist Approach to Natural Language Understanding (see the slides of my talk at CVSC 2015)

25 25 Department of Computer Science Ph.D. Programme: Application dl. 12th December Center for Data Science M.Sc. Programme in Data Science: Application dl. 4th Februrary

26 26 Department of Computer Science M.Sc. Programme in Machine Learning and Data Mining (Macadamia) Ph.D. Programme: Prof. Tapani Raiko

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition: