Advanced Cutting Edge Research Seminar. Dialogue System with Deep Neural Networks

Size: px

Start display at page:

Download "Advanced Cutting Edge Research Seminar. Dialogue System with Deep Neural Networks"

Dwayne Wright
5 years ago
Views:

Advanced Cutting Edge Research Seminar Dialogue System with Deep Neural Networks Assistant Professor Koichiro Yoshino Nara Institute of Science and

1 Advanced Cutting Edge Research Seminar Dialogue System with Deep Neural Networks Assistant Professor Koichiro Yoshino Nara Institute of Science and Technology Augmented Human Communication Laboratory PRESTO, Japan Science and Technology Agency AHC-Lab. NAIST, PRESTO JST Advanced Cutting Edge Research Seminar 2 1

2 Course works 1. Basis of spoken dialogue systems Type and modules of spoken dialogue systems 2. Deep learning for spoken dialogue systems Basis of deep learning (deep neural networks) Recent approaches of deep learning for spoken dialogue systems 3. Dialogue management using reinforcement learning Basis of reinforcement learning Statistical dialogue management using intention dependency graph 4. Dialogue management using deep reinforcement learning Implementation of deep Q-network in dialogue management Advanced Cutting Edge Research Seminar 2 2

3 Basis of deep neural networks Perceptron Simple binary classifier Multi-layer perceptron Combination of binary classifiers Deep neural networks Ways to apply deep neural networks What kind of problem can be solved with DNN? Advanced Cutting Edge Research Seminar 2 3

4 Simple perceptron Simplest unit of neural networks Takes several inputs produces one output x y Perceptron does a single decision given several inputs y = sign i w i φ i (x) + b bias b Binary output +1 or -1 Weight for φ i Feature function on x Advanced Cutting Edge Research Seminar 2 4

5 Properties of simple perceptron Problems the perceptron can solve Linearly solvable binary classification problem Positive: This room is good positive good cool Positive: This building is cool Negative: This room is bad negative bad Problems the perceptron cannot solve Non-linear problems Positive: very good Positive: not bad very very bad very good Negative: very bad not not bad not good Negative: not good bad good Advanced Cutting Edge Research Seminar 2 5

6 Solutions for non-linear problems Using several classifiers Multi-layer perceptron (MLP) Feed-forward network y Classifiers of the 1 st layer take the same input, but learn different weights very x 2 : very bad x 1 : very good If we have two linear separating planes, we can classify the example of not x 3 : not bad x 4 : not good non-linear problem bad good Advanced Cutting Edge Research Seminar 2 6

7 Multi-layer perceptron (MLP) 1 st layer Input: same as the single perceptron Output: features for the decision of the 2 nd layer Each perceptron may learn the mapping between the input feature space and the new feature space Kernel methods do the similar thing 2 nd layer Input: features that come from the 1 st layer Output: classification result very not x 2 x 1 x 3 x 4 bad good x 3 φ 2 (x) x 1 x 2 x 4 φ 1 (x) φ 1 x φ 2 x = negative φ 1 x φ 2 x = positive Advanced Cutting Edge Research Seminar 2 7

8 Deep neural networks x 1 h 1 1 h 1 2 h 1 3 x 2 x 3 1 h 2 1 h 3 2 h 2 2 h 3 3 h 2 3 h 3 h 1 y x 4 h 4 1 h 4 2 h 4 3 Deep neural network (DNN) is deeply layered multi-layer perceptron It can train the mapping of X and y (y = f(x)) even if it is complex Restricted Boltzmann machine (RBM) is a key technique to train the model Pre-trains mapping of each layer (X and H 1, H 1 and H 2, ) and finetunes entire the network (backpropagation) Advanced Cutting Edge Research Seminar 2 8

9 Variation of neural networks: Recurrent neural network (RNN) Recurrent neural network is a neural network that has a recursion h t = tanh(w xh x t + W hh h t 1 + b h ) y t p = softmax(w hyp h t + b yp ) h 0 x 1 x 2 x 3 x 4 x 5 h 1 h 2 h 3 h 4 h 5 tanh = ex e x e x +e x, softmax = ey i k ey k y p 1 y p 2 y p 3 y p 4 y p 5 This structure works well for sequential input (X 1, X 2,, X t ) t is a time step Input h t 1 for time-step t will be a memory of previous input Advanced Cutting Edge Research Seminar 2 9

10 Variation of neural networks: Convolutional neural network (CNN) Convolution Activation Max-pooling Full-connection and softmax CNN is the state of the art algorithm for classification m 1 c i,j = n 1 s=0 t w st x i+s,(j+t) + b c a i,j = tanh c i,j, p i,j = max a i,j o = tanh W po p + b o, y = softmax(w oy o + b y ) Advanced Cutting Edge Research Seminar 2 10

11 Ways to apply deep learning Deep learning can learn the mapping between x and y if we have a large-scale aligned data Speech sounds and their phonemes Transcribed utterances and dialogue states Belief and action-value function State and action Action and utterance Of course, it is not so simple, but successful works of deep learning solve any problem of mapping that was hard to solve existing frameworks Advanced Cutting Edge Research Seminar 2 11

12 Tasks of spoken dialogue systems Speech recognition I d like to take Kintetsu-line from Ikoma stat. Spoken language understanding Dialogue state tracking Action decision Language generation End-to-end dialogue $FROM=Ikoma $LINE=Kintetsu SLU Model Knowledge base LG Where will you go? $FROM=Ikoma $TO_GO=??? $LINE=Kintetsu DM 1 ask $TO_GO 2 inform $NEXT Advanced Cutting Edge Research Seminar 2 12

13 Speech recognition with DNN in early stage Conventional ASR architecture argmax W P(W X) = argmax W P X W P(W) Acoustic model W is word sequence and X is speech GMM-HMM Language model DNN-HMM a r a a r a x 1 x 2 x 3 x 1 x 2 x 3 Advanced Cutting Edge Research Seminar 2 13

14 Speech recognition with DNN in early stage GMM-HMM DNN-HMM a r a a r a x 1 x 2 x 3 x 1 x 2 x 3 Just replace the generative probability of GMM for a phoneme with discriminative probability to classify a phoneme from speech The other architecture (HMM-based phoneme sequence selection and n- gram based language modeling) was the same, but it reduced 20-30% errors of speech recognition Advanced Cutting Edge Research Seminar 2 14

15 Language model and recurrent neural network Language model calculates likelihoods of word sequence W P W = P w 1 P w 2 w 1 P w n w 1,, w n 1 Existing language models are modeled with N-gram model that approximate given words P W i P(w i w i 1:i N 1 ) The problem can be solved with RNN h t = tanh(w wh w t 1 + W hh h t 1 + b h ) w t = softmax(w hyp h t + b yp ) RNN (and its successors) became a state of the art LM Advanced Cutting Edge Research Seminar 2 15

End-to-end speech recognition system Early DNN-based speech recognition system just replaced some modules with deep neural networks, but recent researchers tries to train the model

16 End-to-end speech recognition system Early DNN-based speech recognition system just replaced some modules with deep neural networks, but recent researchers tries to train the model of argmax W P(W X) Including pre-processing of ASR Ochiai et al., Multichannel end-to-end speech recognition. In Proc. ICML, 図は論文より引用 Advanced Cutting Edge Research Seminar 2 16

17 Problem of Spoken language understanding (SLU) and dialogue state tracking (DST) SLU DM Convert the user utterance into machine-readable expressions I want to take Kintetsu from Ikoma Decide the next system action from the SLU result and dialogue history SLU result Train_info{$FROM= Ikoma,$LINE=Kintetsu} history $FROM=??? $TO_GO=Namba $LINE=??? I want to take Kintetsu from Ikoma Line From Stat Dialogue state Train_info{ $FROM=Ikoma $TO_GO=Namba $LINE=Kintetsu } Train_info{$FROM= Ikoma,$LINE=Kintetsu} Action decision 1 inform $NEXT_TRAIN 2 ask $TO_GO Advanced Cutting Edge Research Seminar 2 17

18 Simple classification for SLU Chinese word model Chinese char model (Translated) English word model A MULTICHANNEL CONVOLUTIONAL NEURAL NETWORK FOR CROSS-LANGUAGE DIALOG STATE TRACKING Shi et al., In Proc. IEEE-SLT 2016 Advanced Cutting Edge Research Seminar 2 18

19 CNN for classification He doesn't have himself 0-padding: fill with 0 if the sentence length is smaller than maximum sentence length word embedding artificial intelligence class fixed length vector of words x 1 x 2 x 3 CNN requires to use fixed size of matrix as the input Using two techniques Embedding converts word into fixed length meaning vector 0-padding sets the height of matrix with the maximum sentence length in the training data and fill with 0 if the sentence length is smaller than the max Advanced Cutting Edge Research Seminar 2 19

20 Problem definition of SLU and DST SLU: find a dialogue frame F given words in the utterance W argmax P(F W) F For the tagging problem, there are many works Slot filling, domain/intent classification, dialogue act classification, DST: find a dialogue state S given sequence of F 1:t argmax P(S F 1:t ) S It can be solved as a joint problem P S W 1:t = P S F 1:t P(F 1:t W 1:t ) Can be solved with the same model? (with sequential model) RNN, long short term memory neural network (LSTM) Advanced Cutting Edge Research Seminar 2 20

21 RNN-based dialogue state tracking Word-Based Dialog State Tracking with Recurrent Neural Networks. Henderson et al., In Proc. SIGDIAL, pp, , Advanced Cutting Edge Research Seminar 2 21

22 LSTM-based dialogue state tracking T Is there any activity in Singapore? User utterance Other features word sequence Word embedding LSTM Task: activity{ Area: Singapore Price range: - } Dialogue State Tracking using Long Short Term Memory Neural Networks. Yoshino et al., In Proc. IWSDS, Advanced Cutting Edge Research Seminar 2 22

23 Relation between belief update and RNNbased dialogue state tracker RNN: Output dialogue state given sequence of words (utterance) belief h t = tanh(w Xh X t + W hh h t 1 + b h ) observation state transition X t is a sequence of words in time t Dialogue history is propagated as hidden layer h t 1 y t p = softmax(w hyp h t + b yp ) Output a dialogue state y t p that has the highest probability Belief update b t P(o t s t ) si P s j t s i t 1 b t 1 observation state transition belief Advanced Cutting Edge Research Seminar 2 23

24 Problem of action decision Decide action a t given belief b t I d like to go to Namba with Kintetsu s 4 s 3 s 2 s 1 There are two ways: Belief of dialogue states Train_info{ b t Train_info{ $FROM=Ikoma? $TO_GO=Bamba Train_info{ $FROM=Ikoma $TO_GO=Bamba $LINE=Kintetsu Train_info{ $FROM=Ikoma $LINE=Kintetsu } $FROM=Ikoma $TO_GO=Bamba $LINE=Kintetsu } $TO_GO=Bamba $LINE=Kintetsu } } Find the best policy (policy gradient) Find the best Q-function (Q-network) Action decision 1 inform $NEXT_TRAIN 2 ask $TO_GO a t Advanced Cutting Edge Research Seminar 2 24

25 What was good actions? Maximize the expected future reward (value function) V π s t = max π Vπ (s t ) = max a s t+1 P s t+1 s t, a t a R s t, π s, s t+1 + γv π s t+1 Q π s, a = s t+1 P s t+1 s t, a t R s t, a t, s t+1 + γmax a Q π (s t+1, a t+1 ) Policy gradient directly calculates the score V π s t Q-network calculates Q π s, a for each action according to the sampling manner of Q-learning Advanced Cutting Edge Research Seminar 2 25

26 Policy gradient In policy gradient, policy is not decisive: π(s, a) is a probability of selecting action a given state s J θ = V π θ s θ J θ = E πθ [ θ logπ_θ(s, a)q π θ s, a ] If we can learn parameter θ of policy π θ as maximizing the J θ in existing data, we can get the policy that maximizes the reward for existing data sequence We can use deep learning for the parameter learning Advanced Cutting Edge Research Seminar 2 26

27 Q-learning Premise: if we can calculate Q s, a for every pair, we should decide action a according to max a Q s, a Problem: we don t know P s t+1 s t, a t to calculate Q s, a Solution: approximate the P s t+1 s t, a t with sampling Q s t, a t update 1 α Q st, a t + α R s t, a t, s t+1 + γ max a t+1 Q(st+1, a t+1 ) Back-propagate the reward from the end of the sample We will try to build a dialogue manager by using this algorithm in the next class Advanced Cutting Edge Research Seminar 2 27

Regression? train the mapping between y and Q s, a? Deep learning!

28 Q-network Idea: If we can regress the Q-value at each sampling, learning will be efficient L θ i = E s,a,r,s [ y Q s, a 2 ] y = R s t, a t, s t+1 + γmax a Q(s t+1, a t+1 ) Regression? train the mapping between y and Q s, a? Deep learning! (Deep Q-network; DQN) b s = s 1 : 0. 0 s = s 2 : 0. 9 s = s n : 0. 0 tanh Q(b, a) Advanced Cutting Edge Research Seminar 2 28

Joint learning of dialogue state tracking and action decision with deep learning Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. Zhao et al.

29 Joint learning of dialogue state tracking and action decision with deep learning Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. Zhao et al., In Proc. SIGDIAL, 2016 LSTM-based DST results are used as the input of DQN LSTM: calculate b t from observation DQN: optimize Q(b t, a t+1 ; θ) with regression Fine tuning of entire the network Advanced Cutting Edge Research Seminar 2 29

30 Language generation Generate a sentence given a system action Ask $TO_GO Language generation Where will you go? Conventional: statistical template based approach It is still weak for out of vocabulary (and out of template) フローグラフからの手順書の生成. 情処論文誌, 両手鍋 /T を油 /F で熱 /Ac するセロリ /F と青ねぎ /F とニンニク /F を加え /Ac る 1 分ほど /D 炒め /Ac る Advanced Cutting Edge Research Seminar 2 30

31 RNN-language model for generation RNN can generate a sequence of words by using the generated words as its next input (decoder model) h t = tanh(w wh w t 1 + W hh h t 1 + b h ) w t = softmax(w hyp h t + b yp ) x 1 =He x 2 =doesn t x 3 =have x 7 =in Wx 1 Wx 2 Wx 3 Wx 7 Wh 0 Wh 1 Wh 2 Wh 3 Wh n 2 Wh 5 Dimension size of output will be a y 1 p =doesn t y 2 p =have y 3 p =very y 8 vocabulary p =himself Advanced Cutting Edge Research Seminar 2 31

32 Decoder with condition: semantically conditioned LSTM recurrent hidden layer How to say (=LM) embedding of a word What to say (contents) 1-hot dialog act and slot values Semantically Conditioned LSTMbased Natural Language Generation for Spoken Dialogue Systems. Wen et al., In Proc. EMNLP, Advanced Cutting Edge Research Seminar 2 32

33 semantically conditioned LSTM decoding results for dialogue systems Advanced Cutting Edge Research Seminar 2 33

34 Context-aware NLG Sequence-to-sequence modeling of generation Change the response according to the dialogue context Dusek et al., A context-aware natural language generator for dialogue systems. In Proc. SIGDIAL Advanced Cutting Edge Research Seminar 2 34

QA style I d like to take Kintetsu-line from

35 QA style I d like to take Kintetsu-line from Ikoma stat. $FROM=Ikoma $LINE=Kintetsu SLU $FROM=Ikoma $TO_GO=??? $LINE=Kintetsu Skip the management (example-base) Model Knowledge base LG Where will you go? DM 1 ask $TO_GO 2 inform $NEXT Advanced Cutting Edge Research Seminar 2 35

36 Encoder-decoder We can combine ideas of encoder and decoder to make the neural network that remembers the input sentence and outputs the input sentence RNN may remember not only words but also order of words Vinyals, Oriol, and Quoc Le. "A neural conversational model." arxiv preprint arxiv: (2015). rest room is next. EOS where is the rest room EOS rest room is entrance. Advanced Cutting Edge Research Seminar 2 36

37 Attention model Gives attentional point to be decoded Decides to through the input or not! next to the entrance. EOS X X sign (softmax) where is the rest room EOS next to the entrance. Advanced Cutting Edge Research Seminar 2 37

38 Attention model a t,j = attention_score(h e j, hd t ) Decides to through the input or not! X X X X X next to the entrance. EOS attention_score & softmax where is the rest room EOS next to the entrance. Advanced Cutting Edge Research Seminar 2 38

39 ChitChat One typical end-to-end modeling of dialogue Serban et al.,building end-to-end dialogue systems using generative hierarchical neural network models. In Proc. AAAI Advanced Cutting Edge Research Seminar 2 39

40 Memory Network for dialogue systems The task is goal oriented (API call), but try to work the system on end-to-end memory network The problem is perfectly solved in DSTC6-Track1 Advanced Cutting Edge Research Seminar 2 40

If you are interested in more recent works https://sites.google.

41 If you are interested in more recent works Deep Learning for Dialogue Systems Tutorial by Yun-Nung Chen, Asli Celikyikmaz, and Dilek Hakkani-Tur Advanced Cutting Edge Research Seminar 2 41

42 Summary Deep learning is applied for several tasks of spoken dialogue systems in recent years Speech recognition, understanding, state tracking, action decision, generation, and end-to-end modeling How do we apply deep learning for dialogue (in development)? Clarify your problem to setup the input and output Find similar systems of existing works in recent (2-3 years) conferences (SIGDIAL, NAACL, ACL, EMNLP, COLING, AAAI, IJCAI, ) Can you prepare the enough data pair of the input and the output? How do we apply deep learning for dialogue (in research)? Find a mapping problem that requires hi-dimension or non-linear Consider properties of your input and output, even if it is a part of your problem Can you prepare the enough data pair of the input and the output? Advanced Cutting Edge Research Seminar 2 42

43 Next contents 1/30 Dialogue management with Q-learning We will see the detailed algorithm and implementation of the dialogue manager with Q-learning We will discuss about user simulator Advanced Cutting Edge Research Seminar 2 43

Task-Oriented Dialogue System (Young, 2000)

2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see