Task-Oriented Dialogue System (Young, 2000)

2 Review

Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see this weekend Text Input Are there any action movies to see this weekend? Language Understanding (LU) Domain Identification User Intent Detection Slot Filling Semantic Frame request_movie genre=action, date=this weekend Text response Where are you located? Natural Language Generation (NLG) System Action/Policy request_location Dialogue Management (DM) Dialogue State Tracking (DST) Dialogue Policy Backend Database/ Knowledge Providers 3

Task-Oriented Dialogue System (Young, 2000) 4 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see this weekend Text Input Are there any action movies to see this weekend? Language Understanding (LU) Domain Identification User Intent Detection Slot Filling Semantic Frame request_movie genre=action, date=this weekend Text response Where are you located? Natural Language Generation (NLG) System Action/Policy request_location Dialogue Management (DM) Dialogue State Tracking (DST) Dialogue Policy Backend Action / Knowledge Providers 4

5 Conventional LU

6 Language Understanding (LU) Pipelined 1. Domain Classification 2. Intent Classification 3. Slot Filling 6

7 LU Domain/Intent Classification As an utterance classification task Given a collection of utterances u i with labels c i, D= {(u 1,c 1 ),,(u n,c n )} where c i C, train a model to estimate labels for new utterances u k. find me a cheap taiwanese restaurant in oakland Movies Restaurants Music Sports Domain find_movie, buy_tickets find_restaurant, find_price, book_table find_lyrics, find_singer Intent

8 Conventional Approach Data dialogue utterances annotated with domains/intents Model machine learning classification model e.g. support vector machine (SVM) Prediction domains/intents 8

Theory: Support Vector Machine 9 http://www.csie.ntu.edu.tw/~htlin/mooc/ SVM is a maximum margin classifier Input data points are mapped into a high dimensional feature space where the data is linearly separable Support vectors are input data points that lie on the margin 9

10 Theory: Support Vector Machine Multiclass SVM prob for each class Extended using one-versus-rest approach Then transform into probability P 1 P 2 P 3 P k http://www.csie.ntu.edu.tw/~htlin/mooc/ z z score for each class S 1 S 2 S 3 S k SVM 1 SVM 2 SVM 3 SVM k Domain/intent can be decided based on the estimated scores

11 LU Slot Filling As a sequence tagging task Given a collection tagged word sequences, S={((w 1,1,w 1,2,, w 1,n1 ), (t 1,1,t 1,2,,t 1,n1 )), ((w 2,1,w 2,2,,w 2,n2 ), (t 2,1,t 2,2,,t 2,n2 )) } where t i M, the goal is to estimate tags for a new word sequence. flights from Boston to New York today Entity Tag Slot Tag flights from Boston to New York today O O B-city O B-city I-city O O O B-dept O B-arrival I-arrival B-date

12 Conventional Approach Data dialogue utterances annotated with slots Model machine learning tagging model e.g. conditional random fields (CRF) Prediction slots and their values 12

13 Theory: Conditional Random Fields CRF assumes that the label at time step t depends on the label in the previous time step t-1 output input Maximize the log probability log p(y x) with respect to parameters λ Slots can be tagged based on the y that maximizes p(y x) 13

14 Neural Network Based LU

15 A Single Neuron x 1 w 1 x w 2 2 x N 1 w N b bias Activation function z z z 1 1 e z Sigmoid function y z z w, b are the parameters of this neuron 15

A Single Neuron 16 x 1 w 1 f : R N R M x 2 w 2 z y x N 1 w N b bias is not "2" "2" y y 0.5 0.5 A single neuron can only handle binary classification 16

17 A Layer of Neurons Handwriting digit classification x 1 x 2 x N 1 A layer of neurons can handle multiple possible output, and the result depends on the max one f : R 1 1 or not 10 neurons/10 classes y y 2 2 or not y 3 3 or not N R M Which one is max?

Deep Neural Networks (DNN) 18 Fully connected feedforward network f : R N R M Input Layer 1 Layer 2 Layer L Output vector x x 1 x 2 y 1 y 2 vector y x N y M Deep NN: multiple hidden layers

19 Recurrent Neural Network (RNN) : tanh, ReLU RNN can learn accumulated sequential information (time-series) http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ time

20 Model Training All model parameters can be updated by SGD y t-1 y t y t+1 target predicted http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 20

21 BPTT Forward Pass: Backward Pass: Compute s 1, s 2, s 3, s 4 For C (4) For C (3) For C (2) For C (1) y 1 y 2 y 3 y 4 C (1) C (2) C (3) C (4) o 1 o 2 o 3 o 4 ini t s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 The model is trained by comparing the correct sequence tags and the predicted ones 21

22 Deep Learning Approach Data dialogue utterances annotated with semantic frames (user intents & slots) Model deep learning model (classification/tagging) e.g. recurrent neural networks (RNN) Prediction user intents, slots and their values 22

23 Classification Model As an utterance classification task Given a collection of utterances u i with labels c i, D= {(u 1,c 1 ),,(u n,c n )} where c i C, train a model to estimate labels for new utterances u k. Input: each utterance u i is represented as a feature vector f i Output: a domain/intent label c i for each input utterance How to represent a sentence using a feature vector 23

24 Sequence Tagging Model As a sequence tagging task Given a collection tagged word sequences, S={((w 1,1,w 1,2,, w 1,n1 ), (t 1,1,t 1,2,,t 1,n1 )), ((w 2,1,w 2,2,,w 2,n2 ), (t 2,1,t 2,2,,t 2,n2 )) } where t i M, the goal is to estimate tags for a new word sequence. Input: each word w i,j is represented as a feature vector f i,j Output: a slot label t i for each word in the utterance How to represent a word using a feature vector

25 Word Representation Atomic symbols: one-hot representation car [0 0 0 0 0 0 1 0 0 0] Issues: difficult to compute the similarity (i.e. comparing car and motorcycle ) [0 0 0 0 0 0 1 0 0 0] AND [0 0 1 0 0 0 0 0 0 0] = 0 car car motorcycle 25

26 Word Representation Neighbor-based: low-dimensional dense word embedding Idea: words with similar meanings often have similar neighbors 26

Chinese Input Unit of Representation 27 Character Feed each char to each time step Word Word segmentation required 你知道美女與野獸電影的評價如何嗎? 你 / 知道 / 美女與野獸 / 電影 / 的 / 評價 / 如何 / 嗎 Can two types of information fuse together for better performance?

28 LU Domain/Intent Classification As an utterance classification task Given a collection of utterances u i with labels c i, D= {(u 1,c 1 ),,(u n,c n )} where c i C, train a model to estimate labels for new utterances u k. find me a cheap taiwanese restaurant in oakland Movies Restaurants Music Sports Domain find_movie, buy_tickets find_restaurant, find_price, book_table find_lyrics, find_singer Intent

Deep Neural Networks for Domain/Intent Classification I (Sarikaya et al, 2011) 29 http://ieeexplore.ieee.org/abstract/document/5947649/ Deep belief nets (DBN) Unsupervised training of weights Fine-tuning by back-propagation Compared to MaxEnt, SVM, and boosting 29

Deep Neural Networks for Domain/Intent Classification II (Tur et al., 2012; Deng et al., 2012) 30 http://ieeexplore.ieee.org/abstract/document/6289054/; http://ieeexplore.ieee.org/abstract/document/6424224/ Deep convex networks (DCN) Simple classifiers are stacked to learn complex functions Feature selection of salient n-grams Extension to kernel-dcn 30

Deep Neural Networks for Domain/Intent Classification III (Ravuri and Stolcke, 2015) 31 RNN and LSTMs for utterance classification Word hashing to deal with large number of singletons Kat: #Ka, Kat, at# Each character n-gram is associated with a bit in the input encoding https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rnnlm_addressee.pdf 31

32 LU Slot Filling As a sequence tagging task Given a collection tagged word sequences, S={((w 1,1,w 1,2,, w 1,n1 ), (t 1,1,t 1,2,,t 1,n1 )), ((w 2,1,w 2,2,,w 2,n2 ), (t 2,1,t 2,2,,t 2,n2 )) } where t i M, the goal is to estimate tags for a new word sequence. flights from Boston to New York today Entity Tag Slot Tag flights from Boston to New York today O O B-city O B-city I-city O O O B-dept O B-arrival I-arrival B-date

33 Recurrent Neural Nets for Slot Tagging I (Yao et al, 2013; Mesnil et al, 2015) http://131.107.65.14/en-us/um/people/gzweig/pubs/interspeech2013rnnlu.pdf; http://dl.acm.org/citation.cfm?id=2876380 Variations: a. RNNs with LSTM cells b. Input, sliding window of n-grams c. Bi-directional LSTMs y 0 y 1 y 2 y n y 0 y 1 y 2 y n y 0 y 1 y 2 y n h 0 b h 1 b h 2 b h n b h 0 h 1 h 2 h n h 0 h 1 h 2 h n h 0 f h 1 f h 2 f h n f w 0 w 1 w 2 w n (a) LSTM w 0 w 1 w 2 w n (b) LSTM-LA w 0 w 1 w 2 w n (c) blstm

Recurrent Neural Nets for Slot Tagging II (Kurata et al., 2016; Simonnet et al., 2015) 34 http://www.aclweb.org/anthology/d16-1223 Encoder-decoder networks Leverages sentence level information Attention-based encoderdecoder Use of attention (as in MT) in the encoder-decoder network Attention is estimated using a feed-forward network with input: h t and s t at time t w n w 2 w 1 w 0 y 0 y 1 y 2 y n h n h 2 h 1 h 0 w0 w1 w2 wn w 0 w 1 w 2 w n y 0 y 1 y 2 y n h 0 h 1 h 2 h n s 0 s 1 s 2 s n h 0 h n c i

Joint Semantic Frame Parsing 35 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/is16_multijoint.pdf; https://arxiv.org/abs/1609.01454 Sequencebased (Hakkani-Tur et al., 2016) Slot filling and intent prediction in the same output sequence Parallel (Liu and Lane, 2016) Intent prediction and slot filling are performed in two branches taiwanese U food U please U U EOS h t- 1 h t h t+ 1 W W W W V V V B-type O O Slot Filling h T+1 V FIND_RES T Intent Prediction

Milestone 1 Language Understanding 36 3) Collect and annotate data 4) Use machine learning method to train your system Conventional SVM for domain/intent classification CRF for slot filling Deep learning LSTM for domain/intent classification and slot filling 5) Test your system performance 36

Concluding Remarks 37 Speech Signal Speech Recognition Hypothesis are there any action movies to see this weekend Text Input Are there any action movies to see this weekend? Language Understanding (LU) Domain Identification User Intent Detection Slot Filling Semantic Frame request_movie genre=action, date=this weekend Text response Where are you located? Natural Language Generation (NLG) System Action/Policy request_location Dialogue Management (DM) Dialogue State Tracking (DST) Dialogue Policy Backend Database/ Knowledge Providers 37