Recent Developments in Statistical Dialogue Systems

Similar documents
Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Scaling up Partially Observable Markov Decision Processes for Dialogue Management

Reinforcement Learning and NLP

The SJTU System for Dialog State Tracking Challenge 2

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Recent Advances in Bayesian Inference Techniques

Lecture 8: Policy Gradient

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Artificial Intelligence

Augmented Statistical Models for Speech Recognition

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Pattern Recognition and Machine Learning

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

Learning Gaussian Process Models from Uncertain Data

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Reinforcement Learning as Variational Inference: Two Recent Approaches

Multi-Attribute Bayesian Optimization under Utility Uncertainty

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

RL 14: POMDPs continued

Grundlagen der Künstlichen Intelligenz

Lecture 9: PGM Learning

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning. Machine Learning, Fall 2010

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 6: Graphical Models: Learning

Point-Based Value Iteration for Constrained POMDPs

Final Exam, Machine Learning, Spring 2009

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Task-Oriented Dialogue System (Young, 2000)

Implicit Incremental Natural Actor Critic

Approximate Inference Part 1 of 2

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Hidden Markov Models and Gaussian Mixture Models

Machine Learning Summer School

Approximate Inference Part 1 of 2

Hidden Markov Model and Speech Recognition

Dialogue as a Decision Making Process

Reinforcement learning an introduction

Bayesian Learning in Undirected Graphical Models

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Q-learning. Tambet Matiisen

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Partially Observable Markov Decision Processes (POMDPs)

Expectation Propagation in Dynamical Systems

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Restricted Boltzmann Machines for Collaborative Filtering

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Evaluation of multi armed bandit algorithms and empirical algorithm

(Deep) Reinforcement Learning

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

STA 414/2104: Machine Learning

Reinforcement Learning via Policy Optimization

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Mathematical Formulation of Our Example

Bayesian Nonparametrics

Bayesian Hidden Markov Models and Extensions

Introduction to Gaussian Process

Introduction to Artificial Intelligence (AI)

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

STA 4273H: Statistical Machine Learning

Lecture 5 Neural models for NLP

p(d θ ) l(θ ) 1.2 x x x

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Policy Gradient Methods. February 13, 2017

CSC242: Intro to AI. Lecture 23

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Bayesian Networks Inference with Probabilistic Graphical Models

Lecture 3: Pattern Classification

Deep unsupervised learning

Tutorial on Policy Gradient Methods. Jan Peters

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS599 Lecture 1 Introduction To RL

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Deep Reinforcement Learning. Scratching the surface

Machine Learning Overview

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

HMM part 1. Dr Philip Jackson

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Probabilistic Graphical Models

Transcription:

Recent Developments in Statistical Dialogue Systems Steve Young Machine Intelligence Laboratory Information Engineering Division Cambridge University Engineering Department Cambridge, UK

Contents Review of Basic Ideas and Current Limitations Semantic Decoding Fast Learning Parameter Optimisation and Structure Learning Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 2

Spoken Dialog Systems (SDS) I want to find a restaurant? inform(venuetype=restaurant) User Recognizer Semantic Decoder Waveforms Words Dialog Acts Dialog Control Synthesizer Message Generator Database What kind of food would you like? request(food) Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 3

A Statistical Spoken Dialogue System Supervised Learning Ontology I want an Italian ASR Id like italian {.4} I want an Italian {.2} Id like Indian{.2} In the east{.1} Semantic Decoder inform(food=italian){.6} inform(food=indian) {.2} inform(area=east){.1} null(){.1} Evidence Bayesian Belief Network Belief Propagation Belief State Your looking for an Italian restaurant, whereabouts? Response Generator Decision confirm(food=italian) request(area) Stochastic Policy Reinforcement Learning Rewards: success/fail Reward Function Partially Observable Markov Decision Process (POMDP) Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 4

The POMDP SDS Framework observation w t o t = p(u t x t ) = p(u t w t )p(w t x t ) belief state user acts speech semantic decoder user acts s t 1 words speech recogniser words speech b t (s t ) = ηp(o t s t, a t 1 ) P(s t s t 1, a t 1 )b t 1 (s t 1 ) state state action reward function " policy belief state $ & $ parameters features R = E # r(s a t ~ π (b t, a t ) = eθ φ at (b t ) t, a t )b t (s t )' %$ s t ($ system action a e θ φ a (b t ) Maximise θ Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 5

Dialogue State Goal g type Tourist Information Domain type = bar,restaurant food = French, Chinese, none NB: All conditioned by previous action User Act Memory u type User Behaviour Recognition/ Understanding Errors o type g food u food o food Next Time Slice t+1 History h type h food Observation at time t J. Williams (27). POMDPs for Spoken Dialog Systems." Computer Speech and Language 21(2) Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 6

Belief Monitoring (Tracking) B R F C - B R F C - g type g food g type g food u type u food u type u food t=1 t=2 o type o food o type o food inform(type=bar, food=french) {.6} inform(type=restaurant, food=french) {.3} confirm(type=restaurant, food=french) affirm() {.9} Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 7

Choosing the next action the Policy type food 1 1 Quantize B R F C - φ 1 φ 2 φ 3 1 1 1 1 1 1 Policy Vector θ π(a b,θ) = a = select a % Sample eθ.φ a (b) e θ.φ a % (b) g type g food inform(type=bar) {.4} All Possible Summary Actions: inform, select, confirm, etc? Map select(type=bar, type=restaurant) Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 8

Let s Go 21 Control Test Results Success Predicted Success Rate B. Thomson "Bayesian Update of State for the Let's Go Spoken Dialogue Fail Challenge. SLT 21...2.4.6.8 1. All Baseline Qualifying Systems Average Success = 64.8% Average WER = 42.4% 2 4 6 8 1 Word Error Rate (WER) Cambridge 89% Success 33% WER Baseline 65% Success 42% WER Another 75% Success 34% WER Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 9

Demo of Cambridge Restaurant Information Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 1

Issues with the 21 System Design Poor coverage of N-best of semantic hypotheses Hand-crafting of summary belief space Slow policy optimisation and reliance on user simulation Dependence on hand-crafted dialogue model parameters Dependence on static ontology/database Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 11

N-best Semantic Decoding Conventional N-best decoding speech ASR N-best word strings Semantic Decoder (applied N times) N-best dialogue acts Merge Duplicates M-best dialogue acts M<<N Confusion Network decoding speech ASR Word confusion network Feature Extraction n-gram features Semantic Decoder (applied once) M-best dialogue acts M<<N optional context Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 12

Confusion Network Decoder (Mairesse/Henderson) I I d Is In were want water well indian italian in -- input features multi-way SVM inform venue=restaurant x i = E{ C(n-gram i )} 1/ n-gram i food=italian area=centre n=1,2,3 SVM food=italian.91 M-best semantic trees SVM food=indian.42 Context Binary Vector b j if item j in last system act SVM area=east.1 Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 13

Confusion Network Decoder Evaluation Predicted F-score Predicted ICE F-score.3.5.7.9 C.Net. + Context Phoenix 2 4 6 8 1 WER ICE 1 2 3 4 5 6 C.Net. + Context Phoenix 2 4 6 8 1 Comparison of item retrieval on corpus of 4.8k utterances N-best hand-crafted Phoenix decoder vs Confusion network decoder (trained on 1k utterances) Live Dialogue System N-best Phoenix F-score.8.82 ICE 2.2 1.26 Average Reward 1.6 11.15 WER Confusion Net Discriminative Spoken Language Understanding using Word Confusion Networks, Henderson et al, IEEE SLT 212 Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 14

Policy Optimization Policy parameters chosen to maximize expected reward % J(θ) = E 1 & ' T t ( r(s t,a t ) π θ ) * Natural gradient ascent works well J(θ) = F 1 θ J(θ) Gradient is estimated by sampling dialogues and in practice Fisher Information Matrix does not need to be explicitly computed. This is the Natural Actor Critic Algorithm. However, A) slow (~1k dialogues) and B) requires summary space approximations Fisher Information Matrix J. Peters and S. Schaal (28). "Natural Actor-Critic." Neurocomputing 71(7-9) Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 15

Q-functions and the SARSA algorithm Traditional reinforcement learning is commonly based on finding the optimal Q function: Q * (b, a) = max π ( " T % + * E π # r(b τ, a τ )&- ) $ τ =t+1 ', The optimal deterministic policy is then simply π * (b) = argmax a!q " * (b, a) # $ Q* can be found sequentially using the SARSA algorithm b = b ; choose action a e-greedily from π(b) For each dialogue turn Take action a, observe reward r and next state b choose action a e-greedily from π(b ) Q(b, a) = Q(b, a) +λ[q(b, a ) (Q(b, a) r)] b = b ; a = a end Eventually, Q à Q* Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 16

Gaussian Process based Learning Milica Gasic For POMDPs, the belief space is continuous and direct representations of Q are intractable. However, Q can be approximated as a zero mean Gaussian process by designing a kernel to represent the correlations between points in belief x action space. Thus: Q(b, a) ~ (, k ((b, a),(b, a) )) Given a sequence of state-action pairs [ ]! and rewards r t = r,r 1,.., r t B t = (b, a ),(b 1, a 1 ),..,(b t, a t ) there is a closed form solution for the posterior: Q(b, a) B t, r t ~ N ( Q(b, a), cov ((b, a),(b, a) )) This suggests a SARSA-like sequential optimisation: b = b ; choose action a e-greedily from Q(b, a) For each dialogue turn Take action a, observe reward r and next state b choose action a e-greedily from Q( b!, a!) Update the posterior covariance estimate b = b ; a = a end [ ]! Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 17

Benefits of GP-SARSA sequential estimation of distribution of Q (not Q itself) each new data point can impact on whole distribution via the covariance function à very efficient use of training data much faster learning than gradient methods such a Natural Actor Critic (NAC) GP TownInfo System NAC ε-greedy exploration & ( a = ' ( ) argmax!q(b, " a) # $ with prob 1 ε a random action with prob ε Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 18

Benefits of GP-SARSA variance of Q is known at each stage à more intelligent exploration: Variance exploration & ( a = ' ( ) argmax a argmax a!q(b, " a) # $ with prob 1 ε [ cov((b, a),(b, a)) ] with prob ε 15 14 13 12 Stochastic policy Q i (b, a i ) ~ N Q(b, a i ), cov((b, a i ),(b, a i ) a = argmax! " Q i (b, a i )# $ a i ( ) Average reward 11 1 9 8 7 6 Variance exploration Stochastic policy TownInfo System 5 5 1 15 2 25 3 Training dialogues Well trained within 3k dialogues And summary space mapping no longer needed On-line policy optimisation of SDS via live interaction with human subjects, Gasic et al, ASRU 211 Gaussian processes for policy optimisation of large scale POMDP-based SDS, Gasic et al, SLT 212. Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 19

Parameter Estimation Blaise Thomson Goal g t g t+1 a t-1 Transition Model p(g t+1 g t,a t ) User Behaviour p(u t g t,a t-1 ) User Act Recognition Errors p(o t u t ) u t Memory p(h t u t,h t-1,a t-1 ) History h t o t These parameters typically hand-crafted. How can we learn them from data? Observation at time t Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 2

Factor Graphs and Expectation Propagation Goal g t g t+1 a {θ t-1 i } a t-1 System Act User Act f 1 (S 1 ) where S 1 = ( a t 1, g t,u t ) u t Posterior K k=1 b(s t o t, s t 1, a t 1 ) f k (S k ) History h t o t exact computation is intractable can be approximated using belief propagation we use Expectation Propagation (EP) using EP factors can be discrete & continuous hence, parameters can be added and updated simultaneously Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 21

Effect of Parameter Learning on TownInfo System Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 22

Structure Learning The ability to learn parameters can be extended to learn structure. Let G = { g k } be a set of Bayesian Networks (or Factor Graphs): Corpus data D Compute Evidence p(d g k ) for all g k in G Augment G by perturbing each g k Evidence can be computed efficiently using expectation propagation Prune G Stop? Potential to learn: a) additional values for an existing variable b) new hidden variables c) new links between variables Select g k with highest Evidence Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 23

Conclusions Statistical Dialogue Systems based on POMDPs are viable, offer increased robustness to noise and require no hand-crafting Good progress is being made on increasing accuracy and speeding up learning Learning directly on human users rather than depending on user simulators is now possible Current systems are built from static ontologies for closed domains Next steps will include building more flexible systems capable of dynamically adapting to new information content. Acknowledgement - this work is the result of a team effort: Catherin Breslin, Milica Gasic, Matt Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis and former members of the CUED Dialogue Systems Group Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 212 Steve Young 24