Scaling up Partially Observable Markov Decision Processes for Dialogue Management

Similar documents
Recent Developments in Statistical Dialogue Systems

Point-Based Value Iteration for Constrained POMDPs

Dialogue as a Decision Making Process

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

RL 14: Simplifications of POMDPs

SPOKEN dialog systems (SDSs) help people accomplish

Dialogue Systems. Statistical NLU component. Representation. A Probabilistic Dialogue System. Task: map a sentence + context to a database query

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Planning Under Uncertainty II

Towards Uncertainty-Aware Path Planning On Road Networks Using Augmented-MDPs. Lorenzo Nardi and Cyrill Stachniss

Reinforcement Learning and NLP

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Artificial Intelligence & Sequential Decision Problems

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Fall 2011

Inverse Reinforcement Learning in Partially Observable Environments

Introduction to Sequential Teams

Knowledge Extraction from DBNs for Images

Lecture 3: ASR: HMMs, Forward, Viterbi

Intelligent Systems (AI-2)

An Analytic Solution to Discrete Bayesian Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

The SJTU System for Dialog State Tracking Challenge 2

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Intelligent Systems (AI-2)

RL 14: POMDPs continued

Markov Logic Networks for Spoken Language Interpretation

Don t Plan for the Unexpected: Planning Based on Plausibility Models

Introduction to Artificial Intelligence (AI)

Planning by Probabilistic Inference

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

1 [15 points] Search Strategies

Discrete Probability and State Estimation

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Hidden Markov Model and Speech Recognition

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

The Markov Decision Process Extraction Network

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

State Space Abstraction for Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Solving POMDPs with Continuous or Large Discrete Observation Spaces

CS599 Lecture 1 Introduction To RL

CS 188: Artificial Intelligence Spring Today

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Lecture 14. Clustering, K-means, and EM

The Noisy Channel Model and Markov Models

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Mathematical Formulation of Our Example

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Seq2Seq Losses (CTC)

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

CS 570: Machine Learning Seminar. Fall 2016

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Coarticulation in Markov Decision Processes

Chapter 6 Walk This Way: Spatial Grounding for City Exploration

CS 188: Artificial Intelligence Fall Recap: Inference Example

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Artificial Intelligence

CS 343: Artificial Intelligence

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

State Space Compression with Predictive Representations

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

CS 188: Artificial Intelligence

Optimism in the Face of Uncertainty Should be Refutable

CS188 Outline. CS 188: Artificial Intelligence. Today. Inference in Ghostbusters. Probability. We re done with Part I: Search and Planning!

Autonomous Helicopter Flight via Reinforcement Learning

CSC321 Lecture 7 Neural language models

Probabilistic Graphical Models for Image Analysis - Lecture 1

Machine Learning for natural language processing

Detection-Based Speech Recognition with Sparse Point Process Models

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

November 28 th, Carlos Guestrin 1. Lower dimensional projections

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

Planning in Markov Decision Processes

Bayesian Methods for Diagnosing Physiological Conditions of Human Subjects from Multivariate Time Series Biosensor Data

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation

Bayesian Networks BY: MOHAMAD ALSABBAGH

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Sparse Models for Speech Recognition

Final Exam December 12, 2017

Information, Utility & Bounded Rationality

CS 5522: Artificial Intelligence II

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Modeling Norms of Turn-Taking in Multi-Party Conversation

Causal Belief Decomposition for Planning with Sensing: Completeness Results and Practical Approximation

State Space Abstractions for Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

Be able to define the following terms and answer basic questions about them:

Transcription:

Scaling up Partially Observable Markov Decision Processes for Dialogue Management Jason D. Williams 22 July 2005 Machine Intelligence Laboratory Cambridge University Engineering Department

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work Material in this talk is joint work with Steve Young Jason D. Williams, Pascal Poupart, and Steve Young (2005). Factored Partially Observable Markov Decision Processes for Dialogue Management. 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, International Joint Conference on Artificial Intelligence (IJCAI), August 2005, Edinburgh. Jason D. Williams, Pascal Poupart, and Steve Young (2005). Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management. In Proceedings of the 6th SigDial Workshop on Discourse and Dialogue, September 2005, Lisbon.

Idealized Human-Computer Dialog Boston to London I want to go to London to=london Boston to London From Boston from=austin to=stated from=null Where from? to=stated from=stated When? Current state = dialogue modelling System action selection = dialogue management Dialogue state is unobserved: User s goal User s (real) action Conversation state Inferences via observation: May contain errors

Decompose state variable into 3 models user goal Time t Time t+1 S u dialogue state S u A u user action S d r Reward A u S d O A s system action O observation B B belief state Designer specifies r; mapping A s = Π( b( S)) seeks to maximize cumulative reward

A policy as a partitioning A policy is a mapping: situation action One representation is a partitioning of belief space 2-dimensional example: Simple state space with 2 user goals: A & B Belief space can be written as b(a) (b/c b(b) = 1 - b(a)) Policy shows action to take for each point in belief space Belief state: b(a)=0.0 b(a)=0.5 b(a)=1.0 System action: submit(b) confirm(b) ask-a-or-b confirm(a) submit(a) A s = Π( b( S)) This partitioning is produced by a POMDP optimization method

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

POMDPs scale poorly Belief space, actions, and observations cover N slots and M slot values i 1 2... M S u A u A s O london london, to(london) conf(london), submit(london) london, to(london) leeds leeds, to(leeds) conf(leeds), submit(leeds) leeds, to(leeds)......, to(...) conf(...), submit(...)..., to(...) dover dover, to(dover) conf(dover), submit(dover) dover, to(dover) For M=1000 and N=2, there are ~10 6 states, actions, and observations N Growth factor is O(M N ) Can perform belief monitoring with factoring, but how to optimize?

Intuition of the Summary POMDP b( s u ) london leeds bath b ~ s ( u ) best rest b( a u ) tolondon london toleeds b( s d ) b ~ s ( d ) notstated stated confirmed notstated stated confirmed ~ ~ S A S = 1000 2000 3 S u S = 2 3 u u d d

Actions in the summary POMDP a ask a ~ ask confirm(london) confirm(leeds) confirm(dover) M 1 confirm(x) x = arg max( b( s u )) submit(london) submit(leeds) submit(dover) M 1 submit(x) x = arg max( b( s u ))

Observations in the summary POMDP Observations in summary POMDP give an compact indication of how is changing based on s and o. It has 2 parts. b(s ~ ) arg max( b ( s o ~1 u)) == arg max( b( s ))? { same, different u } s s o Is o~ 2 arg max( b( s )) consistent with o? consistent inconsistent no info u

Summary POMDP method 1 p ( s s, a) p ( o s, a) r( s, a) sample p ( ~ s ~ s, a~ ) p ( o~ ~ s, a~ ) r( ~ s, a~ ) 2 3 p ( ~ s ~ s, a~ ) p ( o~ ~ s, a~ ) r( ~ s, a~ ) run maintain b(s) optimize π ( b( ~ s )) b( ~ sd ) b( sd ) b( ~ su ) max( b( su )) a ~ π ( b( ~ )) a a~ s,arg max( b( s ))) ( u

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

1000-place dialog problem User is travelling from x to y in a world with 1000 cities s u : The user s desired from and to cities e.g., (a,b) a u, o : yes, no, null, a, from(a), to(a), {from(a),to(b)}, etc. s d : Each slot can be not stated, unconfirmed, or confirmed a s : ask-from, ask-to, conf-from(a), conf-to(b), submit(a,b) User goal model: User has a fixed goal throughout the dialogue User action model: Estimated from real dialogue data Conversation model: Deterministically promotes slots Speech recognition model : Substitutions & del s with prob. p err Reward fn : +50 correct submit, -50 wrong submit, -5 fail, -3 for confirming something which is not stated, etc., -1 per-turn otherwise Solve with Perseus (PBVI)

Simulated speech recognition channel 2 subjects interact through a simulated speech recognition channel End-pointer Typist Wizard ASR-noise generator Typical end-pointing model Subject Wizard talks into microphone Wizard s screen shows corrupted input Utterances are transcribed by a typist Confusions generated using lexicon, phonetic confusion matrix, and language model Matthew Stuttle, Jason D. Williams, and Steve Young. A Framework for Wizard-of- Oz Experiments with a Simulated ASR-Channel. ICSLP, 2004, South Korea.

SACTI-1 dialog data corpus Tourist information domain Wizard has more info than user; user given specific task Fountain H1 R2 H5 Fountain Road R4 H4 Main square North Road Art Square Cinema R6 Middle Road Nine street Castle Castle Loop H2 Castle Loop B6 36 users / 12 wizards 144 dialogs 4071 turns 4 different WER targets Task completion Likert-scale questions West loop Shopping area Alexander Street Tower Hill Road Tower R1 B1 B5 Tourist info Alexander Street B4 Post office B2 B3 H6 Main Street Alexander Street Red Bridge South Bridge Bridge Nine Riverside Drive R3 Park Road Museum Cascade Road H3 Park R5 Conversational characteristics broadly similar to those observed in human/machine dialog. Jason D. Williams and Steve Young. Characterizing Task-Oriented Dialog using a Simulated ASR Channel. ICSLP, 2004, South Korea.

User model example S u A s WH a to YN a to WH a from YN a from London To London null yes no null Oxford From Oxford null + + + yes no null P(A u -WH A s, S u =to(london)) A s Where are you going to? To London, is that right? Where are you going from? P mass of A u ( to/wh portion only) London To London null Data was divided, and two model sets were created: Training model set & Testing model set

Turn b ( s slot from ) b ( ~ s from ) b ( s slot to ) b ( ~ ) [prior to start of dialog] CAMB. SHEFF. ~ slot LEEDS OXFORD ~ s slot to Where to? London [Leeds] CAMB. SHEFF. LEEDS OXFORD Where from? From Cambridge [To Oxford] CAMB. SHEFF. LEEDS OXFORD Where from? From Cambridge [From Cambridge] CAMB. SHEFF. LEEDS OXFORD To Leeds, right? No, to London [No, from Sheffield] CAMB. SHEFF. LEEDS OXFORD Where to? London from Cambridge [London from Cambridge] CAMB. SHEFF. LEEDS OXFORD

0.35 0.40 0.45 0.50 0.55 0.60 0.65 Performance vs. p err 8 7 6 5 4 3 2 1 0 0.05 0.10 0.15 0.20 0.25 0.30-1 User model built from training set p err Reward gained / turn

0.35 0.40 0.45 0.50 0.55 0.60 0.65 Performance vs. p err 8 7 6 5 4 3 2 1 0 0.05 0.10 0.15 0.20 0.25 0.30-1 User model built from training set User model built from test set p err Reward gained / turn

Scalability vs. direct optimization 8 6 4 2 0-2 -4-6 -8 Summary POMDP Baseline Average or expected return 3 4 5 10-10 20 50 100 200 500 1000 2000 5000 M (Number of distinct slot values)

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

Conclusions & future work For slot-based dialog management, POMDPs Outperform MDP baseline Outperform 3 handcrafted baselines Can be scaled as a summary POMDP to many slot values Interesting future avenues Scale number of slots to 10s Move beyond slot-based approaches Apply to real system

Thanks! Jason D. Williams jdw30@cam.ac.uk Joint work with Steve Young sjy@eng.cam.ac.uk