Scaling up Partially Observable Markov Decision Processes for Dialogue Management

Scaling up Partially Observable Markov Decision Processes for Dialogue Management Jason D. Williams 22 July 2005 Machine Intelligence Laboratory Cambridge University Engineering Department

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work Material in this talk is joint work with Steve Young Jason D. Williams, Pascal Poupart, and Steve Young (2005). Factored Partially Observable Markov Decision Processes for Dialogue Management. 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, International Joint Conference on Artificial Intelligence (IJCAI), August 2005, Edinburgh. Jason D. Williams, Pascal Poupart, and Steve Young (2005). Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management. In Proceedings of the 6th SigDial Workshop on Discourse and Dialogue, September 2005, Lisbon.

Idealized Human-Computer Dialog Boston to London I want to go to London to=london Boston to London From Boston from=austin to=stated from=null Where from? to=stated from=stated When? Current state = dialogue modelling System action selection = dialogue management Dialogue state is unobserved: User s goal User s (real) action Conversation state Inferences via observation: May contain errors

Decompose state variable into 3 models user goal Time t Time t+1 S u dialogue state S u A u user action S d r Reward A u S d O A s system action O observation B B belief state Designer specifies r; mapping A s = Π( b( S)) seeks to maximize cumulative reward

A policy as a partitioning A policy is a mapping: situation action One representation is a partitioning of belief space 2-dimensional example: Simple state space with 2 user goals: A & B Belief space can be written as b(a) (b/c b(b) = 1 - b(a)) Policy shows action to take for each point in belief space Belief state: b(a)=0.0 b(a)=0.5 b(a)=1.0 System action: submit(b) confirm(b) ask-a-or-b confirm(a) submit(a) A s = Π( b( S)) This partitioning is produced by a POMDP optimization method

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

POMDPs scale poorly Belief space, actions, and observations cover N slots and M slot values i 1 2... M S u A u A s O london london, to(london) conf(london), submit(london) london, to(london) leeds leeds, to(leeds) conf(leeds), submit(leeds) leeds, to(leeds)......, to(...) conf(...), submit(...)..., to(...) dover dover, to(dover) conf(dover), submit(dover) dover, to(dover) For M=1000 and N=2, there are ~10 6 states, actions, and observations N Growth factor is O(M N ) Can perform belief monitoring with factoring, but how to optimize?

Intuition of the Summary POMDP b( s u ) london leeds bath b ~ s ( u ) best rest b( a u ) tolondon london toleeds b( s d ) b ~ s ( d ) notstated stated confirmed notstated stated confirmed ~ ~ S A S = 1000 2000 3 S u S = 2 3 u u d d

Actions in the summary POMDP a ask a ~ ask confirm(london) confirm(leeds) confirm(dover) M 1 confirm(x) x = arg max( b( s u )) submit(london) submit(leeds) submit(dover) M 1 submit(x) x = arg max( b( s u ))

Observations in the summary POMDP Observations in summary POMDP give an compact indication of how is changing based on s and o. It has 2 parts. b(s ~ ) arg max( b ( s o ~1 u)) == arg max( b( s ))? { same, different u } s s o Is o~ 2 arg max( b( s )) consistent with o? consistent inconsistent no info u

Summary POMDP method 1 p ( s s, a) p ( o s, a) r( s, a) sample p ( ~ s ~ s, a~ ) p ( o~ ~ s, a~ ) r( ~ s, a~ ) 2 3 p ( ~ s ~ s, a~ ) p ( o~ ~ s, a~ ) r( ~ s, a~ ) run maintain b(s) optimize π ( b( ~ s )) b( ~ sd ) b( sd ) b( ~ su ) max( b( su )) a ~ π ( b( ~ )) a a~ s,arg max( b( s ))) ( u

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

1000-place dialog problem User is travelling from x to y in a world with 1000 cities s u : The user s desired from and to cities e.g., (a,b) a u, o : yes, no, null, a, from(a), to(a), {from(a),to(b)}, etc. s d : Each slot can be not stated, unconfirmed, or confirmed a s : ask-from, ask-to, conf-from(a), conf-to(b), submit(a,b) User goal model: User has a fixed goal throughout the dialogue User action model: Estimated from real dialogue data Conversation model: Deterministically promotes slots Speech recognition model : Substitutions & del s with prob. p err Reward fn : +50 correct submit, -50 wrong submit, -5 fail, -3 for confirming something which is not stated, etc., -1 per-turn otherwise Solve with Perseus (PBVI)

Simulated speech recognition channel 2 subjects interact through a simulated speech recognition channel End-pointer Typist Wizard ASR-noise generator Typical end-pointing model Subject Wizard talks into microphone Wizard s screen shows corrupted input Utterances are transcribed by a typist Confusions generated using lexicon, phonetic confusion matrix, and language model Matthew Stuttle, Jason D. Williams, and Steve Young. A Framework for Wizard-of- Oz Experiments with a Simulated ASR-Channel. ICSLP, 2004, South Korea.

SACTI-1 dialog data corpus Tourist information domain Wizard has more info than user; user given specific task Fountain H1 R2 H5 Fountain Road R4 H4 Main square North Road Art Square Cinema R6 Middle Road Nine street Castle Castle Loop H2 Castle Loop B6 36 users / 12 wizards 144 dialogs 4071 turns 4 different WER targets Task completion Likert-scale questions West loop Shopping area Alexander Street Tower Hill Road Tower R1 B1 B5 Tourist info Alexander Street B4 Post office B2 B3 H6 Main Street Alexander Street Red Bridge South Bridge Bridge Nine Riverside Drive R3 Park Road Museum Cascade Road H3 Park R5 Conversational characteristics broadly similar to those observed in human/machine dialog. Jason D. Williams and Steve Young. Characterizing Task-Oriented Dialog using a Simulated ASR Channel. ICSLP, 2004, South Korea.

User model example S u A s WH a to YN a to WH a from YN a from London To London null yes no null Oxford From Oxford null + + + yes no null P(A u -WH A s, S u =to(london)) A s Where are you going to? To London, is that right? Where are you going from? P mass of A u ( to/wh portion only) London To London null Data was divided, and two model sets were created: Training model set & Testing model set

Turn b ( s slot from ) b ( ~ s from ) b ( s slot to ) b ( ~ ) [prior to start of dialog] CAMB. SHEFF. ~ slot LEEDS OXFORD ~ s slot to Where to? London [Leeds] CAMB. SHEFF. LEEDS OXFORD Where from? From Cambridge [To Oxford] CAMB. SHEFF. LEEDS OXFORD Where from? From Cambridge [From Cambridge] CAMB. SHEFF. LEEDS OXFORD To Leeds, right? No, to London [No, from Sheffield] CAMB. SHEFF. LEEDS OXFORD Where to? London from Cambridge [London from Cambridge] CAMB. SHEFF. LEEDS OXFORD

0.35 0.40 0.45 0.50 0.55 0.60 0.65 Performance vs. p err 8 7 6 5 4 3 2 1 0 0.05 0.10 0.15 0.20 0.25 0.30-1 User model built from training set p err Reward gained / turn

0.35 0.40 0.45 0.50 0.55 0.60 0.65 Performance vs. p err 8 7 6 5 4 3 2 1 0 0.05 0.10 0.15 0.20 0.25 0.30-1 User model built from training set User model built from test set p err Reward gained / turn

Scalability vs. direct optimization 8 6 4 2 0-2 -4-6 -8 Summary POMDP Baseline Average or expected return 3 4 5 10-10 20 50 100 200 500 1000 2000 5000 M (Number of distinct slot values)

Outline Dialogue management as a POMDP Scaling up with the summary POMDP method Example problem & results Conclusions & future work

Conclusions & future work For slot-based dialog management, POMDPs Outperform MDP baseline Outperform 3 handcrafted baselines Can be scaled as a summary POMDP to many slot values Interesting future avenues Scale number of slots to 10s Move beyond slot-based approaches Apply to real system

Thanks! Jason D. Williams jdw30@cam.ac.uk Joint work with Steve Young sjy@eng.cam.ac.uk