Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Similar documents
Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Recent Developments in Statistical Dialogue Systems

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Lecture 8: Policy Gradient

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Off-Policy Actor-Critic

RL 14: POMDPs continued

POMDPs and Policy Gradients

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Lecture 9: Policy Gradient II 1

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

A reinforcement learning scheme for a multi-agent card game with Monte Carlo state estimation

Implicit Incremental Natural Actor Critic

Policy Gradient Methods. February 13, 2017

Lecture 3: Markov Decision Processes

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning and NLP

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Open Theoretical Questions in Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

(Deep) Reinforcement Learning

Machine Learning I Reinforcement Learning

Lecture 8: Policy Gradient I 2

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Dialogue as a Decision Making Process

Reinforcement Learning In Continuous Time and Space

Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

6 Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Q-Learning in Continuous State Action Spaces

State Space Abstractions for Reinforcement Learning

CS599 Lecture 1 Introduction To RL

An Adaptive Clustering Method for Model-free Reinforcement Learning

Lecture 23: Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Policy-Gradients for PSRs and POMDPs

Grundlagen der Künstlichen Intelligenz

Reinforcement learning

Lecture 1: March 7, 2018

Lecture 3: The Reinforcement Learning Problem

Replacing eligibility trace for action-value learning with function approximation

Reinforcement Learning via Policy Optimization

Reinforcement Learning

Introduction of Reinforcement Learning

Time Indexed Hierarchical Relative Entropy Policy Search

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

REINFORCEMENT LEARNING

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning

Reinforcement Learning: the basics

Grundlagen der Künstlichen Intelligenz

Machine Learning I Continuous Reinforcement Learning

Artificial Intelligence

Olivier Sigaud. September 21, 2012

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Average Reward Optimization Objective In Partially Observable Domains

Deep Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Markov Decision Processes and Solving Finite Problems. February 8, 2017

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

State Space Abstraction for Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning as Classification Leveraging Modern Classifiers

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement learning an introduction

Deep Reinforcement Learning: Policy Gradients and Q-Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement Learning II. George Konidaris

Animal learning theory

Reinforcement Learning

State Space Compression with Predictive Representations

Hidden Markov models 1

Reinforcement Learning II. George Konidaris

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Policy Gradient Critics

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Policy Gradient Reinforcement Learning for Robotics

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Temporal difference learning

arxiv: v1 [cs.ai] 5 Nov 2017

Probabilistic Planning. George Konidaris

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

The Reinforcement Learning Problem

Reinforcement Learning with Recurrent Neural Networks

Reinforcement Learning: An Introduction

Planning and Acting in Partially Observable Stochastic Domains

Reinforcement Learning for NLP

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Transcription:

Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30

Dialogue optimisation as a reinforcement learning task Dialogue management as a continuous space Markov decision process Summary space Simulated user RL algorithms for dialogue management Natural Actor Critic 2 / 30

Elements of dialogue management What the system says: actions a1 a2 a3 at-1 What the user wants: states s1 s2 s3 st-1 st What the system hears: observations o1 o2 o3 ot-1 ot dialogue turns 3 / 30

Dialogue as a control problem Input the distribution over possible states belief state, the output of the belief tracker Control actions that the system takes what the system says to the user Feedback signal the estimate of dialogue quality Aim automatically optimise system actions dialogue policy 4 / 30

Belif tracking vs policy optimisation BELIEF TRACKING POLICY initial turn PAST current turn FUTURE final turn 5 / 30

Dialogue as a partially observable Markov decision process Data Model Predictions Noisy observations Reward a measure of dialogue quality Partially observable Markov decision process Optimal system actions in noisy environment 6 / 30

Theory: Partially observable Markov decision process s t o t a t r t dialogue states noisy observations system actions rewards p(s t+1 s t, a t ) transition probability p(o t+1 s t+1 ) observation probability b(s t ) distribution over possible states st ot at st+1 ot+1 rt 7 / 30

Decision making in POMDPs Policy π : B A Return R t = T l k=0 γk r t+k Value function How good is it for the system to be in a particular belief state? [ T t ] V π (s) = E π γ k r t+k s t = s k=0 = r(s, a) + γ s p(s s, a) o p(o s )V π (s ) V π (b) = s V π (s)b(s) 8 / 30

Optimising POMDP policy Finding value function associated with optimal policy, i.e. the one that generates maximal return Tractable only for very simple cases [Kaelbling et al., 1998] Alternative view: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states b t = b(s t ) 9 / 30

Theory: Markov decision process b t a t r t belief states from tracker system actions rewards p(b t+1 b t, a t ) transition probability bt bt+1 at rt 10 / 30

Dialogue management as a continuous space Markov decision process Data Model Predictions belief states (from belief tracker) Reward a measure of dialogue quality Markov decision process and reinforcement learning Optimal system actions 11 / 30

Problems Size of the optimisation problem Belief state is large and continuous Set of system actions also large Knowledge of the environment, in this case the user We do not have transition probabilities Where do rewards come from? 12 / 30

Problem: large belief state and action space Solution: perform optimisation in a reduced space summary space built according to the heuristics Belief space (Master space) System Actions (Master actions) Summary Function Master Function Summary space (Learned) Summary Policy Summary actions 13 / 30

Problem: Where do the transition probability and the reward come from? Solution: learn from real users. Speech recognition Semantic decoding Belief tracking waveform distribution over text hypotheses distribution over dialogue acts Ontology Speech synthesis Natural language generation Policy optimisation 14 / 30

Problem: Where do the transition probability and the reward come from? Solution: learn from a simulated user. distribution over user dialogue acts Belief tracking Simulated user system dialogue act Policy optimisation Ontology 15 / 30

Elements of the simulated user Error model distribution over user dialogue acts Belief tracking User model Reward model system dialogue act Policy optimisation Ontology reward 16 / 30

Evaluation metrics/optimisation criteria Candidate User satisfaction Task completion Dialogue length Channel accuracy Repeat usage Financial benefit Issues How to measure in a real scenario? Task is hidden. Hang up on user? Endless confirmations Not always makes sense. Maybe in industry but not in research. Williams, Spoken dialogue system: challenges and opportunities for research, ASRU (invited talk), 2009 17 / 30

Theory: Reinforcement learning Policy deterministic π : B A or stochastic π : B A [0, 1] Return R t = T l k=0 γk r t+k Value function How good is it for the system to be in a particular belief state? [ T t ] V π (b) = E π γ k r t+k b t = b k=0 Q-function What is the value of taking action a in belief state b under a policy π? [ T t ] Q π (b, a) = E π γ k r t+k b t = b, a t = a k=0 18 / 30

Theory: Reinforcement learning Occupancy frequency d π (b) = t γ t Pr(b t = b b 0, π) Advantage function A π (b, a) = Q π (b, a) V π (b) 19 / 30

Theory: Reinforcement learning [Sutton and Barto, 1998] For discrete state spaces standard RL approaches can be used to estimate optimal Value function, Q-function or policy π Dynamic programming is model-based learning and update of the estimates are based on the previous estimates Monte-Carlo methods is model-free learning and update of estimates based is based on raw experience Temporal-difference methods is model-free learning and update of the estimates are based on the previous estimates 20 / 30

Reinforcement learning for dialogue management Options 1. Discretise the belief state/summary space into a grid and apply standard RL algorithms to estimate Value function, Q-function or policy π (for example Monte-Carlo Control in practical) 2. Apply parametric function approximation to Value function, Q-function or policy π and find optimal parameters using gradient methods (this lecture) 3. Apply non-parametric function approximation to Value function, Q-function or policy π (next lecture) 21 / 30

Linear function approximation Define summary space as features of belief space (φ or φ a ) and parameterise either: Value function V (b, θ) θ T φ(b) Q-function policy Q(b, a, θ) θ T φ(b, a) π(a b, ω) = eωt ψ(b,a) a eωt ψ(b,a) 22 / 30

Policy gradient Find policy parameters that maximise return J(ω) = E π(ω) [R 0 ] Update policy parameters in the direction of gradient ω ω + α J(ω) vanilla gradient given by policy gradient theorem [Sutton et al., 2000] J(ω) = d π (b) Q π (b, a)π(b, a, ω) logπ(b, a, ω)db B a (1) = E π(ω) [ γ t Q π (b t, a) ω log π(b t, a, ω) ] (2) = E π(ω) [ γ t A π (b t, a) ω log π(b t, a, ω) ] (3) This is not always stable (large) changes in the parameters can result in unexpected policy moves. Convergence can be very slow. 23 / 30

Natural Actor Critic [Peters and Schaal, 2008, Thomson, 2009] Actor-critic methods are Temporal-difference methods that estimate actor policy that takes actions parametrised with ω critic Advantage function that criticises/evaluates actor actions parameterised with θ In Natural Actor Critic A modified form of gradient natural gradient is used to find the optimal parameters to speed up the convergence. 24 / 30

Natural Policy Gradient Compatible function approximation Advantage function is parametrised with parameters θ such that the direction of change is the same as for the policy parameters ω Then by replacing in Eq 3 It can be shown γ t θ A(b, a, θ) = ω log π(b, a, ω) γ t A(b, a, θ) = ω log π(b, a, ω) T θ θ = Gω 1 ω J(ω) where G ω is the Fisher information matrix G ω = E π(ω) [ log π(b, a, ω) log π(b, a, ω) T] 25 / 30

Episodic Natural Actor Critic Algorithm 1 Episodic Natural Actor Critic 1: Input: parametrisation of π(ω) 2: Input: parametrisation of γ t A(θ) = θ T φ 3: Input: step size α > 0 4: for each batch of dialogues do 5: for each dialogue k do 6: Execute the dialogue according to the current policy π(ω) 7: Obtain sequence of belief states b k t, actions a k t and return R k 8: end for 9: Critic evaluation Choose θ and J to minimise k ( t γt A(b k t, a k t, θ) + J R k ) 2 10: Actor update ω ω + αθ 11: end for 26 / 30

Summary features For each concept the probability of two most likely values mapped into a grid Number of matching entities in the database (assuming most likely concepts) A parameter is associated with each summary action, concept and concept level feature Parameters can be tied to reduce computational complexity and over-fitting 27 / 30

Summary Dialogue policy optimisation can be viewed as a reinforcement learning task POMDP can be viewed as a continuous space MDP Belief state space can be summarised to reduce computational complexity Natural Actor Critic is a temporal-difference algorithm which estimates both the policy (actor) and the Q-function (critic). Both policy and Q-function are parametrised and natural gradient is used to find the direction of the steepest descent 28 / 30

References I Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artif. Intell., 101(1-2):99 134. Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7):1180 1190. Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition. 29 / 30

References II Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In In Advances in Neural Information Processing Systems 12, pages 1057 1063. MIT Press. Thomson, B. (2009). Statistical methods for spoken dialogue management. PhD thesis, University of Cambridge. 30 / 30