CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Similar documents
Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Lecture 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning

Reinforcement Learning. Up until now we have been

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

The Markov Decision Process (MDP) model

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Decision Theory: Q-Learning

Reinforcement Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning (1)

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. Machine Learning, Fall 2010

Some AI Planning Problems

Reinforcement Learning. Introduction

CS 7180: Behavioral Modeling and Decisionmaking

Planning in Markov Decision Processes

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Factored State Spaces 3/2/178

Introduction to Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Markov decision processes

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning and Control

, and rewards and transition matrices as shown below:

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Markov Decision Processes Chapter 17. Mausam

Probabilistic Planning. George Konidaris

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning: An Introduction

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

CS599 Lecture 1 Introduction To RL

Reinforcement Learning Active Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Internet Monetization

Reinforcement Learning

Reinforcement Learning. George Konidaris

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning II

Reinforcement learning

16.4 Multiattribute Utility Functions

Reinforcement Learning

A Gentle Introduction to Reinforcement Learning

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Machine Learning I Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning and Deep Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Final Exam December 12, 2017

Discrete planning (an introduction)

Markov Decision Processes Infinite Horizon Problems

Artificial Intelligence

Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Artificial Intelligence & Sequential Decision Problems

Markov Decision Processes (and a small amount of reinforcement learning)

An Adaptive Clustering Method for Model-free Reinforcement Learning

Preference Elicitation for Sequential Decision Problems

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CSC321 Lecture 22: Q-Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CSE250A Fall 12: Discussion Week 9

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Lecture 23: Reinforcement Learning

Final Exam December 12, 2017

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS 570: Machine Learning Seminar. Fall 2016

Partially Observable Markov Decision Processes (POMDPs)

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning

REINFORCEMENT LEARNING

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Introduction to Markov Decision Processes

CS 4100 // artificial intelligence. Recap/midterm review!

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reinforcement Learning and NLP

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Transcription:

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division

Markov Decision Processes (MDPs) A popular model for sequential decision problems; Specified by A set of possible world states s 2 S A set of possible actions a 2 A A real-valued reward function R(s) A transition model T(s, a, s ) given by P(s s,a) Assumes fully observable environment with a Markovian transition model: P(S t+1 =s S t =s t, A t =a t, S t-1 =s t-1, A t-1 =a t-1, ) = P(S t+1 =s S t =s t, A t =a t ) Example: Maze Robot

Finite MDP Example: Recycling Robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected

Recycling Robot MDP S = fhigh; lowg A(high) = fsearch; waitg A(low) = fsearch; wait; rechargeg R search = expected # of cans while searching R wait = expected # of cans while waiting R search > R wait

Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy State-value function of policy ¼ " 1 # X V ¼ (s) = E ¼ [R t js t = s] = E ¼ k r t+k+1 js t = s k=0 The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter following p : Action-value function for policy ¼ " 1 # X Q ¼ (s; a) = E ¼ [R t js t = s; a t = a] = E ¼ k r t+k+1 js t = s; a t = a k=0

Bellman Equation for V p Basic Idea: Bellman equation V ¼ (s) = E ¼ [R t js t = s] " 1 # X = E ¼ k r t+k+1 js t = s k=0 = E ¼ "r t+1 + = X a = X a R t = r t+1 + r t+2 + 2 r t+3 + = r t+1 + (r t+2 + r t+3 + ) = r t+1 + R t+1 # 1X k r t+k+2 js t = s k=0 ¼(s; a) X T (s; a; s 0 ) s 0 " " 1 ## X R(s; a; s 0 ) + E ¼ k r t+k+2 js t+1 = s 0 k=0 ¼(s; a) X T (s; a; s 0 ) R(s; a; s 0 ) + V ¼ (s 0 ) s 0

More on Bellman Equation V ¼ (s) = X a ¼(s; a) X T(s; a; s 0 ) [R(s; a; s 0 ) + V ¼ (s 0 )] s 0 In fact, it is a set of linear equations for each state s; is a unique solution to this system of equations V ¼ Backup diagrams V ¼ Q ¼

MDP Example: Gridworld Actions: north, south, east, west; deterministic If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown State-value function for equiprobable random policy; = 0:9

MDP Example: Golf State is ball location Reward of 1 for each stroke until the ball is in the hole Actions: putt (use putter) driver (use driver) putt succeeds anywhere on the green Value of a state?

Optimal Value Functions For finite MDPs, policies can be partially ordered: ¼ < ¼ 0 if and only if V ¼ (s) V ¼0 (s) for all s 2 S There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all p * Optimal policies share the same optimal state-value function: V (s) = max V ¼ (s) for all s 2 S ¼ Optimal policies also share the same optimal action-value function: Q (s; a) = max Q ¼ (s; a) for all s 2 S and a 2 A(s) ¼ Expected return for executing action a in state s and then following an optimal policy

Optimal Action-Value Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q*(s, driver) gives the value or using driver first, then using whichever actions are best

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) = max a2a(s) Q (s; a) The backup diagram: = max E[r t+1 + V (s t+1 )js t = s; a t = a] a2a(s) X = max T(s; a; s 0 )[R(s; a; s 0 ) + V (s 0 )] a2a(s) s 0 V is the unique solution to the system of nonlinear equations

Bellman Optimality Equation for Q* Q (s; a) = E[r t+1 + max Q (s t+1 ; a 0 )js t = s; a t = a] a 0 = X T (s; a; s 0 )[R(s; a; s 0 ) + max Q (s 0 ; a 0 )] a 0 s 0 The backup diagram: Q is the unique solution to the system of nonlinear equations

Policies: Solution to MDPs p: S! A p(s) : recommended action for state s Following a policy p: (1) Determine the current state s (2) Execute action p(s) (3) Goto step 1 Evaluating p: How good is a policy p in a state s? Total sum of the rewards obtained can be infinite Finite horizon vs. Infinite horizon Additive rewards vs. Discounted rewards V ¼ ([s 0 ;s 1 ;s 2 ;:::]) = R(s 0 ) +R(s 1 ) +R(s 2 ) +::: V ¼ ([s 0 ; s 1 ; s 2 ; : : :]) = R(s 0 ) + R(s 1 ) + 2 R(s 2 ) + : : : Optimal policy:

Optimal Policy from V* Any policy that is greedy with respect to policy Therefore, given optimal actions V Example: Gridworld V is an optimal, one-step-ahead search produces the long-term

Optimal Policy from Q* Q Given, the agent does not even have to do a one-stepahead search: ¼ (s) = argmax Q (s; a) a2a(s)

Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property How much space and time do we need? Polynomial in number of states (via dynamic programming methods) BUT, number of states can be huge (e.g., backgammon has about 1020 states) We usually have to settle for approximations Many RL methods (next week) can be understood as approximately solving the Bellman Optimality Equation But for this lecture, we assume we can solve without approximation

Value Iteration Algorithm Suppose that we know the utility of optimal policy It should satisfy Bellman X equation: V ¼ (s) = R(s) + max T(s; a; s 0 )V ¼ (s 0 ) a s 0 Then the optimal " policy can be derived by # ¼ (s) = arg max Iterative algorithm for solving X the Bellman equation V i+1 (s) Ã R(s) + max T(s; a; s 0 )V i (s 0 ) [Bellman update] a s 0 As i! 1 a R(s) + X T(s; a; s 0 )V ¼ (s 0 ) s 0, V i+1 converges to the utility of optimal policy Proof sketch: (1) prove that it converges to a vector (2) prove that vector satisfies the Bellman equation

Value Iteration Example?

Policy Iteration Algorithm From value iteration, let p i+1 (s) be the policy from the i-th Bellman update: " # ¼ i+1 (s) Ã arg max a R(s) + X T(s; a; s 0 )V i (s 0 ) s 0 p i+1 can be seen as improvement to current guess of optimal policy from inaccurate guess of value V i of following policy p i Idea: improvement will be more accurate if V i is more accurate! Policy Iteration Policy Evaluation: given a policy p, calculate V i = V pi (solving a set of linear equations) Policy Improvement: Calculate a new MEU policy p i+1, using one-step look-ahead based on V i (See above) If policy evaluation can be done very quickly, policy iteration is typically faster than value iteration

Summary Markov decision processes (MDPs): a popular model for sequential decision making problems under uncertainty Strong assumption Polynomial time computation of an optimal policy using dynamic programming However, if state space is too large, dynamic programming is impractical Application to dialogue management State space? Action space? Transition probabilities? Rewards? Next week: we will look at RL solutions Read Singh, Litman, Kearns, and Walker, Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System, Journal of Artificial Intelligence Research, 2002