Point-Based Value Iteration for Constrained POMDPs

Similar documents
Piecewise Linear Dynamic Programming for Constrained POMDPs

CAP Plan, Activity, and Intent Recognition

RL 14: POMDPs continued

Artificial Intelligence

Solving Risk-Sensitive POMDPs with and without Cost Observations

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

Partially Observable Markov Decision Processes (POMDPs)

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

SPOKEN dialog systems (SDSs) help people accomplish

Efficient Maximization in Solving POMDPs

Information Gathering and Reward Exploitation of Subgoals for P

Towards Faster Planning with Continuous Resources in Stochastic Domains

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Scaling up Partially Observable Markov Decision Processes for Dialogue Management

Robust Policy Computation in Reward-uncertain MDPs using Nondominated Policies

Constrained Bayesian Reinforcement Learning via Approximate Linear Programming

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Prediction-Constrained POMDPs

Heuristic Search Value Iteration for POMDPs

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

RL 14: Simplifications of POMDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

An Adaptive Clustering Method for Model-free Reinforcement Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Online Partial Conditional Plan Synthesis for POMDPs with Safe-Reachability Objectives

Planning and Acting in Partially Observable Stochastic Domains

CS 234 Midterm - Winter

Dialogue as a Decision Making Process

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Inverse Reinforcement Learning in Partially Observable Environments

Solving Stochastic Planning Problems With Large State and Action Spaces

Accelerated Vector Pruning for Optimal POMDP Solvers

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

CS 4100 // artificial intelligence. Recap/midterm review!

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Grundlagen der Künstlichen Intelligenz

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Planning Under Uncertainty II

Recent Developments in Statistical Dialogue Systems

Relational Partially Observable MDPs

Function Approximation for Continuous Constrained MDPs

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

RAO : an Algorithm for Chance-Constrained POMDP s

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Solving POMDPs with Continuous or Large Discrete Observation Spaces

Probabilistic robot planning under model uncertainty: an active learning approach

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

Nonparametric Bayesian Inverse Reinforcement Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Discrete planning (an introduction)

Machine Learning for Sustainable Development and Biological Conservation

Probabilistic Planning. George Konidaris

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested Agents

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Distributed Optimization. Song Chong EE, KAIST

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in non-stationary Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Bayes-Adaptive POMDPs 1

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Artificial Intelligence. Non-deterministic state model. Model for non-deterministic problems. Solutions. Blai Bonet

Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Symbolic Dynamic Programming for First-order POMDPs

POMDPs and Policy Gradients

REINFORCEMENT LEARNING

Solution Methods for Constrained Markov Decision Process with Continuous Probability Modulation

Multi-Objective Decision Making

Temporal Difference Learning & Policy Iteration

Online Learning for Markov Decision Processes Applied to Multi-Agent Systems

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Final Exam December 12, 2017

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs

Preference Elicitation for Sequential Decision Problems

CS221 Practice Midterm

Approximating Reachable Belief Points in POMDPs

Goal Recognition over POMDPs: Inferring the Intention of a POMDP Agent

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Probabilistic inference for computing optimal policies in MDPs

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Solving Factored MDPs with Continuous and Discrete Variables

The quest for finding Hamiltonian cycles

Constrained Markov Decision Processes

An Analytic Solution to Discrete Bayesian Reinforcement Learning

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Final Exam December 12, 2017

Transcription:

Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22.

Motivation goals Agent action oservation Environment Partially oservale Markov decision processes (POMDPs) [Kaelling98] Modeling sequential decision making under partial or uncertain oservations Single reward function encodes the immediate utility of executing actions. Required to manually alance different ojectives into the single reward function Constrained POMDPs (CPOMDPs) Prolems with limited resource or multiple ojectives Maximizing one ojective (reward) while constraining other ojectives (costs) CPOMDP has not received as much attention as CMDPs. [Altman99] Exception: DP method for finding deterministic policies [Isom08] Dongho Kim 2

Motivation Resource-limited agent, e.g., attery-equipped root Accomplish as many goals as possile given a finite amount of energy Spoken dialogue system [Williams07] e.g., minimize length of dialogue while guaranteeing 95% dialogue success rate Reward : -1 for each dialogue turn Cost : +1 for each unsuccessful dialogue, 0 for each successful dialogue Dialogue : s 0 s 1 s 2 s T R = 1 C = 0 R = 1 C = 0 R = 1 C = 0 R = 1 C = +1 for unsuccessful dialogue C = 0 for successful dialogue Goal: maximize E γ t t r t s.t. E γ t t c t c We propose exact and approximate methods for solving CPOMDPs. Dongho Kim 3

Suoptimality of deterministic policies in CPOMDPs lazy, p = 0.9 R = 0, C = 0 lazy R = 0, C = 0 AdvisorHappy lazy, p = 0.1 R = 0, C = 0 AdvisorAngry Procrastinating student prolem work R = 1 C = 1 Optimal deterministic policy At t = 0, lazy At t = 1, work value = 0.9γ, cumulative cost = γ Optimal randomized policy At t = 0, work with pro. of c and lazy with pro. of 1 c At t 1, lazy JoDone work R = 0 C = 1 value = c, cumulative cost = c with pro. of c 0 = 1,0,0 γ < c < 1 Reward and cost for work at each timestep t elief reward cost 0 [1,0,0] 1 1 1 [0.9,0.1,0] 0.9γ γ 2 [0.81,0.19,0] 0.81γ 2 γ 2 Dongho Kim 4

Value iteration in CPOMDPs Value function of CPOMDPs is a set of α-vector pairs value α 2,r α 3,r α 1,r cumulative cost α 2,c α 3,c α 1,c c V = α i,r, α i,c i α i,r and α i,c are i-th vectors for cumulative reward and cost respectively. Exact DP update via enumeration α i,r (s) = R(s, a)/ Z + γ T s, a, s O s s S, a, z α i,r a i,c (s) = C(s, a)/ Z + γ T s, a, s O s s S, a, z α i,c V = a A z Z α i,r, α i,c i, Creates exponentially many α-vector pairs V = A V Z Pruning is needed s s Dongho Kim 5

Exact DP update for CPOMDPs Pruning y mixed integer linear program (MILP) [Isom08] Check whether α r, α c is dominated y V = α i,r, α i,c i Not dominated at : cost c and higher value than other vectors with cost c value α 1,r α 2,r α r If there exists where α r, α c is not dominated, it will not e pruned. Shortcomings in MILP pruning Considers only deterministic policies cost α 1,c α 2,c Need to consider randomized policies (convex comination of α-vectors) Prunes α-vector pairs violating the cost constraint in each DP update Satisfying the cost constraint does not necessarily mean that the constraint should e satisfied at every time step. α c c Boolean variales MILP cost α c c Dongho Kim 6

Exact DP update for CPOMDPs Pruning y minimax quadratically constrained program (QCP) Inner maximization: Is α r, α c dominated at? Outer mininization: Where is α r, α c not dominated? Not dominated at : no convex comination with higher value and same or lower cost Inner maximization: for fixed Find convex comination which dominates α r, α c y maximizing the gap If the gap is positive, α r, α c is dominated at Outer minimization value α r α 1,r α 2,r gap = value of convex comination - value of α r cost α 1,c α c α 2,c Find where α r, α c is not dominated y minimizing the gap If the gap is negative at the resulting, α r, α c will not e pruned. Dongho Kim 7

Point-ased DP for CPOMDPs value Point-ased value iteration (PBVI) for standard POMDPs[Pineau06] Maintains the est α-vector for each B = 0, 1,, q 0 1 2 Adapting standard PBVI to CPOMDPs in a simple way Enumerates α-vector pairs and performs pruning confined to B Minimax QCP pruning ecomes LP for each B find a randomized policy which dominates α r, α c at value α r α 1,r cost α 1,c α c α 2,r α 2,c Still many α-vectors at each B No information on costs at B 0 1 2 Dongho Kim 8

Admissile cost [Piunovskiy00] Admissile cost is Expected cumulative cost that can e additionally incurred in the future s 0 s 1 s t s t+1 c 0 s t+2 γc 1 γ t c t γ t+1 c t+1 γ t+2 c t+2 Expected cumulative cost up to t W t = γ τ c τ t τ=0 Admissile cost at t + 1 d t+1 = 1 γ t+1 (c W t ) d t d t+1 Recursive formulation: d t+1 = 1 γ d t c t Dongho Kim 9

PBVI with admissile cost for CPOMDPs Samples elief-admissile cost pairs B = 0, d 0, 1, d 1,, q, d q Maintains the est randomized policy for each, d B Using LP for finding the est convex comination for, d value α 1,r α 3,r α 2,r Point-ased DP update For each, d cost α 1,c α 2,c B, find the est rand. policy at (τ, a, z, d z ) for each a, z Heuristic: distriuting admissile cost in proportion to the oservation proaility, i.e., d z = 1 d C, a P(z, a) γ α 3,c d LP solution: Convex comination of at most 2 α-vector pairs value cost At most 2 B α-vector pairs d 0 d 1 0 1 0 1 Dongho Kim 10

Experiment: Quickest change detection Quickest change detection[isom08] Minimize detection delay while constraining the proaility of false alarm S = 3, A = 2, Z = 3 MILP (det) vs. QCP (rand) vs. PBVI (rand) MILP and QCP could not perform DP updates more than 6 and 5 timesteps. PBVI scaled effectively more than 10 timesteps. PBVI performed close to exact methods. NoAlarm, p = 0.99 R = 0, C = 0 NoAlarm p = 0.01 R = 0, C = 0 NoAlarm R = 1, C = 0 Alarm R = 0, C = 0 PreChange PostChange PostAlarm Alarm false alarm R = 0, C = 1 Dongho Kim 11

Experiment: n-city ticketing prolem n-city ticketing prolem[williams07] Figure out the origin and the destination among n-cities Sumit the ticket purchase request once it has gathered sufficient information Due to the speech recognition errors, the oserved user s response can e different from the true response -1 reward for each timestep, 1 cost for a wrong ticket PBVI result for n = 3, P e = 0.2 S = 1945, A = 16, Z = 18 More dialogue turns for smaller c Needs more information gathering steps to e more accurate Dongho Kim 12

Conclusion We showed that optimal policies in CPOMDPs can e randomized We presented exact and approximate methods for CPOMDPs Exact method with minimax QCP pruning Approximate method ased on PBVI Can extend to multiple constraints and different discount factor for each cost function Future work Adopting state-of-the-art POMDP solvers with heuristic elief exploration Extension to average reward and cost criterion Extension to factored CPOMDPs Dongho Kim 13

Reference [Altman99] E. Altman. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999. [Isom08] J. D. Isom, S. P. Meyn, and R. D. Braatz. Piecewise linear dynamic programming for constrained POMDPs. In Proc. of AAAI, 2008. [Kaelling98] L. P. Kaelling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially oservale stochastic domains. Artificial Intelligence, 101:99-134, 1998. [Pineau06] J. Pineau, G. Gordon, and S. Thrun. Anytime point-ased approximations for large POMDPs. JAIR, 27:335-380, 2006. [Piunovskiy00] A. B. Piunovskiy and X. Mao. Constrained Markovian decision processes: the dynamic programming approach. Operations Research Letters, 27(3):119-126, 2000. [Williams07] J. D. Willians and S. Young. Partially oservale Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393-422, 2007. Dongho Kim 14