Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22.
Motivation goals Agent action oservation Environment Partially oservale Markov decision processes (POMDPs) [Kaelling98] Modeling sequential decision making under partial or uncertain oservations Single reward function encodes the immediate utility of executing actions. Required to manually alance different ojectives into the single reward function Constrained POMDPs (CPOMDPs) Prolems with limited resource or multiple ojectives Maximizing one ojective (reward) while constraining other ojectives (costs) CPOMDP has not received as much attention as CMDPs. [Altman99] Exception: DP method for finding deterministic policies [Isom08] Dongho Kim 2
Motivation Resource-limited agent, e.g., attery-equipped root Accomplish as many goals as possile given a finite amount of energy Spoken dialogue system [Williams07] e.g., minimize length of dialogue while guaranteeing 95% dialogue success rate Reward : -1 for each dialogue turn Cost : +1 for each unsuccessful dialogue, 0 for each successful dialogue Dialogue : s 0 s 1 s 2 s T R = 1 C = 0 R = 1 C = 0 R = 1 C = 0 R = 1 C = +1 for unsuccessful dialogue C = 0 for successful dialogue Goal: maximize E γ t t r t s.t. E γ t t c t c We propose exact and approximate methods for solving CPOMDPs. Dongho Kim 3
Suoptimality of deterministic policies in CPOMDPs lazy, p = 0.9 R = 0, C = 0 lazy R = 0, C = 0 AdvisorHappy lazy, p = 0.1 R = 0, C = 0 AdvisorAngry Procrastinating student prolem work R = 1 C = 1 Optimal deterministic policy At t = 0, lazy At t = 1, work value = 0.9γ, cumulative cost = γ Optimal randomized policy At t = 0, work with pro. of c and lazy with pro. of 1 c At t 1, lazy JoDone work R = 0 C = 1 value = c, cumulative cost = c with pro. of c 0 = 1,0,0 γ < c < 1 Reward and cost for work at each timestep t elief reward cost 0 [1,0,0] 1 1 1 [0.9,0.1,0] 0.9γ γ 2 [0.81,0.19,0] 0.81γ 2 γ 2 Dongho Kim 4
Value iteration in CPOMDPs Value function of CPOMDPs is a set of α-vector pairs value α 2,r α 3,r α 1,r cumulative cost α 2,c α 3,c α 1,c c V = α i,r, α i,c i α i,r and α i,c are i-th vectors for cumulative reward and cost respectively. Exact DP update via enumeration α i,r (s) = R(s, a)/ Z + γ T s, a, s O s s S, a, z α i,r a i,c (s) = C(s, a)/ Z + γ T s, a, s O s s S, a, z α i,c V = a A z Z α i,r, α i,c i, Creates exponentially many α-vector pairs V = A V Z Pruning is needed s s Dongho Kim 5
Exact DP update for CPOMDPs Pruning y mixed integer linear program (MILP) [Isom08] Check whether α r, α c is dominated y V = α i,r, α i,c i Not dominated at : cost c and higher value than other vectors with cost c value α 1,r α 2,r α r If there exists where α r, α c is not dominated, it will not e pruned. Shortcomings in MILP pruning Considers only deterministic policies cost α 1,c α 2,c Need to consider randomized policies (convex comination of α-vectors) Prunes α-vector pairs violating the cost constraint in each DP update Satisfying the cost constraint does not necessarily mean that the constraint should e satisfied at every time step. α c c Boolean variales MILP cost α c c Dongho Kim 6
Exact DP update for CPOMDPs Pruning y minimax quadratically constrained program (QCP) Inner maximization: Is α r, α c dominated at? Outer mininization: Where is α r, α c not dominated? Not dominated at : no convex comination with higher value and same or lower cost Inner maximization: for fixed Find convex comination which dominates α r, α c y maximizing the gap If the gap is positive, α r, α c is dominated at Outer minimization value α r α 1,r α 2,r gap = value of convex comination - value of α r cost α 1,c α c α 2,c Find where α r, α c is not dominated y minimizing the gap If the gap is negative at the resulting, α r, α c will not e pruned. Dongho Kim 7
Point-ased DP for CPOMDPs value Point-ased value iteration (PBVI) for standard POMDPs[Pineau06] Maintains the est α-vector for each B = 0, 1,, q 0 1 2 Adapting standard PBVI to CPOMDPs in a simple way Enumerates α-vector pairs and performs pruning confined to B Minimax QCP pruning ecomes LP for each B find a randomized policy which dominates α r, α c at value α r α 1,r cost α 1,c α c α 2,r α 2,c Still many α-vectors at each B No information on costs at B 0 1 2 Dongho Kim 8
Admissile cost [Piunovskiy00] Admissile cost is Expected cumulative cost that can e additionally incurred in the future s 0 s 1 s t s t+1 c 0 s t+2 γc 1 γ t c t γ t+1 c t+1 γ t+2 c t+2 Expected cumulative cost up to t W t = γ τ c τ t τ=0 Admissile cost at t + 1 d t+1 = 1 γ t+1 (c W t ) d t d t+1 Recursive formulation: d t+1 = 1 γ d t c t Dongho Kim 9
PBVI with admissile cost for CPOMDPs Samples elief-admissile cost pairs B = 0, d 0, 1, d 1,, q, d q Maintains the est randomized policy for each, d B Using LP for finding the est convex comination for, d value α 1,r α 3,r α 2,r Point-ased DP update For each, d cost α 1,c α 2,c B, find the est rand. policy at (τ, a, z, d z ) for each a, z Heuristic: distriuting admissile cost in proportion to the oservation proaility, i.e., d z = 1 d C, a P(z, a) γ α 3,c d LP solution: Convex comination of at most 2 α-vector pairs value cost At most 2 B α-vector pairs d 0 d 1 0 1 0 1 Dongho Kim 10
Experiment: Quickest change detection Quickest change detection[isom08] Minimize detection delay while constraining the proaility of false alarm S = 3, A = 2, Z = 3 MILP (det) vs. QCP (rand) vs. PBVI (rand) MILP and QCP could not perform DP updates more than 6 and 5 timesteps. PBVI scaled effectively more than 10 timesteps. PBVI performed close to exact methods. NoAlarm, p = 0.99 R = 0, C = 0 NoAlarm p = 0.01 R = 0, C = 0 NoAlarm R = 1, C = 0 Alarm R = 0, C = 0 PreChange PostChange PostAlarm Alarm false alarm R = 0, C = 1 Dongho Kim 11
Experiment: n-city ticketing prolem n-city ticketing prolem[williams07] Figure out the origin and the destination among n-cities Sumit the ticket purchase request once it has gathered sufficient information Due to the speech recognition errors, the oserved user s response can e different from the true response -1 reward for each timestep, 1 cost for a wrong ticket PBVI result for n = 3, P e = 0.2 S = 1945, A = 16, Z = 18 More dialogue turns for smaller c Needs more information gathering steps to e more accurate Dongho Kim 12
Conclusion We showed that optimal policies in CPOMDPs can e randomized We presented exact and approximate methods for CPOMDPs Exact method with minimax QCP pruning Approximate method ased on PBVI Can extend to multiple constraints and different discount factor for each cost function Future work Adopting state-of-the-art POMDP solvers with heuristic elief exploration Extension to average reward and cost criterion Extension to factored CPOMDPs Dongho Kim 13
Reference [Altman99] E. Altman. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999. [Isom08] J. D. Isom, S. P. Meyn, and R. D. Braatz. Piecewise linear dynamic programming for constrained POMDPs. In Proc. of AAAI, 2008. [Kaelling98] L. P. Kaelling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially oservale stochastic domains. Artificial Intelligence, 101:99-134, 1998. [Pineau06] J. Pineau, G. Gordon, and S. Thrun. Anytime point-ased approximations for large POMDPs. JAIR, 27:335-380, 2006. [Piunovskiy00] A. B. Piunovskiy and X. Mao. Constrained Markovian decision processes: the dynamic programming approach. Operations Research Letters, 27(3):119-126, 2000. [Williams07] J. D. Willians and S. Young. Partially oservale Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2):393-422, 2007. Dongho Kim 14