Piecewise Linear Dynamic Programming for Constrained POMDPs

Size: px

Start display at page:

Download "Piecewise Linear Dynamic Programming for Constrained POMDPs"

Rodney Ford
5 years ago
Views:

1 Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (28) Piecewise Linear Dynamic Programming for Constrained POMDPs Joshua D. Isom Sikorsky Aircraft Corporation Stratford, CT 6615 Sean P. Meyn and Richard D. Braatz University of Illinois at Urana-Champaign Urana, IL 6181 Astract We descrie an exact dynamic programming update for constrained partially oservale Markov decision processes (CPOMDPs). State-of-the-art exact solution of unconstrained POMDPs relies on implicit enumeration of the vectors in the piecewise linear value function, and pruning operations to otain a minimal representation of the updated value function. In dynamic programming for CPOMDPs, each vector takes two valuations, one with respect to the ojective function and another with respect to the constraint function. The dynamic programming update consists of finding, for each elief state, the vector that has the est ojective function valuation while still satisfying the constraint function. Whereas the pruning operation in an unconstrained POMDP requires solution of a linear program, the pruning operation for CPOMDPs requires solution of a mixed integer linear program. Background The partially oservale Markov decision process (POMDP) is a model for decision-making under uncertainty with respect oth to the current state and to the future evolution of the system. The model generalizes the fully oserved Markov decision process, which allows only for uncertainty as to the future evolution of the system. The theory of solution of fully oserved Markov decision processes with finite or countale state space is well-estalished for oth unconstrained (Puterman 25) and constrained (Altman 1999) prolem formulations. Despite a recent urst of algorithmic development for unconstrained POMDPs (Cassandra 1998; Hansen 1998; Poupart 25; Feng & Zilerstein 24; Spaan & Vlassis 25), there has een relatively little development of algorithmic approaches for the constrained prolem. A comprehensive treatment of countale state constrained Markov decision processes is Altman s monograph (Altman 1999). The central solution concept is a countaly infinite linear program, which has desirale theoretical properties ut oviously requires LP approximation or state truncation to e important practically. Complexity reduction is addressed in (Meyn 27) for network models. Approaches include relaxation techniques and value function approximations ased on a mean-flow model. Because the indefinite- Copyright c 28, Association for the Advancement of Artificial Intelligence ( All rights reserved. horizon and infinite-horizon discounted POMDPs considered here are equivalent to Markov decision processes with an uncountaly infinite state space, the countale state results are not directly applicale. The solution technique that we propose is related to an approach developed y Piunovskiy and Mao (Piunovskiy & Mao 2). They considered an uncountaly infinitestate Markov decision process, and proposed augmenting the state space with a state representing the expected value of the constraint function, with a penalty function used to prohiit solutions that violate the constraint. The technique of Piunovskiy and Mao is independent of the representation of the value function, and was illustrated with an example that admitted an exact analytical solution. Our approach is similar in that it disallows at every step a solution that would violate the constraint. However, our method uses a piecewise linear representation of the value function, uilding on techniques developed for unconstrained POMDPs and allowing for ɛ-exact numerical solutions. Our value function update approach can e the asis for value iteration or for policy iteration using Hansen s policy improvement technique. A constrained partially oservale Markov decision process is an 7-tuple where (S, A, Σ, p(s s, a), p(z s), r(s, a), c(s, a)), (1) S, A and Σ are finite sets of states, actions, and oservations, p(s s, a) is the state transition function, where p(s s, a) is the proaility that state s is reached from state s on action a, p(z s) is the oservation function, where p(z s) is the proaility that oservation z Σ will e made in state s, r(s, a) is the cost incurred y taking action a in state s, and c(s, a) is a constraint function. A deterministic stationary policy φ for a CPOMDP is a map from all possile oservation sequences to an action in A, with the set of all policies given y Φ = {φ : Σ A}. (2) 291

2 We consider two prolems on CPOMDPs. The indefinitehorizon prolem takes the form [ T minimize E Φ r(a t, s t ), (3) [ T E Φ c(a t, s t ) α, s, (4) where T is the time that the system enters a set of one or more terminal states, and s is the initial state. Conditions that ensure that the dynamic programming operator for an indefinite-horizon prolem has a unique fixed point are developed in the literature (Hansen 27; Patek 21). For the infinite-horizon constrained case, the constrained prolem takes the form [ minimize E Φ λ t r(a t, s t ) (5) [ E Φ λ t c(s t, a t ) α, s, (6) with discount factor < λ < 1 and initial state s. Discounting ensures that the dynamic programming operator is a contraction operator. Motivating Prolems Below we discuss three examples that motivate development of techniques for solution of CPOMDPs. Practical prolems that are naturally formulated as CPOMDPs are at least as prevalent as those formulated as unconstrained prolems. Change Detection A classic example of a constrained indefinite-horizon POMDP is the prolem of quickest change detection. The prolem is to minimize detection delay a constraint on the proaility of false alarm. This prolem was studied y Shiryaev (Shiryaev 1963), who elucidated the form of the optimal policy ut did not estalish numerical techniques for optimal parameterization of the solution. The difficulty of finding the optimal parameterization for the policy, or even evaluating the proaility of false alarm, has een noted in the literature (Tartakovsky & Veeravalli 24). An indefinite-horizon constrained Bayes change detection prolem is a 5-tuple {Σ, α, f, f 1, T } with Σ a finite set of oservations, change time parameter < ρ < 1, proaility of false alarm constraint < α < 1, pre-change oservation proaility mass function f : Σ [, 1, post-change oservation proaility mass function f 1 : Σ [, 1, and countale time set T. The change time ν is a geometric random variale with parameter ρ, and η is an adapted estimate. The prolem statement is minimize E [(η ν φ Φ φ(y 1,...,y h ) 1)+ (7) 1-ρ / 1 / 1 / 1 f ρ / f 1 / 1 f 1 / 1 Figure 1: State transition diagram for the change detection prolem with geometric change time distriution. Arcs are laeled with transition proailities under the no-alarm / alarm actions. Nodes are laeled with the distriution of oservations for the state. with p(η < ν) < α (8) Φ = {φ : Σ [, 1}, (9) where the distriution of random variales is given y p(y t = a ν < t) = f (z), z Σ, (1) p(y t = a ν t) = f 1 (z), z Σ, (11) p(η = t η t) = φ(y 1... y t ), (12) p(ν = t) = ρ(1 ρ) t 1. (13) The prolem of quickest change detection can e modeled as an indefinite-horizon POMDP with S = {PreChange, PostChange, PostAlarm}, A = {Alarm, NoAlarm}, c(prechange, Alarm) = 1, r(postchange, NoAlarm) = 1, (14) and constraint functional [ T E Φ c(a t, s t ) α. (15) State transition and oservation proailities are as shown in Figure 1. Structural Monitoring and Maintenance Bridges, aircraft airframes, and other structural components in engineered systems are fatigue failures. Nondestructive techniques for inspection of the damage state of the component are imperfect, so the true structural damage state of the component is imperfectly knowale. The maintainer may take several actions in structural monitoring, including inspection, repair, or replacement. Because the consequences of structural failure are often severe, the structural monitoring and maintenance prolem is most naturally formulated as an infinite-horizon discounted POMDP in which the ojective is to minimize lifecycle maintenance and replacement costs a constraint on the proaility of failure, expressed at least qualitatively in terms like 292

3 no more than one in a million. Because the service life of the engineered system is usually uncertain, a discounted life cycle cost ojectives and occupation measure constraints are appropriate (Puterman 25, p. 125). Opportunistic Spectrum Access Limits on andwidth for wireless services motivate dynamic spectrum sharing strategies such as opportunistic spectrum access (OSA), which is a strategy to allow secondary users to exploit instantaneous spectrum opportunities while limiting the level of interference for primary users. Zhao and Swami proposed a CPOMDP decision-theoretic framework for solution of the OSA prolem (Zhao & Swami 27). In this framework, the state is determined y whether or not the spectrum is in use y the primary user, the oservations consist of partial spectrum measurements, the reward function is the numer of its delivered y the secondary user, and the constraint is on the proaility of collision with a primary user. Exact Dynamic Programming Update for CPOMDP Having defined a CPOMDP and provided some motivating prolems, we now descrie how to extend dynamic programming techniques for POMDPs to solve CPOMDPs. A POMDP can e transformed into an infinite-state fully oserved Markov decision process y defining a new elief state, which is a vector [, 1 S, (s) = 1, representing the posterior proaility that the system is in each state s S, given the starting state, oservation history, and action history. The elief state is a sufficient statistic and can e recursively updated via Bayes rule. If oservation z is made and action a taken, and if the previous elief state vector was (s), the new elief state vector (s ) is given y (s ) = p(z s ) s p(s a, s)(s) s,s p(z s )p(s a, s)(s). (16) The dynamic programming update for a POMDP is [ v n+1 () = min a A r(, a) + λ z Σ p(z, a)v n ( (, a, z)) Exact methods for efficiently performing the dynamic programming update for POMDPs include incremental pruning (Cassandra 1998) and restricted region incremental pruning (Feng & Zilerstein 24). A key technique for oth of these methods is representation of the value function as a piecewise-linear function over the elief simplex. This representation allows for a finite representation of the value function as a set of value function vectors, the elements of which represent the intersection of a hyperplane with a given axis of the elief simplex. Stationary nonrandomized policies resulting from a stationary value function are sufficient for constrained discounted (Piunovskiy & Mao 2) and indefinite-horizon (Feinerg & Piunovskiy 2) Markov decision processes prolems with continuous state space. Exact techniques for the POMDP dynamic programming update implicitly enumerate all value function vectors for. the updated value function, and use a pruning operation to maintain a minimal representation. An explicitly enumerative dynamic programming update to convert a set of value function vectors V to a new set V is with v a,z,i (s) = V = a A z Σ {v a,z,i v i V} (17) r(s, a) +λ p(z s, a)p(s, s, a)v i (s ), (18) Σ s S and λ = 1 for the indefinite-horizon prolem or < λ < 1 for the infinite-horizon discounted prolem. Typically, many of the value function vectors in V are dominated and superfluous to a minimal value function representation. Pruning is the process of identifying vectors that are not part of the minimal value function representation. At the heart of the pruning operation is solution of a linear program to determine if there is a point where a given value function vector w is etter (lower cost) than all other value function vectors u i U. The linear program takes the form variales: h, (s) s S maximize: h the constraints: (u w) h, u U s S (s) = 1. If the linear program is feasile, then w is a candidate for the minimal value function representation. Incremental pruning interleaves pruning with vector generation, to reduce the computational cost associated with pruning for each dynamic programming update. With PR representing the pruning operation, the dynamic programming update is given y V = PR( a AV a ), V a = PR(V a,z 1 PR(V a,z 2...PR(V a,z k 1 V a,z k )...)), V a,z = PR({v a,z,i v i V}). (19) This dynamic programming technique can e extended to the constrained prolem y generating two sets of value function vectors with a one-to-one correspondence among the vectors in the set. One set is valued with respect to the cost function r, and the other is valued with respect to the constraint function c. We write (v i r, v i c) for the vectors corresponding to each other in the two sets, and V for the collection of pairs. Let v a,z,j r (s) = r(s, a) +λ p(z s, a)p(s, s, a)v Σ r(s j ) (2) s S and vc a,z,j c(s, a) (s) = + λ p(z s, a)p(s, s, a)v Σ c(s j ). s S (21) Equation 17 is applied separately to {vr a,z,j (s)} and {vc a,z,j (s)}, and vectors (vr, i vc) i in the resulting sets correspond if they were generated from the same action and the same predecessor vector index for each oservation. To develop a pruning operation for the value function vectors, note that for a vector v to elong to the minimal set, it 293

4 must have a lower cost v r at than any another vector u i r U that satisfies the constraint u i c < α for a pair (u r, u c ). This concept is illustrated in Figure 2. If a vector u i that does not satisfy the constraint u i c < α at, the test vector w need not have a lower cost. Thus there are conditional constraints. A standard trick for incorporating conditional constraints into linear programs is to add a discrete variale d i {, 1}, resulting in a mixed integer linear program. Applying this technique to the prolem at hand produces the mixed integer linear program variales: h, (s) > s S, d i {, 1} i {1,..., U } maximize: h the constraints: v c α, (u i r v r ) h d i M, u i c d i α, (u i c, u i r) U, (s) = 1, s S with M a large positive numer. If d i = 1, then u i s violates the constraint at, and a candidate vector v r need not have a lower cost than u i r at. On the other hand, if d i =, then u i s satisfies the constraint at, and a candidate vector v r must have a lower cost than u i r at. If the program is feasile, then v r is a candidate for the lowest cost, minimal value function representation satisfying the constraint. The entire pruning operation for a CPOMDP is presented in Tale 1. The CPOMDP pruning operation can e used to perform value iteration using incremental pruning dynamic programming updates using Equation 19. After a dynamic programming update producing a minimal set of vectors V, the value function is given y v() = min( v r : v c α, (v r, v c ) V). The value function converges to the optimal value function with each successive dynamic programming update. As an alternative to value iteration, the exact dynamic programming update can e used to perform policy iteration usings Hansen s algorithm (Hansen 1998), which uses a deterministic finite-state controller (Q, q o, Σ, δ, A, α) for a policy representation, where Q is a finite set of controller states,one for each vector v V, q Q is the start state, Σ is a finite input alphaet, δ is a function from Q Σ into Q, called the transition function, A is a finite output alphaet, α is an output function from Q into A, with α(i) as shorthand for α(q i ). Because the comination of the system with the controller yields a Markov chain, any policy can e evaluated via solution of a linear system with coefficient matrix (I λa) with I an identity matrix, A the system/controller transition matrix, and λ = 1 for the indefinite-horizon prolem or < λ < 1 for the infinite-horizon prolem. The system state index i and the controller state index j are mapped to a new index k via a one-to-one mapping (i, j) k, and the v r v c d.5 e e d c a a c Figure 2: The top plot shows value functions vectors with ojective function valuation v r, while the ottom plot shows value function vectors with constraint valuation v c. Vectors in the top plot are highlighted in the region of the elief space where the vector is the est (lowest cost) vector satisfying the constraint α =.2. coefficients of the controller/system transition matrix A are given y A ((i, j), (i, j )) p(i i, α(j))p(z j, α(j))1{δ(j, z) = j }. (22) z Σ The linear system for finding the ojective value function of a policy is (I λa)r = x, where r (i,j) = r(i, α(j)) and x (i,j) is the value function intercept for the ith elief state vertex for the jth value function vector vr. j The linear system for finding the constraint value function of a policy is (I λa)c = x, where c (i,j) = c(i, α(j)) and x (i,j) is the value function intercept for the ith elief state vertex for the jth value function vector vc. j Under Hansen s policy iteration algorithm, an initial policy is evaluated with respect to oth the ojective and constraint functions, and then a dynamic programming update is performed using Equation 19 and the pruning operation descried in Tale 1. Each vector in the updated value function represents a new candidate state for the finite state controller. Once new states have een added, the policy is evaluated to produce a new value function, and the algorithm repeats. Each iteration yields an improved finite state controller that satisfies the prolem constraints. 294

5 Tale 1: Algorithm for pruning value function vectors for a CPOMDP Procedure: MIP-DOMINATE((w r, w c ), U) 1: solve the following mixed integer linear program: 2: variales: 3: h, (s) > s S, 4: d i {, 1} i {1,..., U } 5: maximize: h 6: the constraints: 7: w c α, 8: (u i r w r ) h d i M, 9: u i c d i α, (u i r, u i c) U 1: s S (s) = 1. 11: if h then 12: return 13: else 14: return nil 15: end if Procedure: BEST(, U) 1: min 2: for all (u r, u c ) U do 3: if u c < α then 4: if ( u r < min) or (( u r = min) and (u r < lex w r )) then 5: (w r, w c ) (u r, u c ) 6: min u 7: end if 8: end if 9: end for 1: return (w r, w c ) Procedure: PR(W) 1: D 2: while W do 3: (w r, w c ) any element in W 4: MIP-DOMINATE((w r, w c ), D) 5: if = nil then 6: W W {(w r, w c )} 7: else 8: (w r, w c ) BEST(, W) 9: D D {(w r, w c )} 1: W W {(w r, w c )} 11: end if 12: end while 13: return D Example The solution technique is illustrated with an example. Consider the indefinite-horizon change detection prolem with geometric change time parameter ρ =.1, oservation alphaet size Σ = 3, pre-change oservation distriution f (z 1 ) =.6, f (z 2 ) =.3, f (z 3 ) =.1, and post-change oservation distriution The prolem statement is f (z 1 ) =.2, f (z 2 ) =.4, f (z 3 ) =.4. minimize v = E[(τ η 1) + (23) φ Φ p(τ < η) α. (24) with false alarm proaility constraint α =.2. Starting with initial value function vectors v r = [1 and v c = [, we perform four dynamic programming updates. The results are shown in Figure 3. The policy corresponding to the value function consists of producing an alarm when the proaility that the system is in the postchange state exceeds.4, with an expected detection delay of approximately 4. Tartakovsky and Veeravalli noted that a conservative policy for the prolem of quickest detection consists of producing an alarm when the posterior proaility of change exceeds 1 α (Tartakovsky & Veeravalli 24). Note that this policy is much less conservative (and therefore etter-valued) than the conservative policy of producing an alarm when the posterior proaility of change exceeds 1 α =.8. Conclusion This paper presents an exact dynamic update for CPOMDPs using a piecewise linear representation of the value function. The key concept is modification of the standard pruning algorithm, with a mixed integer linear program replacing a linear program. The dynamic programming update can e used to perform value iteration or policy iteration. The method should e useful for finding solutions to modestlysized sequential decision prolems that include uncertainty and constraints. The scaling ehavior of the proposed CPOMDP dynamic programming update is expected to e similar to that of exact dynamic programming updates for POMDPs, which exhiit PSPACE complexity for many prolems. Although this scaling ehavior makes the proposed technique unsuitale for large-scale prolems, we hope that the insight provided y an exact numerical technique will spur development of approximate methods for solution of CPOMDPs. 295

6 v c v r (s 2 ) (s 2 ) Figure 3: A minimal representation of the value function for the change detection example after four exact dynamic programming updates. Each vector is presented with its ojective valuation v r and constraint valuation v c, with (s 2 ) the posterior proaility that the system is in the post-change state. An approximation of the optimal value function v() is shown as dots on the top plot. References Altman, E Constrained Markov Decision Processes. Boca Raton, LA: CRC Press. Cassandra, A. R Exact and Approximate Algorithms for Partially Oservale Markov Decision Processes. Ph.D. Dissertation, Brown University. Feinerg, E. A., and Piunovskiy, A. B. 2. Multiple ojective nonatomic markov decision processes with total reward criteria. J. Math. Anal. Appl. 247: Feng, Z., and Zilerstein, S. 24. Region-ased incremental pruning for POMDPs. In Proceedings of the 2th Conference on Uncertainty in Artificial Intelligence, Hansen, E. A Finite Memory Control of Partially Oservale Systems. PhD thesis, University of Massachusetts Amherst. Hansen, E. A. 27. Indefinite-horizon POMDPs with action-ased termination. In 22nd National Conference on Artificial Intelligence. Meyn, S. P. 27. Control Techniques for Complex Networks. Camridge, UK: Camridge University Press. Patek, S. D. 21. On partially oserved stochastic shortest path prolems. In 4th IEEE Conference on Decision and Control. Piunovskiy, A. B., and Mao, X. 2. Constrained markovian decision processes: the dynamic programming approach. Operations research letters 27: Poupart, P. 25. Exploiting Structure to Efficiently Solve Large Scale Partially Oservale Markov Decision Processes. PhD thesis, University of Toronto. Puterman, M. L. 25. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Proaility and Statistics. Hooken, NJ: John Wiley & Sons. Shiryaev, A. N On optimum methods in quickest detection prolems. Theory of Proaility and Its Applications 8(1): Spaan, M. T. J., and Vlassis, N. 25. Perseus: Randomized point-ased value iteration for POMDPs. Journal of Artificial Intelligence Research 24: Tartakovsky, A. G., and Veeravalli, V. V. 24. General asymptotic Bayesian theory of quickest change detection. Theory of Proaility and Its Applications 49(3): Zhao, Q., and Swami, A. 27. A decision-theoretic framework for opportunistic spectrum access. IEEE Wireless Communications 14(4):

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals