On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Size: px

Start display at page:

Download "On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers"

Eustacia Woods
5 years ago
Views:

1 On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu Dimitri Bertsekas Lab for Information and Decision Systems, M.I.T. April

2 POMDP (Partially Observable Markov Decision Problems) 2 Very important in control and other fields

3 POMDP (Partially Observable Markov Decision Problems) 2 Very important in control and other fields Hard (because of infinite state space - the space of beliefs or conditional state distributions)

4 POMDP (Partially Observable Markov Decision Problems) 2 Very important in control and other fields Hard (because of infinite state space - the space of beliefs or conditional state distributions) Last frontier for Exact DP

5 POMDP (Partially Observable Markov Decision Problems) 2 Very important in control and other fields Hard (because of infinite state space - the space of beliefs or conditional state distributions) Last frontier for Exact DP Great challenge for Approximate DP

6 Approximation/Finite Parametrization Methods for POMDP 3 Let s focus on finite parametrizations of finite-state POMDP: Discretization/interpolation in belief space

7 Approximation/Finite Parametrization Methods for POMDP 3 Let s focus on finite parametrizations of finite-state POMDP: Discretization/interpolation in belief space Problem approximation (use a finite-state MDP to approximate a POMDP/aggregation)

8 Approximation/Finite Parametrization Methods for POMDP 3 Let s focus on finite parametrizations of finite-state POMDP: Discretization/interpolation in belief space Problem approximation (use a finite-state MDP to approximate a POMDP/aggregation) Piecewise linear approximations (approximate value and policy iteration)

9 Approximation/Finite Parametrization Methods for POMDP 3 Let s focus on finite parametrizations of finite-state POMDP: Discretization/interpolation in belief space Problem approximation (use a finite-state MDP to approximate a POMDP/aggregation) Piecewise linear approximations (approximate value and policy iteration) Optimization over a set of finite-state controllers

10 4 Outline of the Talk Introduction to POMDP (finite-state average cost) Lower bounds/mdp approximations to POMDP optimal cost function (a brief summary of our work) On Near-Optimality of the Set of Finite-State Controllers for Average Cost POMDP

11 5 References H. Yu, Approximate Solution Methods for Partially Observable Markov and Semi-Markov Decision Processes, Ph.D. Thesis, Dept of EECS, MIT, Jan H. Yu and D. P. Bertsekas, Discretized Approximations for POMDP with Average Cost, The 20th Conference on Uncertainty in Artificial Intelligence, 2004, Banff, Canada. H. Yu and D. P. Bertsekas, Near-Optimality of Finite-State Controllers, LIDS Report 2689, MIT, April 2006.

12 6 Introduction to POMDP Graphical model of discrete-time POMDP: S 0 u 0 S 1 S 2 Y 1 u 1 Y 2 S state space, Y observation space, U control space g(s, u) ξ π Π per-stage cost initial distribution of state history-dependent randomized policies π = {µ 0, µ 1,..., } µ 0 ( ), µ t (h t, ) P(U) h t = (u 0, y 1, u 1,..., y t ), observed history up to time t ξ, π stochastic process {S 0, U 0, S 1, Y 1, U 1,...} joint probability distribution P ξ,π

13 7 Expected Cost Criteria k-stage cost J π k Average cost (ξ) = EPξ,π k 1 g(s t, U t ) t=0 Jk (ξ) = inf π Π Jπ k (ξ) J π (ξ) = lim inf k J (ξ) = inf π Π Jπ (ξ) 1 k Jπ k (ξ) 1 Jπ + (ξ) = lim sup k k Jπ k (ξ) (ξ) = inf J + π Π Jπ + (ξ)

14 8 Important Analytical Features Sufficient statistic for control is ξ t : belief state/conditional distribution of the state

15 8 Important Analytical Features Sufficient statistic for control is ξ t : belief state/conditional distribution of the state k-stage policy cost function: linear in ξ

16 8 Important Analytical Features Sufficient statistic for control is ξ t : belief state/conditional distribution of the state k-stage policy cost function: linear in ξ Lower k-stage optimal cost function: piecewise linear/concave in ξ

17 8 Important Analytical Features Sufficient statistic for control is ξ t : belief state/conditional distribution of the state k-stage policy cost function: linear in ξ Lower k-stage optimal cost function: piecewise linear/concave in ξ Key questions for average cost: are the lower or upper optimal cost functions flat (constant/independent of the initial belief ξ)? Are they equal?

18 8 Important Analytical Features Sufficient statistic for control is ξ t : belief state/conditional distribution of the state k-stage policy cost function: linear in ξ Lower k-stage optimal cost function: piecewise linear/concave in ξ Key questions for average cost: are the lower or upper optimal cost functions flat (constant/independent of the initial belief ξ)? Are they equal? In POMDP constant cost across beliefs is far less likely than in MDP

19 9 Average Cost Optimality Equations Pair of coupled optimality equations J(ξ) + h(ξ) = J(ξ) = min u U E {J (φ u(ξ, Y ))}, U(ξ) def = arg min u U E {J (φ u(ξ, Y ))}, min [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}], (1) u U(ξ)

20 9 Average Cost Optimality Equations Pair of coupled optimality equations J(ξ) + h(ξ) = J(ξ) = min u U E {J (φ u(ξ, Y ))}, U(ξ) def = arg min u U E {J (φ u(ξ, Y ))}, min [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}], (1) u U(ξ) Constant average cost DP equation λ + h(ξ) = min u U [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}]. (2)

21 9 multichain Average Cost Optimality Equations Pair of coupled optimality equations J(ξ) + h(ξ) = J(ξ) = min u U E {J (φ u(ξ, Y ))}, U(ξ) def = arg min u U E {J (φ u(ξ, Y ))}, min [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}], (1) u U(ξ) Constant average cost DP equation unichain λ + h(ξ) = min u U [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}]. (2)

22 9 multichain Average Cost Optimality Equations Pair of coupled optimality equations J(ξ) + h(ξ) = J(ξ) = min u U E {J (φ u(ξ, Y ))}, U(ξ) def = arg min u U E {J (φ u(ξ, Y ))}, min [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}], (1) u U(ξ) Constant average cost DP equation unichain λ + h(ξ) = min u U [ḡ(ξ, u) + E {h (φ u(ξ, Y ))}]. (2) Bounded Solutions (J, h ) of (1) J ( ) = J + ( ) = J ( ) Bounded Solutions (λ, h ) of (2) J ( ) = J + ( ) = λ

23 10 Example: Non-Constant Optimal Average Cost States: {1, 2, 3, 4}, Actions: {a, b} {, 1/2 } 1 {, 1/2 } 2 {, 1 } {, 1/2 } 3 {, 1 } 4 {, 1/2 } Markov chain recurrent and aperiodic Observations: {c, d}, state 1, 3 c; state 2, 4 d Cost: g(1, a) = 1, g(1, b) = 0; g(3, a) = 0, g(3, b) = 1 ξ : ξ(1) = 1 or ξ(3) = 1, J ( ξ) = 0 ξ : ξ(1) = ξ(3) = 1/2, J (ξ) = 1/3 > 0 Non-constant optimal average cost in POMDP

24 Existence of Solutions to Constant AC DP Equation Sufficient conditions: 11 reachability and detectability [Platzman 80] interior accessibility, relative interior accessibility [Hsu, Chuang, Arapostathis 05] Constant AC: not fully understood Non-constant AC: not much has been done

25 12 Outline of the Talk Introduction to POMDP and the average cost problem Lower bounds for average cost POMDP Near-optimality of finite-state controllers

26 13 Lower Bounds for Average Cost POMDP J (ξ) J (ξ) ξ P(S). approximating processes (usually finite-state MDPs)

27 13 Lower Bounds for Average Cost POMDP J (ξ) J (ξ) ξ P(S). approximating processes (usually finite-state MDPs) Notes: J computable when spaces are finite: the bounds involve discretized approximations, finite state and control MDP algorithms

28 13 Lower Bounds for Average Cost POMDP J (ξ) J (ξ) ξ P(S). approximating processes (usually finite-state MDPs) Notes: J computable when spaces are finite: the bounds involve discretized approximations, finite state and control MDP algorithms neither J nor J have to be constant

29 13 Lower Bounds for Average Cost POMDP J (ξ) J (ξ) ξ P(S). approximating processes (usually finite-state MDPs) Notes: J computable when spaces are finite: the bounds involve discretized approximations, finite state and control MDP algorithms neither J nor J have to be constant use of lower bounds: characterize the optimal performance, provide suboptimal control

30 13 Lower Bounds for Average Cost POMDP J (ξ) J (ξ) ξ P(S). approximating processes (usually finite-state MDPs) Notes: J computable when spaces are finite: the bounds involve discretized approximations, finite state and control MDP algorithms neither J nor J have to be constant use of lower bounds: characterize the optimal performance, provide suboptimal control extension to semi-markov and constrained cases

31 14 Outline of the Analysis POMDP POMDP Fictitious Process Modified Belief MDP Modified Belief MDP Approximating POMDP

32 15 Outline of the Talk Introduction to POMDP and the average cost problem Lower bounds for average cost POMDP Near-optimality of finite-state controllers

33 16 Finite-State Controllers X X X Y 0 U0 Y 1 U1 Y2 Z Z Z internal states of controller

34 16 Finite-State Controllers X X X Y 0 U0 Y 1 U1 Y2 Z Z Z internal states of controller Key properties and advantages of FSC {(S t, Y t, Z t, U t )} jointly Markov chain, {(S t, Y t, Z t )} marginally Markov chain

35 16 Finite-State Controllers X X X Y 0 U0 Y 1 U1 Y2 Z Z Z internal states of controller Key properties and advantages of FSC {(S t, Y t, Z t, U t )} jointly Markov chain, {(S t, Y t, Z t )} marginally Markov chain asymptotic behavior well-understood; MDP theory applies

36 16 Finite-State Controllers X X X Y 0 U0 Y 1 U1 Y2 Z Z Z internal states of controller Key properties and advantages of FSC {(S t, Y t, Z t, U t )} jointly Markov chain, {(S t, Y t, Z t )} marginally Markov chain asymptotic behavior well-understood; MDP theory applies connection with piecewise linear concave approximations (Hansen 1998)

37 17 Existence of Near-Optimal FSC/Average Cost Question: Given ɛ > 0, under what conditions does there exist an ɛ-optimal FSC (with sufficiently large number of internal states)?

38 17 Existence of Near-Optimal FSC/Average Cost Question: Given ɛ > 0, under what conditions does there exist an ɛ-optimal FSC (with sufficiently large number of internal states)? Notes: We mean ɛ-optimal simultaneously for all initial beliefs (so J : flat is a necessary condition) The existence of an ɛ-optimal FSC for a given single initial belief is an open question

39 14 Existence of Near-Optimal FSC/Average Cost Question: Given ɛ > 0, under what conditions does there exist an ɛ-optimal FSC (with sufficiently large number of internal states)? Notes: We mean ɛ-optimal simultaneously for all initial beliefs (so J : flat is a necessary condition) The existence of an ɛ-optimal FSC for a given single initial belief is an open question For discounted problems, there are easy answers: There is no FSC that is simultaneously ɛ-optimal for all initial beliefs There is an ɛ-optimal FSC for a given single initial belief

40 15 Near-Optimality of Finite-State Controllers Assumption: Finite spaces, J constant Our Main Theorem: ɛ-optimal FSC. J+ = J ; and for all ɛ > 0, there exists an

41 15 Near-Optimality of Finite-State Controllers Assumption: Finite spaces, J constant Our Main Theorem: ɛ-optimal FSC. J+ = J ; and for all ɛ > 0, there exists an Recall definitions: π = {µ 0, µ 1,..., }, µ 0 ( ), µ t (h t, ) P(U) Jk π (ξ) = EPξ,π J π (ξ) = lim inf k 1 k Jπ k (ξ) J (ξ) = inf π Π Jπ (ξ) k 1 g(s t, U t ) t=0

42 15 Near-Optimality of Finite-State Controllers Assumption: Finite spaces, J constant Our Main Theorem: ɛ-optimal FSC. J+ = J ; and for all ɛ > 0, there exists an Recall definitions: π = {µ 0, µ 1,..., }, µ 0 ( ), µ t (h t, ) P(U) Jk π (ξ) = EPξ,π Lemma: J π (ξ) = lim inf k 1 k Jπ k (ξ) J (ξ) = inf π Π Jπ (ξ) Jk π(ξ) linear; Jπ (ξ) concave; J (ξ) concave. k 1 g(s t, U t ) t=0

43 15 Near-Optimality of Finite-State Controllers Assumption: Finite spaces, J constant Our Main Theorem: ɛ-optimal FSC. J+ = J ; and for all ɛ > 0, there exists an Recall definitions: π = {µ 0, µ 1,..., }, µ 0 ( ), µ t (h t, ) P(U) Jk π (ξ) = EPξ,π Lemma: J π (ξ) = lim inf k 1 k Jπ k (ξ) J (ξ) = inf π Π Jπ (ξ) Jk π(ξ) linear; Jπ (ξ) concave; J (ξ) concave. k 1 g(s t, U t ) t=0 1st Key Observation: Suppose J cannot happen... J π constant. Then, the first case J π J J

44 15 Near-Optimality of Finite-State Controllers Assumption: Finite spaces, J constant Our Main Theorem: ɛ-optimal FSC. J+ = J ; and for all ɛ > 0, there exists an Recall definitions: π = {µ 0, µ 1,..., }, µ 0 ( ), µ t (h t, ) P(U) Jk π (ξ) = EPξ,π Lemma: J π (ξ) = lim inf k 1 k Jπ k (ξ) J (ξ) = inf π Π Jπ (ξ) Jk π(ξ) linear; Jπ (ξ) concave; J (ξ) concave. k 1 g(s t, U t ) t=0 1st Key Observation: Suppose J cannot happen... J π constant. Then, the first case J π J J There must exist π nearly optimal for some interior ξ, with J π almost flat

45 First Result on Independence from Initial Belief Proposition: For all ɛ > 0, there exists an ɛ-liminf optimal policy π that does not functionally depend on the initial belief ξ, i.e., 16 J π (ξ) J + ɛ, ξ P(S).

46 First Result on Independence from Initial Belief Proposition: For all ɛ > 0, there exists an ɛ-liminf optimal policy π that does not functionally depend on the initial belief ξ, i.e., 16 J π (ξ) J + ɛ, ξ P(S). 2nd Key Observation: There must exist k 0, with 1 k 0 J π k 0 uniformly close to J π 1 k 1 J π k 1 1 k 2 J π k 2 J π = lim inf k 1 k Jπ k

47 17 Finite-Stage Uniform Closeness Lemma Lemma: For all ɛ > 0, there exists π 0 Π and an integer k 0 (depending on π 0 ) such that 1 k 0 J π 0 k 0 (ξ) J ɛ, ξ P(S).

48 17 Finite-Stage Uniform Closeness Lemma Lemma: For all ɛ > 0, there exists π 0 Π and an integer k 0 (depending on π 0 ) such that 1 k 0 J π 0 k 0 (ξ) J ɛ, ξ P(S). Final construction: Construct infinite-stage policy by replication of the finite-stage policy: Form a new policy π 1 that has a finite-length history window : π 0 = {µ 0, µ 1,..., µ k0 1,...} π 1 = {µ 0, µ 1,..., µ k0 1, µ k 0, µ k 0 +1,..., µ 2k 0 1,...} µ t (h t, ) = µ k(t) (δ k(t) (h t ), ), k(t) def = mod(t, k 0 ), where δ k(t) extracts the last length- k(t) segment of a length-t history h t.

49 17 Finite-Stage Uniform Closeness Lemma Lemma: For all ɛ > 0, there exists π 0 Π and an integer k 0 (depending on π 0 ) such that 1 k 0 J π 0 k 0 (ξ) J ɛ, ξ P(S). Final construction: Construct infinite-stage policy by replication of the finite-stage policy: Form a new policy π 1 that has a finite-length history window : π 0 = {µ 0, µ 1,..., µ k0 1,...} π 1 = {µ 0, µ 1,..., µ k0 1, µ k 0, µ k 0 +1,..., µ 2k 0 1,...} µ t (h t, ) = µ k(t) (δ k(t) (h t ), ), k(t) def = mod(t, k 0 ), where δ k(t) extracts the last length- k(t) segment of a length-t history h t. J π 0 ( ): uniformly close to J = Jπ 1 ( ): uniformly close to J

50 17 Finite-Stage Uniform Closeness Lemma Lemma: For all ɛ > 0, there exists π 0 Π and an integer k 0 (depending on π 0 ) such that 1 k 0 J π 0 k 0 (ξ) J ɛ, ξ P(S). Final construction: Construct infinite-stage policy by replication of the finite-stage policy: Form a new policy π 1 that has a finite-length history window : π 0 = {µ 0, µ 1,..., µ k0 1,...} π 1 = {µ 0, µ 1,..., µ k0 1, µ k 0, µ k 0 +1,..., µ 2k 0 1,...} µ t (h t, ) = µ k(t) (δ k(t) (h t ), ), k(t) def = mod(t, k 0 ), where δ k(t) extracts the last length- k(t) segment of a length-t history h t. J π 0 ( ): uniformly close to J = Jπ 1 ( ): uniformly close to J For finite spaces: π 1 is FSC = J π 1 (ξ) = Jπ 1 + (ξ), ξ P(S).

51 18 Near-Optimality of FSC Assumption: Finite spaces, J constant Theorem: J + = J for all ɛ > 0, there is a FSC that is ɛ-optimal, and does not functionally depend on the initial belief ξ. Note: The ɛ-optimal FSC is also a finite-history controller Intuitive implication: If J does not depend on the initial belief, old information becomes increasingly obsolete

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,