6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE

Size: px

Start display at page:

Download "6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE"

Jayson Randell Strickland
6 years ago
Views:

1 6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE DP for imperfect state info Sufficient statistics Conditional state distribution as a sufficient statistic Finite-state systems Examples 1

2 REVIEW: IMPERFECT STATE INFO PROBLEM Instead of knowing x k, we receive observations z 0 = h 0 x 0,v 0, z k = h k x k,u k 1,v k, k 0 I k : information vector available at time k: I 0 = z 0, I k = z 0,z 1,...,z k,u 0,u 1,...,u k 1, k 1 Optimizationoverpolicies π = {µ 0,µ 1,...,µ N 1 }, where µ k I k U k, for all I k and k. Find a policy π that minimizes { } N 1 J π = E g N x N + x 0,w k,vk k=0,...,n 1 k=0 subject to the equations g k x k,µ k I k,w k x k+1 = f k xk,µ k I k,w k, k 0, z 0 = h 0 x 0,v 0, z k = h k xk,µ k 1 I k 1,v k, k 1 2

3 DP algorithm: J k I k = min uk Uk DP ALGORITHM [ { E g k x k,u k,w k x k,w k,zk+1 }] +J k+1 I k,z k+1,u k I k,u k for k = 0,1,...,N 2, and for k = N 1, J N 1 I N 1 = min E g N 1 x N 1,u N 1,w N 1 u N 1 U N 1 [ x N 1,w N 1 { ] } +g N fn 1 x N 1,u N 1,w N 1 I N 1,u N 1 The optimal cost J is given by J = E z0 { J0 z 0 }. 3

4 SUFFICIENT STATISTICS Suppose there is a function S k I k such that the minintheright-hand sideofthe DPalgorithmcan be written in terms of some function H k as min H k uk Uk Sk I k,u k Such a function S k is called a sufficient statistic. An optimal policy obtained by the preceding minimization can be written as µ ki k = µ k Sk I k, where µ k is an appropriate function. Example of a sufficient statistic: S k I k = I k Another important sufficient statistic S k I k = P x I, k k assuming that v k is characterized by a probability distribution P vk x k 1,u k 1,w k 1 4

5 DP ALGORITHM IN TERMS OF P X I K K Filtering Equation: P x k I is generated recursively by a dynamic systemestimator of the k form P x I = Φ k P x I,u k,z k+1 k+1 k+1 k k for a suitable function Φ DP algorithm can be written as J k P x I = min k k [ k E uk Uk x k,w k,zk+1 { gk x k,u k,w k +J k+1 Φk P x I,u k,z k+1 I k,u k k k It is the DP algorithm for a new problem whose state is P x k I also called belief state k }] w k v k u k System x k Measurement x k + 1 = f k x k,u k,w k z k = h k x k,u k - 1,v k u k - 1 z k Delay Actuator µ k P xk I k Estimator φ k - 1 u k - 1 z k 5

6 EXAMPLE: A SEARCH PROBLEM At each period, decide to search or not search a site that may contain a treasure. If we search and a treasure is present, we find it with prob. β and remove it from the site. Treasure s worth: V. Cost of search: C States: treasure present & treasure not present Each search can be viewed as an observation of the state Denote p k : prob. of treasure present at the start of time k with p 0 given. p k evolves at time k according to the equation p k+1 = pk if not search, 0 if search and find treasure, p k1 β if search and no treasure. p k1 β+1 pk This is the filtering equation. 6

7 SEARCH PROBLEM CONTINUED DP algorithm [ J k p k = max 0, C +p k βv p ] k1 β +1 p k β J k+1, p k 1 β+1 p k with J N p N = 0. Can be shown by induction that the functions J k satisfy J k p k 0 if pk C = βv, > 0 if p k > C βv. Furthermore, it is optimal to search at period k if and only if p k βv C expected reward from the next search the cost of the search - a myopic rule 7

8 FINITE-STATE SYSTEMS - POMDP Suppose the system is a finite-state Markov chain, with states 1,...,n. Then the conditional probability distribution is an n-vector P x I k k Pxk = 1 I k,...,px k = n I k The DP algorithm can be executed over the n- dimensional simplex state space is not expanding with increasing k When the control and observation spaces are also finite sets the problem is called a POMDP Partially Observed Markov Decision Problem. For POMDP it turns out that the cost-to-go functions J k in the DP algorithm are piecewise linear and concave Exercise 5.7 Useful in practice both for exact and approximate computation. 8

9 INSTRUCTION EXAMPLE I Teaching a student some item. Possible states are L: Item learned, or L: Item not learned. Possible decisions: T: Terminate the instruction, or T: Continue the instruction for one period and then conduct a test that indicates whether the student has learned the item. Possible test outcomes: R: Student gives a correct answer, or R: Student gives an incorrect answer. Probabilistic structure 1 1 L L R t r L 1 - t L 1 - r R Cost of instruction: I per period Cost of terminating instruction: 0 if student has learned the item, and C > 0 if not. 9

10 INSTRUCTION EXAMPLE II Let p k : prob. student has learned the item given the test results so far p k = Px k = L z 0,z 1,...,z k. Filtering equation: Using Bayes rule p k+1 = Φp k,z k+1 = { 1 1 t1 p k 1 1 t1 r1 p k if z k+1 = R, 0 if z k+1 = R. DP algorithm: J k p k = min starting with [ { } ] 1 p k C, I + E Jk+1 Φpk,z k+1 z k+1 J N 1 p N 1 = min [ 1 p N 1 C, I+1 t1 p N 1 C ]. 10

11 INSTRUCTION EXAMPLE III Write the DP algorithm as where J k p k = min [ ] 1 p k C, I +A k p k, A k p k = Pz k+1 = R I k J k+1 Φpk,R +Pz k+1 = R I k J k+1 Φpk,R Can show by induction that A k p are piecewise linear, concave, monotonically decreasing, with A k 1 p A k p A k+1 p, for all p [0,1]. The cost-to-go at knowledge prob. p increases as we come closer to the end of horizon. C I + A N - 1 p I + A N - 2 p I + A N - 3 p I 0 a N - 1 a N - 2 a N - 3 I 1-1 p C 11

12 MIT OpenCourseWare Dynamic Programming and Stochastic Control Fall 2015 For information about citing these materials or our Terms of Use, visit:

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE Rollout algorithms Policy improvement property Discrete deterministic problems Approximations of rollout algorithms Model Predictive Control (MPC) Discretization