Optimal Stopping Problems

Size: px

Start display at page:

Download "Optimal Stopping Problems"

Sybil Dortha Stewart
5 years ago
Views:

1 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating the cost to go function in autonomous systems. Recall that much of the analysis was based on the idea of sampling states according to their stationary distribution. This was done either explicitly, as was assumed in approximate value iteration, or implicitly through the simulation or observation of system trajectories. It is unclear how this line of analysis can be extended to general controlled systems. In the presence of multiple policies, in general there are multiple stationary state distributions to be considered, and it is not clear which one should be used. Moreover, the dynamic programming operator T may not be a contraction with respect to any such distribution, which invalidates the argument used in the autonomous case. However, there is a special class of control problems for which analysis of T D(λ) along the same lines used in the case of autonomous systems is successful. These problems are called optimal stopping problems, and are characterized by a tuple (S, P : S S [0, ], g 0 : S R, g : S R), with the following interpretation. The problem involves a Markov decision process with state space S. In each state, there are two actions available: to stop (action 0) or to continue (action ). Once action 0 (stop) is selected, the system is halted and a final cost g 0 (x) is incurred, based on the final state x. At each previous time stage, action (continue) is selected and a cost g (x) is incurred. In this case, the system transitions from state x to state y with probability P (x, y). Each policy corresponds to a (random) stopping time τ u, which is given by τ u = min{k : u(x k ) = 0}. Example (American Options) An American call option is an option to buy stock at a price K, called the stock price, on or before an expiration date the last time period the option can be exercised. The state of a such a problem is the stock price P k. Exercising the option corresponds to the stop action and leads to a reward max(0, P k K); not exercising the option corresponds to the continue action and incurs no costs or rewards. Example 2 (The Secretary Problem) In the secretary problem, a manager needs to hire a secretary. He interviews secretaries sequentially and must make a decision about hiring each one of them immediately after the interview. Each interview incurs a certain cost for the hours spent meeting with the candidate, and hiring a certain person incurs a reward that is a function of the person s abilities. In the infinite horizon, discounted cost case, each policy is associated with a discounted cost to go [ ] τ u J u (x) = E α t g (t) + α τu g 0 (x τu ) x 0 = x. We are interested in finding the policy u with optimal cost to go t=0 J = min J u. u

2 Bellman s equation for the optimal stopping problem is given by J = min(g 0, g + αp J) T J. As usual, T is a maximum norm contraction, and Bellman s equation has a unique solution corresponding to the optimal cost to go function J. Moreover, T is also a weighted Euclidean norm contraction. Lemma Let π be a stationary distribution associated with P, i.e., π T P = π T, and let D = diag(π) be a diagonal matrix whose diagonal entries correspond to the elements in vector π. Then, for all J and J, we have T J T J 2,D α J J 2,D Proof: Note that, for any scalars c, c 2 and c 3, we have min(c, c 3 ) min(c 2, c 3 ) c c 2. () It is straightforward to show Eq.(). Assume without loss of generality that c c 2. If c c 2 c 3 or c 3 c c 2, Eq() holds trivially. The only remaining case is c c 3 c 2, and we have Applying (), we have It follows that min(c, c 3 ) min(c 2, c 3 ) = c 3 c 2 c c 2. T J T J = min(g 0, g + αp J) min(g 0, g + αp J ) αp (J J ). T J T J 2,D α P (J J ) 2,D α J J 2,D, where the last inequality follows from P J 2,D J 2,D by Jensen s inequality and stationarity of π.. T D(λ) Recall the Q function Q(x, a) = g a (x) + α P a (x, y)j (y). In general control problems, storing Q may require substantially more space than storing J, since Q is a function of state action pairs. However, in the case of optimal stopping problems, storing Q requires essentially the same space as J, since the Q value of stopping is trivially equal to g 0 (x). Hence in the case of optimal stopping problems, we can set T D(λ) to learn y Q = g + αp J, the cost of choosing to continue and behaving optimally afterwards. Note that, assuming that one stage costs g 0 and g are known, we can derive an optimal policy from Q by comparing the cost of continuing with the cost of stopping, which is simply g 0. We can express J in terms of Q as J = min(g 0, Q ), 2

3 so that Q also satisfies the recursive relation This leads to the following version of T D(λ): Let and Q = g + αp min(g 0, Q ). r k+ = r k + γ k z k [g (x) + α min (g 0 (x k+ ), φ(x k+ )r k ) φ(x k )r k ] z k = αγz k + φ(x k ) We can rewrite T D(λ) in terms of operator Q as HQ = g + αp min(g 0, Q), (2) H λ t H t+ λ Q = ( λ) Q. t=0 r k+ = r k + γ k z k (H λ Φr k Φr k + w k ). (3) The following property of H implies that an analysis of T D(λ) for optimal stopping problems along the same lines of the analysis for autonomous systems suffices for establishing convergence of the algorithm. Lemma 2 HQ HQ 2,D α Q Q 2,D. Theorem [Analogous to autonomous systems] Let r k be given by (3) and suppose that P is irreducible and aperiodic. Then r k r w.p., where r satisfies Φr = ΠH λ Φr and Φr Q 2,D k 2 ΠQ Q 2,D, (4) α( λ) where k = αλ α. We can also place a bound on the loss in the performance incurred by using a policy that is greedy with respect to Q, rather than the optimal policy. Specifically, consider the following stopping role based on Φr. { u(x) = stop, continue, ũ specifies a (random) stopping time τ, which is given by if g 0 (x) Φr (x) otherwise τ = min{k : φ(x k )r g 0 (x k )}, and the cost to go associated with ũ is given by [ ] τ τ J (x) = E α t g (x t ) + α g 0 (x τ ). t=0 The following theorem establishes a bound on the expected difference between J and the optimal cost to go J. 3

4 Theorem 2 2 E[J (x) x π] E[J (x) x π] ( α) K 2 ΠQ Q 2,D Proof: Let Q be the cost of choosing to continue in the current time stage, followed by using policy u: Then we have Q = g + αp J. E[J (x 0 ) J(x 0 ) x 0 π] = E[J (x ) J(x ) x 0 π] = π T P (J J) }{{} = π(x) P (J J )(x) x S = P J P J,π The inequality (5) follows from the fact that, for all J, we have P J P J 2,π (5) = α g + αp J g + αp J 2,π = α Q Q 2,π (6) HQ = g { + αp KQ, where g 0 (x), if Φr (x) g 0 (x) KQ = Q, otherwise. Q Q Q 2,π Q HΦr 2,π + HΦr 2,π = HQ HΦr 2,π + H Q HΦr 2,π α Q Φr Q Φr 2,π + α 2,π ( 0 J 2 2,π = E[ J(x) : x π] E[J(x) 2 : x π] = J 2 2,π, where the inequality is due to Jensen s inequality. Now define the operators H and K, given by Recall the definition of operator H. Then it is easy to verify the following identities: HΦr = HΦr HQ = Q H Q = Q. Moreover, it is also easy to show that H is a contraction with respect to 2,π. Now we have α Q Φr Q Q 2,π + Q Φr 2,π + α 2,π 2α Q Φr Q Φr 2,π + α 2,π ) 4

5 Thus, 2α Q Q 2,π Q Φr 2,π. α (7) The theorem follows from Theorem and equations (6) and (7). 2.2 Discounted Cost Problems with Control Our analysis of T D(λ) is based on comparisons with approximate value iteration: Φr k+ = ΠT λ Φr k, where the projection represented by Π is defined with respect to the weighted Euclidean norm 2,π (or, equivalently, 2,D ) and π is the stationary distribution associated with the transition matrix P. In controlled problems, there is not a single stationary distribution to be taken into account; rather, for every policy u, there is a stationary distribution π u associated with transition probabilities T u. A natural way of extending AVI to the controlled case is to consider a scheme of the form Φr k+ = Π uk T uk Φr k, (8) where Π u represents the projection based on 2,πu and u k is the greedy policy with respect to Φr k. Such a scheme is a plausible approximation, for instance, for an approximate policy iteration based on T D(λ) trained with system trajectories:. Select a policy u 0. Let k = Fit J uk Φr k (e.g., via T D(λ) for autonomous systems); 3. Choose u k+ to be greedy with respect to Φr k. Let k = k +. Go back to step 2. Note that step 2 in the approximate policy iteration presented above involves training over an infinitely long trajectory, with a single policy, in order to perform policy evaluation. Drawing inspiration from asynchronous policy iteration, one may consider performing policy updates before the policy evaluation step is considered. As it turns out, none of these algorithms is guaranteed to converge. In particular, approximate value iteration (8) is not even guaranteed to have a fixed point. For an analysis of approximate value iteration, see []. A special situation where AVI and T D(λ) are guaranteed to converge occurs when the basis functions are constant over partitions of the state space. Theorem 3 Suppose that φ i (x) = {x A i }, where A i A j =, i, j, i = j. Then converges for any Euclidean projections Π. Φr k+ = ΠT Φr k Proof: We will show that, if Π is a Euclidean projection and φ i satisfy the assumption of the theorem, then Π is a maximum norm nonexpansion: ΠJ ΠJ J J. 5

6 Let (ΠJ )(x) = K i, if x A i, where ( ) 2 K i = arg min w(x) J (x) r r x A i Thus Therefore w(x) K i = E[J (x) x w i ], where w i = w(x). x A i [ (ΠJ)(x) (ΠJ )(x) = E J (x) J (x) x J J wi ] It follows that ΠT is a maximum norm contraction, which ensures convergence of approximate value iteration. 2 References [] D.P. de Farias and B. Van Roy. On the existence of fixed points for appproximate value iteration and temporal difference learning. Journal of Optimization Theory and Applications, 05(3),

1 Markov decision processes

2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe