THIRD EDITION. Dimitri P. Bertsekas. Massachusetts Institute of Technology. Last Updated 10/1/2008. Athena Scientific, Belmont, Mass.

Dynamic Programming and Optimal Control THIRD EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology Selected Theoretical Problem Solutions Last Updated 10/1/2008 Athena Scientific, Belmont, Mass. WWW site for book information and orders http://www.athenasc.com/ 1

NOTE This solution set is meant to be a significant extension of the scope and coverage of the book. It includes solutions to all of the book s exercises marked with the symbol www. The solutions are continuously updated and improved, and additional material, including new problems and their solutions are being added. Please send comments, and suggestions for additions and improvements to the author at dimitrib@mit.edu The solutions may be reproduced and distributed for personal or educational uses. 2

Solutions Vol. I, Chapter 1 1.16 www (a) Given a sequence of matrix multiplications M 1 M 2 M k M k+1 M N we represent it by a sequence of numbers n 1,..., n N+1 }, where n k n k+1 is the dimension of M k. Let the initial state be x 0 = n 1,..., n N+1 }. Then choosing the first multiplication to be carried out corresponds to choosing an element from the set x 0 n 1, n N+1 }. For instance, choosing n 2 corresponds to multiplying M 1 and M 2, which results in a matrix of dimension n 1 n 3, and the initial state must be updated to discard n 2, the control applied at that stage. Hence at each stage the state represents the dimensions of the matrices resulting from the multiplications done so far. The allowable controls at stage k are u k x k n 1, n N+1 }. The system equation evolves according to x k+1 = x k u k }. Note that the control will be applied N 1 times, therefore the horizon of this problem is N 1. The terminal state is x N 1 = n 1, n N+1 } and the terminal cost is 0. The cost at stage k is given by the number of multiplications, g k (x k, u k ) = n a n uk n b, where n uk = u k and a = max i 1,..., N + 1} i < u k, i x k }, b = min i 1,..., N + 1} i > u k, i x k }. The DP algorithm for this problem is given by J k (x k ) = J N 1 (x N 1 ) = 0, ( min na n uk n b + J k+1 xk u k } )}, k = 0,..., N 2. u k x k n 1,n N+1} Now consider the given problem, where N = 3 and M 1 is 2 10, M 2 is 10 5, M 3 is 5 1. The optimal order is M 1 (M 2 M 3 ), requiring 70 multiplications. (b) In this part we can choose a much simpler state space. Let the state at stage k be given by a, b}, where a, b 1,..., N} and give the indices of the first and the last matrix in the current partial product. 3

There are two possible controls at each stage, which we denote by L and R. Note that L can be applied only when a 1 and R can be applied only when b N. The system equation evolves according to x k+1 = a 1, b}, if uk = L, a, b + 1}, if u k = R, k = 1,...,N 1. The terminal state is x N = 1, N} with cost 0. The cost at stage k is given by g k (x k, u k ) = na 1 n a n b+1, if u k = L, n a n b+1 n b+2, if u k = R, k = 1,...,N 1. For the initial stage, we can take x 0 to be the empty set and u 0 1,...,N}. The next state will be given by x 1 = u 0, u 0 }, and the cost incurred at the initial stage will be 0 for all possible controls. 1.18 www Let t 1 < t 2 < < t N 1 denote the times where g 1 (t) = g 2 (t). Clearly, it is never optimal to switch functions at any other times. We can therefore divide the problem into N 1 stages, where we want to determine for each stage k whether or not to switch activities at time t k. Define 0 if on activity g1 just before time t x k = k, 1 if on activity g 2 just before time t k, 0 to continue current activity, u k = 1 to switch between activities. Then the state at time t k+1 is simply x k+1 = (x k + u k ) mod 2, and the profit for stage k is The DP algorithm is then J N (x N ) = 0 g k (x k, u k ) = tk+1 t k g 1+xk+1 (t)dt u k c. J k (x k ) = min u k gk (x k, u k ) + J k+1 ( (xk + u k ) mod 2 )]. 1.21 www We consider part (b), since part (a) is essentially a special case. We will consider the problem of placing N 2 points between the endpoints A and B of the given subarc. We will show that the polygon of maximal area is obtained when the N 2 points are equally spaced on the subarc between A and B. Based on geometric considerations, we impose the restriction that the angle between any two successive points is no more than π. 4

As the subarc is traversed in the clockwise direction, we number sequentially the encountered points as x 1, x 2,..., x N, where x 1 and x N are the two endpoints A and B of the arc, respectively. For any point x on the subarc, we denote by φ the angle between x and x N (measured clockwise), and we denote by A k (φ) the maximal area of a polygon with vertices the center of the circle, the points x and x N, and N k 1 additional points on the subarc that lie between x and x N. Without loss of generality, we assume that the radius of the circle is 1, so that the area of the triangle that has as vertices two points on the circle and the center of the circle is (1/2)sinu, where u is the angle corresponding to the center. By viewing as state the angle φ k between x k and x N, and as control the angle u k between x k and x k+1, we obtain the following DP algorithm A k (φ k ) = max 0 u k minφ k,π} 1 2 sin u k + A k+1 (φ k u k ) ], k = 1,...,N 2. (1) Once x N 1 is chosen, there is no issue of further choice of a point lying between x N 1 and x N, so we have A N 1 (φ) = 1 sin φ, (2) 2 using the formula for the area of the triangle formed by x N 1, x N, and the center of the circle. It can be verified by induction that the above algorithm admits the closed form solution A k (φ k ) = 1 2 (N k) sin ( φk N k and that the optimal choice for u k is given by u k = φ k N k. ), k = 1,...,N 1, (3) Indeed, the formula (3) holds for k = N 1, by Eq. (2). Assuming that Eq. (3) holds for k + 1, we have from the DP algorithm (1) A k (φ k ) = max H k(u k, φ k ), (4) 0 u k minφ k,π} where H k (u k, φ k ) = 1 2 sin u k + 1 ( ) 2 (N k 1) sin φk u k. (5) N k 1 It can be verified that for a fixed φ k and in the range 0 u k minφ k, π}, the function H k (, φ k ) is concave (its second derivative is negative) and its derivative is 0 only at the point u k = φ k/(n k) which must therefore be its unique maximum. Substituting this value of u k in Eqs. (4) and (5), we obtain A k (φ k ) = 1 ( ) 2 sin φk + 1 ( ) N k 2 (N k 1) sin φk φ k /(N k) = 1 ( ) N k 1 2 (N k) sin φk, N k and the induction is complete. Thus, given an optimally placed point x k on the subarc with corresponding angle φ k, the next point x k+1 is obtained by advancing clockwise by φ k /(N k). This process, when started at x 1 with φ 1 equal to the angle between x 1 and x N, yields as the optimal solution an equally spaced placement of the points on the subarc. 5

1.25 www (a) Consider the problem with the state equal to the number of free rooms. At state x 1 with y customers remaining, if the inkeeper quotes a rate r i, the transition probability is p i to state x 1 (with a reward of r i ) and 1 p i to state x (with a reward of 0). The DP algorithm for this problem starts with the terminal conditions J(x, 0) = J(0, y) = 0, x 0, y 0, and is given by J(x, y) = max pi (r i + J(x 1, y 1)) + (1 p i )J(x, y 1) ], x 0. i=1,...,m From this equation and the terminal conditions, we can compute sequentially J(1, 1), J(1, 2),..., J(1, y) up to any desired integer y. Then, we can calculate J(2, 1), J(2, 2),..., J(2, y), etc. We first prove by induction on y that for all y, we have J(x, y) J(x 1, y), x 1. Indeed this is true for y = 0. Assuming this is true for a given y, we will prove that J(x, y + 1) J(x 1, y + 1), x 1. This relation holds for x = 1 since r i > 0. For x 2, by using the DP recursion, this relation is written as max pi (r i + J(x 1, y)) + (1 p i )J(x, y) ] max pi (r i + J(x 2, y)) + (1 p i )J(x 1, y) ]. i=1,...,m i=1,...,m By the induction hypothesis, each of the terms on the left-hand side is no less than the corresponding term on the right-hand side, so the above relation holds. The optimal rate is the one that maximizes in the DP algorithm, or equivalently, the one that maximizes p i r i + p i (J(x 1, y 1) J(x, y 1)). The highest rate r m simultaneously maximizes p i r i and minimizes p i. Since J(x 1, y 1) J(x, y 1) 0, as proved above, we see that the highest rate simultaneously maximizes p i r i and p i (J(x 1, y 1) J(x, y 1)), and so it maximizes their sum. (b) The algorithm given is the algorithm of Exercise 1.22 applied to the problem of part (a). Clearly, it is optimal to accept an offer of r i if r i is larger than the threshold r(x, y) = J(x, y 1) J(x 1, y 1). 6

1.26 www (a) The total net expected profit from the (buy/sell) investment decisions after transaction costs are deducted is N 1 ( E uk P k (x k ) c u k )}, k=0 where 1 if a unit of stock is bought at the kth period, u k = 1 if a unit of stock is sold at the kth period, 0 otherwise. With a policy that maximizes this expression, we simultaneously maximize the expected total worth of the stock held at time N minus the investment costs (including sale revenues). The DP algorithm is given by with J k (x k ) = max u k 1,0,1} u k P k (x k ) c u k + E J k+1 (x k+1 ) x k } ], J N (x N ) = 0, where J k+1 (x k+1 ) is the optimal expected profit when the stock price is x k+1 at time k+1. Since u k does not influence x k+1 and E } J k+1 (x k+1 ) x k, a decision uk 1, 0, 1} that maximizes u k P k (x k ) c u k at time k is optimal. Since P k (x k ) is monotonically nonincreasing in x k, it follows that it is optimal to set 1 if xk x k, u k = 1 if x k x k, 0 otherwise, where x k and x k are as in the problem statement. Note that the optimal expected profit J k (x k ) is given by N 1 J k (x k ) = E ui P i (x i ) c u i ]}. i=k max u i 1,0,1} (b) Let n k be the number of units of stock held at time k. If n k is less that N k (the number of remaining decisions), then the value n k should influence the decision at time k. We thus take as state the pair (x k, n k ), and the corresponding DP algorithm takes the form V k (x k, n k ) = max uk 1,0,1} max uk 0,1} u k P k (x k ) c u k + E } ] V k+1 (x k+1, n k + u k ) x k if n k 1, u k P k (x k ) c u k + E } ] V k+1 (x k+1, n k + u k ) x k if n k = 0, with Note that we have V N (x N, n N ) = 0. V k (x k, n k ) = J k (x k ), if n k N k, 7

where J k (x k ) is given by the formula derived in part (a). Using the above DP algorithm, we can calculate V N 1 (x N 1, n N 1 ) for all values of n N 1, then calculate V N 2 (x N 2, n N 2 ) for all values of n N 2, etc. To show the stated property of the optimal policy, we note that V k (x k, n k ) is monotonically nondecreasing with n k, since as n k decreases, the remaining decisions become more constrained. An optimal policy at time k is to buy if and to sell if P k (x k ) c + E V k+1 (x k+1, n k + 1) V k+1 (x k+1, n k ) x k } 0, (1) P k (x k ) c + E V k+1 (x k+1, n k 1) V k+1 (x k+1, n k ) x k } 0. (2) The expected value in Eq. (1) is nonnegative, which implies that if x k x k, implying that P k (x k ) c 0, then the buying decision is optimal. Similarly, the expected value in Eq. (2) is nonpositive, which implies that if x k < x k, implying that P k (x k ) c < 0, then the selling decision cannot be optimal. It is possible that buying at a price greater than x k is optimal depending on the size of the expected value term in Eq. (1). (c) Let m k be the number of allowed purchase decisions at time k, i.e., m plus the number of sale decisions up to k, minus the number of purchase decisions up to k. If m k is less than N k (the number of remaining decisions), then the value m k should influence the decision at time k. We thus take as state the pair (x k, m k ), and the corresponding DP algorithm takes the form with W k (x k, m k ) = max uk 1,0,1} max uk 1,0} u k P k (x k ) c u k + E } ] W k+1 (x k+1, m k u k ) x k if m k 1, u k P k (x k ) c u k + E } ] W k+1 (x k+1, m k u k ) x k if m k = 0, W N (x N, m N ) = 0. From this point the analysis is similar to the one of part (b). (d) The DP algorithm takes the form H k (x k, m k, n k ) = max u k P k (x k ) c u k + E } ] H k+1 (x k+1, m k u k, n k + u k ) x k u k 1,0,1} if m k 1 and n k 1, and similar formulas apply for the cases where m k = 0 and/or n k = 0 compare with the DP algorithms of parts (b) and (c)]. (e) Let r be the interest rate, so that x invested dollars at time k will become (1+r) N k x dollars at time N. Once we redefine the expected profit P k (x k ) to be the preceding analysis applies. P k (x) = Ex N x k = x} (1 + r) N k x, 8

Solutions Vol. I, Chapter 2 2.4 www (a) We denote by P k the OPEN list after having removed k nodes from OPEN, (i.e., after having performed k iterations of the algorithm). We also denote d k j the value of d j at this time. Let b k = min j Pk d k j }. First, we show by induction that b 0 b 1 b k. Indeed, b 0 = 0 and b 1 = min j a sj } 0, which implies that b 0 b 1. Next, we assume that b 0 b k for some k 1; we shall prove that b k b k+1. Let j k+1 be the node removed from OPEN during the (k +1)th iteration. By assumption d k j k+1 = min j Pk d k j } = b k, and we also have = mind k i, dk j k+1 + a jk+1 i}. d k+1 i We have P k+1 = (P k j k+1 }) N k+1, where N k+1 is the set of nodes i satisfying d k+1 i and i / P k. Therefore, min d k+1 i } = min i P k+1 d k+1 i i (P k j k+1 }) N k+1 } = min min i P k j k+1 } dk+1 i ] }, min d k+1 i }. i N k+1 = d k j k+1 + a jk+1 i Clearly, Moreover, min i P k j k+1 } dk+1 min d k+1 i } = min d k j i N k+1 i N k+1 + a jk+1 i} d k j k+1. k+1 i } = min min i P k j k+1 } ] mind k i, dk j k+1 + a jk+1 i} min i P k j k+1 } dk i }, dk j k+1 ] = min i P k d k i } = dk j k+1, because we remove from OPEN this node with the minimum d k i. It follows that b k+1 = min i Pk+1 d k+1 i } d k j k+1 = b k. Now, we may prove that once a node exits OPEN, it never re-enters. Indeed, suppose that some node i exits OPEN after the k th iteration of the algorithm; then, d k 1 i after the l th iteration (with l > k ), then we have d l1 1 i > d l i = d l 1 j l = b k 1. If node i re-enters OPEN +a jl i d l 1 j l = b l 1. On the other hand, since d i is non-increasing, we have b k 1 = d k 1 i d l 1 i. Thus, we obtain b k 1 > b l 1, which contradicts the fact that b k is non-decreasing. Next, we claim the following: after the kth iteration, d k i equals the length of the shortest possible path from s to node i P k under the restriction that all intermediate nodes belong to C k. The proof will be done by induction on k. For k = 1, we have C 1 = s} and d 1 i = a si, and the claim is obviously true. Next, we assume that the claim is true after iterations 1,...,k; we shall show that it is also true after iteration k + 1. The node j k+1 removed from OPEN at the (k + 1)-st iteration satisfies min i Pk d k i } = d j k+1. Notice now that all neighbors of the nodes in C k belong either to C k or to P k. 9

It follows that the shortest path from s to j k+1 either goes through C k or it exits C k, then it passes through a node j P k, and eventually reaches j k+1. In the latter case, the length of this path is at least equal to the length of the shortest path from s to j through C k ; by the induction hypothesis, this equals d k j, which is at least dk j k+1. It follows that, for node j k+1 exiting the OPEN list, d k j k+1 equals the length of the shortest path from s to j k+1. Similarly, all nodes that have exited previously have their current estimate of d i equal to the corresponding shortest distance from s. Notice now that d k+1 i = min d k i, dk j k+1 + a jk+1 i}. For i / P k and i P k+1 it follows that the only neighbor of i in C k+1 = C k j k+1 } is node j k+1 ; for such a node i, d k i =, which leads to dk+1 i = d k j k+1 + a jk+1 i. For i j k+1 and i P k, the augmentation of C k by including j k+1 offers one more path from s to i through C k+1, namely that through j k+1. Recall that the shortest path from s to i through C k has length d k i (by the induction hypothesis). Thus, d k+1 i = min d k 1, dk j k+1 + a jk+1 i} is the length of the shortest path from s to i through Ck+1. The fact that each node exits OPEN with its current estimate of d i being equal to its shortest distance from s has been proved in the course of the previous inductive argument. (b) Since each node enters the OPEN list at most once, the algorithm will terminate in at most N 1 iterations. Updating the d i s during an iteration and selecting the node to exit OPEN requires O(N) arithmetic operations (i.e., a constant number of operations per node). Thus, the total number of operations is O(N 2 ). 2.6 www Proposition: If there exists a path from the origin to each node in T, the modified version of the label correcting algorithm terminates with UPPER < and yields a shortest path from the origin to each node in T. Otherwise the algorithm terminates with UPPER =. Proof: The proof is analogous to the proof of Proposition 3.1. To show that this algorithm terminates, we can use the identical argument in the proof of Proposition 3.1. Now suppose that for some node t T, there is no path from s to t. Then a node i such that (i, t) is an arc cannot enter the OPEN list because this would establish that there is a path from s to i, and therefore also a path from s to t. Thus, d t is never changed and UPPER is never reduced from its initial value of. Suppose now that there is a path from s to each node t T. Then, since there is a finite number of distinct lengths of paths from s to each t T that do not contain any cycles, and each cycle has nonnegative length, there is also a shortest path. For some arbitrary t, let (s, j 1, j 2,..., j k, t) be a shortest path and let d t be the corresponding shortest distance. We will show that the value of UPPER upon termination must be equal to d = max t T d t. Indeed, each subpath (s, j 1,..., j m ), m = 1,...,k, of the shortest path (s, j 1,..., j k, t) must be a shortest path from s to j m. If the value of UPPER is larger than d at termination, the same must be true throughout the algorithm, and therefore UPPER will also be larger than the length of all the paths (s, j 1,..., j m ), m = 1,...,k, throughout the algorithm, in view of the nonnegative arc length assumption. If, for each t T, the parent node j k enters the OPEN list with d jk equal to the shortest distance from s to j k, UPPER will be set to d in step 2 immediately following the next time the last of the nodes j k is examined by the algorithm in step 2. It follows that, for 10

some t T, the associated parent node j k will never enter the OPEN list with d j k equal to the shortest distance from s to j k. Similarly, and using also the nonnegative length assumption, this means that node j k 1 will never enter the OPEN list with d j k 1 equal to the shortest distance from s to j k 1. Proceeding backwards, we conclude that j 1 never enters the OPEN list with d j 1 equal to the shortest distance from s to j 1 which is equal to the length of the arc (s, j 1 )]. This happens, however, at the first iteration of the algorithm, obtaining a contradiction. It follows that at termination, UPPER will be equal to d. Finally, it can be seen that, upon termination of the algorithm, the path constructed by tracing the parent nodes backward from d to s has length equal to d t for each t T. Thus the path is a shortest path from s to t. 2.13 www (a) We first need to show that d k i k = 1, is the length of the shortest k-arc path originating at i, for i t. For d 1 i = min c ij j which is the length of shortest arc out of i. Assume that d k 1 i is the length of the shortest (k 1)-arc path out of i. Then d k i = min c ij + d k 1 j j } If d k i is not the length of the shortest k-arc path, the initial arc of the shortest path must pass through a node other than j. This is true since dj k 1 length of any (k 1)-step arc out of j. Let l be the alternative node. From the optimality principle distance of path through l = c il + d k 1 l But this contradicts the choice of d k i in the DP algorithm. Thus, d k i is the length of the shortest k-arc path out of i. Since d k t = 0 for all k, once a k-arc path out of i reaches t we have d κ i = dk i for all κ k. But with all arc lengths positive, d k i is just the shortest path from i to t. Clearly, there is some finite k such that the shortest k-path out of i reaches t. If this were not true, the assumption of positive arc lengths implies that the distance from i to t is infinite. Thus, the algorithm will yield the shortest distances in a finite number of steps. We can estimate the number of steps, N i as N i min j d jt min j,k d jk d k i (b) Let d k i be the distance estimate generated using the initial condition d0 i = and dk i be the estimate generated using the initial condition d 0 i = 0. In addition, let d i be the shortest distance from i to t. Lemma: d k i d k+1 i d i d k+1 i d k i (1) d k i = d i = d k i for k sufficently large (2) 11

Proof: Relation (1) follows from the monotonicity property of DP. Note that d 1 i d0 i and that d 1 i d 0 i. Equation (2) follows immediately from the convergence of DP (given d 0 i = ) and from part a). Proposition: For every k there exists a time T k such that for all T T k d k i dt i d k i, i = 1, 2,...,N Proof: The proof follows by induction. For k = 0 the proposition is true, given the positive arc length assumption. Assume it is true for a given k. Let N(i) be a set containing all nodes adjacent to i. For every j N(i) there exists a time, T j k such that d k j dt j d k j T T j k Let T be the first time i updates its distance estimate given that all d T j k j, j N(i), estimates have arrived. Let d T ij be the estimate of d j that i has at time T. Note that this may differ from d T j k j since the later estimates from j may have arrived before T. From the Lemma d k j d T ij d k j which, coupled with the monotonicity of DP, implies d k+1 i d T i d k+1 i T T Since each node never stops transmitting, T is finite and the proposition is proved. Using the Lemma, we see that there is a finite k such that di κ = d i = d κ i, κ k. Thus, from the proposition, there exists a finite time T such that d T i = d i for all T T and i. 12

Solutions Vol. I, Chapter 3 3.6 www This problem is similar to the Brachistochrone Problem (Example 4.2) described in the text. As in that problem, we introduce the system ẋ = u and have a fixed terminal state problem x(0) = a and x(t) = b]. Letting g(x, u) = 1 + u 2, Cx the Hamiltonian is H(x, u, p) = g(x, u) + pu. Minimization of the Hamiltonian with respect to u yields p(t) = u g ( x(t), u(t) ). Since the Hamiltonian is constant along an optimal trajectory, we have g ( x(t), u(t) ) u g ( x(t), u(t) ) u(t) = constant. Substituting in the expression for g, we have 1 + u 2 u 2 Cx 1 + u2 Cx = 1 1 + u2 Cx = constant, which simplifies to ( x(t) ) 2 ( 1 + ( ẋ(t) ) 2 ) = constant. Thus an optimal trajectory satisfies the differential equation D ( x(t) ) 2 ẋ(t) = ( ) 2. x(t) It can be seen through straightforward calculation that the curve ( x(t) ) 2 + (t d)2 = D satisfies this differential equation, and thus the curve of minimum travel time from A to B is an arc of a circle. 13

3.9 www We have the system ẋ(t) = Ax(t) + Bu(t), for which we want to minimize the quadratic cost The Hamiltonian here is and the adjoint equation is with the terminal condition x(t) Q T x(t) + T 0 ( x(t) Qx(t) + u(t) Ru(t) ) dt. H(x, u, p) = x Qx + u Ru + p (Ax + Bu), ṗ(t) = A p(t) 2Qx(t), p(t) = 2Qx(T). Minimizing the Hamiltonian with respect to u yields the optimal control u (t) = arg min x (t) Qx (t) + u Ru + p ( Ax (t) + Bu )] u = 1 2 R 1 B p(t). We now hypothesize a linear relation between x (t) and p(t) 2K(t)x (t) = p(t), t 0, T], and show that K(t) can be obtained by solving the Riccati equation. Substituting this value of p(t) into the previous equation, we have u (t) = R 1 B K(t)x (t). By combining this result with the system equation, we have ẋ(t) = ( A BR 1 B K(t) ) x (t). (1) Differentiating 2K(t)x (t) = p(t) and using the adjoint equation yields Combining with Eq. (1), we have 2 K(t)x (t) + 2K(t)ẋ (t) = A 2K(t)x (t) 2Qx (t). K(t)x (t) + K(t) ( A BR 1 B K(t) ) x (t) = A K(t)x (t) Qx (t), and we thus see that K(t) should satisfy the Riccati equation K(t) = K(t)A A K(t) + K(t)BR 1 B K(t) Q. From the terminal condition p(t) = 2Qx(T), we have K(T) = Q, from which we can solve for K(t) using the Riccati equation. Once we have K(t), we have the optimal control u (t) = R 1 B K(t)x (t). By reversing the previous arguments, this control can then be shown to satisfy all the conditions of the Pontryagin Minimum Principle. 14

Solutions Vol. I, Chapter 4 4.10 www (a) Clearly, the function J N is continuous. Assume that J k+1 is continuous. We have where J k (x) = } min cu + L(x + u) + G(x + u) u 0,1,...} G(y) = E wk J k+1 (y w k )} L(y) = E wk p max(0, w k y) + h max(0, y w k )} Thus, L is continuous. Since J k+1 is continuous, G is continuous for bounded w k. Assume that J k is not continuous. Then there exists a ˆx such that as y ˆx, J k (y) does not approach J k (ˆx). Let u y = arg min u 0,1,...} cu + L(y + u) + G(y + u) } Since L and G are continuous, the discontinuity of J k at ˆx implies But since u y is optimal for y, lim y ˆx uy uˆx lim cuy + L(y + u y ) + G(y + u y ) } < lim cuˆx + L(y + uˆx ) + G(y + uˆx ) } = J k (ˆx) y ˆx y ˆx This contradicts the optimality of J k (ˆx) for ˆx. Thus, J k is continuous. (b) Let Y k (x) = J k (x + 1) J k (x) Clearly Y N (x) is a non-decreasing function. Assume that Y k+1 (x) is non-decreasing. Then Y k (x + δ) Y k (x) = c(u x+δ+1 u x+δ ) c(u x+1 u x ) + L(x + δ + 1 + u x+δ+1 ) L(x + δ + u x+δ ) L(x + 1 + u x+1 ) L(x + u x )] + G(x + δ + 1 + u x+δ+1 ) G(x + δ + u x+δ ) G(x + 1 + u x+1 ) G(x + u x )] 15

Since J k is continuous, u y+δ = u y for δ sufficiently small. Thus, with δ small, Y k (x + δ) Y k (x) = L(x + δ + 1 + u x+1 ) L(x + δ + u x ) L(x + 1 + u x+1 ) L(x + u x )] + G(x + δ + 1 + u x+1 ) G(x + δ + u x ) G(x + 1 + u x+1 ) G(x + u x )] Now, since the control and penalty costs are linear, the optimal order given a stock of x is less than the optimal order given x + 1 stock plus one unit. Thus u x+1 u x u x+1 + 1 If u x = u x+1 + 1, Y (x + δ) Y (x) = 0 and we have the desired result. Assume that u x = u x+1. Since L(x) is convex, L(x + 1) L(x) is non-decreasing. Using the assumption that Y k+1 (x) is non-decreasing, we have Y k (x + δ) Y k (x) = L(x + δ + 1 + u x ) L(x + δ + u x ) L(x + 1 + u x ) L(x + u x )] }} 0 Thus, Y k (x) is a non-decreasing function in x. + E wk Jk+1 (x + δ + 1 + u x w k ) J k+1 (x + δ + u x w k ) J k+1 (x + 1 + u x w k ) J k+1 (x + u x w k )] } }} 0 0 (c) From their definition and a straightforward induction it can be shown that J k (x) and J k(x, u) are bounded below. Furthermore,since lim x L k (x, u) =, we obtain lim x (x, 0) =. From the definition of J k (x, u), we have Let S k be the smallest real number satisfying J k (x, u) = J k (x + 1, u 1) + c, u 1, 2,...}. (1) J k (S k, 0) = J k (S k + 1, 0) + c (2) We show that S k is well defined. If no S k satisfying Eq. (2) exists, we must have either J k (x, 0) J k (x + 1, 0) > c, x R or J k (x, 0) J k (x+1, 0) < 0, x R, because J k is continuous. The first possibility contradicts the fact that lim x J k (x, 0) =. The second possibility implies that lim x J k (x, 0)+cx is finite. However, using the boundedness of J k+1 (x) from below, we obtain lim x J k (x, 0)+cx =. The contradiction shows that S k is well defined. We now derive the form of an optimal policy u k (x). Fix some x and consider first the case x S k. Using the fact that J k (x, u) J k (x+1, u) is nondecreasing function of x we have for any u 0, 1, 2,...} J k (x + 1, u) J k (x, u) J k (S k + 1, u)j k (S k, u) = J k (S k + 1, 0) J k (S k, 0) = c Therefore, J k (x, u + 1) = J k (x + 1, u) + c J k (x, u) u 0, 1,...}, x S k. 16

This shows that u = 0 minimizes J k (x, u), for all x S k. Now let x S k n, S k n+1), n 1, 2,...}. Using Eq. (1), we have J k (x, n + m) J k (x, n) = J k (x + n, m) J k (x + n, 0) 0 m in 0, 1,...}. (3) However, if u < n then x + u < S k and Therefore, J k (x + u + 1, 0) J k (x + u, 0) < J k (S k + 1, 0) J k (S k, 0) = c. J k (x, u+1) = J k (x+u+1, 0)+(u+1)c < J k (x+u, 0)+uc = J k (x, u) u 0, 1,...}, n < n. (4) Inequalities (3) and (4) show that u = n minimizes J k (x, u) whenever x S k n, S k n + 1). 4.18 www Let the state x k be defined as T, if the selection has already terminated x k = 1, if the k th object observed has rank 1 0, if the k th object observed has rank < 1 The system evolves according to The cost function is given by x k+1 = T, if uk = stop or x k = T w k, if u k = continue k g k (x k, u k, w k ) = N, if x k = 1 and u k = stop 0, otherwise g N (x N ) = 1, if xn = 1 0, otherwise Note that if termination is selected at stage k and x k 1 then the probability of success is 0. Thus, if x k = 0 it is always optimal to continue. To complete the model we have to determine P(w k x k, u k ) = P(w k ) when the control u k = continue. At stage k, we have already selected k objects from a sorted set. Since we know nothing else about these objects the new element can, with equal probability, be in any relation with the already observed objects a j < a i1 < < a i2 < < a ik }} k+1 possible positions for a k+1 17

Thus, P(w k = 1) = 1 k + 1, P(w k = 0) = k k + 1 ( ) Proposition: If k S 1 N = i N 1 + + 1 i 1}, then Proof: For k = N 1 and µ N 1 (0) = continue. Also, J k (0) = k N ( 1 N 1 + + 1 ), J k (1) = k k N. J N 1 (0) = max }} 0 stop N 1 J N 1 (1) = max N }} stop and µ N 1 (1) = stop. Note that N 1 S N for all S N. Assume the proposition is true for J k+1 (x k+1 ). Then Now, J k (0) = max }} 0 stop k J k (1) = max N }} stop EJ k+1 (w k )} = 1 k + 1 k + 1 N + k k + 1 k + 1 N = k ( 1 N N 1 + + 1 ) k ], Ew N 1 } = 1 }} N continue ], Ew N 1 } = N 1 }} N continue ], EJ k+1 (w k )} }} continue ], EJ k+1 (w k )} }} continue ( 1 N 1 + + 1 ) k + 1 Clearly and µ k (0) = continue. If k S N, J k (0) = k N ( 1 N 1 + + 1 ) k J k (1) = k N and µ k (1) = stop. Q.E.D. 18

Proposition: If k S N J k (0) = J k (1) = δ 1 N ( 1 N 1 + + 1 ) δ 1 where δ is the minimum element of S N. Proof: For k = δ 1 J k (0) = 1 δ δ N + δ 1 ( δ 1 δ N N 1 + + 1 ) δ = δ 1 ( 1 N N 1 + + 1 ) δ 1 δ 1 J k (1) = max = δ 1 N and µ δ 1 (0) = µ δ 1 (1) = continue. Assume the proposition is true for J k (x k ). Then and µ k 1 (0) = continue. N, δ 1 ( 1 N N 1 + + 1 )] δ 1 ( 1 N 1 + + 1 ) δ 1 J k 1 (0) = 1 k J k(1) + k 1 k J k(0) = J k (0) 1 J k 1 (1) = max δ 1 = max N = J k (0) k J k(1) + k 1 k J k(0), k 1 ] N ( 1 N 1 + + 1 δ 1 ), k 1 ] N and µ k 1 (1) = continue. Q.E.D. ( Thus the optimum ) policy is to continue until the δ th object, where δ is the minimum integer such 1 that N 1 + + 1 δ 1, and then stop at the first time an element is observed with largest rank. 4.31 www (a) In order that A k x + B k u + w X for all w W k, it is sufficient that A k x + B k u belong to some ellipsoid X such that the vector sum of X and W k is contained in X. The ellipsoid X = z z Fz 1}, 19

where for some scalar β (0, 1), F 1 = (1 β)(ψ 1 β 1 D 1 k ) has this property (based on the hint and assuming that F 1 is well-defined as a positive definite matrix). Thus, it is sufficient that x and u are such that (A k x + B k u) F(A k x + B k u) 1. (1) In order that for a given x, there exists u with u R k u 1 such that Eq. (1) is satisfied as well as it is sufficient that x is such that x Ξx 1, min x Ξx + u R k u + (A k x + B k u) F(A k x + B k u) ] 1, (2) u R m or by carrying out explicitly the quadratic minimization above, where The control law x Kx 1, K = A k (F 1 + B k R 1 k B k ) 1 + Ξ. µ(x) = (R k + B k FB k) 1 B k FA kx attains the minimum in Eq. (2) for all x, so it achieves reachability. (b) Follows by iterative application of the results of part (a), starting with k = N 1 and proceeding backwards. (c) Follows from the arguments of part (a). 20

Solutions Vol. I, Chapter 5 5.1 www Define Then and y N = x N, y k = x k + A 1 k w k + A 1 k A 1 k+1 w k+1 +... + A 1 k Now, the cost function is the expected value of y k = x k + A 1 k (w k x k+1 ) + A 1 k y k+1 = x k + A 1 k ( A kx k B k u k ) + A 1 k y k+1 = A 1 k B ku k + A 1 k y k+1 y k+1 = A k y k + B k u k. A 1 N 1 w N 1. We have N 1 N 1 x N Qx N + u k R k u k = y 0 K 0 y 0 + (y k+1 K k+1 y k+1 y k K k y k + u k R k u k ). k=0 k=0 y k+1 K k+1 y k+1 y k K k y k + u k R k u k = (A k y k + B k u k ) K k+1 (A k y k + B k u k ) + u k R k u k Thus, the cost function can be written as E y 0 K 0 y 0 + y k A k K k+1 K k+1 B k (B k K k+1 B k ) 1 B k τrk k+1 ]A k y k = y k A k K k+1 A k y k + 2y k A k K k+1b k u k + u k B k K k+1 B k u k y k A k K k+1 A k y k + y k A k K k+1 B k P 1 k B k K k+1a k y k + u k R k u k = 2y k L k P ku k + u k P k u k + y k L k P k L k y k = (u k L k y k ) P k (u k L k y k ). N 1 k=0 (u k L k y k ) P k (u k L k y k ) The problem now is to find µ k (I k), k = 0, 1,...,N 1, that minimize over admissible control laws µ k (I k ), k = 0, 1,..., N 1, the cost function N 1 ( ) Pk ( ) } E y 0 K 0 y 0 + µk (I k ) L k y k µk (I k ) L k y k. k=0 21 }.

We do this minimization by first minimizing over µ N 1, then over µ N 2, etc. The minimization over µ N 1 involves just the last term in the sum and can be written as (µn 1 ) PN 1 ( ) } min E (I N 1 ) L N 1 y N 1 µn 1 (I N 1 ) L N 1 y N 1 µ N 1 (un 1 ) PN 1 ( ) }} = E min E L N 1 y N 1 un 1 L N 1 y IN 1 N 1. u N 1 Thus this minimization yields the optimal control law for the last stage: µ N 1 (I } N 1) = L N 1 E y IN 1 N 1. Recall here that, generically, Ez I} minimizes over u the expression Ez (u z) P(u z) } for any random variable z, any conditioning variable I, and any positive semidefinite matrix P.] The minimization over µ N 2 involves (µn 2 ) PN 2 ( ) } E (I N 2 ) L N 2 y N 2 µn 2 (I N 2 ) L N 2 y N 2 + E (EyN 1 I N 1 } y N 1 ) L N 1 P N 1L N 1 ( EyN 1 I N 1 } y N 1 ) }. However, as in Lemma 5.2.1, the term Ey N 1 I N 1 } y N 1 does not depend on any of the controls (it is a function of x 0, w 0,..., w N 2, v 0,..., v N 1 ). Thus the minimization over µ N 2 involves just the first term above and yields similarly as before µ N 2 (I } N 2) = L N 2 E y IN 2 N 2. Proceeding similarly, we prove that for all k µ k (I k) = L k E y k Ik }. Note: The preceding proof can be used to provide a quick proof of the separation theorem for linearquadratic problems in the case where x 0, w 0,...,w N 1, v 0,...,v N 1 are independent. If the cost function is N 1 ( ) } E x N Q N x N + xk Q k x k + u k R k u k the preceding calculation can be modified to show that the cost function can be written as N 1 ( E x 0 K 0 x 0 + (uk L k x k ) ) } P k (u k L k x k ) + w k K k+1 w k. k=0 By repeating the preceding proof we then obtain the optimal control law as µ k (I k) = L k E x k Ik } k=0 22

5.3 www The control at time k is (u k, α k ), where α k is a variable taking values 1 (if the next measurement at time k + 1 is of type 1) or 2 (if the next measurement is of type 2). The cost functional is } N 1 N 1 E x N Qx N + (x k Qx k + u k Ru k ) + g αk. We apply the DP algorithm for N = 2. We have from the Riccatti equation So k=0 J 1 (I 1 ) = J 1 (z 0, z 1, u 0, α 0 ) = E x1 x1 (A QA + Q)x 1 I 1 } + Ew1 w Qw} k=0 + min u 1 u1 (B QB + R)u 1 + 2 Ex 1 I 1 } A QBu 1 } + ming 1, g 2 ]. µ 1 (I 1) = (B QB + R) 1 B QAEx 1 I 1 }, 1, if α 1 (I g1 g 1) = 2, 2, otherwise. Note that the measurement selected at k = 1 does not depend on I 1. This is intuitively clear since the measurement z 2 will not be used by the controller, so its selection should be based on measurement cost alone and not on the basis of the quality of estimate. The situation is different once more than one stage is considered. Using a simple modification of the analysis in Section 5.2 of the text, we have J 0 (I 0 ) = J 0 (z 0 ) = min E x0 Qx 0 + u 0 Ru 0 + Ax 0 + Bu 0 + w } } 0 K 0 Ax 0 + Bu 0 + w 0 z0 x 0,w 0 + min E E x1 Ex 1 I 1 }] P 1 x 1 Ex 1 I 1 }] } } ] I1 z0, u 0, α 0 + g α0 α 0 z 1 x 1 u 0 + E w1 w 1 Qw 1 } + ming 1, g 2 ]. Note that the minimization of the second bracket is indicated only with respect to α 0 and not u 0. The reason is that quantity in the second bracket is the error covariance of the estimation error (weighted by P 1 ) and, as shown in the text, it does not depend on u 0. Because all stochastic variables are Gaussian, the quantity in the second bracket does not depend on z 0. (The weighted error covariance produced by the Kalman filter is precomputable and depends only on the system and measurement matrices and noise covariances but not on the measurements received.) In fact E E x1 Ex 1 I 1 }] P 1 x 1 Ex 1 I 1 }] } } I1 z0, u 0, α 0 z 1 x 1 ( Tr = Tr P 1 2 1 1 1 1 P 1 2, if α 0 = 1, ( ) 2 P 1 2 1 1 1 P 1 2 1, if α 0 = 2, 23 1 )

where Tr( ) denotes the trace of a matrix, and 1 1 1 ( 2 1 1 ) denotes the error covariance of the Kalman filter estimate if a measurement of type 1 (type 2) is taken at k = 0. Thus at time k = 0, we have that the optimal measurement chosen does not depend on z 0 and is of type 1 if ( ) ( ) Tr P 1 2 1 Σ 1 1 1 P 1 2 1 + g 1 Tr P 1 2 1 Σ 2 1 1 P 2 1 1 + g 2 and is of type 2 otherwise. 5.7 www a) We have p j k+1 = P(x k+1 = j z 0,..., z k+1, u 0,..., u k ) = P(x k+1 = j I k+1 ) = P(x k+1 = j, z k+1 I k, u k ) P(z k+1 I k, u k ) n i=1 = P(x k = i)p(x k+1 = j x k = i, u k )P(z k+1 u k, x k+1 = j) n n s=1 i=1 P(x k = i)p(x k+1 = s x k = i, u k )P(z k+1 u k, x k+1 = s) n i=1 = pi k p ij(u k )r j (u k, z k+1 ) n n s=1 i=1 pi k p is(u k )r s (u k, z k+1 ). Rewriting p j k+1 in vector form, we have p j k+1 = r j (u k, z k+1 )P(u k ) P k ] j n s=1 r s(u k, z k+1 )P(u k ) P k ] s, j = 1,...,n. Therefore, P k+1 = r(u k, z k+1 )] P(u k ) P k ] r(u k, z k+1 ) P(u k ) P k. b) The DP algorithm for this system is: J N 1 (P N 1 ) = min u = min u p i N 1 p ij (u)g N 1 (i, u, j) i=1 n pn 1 i GN 1 (u) ] } i i=1 = min u P N 1 G N 1 (u) } 24

J k (P k ) = min u i=1 = min u c) For k = N 1, p i k P k G k(u) + p ij (u)g k (i, u, j) + q θ=1 Now assume J k (λp k ) = λ J k (P k ). Then, J k 1 (λp k 1 ) = min u = min u p i k i=1 p ij (u) q r j (u, θ) J k+1 (P k+1 P k, u, θ) θ=1 r(u, θ) P(u) P k Jk+1 r(u, θ)] P(u) P k ] r(u, θ) P(u) P k J N 1 (λp N 1 ) = min λp u N 1 G N 1 (u) } = min λp i u N 1 GN 1 (u) ] i } = λmin u i=1 = min u λ λp k 1 G k 1(u) + λp k 1 G k 1(u) + λ P k 1 G k 1(u) + p i N 1 G N 1(u)] i } i=1 n = λmin p i u N 1 G } N 1(u)] i i=1 n = λmin p i u N 1 G } N 1(u)] i i=1 = λ J N 1 (P N 1 ). ] }. } q r(u, θ) P(u) λp k 1 Jk (P k P k 1, u, θ) θ=1 } q r(u, θ) P(u) P k 1 Jk (P k P k 1, u, θ) θ=1 } q r(u, θ) P(u) P k 1 Jk (P k P k 1, u, θ) θ=1 = λ J k 1 (P k 1 ). Q.E.D. For any u, r(u, θ) P(u) P k is a scalar. Therefore, letting λ = r(u, θ) P(u) P k, we have q ] } r(u, θ)] J k (P k ) = min P u k G P(u) P k(u) + r(u, θ) P(u) k ] P k Jk+1 r(u, θ) P(u) P k θ=1 ] q = min P u k G k(u) + J k+1 (r(u, θ)] P(u) P k ]). θ=1 d) For k = N 1, we have J N 1 (P N 1 ) = min u P N 1 G N 1(u)], and so J N 1 (P N 1 ) has the desired form J N 1 (P N 1 ) = min P N 1 α1 N 1,..., P N 1 αm N 1], 25

where α j N 1 = G N 1(u j ) and u j is the jth element of the control constraint set. Assume that J k+1 (P k+1 ) = min P k+1 α1 k+1,...,p ] k+1 αm k+1 k+1. Then, using the expression from part (c) for J k (P k ), J k (P k ) = min u = min u = min u = min u P k G k(u) + P k G k(u) + P k G k(u) + P k q ( J k+1 r(u, θ)] P(u) P k ] )] θ=1 q r(u, min θ)] P(u) P k ] } ] ] α m m=1,...,m k+1 k+1 θ=1 q ] ] min P m=1,...,m k P(u)r(u, θ) α m k+1 k+1 θ=1 G k (u) + q ] }] min P(u)r(u, θ) α m m=1,...,m k+1 k+1 ], θ=1 = min P k α1 k,..., P k αm k k where α 1 k,..., αm k k are all possible vectors of the form G k (u) + q θ=1 P(u)r(u, θ) α m u,θ k+1, as u ranges over the finite set of controls, θ ranges over the set of observation vector indexes 1,..., q}, and m u,θ ranges over the set of indexes 1,..., m k+1 }. The induction is thus complete. For a quick way to understand the preceding proof, based on polyhedral concavity notions, note that the conclusion is equivalent to asserting that J k (P k ) is a positively homogeneous, concave polyhedral function. The preceding induction argument amounts to showing that the DP formula of part (c) preserves the positively homogeneous, concave polyhedral property of Jk+1 (P k+1 ). This is indeed evident from the formula, since taking minima and nonnegative weighted sums of positively homogeneous, concave polyhedral functions results in a positively homogeneous, concave polyhedral function. 26

Solutions Vol. I, Chapter 6 6.8 www First, we notice that α β pruning is applicable only for arcs that point to right children, so that at least one sequence of moves (starting from the current position and ending at a terminal position, that is, one with no children) has been considered. Furthermore, due to depth-first search the score at the ancestor positions has been derived without taking into account the positions that can be reached from the current point. Suppose now that α-pruning applies at a position with Black to play. Then, if the current position is reached (due to a move by White), Black can respond in such a way that the final position will be worse (for White) than it would have been if the current position were not reached. What α-pruning saves is searching for even worse positions (emanating from the current position). The reason for this is that White will never play so that Black reaches the current position, because he certainly has a better alternative. A similar argument applies for β pruning. A second approach: Let us suppose that it is the WHITE s turn to move. We shall prove that a β cutoff occurring at the nth position will not affect the backed up score. We have from the definition of β β = mintbs of all ancestors of n (white) where BLACK has the move}. For a cutoff to occur: TBS(n) > β. Observe first of all that β = TBS(n 1 ) for some ancestor n 1 where BLACK has the move. Then there exists a path n 1, n 2,...,n k, n. Since it is WHITE s move at n we have that TBS(n) = maxtbs(n), BS(n i )} > β, where n i are the descendants of n. Consider now a position n k. Then TBS(n k ) will either remain unchanged or will increase to a value greater than β as a result of the exploration of node n. Proceeding similarly, we conclude that TBS(n 2 ) will either remain the same or change to a value greater than β. Finally at node n 1 we have that TBS(n 1 ) will not change since it is BLACK s turn to move and he will choose the move with minimum score. Thus the backed up score and the choice of the next move are unaffected from β pruning. A similar argument holds for α pruning. 27

Solutions Vol. I, Chapter 7 7.8 www A threshold policy is specified by a threshold integer m and has the form Process the orders if and only if their number exceeds m. The cost function corresponding to a threshold policy specified by m will be denoted by J m. By Prop. 3.1(c), this cost function is the unique solution of system of equations J m (i) = K + α(1 p)jm (0) + αpj m (1) if i > m, ci + α(1 p)j m (i) + αpj m (i + 1) if i m. (1) Thus for all i m, we have J m (i) = ci + αpj m(i + 1), 1 α(1 p) J m (i 1) = c(i 1) + αpj m(i). 1 α(1 p) From these two equations it follows that for all i m, we have Denote now J m (i) J m (i + 1) J m (i 1) < J m (i). (2) γ = K + α(1 p)j m (0) + αpj m (1). Consider the policy iteration algorithm, and a policy µ that is the successor policy to the threshold policy corresponding to m. This policy has the form Process the orders if and only if or equivalently K + α(1 p)j m (0) + αpj m (1) ci + α(1 p)j m (i) + αpj m (i + 1) γ ci + α(1 p)j m (i) + αpj m (i + 1). In order for this policy to be a threshold policy, we must have for all i γ c(i 1) + α(1 p)j m (i 1) + αpj m (i) γ ci + α(1 p)j m (i) + αpj m (i + 1). (3) This relation holds if the function J m is monotonically nondecreasing, which from Eqs. (1) and (2) will be true if J m (m) J m (m + 1) = γ. Let us assume that the opposite case holds, where γ < J m (m). For i > m, we have J m (i) = γ, so that ci + α(1 p)j m (i) + αpj m (i + 1) = ci + αγ. (4) 28

We also have J m (m) = cm + αpγ 1 α(1 p), from which, together with the hypothesis J m (m) > γ, we obtain Thus, from Eqs. (4) and (5) we have cm + αγ > γ. (5) ci + α(1 p)j m (i) + αpj m (i + 1) > γ, for all i > m, (6) so that Eq. (3) is satisfied for all i > m. For i m, we have ci + α(1 p)j m (i) + αpj m (i + 1) = J m (i), so that the desired relation (3) takes the form γ J m (i 1) γ J m (i). (7) To show that this relation holds for all i m, we argue by contradiction. Suppose that for some i m we have J m (i) < γ J m (i 1). Then since J m (m) > γ, there must exist some i > i such that J m (i 1) < J m (i). But then Eq. (2) would imply that J m (j 1) < J m (j) for all j i, contradicting the relation J m (i) < γ J m (i 1) assumed earlier. Thus, Eq. (7) holds for all i m so that Eq. (3) holds for all i. The proof is complete. 7.12 www Let Assumption 2.1 hold and let π = µ 0, µ 1,...} be an admissible policy. Consider also the sets S k (i) given in the hint with S 0 (i) = i}. If t S n (i) for all π and i, we are done. Otherwise, we must have for some π and i, and some k < n, S k (i) = S k+1 (i) while t / S k (i). For j S k (i), let m(j) be the smallest integer m such that j S m. Consider a stationary policy µ with µ(j) = µ m(j) (j) for all j S k (i). For this policy we have for all j S k (i), p jl (µ(j)) > 0 l S k (i). This implies that the termination state t is not reachable from all states in S k (i) under the stationary policy µ, and contradicts Assumption 2.1. 29

Solutions Vol. II, Chapter 1 1.5 (a) We have p ij (u) = Therefore, p ij (u) are transition probabilities. (b) We have for the modified problem J (i) = min u U(i) = min u U(i) = } pij (u) m j 1 n k=1 m k n p ij(u) n m j = 1. g(i, u) + α 1 g(i, u) + α 1 n k=1 m k m j p ij (u)j (j) α p ij (u) m j 1 n k=1 m J (j) k m k J (k). k=1 So J (i) + α n k=1 m kj (k) = min 1 α u U(i) J (i) + α n k=1 m kj (k) = min 1 α u U(i) n g(i, u) + α p ij (u)j (j) α g(i, u) + α k=1 p ij (u) (J (j) + α n m k (1 1 1 α ) J (k) }} α 1 α k=1 m ) kj (k). 1 α Thus Q.E.D. J (i) + α n k=1 m kj (k) = J 1 α (i), i. 30

1.7 We show that for any bounded function J : S R, we have For any µ, define and note that J T(J) T(J) F(J), (1) J T(J) T(J) F(J). (2) F µ (J)(i) = g(i, µ(i)) + α j i p ij(µ(i))j(j) 1 αp ii (µ(i)) F µ (J)(i) = T µ(j)(i) αp ii (µ(i))j(i). (3) 1 αp ii (µ(i)) Fix ǫ > 0. If J T(J), let µ be such that F µ (J) F(J) + ǫe. Then, using Eq. (3), F(J)(i) + ǫ F µ (J)(i) = T µ(j)(i) αp ii (µ(i))j(i) 1 αp ii (µ(i)) T(J)(i) αp ii(µ(i))t(j)(i) 1 αp ii (µ(i)) = T(J)(i). Since ǫ > 0 is arbitrary, we obtain F(J)(i) T(J)(i). Similarly, if J T(J), let µ be such that T µ (J) T(J) + ǫe. Then, using Eq. (3), F(J)(i) F µ (J)(i) = T µ(j)(i) αp ii (µ(i))j(i) 1 αp ii (µ(i)) T(J)(i) + ǫ αp ii(µ(i))t(j)(i) 1 αp ii (µ(i)) T(J)(i) + ǫ 1 α. Since ǫ > 0 is arbitrary, we obtain F(J)(i) T(J)(i). From (1) and (2) we see that F and T have the same fixed points, so J is the unique fixed point of F. Using the definition of F, it can be seen that for any scalar r > 0 we have Furthermore, F is monotone, that is F(J + re) F(J) + αre, F(J) αre F(J re). (4) For any bounded function J, let r > 0 be such that J J F(J) F(J ). (5) J re J J + re. Applying F repeatedly to this equation and using Eqs. (4) and (5), we obtain F k (J) α k re J F k (J) + α k re. Therefore F k (J) converges to J. From Eqs. (1), (2), and (5) we see that J T(J) T k (J) F k (J) J, 31

J T(J) T k (J) F k (J) J. These equations demonstrate the faster convergence property of F over T. As a final result (not explicitly required in the problem statement), we show that for any two bounded functions J : S R, J : S R, we have max j so F is a contraction mapping with modulus α. Indeed, we have F(J)(j) F(J )(j) α max J(j) J (j), (6) j g(i, u) + α j i F(J)(i) = min p } ij(u)j(j) u U(i) 1 αp ii (u) g(i, u) + α j i = min p ij(u)j (j) + α j i p ij(u)j(j) J } (j)] u U(i) 1 αp ii (u) 1 αp ii (u) F(J )(i) + max J(j) J (j), i, j where we have used the fact 1 αp ii (u) 1 p ii (u) = j i p ij (u). Thus, we have F(J)(i) F(J )(i) α max J(j) J (j), i. j The roles of J and J may be reversed, so we can also obtain Combining the last two inequalities, we see that By taking the maximum over i, Eq. (6) follows. F(J )(i) F(J)(i) α max J(j) J (j), i. j F(J)(i) F(J )(i) α max J(j) J (j), i. j 1.9 (a) Since J, J B(S), i.e., are real-valued, bounded functions on S, we know that the infimum and the supremum of their difference is finite. We shall denote ( m = min J(x) J (x) ) x S and ( M = max J(x) J (x) ). x s 32

Thus or m J(x) J (x) M, x S, J (x) + m J(x) J (x) + M, x S. Now we apply the mapping T on the above inequalities. By property (1) we know that T will preserve the inequalities. Thus By property (2) we know that T(J + me)(x) T(J)(x) T(J + Me)(x), x S. T(J)(x) + mina 1 r, a 2 r] T(J + re)(x) T(J)(x) + maxa 1 r, a 2 r]. If we replace r by m or M, we get the inequalities T(J )(x) + mina 1 m, a 2 m] T(J + me)(x) T(J )(x) + maxa 1 m, a 2 m] and Thus so that T(J )(x) + mina 1 M, a 2 M] T(J + Me)(x) T(J )(x) + maxa 1 M, a 2 M]. T(J )(x) + mina 1 m, a 2 m] T(J)(x) T(J )(x) + maxa 1 M, a 2 M], T(J)(x) T(J )(x) maxa 1 M, a 2 M, a 1 m), a 2 m ]. We also have Thus from which maxa 1 M, a 2 M, a 1 m, a 1 m, a 2 m ] a 2 max M, m ] a 2 sup J(x) J (x). x S T(J)(x) T(J )(x) a 2 max x S J(x) J (x) max T(J)(x) T(J )(x) a 2 max J(x) J (x). x S x S Thus T is a contraction mapping since we know by the statement of the problem that 0 a 1 a 2 < 1. Since the set B(S) of bounded real valued functions is a complete linear space, we conclude that the contraction mapping T has a unique fixed point, J, and lim k T k (J)(x) = J (x). (b) We shall first prove the lower bounds of J (x). The upper bounds follow by a similar argument. Since J, T(J) B(S), there exists a c R, (c < ), such that J(x) + c T(J)(x). (1) 33

We apply T on both sides of (1) and since T preserves the inequalities (by assumption (1)) we have by applying the relation of assumption (2). J(x) + minc + a 1 c, c + a 2 c] T(J)(x) + mina 1 c, a 2 c] T(J + ce)(x) T 2 (J)(x). (2) Similarly, if we apply T again we get, Thus by induction we conclude J(x) + min i (1,2) c + a ic, c + a 2 i c] T(J) + mina 1c + a 2 1 c, a 2c + a 2 2 c] T 2 (J) + mina 2 1 c, a2 2 c] T(T(J) + mina 1c, a 2 c]e)(x) T 3 (J)(x). k k k k J(x) + min a m 1 c, a m 2 c] T(J)(x) + min a m 1 c, a m 2 c]... m=0 m=0 m=1 m=1 T k (J)(x) + mina k 1 c, ak 2 c] T k+1 (J)(x). (3) By taking the limit as k and noting that the quantities in the minimization are monotone, and either nonnegative or nonpositive, we conclude that ] ] 1 1 a1 a 2 J(x) + min c, c T(J)(x) + min c, c 1 a 1 1 a 2 1 a 1 1 a 2 a k T k (J)(x) + min 1 a k ] 2 c, c 1 a 1 1 a 2 ] (4) a k+1 T k+1 1 a k+1 2 (J)(x) + min c, c 1 a 1 1 a 2 Finally we note that Thus J (x). mina k 1 c, ak 2 c] T k+1 (J)(x) T k (J)(x). mina k 1 c, ak 2c] inf (T k+1 (J)(x) T k (J)(x)). x S Let b k+1 = inf x S (T k+1 (J)(x) T k (J)(x)). Thus mina k 1 c, ak 2 c] b k+1. From the above relation we infer that ] a k+1 ] 1 c min, ak+1 2 c a1 a 2 min b k1, b k+1 = c k+1 1 a 1 1 a 2 1 a 1 1 a 2 Therefore This relationship gives for k = 1 a k T k (J)(x) + min 1 c, 1 a 1 T(J)(x) + min a1 c 1 a 1, a k 2 c ] T 1 a k+1 (J)(x) + c k+1. 2 ] a 2 c T 1 a 2 (J)(x) + c 2 2 34