On Optimization of the Total Present Value of Profits under Semi Markov Conditions

Size: px

Start display at page:

Download "On Optimization of the Total Present Value of Profits under Semi Markov Conditions"

Vernon Harper
5 years ago
Views:

1 On Optimization of the Total Present Value of Profits under Semi Markov Conditions Katehakis, Michael N. Rutgers Business School Department of MSIS 180 University, Newark, New Jersey 07102, U.S.A. Abstract: In this paper we have surveyed theory related to optimization of Semi Markov processes and we have applied these techniques to a simple dynamic ferry dispatch problem, when the customers arrive at a Ferry according to a Poisson process with rate λ > 0. We have formulated this dispatch problem as a two action semi-markov decision process and we have illustrated computationally that an optimal dispatch policy is characterized by a single critical numberx 0 such that it is optimal to wait until there are at leastx 0 customers on the ferry before a departure occurs. Key Words: Semi-Markov Optimization, Dynamic Scheduling, Markov Chains. 1 Introduction In this paper we have surveyed theory related to optimization of Semi Markov processes and we have applied these techniques to a simple ferry dispatch problem, when the customers arrive at a Ferry according to a Poisson process with rate λ > 0. We have formulated this dispatch problem as a two action semi-markov decision process and we have illustrated computationally that an optimal dispatch policy is characterized by a single critical number x 0 such that it is optimal to wait until there are at least x 0 customers on the ferry before a departure occurs. For related work in this are of dynamic scheduling we refer the reader to Ungureanu et-al [3] through Ungureanu et-al [6]. Further related work can be found in Zhao and Katehakis(2006) and Zhou and Katehakis (2008). The paper is organized as follows. In section 2 we survey the concepts of future and present rewards under continuous discounting. In Section 3 we present the main tools for optimization of a system that can be modelled as a Semi Markov Process, herein we follow Derman(1970) and Ross(1970). In Section 4 we present the model for a ferry dispatch and present computationally optimal dispatch policies. 2 Future and Present values of Rewards. 2.1 Future Values with Compounding. With compounding accumulated interest is added back to the principal, so that interest is earned on interest from that moment on. For example, a loan with $100 principal and an monthly interest rate of 1% that has its interest compounded every month would have a balance of $101 at the end of the first month, $ at the end of the second month, etc. Let t denote the total time in years, n the number of compounding periods per year, note that the total number of compounding periods (for example a period is a month) in a year isn t. Letρbe the nominal annual interest rate, expressed as a decimal. e.g.: 12% = Then, ρ/n is the per period interest and the future value at period t of an initial capital R 0 is 1 : 1 sometimes the simple interest is used in chich case the future value is R s(t) = R 0(1+tρ).

2 R cn (t) = R 0 ( 1+ ρ n) n t. Note that when the compounding frequency is annual, n will be 1 and R c1 (t) = R 0 (1+ρ) t. Since the principalr 0 is a coefficient, it is often dropped for simplicity, and the resulting accumulation function is used in interest theory instead. The accumulation function, b cn (t), is: b cn (t) = ( 1+ ρ ) n t n As n increases, the rate b cn (t) approaches an upper limit of e ρt, cf figure 1. This rate is called continuous compounding factor at rate ρ Figure 1: Convergence of b cn (t) to e ρt as n 2.2 Present Values with Compounding. With compounding accumulated interest is subtracted from principal, so for example, a loan with a monthly interest rate of 1% that has its interest compounded every month that has a balance of $ at the end of the second month, would have a balance of $101 at the end of the first month, and a present value $ at the start of the time horizon. As before let t denote the total time in years, n the number of compounding periods per year, and let ρ be the nominal annual interest rate, expressed as a decimal. e.g.: 12% = Then, ρ/n is the per period interest and for compounded discounting the present value at period t of an initial capital R 0 is : The discount function is: R cn (t) = R 0 d cn (t),. d cn (t) = ( 1 ρ n) n t. As n increases, the discount function d cn (t) approaches an upper limit of e ρt. This function e ρt is called continuous discounting factor at rate ρ.

3 3 Optimization in Semi Markov Processes. In this section we survey the most important features of a sequential decision process for which the times between transitions are random. Such a process is observed at time 0 and its state is classified into one element of the set X = {0,1,2,...}. If the process has just entered state x at time 0, an action a from a set of available actions A x must be chosen. Then, as a result of this state action pair (x,a) the following events unfold: (i) The time spent in state x, (the sojourn time in state x) conditional on the event that the next state visited when the process leaves state x is state y is a random variable S xa with probability distribution F xy,a. (ii) The probability that the next state is state y isp xy,a, where y X p xy,a = 1. After the process leaves statexand upon entering a new stateya new actiona from the set of allowable actions in state y, A y, must be chosen and steps (i) and (ii) above are repeated ad infinitum. It is further supposed there is a reward structure associated with the states visited and actions chosen. If action a is chosen when the process enters state x, then: (i) an immediate reward R(x,a) is earned; (ii) additional rewards accumulate at a rate r(x,a), per unit of time the process stays in state x. Thus, the total reward associated with the state action pair (x,a) when the process stays in state x for t units of time units, then is given by R(x,a)+tr(x,a), and its present value under continuous discounting is equal to: t R(x,a;t) = R(x,a)+ e ρs r(x,a)ds. 0 Remark. When the transition times are identically one, then the above is just a Markov decision process, and in the general case, it is called a semi - Markov decision process. We also note that if a stationary policy is employed, then the process{x(t), t 0} is a semi-markov process, where X(t) represents the state of the process at time t. To avoid trivialities, we will make the following assumption. Assumption I. (i) The reward functions R(x, a) and r(x, a). are bounded. (ii) There exist constants δ > 0 and ǫ > 0, such that P(S xa > δ) ǫ, (x,a). (1) Note that P(S xa > δ) = y X p xy,a F xy,a (δ), where we use the notation: F xy,a = 1 F xy,a. Thus, assumption A(ii) states that for every state x and action a, there is a positive probability of at least ǫ that the that the sojourn time in state x will be greater than δ. Hence, an infinite number of transitions can not occur in a finite interval. 3.1 Optimization of Present Values. We assume that rewards are continuously discounted, and the objective is to maximize the expected total present value of a stream of rewards. Note that a reward R received at at time t has equivalent present value (at time 0) equal tore ρt. Most of these results are well known, [1] and [2] and the theorems will be stated without proof. LetL S (ρ) denote the Laplace transform of a random variable S, i.e.,

4 L S (ρ) = E x,a e ρs. Notice that We also define L Sxa (ρ) = r(x,a;s xa ) = p xy (a) e ρt df xy,a (t). (2) 0 Sxa 0 r(x,a)e ρs ds. (3) Using the Eqs (2) and (3) above we obtain the following expression for the expected discounted reward R(x,a) during the sojourn time S xa, in state x when action a is taken: where R(x,a) = R(x,a)+ r(x,a), (4) r(x,a) = E r(x,a) = r(x,a)(1 Ee ρsxa )/ρ = r(x,a)(1 L Sxa (ρ))/ρ (5) Let X n and A n be respectively the n-th state of the process and the n-th action chosen, n = 1,2,... Now, for any deterministic policy π (i.e., a rule for choosing actions as a function of the past observations on states and times) and ρ > 0, the expected total discounted reward over an infinite horizon, w ρ,π (x), when policy π is employed is equal to: w ρ,π (x) = E π ( e ρ n 1 ν=0 S XνAν R(X n,a n ) X 0 = x) n=0 = R(x,π(x))+L Sxπ(x) (ρ) p xy (π(x))w ρ,π (y). (6) The value function is defined as follows: A policy π is optimal if v ρ (x) = sup{w ρ,π (x)}. (7) π w ρ,π (x) = v ρ (x), x X. (8) The following classic theorems, are used to specify the optimal value function and the the existence of a simple optimal policy. Theorem 1 Under the assumption 1, the value function v ρ (x) is the unique solution to the following system of equations. v ρ (x) = max { R(x,a)+L Sxa (ρ) p xy (a)v ρ (y)}. (9) a A(x)

5 Further, if we define a policy π 0 such that for all x X it chooses action π 0 (x) defined below: then we have the following theorem. π 0 (x)=argmax { R(x,a)+L Sxa (ρ) p xy (a)v ρ (y)}, (10) a A(x) Theorem 2 The value function v ρ (x) is the unique solution to the following system of equations. w ρ,π 0(x) = v ρ (x), x X. (11) Theorem 3 Under assumption 1, the iterates vρ(x) ν produce by (12) below converge to the value function v ρ (x), as ν, vρ(x)= ν max { R(x,a)+L Sxa (ρ) p xy (a)vρ ν 1 (y)}, (12) a A(x) for any arbitrary initial values v 0 ρ(x). 4 An Application: The Optimal Ferry Dispatch Problem. Suppose that customers arrive at a Ferry according to a Poisson process with rateλ > 0. At any time,t, the decision maker (captain) may, at a cost ofk+tk units depart, wherek is a fixed cost andk is a cost proportional to the delay from the nominal departure time. Suppose also that there is a revenue of R(x) if the ferry picks up all x customers, where R(x) is a bounded increasing nonnegative function. The process is assumed to go on indefinitely, and the problem is to select a policy which maximizes the total expected discounted profit for the ferry. This problem can be formulated as a two action semi-markov decision process with states: X = {1,2,...,S} where state x means that there are x customers currently on board and S is the capacity of the ferry. Let a 1 denote the action depart and let a 0 denote the action wait. We assume that the process repeats without delay. The parameters of the problem are: 1. Under action a 1 : p x1 (a 1 ) = 1, F Sx,a1 (t) = 1 e λt, L Sxa1 = λ and R(x,a 1 ) = R(x) K. 2. Under action a 0 : p xx+1 (a 0 ) = 1, F Sx,a0 (t) = 1 e λt, L Sxa1 = λ and R(x,a 0 ) = k 3. Also, and A(x) = {a 1,a 0 }, for x = 1,...,S 1, A(S) = {a 1 }, for x = S. Thus the Bellman s optimality conditions of theorem 1 for x < S are: v ρ (x) = max{t 1 (x,a 1 ),T 0 (x,a 0 )} (13) where λ T 1 (x,a 1 ) = R(x) K +v ρ (1)

6 and and for x = S they are T 0 (x,a 0 ) = k +v λ ρ(x+1) λ v ρ (S) = T 1 (S,a 1 ) = R(S) K +v(1). (14) It follows that it is optimal to depart when there arexcustomers present when λ R(x) K +v ρ (1) > k +v λ ρ(x+1) Using Theorem 1, we have done indicative computations using the values λ = 1, ρ =.1, K = 20, k :=.1, S = 40, and R(x) = rx, where r = 1.5. In Figure 2 we plot v ν ρ(25) and v ν ρ(30) versus ν in order to illustrate the convergence of of v ν ρ(x)) tov ρ (x) asν. Figure 2: Convergence ofv ν ρ (x)) tov ρ(x) asν. In Figure 3 we illustrate the form of the optimal policy where we observe that there exists a fixed critical constant x 0 such that π 0 (x) = 0 for x x 0 and π 0 (x) = 1 for x x 0. Figure 3: Optimal actions: π 0 (x)) References: [1] Derman, C.(1970). Finite State Markovian Decision Processes, Academic Press. [2] Ross, S. M. (1970). Applied Probability Models with Optimization Applications, Holden-Day, San Francisco, CA. [3] Ungureanu V., Melamed B., Katehakis M.N. and Bradford P.G. (2006). Deferred Assignment Scheduling in Cluster-based Servers. Cluster Computing 9(1) pp [4] Ungureanu V., Melamed B., Katehakis M.N. and Bradford P.G. (2006). Class-Dependent Assignment in cluster-based servers. SAC 2004: pp

7 [5] Ungureanu V., Melamed B. and Katehakis M.N. (2004). The LC Assignment Policy for Cluster-Based Servers. NCA 2004: pp [6] Ungureanu V., Melamed B. and Katehakis M.N. (2004). Performance Comparison of Assignment Policies on Cluster-based E-Commerce Servers, WSEAS Transactions. Also in Proceedings of the International Conference on Software Engineering, Parallel and Distributed Systems, February 13-15, 2004, Salzburg, Austria. [7] Ungureanu V., Melamed B. and Katehakis M.N. (2003). Towards an Efficient Cluster-Based E-Commerce Server. CLUSTER 2003: pp [8] Veinott A. F., (1966). On the optimality of(s, S) inventory policies: new conditions and a new proof. SIAM J. Appl. Math., pp [9] Zhao Y. and Katehakis M. N., (2006) on the structure of optimal ordering policies for stochastic inventory systems with minimum order quantity, Probability in the Engineering and Informational Sciences, pp [10] Zhou B., Katehakis M. N. and Zhao Y (2007). Effective control policies for stochastic inventory systems with minimum order quantity: and linear costs. International Journal Of Production Economics Vol 106(2)

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation