Fork-Join Networks in Heavy Traffic: Diffusion Approximations and Control. M.Sc. Research Proposal

Size: px

Start display at page:

Download "Fork-Join Networks in Heavy Traffic: Diffusion Approximations and Control. M.Sc. Research Proposal"

Roland Lucas Manning
5 years ago
Views:

1 Fork-Join Networks in Heavy Traffic: Diffusion Approximations and Control M.Sc. Research Proposal Asaf Zviran Advisors: Prof. Rami Atar Prof. Avishai Mandelbaum Faculty of Industrial Engineering and Management Technion Israel Institute of Technology August 20,

2 Contents 1 Introduction Fork-Join Networks: Definition and Some Applications Short Literature Survey Preliminaries and Notations Mathematical Model System Representation State Space Representation Fluid and Diffusion Limits Fluid Limits Diffusion Limits State-Space Implications Heavy Traffic Limit For The Throughput Time Brief Discussion on System Control Length of Stay Estimation The Snapshot Principle Priorities Control Staffing-Control Proposed Research Performance Analysis Stochastic Control

3 1 Introduction This research proposal deals with the development of heavy traffic approximations, which enables some practical aspects of analysis and control of system behavior in fork-join networks. Despite the importance of parallel processing of the kind which appears in the fork-join class, there are still few results concerning many practical and important aspects of the system s behavior. In Section 1.1 we give the definition and some examples of fork join networks and their application. In Section 1.2 we give a short survey of the work that has been done in developing methods to analyze that family of networks. In Section 1.3 we start with notations and definitions in preparation for the system analysis in the following sections. 1.1 Fork-Join Networks: Definition and Some Applications A fork-join network consists of a group of service stations, which serve the arriving customers simultaneously and sequently according to preset deterministic precedence constraints. More specifically, one can think in terms of jobs arriving to the system over time, each job containing different tasks that need to be executed according to the precedence constraints. The job may leave the system only after all its tasks have finished their service. The distinguishing features of this model class are the so-called fork and join constructs. A fork occurs whenever several tasks are being processed at the same time. In the network model, this is represented by a splitting of the job into multiple tasks, which are then sent simultaneously to their respective servers. A join node, on the other hand, corresponds to a task that may not be initiated until several other tasks have been completed. Components are joined only if they correspond to the same job; thus a join is always preceded by a fork. If the last stage of operation consists of multiple tasks, then these tasks regroup into a single job before departing the system. Fork-Join task flow examples In Fig. 1.1 we can see the process progressing from the Arrest of an alleged criminal until getting him to trial (arraignment). As shown, the process consists of three simultaneous paths the path of the arrestee, the path of the arresting officer, and the path of the arrestee information through the system. This example is taken from Larson s article [7] on improving the N.Y.C A-to-A system. In Fig. 1.2 we can see the process taking place from the arrival of an order to build a house until the completion of the numerous tasks required in the construction plan. In 3

4 the graph, the construction order split and join throughout the system till all the tasks are finished; the precedence constraints take the form of a flow chart. Figure 1.1: Arrest-to-Arraignment Process Figure 1.2: Construction of a house 4

5 Fork-Join networks are natural models for a variety of processes including communication and computer systems, manufacturing and project management (as introduced in Fig. 1.2) and service systems (as introduced in Fig. 1.1). A fork-join computer or telecommunication network typically represents the processing of computer programs, data packets, etc., which involve parallel multitasking and the splitting and joining of information. In manufacturing, a fork-join network, called an assembly network, represents the assembly of a product or system which requires several parts that are processed simultaneously at separate workstations or plant locations. Fork-join networks can be found frequently in the health-care system in general, and hospitals in particular (see Fig. 1.3), in which the patient and his medical file, test results, insurance policy may split and join in different parts of the process in order to get to the final task, that may be admitting a patient to the wards, starting an operation, etc. Another reason for the need of a fork-join network in hospitals is the necessity to join and synchronize many separate resources doctors, nurses, room/bed, special equipment in order to perform one integrated operation. In this research we will try to develop and implement the mathematical approximations needed for some practical aspects of analysis and control of hospital processes. Figure 1.3: Fork-join network in hospitals preparation to surgery 5

6 1.2 Short Literature Survey This research focuses on the use of heavy traffic limits and especially fluid and diffusion limits in the analysis of stochastic networks. In his book [5] Harrison presents the basics for this work including the regulator mapping for stochastic processes, the representation of the processes as Reflected Brownian Motion (RBM) and examples of optimal control in Brownian Motion. The processes examined in his book were simple buffer flow processes. His method for analyzing stochastic models by Brownian Motion was later expanded by him in [6]. Chen and Yao present in their book [3] the method for constructing regulator mapping for a G/G/1 system and developing the fluid and diffusion limits of queue-length and workload processes. They show that the diffusion limit is one-dimensional RBM. These two books ([5] and [3]) have had great impact on the development of the regulator mapping and limit processes in Sections 2.1 and 3. Nguyen developed in her papers [8] and [9] a system and state-space representation for fork-join systems with homogeneous and heterogeneous customer population. She has succeeded to show that those systems converge weakly in heavy traffic to multidimensional RBM, whose state space is a nonsimple polyhedral cone within a nonnegative orthant. Her system and state space representation are used in this proposal (Section 2.1) as the basis for the limit processes development. In her work Nguyen assumed that all the stations in the system converge to heavy traffic uniformly, which means that a state-space collapse and a single bottleneck weren t defined by her. Another assumption she used was that the priority discipline is always FCFS. These assumptions are the baseline to the further work that is being done in our research. Peterson in his PhD thesis [10] worked on diffusion approximation for networks of queues with multiple customer types. He focused on feedforward generalized Jackson systems with preemptive discipline. He proved a heavy traffic limit theorem for these systems, with multidimensional RBM. Peterson also didn t include bottleneck analysis in his research but he did show state-space collapse in the sense of customer type priorities. Reiman and Simon [2] expanded Peterson s work in the analysis of feedforward and feedbackward generalized Jackson systems with multiple customer types. They included in their work an analysis of single bottleneck systems, and showed that the systems converge weakly to onedimensional RBM. They also showed the effects of a state-space collapse of non-bottleneck stations and a state-space collapse of high priority customers, and presented the notion of the snapshot principle, which is used in Section 4.1. One of the goals in this proposal is to expand their results to fork-join systems. During the work Whitt s [11] paper was helpful for the development of weak convergence of the processes of interest in diffusion scaling. The paper from Banks and Dai [4] was used as a reference to the problems of stability definition for systems which include feedback in their routing constraints. 6

7 Cohen, Mandelbaum and Shtub [1] examined control mechanisms for project management in a multi-project environment. They surveyed a variety of buffer management techniques in open and closed systems. In this research we intend to examine the control problem which arises in their work by means of optimal control constructed on the associated limit process. 1.3 Preliminaries and Notations The main subject of this proposal is the characterization of limiting processes for the queue-length, total job-count and total workload processes in fork-join queues under heavy traffic limits. Limit theorems that we use include the functional strong law of large numbers (FSLLN) and the functional central limit Theorem (FCLT). Throughout this proposal we shall follow the convention of using the terms fluid limit for the FSLLN limit process and diffusion limit for the FCLT limit process. Heavy traffic limits of the type considered in this proposal, which involve convergence of normalized stochastic processes to a limit process (in this case multidimensional RBM), utilize the notion of weak convergence of probability measures on metric spaces; the metric space is D r [0, 1], the r-dimension product space of right-continuous functions on [0, 1] that have left limits, endowed with the Skorohod topology. Throughout the proposal weak convergence is denoted by distinguished from the almost surely convergence which is denoted by. Throughout the proposal we will use the notation X(t) as the fluid limit of process X(t), which satisfies the scaling X n (t) = 1 X(nt) X(t) u.o.c as n, and ˆX(t) as the n diffusion limit which satisfies the scaling ˆX n (t) = n 1/2 [ X n (t) X(t)] ˆX(t) as n. 7

8 2 Mathematical Model In this section we develop the mathematical model for the processes of interest. Our interest will be focused on a class of fork-join networks with single-server stations and FCFS priority discipline; the customer flow into the system is homogeneous with a deterministic feedforward routing scheme. This system s characterization has a feature for preserving the ordering of customer arrivals, meaning the customer s entrance order is preserved throughout the system till the departure. This property has good implications for the synchronization of tasks in join nodes. We start in Section 2.1 with a pathwise construction of the job-count process vector and total workload vector the key performance measures of the system. In Section 2.2 we develop the state space of the job-count process vector and prove that it is contained in a nonsimple polyhedral cone. 2.1 System Representation Characterization of The System Structure The network consists of J single-station servers, indexed by j = {1,..., J}. We assume that each server works at a constant rate, and we denote by τ j, the mean completion time of a job at station j. The network has an input stream of homogeneous jobs and we denote by λ the average arrival rate of new jobs. The processing of each job requires the completion of J tasks. Each task is performed at a specific single-server station, and the task performed at station j is referred to as task j. The order in which tasks are performed is specified by a given set of precedence constraints, which may allow some tasks to be performed in parallel and may require that others be performed sequentially. Task i is said to be an immediate predecessor for task j if upon completing task i the job moves immediately to station j. The precedence relationships can be expressed via a precedence matrix P = (P ij ) defined as follows: P ij = { 1, if task i is an immediate predecessor for task j, 0, otherwise. (2.1) (Because all elements of the precedence matrix P are 0 s and 1 s, routing is clearly deterministic.) We assume that there is a column and row permutation of P such that the resulting matrix is strictly upper triangular; in terms of the model, this means that we consider only systems in which tasks are not repeated (feedforward systems). Now we will define the buffers in the system. For every element P ij = 1 in the precedence 8

9 matrix there is an associated buffer, and whenever a service is completed at station i, one can think of the departing task as entering a waiting room or buffer preceding server j. Notationally, it will be convenient to index the buffers by k = {1,..., K}, when every buffer k is associated with a different (i, j) pair. For each station j = {1,..., J} we define β(j) as the set of buffers k that are incident to j. It can be seen that the J group of sets {β(j)} satisfies β(j) 1 j and K β(j) J. The property β(j) > 1 characterizes servers j which are join nodes in the system, meaning the task that arrives to k {β(j) : β(j) > 1} may not be initiated until several other tasks performed simultaneously in other servers merge with it in the waiting rooms of β(j). We let s(k) {1... J} be the source of buffer k, that is, the station whose output feeds into buffer k; if k is an external arrival buffer then we define s(k) 0. Accordingly we will define {s(β(j))} {1... J} as the set of servers which are immediate predecessors to station j; if j is an entrance server then we define {s(β(j))} φ (empty set). During the analysis of the system and the development of limit processes we will use induction on the server stations from entrance stations till the departure stations. This induction is based on the feedforward manner of the system, since in feedforward systems we can argue that a station is not affected by the netflow of the stations following her downstream but is affected by the netflow of her predecessor (upstream). The induction is on d(j); the depth of station j which is defined in the following way. For j which his immediate predecessors set is {s(β(j))} φ, we set d(j) 1. Next, consider a station j such that d(i) has been defined for all stations in its predecessor set {s(β(j))}. The depth of station j is given by j=1 d(j) max{d(i) : i {s(β(j))}} + 1 (2.2) System Representation Example Let s define the system representation for the following system (see Fig. 2.1). The system has 4 single server stations indexed by j {1,..., 4}, and 7 buffers indexed by k {1,..., 7}. The servers depth vector is represented as d = [1, 1, 2, 2]. The system precedence matrix can be expressed as P = An example for the precedence sets may be: β(3) = {3, 4}, and s(4) = 2. 9 (2.3)

10 Figure 2.1: System Representation Example System Primitive Data and Immediate Workload Processes The primitive data that we use to construct the processes of interest are the i.i.d summation of interarrival times and service times. Let (Ω, F, P ) be a probability space on which are defined sequences of random variables {u(i), i 1} and {v j (i), i 1}, j = 1,..., J, where u(i) and v j (i) are strictly positive with unit mean. For this proposal we will restrict ourselves to the case where each is a sequence of i.i.d random variables and the J + 1 sequences are mutually independent. From these sequences, the interarrival times and the service times are constructed by setting the interarrival time of the i-th job to be λ 1 u(i) and its service time at station j to be τ j v j (i). We will assume that initially, i.e., at t = 0, the system is empty, meaning that Q k (0) = 0 k. So the flow of jobs going through the system is characterized by two primitives: External arrival process, k N(t) max{k : λ 1 u(i) t}. (2.4) Potential service process, S j (t) max{k : i=0 k τ j v j (i) t}. (2.5) In addition, we assume (throughout the proposal) that the service discipline is workconserving, i.e., the stations cannot stay idle if there are complete jobs present at the buffers preceding them. Now in order to construct pathwise the immediate workload of the systems stations we will use the induction on the stations depth (2.2). Starting with the entrance stations (d(j) = 1) the station arrival process equals the external arrival process so the immediate arrival process for {j : d(j) = 1} is i=0 A j (t) N(t) max{k : k λ 1 u(i) t}. (2.6) i=0 10

11 M j (t) - The immediate workload input process for station {j : d(j) = 1}, will be defined as the sum of all service times at station j for customers or jobs that enter the network during [0, t], N(t) M j (t) V j (N(t)) = τ j v j (i), (2.7) when V j (N) is the partial sums process associated with the service times at the station, N V j (N) τ j v j (i). i=0 Next we will set the immediate workload netflow process i=0 ξ j (t) = M j (t) t. (2.8) because t is the potential amount of work that can be processed in t units of time 1, ξ j (t) is the difference between the workload input and the potential workload output. Let Q k (t) denote the number of jobs waiting to be processed in the buffer k {β(j)}; we shall refer to Q k (t) as the queue length process. Then for {j : d(j) = 1} stations it should be clear that β(j) 1 and the queue length process satisfy Q k (t) = A k (t) S j (B j (t)) = A j (t) S j (B j (t)) (2.9) B j (t) = t o 1{ min k β(j) Q k(s) > 0}ds, (2.10) where B j (t) denotes the cumulative amount of time when the server is busy over the time interval [0, t]; hence S j (B j (t)) is the number of jobs that have departed (after service completion) from the system in the same time interval. We shall refer to B j (t) as the busy time process of server j, and define D j (t) as the departure process D j (t) = S j (B j (t)) = max{k : k τ j v j (i) B j (t)}. (2.11) i=0 Now we can define the immediate workload process W j (t) = M j (t) B j (t), (2.12) i.e., the sum of the impending service times for all jobs that are present in buffers incident to j at time t, plus the remaining service time of any task that may be in service at time t. In an inductive manner, these definitions can be extended to all stations in the network. Consider a station j (d(j) > 1) such that all immediate predecessor stations have 1 single-server stations 11

12 been treated, i.e., their immediate processes have been defined. For such a station j and for each buffer k β(j) one defines the immediate arrival process as the departure process from its source A k (t) = D s(k) (t) k β(j); A j (t) = min k β(j) A k(t). (2.13) The rest of the process definitions can be extended in the same manner. In conclusion we get an immediate process summary M j (t) = A j (t) i=0 τ j v j (i). ξ j (t) = M j (t) t. D j (t) = S j (B j (t)) = max{k : k τ j v j (i) B j (t)}. i=0 (2.14) Q k (t) = A k (t) S j (B j (t)). W j (t) = M j (t) B j (t). That concludes the construction of the system s immediate workload processes. Multidimensional Regulator Mapping In this part we will apply a centering operation on the immediate workload process of every station j separately, and rewrite it as follows: W j (t) = ξ j (t) + I j (t); ξ j (t) = M j (t) t; (2.15) I j (t) = t B j (t) = t 1{ min Q o k(s) = 0}ds; k β(j) When I j (t) interprets as the cumulative amount of time the server is idle during [0, t], we shall refer to I j (t) as the idle time process. 12

13 Furthermore, the following relation must hold: For all t 0 and all j, W j (t) 0; di j (t) 0, I j (0) = 0; (2.16) W j (t)di j (t) = 0; In other words, di j (t) 0 means that I j is nondecreasing; this is naturally satisfied, since the idle time process I j (t) is measured as cumulative over time. W j (t)di j (t) = 0 reflects the work-conserving condition, the stations cannot be idle if there are complete jobs in the stations queues waiting to be treated. From now on we will denote W (t), ξ(t), I(t) as the immediate workload vector, immediate netflow vector and the idle time vector respectively. The following theorem is stated for multidimensional regulator mapping from ξ(t) to (W (t), I(t)). Theorem 2.1 For any ξ(t) D J, there exists a unique pair (W (t), I(t)) in D 2 J simultaneously satisfying the following three properties for all j W j (t) = ξ j (t) + I j (t) 0; di j (t) 0, I j (0) = 0; (2.17) W j (t)di j (t) = 0. Furthermore, (W j (t), I j (t)) is given by I j (t) = Ψ(ξ j (t)) = sup [ ξ j (s)] +, 0 s t W j (t) = Φ(ξ j (t)) = ξ j (t) + sup [ ξ j (s)] +. 0 s t (2.18) We shall call W (t) the reflected process vector of ξ(t) and I(t) the regulator vector of ξ(t). Total Workload and Job-Count Process We now proceed with the construction of the system s total workload and job-count process vectors, which are the key performance measures of the system, and the basic ingredients for our limit theorems later on. In this part we will use the external arrival process from (2.4) as the station arrival process for 13

14 all j {1,..., J}. Let s define L j (t) and X j (t) as N(t) L j (t) V j (N(t)) = τ j v j (i). i=0 (2.19) X j (t) = L j (t) t. We shall refer to them as the total workload input process and total workload netflow process, respectively. From these processes we define U j (t) and Z j (t), and we shall refer to them as the total workload process and the total job-count process, respectively. U j (t) = L j (t) B j (t) = X j (t) + I j (t). Z j (t) = N j (t) D j (t). (2.20) When I j (t) and D j (t) are the same idle time process and departure process which were defined for the immediate workload in (2.15) and (2.11) I j (t) = t B j (t) = t 1{ min Q o k(s) = 0}ds. k β(j) D j (t) = S j (B j (t)) = max{k : k τ j v j (i) B j (t)}. i=0 (2.21) The process U j (t) represents the amount of unfinished work destined for station j that is present anywhere in the system at time t. In particular, U j (t) may contain work corresponding to jobs which at time t are still queued at the station preceding j. The process Z j (t) represents the total number of jobs in the system at time t that still need service at station j. It can be seen that U j (t) is regulated by the same regulator from Theorem 2.1. In fact it satisfies U j (t) = X j (t) + I j (t) 0; di j (t) 0, I j (0) = 0; j (2.22) U j (t)di j (t) 0 but W j (t)di j (t) = 0. In other words, it can be seen that W j (t) is the lower bound of U j (t) and by (2.16) we get U j (t) W j (t) 0 which satisfies the above, di j (t) 0; W j (t)di j (t) = 0 are satisfied the same way as in (2.16). Another interesting relation between the total job-count process and the immediate work- 14

15 load process can be shown in Z j (t) = Q k (t) + Z s(k) (t), W j (t) = 0 if k β(j). min Q k (t) = min (Z j (t) Z s(k) (t)) = 0 k β(j) k β(j) (2.23) Since it is clear that the total job count for station j equals the sum of the number of tasks found in buffer k (k β(j)) and the total job count for the station i (i = s(k)). From now on throughout this proposal we will denote U(t), X(t), Z(t) as the total workload vector, total netflow vector and the total job-count vector, respectively. 2.2 State Space Representation We now define S as the state space of Z(t), i.e. the total job-count vector. Let A be a K J matrix whose elements A kj are given by 1, if k β(j), 1, if j = s(k), (2.24) 0, otherwise. Using the matrix A, we can now define S as the following polyhedral cone in a J- dimensional space: S = {z R J : Az 0}. (2.25) It can be verified that S is contained in the nonnegative orthant, and that the cone has a total of K distinct faces, when K is the number of buffers in the system. The state space k-th face is then defined by S k = {z R J : A k z = 0}. (2.26) Where A k is the k-th row of matrix A. According to Section 2.1, K can be equal or greater than J. The case in which K > J is associated with a system containing join nodes, since only join node j can satisfy β(j) > 1. In that case when K > J the state space is contained in a nonsimple polyhedral cone in the nonnegative orthant. We will now prove that S satisfies the restriction of the multidimensional regulator and therefore is acceptable as Z(t) s state space. We can see from definition (2.24) and (2.23) 15

16 that A k z (i.e. the k-th row of matrix A multiplied by the vector z) satisfies the following statements A k z = z j z s(k) = Q k when k β(j), A k z = Q k 0 k 1,..., K W j = τ j { min Q k (t)} 0 j {1,..., J}, k β(j) I j if A k z = Q k = 0 for at least one k β(j). (2.27) In other words, the second statement means that the space defined by S = {z R J : Az 0} satisfies the regulator restriction for W j (t) 0 j. The third statement shows that the state space faces S k = {z R J : A k z = 0} match the events when di j > 0 for some j {1,..., J}. We will conclude in the following example for the state space calculation in Fig. 2.2 Figure 2.2: state space example As we can see, Example 1 shows a fork-join system whose state-space matrix dimension is {4 3}, and the associated state space is a nonsimple polyhedral cone in the R 3 space. The cone has a total of 4 distinct faces, which exceeds the space dimension, meaning that the cone is nonsimple. Example 2 shows a simple tandem system whose state space is in the R 2 space. 16

17 3 Fluid and Diffusion Limits In this section we will seek heavy traffic limits which involve the convergence of a sequence of normalized stochastic processes to a limit process. The traffic intensity at station j is defined as ρ j λτ j. The system is said to be stable if ρ j < 1 for all j {1,..., J}, and is said to be in heavy traffic if ρ j approaches 1 fast enough for at least one j. The precise formulation of our heavy traffic limit theorem requires the construction of a sequence of systems, indexed by n. To construct this sequence we require sequences of positive constants {λ (n), n 1}, {τ (n) j, n 1}, j = 1,..., J. In the nth system of the sequence, the interarrival time and service times are taken to be u (n) (i) u(i)/λ (n) and v (n) j (i) v j (i)τ (n) j, respectively. When {u(i) : i 1} and {v j (i) : i 1}, j = 1,..., J, are sequences of unitized random variables as defined in Section 2.1, with the restriction that the two sequences are mutually independent contained with i.i.d random variables with squared coefficients of variation c 2 a and c 2 sj, respectively (the squared coefficients of variation of a random variable is defined to be its variance divided by the square of its mean). Section 3.1 begins with the calculation of the fluid limits for the processes of interest {U(t), Z(t), W (t)}. In Section 3.2 we calculate the diffusion limit for the processes of interest and prove it is a multidimensional RBM. In Section 3.3 we check the implication of the diffusion limit on the state space of Z(t) and define the concepts of state-space collapse and single-bottleneck systems. In Section 3.4 we use the immediate workload diffusion limit process in order to define the jobs throughput time in the system. 3.1 Fluid Limits Under the definitions of {u(i) : i 1} and {v j (i) : i 1}, j = 1,..., J, above, it can be seen that N (n) (t) (2.4) is a renewal process with rate λ (n), and L (n) (t) (2.19) is a compound renewal process with rate λ (n) τ (n) j are = ρ (n) j. So the fluid limits of the primitives N n (t) 1 n N(nt) λt and L j n (t) 1 n L j(nt) ρ j t u.o.c, as n (3.1) 17

18 Now we will use scaling on the total workload process in the regulator version Ū j n (t) = X j n (t) + Ījn (t); X j n (t) = ( L j n (t) ρ n j t) + (ρ n j 1)t; (3.2) Ī j n (t) = t B j n (t). Let n and we get X j (t) = (ρ j 1)t; B j n (t) (ρ j 1)t Īj(t) = (1 ρ j ) + t = (ρ j 1) t; (3.3) = Ūj(t) = X j (t) + Īj(t) = (ρ j 1) + t. In other words, as seen in (2.22), W j (t) is the lower bound of U j (t) and is regulated by I j (t). Under the assumption of Q k (0) = 0 k (empty system at t = 0), and ρ j 1 for all j. The fluid level of U j (t) and W j (t) converge together at zero. In conclusion, we get the following theorem. Theorem 3.1 Suppose that argument (3.1) hold. Then (Ū n, Z n, B n ) (Ū, Z, B) u.o.c, as n (3.4) if ρ j 1 for all j {1,..., J}, then B j (t) = ρ j t Īj(t) = (1 ρ j )t; X j (t) = (ρ j 1)t; Ū j (t) = (ρ j 1) + t 0; (3.5) Z j (t) = τ 1 Ū j (t) 0. 18

19 3.2 Diffusion Limits Applying the diffusion scaling on the primitives we get ˆN n (t) = n[ N n (t) N(t)]; N(t) = λt Ŝ j n (t) = n[ Sj n (t) S j (t)]; Sj (t) = τ 1 j t (3.6) ˆV j n (t) = n[ Vj n (t) V j (t)]; Vj (t) = τ j t. These processes are renewal processes and by applying the functional central limit theorem for renewal processes, we get the limit processes ˆN(t) = λc 2 a BM(0, t) Ŝ j (t) = τ 1 j c 2 sj BM(0, t) (3.7) ˆV j (t) = τ j c sj BM(0, t). Additionally, by applying the scaling on the total workload input process which is a compound renewal process, we get ˆL n j (t) = n [ L n j (t) L j (t)]; Lj (t) = ρ j t ˆL j (t) = ˆV j (λt) + τ j ˆN(t) ˆLj (t) = Γ j BM(0, t); Γ j = λτ 2 j (c 2 a + c 2 sj). (3.8) Finally, we apply the scaling on the total workload process in the regulator version Û j n (t) = n [ Ū n j (t) Ūj(t)]; Ū j (t) 0 by fluid limit Ûjn (t) = n [ Ū n j (t)] = n [ X j n (t) + Ījn (t)]; n Xj n (t) = n ( L j n (t) ρ n j t) + n (ρ n j 1)t = ˆL j n (t) + n (ρ n j 1)t; (3.9) n Ī j n (t) = n [t B j n (t)]; = Ûjn (t) = ˆLj n (t) + n (ρ n j 1)t + n[t B j n (t)]. 19

20 Let n and we get ˆL j (t) = λτ 2 j (c 2 a + c 2 sj) BM(0, t), θj n = θ j = if ρ j < 1 n (ρ n j 1) < θ j 0 if ρ j = 1 (3.10) n Ī j n (t) = n [t B j n (t)] Îj, When, the first limit is the total workload input diffusion limit calculated in (3.8). The second limit is called the heavy traffic condition which converges to < θ j 0 if ρ (n) j approaches 1 fast enough but doesn t exceed 1. The third limit is the regulator diffusion limit. From these limits we can derive the following theorem. Theorem 3.2 Suppose that argument (3.7) and (3.8) hold. Then, (Û n, Ẑn ) (Û, Ẑ), as n, (3.11) where for all j {1,..., J} the limit (Ûj, Ẑj) takes the following form (according (3.9)), depending on the traffic intensity ρ j n Xj (t) = BM(, Γ j ); if ρ j < 1 then Î j (t) ; dtīj(t) d = 1 ρ j > 0 Û j (t) = n X j (t) + n Īj(t) 0; = Ẑ j (t) = τ 1 Ū j (t) 0; t; (3.12) if ρ j = 1 then n X j (t) = BM(θ j, Γ j ); Û j (t) = n X j (t) + n Īj(t) = RBM(θ j, Γ j ); = Ẑ j (t) = τ 1 Û j (t) = τ 1 RBM(θ j, Γ j ). 20

21 In other words, the diffusion limit of Ûj(t) associated with ρ j < 1 converges to RBM with infinite drift in the direction of zero. Also it can be seen that the dīj(t) > 0 t, which means that the regulator first-order approximation has a positive rate greater than zero. This means that Ŵj(t) 0 t (by the regulator restriction). As a consequence Ûj(t) converges with its lower bound which is Ŵj(t) at zero. Contrarily, the diffusion limit of Û j (t) associated with ρ j = 1 converges RBM with drift θ j > and variance Γ j. As a conclusion we get this summary of diffusion limits. Theorem 3.3 Suppose that arguments (3.7) and (3.8) hold. Then, for all j {j : ρ j = 1}, the following relation is satisfied ˆX j (t) = BM(θ j, Γ j ); Û j (t) = ˆX j (t) + Îj(t) = RBM(θ j, Γ j ); Ẑ j (t) = τ 1 Ū j (t); ˆQ k (t) = Ẑj(t) Ẑs(k)(t), Ẑ 0 (t) 0, k β(j); ˆQ k (t) 0 t 0 (by regulator restriction); (3.13) Ŵ j (t) = τ j { min k β(j) ˆQ k (t)}; dîj(t) 0, Î j (0) = 0; j; Î j if A k ẑ = ˆQ k = 0 for at least one k β(j). In conclusion, from now on let {Û(t), Ẑ(t), ˆQ(t), Ŵ (t)} be denoted as the diffusion limits of the total workload vector, total job-count vector, queue-length vector and immediate workload vector, respectively. We will now give a full formulation of the total job-count limit process from which all the other limits can be derived by the simple transformations in Theorem 3.3. When the transformations between limit processes is justified by the continuous mapping theorem. 21

22 Job-Count Diffusion Limit Process- R = diag(τ1 1,..., τ 1 J ); µ = Rθ; Ω = RΓR; S = {z R J : Az 0}; Ẑ(t) = multidimensional RBM(S, µ, Ω, R) (3.14) when for all (i, j) n (ρ n j 1) θ j, Γ ij = cov(ẑj, Ẑi) = Ẑ j 0 if θ j = ; 0 if {Ẑj 0} or {Ẑi 0}; λτ j τ i (c 2 a + cov( ˆV j, ˆV i ) λτ j τ i ) if otherwise; (3.15) 3.3 State-Space Implications When using the job-count diffusion limit on real-life systems, it can be seen that the case when ρ j = 1 for all j, is not necessarily a realistic one. In systems with single-class customers with a single possible route that means µ i = µ j (i, j) {1,..., J}. In reallife systems we need to apply the diffusion limit on systems in which some of the service stations are in heavy-traffic and others are not. This case leads to the following theorem which is a straight consequence of Theorem 3.2. Theorem 3.4 Suppose that argument (3.7) and (3.8) hold. Let s define a subset of the stations index Λ which satisfies {Λ {1,..., J} : i Λ ρ i = 1}, and a subset of the buffers index B which satisfies {B {1,..., K} : k B, ρ i = 1 if k β(i)}. Then the diffusion limit of the job-count process is an RBM with the dimension Λ whose state space S = {z R J : Az 0} is defined by Λ B matrix A. We will refer to the set of servers indexed by Λ as the system s critical tasks. This effect is similar to the effect of non-bottleneck stations collapse shown by Rieman in [2] for Jackson networks. In fact, it can be shown that the performance measures and state spaces of the system behave as if the system was a degenerated network containing 22

23 only servers index contained in Λ and buffers index contained in B. We shall refer to this phenomenon as State-Space collapse. Example 3.1 state space collapse example In this example we refer to the following system in which Λ = {1, 3}, and the second station s workload collapses under the diffusion scaling. It can be seen that the original system degenerates into a simple tandem system containing only the servers in Λ and the buffers in B (B = {1, 3}). Additionally, as the job-count process of station 2 converges to zero it forces the state space to converge to the {Ẑ1, Ẑ3} plane, and reduces system s dimension. 23

24 A special case of the above is the single bottleneck system. Let s define the diffusion limit of a job-count process for a single bottleneck system. Assume station i is the bottleneck station, so the limit process is R = τ 1 i ; µ = τ 1 i θ i ; Ω = λ(c 2 a + c 2 si); Ẑ(t) = one-dimensional RBM(S, µ, Ω, R) (3.16) S = {z i R : z i 0}; Which is a one-dimensional RBM whose state space is the nonnegative part of the real line. As we defined before, the bottleneck station is referred to as the system s critical task which determines system performance. In the following example the fork-join system (a) converges weakly to its bottleneck station (station 4) which is a G/G/1 system. Figure 3.1: single bottleneck example 3.4 Heavy Traffic Limit For The Throughput Time We conclude this section with the definition of throughput time, or sojourn time of a job in the system; this definition is based on Nguyen in [8]. Let T (t) be the throughput time of the next job to enter the network after time t. A formal definition will be developed via an inductive definition of intermediate processes T 1 (t),..., T J (t), where T j (t) is interpreted as the throughput time through station j, which is the time interval between the arrival epoch of a job and when it completes service at station j. Let Φ(t) be the random process defined by Φ(t) N(t) λ 1 u(i). (3.17)

25 One interprets Φ(t) as the arrival epoch of the next job to enter the network after time t. For each station j {j : d(j) = 1}, let Φ j (t) Φ(t), T j (t) W j (Φ j (t)). (3.18) Because station j {j : d(j) = 1} is among the first stations to be visited, Φ(t) is the arrival time of this job to station j. Furthermore, because jobs are served in a FCFS manner, the amount of time this job must spend at the station is precisely the amount of work found at station j immediately after arrival (which includes the service time associated with the new arrival). Thus T j (t) is the total sojourn time of the job through station j. For other stations in the network, the random processes Φ j (t) and T j (t) are inductively defined as follows. Suppose that j is a station such that T i (t) has been defined for each i s(β(j)), and set Φ j (t) Φ(t) + max T i(t), i s(β(j)) T j (t) max T i(t) + W j (Φ j (t)). i s(β(j)) (3.19) Recall that the arrival time of a job is taken to be the time at which its last component arrives (If j were a join node, there could be a gap between arrival times of the various components of the job). Thus, max T i(t) is the amount of time that elapses until the i s(β(j)) job arrives at station j and Φ j (t) is precisely its time of arrival. Hence T j (t) corresponds to the throughput time through station j. Setting T (t) max T j(t), (3.20) j s(β(j+1)) one can conclude that T (t) is the total sojourn time corresponding to the next job to enter the system after time t. In conclusion, we derive our last diffusion limit regarding the throughput time process. Theorem 3.5 Suppose that argument (3.7) and (3.8) hold. Then, ( ˆT (n), (n) (n) ˆT 1,..., ˆT J ) ( ˆT, ˆT 1,..., ˆT J ) (3.21) where ˆT (t) = ˆT j (t) max ˆT j (t); j s(β(j+1)) max ˆT i (t) + Ŵj(Φ j (t)), ˆT0 0, i s(β(j)) (3.22) 25

which is the familiar longest path functional associated with a critical path analysis of PERT/CPM. In the case of a single bottleneck system, it can be seen by using Theorem 3.

26 which is the familiar longest path functional associated with a critical path analysis of PERT/CPM. In the case of a single bottleneck system, it can be seen by using Theorem 3.5 combined with Theorem 3.4 that the following argument holds: ˆT (t) = Ŵj(Φ(t)), (3.23) which means that the bottleneck station (critical task) alone determines the performance of the system in reference to the system s throughput time. Critical path for bottleneck system The fork-join system above (system a.) contains three processing routes with the following sets of servers: {1, 3}, {2, 3}, {2, 4}, respectively. In the limit process the workload in stations {1, 2, 3} converge to zero, which means, according to (3.23), that the system s critical path (or longest path ) should converge with probability one to the third route - servers {2, 4}. The bar chart below is an analysis of the system s simulation runs along 200,000 days which reflect every route factor of time as the critical path for different arrival rates. It s clear from the results that as the arrival rate approaches heavy traffic (ρ 1 when λ 0.33) the third route becomes the critical path with probability one. 26

27 4 Brief Discussion on System Control In this section, we will be looking for methods in which the fluid and diffusion limits can be used in order to get a better estimation and a better control of real-life systems performance measures. This will include a short survey of directions in which work is being done or may be done by us in this field, as a preparation for our proposed research. We start in Section 4.1 with a method aimed to dynamically estimate the system throughput time for arriving customers on the basis of the system occupation on the arrival moment. This method was introduced by Rieman in [2] and will be referred to as the snapshot principle. In Section 4.2 we will discuss the ability to improve the system throughput time by using priority methods. In Section 4.3 we discuss the ability of improving system throughput time by using staffing methods. The method proposed use the notion of systems critical tasks in order to check the workload balancing in the system and offers appropriate staffing. 4.1 Length of Stay Estimation The Snapshot Principle In his paper [2] Reiman stated the following result: Under suitable conditions, in the limit diffusion time scale the queue length process does not change during customer s sojourn in the network. We refer to this result as the snapshot principle. By using the snapshot principle with Theorems 3.3 and 3.5 we get the following result, introduced first by Nguyen in [8]. For a customer arriving at Φ(t) ˆQ(Φ(t)) is given by a deterministic vector Q Ŵ j (Φ(t)) = W j = τ j { min k β(j) Q k}, determinstic for all j; ˆT (Φ(t)) = ˆT j (Φ(t)) max ˆT j (Φ(t)); j s(β(j+1)) max ˆT i (Φ(t)) + W j. i s(β(j)) (4.1) In other words, if one takes a snapshot of the system at the time of the job s arrival, Φ(t), the deterministic queue length vector observed in this snapshot can be used to 27

estimate a deterministic counterpart for the immediate workload vector experienced by the customer throughout the processing in the system.

28 estimate a deterministic counterpart for the immediate workload vector experienced by the customer throughout the processing in the system. Therefore, the throughput time can be calculated by Theorem 3.5 as if W is a deterministic vector. This result reduces forkjoin throughput time analysis to a sample-path by sample-path analysis of a PERT/CPM network, in which workload levels take the place of task time. This method of estimation was tested in matlab simulation runs for different networks and was found efficient in the sense of small RMSE. Snapshot Example Figure 4.1: snapshot simulation results This is an analysis of simulation runs in which the snapshot method was checked on the system in Fig These graphs compare the customer s actual LOS in graph a; the offline calculation of each customer s LOS according to (4.1) in graph b; and the estimator error distribution in graph c. 28

29 4.2 Priorities Control In his paper [1] Cohen examined the critical-chain (CC) method and other control mechanisms in a multi-project environment. His research was focused on priority discipline and buffer management control methods. Included in his research are methods without buffer overflow such as critical-chain and minimum slack, and methods with buffer overflow (i.e. when the controlled buffer is full the arriving customers are forced to leave without being treated) such as, a constant number of projects in process and queue size control. The conclusion of the research was that there are differences in system performance between the various methods in heavy traffic, which means that control can possibly improve system performance. In our proposal we will not allow customers to leave the system without being treated, but it can be simply seen that these methods of buffer overflow do improve system performance in heavy traffic by not letting the system approach heavy traffic. In these methods the control keeps ρ j < 1 j (by definition) and therefore keeps queue length process and throughput time well behaved and hence converges to zero under diffusion scaling. The disadvantage of these methods is that the fraction of clients leaving the system without being treated will converge to one in heavy traffic. We would like to focus now on methods which don t allow customers to leave the system without being treated. We claim that in the heavy traffic limit of single-type customers the optimal priority discipline is FCFS and other priority disciplines can only be equal to or reduce system performance, in contradiction to Cohen s paper. This claim is based on the sequence of the following arguments, which will be given here without proof. 1. Under suitable conditions and according to Section 4.1 we can argue the following: In each time point of the system s operation, the group of customers who are currently being served will experience a deterministic immediate workload vector, which means that there will be one deterministic critical path that will determine the customers LOS in the system. According to this critical path the LOS will behave as a pure sequential system. Moreover, in the case of a single bottleneck system according to (3.23) the system LOS will behave as a simple G/G/1 system. 2. In a sequential system of G/G/1 stations with homogeneous customers the LOS time is invariant to the priority discipline. Let s assume that we want to rearrange a group of N customers in order to get minimum average LOS time. Let us define customer index i {1,..., N} in the same order as the client processing order. So 29

30 for any G/G/1 station T = 1 N N i=1 i τ j, (4.2) when τ j is customer j mean service time, and for a sequence of G/G/1 we get T = T 1 + T For homogeneous customers τ i = τ j (i, j) {1,..., N}. That means that T is invariant to the client processing order i, and that in the deterministic critical path case priority discipline doesn t change the average LOS time. 3. Another important factor to be considered is the system s synchronization requirements for the join nodes. Components arriving to a join node are joined only if they correspond to the same job. As a consequence, the waiting time of a job since its first component arrives and until it may start processing at join node j, can be defined in υ j = max T i min T i. It can be seen that υ j depends on the synchronization of a customer s processing order in different paths through the system, and i β(j) i β(j) on the variance of the priority discipline. In that sense FCFS has the advantage of low variance discipline and that it keeps the order of the customer s arrival. For an opposite case, one can think of unsynchronized preemptive priority (in every path different client types get the lower priority) whose variance in heavy traffic tends to go to infinity, and it can be seen that the resulting waiting time in the joins nodes also goes to infinity, which means that the join node server is idle in times when the buffers are full (with components of different jobs). j=1 This claim was verified in matlab simulation runs for different networks and different priority disciplines. 4.3 Staffing-Control When taking into consideration the conclusions from the total workload diffusion limit in Theorem 3.2 and especially the state-space collapse notion defined in Theorem 3.4, we can look from different points of view on the concept of a system s workload balancing. When the set of critical tasks Λ (see Section 3.3) satisfies Λ < J (when J represents the actual number of service stations in the system), or in a more extreme case, if Λ = 1, a single bottleneck system, then we can argue that the staffing is not efficiently distributed according to the expected workload. In this case we can say that most of the work is assigned to only a few stations. We propose a staffing method which inforce the following rule - ρ j approach 1 uniformly 30

31 fast for all j. In systems in which all the customers have the same single possible precedence constraints scheme (single deterministic P ij matrix, (2.1)). Then the arrival rate is uniform in all of the stations entrances, which leads to the following staffing rule Route Balancing staffing Given a system containing N servers and J separate tasks (J stations) which need to be carried out upon each arriving customer (naturally N J), and given that the network has an input stream of jobs with a single possible precedence constraints scheme. The staffing N j should be the result of the following IP problem- Let us define λ j critical = N j τ j j {1,..., J} and λ critical = 1 J min 1 N j J 1 J (λ j critical λ critical ) 2 j=1 J λ j critical, then j=1 s.t J N j = N. j=1 (4.3) N j 0 integer, j {1,..., J}. It is easy to see that this method is optimal in the sense that it sets the heavy traffic condition (ρ j = 1 for at least one j) of the system to the highest arrival rate achievable for all staffing methods. With one more advantage, this method may be applied off-line in the design of the system independently from the unknown arrival rate λ, and doesn t need to change on-line unless the servers service rates vary with time. This method of staffing was checked in matlab simulation runs and has shown improvement in system performance in heavy and light traffic. For example, we ran this method on the system in Fig. 2.1 with the server properties (τ j, N j ) = {(6, 3), (5, 2), (4, 3), (3, 1)}, respectively. It can be checked that those properties make the system unbalanced. After applying the route balancing we get the following server properties: (τ j, N j ) = {(6, 3), (5, 2), (4, 2), (3, 2)}, which means we only moved one server from station 3 to station 4. The effect of this change can be seen in the following graph (Fig. 4.2) of the system s throughput time vs. arrival rate. 31

Figure 4.2: route-balancing simulation example In other words, the arrival rate in which the system enter heavy traffic have changed from λ = 0.33 to λ = 0.

32 Figure 4.2: route-balancing simulation example In other words, the arrival rate in which the system enter heavy traffic have changed from λ = 0.33 to λ = 0.4, but the improvement in the system s throughput time is not restricted only to the heavy traffic region, since improvement can be seen in light traffic region as λ = 0.28 and less. 32

33 5 Proposed Research In this section we propose directions that expand the work described above. These directions are straightforward consequence of the previous discussion on analyzing system performance measures using fluid and diffusion limits, and applying control methods which are based on the limiting processes. The following proposals can and will be applied to real-life processes associated with hospital performance measures and management policies. 5.1 Performance Analysis Single-Type and Single-Bottleneck Systems As described in Section 3.3, we claim that the limit process collapses into a single G/G/1 station. In our work we shall start by verifying that claim, and also define the notions of critical task and critical path and their implications on system performance. We shall research further the effects of synchronization gaps on system performance as discussed in Section 4.2, and the dependence between synchronization requirements and the limit process after system collapse. Multi-Type and Multi-Route Customers In her subsequent work (see [9]), Nguyen extended system representation in order to include feedforward networks of single stations with FCFS discipline populated by multiple job types. In her paper it was shown that the resulting polyhedral region has many more faces than its homogeneous counterpart and that the description of the state space becomes vastly more complicated in this setting. It can be seen that the rise in system s complexity is a direct consequence of the system s synchronization requirements, which were introduced in Section 4.2. In Reiman [2] and Peterson [10], it was shown that in Jackson networks exists state space collapse with respect to customer types. In their work, they show that the high priority customer s workload process converges to zero under diffusion scaling. In our further work we propose to check the application of that result in forkjoin networks and reduce system s complexity by using a synchronized preemptive priority discipline. In this method, theoretically we can degenerate heterogeneous single-bottleneck systems into homogeneous single-type single-station systems in the same manner as in the state space collapse introduced in Section 3.3. Fork-Join and Jackson Mixture Additional extension of the model is to relax the routing discipline into allowing probability-based routing (as in Jackson net- 33

Control of Fork-Join Networks in Heavy-Traffic

Control of Fork-Join Networks in Heavy-Traffic in Heavy-Traffic Asaf Zviran Based on MSc work under the guidance of Rami Atar (Technion) and Avishai Mandelbaum (Technion) Industrial Engineering and Management Technion June 2010 Introduction Network