Technion - Computer Science Department - Ph.D. Thesis PHD Competitive Evaluation of Switch Architectures.

Size: px

Start display at page:

Download "Technion - Computer Science Department - Ph.D. Thesis PHD Competitive Evaluation of Switch Architectures."

Jessie Barnett
6 years ago
Views:

1 Competitive Evaluation of Switch Architectures David Hay

3 Competitive Evaluation of Switch Architectures Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy David Hay Submitted to the Senate of the Technion - Israel Institute of Technology Iyyar 5767 Haifa April 2007

4 The research thesis was done under the supervision of Prof. Hagit Attiya in the Department of Computer Science. Hagit, there are no words to express how grateful I am for your help and patient guidance all these years. I feel privileged to have worked with you and learn from your experience, as a researcher, a teacher, but first and foremost as a human being. I felt that you were always available for me, for any question, any thought or even just for a chat. Among the countless things I learned from you, I especially appreciate how you perfectly balanced between guiding me, while maintaining my independence as a researcher. No doubt you are the kind of advisor any student dreams of (and much much more). I would like to thank my collaborators, Dr. Isaac Kesslasy, Gabriel Scalosub and Prof. Jennifer L. Welch, for many helpful and fruitful discussions. The periods we worked together were among the most enjoyable during all my studies. I thank Isaac Kesslasy also for organizing my internship in Cisco Systems during summer This internship had a significant contribution to my research. I am thankful to committee members: Prof. Israel Cidon, Dr. Isaac Kesslasy, Prof. Yishai Mansour, Prof. Seffi Naor, Prof. Danny Raz and Dr. Adi Rosen. I benefited tremendously from your insights and comments. I would also like to thank all the other people from the Computer Science department, with whom I worked and who helped me all these years since my undergraduate studies. Special thanks to Moshe Saikevich for his consistent moral support (and graphic advises and help). Moshe, without you none of this would have happened. Last but not least, I thank my parents Yael and Yigal Hay, my grandparents Zvi and Zvia Gracy and Lia and Jacob Hay, my brothers Eyal, Roee and Assaf, and the rest of my family, for always being there for me and supporting my decisions and choices. The generous financial help of the Blankstein and Wolf foundations is gratefully acknowledged.

5 Contents Abstract 1 Abbreviations and Notations 3 1 Introduction Classification of Switch Architectures Evaluating the Performance of Switch Architecture Performance Metrics The Packet Scheduling Process Bottleneck Overview of the Thesis Relative Queuing Delay in Parallel Packet Switches Packet-mode Scheduling in Combined Input-Output Queued Switches Jitter Regulation for Multiple Streams Background CIOQ Switches Output-Queued Switch Emulation Parallel Packet Switches Delay Jitter Regulation i

6 3 Model Definitions 26 4 Relative Queuing Delay in PPS Summary of Our Results A Model for Parallel Packet Switches Lower Bounds on the Relative Queuing Delay General Techniques and Observations Lower Bounds for Fully-Distributed Demultiplexing Algorithms Lower Bounds for u-rt Demultiplexing Algorithms Upper Bounds on the Relative Queuing Delay Demultiplexing Algorithms with Optimal RQD Optimal Fully-Distributed Demultiplexing Algorithms Optimal 1-RT Demultiplexing Algorithm Optimal u-rt Demultiplexing Algorithm Extensions of the PPS model The Relative Queuing Delay of an Input-Buffered PPS Recursive Composition of PPS Packet-Mode Scheduling in CIOQ Switches Our Results A model for packet-mode CIOQ switches Simple Upper and Lower Bounds on the Relative Queuing Delay Tradeoffs between the speedup and the relative queuing delay Matrix Decomposition Mimicking an Ideal Shadow Switch with Speedup S ii

7 5.4.3 Mimicking an Ideal Shadow Switch with Speedup S Mimicking an Ideal Shadow Switch with Bounded Buffers Simulation Results Jitter Regulation for Multiple Streams Our Results Model Description, Notation, and Terminology Online Multi-Stream Max-Jitter Regulation An Efficient Offline Algorithm Conclusions 122 Bibliography 127 iii

8 List of Figures 1 High-level model of a switch and its bottlenecks Combined Input-Output Queued Switch with Virtual Output-Queuing A 5 5 PPS with 2 planes in its center stage, without buffers in the input-ports Illustration of different delay metrics Time-points associated with a cell c T Illustration of traffic T in the proof of Theorems 1 and Illustration of traffic T in the proof of Theorem Illustration of traffic T e T I in the proof of Theorems 6 and The number of cells arriving until time-slot t 1, and still queued in plane k by time-slot τ Illustration for the different cases in the proof of Theorem A (2, 2, 2 )-RPPS with 5 input-ports and 5 output-ports Summary of the results described in Chapter Illustration of the proof of Theorem Simulation results for a switch, operating under uniform traffic Simulation results for a switch, operating under spotted traffic Simulation results for a switch, operating under diagonal traffic iv

9 17 Trace-Driven simulation results Simulation results of the Store&Forward Greedy Algorithm The multi-stream jitter regulation model Geometric view of delay jitter Geometric view of the right margin of the release band Illustration of Lemma v

10 List of Tables 4.1 The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024 ports and speedup Illustration of Example vi

11 Abstract To support the growing need for Internet bandwidth, contemporary backbone routers and switches operate with an external rate of 40 GB/s and hundreds of ports. At the same time, applications with stringent quality of service (QoS) requirements call for powerful control mechanisms, such as packet scheduling and queue management algorithms. The primary goal of our research is to provide analytic methodologies for designing and evaluating high-capacity high-speed switches. A unique feature of our approach is a worst-case comparison of switch performance relative to an ideal switch, with no limitations. This competitive approach is natural due to the central role of incomplete knowledge, and it can reveal the strengths and weaknesses of the studied mechanisms and indicate important design choices. We first consider the parallel packet switch (PPS) architecture in which cells are switched in parallel through intermediate slower switches. We study the effects of this parallelism on the overall performance and present tight bounds on the average queuing delay introduced by the switch relative to an ideal output-queued switch. Our lower bounds hold even if the algorithm in charge of balancing the load among middle-stage switches is randomized. We also study how variable-size packets can be scheduled contiguously without segmentation and reassembly in a combined input-output queued (CIOQ) switch. This mode of scheduling became very attractive recently, since most common network protocols (e.g., IP) work with variable size packets. We present frame-based schedulers that allow a packet-mode CIOQ switch with small speedup to mimic an ideal output-queued switch with bounded relative queuing delay. A slightly different line of research involves studying how different QoS measures can be guaranteed in a stand-alone environment, where traffic arrives to a regulator that should shape it 1

12 to meet the demand. We focus on jitter regulators, which should shape the incoming traffic to be perfectly periodic and show upper and lower bounds for multiple stream jitter regulation: In the offline setting, jitter regulation can be solved in polynomial time, while in the online setting a buffer augmentation is needed in order to compete with the optimal algorithm; the amount of buffer augmentation depends linearly on the number of streams. 2

13 Abbreviations and Notations N The number of the switch s ports R The external rate of the switch S The speedup of the switch K The number of planes in a parallel packet switch r The internal rate of a parallel packet switch L max The maximum packet size orig The input-port at which a cell arrives at the switch dest The output-port for which a cell is destined packet The packet corresponds to cell a specific cell first The first cell of a packet last The last cell of a packet T A traffic. The collection of cells arriving at the switch ta The time-slot at which a cell arrives at the switch shift A cell obtained by shifting another cell by predetermined time-slots E SW An execution of the switch SW σ Coin tosses sequence tl SW The time-slot at which a cell leaves the switch E S The execution of the shadow switch tl S The time-slot at which a cell leaves the shadow switch delay The queuing delay of a cell R The relative queuing delay of a cell R max The maximum relative queuing delay

14 R avg The average relative queuing delay R A max The maximum relative queuing delay against adversary A R A avg The average relative queuing delay against adversary A J SW The delay jitter of switch SW J S The delay jitter of the shadow switch J The relative delay jitter S The state space of a demultiplexor plane The plane through which a cell is sent (in a PPS) tp The time-slot at which a cell leave the plane succ The immediate successor of a cell C The reachable configuration space of a demultiplexor A The number of cells arriving at the switch The imbalance of a plane with respect to an output Q The length of a queue L The number of a cell leaving a plane B The set of reachable buffer states to The time-slot at which a cell leaves the demultiplexor t CCF The time-slot at which CCF forwards a cell L The set of eligible packet sizes X The inter-release time of cells M The total number of streams MJ The max-jitter of a multi-stream traffic

15 Chapter 1 Introduction The rapid increase in the demand for Internet bandwidth and the boost in line rates of contemporary data networks establish the basic nodes at the network core namely, the switches and routers as one of the network s primary performance bottlenecks. Today, switches and routers are built to operate with link rate of up to 40 Gb/s and hundreds of ports. At the same time, contemporary data networks are required to integrate different types of services (for example, IP traffic with voice and video traffic), implying that the switch (or router) must meet stringent quality of service (QoS) requirements and provide service differentiation between applications. In order to cope with these challenging tasks, all routers and switches are equipped with powerful control mechanisms, such as packet scheduling and queue management algorithms. As switches become larger and faster, robust parallel and distributed architectures are often used; these architectures require additional mechanisms for coordination and load balancing. Varghese [142, Page 302] identifies three bottlenecks in the design of high-speed switches: the address lookup process, which determines which output link a packet will switch to, the switching process, which is responsible for forwarding packets from the input-port to the output-port, and the packet scheduling process, which is done at the output-port and decides how packets leave the outbound links of the switch. Figure 1 depicts the general structure of a packet switch and indicates the locations of the above-mentioned bottlenecks. This thesis focuses primarily on problems arising from the switching process bottleneck (ex- 5

16 Scheduler (switching bottleneck) R R R 1 2 N Input-ports Address lookup bottleneck Switch Fabric Output-ports Packet Scheduling 1 bottleneck Figure 1: High-level model of a switch and its bottlenecks. The switch has N input-ports and N output-ports, operating at external rate R. cept Chapter 6 that deals with the packet scheduling bottleneck) and aims to provide analytical methodologies for designing and evaluating the related switch control mechanisms. In addition, our results allow to compare between different switch architectures. Generally, given an existing switch architecture in which the line rates, buffering locations and control lines are specified, we offer switching algorithms and evaluate their performance. In addition, we prove inherent limitations of the architecture and point out to important design trade-offs. Note that the algorithms and their analysis strongly depend on the investigated switch architecture; therefore, this research involves a large variety of algorithmic problems. Most of our results are relativistic, in the sense that they are measured in comparison to an optimal switch which is not limited by its architecture. Similar to online algorithms, in which there is no information about future events, the competitive approach taken in this research is natural due to the central role of lack of knowledge. 2 N R R R A primary candidate for an ideal switch is an output-queued switch (see Section 1.1), which is considered optimal with respect to its ability to guarantee different QoS demands. For that reason, 6

17 this comparison is often referred to as the ability of a switch to mimic or emulate an output-queued switch [120]. Because such analysis is not burdened by probabilistic assumptions on the incoming traffic that can be misleading, it reveals the strengths and weaknesses of the studied mechanisms and architectures. In addition, analytic evaluation, and especially worst-case evaluation, is important because it allows QoS demands to be guaranteed (unlike empirical evaluation based on simulations). In the rest of this chapter, we first discuss in Section 1.1 how to classify different switch architectures. In Section 1.2, we overview the methods and metrics for evaluating switch architectures performance. Section 1.3 discusses in more depth the problems arising from the packet scheduling process bottleneck. Section 1.4 overviews the main results of this thesis. 1.1 Classification of Switch Architectures Karol et al. [78] considered switches without buffers at all. In these switches when m cells destined for the same output-port arrive at the same time-slot, m 1 cells are dropped and one cell is transmitted over the switch fabric. Even in case of uniform well-behaved traffic, such a bufferless switch suffers from large loss ratio and low throughput. Hence, buffering within the switch is needed to handle conflicts among different flows. The location of the buffers, their size, and their management, depend on the specific architecture of the switch, and play a major role in its performance. Therefore, switches are often classified according to their buffering strategy [16]. In the rest of this section we employ such a classification and present common architectures by the location of their buffers. Output-Queued Switches: In output-queued (OQ) switches, a cell arriving to the switch is immediately transfered to the output-port it is destined for. At each time-slot, at most one cell leaves each output-port; conflicting cells are queued in a buffer at the output-port. Output-queued switches provide the highest throughput and lowest average cell-delay, since cells are queued only when the output-port is transmitting a cell. Furthermore, traffic destined for one output-port does not affect other output-ports, implying that misbehaving flows are easily 7

18 isolated. However, since it is possible that in a specific time-slot, all input-ports send a cell for the same destination, output-ports are required to operate in the aggregate rate of the input-ports. This property yields that the output-queued switch architecture does not scale with the number of external ports, and therefore it is impractical for high-speed switches with large number of ports. Shared-Memory Switches: Shared-memory switches are a variant of output-queued switches, where buffers are not dedicated to a specific output-port. Naturally, shared-memory switches are more flexible than dedicated memory, and require significantly smaller buffer size than outputqueued switches, sometimes just two or three times larger than a single output-buffer of an outputqueued switch [139]. Since at each time-slot, all the input-ports can write to the shared memory and all the outputports can read from it, the shared memory should operate in the aggregate rate of both the inputports and the output-ports. Hence, like output-queued switches, these switches are not practical if the switch is large or operates at high speed. Input-Queued and Combined-Input-Output-Queued Switches: Input Queued (IQ) switches, with buffering at the input-ports, were suggested to reduce the rate in which memory units are required to operate. Cells arriving at the switch are queued in FIFO input-buffers, and then are forwarded to the appropriate output-port, as dictated by a centralized scheduler. The switch fabric that is used in IQ switches is a bufferless crossbar, which puts the following constraint on the scheduler: At each time-slot at most one cell is forwarded from each input-port and to each outputport. The most infamous problem in input-queued switches is the head-of-line (HOL) problem, where a cell destined for an occupied output-port, blocks other cells from being forwarded [78]. To eliminate such problems different buffering policies were suggested. The most common one is virtual output-queuing (VOQ) in the input-ports [133], where each input-buffer is divided to N different FIFO queues according to the cell destination. In this case, the scheduler makes its scheduling decisions based on the cells located in the head of each queue (i.e., N 2 cells). Since 8

19 Scheduler R R R 1 2 N Input-ports Crossbar Fabric Output-ports Figure 2: Combined Input-Output Queued Switch with Virtual Output-Queuing. The switch fabric operates at rate S R, where S is the speedup of the switch. scheduling decisions are typically made at least once in every time-slot, the scheduler may become the bottleneck for implementing a high-speed, large switch. In addition to the virtual output queues, some input-queued switches have speedup; namely, the switch fabric runs S 1 times faster than the external line rates (where S is the switch speedup), enabling the switch to make S scheduling decisions every time-slot. When S > 1, a certain amount of buffering should be done in the output side of the buffer, and therefore such switches are usually referred to as Combined Input-Output Queued (CIOQ) switches (Figure 2). Buffered Crossbar Switches: Recently, CIOQ switches with additional (small) buffers in the crosspoints were also considered. These buffered crossbar or combined input-crosspoint queued (CICQ) switches circumvent the major constraint imposed by the bufferless crossbar fabric and introduce orthogonality between the operations of input-ports and output-ports. This strong property greatly simplifies the design of switching algorithms [38, 138], at the expense of having N 2 additional buffers that must be allocated and managed. 1 2 N R R R 9

20 Figure 3: A 5 5 PPS with 2 planes in its center stage, without buffers in the input-ports Parallel Packet Switches (PPS): Switching cells in parallel is a natural approach to build switches with very high external line rate and with a large number of ports. A parallel packet switch (PPS) [74] is a three-stage Clos network [41], with K < N switches in its center stage, also called planes. Each plane is an N N switch operating at rate r < R; each plane is connected to all input-ports on one side, and to all output-ports on the other side (Figure 3). This model is based on an architecture used in inverse multiplexing systems [53, 56], and especially on the inverse multiplexing for ATM (IMA) technology [12, 36]. Iyer and McKeown [70] also consider a variant of the PPS architecture, called input-buffered PPS, having finite buffers in its input-ports in addition to buffers in its output-ports. Additional two architectures have topology similar to the PPS. The Parallel Switching Architecture (PSA) [109] has several combined input-output queued (CIOQ) switches operating in parallel with no speedup, whereas the Switch-Memory-Switch (SMS) model [121, 122, 127] has M > N parallel memories that reside between the input and output ports. 1.2 Evaluating the Performance of Switch Architecture Switch architectures are evaluated by their ability to provide different QoS guarantees. Some of the important performance figures are the maximum or average delay of cells, the switch throughput, and cell loss probability. Contemporary network applications necessitate even more sophisticated 10

21 performance metrics (e.g, delay jitter). In Section 1.2.1, we discuss these metrics in detail. These performance figures can be evaluated under different assumptions on the incoming traffic. A traditional approach is to model the arrival of cells as a stochastic process. The performance figures are derived from the switch behavior in response to the arrival pattern, which is calculated either using analytical probabilistic methods (e.g., traditional queuing theory) or using simulations. The most common assumption is a uniform traffic, where cells arrival is an independent and identically-distributed (i.i.d) Bernoulli process with parameter p, 0 < p 1, and cell destination is chosen uniformly among all the output-ports. The simplicity of uniform traffic makes it attractive in analytical evaluation; however, it usually leads to unrealistic overly optimistic results, compared to real-life traffic. More sophisticated traffic patterns (e.g., on/off traffic [2, 64] or hot-spot traffic pattern [118]) were suggested to model real traffic more accurately. Unfortunately, such models generally tend to be either unrealisticly simple or too complex for closed-form analysis. A contemporary approach is to use restrictive models that only bound the incoming traffic rather than exactly characterize it. Our research focuses on such models, which capture the nature of most known traffic patterns, and yet can be handled analytically. These models are particularly appealing because a switch can be used as part of a network (e.g., the Internet, LAN networks or WAN networks) whose traffic characterization can be very different, and may even change over time. Therefore, a restrictive model that captures all traffic patterns at once may yield more meaningful results than stochastic processes, which try to exactly characterize the arrival pattern. A prime example for a restrictive traffic model is the (R, B)-leaky-bucket model [140]. In this model, it is only required that the combined rate of flows sharing the same input-port or the same output-port does not exceed the external rate R of that port by more than a fixed bound B, which is independent of time [34]. Other examples of restrictive models are the (R, T )-smooth model [60], which was later used by Borodin et al [28] for adversarial queuing theory and is often referred to as the AQT (R, T ) model. Traffic patterns that obey the strong law of large numbers [47] are also widely used recently, since they enable the usage of fluid models [46] for evaluating the switch 11

22 behavior, although the arrival processes remain discrete [47]. Our research takes a competitive approach for evaluating the behavior of switch architectures: We compare the performance of a switch against an ideal shadow switch receiving the same incoming traffic [120], which may be unrestricted or obey one of the restrictive models described earlier. As mentioned before, since output-queued switches are considered optimal with respect to their delay and throughput performance, this comparison is referred to as output queued switch emulation. The measure of how closely a switch mimics the ideal switch depends on the relevant QoS demand. For example, in [14, 70, 74, 86] the performance figure discussed is the queuing delay, and therefore the competitive analysis is resulted by the relative queuing delay; namely, the difference in queuing delay between the evaluated switch and the shadow switch. When switches are allowed to drop cells, and the figure of interest is the number of cells successfully delivered, competitive analysis results in a competitive ratio (or equivalently the switch miss-fraction [113]) Performance Metrics In this section, we survey the most common performance metrics used in current research to evaluate a single switch. Note that end-to-end evaluations (for example, over IP networks) are outside the scope of this thesis. Throughput and Stability: The throughput of a switch can be defined in several ways. One common definition is the average number of cells which are successfully transmitted by the switch per time-slot per input-port [16]. In other cases [30], throughput is defined as the maximum rate at which none of the offered frames (in our case, cells) are dropped by the device (in our case, the switch). In this case the throughput is usually normalized to the maximum theoretical rate (namely, R). For example, a 100% throughput means that even in maximum load conditions, no cells are dropped by the switch. Note that the relation between cell loss rate to this notion of throughput is not immediate: For example, in an extreme condition, in which the switch drops a small number of cells for any incoming traffic (even with the lowest rate), the switch throughput is 0%, while the loss-rate is very small. 12

23 When discussing throughput, a definition of stability comes handy [104]. A switch is stable (in the strong sense) if the expected queue length does not grow without bound: that is, if for every input-port i, output-port j, lim t E(Q i,j (t)) is finite, where Q i,j (t) is the number of cells from flow (i, j) that are still queued in one of the switch buffers. A switch achieves 100% throughput if and only if it is stable under all admissible traffics, and therefore with finite buffers, no cells are dropped [105]. A stronger stability measure is the ability of the switch to be work-conserving (greedy) [37, 88, 91]. A work-conserving switch guarantees that if a cell is pending for output port j at timeslot t, then some cell leaves the switch from output-port j at time-slot t. This property prevents an output-port from being idle unnecessarily, and by that ensures the switch stability, maximizes its throughput and minimizes its average cell delay. Note that work conservation is a strictly stronger property than stability since there are switches (e.g., the parallel packet switch, described in Section 1.1) which are stable but not work-conserving. Queue Length and Cell Loss Ratio: A more fine-grained measure than stability is bounding the queue lengths (also referred to as backlogs), bounding the expected queue length or approximating the distribution of the lengths (over time). Since buffer sizes play a major role in both the design and the pricing of the switch, queue length bounds are very important performance figures. Moreover, the figures obtained have great practical importance and usually can be easily translated to other important bounds (e.g., on cell delays and cell loss ratio). As a simple example, a work-conserving output-queued switches the maximum buffer size needed for any (R, B) leaky-bucket traffic is B cells [45]. In this case, no cells are dropped and the maximum queue length is B; if the output queues operate under first-come-first-serve (FCFS) policy, the maximum latency is B time-slots. If buffer sizes are bounded, cells that cannot be stored in the buffers are dropped. The number of cells dropped, compared to the number of cells arrived at the switch, is captured by the cell loss ratio. Clearly, characterizing the queue lengths is often a first step towards evaluating the cell loss ratio. 13

24 number of cells Propogation Delay Cell Delay and Queuing Delay: Average Delay Delay Jitter Maximum Delay Figure 4: Illustration of different delay metrics [79, Page 219]. delay time For a wide class of interactive or time-critical applications (such as voice conversation and other diverse telecommunication services), cell delays are more important than throughput [94]. In such applications excessive latency can inhibit usability of the cells, and therefore should be avoided. Naturally, the maximum cell delay may occur only in extreme situations, implying that this metric can be overly pessimistic. In such cases, we are interested in the average cell delay. The dominant causes of delay are fixed propagation delays (e.g., those arising from speedof-light restrictions) and queuing delays in the switch [49]. Since propagation delays are a fixed property of the topology, cell delay is minimized when queuing delays are minimized. Furthermore, propagation delays are strongly dependent on technology, therefore a switch architecture is best evaluated through queuing delay. Delay Jitter: Another important QoS parameter is the delay jitter, which is sometimes referred to as the cell delay variation. Delay jitter is the difference between the maximal and minimal cell transfer delays. Guaranteeing certain delay jitter is especially important in interactive communication (such as video or audio streaming). In such applications bounding the delay jitter is translated to bounds on 14

25 the buffer size at the destination. Mansour and Patt-Shamir [100] define the delay jitter to measure how far off is the difference of delivery times of different packets from the ideal difference in a perfectly periodic sequence. In the natural case, where the abstract source of the incoming traffic operates in a perfectly periodic manner, both definitions are equivalent. Other Measures: The ATM forum defines other quality-of-service parameters, which are not used widely: Cell Misinsertion Rate (CMR), Cell Error Rate (CER)and Severely Errored Cell Block Ratio (SECBR). Our research does not address these measures. We will assume that all cells are transmitted over the switch without errors or misinsertations. 1.3 The Packet Scheduling Process Bottleneck After the switching process bottleneck is resolved (recall Figure 1), incoming packets are stored at their destinations (that is, the buffers of the respective output-ports) awaiting to be scheduled outside the switch. A packet-scheduler, which manages the buffers of a single output-port, is responsible on deciding which packet will leave the output-port outgoing link in each time-slot. Depending on the demands from the switch, the packet-schedulers are geared to ensure the relevant performance metrics out of those described in Section It is important to notice that typically flows from different sources traverse the switch at the same time, and therefore compete on the same switch resources (for example, the switch buffers or the switch internal transmission lines). One of the key roles of the packet scheduling process is to protect well-behaved flows from misbehaved ones [34]. This is often called flow isolation, and the ability to provide such isolation is one of the most important evaluation criteria for switching architectures. The demand for flow isolation is sometimes formalized by the slightly stronger concept of fairness: When several flows are equally important (i.e., demand the same QoS guarantees) they should be treated fairly by the switch, and obtain an equal fraction of the switch resources. Well-known approaches to solve flow isolation and fairness are by allocating per-flow buffers [111] 15

26 or by using appropriate queuing disciplines (e.g., GPS [115] or WFQ [50, 147]). The problem of packet scheduling becomes even more difficult when the buffers of the outputport have only bounded size. In such cases, the packet schedulers cannot handle every traffic patterns and some packets must be dropped. The most common and simple drop mechanism is tail-drop, in which incoming packets are dropped if the buffer is full. However, modern switches and routers often implement more sophisticated drop mechanisms such as Random Early Detection (RED) [55] that aims at optimizing the performance of TCP traffic. In this research, we take a competitive analysis point of view also when evaluating the packet scheduling process. We compare the performance of these schedulers to an optimal scheduler that uses the same buffer size but has a complete knowledge on future packet arrivals (that is, an offline algorithm). In addition, in order to investigate the trade-off between the buffer size and the scheduler performance, we also investigate resource augmentation scenarios, in which the packetscheduler compensates its lack of knowledge by using additional buffers. Note that the switching problem and the packet scheduling problem are not orthogonal: One can devise switching algorithms that aim at optimizing certain QoS guarantees; such switching algorithms are especially important in IQ Switch (recall Section 1.1) in which there are no buffers in the output-ports. However, the problems often become independent if the switching algorithm provides output-queued emulation. 1.4 Overview of the Thesis Relative Queuing Delay in Parallel Packet Switches We provide analysis of the relative queuing delay of cells in a PPS compared to an ideal switch, capturing the influence of parallelism on PPS performance [13, 14]. Our lower and upper bounds on the relative queuing delay depend on the amount and type of information used for balancing the load among the lower-speed switches and indicate significant differences in the PPS performance: sharing even out-dated information among input-ports can highly improve the switch performance. An attractive paradigm for balancing load on the average is with randomization [17, 108]; even 16

27 a very simple strategy ensures, with high probability, maximum load close to the optimal distribution [61]. Given these successful applications of randomization in traditional load balancing settings and in other high-bandwidth switches [58, 136], it is tempting to employ randomization in parallel packet switches in order to improve their performance. Nevertheless, we show that randomization does not help to decrease the average relative queuing delay. This surprising result holds because the common practice is that switches should not mis-sequence cells [81]. This property allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate it long enough to increase the average relative queuing delay. On the positive side, we introduce a generic methodology for analyzing the maximal relative queuing delay by measuring the imbalance between the lower-speed switches. The methodology is used to devise new optimal algorithms that rely on slightly out-dated global information on the switch status. It is also used to provide a complete proof of the maximum relative queuing delay provided by the fractional traffic dispatch algorithm [72, 86]. These results are discussed in Chapter Packet-mode Scheduling in Combined Input-Output Queued Switches The need for packet-mode schedulers arises from the fact that in most common network protocols, traffic is comprised of variable size packets (e.g., IP datagrams), while real-life switches store and transmit packets as fixed-sized cells, with fragmentation and reassembly done outside the switch. Packet-mode schedulers consider the linkage between cells that correspond to the same packet and are constrained so that such cells should be delivered from the switch contiguously [57]. Packetaware scheduling schemes avoid the overhead induced by packet segmentation and reassembly that can become very significant at high speeds. We devise coarse-grained schedulers that allow a packet-mode combined input output queued (CIOQ) switch with small speedup to mimic an ideal output-queued switch with bounded relative queuing delay [15]. The schedulers are coarse-grained, making a scheduling decision every certain number of time-slots and work in a pipelined manner based on matrix decomposition. Our schedulers demonstrate a trade-off between the switch speedup and the relative queuing 17

28 delay incurred while mimicking an output-queued switch. When the switch is allowed to incur high relative queuing delay, a speedup arbitrarily close to 2 suffices to mimic an ideal outputqueued switch. This implies that packet-mode scheduling does not require higher speedup than a cell-based scheduler. The relative queuing delay can be considerably reduced with just a doubling of the speedup. We also show that it is impossible to achieve zero relative queuing delay (that is, a perfect emulation), regardless of the switch speedup. We further evaluate the performance of our scheduler through extensive simulations, both under real-life traffic traces and under various stochastic traffic models. These simulations clearly indicate that in practice this scheduler performs significantly better than its theoretical bound. These results are presented in Chapter Jitter Regulation for Multiple Streams We also investigate the packet scheduling process. Specifically, we refer to each output-port as a stand-alone environment with a bounded-sized buffer, and study how different QoS measures can be guaranteed for a traffic traversing such an environment. A prime example is a jitter regulator, which should shape the incoming traffic to be perfectly periodic, using a bounded-sized buffer. While previous work on this topic [100] handles only a single stream, we show upper and lower bounds for multiple stream jitter regulation in offline and online settings [63]: In the offline setting, jitter regulation can be solved in polynomial time, while in the online setting a buffer augmentation is needed in order to compete with the optimal algorithm; the amount of buffer augmentation depends linearly on the number of streams. Chapter 6 presents our results on jitter regulation. 18

29 Chapter 2 Background In this chapter we survey the most relevant background to our research. Section 2.1 deals with related work on CIOQ switches. Section 2.2 describes the research done on OQ emulation. In Section 2.3, we present the known results on Parallel Packet Switches and related architectures. We conclude in Section 2.4, by describing the prior work on jitter regulation. 2.1 Prior Work on CIOQ Switches In a combined input-output queued (CIOQ) switch, described in Section 1.1, arriving cells are first stored in the input side of the switch and then forwarded over a crossbar switch fabric to the output-side as dictated by a scheduling algorithm. The switch fabric operates S time faster than the external rate, where S is the speedup of the switch, and imposes the following major constraint on the scheduling algorithm: At each time-slot at most S cells can be forwarded from any input-port and at most S cells can be sent to any outputport. An alternative approach to express this constraint is by defining a scheduling opportunity (or scheduling decision), in which at most one cell is forwarded from the each input-port and to each output-port. The speedup S implies that the switch has S scheduling opportunities every time-slot. A common approach to solve the scheduling problem in CIOQ switches is to refer to the switch as a bipartite graph G(t) = V 1, V 2, E, where V 1 is the set of input-ports, V 2 is the set of output- 19

30 ports and an edge (v 1, v 2 ) E exists if and only if there is a cell waiting for scheduling from input-port v 1 to output-port v 2 at time-slot t. Note that after each scheduling action, a new graph should be obtained. A solution to the classic maximum size matching problem achieves 100% throughput under uniform i.i.d traffic, but can lead to instability and unfairness with other traffic patterns [104]. These results can be improved by assigning weights on the edges and solving the maximum weight matching problem. It is shown [104] that if the weights are assigned according to the lengths on the queues (LQF) or the waiting time of the cell in the head of the queue (OCF), 100% throughput can be achieved even for non-uniform traffic. However, these algorithms are unfeasible in large and high-speed switches due to the high complexity of maximum weight matching solutions. To overcome this problem, several algorithms that are based on a solution of maximal matching were proposed. In general, these algorithms operate in iterations, such that in each iteration an unmatched input-port picks unmatched output-port and adds the edge to the matching, until the matching converges to a maximal matching (usually after N iterations). The difference between the maximal-matching based algorithms is in the way conflicting requests are resolved. In Parallel Iterative Matching (PIM) [5] these requests are resolved randomly (this yields that with high probability O(log N) iterations suffice for the algorithm to converge), while in islip [103] these requests are resolved using round-robin pointers. Other maximal matching based algorithms include Wave Front Arbiters (WFA) [132], ilqf [105] and iocf [105]. Nevertheless, the scheduling algorithm complexity is typically the main performance limitation of CIOQ switches [24], since scheduling decisions are done every time-slot, meaning that the scheduling algorithm speed should be at least the same as the external line rate. One approach to overcome this scheduling complexity is by using randomization, which was proven extremely successful in simplifying the implementation of algorithms. A prime example of a linear-time randomized algorithm for CIOQ switches was presented by Tassiulas [136], that proposed to compare, at each scheduling decision, the weight of the current matching to the weight of a randomly chosen matching. Giaccone et al. [58] later improved this algorithm and showed how such randomized algorithms can achieve good delay performance. Another approach to solve the scheduling problem is by matrix decomposition [31]. Such 20

31 solutions assume that the arrival rate of each flow is known and decompose the arrival traffic rate matrix Λ = [λ i,j ] (λ i,j is the rate of flow (i, j)) into permutation matrices, which are used periodically as scheduling decisions. A recent approach is to use a coarse-grained scheduler, which makes a scheduling decision every predetermined number of time-slots [24, 114]. In such algorithms, a frame is defined as τ consecutive time-slots, and scheduling decisions are done in the boundaries of these frames. The scheduling decision should encompass all necessary information for the input-port to schedule cells for τ time-slots. Notice that matrix decomposition techniques are a promising approach in devising such frame-based schedulers, as was previously proposed for optical switching and in satellite-switched time-devision multiple access (SS/TDMA) schedulers [83, 95, 137, 145]. Aggarwal et al. [3] combine these three approaches and devise a randomized, coarse-grained algorithm for matrix decomposition. Basically, they looked at the matrix decomposition problem as coloring a bipartite multi-graph, and proposed a randomized edge coloring algorithm that color the graph with as few colors as possible. Their algorithm achieves nearly optimal results with very low implementation complexity. All the above mentioned schedulers dealt only with fixed-size cells. However, some schedulers that handle variable size packets directly were also proposed. Previous work [57, 101] considers packet-mode scheduling in an input-queued (IQ) switch with crossbar fabric and no speedup. It proves analytically that packet-mode IQ switches can achieve 100% throughput, provided that the input-traffic is well-behaved; this matches the earlier results on cell-based scheduling [104]. Marsan et al. [101] also show that under low load and small packet size variance, packet-mode schedulers may achieve better average packet delay than cell-based schedulers. A different line of research used competitive analysis to evaluate packet-mode scheduling, when each packet has a weight representing its importance and a deadline until which it should be delivered from the switch [62]. 21

32 2.2 Prior Work on Output-Queued Switch Emulation The question whether a feasible switch architecture can emulate an output-queued switch was was first raise by Prabhakar and Mckeown in the context of combined input-output queued switches [120]. They answered this question in the affirmative and presented the first output-queued emulation algorithm, called most urgent cell first algorithm (MUCFA), that requires a speedup of 4. Following this seminal paper, other works investigated the speedup required for a CIOQ switch to emulate an output-queued switch [35, 91, 130, 131]. A prime example is the critical cells first (CCF) algorithm [37], which allows a CIOQ switch with speedup of at least 2 to emulate (exactly) an output-queued switch. In addition, a matching lower bound was also proven [37]: A CIOQ switch needs speedup 2 1 in order to emulate an output-queued switch. Notice that all these algorithms do not make any assumptions on the incoming traffic. N The demand for exact emulation is sometimes relaxed to allow the investigated switch architecture to lag behind in the OQ switch by a fixed and predetermined relative queuing delay [72]. We refer to such relaxed emulation as OQ switch mimicking. In this context, the ability of a CIOQ switch with small speedup (that is, S < 2) to mimic an OQ switch was investigated in [59]: when the traffic is well-behaved (that is, obeys one of the restrictive model described in Section 1.2) the demand of speedup S 2 1/N can be relaxed at the expense of a bounded relative queuing delay. In Chapter 5, we show that packet-mode CIOQ switch can provide OQ mimicking. The ability to emulate or mimic an OQ switch was also investigated under other switch architectures. A recent line of research deals with buffered crossbar (described in Section 1.1). Magill et al. [99] showed that a buffer crossbar with S = 2 can emulate a first-come-first-serve (FCFS) output-queued switch with any arrival pattern. Furthermore, they also showed that if the buffers at the crosspoints are of size at least k, more general queuing disciplines can be emulated, namely a FCFS with k strict priorities. These results were further improved in [38] showing that OQ switch with any weighted round robin scheduler can be emulated using a fully-distributed algorithm (that is, each input-port and each output-port make independent decisions). Turner [138] investigated packet-mode schedulers in buffered crossbars and showed that that a buffered crossbar switch with speedup 2 and crosspoint buffers of size 5L max, where L max is the maximum packet size, can 22

33 mimic an output-queued switch with relative queuing delay of (7/2)L max time-slots. 2.3 Prior Work on Parallel Packet Switches The parallel packet switch architecture was first considered by Iyer et al. [70, 72, 74], who evaluated its ability to mimic output-queued switches. Iyer et al. [74] introduced the Centralized PPS Algorithm (CPA) that allows a PPS with speedup S 2 to mimic a FCFS output-queued switch with zero relative queuing delay; here, the speedup S of the switch is the ratio of the aggregate capacity of the internal traffic lines, connected to an input- or output-port, to the capacity of its external line (namely, S = Kr R ). Unfortunately, these algorithms are impractical for real switches, because they gather information from all the input-ports in every scheduling decision. To overcome this problem, Iyer and McKeown [72] suggest a fully-distributed algorithm that works with speedup S = 2 and mimics a FCFS output-queued switch with relative queuing delay of N R r time-slots. Another family of fully-distributed algorithms, called fractional traffic dispatch (FTD) [86], works with switch speedup S K, and their relative queuing delay is at least 2NR/r time-slots. However, these K/2 papers did not provide complete and precise proofs for the correctness of the proposed algorithms and their performance. 1 The requirement for additional speedup is relaxed by adding buffers in the demultiplexors. For such an input-buffered PPS, Iyer and McKeown [72] suggest a fully-distributed algorithm that allows a PPS with speedup S = 1 to mimic a FCFS output-queued switch with relative queuing delay of 2N R r time-slots. 2.4 Prior Work on Delay Jitter Regulation The problem of jitter control has received much attention in recent years, along with the increasing importance of providing QoS guarantees. A prime example is the Differentiated Services (Diff- Serv) architecture, in which there is a specific requirement to maintain low-jitter for Expedited 1 See Remarks 1 and 2 in Chapter 4 for further details. 23

34 Forwarding (EF) traffic [49]. Jitter regulators that capture jitter control mechanisms, use an internal buffer to shape the traffic [79, 100, 149]. These regulators typically use scheduling algorithms that are not workconserving, i.e., they might delay releasing a cell even if there are cells in the buffer and the outgoing links are not fully utilized. Several algorithms have been proposed with the aim of providing traffic jitter control: A jitter control algorithm which reconstructs the entire sequence at the destination using a predetermined maximum delay bound was proposed in [116]. The Jitter-Earliest-Due-Date algorithm proposed in [143] uses a predetermined maximum delay bound in order to calculate a deadline for every cell, such that it is released precisely upon its deadline. The Stop-and-Go algorithm proposed in [60] uses time frames of predetermined lengths in order to regulate traffic, such that cells arriving in the middle of a frame, are only made available for sending in the following time frame. The Hierarchical-Round-Robin algorithm proposed in [76] uses a framing strategy similar to the one used in the Stop-and-Go algorithm, but releases are governed by a round robin policy that sometimes allocates non-utilized release time-slots to other streams. Other jitter control algorithms are surveyed thoroughly in [147]. A slightly different line of research investigated jitter regulation in the Combined Input-Output Queue switch architecture, forcing the jitter regulator to obey additional constraints posed by the switching architecture [83]. The problem of jitter control in an adversarial setting was studied by Mansour and Patt-Shamir [100] in a simplified single-stream model, with only a single abstract source. They present an efficient offline algorithm, which computes an optimal release schedule in these settings. They further devise an online algorithm, which uses a buffer of size 2B, and produces a release schedule with the optimal jitter attainable with buffer of size B, and then show a matching lower bound on the amount of resource augmentation needed, proving that their online algorithm is optimal in this sense. For the same model, Koga [89] presents an optimal offline algorithm and a nearly optimal online algorithm for the case where a cell can be stored in the buffer at most a predetermined amount of time. 24

35 The burstiness of the traffic is also captured by its rate jitter, which was first defined as the short-term average rate of a traffic [76]. Mansour and Patt-Shamir [100] introduced another definition for the rate jitter, which bounds the difference in cell delivery rates at various times. Since the difference in delivery time between two successive cells is the reciprocal of the instaneous delivery rate, the rate jitter is defined as the difference between the maximal and minimal inter-departure time. Note that delay jitter is more restrictive than rate jitter [100]. Therefore, unlike delay jitter regulators which completely reconstruct the incoming traffic, rate jitter regulators typically just partially reconstruct the traffic, and therefore are easier to implement [147]. 25

36 Chapter 3 Model Definitions A N N switch handles either fixed-size cells or variable-size packets arriving at N input-ports at rate R and destined for N output-ports working at the same rate R. Packets (cells) arrive at the input-ports and leave the output-ports in a time-slotted manner (that is, all the switch external lines are synchronized). For variable-size packets, we refer to each part of the packet that is transmitted during a single slot as a single fixed-size cell, and measure the packet size in cell-units. Unless otherwise specified, we assume the switch does not drop cells. For every cell c, ta(c) denotes the time-slot in which cell c arrived at the switch. In addition, we denote by orig(c) and dest(c) the input-port at which c arrives and the output-port for which c is destined. packet(c) denotes the packet that corresponds to cell c; f irst(p), last(p) are the first and last cells of packet p. The definitions in the rest of this chapter assume that only fixed-size cells arrive at the switch. However, they can be easily extended to hold also for variable-size packets. A traffic T is a finite collection of cells, such that no two cells arrive at the same input-port at the same time-slot. A flow (i, j) is the collection of cells sent from input-port i to output-port j. 1 projection of a traffic T on a set of input-ports I, denoted by T I, is {c T orig(c) I}. Since for any input-port i and traffic T, there are no two cells c 1, c 2 T i such that ta(c 1 ) = ta(c 2 ), the arrival times of cells in T i induce a total order on them. 1 It is important to notice that a flow at the switch level may correspond to several flows at the network level, all sharing the same input-port and the same output-port of the switch. 26

37 For any cell c, shift(c, t) is a cell with the same origin and destination such that ta(shift(c, t))= ta(c) + t. The shift operation is used for concatenating two finite traffics, T 1 and T 2, so that T 2 starts after the last cell of traffic T 1. Formally, T 1 T 2 is the traffic T 1 {shift(c, t) c T 2 }, where t = 1 + max{ta(c) c T 1 I1 }. E SW (ALG, T ) is the execution of the switch, using scheduling algorithm ALG in response to incoming traffic T. If ALG is a randomized algorithm, we denote by E SW (ALG, σ, T ) the execution E SW (ALG, T ) taking into account the coin-tosses sequence σ obtained by the algorithm. The exact definition of the execution is determined by the switch architecture that is investigated. Yet, given the execution E SW (ALG, σ, T ) one can determine uniquely the time-slot in which cell c leaves the switch for every cell c T. This time-slot is denoted by tl SW (c, T ) (or tl SW (σ, c, T ) if the algorithm is randomized). The switch is compared to a work-conserving shadow switch that receives the same traffic T, and obeys the per-flow FCFS discipline; that is, cells with the same origin and the same destination should leave the switch in their arrival order. We denote the execution of the shadow switch in response to traffic T by E S (T ), and the time a cell c T leaves the shadow switch by tl S (c, T ). Note that tl S (c, T ) ta(c) + 1. The relative queuing delay of a cell c T under a scheduling algorithm ALG and a coin-tosses sequence σ is R(ALG, σ, c, T ) = tl SW (σ, c, T ) tl S (c, T ). Definition 1 For traffic T, scheduling algorithm ALG and coin-tosses sequence σ, the maximum relative queuing delay R max (ALG, σ, T ) is max c T {R(ALG, σ, c, T )}, and the average relative queuing delay R avg (ALG, σ, T ) is 1 T c T R(ALG, σ, c, T ). The maximum relative queuing delay of an algorithm ALG against an adversary A is denoted R A max(alg). Specifically, R A max(alg) R with probability 1 δ if adversary A can construct a traffic T such that Pr σ [R max (ALG, σ, T ) R] 1 δ. The average relative queuing delay of an algorithm ALG against an adversary A is denoted R A avg(alg). Specifically, R A avg(alg) R with probability 1 δ if adversary A can construct a traffic T such that Pr σ [R avg (ALG, σ, T ) R] 1 δ. If a switch architecture has a scheduling algorithm ALG, such that R max (ALG) = 0 we say that 27

38 the switch architecture emulates an ideal switch. In case the switch architecture has a scheduling algorithm ALG for which R max (ALG) is bounded, we say that the switch architecture mimic an ideal switch. The per-flow delay jitter of a traffic T under a scheduling algorithm ALG and a coin-tosses sequence σ is the maximal difference in queuing delay of cells originated in the same input port and destined for the same output port. Specifically, Definition 2 For traffic T, scheduling algorithm ALG and coin-tosses sequence σ, the delay of a cell c in T is delay(alg, σ, c, T ) = tl SW (σ, c, T ) ta(c). The delay jitter of a flow (i, j) in T is J SW (ALG, σ, T, i, j) = max c T i,j {delay(alg, σ, c, T i,j )} min c T i,j {delay(alg, σ, c, T i,j )}, where T i,j = {c T orig(c) = i and dest(c) = j}. The per-flow delay jitter of the traffic T is J SW (ALG, σ, T ) = max i,j {J SW (ALG, σ, T, i, j)}. Similarly, let J S (T ) be the per-flow delay jitter of traffic T under the shadow switch. The relative delay jitter is formally defined as follows: Definition 3 For traffic T, scheduling algorithm ALG and coin-tosses sequence σ, the relative delay jitter, denoted by J, is the difference between the per-flow delay jitter of the switch and the per-flow delay jitter of an optimal shadow work-conserving switch, that is J (ALG, σ, T ) = J SW (ALG, σ, T ) J S (T ). Note that if R max (ALG, σ, T ) = 0 then J (ALG, σ, T ) = 0 as well. 28

39 Chapter 4 Relative Queuing Delay in PPS One of the key issues in the design of a PPS (recall Figure 3) is balancing the load of switching operations among the middle-stage switches, and by that utilizing the parallel capabilities of the switch. Load balancing is performed by a demultiplexing algorithm, whose goal is to minimize the concentration of a disproportional number of cells in a small number of middle-stage switches. Demultiplexing algorithms can be classified according to the amount and type of information they use. The strongest type of demultiplexing algorithms are centralized, and make demultiplexing decisions based on global information about the status of the switch. Unfortunately, these algorithms must operate in a speed proportional to the aggregate incoming traffic rate, and therefore, they are impractical. At the other extreme, fully-distributed demultiplexing algorithms rely only on the local information in the input-port. 1 Due to their relative simplicity, they are common in contemporary switches. A realistic middle ground is what we call u Real-time distributed (u-rt) demultiplexing algorithms, in which a demultiplexing decision is based on the local information and global information older than u time slots. Obviously, every fully distributed algorithm is also a u real-time distributed algorithm. The relative queuing delay of the PPS (Definition 1) captures the influence of the parallelism of the PPS on the performance of the switch, depending on the different demultiplexing algorithms, and ignores the specific PPS hardware implementation. As we shall prove, the relative queuing 1 These are also called independent demultiplexing algorithms [68]. 29

40 delay is determined solely by the balancing of cells among the planes. Randomization is successfully applied in traditional load balancing settings [17, 61, 108] and in other high-bandwidth switches [58, 136]: Even a very simple strategy ensures (with high probability) maximum load close to the optimal distribution. Therefore, it is tempting to employ randomization to reduce the average imbalance between planes and by that reduce the average relative queuing delay. Our main contributions are lower and upper bounds on the relative queuing delay of the PPS. Our lower bounds hold even when the PPS has to deal only with well-behaved traffics that obey the leaky-bucket model [140], which makes our results stronger. In addition, we show that randomization does not help to decrease the average relative queuing delay. This somewhat surprising result holds due to the requirement that switches should not mis-sequence cells [81]. This property allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate it sufficiently long to increase the average relative queuing delay. On the other hand, we devise a general methodology for analyzing the maximum relative queuing delay from above; clearly, this also bounds (from above) the average relative queuing delay. 4.1 Summary of Our Results Deterministic Lower Bounds A bufferless PPS (i.e., without buffers at the input-ports) with fully-distributed demultiplexing algorithm incurs the highest relative queuing delay and relative delay jitter. If some plane is utilized by all the demultiplexors, we prove a lower bound of ( R 1) N time slots on the relative queuing r delay and relative delay jitter, where R is the ratio between the PPS external and internal rates. r Even in the unrealistic and failure-prone case where the planes are statically partitioned among the demultiplexors, the relative queuing delay and relative delay jitter are at least ( R r 1) N S time-slots. Both lower bounds employ leaky-bucket flows with no bursts. A bufferless PPS with u-rt demultiplexing algorithm (for any u) has relative queuing delay and relative delay jitter of at least ( 1 u r R) un S time-slots, where u = min{u, 1 2 R}. In contrast, r 30

41 Demultiplexor Type Planes OC-3 OC-12 OC-48 Fully-distributed unpartitioned 64,512 15,360 3,072 partitioned 32,256 7,680 1,536 1-RT Centralized Table 4.1: The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024 ports and speedup 2. Iyer et al. [72] present a centralized demultiplexing algorithm for a bufferless PPS with speedup S 2, which achieves zero relative queuing delay. Our lower bound results show that the PPS architecture does not scale with increasing number of external ports (see Table 4.1 for specific instances). This is significant since great effort is currently invested in building switches with a large number of ports. Note that large relative queuing delays usually imply that the buffer sizes at the middle-stage switches and at the external ports should be large as well, so that the cells can be queued. For bufferless PPS, it is important to notice that using u-rt demultiplexing algorithm significantly reduces the lower bound on the relative queuing delay compared with fully-distributed demultiplexing algorithm. u-rt demultiplexing algorithms correspond to commercially used policies like arbitrated crossbar switches [132], in which a request is made by the input-port, and the cell is sent once a grant is received back from the arbiter. The separation between the lower bounds implies that employing u-rt demultiplexing algorithms in PPS (even with a considerably large value of u) may decrease the relative queuing delay dramatically, and still be feasible for high-speed switches. Randomized Lower Bounds We show that an adversary can devise traffic that exhibits with high probability a large average relative queuing delay. The exact bounds depend on the type of the adversary, the exact restriction on the order of cells the switch should respect and, as in the deterministic case, on the locality of information used for cell demultiplexing. When the PPS respects the arrival order of cells with the same input-port and the same output- 31

42 port (that is, per-flow FCFS discipline) and the adversary is adaptive [110], the bounds are equal (with high probability) to the deterministic lower bounds for maximum relative queuing delay for all classes of demultiplexing algorithms. The randomized lower bound holds also with an oblivious adversary, if a PPS obeys a global FCFS policy (that is, all cells to the same destination should leave the switch according to their arrival order) and a fully-distributed demultiplexing algorithm is used. Matching Upper Bounds To prove that the lower bounds are tight, we devise a methodology for evaluating the relative queuing delay under global FCFS policies. We show a general upper bound that depends on the difference between the number of cells with the same destination that are sent through a specific plane, and the total number of cells with this destination. Our methodology is employed to prove that the maximal relative queuing delay of the fractional traffic dispatch (FTD) algorithm [70] is O(N R ) time-slots. This matches the lower bound on the r average relative queuing delay introduced by fully-distributed demultiplexing algorithms (even when randomization is used). This is the first formal and complete correctness proof for this algorithm. Remark 1 Iyer and McKeown [70, 72] outline an approach for bounding the relative queuing delay of FTD, but leave a number of details missing [69]; a previous attempt [86] to complete the formal proof and precisely bound the relative queuing delay of FTD, turned out to be flawed [85] (see Remark 2 for further details on the mistake in [86].) By precisely capturing the crucial factors affecting the relative queuing delay, our methodology leads to new algorithms that use global information that is u time-slot old. Their maximum relative queuing delay is O(N) time-slots, asymptotically matching the lower bound on the average relative queuing delay for this class of demultiplexing algorithms (even when randomization can be used). 32

43 PPS Model Extensions One extension to the PPS model is input-buffered PPS, in which there are small buffers also in the input-ports, can support more elaborate demultiplexing algorithms, since an arriving cell can either be transmitted to one of the middle stage switches, or be kept in the input-buffer. We show that under a u-rt demultiplexing algorithm, a switch with speedup S 2 and input-buffers larger than u can employ a centralized algorithm (e.g., [72]). In contrast, a deterministic fully-distributed demultiplexing algorithm introduces relative queuing delay and relative delay jitter of at least ( 1 r R) N S time-slots, for any buffer size under leaky-bucket flows with no bursts. A second extension to the PPS model is by implementing recursively the planes themselves as parallel packet switches operating at lower rate. We prove lower bounds on the relative queuing delay for the homogeneous recursive PPS, in which all the demultiplexors in all recursion levels are of the same type (e.g., fully distributed demultiplexors), and for the monotone recursive PPS, in which demultiplexors are allowed to share more information as their rate decreases. The lower bounds generalize the lower bound for the non-recursive PPS model. 4.2 A Model for Parallel Packet Switches An N N PPS is a three-stage Clos network [41], with K < N planes. Each plane is an N N switch operating at rate r < R, and is connected to all input-ports on one side, and to all outputports on the other side (recall Figure 3). The speedup S = Kr R captures the switch over-capacity. A bufferless PPS has no buffers at its input-ports but can store pending cells in its planes and in its output-ports. Each cell arriving at input-port i is immediately sent to one of the planes; the plane through which the cell is sent is determined by a randomized state machine with state set S i, following some algorithm. Definition 4 The demultiplexing algorithm of a bufferless input-port i is a function ALG i : {1,..., N} S i COINSPACE {1,..., K} S i 33

44 which gives a plane number and the next state, according to the incoming cell destination, the current state, and the result of a coin-toss that is taken out of a finite and uniform coin-space COINSPACE. (For a deterministic algorithm, COINSPACE = 1.) It is important to notice that demultiplexing algorithm ALG i accesses the random coin-tosses one by one. More precisely, the demultiplexing decision of ALG i at time-slot t depends only on random coins that were tossed up until time-slot t; the coin-tosses up until time-slot t 1 are incorporated into the state of ALG i at time-slot t, while the coin-toss of time-slot t appears explicitly in the definition of ALG i. We next extend the switch model defined in Chapter 3 to capture the PPS architecture. E SW (ALG, σ, T ) is the execution of a PPS using demultiplexing algorithm ALG in response to incoming traffic T, and coin-tosses sequence σ; for all cells in T, the execution indicates the planes the cells are sent through: { c, plane(c, σ, T ) c T }. For clarity, we denote this execution by E P P S (ALG, σ, T ). A state s S i is reachable if there is a sequence of coin tosses σ and a traffic T, such that the state-machine reaches state s in execution E P P S (ALG, σ, T ). A switch configuration consists of the states of all state-machines, and the contents of all the buffers in the switch. A configuration is reachable if it is reached in an execution of the switch. Since the switch does not have a predetermined initial configuration, we assume that for every pair of reachable configurations C 1, C 2, there is a finite incoming traffic that causes the switch a transition from C 1 to C 2. The internal lines of the switch operate at rate r < R. For simplicity, we assume that r R = r. This lower rate r imposes an input constraint on the demultiplexing algorithm [74]: R r For any two cells c 1, c 2 in traffic T, if orig(c 1 ) = orig(c 2 ) and ta(c 1 ) ta(c 2 ) r then plane(c 1, σ, T ) plane(c 2, σ, T ). Since a PPS has no buffers in its input-ports, cells are immediately sent to one of the planes; that is, a cell c traverses the internal link between orig(c) and plane(c, σ, T ) at time ta(c) (see Figure 5). We assume that both the planes and the output buffers are FCFS and work-conserving. Let tp(c, σ, T ) be the time-slot in which a cell c T leaves plane(c, σ, T ), and denote tl P P S (c, σ, T ) 34

45 ta(c) orig(c) orig(c) ta(c) plane(c, σ, T) PPS tp(c,σ,t) dest(c) tl P P S (c, σ, T ) dest(c) ta(c) tl S (c, T ) Figure 5: Time-points associated with a cell c T. Shadow switch the time-slot it leaves the PPS (that is, tl P P S (c, σ, T ) = tl SW (c, σ, T ) as defined in Chapter 3). The lower rate of the internal links between the planes to the output ports imposes an outputconstraint [74]: For every two cells c 1, c 2 in traffic T, if dest(c 1 ) = dest(c 2 ) and plane(c 1, σ, T ) = plane(c 2, σ, T ) then tp(c 1, σ, T ) tp(c 2, σ, T ) > r. To neglect delays caused by the additional stage of the PPS, a cell can leave the PPS at the same time-slot it arrives at the output-port, provided that no other cell is leaving at this time-slot, i.e., tl P P S (c, σ, T ) tp(c, σ, T ). Note however that tp(c, σ, T ) ta(c) + 1. When it is clear from the context, we omit the traffic T and the coin-tosses sequence σ from the notations plane(c, σ, T ), tp(c, σ, T ), tl P P S (c, σ, T ), tl S (c, σ, T ) and R(ALG, σ, c, T ). 4.3 Lower Bounds on the Relative Queuing Delay The relative queuing delay of a PPS heavily depends on the information available to the demultiplexing algorithm. Practical demultiplexing algorithms must operate with local, or out-dated, information about the status of the switch: flows waiting at other input-ports, contents of the planes buffers, etc. As we shall see, such algorithms incur non-negligible queuing delay. Specifically, in this section we prove lower bounds on the maximal and average relative queuing delay even when randomization is used. We show lower bounds for deterministic demultiplexing algorithms. Based on these results, we present lower bounds for randomized demultiplexing algo- 35

46 rithms that use an adaptive adversary that sends cells to the switch at each time-slot based on the algorithm actions at previous slots. We further show that under reasonable assumptions the lower bounds can be extended to hold with an oblivious adversary, which chooses the entire traffic in advance, knowing only the demultiplexing algorithm. Moreover, we show that these lower bounds on the relative queuing delay yield similar lower bounds on the relative delay jitter. We prove even stronger results, and show that the lower bounds hold even when the traffic is restricted by the (R, B) leaky-bucket model [34, 45]. This model restricts the traffic from flooding the switch by requiring that the combined rate of flows sharing the same input-port or the same output-port does not exceed the external rate R of that port by more than a fixed bound B, which is independent of time. Specifically, a traffic T is (R, B) leaky-bucket, if for any two time-slots t 1 t 2 and any output-port j, {c T t 1 ta(c) t 2 and dest(c) = j} (t 2 t 1 ) + B General Techniques and Observations High relative queuing delay is exhibited when cells that are supposed to leave the shadow switch one after the other, are concentrated in a single plane. We first describe this scenario given a specific coin-tosses sequence σ, implying the results holds also for a deterministic demultiplexing algorithm with COINSPACE = 1. Definition 5 An execution E P P S (ALG, σ, T ) is (f, s) weakly-concentrating for output-port j and plane k if there is a time-slot t such that: 1. Output-port j s buffer of the shadow switch is empty at time-slot t; and 2. At least f cells destined for output-port j arrive at the switch during time-interval [t, t + s), and f out of these cells are sent through the plane k. We call an execution an (f, s) weakly-concentrating execution, when the plane k and the output-port j are clear from the context. The following lemma bounds the relative queuing delay exhibited in (f, s) weakly-concentrating executions: 36

47 Lemma 1 For any (R, B) leaky-bucket traffic T, coin-tosses sequence σ, and (f, s) weakly-concentrating execution E P P S (ALG, σ, T ) for output-port j and plane k, the last cell c that is sent from plane k to output-port j in E P P S (ALG, σ, T ) attains R(ALG, σ, c, T ) f r (s + B). Proof. We compare the queuing delay of the cells, arriving in time interval [t, t + s), in the PPS and in the shadow switch. Since the shadow switch is work-conserving, all f cells leave the switch exactly f time-slots after the first cell is dispatched. On the other hand, a PPS completes this execution after at least fr time-slots, because f cells are sent to the same plane, and only one cell can be sent from this plane to the output-port every r time-slots. Let c be the last of these cells sent from the plane to the output-port. Hence, the relative queuing delay c attains is at least fr f time-slots. Since the incoming traffic is (R, B) leaky-bucket, f s + B, and therefore R(ALG, σ, c, T ) fr f fr (s + B) time-slots. Lemma 1 implies the following lower bounds on the maximum relative queuing delay and the maximum relative delay jitter: Lemma 2 For any (R, B) leaky-bucket traffic T, coin-tosses sequence σ, and (f, s) weakly-concentrating execution E P P S (ALG, σ, T ) for output-port j and plane k: (1) The maximum relative queuing delay R max (ALG, σ, T ) f r (s + B) time-slots. (2) There is a traffic T such that the relative delay jitter J (ALG, σ, T ) f r (s+b) time-slots. Proof. The proof of (1) immediately follows from Lemma 1. Let t be the time-slot defining E P P S (ALG, σ, T ) as (f, s) weakly-concentrating execution for output-port j and plane k, and let c be the last cell, arriving at interval [t, t + s), that is sent from plane k to output-port j in E P P S (ALG, σ, T ). By the definition of cell c, ta(c) t + s 1, while by Lemma 1, tl P P S (c) t + fr. This implies that delay(c, T ) f r s + 1. Let t > tl P P S (c, T ) be the first time-slot after tl P P S (c, T ) in which all the buffers of the PPS are empty. Such time-slot exists since T is finite and both the planes and multiplexors of the PPS are work-conserving. 37

48 Let T = T {c }, where cell c is a cell with orig(c ) = orig(c), dest(c ) = dest(c), and ta(c ) = t. Clearly, cell c leaves the PPS exactly one time-slot after its arrival. In addition, all other cells in traffic T leave the PPS exactly as in execution E P P S (ALG, σ, T ). Because cell c and cell c share the same origin and destination, the maximum delay jitter introduced by the PPS is at least J P P S (ALG, σ, T ) (f r s + 1) 1 = f r s time-slots. Recall that the maximum buffer size needed for any work-conserving switch to work under (R, B) leaky-bucket traffic is B. Therefore a work-conserving switch, which serves the incoming cells in an FCFS manner (e.g. FCFS output-queued switch) introduces queuing delay, and therefore also delay jitter, of at most B time-slots. Thus, the relative delay jitter between the PPS and the shadow switch is at least (f r s) B = f r (s + B) time-slots, proving (2). Another key observation is that if the last cell of a traffic attains relative queuing delay R, then this traffic can be continued so that every added cell attains at least relative queuing delay R, regardless of the random choices made by the demultiplexing algorithm (if any). We first define how a traffic is continued. A cell c 2 T is the immediate successor of cell c 1 T in demultiplexing algorithm ALG, denoted c 2 = succ(c 1, T ), if tl S (c 2, T ) = tl S (c 1, T ) + 1, and for every coin tosses sequence σ, tl P P S (c 2, T ) > tl P P S (c 1, T ) in the execution E P P S (ALG, σ, T ). Namely, a PPS cannot change the order in which c 1 and c 2 are delivered; this happens for example when a PPS follows a per-flow FCFS policy and c 1, c 2 share the same input-port and the same output-port. Generally, the existence of an immediate successor depends on the priority scheme supported by the PPS. Let c be the last cell in a traffic T, i.e., tl S (c, T ) = max c T {tl S (c, T )}. A traffic T = {c 0,..., c n } is a proper continuation of T, if in the execution of the shadow switch in response to traffic T T, all the cells of T are delivered one time-slot after the other without any stalls, and the delivery times of the cells of T remain unchanged. Formally, T is a proper continuation of T if in the execution E S (T T ), c 0 = succ(c, T T ), c i = succ(c i 1, T T ) for every i, and for every c T, tl S (c, T ) = tl S (c, T T ) and tl P P S (c, T ) = tl P P S (c, T T ). We first examine proper continuations by a single cell: Lemma 3 For any demultiplexing algorithm ALG, coin-tosses sequence σ, and finite traffic T, if 38

49 c 1 is the last cell of T, and T = {c 2 } is a proper continuation of T, then R(ALG, σθ, c 2, T T ) R(ALG, σ, c 1, T ) for any coin-toss θ. Proof. Since T is a proper continuation of T, cell c 2 leaves the shadow switch exactly at time-slot tl S (c 1, T T ) + 1, and in addition tl P P S (c 2, T T ) tl P P S (c 1, T T ) + 1. Hence, R(ALG, σθ, c 2, T T ) tl P P S (c 2, T T ) tl S (c 2, T T ) (tl P P S (c 1, T T ) + 1) (tl S (c 1, T T ) + 1) = tl P P S (c 1, T ) tl S (c 1, T ) = R(ALG, σ, c 1, T ) It is important to notice that Lemma 3 holds for any coin-toss θ, and therefore it trivially holds if the demultiplexing algorithm ALG is deterministic. If the adversary can construct, for every traffic, a proper continuation that is arbitrarily long, then it can construct a traffic that exhibits an average relative queuing delay that matches the maximum relative queuing delay. Intuitively, the adversary waits for a cell c that attains R max and then sends many cells, which form a proper continuation (whose length depends on the number of cells that arrived before c). Lemma 4 Fix an adversary A, demultiplexing algorithm ALG, a coin-tosses sequence σ and a finite traffic T whose last cell c has R(ALG, σ, c, T ) = x. If the adversary A can construct a proper continuation of traffic T, whose size is at least T x ε ε (ε is an arbitrarily small constant), then R A avg(alg) x ε. Proof. Let l be the number of cells in traffic T, and let T be a proper continuation of T such that T = l x ε ε. Applying Lemma 3, T times implies that for every cell b in T and any coin-tosses sequence σ b, R(ALG, σσ b, b) R(ALG, σ, c) x. Hence, R A avg(alg) 1 l + l x ε ε l x ε x x ε ε 39

50 In order to allow constructing proper continuations of a traffic T with high relative queuing delay, we extend Definition 5, so that traffic T ends when a concentration occurs: Definition 6 An execution E P P S (ALG, σ, T ) is (f, s) strongly-concentrating for output-port j and plane k if it is (f, s) weakly-concentrating and in addition traffic T ends at time-slot t + s. For brevity, we call such executions (f, s)-concentrating executions Lower Bounds for Fully-Distributed Demultiplexing Algorithms A fully-distributed demultiplexing algorithm demultiplexes a cell, arriving at time-slot t, according to the input-port s local information in time interval [0, t]. Since no information is shared between input ports, we assume that the state s i S i of demultiplexor i does not change, unless a cell arrives at input-port i. Note that demultiplexing algorithms that change their state even without receiving a cell are not considered fully distributed, because a common clock-tick is shared among all input ports. (Such algorithms are covered in Section ) Lower bound for deterministic fully-distributed algorithm The relative queuing delay of a PPS with fully-distributed demultiplexing algorithm strongly depends on the number of input-ports that can send a cell, destined for the same output-port, through the same plane. The following definition captures this switch characteristic under deterministic algorithm (Definition 8 extends this definition for randomized algorithms): Definition 7 A deterministic demultiplexing algorithm is d-partitioned if there is a plane k, an output-port j, such that at least d input-ports send a cell destined for output-port j through plane k in one of their reachable configurations. We next show that a static partition of the planes among the demultiplexors helps to reduce the relative queuing delay. However, since such partitioning is failure-prone, most existing fullydistributed algorithms are N-partitioned, meaning that each demultiplexor may use each plane 40

51 in order to send cells to each output-port. All our results hold for this class of algorithms by substituting d = N. Theorem 1 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has R max (ALG) d(r 1) and J (ALG) d(r 1) time-slots. Proof. By the definition of a d-partitioned demultiplexing algorithm, there is an output-port j and a plane k, such that at least d demultiplexors send a cell destined for j through k in some reachable configuration. Let I = {i 1, i 2,... i d } be the set of these demultiplexors, and let s i S i be the state of demultiplexor i I in configuration C i, just before a cell is sent to plane k. Consider traffic T i from an arbitrary reachable configuration C which leads the switch to configuration C i ; such traffic exists since C and C i are reachable, and there is a traffic that causes the switch to transit between any two reachable configurations. Let T i = (T i ) i ; that is, a traffic in which cells arrive only at input-port i exactly in the same time-slots as in traffic T i. Since the demultiplexing algorithm is fully-distributed, demultiplexor i transits into s i. Note that in T i at most one cell arrives at the switch in every time-slot, therefore this traffic has no bursts. Now consider traffic T, which starts with T i1... T id, a sequential composition of the traffics T i, where i I. T begins from configuration C, and sequentially for every i I, the same cells arrive at the switch in the same time-slots as in traffic T i, until demultiplexor i reaches state s i. Then, no cells arrive at the switch until all the buffers in all the planes are eventually empty. Finally, d cells destined for output-port j arrive, one after the other, at different input-ports i I (one cell in each time-slot). Since the demultiplexing algorithm is fully-distributed, each demultiplexor i I remains in state s i, and all the cells are sent through the same plane k (see Figure 6; the last d cells are denoted c i1,..., c id ). T has no bursts, and cells c i1,..., c id arrive during d consecutive time-slots. These cells arrive at the switch after the buffer in output-port j is empty. Thus, by Lemma 2 with f = d, s = d and B = 0, we obtain the stated lower bounds. If the PPS obeys a per-flow FCFS policy, we get the following lower bound on the average relative queuing delay: 41

52 i d Cells c i1,..., c id i 2 i 1 Figure 6: Illustration of traffic T in the proof of Theorems 1 and 4. Theorem 2 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has R avg (ALG) d(r 1) ε time-slots, where ε > 0 can be made arbitrarily small. Proof. Let T be the traffic that caused the maximum relative queuing delay as described in the proof of Theorem 1. We continue traffic T with traffic T, which consists of T f r (s+b) ε cells from orig(c) to ε dest(c), one cell at each time-slot. T is a proper continuation of traffic T, because both the PPS and the shadow switch obey a per-flow FCFS policy and all cells in T share the same input-port and the same output-port. Hence, Lemma 4 implies that R avg (ALG) f r (s + B) ε. Note that the PPS input constraint implies that each demultiplexor must send incoming cells through at least r planes. This implies that even under static partitioning each plane is used by r N demultiplexors on the average. Hence, there is a plane k that is used by at least K r N K demultiplexors in order to dispatch cells destined for a certain output-port j. By substituting d = N/S in Theorems 1 and 2 we get: Theorem 3 A bufferless PPS, with fully-distributed deterministic demultiplexing algorithm, has maximum relative queuing delay and relative delay jitter of N S (r 1) time-slots under a leakybucket traffic without bursts. Its average relative delay jitter is N S (r 1) ε, for arbitrarily small ε > 0. = N S 42

53 Lower bound for randomized fully-distributed algorithms with an adaptive adversary We concentrate now on adaptive adversary, denoted adp, which sends cells to the switch based on the algorithm actions. For every traffic T we examine the probability Pr σ [E P P S (ALG, σ, T ) is (f, s)-concentrating], taken over all coin-tosses sequences σ, that the execution of ALG given T and σ is (f, s)-concentrating. Another key observation is that if there is traffic T such that its execution is (f, s)-concentrating with small but non-negligible probability, an adaptive adversary can construct another execution that is almost always (f, s)-concentrating: Lemma 5 If from every configuration C there is an (R, B) leaky-bucket traffic T such that Pr σ [E P P S (ALG, σ, T ) is (f, s)-concentrating] p > 0, then an adaptive adversary can construct an (R, B) leaky-bucket traffic T from C such that Pr σ [E P P S (ALG, σ, T ) is (f, s)-concentrating] 1 δ, where δ can be made arbitrarily small. Proof. Fix a configuration C; the adaptive adversary constructs the executions from C iteratively: Denote C 0 C. Let C i be the configuration just before iteration i 0, and denote by T i the traffic such that from configuration C i, Pr σ [E P P S (ALG, σ, T i ) is (f, s)-concentrating] p. The adversary stops if the last execution is indeed (f, s)-concentrating. Otherwise, it concatenates an empty traffic of B time-slots (denoted T e ) and continues to the next iteration. Since in each iteration the adversary stops with probability at least p independently of previous iterations, it stops with an (f, s)-concentrating execution at iteration l log 1 p δ with probability 1 δ. Since there are B empty time-slots between the arrival of the last cell of traffic T i and the arrival of the first cell in T i+1, T = T 0 T e... T e T l has burstiness factor B, and its corresponding execution starting from C is (f, s)-concentrating with probability 1 δ. If both the shadow switch and the PPS are per-flow FCFS, an adaptive adversary can always construct an arbitrarily long proper continuation of some traffic T. Therefore, we have: 43

54 Lemma 6 If from every configuration C there is an (R, B) leaky-bucket traffic T such that Pr σ [E P P S (ALG, σ, T ) is (f, s)-concentrating] > 0, then with probability 1 δ, R adp avg(alg) f r (s + B) ε, where ε > 0 and δ > 0 can be made arbitrarily small. Proof. By Lemma 5, an adaptive adversary can construct a traffic T from configuration C, such that Pr σ [E P P S (ALG, σ, T ) is (f, s)-concentrating] 1 δ. Let c be the last cell of T. Lemma 1 implies that with probability 1 δ, the relative queuing delay of c is at least f r (s + B). When the adaptive adversary observes such a concentration event, it continues with traffic T, which consists of T f r (s+b) ε cells from orig(c) to dest(c), one cell at each time-slot. T ε is a proper continuation of traffic T, because both the PPS and the shadow switch obey a per-flow FCFS policy and all cells in T share the same input-port and the same output-port. Hence, Lemma 4 implies that R adp avg(alg) f r (s + B) ε, with probability 1 δ. We now extend Definition 7 to capture randomized demultiplexing algorithms: Definition 8 A randomized demultiplexing algorithm is d-partitioned if there is a plane k, an output-port j, and a set of input-ports I, such that I d and the following property holds: For every input-port i I and state s i S i, if at least n i cells destined for output-port j arrive at input-port i after it is in state s i, then with probability p i > 0, i sends at least one cell destined for output-port j through plane k. We next prove a lower bound for d-partitioned fully distributed demultiplexing algorithms by showing that it is possible to construct a traffic with no bursts that causes with, non-negligible probability, the algorithm to concentrate d cells in a single plane during a time-interval of d timeslots: Theorem 4 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has, with probability 1 δ, R adp max(alg) d(r 1) time-slots and R adp avg(alg) d(r 1) ε time-slots, where ε > 0 and δ > 0 can be made arbitrarily small. 44

55 Proof. Given ALG, the adversary pre-computes the set I = {i 1,..., i d } of d input ports, the output-port j and the plane k, and for each input port i I the values n i and p i > 0, for which the conditions presented in Definition 8 hold. We now construct a traffic similar to the one used in the proof of Theorem 1. Fix a configuration C, and for every i I, let T i be a traffic consisting of n i cells destined for output-port j that arrive one after the other to input-port i. By the definition of n i, with probability at least p i, there is at least one cell in T i that is sent through plane k. Let c i be the first such cell; it follows that Pr σ [cell c i is sent through plane k in E P P S (ALG, σ, T i )] p i. Let T i be the prefix of T i that ends with cell c i ; that is T i = {c T i ta(c) ta(c i )}. Since the probability to send c i through plane k in execution E P P S (ALG, σ, T i ) depends only on cells that arrive at the switch before cell c i, it follows that for prefix T i, Pr σ [cell c i is sent through plane k in E P P S (ALG, σ, T i )] p i as well. Traffic T is defined as follows: T = (T i1 \ {c i1 })... (T id \ {c id }) {c i1 }... {c id } (recall Figure 6). We next show that with non-negligible probability, taken over all coin-tosses sequences σ, all cells c i1... c id are sent through plane k in the execution of ALG on traffic T. In traffic T, for each input port i I, no cells arrive to input port i between T i \ {c i } and c i. Thus, for each input port i I and coin-tosses sequence σ, plane(c i, T ) = plane(c i, T i1... T id }. Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ, that the last d cells are sent through plane k in execution E P P S (ALG, σ, T ) is at least d a=1 p i a > 0. This implies that execution E P P S (ALG, σ, T ) is (d, d)-concentrating with non-negligible probability. Since T has no bursts, the claim follows immediately from Lemma 6. Lower bound for randomized fully-distributed algorithms with an oblivious adversary We now consider oblivious adversaries, obl, that choose the entire traffic in advance, knowing only the demultiplexing algorithm ALG. R obl max(alg) and R obl avg(alg) denote the maximum and average queuing delay of algorithm ALG against such an adversary. We assume that the PPS and the shadow switch obey a global FCFS policy, i.e., cells that share the same output-port should leave the switch in the order of their arrival (with ties broken arbitrarily). Unlike per-flow FCFS 45

56 policy, global FCFS policy requires cells to leave in order even if they do not share the same origin. We next extend Theorem 4 to hold with an oblivious adversary, under a global FCFS discipline. Theorem 5 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has R obl max(alg) d(r 1) time-slots and R obl avg(alg) d(r 1) ε time-slots, with probability 1 δ, where ε > 0 and δ > 0 can be made arbitrarily small. Proof. Given ALG, the adversary pre-computes the set I = {i 1,..., i d } of d input ports, the output-port j and the plane k, and for each input port i I the values n i and p i, for which the conditions of Definition 8 hold. Let p d p ia a=1 n ia > 0. For any input port i I, let x i be a value chosen uniformly at random from {1,..., n i }. Let T i be a traffic consisting of x i cells from input port i to output port j, and let c i be the last cell of T i. Traffic T is defined as follows: T = (T i1 \ {c i1 })... (T id \ {c id }) {c i1 }... {c id }. Note that traffic T is similar to traffic T in the proof of Theorems 1 and 4 (illustrated in Figure 6). Using traffic T, the adversary constructs a traffic T (illustrated in Figure 7) whose average relative queuing delay is at least d(r 1) ε time-slots with probability 1 δ (the constants δ, ε > 0 can be made arbitrarily small). The construction has two steps: Step 1 Concatenate log 1 p δ instances of traffic T. For each instance, choose independently and uniformly at random the values for x im, 1 m d, from {1,... n im }. Let l be the total size of these instances. Step 2 Concatenate a traffic of size l input port i to output port j. d(r 1) ε ε cells, such that each cell is sent from an arbitrary We first prove that with non-negligible probability, taken over all coin-tosses sequences σ, the execution of ALG on any instance of traffic T = (T i1 \ {c i1 })... (T id \ {c id }) {c i1 }... {c id } is (d, d)-concentrating, regardless of the initial configuration. Claim 1 Pr σ [E P P S (ALG, σ, T ) sends the last d cells through plane k] d p ia a=1 n ia p > 0. 46

57 Step 1 Step 2 i d i i 2 i 1 Proof of claim. Traffic T Traffic T Cell c Figure 7: Illustration of traffic T in the proof of Theorem 5. For any input port i, denote by T i the traffic consisting of n i cells from input port i to output port j. By the definition of n i, with probability p i at least one cell in T i is sent through plane k. Since x i is chosen uniformly at random from the values {1,..., n i }, this cell is the x i -th cell (that is, the cell c i ) with probability at least 1 n i. Note that traffic T i is a prefix of traffic T i ; since the demultiplexor is bufferless, the decision through which plane to send the cell c i is based only on cells arriving at the switch prior to c i, which implies that cell c i is sent through k with probability of at least 1 n i p i. In traffic T, for each input port i I, no cells arrive at input port i between T i \ {c i } and c i. Thus, for each input port i I and coin-tosses sequence σ, plane(c i, T ) = plane(c i, T i1... T id ) Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ, that execution E P P S (ALG, σ, T ) sends the last d cells through plane k is at least d p ia a=1 n ia p > 0. In Step 1, the random choices of the T instances are independent. Therefore, (d, d)-concentration occurs at least once in Step 1 with probability at least 1 δ. Let c be last cell of the first instance in which (d, d)-concentration occurs, and T 1 be the traffic {c T ta(c) ta(c )}. Since E P P S (ALG, σ, T 1 ) is (d, d)-concentrating, Lemma 1 implies that R(ALG, σ, c, T 1 ) d(r 1). Let T 2 = T \ T 1. We next show that T 2 is a proper continuation of T 1. Intuitively, this is due to the fact that the switches are work-conserving with FCFS policy and during each interval of size τ, exactly τ cells arrive at the switch and destined for the same output-port j (i.e., there are no stalls between cells in traffic T = T 1 T 2 ). Formally, consider two cells c 1, c 2 T such that ta(c 2 ) = ta(c 1 )+1. The FCFS policy implies 47

58 that tl S (c 2, T ) > tl S (c 1, T ) and tl P P S (c 2, T ) > tl P P S (c 1, T ). In addition, by the construction of traffic T, there is no cell c 3 T such that ta(c 1 ) ta(c 3 ) ta(c 2 ). Therefore, the FCFS policy and the work-conservation of the shadow switch imply that tl S (c 2, T ) = tl S (c 1, T ) + 1. Hence, for every two cells c 1, c 2 T, if ta(c 2 ) = ta(c 1 ) + 1 then c 2 = succ(c 1, T ); in particular, the first cell of T 2 is the successor of cell c. Moreover, since the switches follow a FCFS policy, cells of traffic T 2 do not prohibit cells of traffic T 1 of being delivered on time; namely, for any cell c T 1, tl S (c, T 1 ) = tl S (c, T 1 T 2 ) and tl P P S (c, T 1 ) = tl P P S (c, T 2 ). Since R(ALG, σ, c, T 1 ) d(r 1) and T 2 T 1 d(r 1) ε ε R obl avg(alg) d(r 1) ε and R obl max(alg) d(r 1), with probability 1 δ Lower Bounds for u-rt Demultiplexing Algorithms, Lemma 4 implies that While fully-distributed demultiplexing algorithms do not use any global information, in practice demultiplexors may be able to gather some information about the switch status (e.g., through dedicated control lines). Therefore, it is important to consider a broader class of demultiplexing algorithms in which out-dated global information is also used: Definition 9 A u real-time distributed (u-rt) demultiplexing algorithm demultiplexes a cell, arriving at time t, according to the input-port s local information in time interval [0, t], and to the switch s global information in time interval [0, t u]. The state transition function of the i th bufferless demultiplexor operating under u-rt demultiplexing algorithm is S i (t) : S i C t u+1 {1,..., N} COINSPACE S i, where t is the time-slot in which S i is applied, C is the set of all reachable switch configurations, and C t u+1 is the cross-product of t u + 1 such sets, one for each time-slot in the interval [0, t u]. Note that a demultiplexor state transition may depend on other demultiplexors state transitions, and on incoming flows to other input-ports, as long as these events occurred u time-slots before the state transition. The state of a demultiplexor can change even if no cell arrives at the input-port. The additional global information allows to reduce the relative queuing delay. For example, when a 1-RT demultiplexing algorithm receives (R, 0) leaky-bucket traffic, it has full information 48

59 about the switch status, and therefore it can emulate a centralized algorithm. Yet, lack of information about recent events yields non-negligible relative queuing delay, caused by leaky-bucket traffic with a non-zero burstiness factor, as we shall prove next. A prominent example of 1-RT demultiplexing algorithms (that is, with u = 1) are demultiplexing algorithms that share only a common clock-tick among input-ports. Therefore, a demultiplexor with a 1-RT algorithm may change its state even if no cell arrives at its input-port. Lower bound for deterministic u-rt algorithms Let u = min{u, r }, that is, the minimum between the lag in gathering global information and half 2 the external rate relative to the rate of the planes. We first show a lower bounds on the performance of deterministic u-rt algorithms: Theorem 6 Any deterministic u-rt demultiplexing algorithm ALG has R max (ALG) un S (1 u r )) and J (ALG) un S (1 u r )) time-slots. If the PPS obeys a per-flow FCFS policy then R avg (ALG) un S (1 u r ) ε time-slots, where ε > 0 can be made arbitrarily small. Proof. Consider an arbitrary configuration C. Denote by t 0 the time-slot in which the PPS is in configuration C, by x 0 the number of cells that arrived at the PPS until time-slot t 0, and by n 0 the number of cells stored in one of the PPS buffers at time-slot t 0. Consider now the empty traffic T e, in which no cells arrive to the switch at all. We first argue that if T e is long enough, all the buffers of the switch become empty. Specifically, denote by C 1 the switch configuration at time-slot t 1 = t 0 + n 0 + unx If there are still cells stored in one of the S buffers at time-slot t 1, then these cells have relative queuing delay of at least unx time-slots; S therefore the average relative queuing delay is more than un time-slots, and the theorem follows. S Assume now that all the buffers are empty in configuration C 1. Fix an output-port j, and consider the traffic T in which cells destined for j arrive simultaneously to all input-ports at each time-slot in the interval [t 1, t 1 + u). Note that T is an (R, un u) leaky-bucket traffic, since for any τ 1 and time-interval [t, t + τ), the total number of cells arriving to the switch is bounded by τ + (un u). 49

60 i I i 2 i 1 t 0 t 1 t 1 + u Figure 8: Illustration of traffic T e T I in the proof of Theorems 6 and 8. time slot Since u 1 R < R, the input constraint implies that two cells arriving at the same input-port 2 r r are not sent through the same plane. Hence, there is a plane k used by a set I of at least u N input- K ports in the execution E P P S (ALG, T ); note that since a PPS speedup is at least 1, u K N < R rk N N. For every input-port i I, let c i T i be a cell such that plane(c i, T i ) = k. Consider the traffic T i = {c c T i and ta(c) ta(c i )}; that is, T i consists of the cells in T i that arrive at the switch before cell c i. Now consider traffic T I (R, u 2 N u) leaky-bucket traffics. K = i I T i (see Figure 8). Note that both T I and T e T I are For every input-port i I, ta(c i ) < t 1 + u t 1 + u, which implies that input-port i does not have global information on the switch status after time-slot t 1. Hence, the executions E P P S (ALG, T ) and E P P S (ALG, T I ) are equivalent. Therefore, all the input-ports i I send their last cell to plane k in E P P S (ALG, T e T I ) starting at configuration C. Hence, by Lemma 2, the maximum relative queuing delay and the relative delay jitter are at least R = un S (1 u r ) time-slots. Assume now that the PPS obeys a per-flow FCFS policy. Let c be the last cell of traffic T e T I that attains the maximum relative queuing delay. Consider traffic T that consists of (T I ) R ε ε cells from orig(c) to dest(c), one cell at each time-slot. T is a proper continuation of traffic T I ; thus, by Lemma 4, R avg (ALG) un S (1 u r ) ε as required. By substituting the minimal value u = 1, we get the following general result: Corollary 7 Any deterministic u-rt demultiplexing algorithm, u 1, has relative queuing delay ( ) and relative delay jitter of at least N S 1 1 r time-slots, under leaky-bucket traffic with burstiness 50

61 factor N K 1. Lower bound for randomized u-rt algorithms with an adaptive adversary We next give a lower bound on the average relative queuing delay of a randomized u-rt demultiplexing algorithms. The proof is based on Theorem 6 and Lemma 6: Theorem 8 Any randomized u-rt demultiplexing algorithm ALG has, with probability 1 δ, R adp max(alg) un S (1 u r ) time-slots and R adp avg(alg) un S (1 u r ) ε time-slots, where ε > 0 and δ > 0 can be made arbitrarily small. Proof. Consider an arbitrary configuration C and the traffics T e and T, whose constructions are described in the proof of Theorem 6. It is important to notice that the input constraint implies that for every coin-tosses sequence σ there is a plane k used by a set I of at least u K N input-ports in the execution E P P S(ALG, σ, T ). For every input-port i I, let c i T i be a cell such that plane(c i, T i ) = k. Consider the traffics T i = {c c T i and ta(c) ta(c i )} and T I = i I T i (recall Figure 8). For every input-port i I, ta(c i ) < t 1 + u t 1 + u, which implies that input-port i does not have global information on the switch status after time-slot t 1. Hence, the executions E P P S (ALG, σ, T ) and E P P S (ALG, σ, T I are equivalent. Therefore, with probability of at least ( ) ( ) T i u I ( ) u un K > 0, taken over the coin-tosses i I 1 COINSPACE 1 COINSPACE 1 COINSPACE sequences σ, all the input-ports i I send their last cell to plane k in E P P S (ALG, σ, T e T I ) starting at configuration C. Hence, configuration C satisfies the conditions of Lemma 6 and the claim follows. The question whether the lower bound for u-rt demultiplexing algorithm (described in Theorem 8) can be extended to hold with an oblivious adversary is left open. The proof technique described in this section will most likely fail to provide such an extension, since the worst-case traffics that are used in order to prove lower bounds for u-rt algorithms have bursts. Unfortunately, the burstiness accumulates when concatenating bursty traffics, unless there is a gap of 51

62 certain number of time-slots in which no cells arrive to the over-loaded output-port. Large bursts may justify high queuing delay of cells, and hence result in low relative queuing delay. On the other hand, a gap in which no cells arrive to the over-loaded output, reduces the relative queuing delay of cells that arrive immediately after it. This implies that the adversary should identify the concentration and then choose to continue the traffic without a gap (as in Lemma 4). 4.4 Upper Bounds on the Relative Queuing Delay This section presents a methodology for bounding R max (ALG, σ, T ) for an arbitrary traffic T and coin-tosses sequence σ. We fix some traffic T and omit the notations ALG, σ and T. For simplicity assume T begins after time-slot 0, and that at time-slot 0 (i.e., at the beginning of time ), no cells arrive at the switch, and therefore all the queues are empty. Our analysis depends on the realistic assumption that the PPS obeys the global FCFS policy. Cells are queued in a bufferless PPS either within the planes or within the multiplexors residing at the output-ports. A simple situation in which queuing in a multiplexor happens is when the output-port is flooded, but in this case, cells also suffer from high queuing delay in the shadow switch, and the relative queuing delay is small. A more complicated situation is when a cell arrives at the multiplexor out of order, and should wait for previous cells to arrive from their planes. In this case, the relative queuing delay is a by-product of queuing within the other planes (of some preceding cell) the relative queuing delay of the waiting cell is at most the relative queuing delay of some preceding cell that was queued only in the planes. This observation is captured in the next lemma: Lemma 7 There is a cell c such that tl P P S (c) = tp(c) and R(c) = R max. Proof. Let c be the first cell to leave the PPS such that R(c) = R max. Assume that tl P P S (c) > tp(c); since the multiplexor buffer is work-conserving, in time-slot tl P P S (c) 1 another cell c leaves the PPS from output-port dest(c). Hence tl P P S (c ) = tl P P S (c) 1, and therefore R(c ) = tl P P S (c ) tl S (c ) = tl P P S (c) 1 tl S (c ). Since c leaves the PPS before c and the shadow switch is FCFS, tl S (c ) tl S (c) 1. Hence the relative queuing delay of c is R(c ) tl P P S (c) tl S (c) = R(c) = R max, contradicting the minimality of c. 52

63 Consider a single cell c, and focus on the queuing within plane(c), caused by the lower rate on the link from plane(c) to dest(c). Since both the PPS and the shadow switch are FCFS, cells arriving at the switch after cell c cannot prohibit c from being transmitted on time. We present an upper bound that depends only on the disproportion of the number of cells sent through plane(c) to dest(c). Relating this quantity and the queue lengths at time-slot ta(c) is not immediate, since it is possible that the shadow switch is busy when the plane is idle, and vice versa. Let A j (t 1, t 2 ) be the number of cells destined for output-port j that arrive at the switch during time interval [t 1, t 2 ], and A k j (t 1, t 2 ) be the number of these cells that are sent through plane k. The following definition captures the imbalance between planes: Definition 10 For a plane k, output-port j and time-slots 0 t 1 t 2 : 1. The imbalance of time interval [t 1, t 2 ] is k j (t 1, t 2 ) = A k j (t 1, t 2 ) 1 A r j (t 1, t 2 ). 2. The imbalance by time-slot t 2 is k j (t 2 ) = max t1 t 2 { k j (t 1, t 2 )}. 3. The maximum imbalance is k j=max t2 { k j (t 2 )}. Clearly, k j k j (t 2 ) k j (t 1, t 2 ) for every output-port j, plane k and time-slots t 1 > t 2. In addition, the imbalance is superadditive: Property 1 For every output-port j, plane k and time-slots t 1 > t 2, k j (t 2 ) k j (t 1 1) + k j (t 1, t 2 ) Proof. By Definition 10, there is a time-slot t 1 t 1 such that k j (t 1 1) = k j (t 1, t 1 1) = A k j (t 1, t 1 1) 1 A r j (t 1, t 1 1). Since k j (t 1, t 2 ) = A k j (t 1, t 2 ) 1 A r j (t 1, t 2 ), we have: k j (t 1, t 2 )+ k j (t 1 1)=A k j (t 1, t 1 1)+A k j (t 1, t 2 ) 1 r (A j(t 1, t 1 1)+A j (t 1, t 2 ))= k j (t 1, t 2 ) k j (t 2 ) Let Q j (t) be the length of the j th queue in the shadow switch after time-slot t; similarly, Q k j (t) is the length of j th queue of plane k of the PPS after time-slot t. Let L k j (t 1, t 2 ) be the 53

64 number of cells destined for output-port j that leave plane k during time interval [t 1, t 2 ]. Note that Q k j (t) = A k j (0, t) L k j (0, t). Time-slot t 1 is the beginning of a (k, j) busy period for time-slot t 2 t 1, if it is the last time-slot before t 2, such that Q k j (t 1 1) = 0. Note that this expression is well defined because in time-slot 0 all the queues are empty. Since Q k j (t 1 ) > Q k j (t 1 1), a cell c arrives at the switch at time-slot t 1, and therefore exactly one cell destined for j leaves plane k in time-interval (t r, t 1 + 1]. This is either cell c itself, or another cell that prohibits c from using the link, and therefore is sent at most r time-slots before time-slot t Since the queue is never empty until time-slot t 2, one cell is sent to j exactly every r time-slots after the first cell. This implies that the number of cells sent from k, L k j (t 1, t 2 ) (t2 t 1 )+1. r Remark 2 Khotimsky and Krishnan [86] defined busy periods only with respect to an output-port j. This points to a flaw in their proof, which ignores situations when the optimal shadow switch is busy sending cells to output-port j, while a specific plane in the PPS is idle part of the time [85]. These situations are the main source of complication in our proof. The following lemma bounds how badly a plane can perform relative to the shadow switch, by comparing their busy periods: Lemma 8 If Q j (t 1) = 0 then for every plane k and for every δ {0,..., k j (t 1)r }, Proof. L k j (0, (t 1) + δ) A k j (0, t 1) k j (t 1)r δ If there is a time-slot t 1 [t 1, t 1 + δ], such that Q k j (t 1 ) = 0, then by time-slot t 1, no cells destined for j are waiting in plane k. That is, L k j (0, (t 1) + δ) L k j (0, t 1 ) = A k j (0, t 1 ) A k j (0, t 1), and the lemma follows. Otherwise, let t 2 be the beginning of a (k, j) busy period for time-slot (t 1) + δ. During time interval [t 2, (t 1) + δ] plane k sends a cell to output j every r time-slots, therefore: t + δ L k t2 j (t 2, (t 1) + δ) r r (4.1) 54

65 A k j (0, min{τ, t 1}) L k j (0, τ) t 2 t t + δ t + k j (t 1)r τ (time-slots) Figure 9: The number of cells arriving until time-slot t 1, and still queued in plane k by time-slot τ. On the other hand, Q j (t 1) = 0 implies that for every time-slot t 3 t 1, A j (t 3, t 1) t t 3 (otherwise the j th buffer of the shadow switch is not empty after time-slot t 1). In particular: Using these inequalities we bound L k j (0, (t 1) + δ): L k j (0, (t 1)+δ) = L k j (0, t 2 1) + L k j (t 2, (t 1) + δ) L k j (0, t 2 1) + t+δ t 2 r A k j (0, t 2 1) + t+δ t 2 r A k j (0, t 2 1)+ A j (t 2, t 1) t t 2 (4.2) δ + A j(t 2,t 1) r r by (4.1) Since Q k j (t 2 1)=0 by (4.2) = A k j (0, t 2 1) + A k j (t 2, t 1) + δ k r j (t 2, t 1) by Definition 10 A k r j (0, t 1) k j (t 1)+δ r By substituting δ = 0 in Lemma 8, we get the following corollary, demonstrating the relation between the imbalance and the queue size in the beginning of a busy period: Corollary 9 If Q j (t 1) = 0 then for every plane k, Q k j (t 1) max { 0, k j (t 1) }. 55

66 We complete the proof by bounding the lag between the time a cell leaves the plane it is sent through and the time it should leave the shadow switch: Theorem 10 The maximum relative queuing delay of cells destined for output-port j and sent through plane k is bounded by max{0, r ( k j + 1) + B j }, where B j is the maximum number of cells destined for output-port j that arrive at the switch in the same time-slot. Proof. By Lemma 7, it suffices to bound tp(c) tl S (c) for every cell c. Since tp(c) tl S (c) = (tp(c) ta(c)) ((tl S (c) ta(c)), it suffices to bound only the difference between the time a cell spends in the plane, tp(c) ta(c), and the time it spends in the shadow switch, tl S (c) ta(c). Since both switches operate under FCFS policy, these values solely depend on the corresponding queues lengths when cell c arrives. Let t 1 be the earliest time-slot, such that the buffer of output-port j in the shadow switch is never empty during time-interval [t 1, ta(c)]; if no such time-slot exists let t 1 = ta(c). First, we bound tl S (c) ta(c) from below. The buffer in the shadow switch is empty at timeslot t 1 1, and then the switch is continuously busy during time-interval [t 1, ta(c) 1], transmitting exactly one cell at each time-slot to output-port j. This implies that Q j (ta(c) 1) = A j (t 1, ta(c) 1) (ta(c) t 1 ). All the cells in the queue should leave the switch after time-slot ta(c) and before tl S (c), therefore: tl S (c) ta(c) > A j (t 1, ta(c) 1) (ta(c) t 1 ) Since A j (ta(c), ta(c)) B j, and tl S (c) ta(c) is an integer, it follows that: tl S (c) ta(c) A j (t 1, ta(c)) B j + t 1 ta(c) + 1 (4.3) Recall that by Corollary 9, Q k j (t 1 1) max{0, k j (t 1 1) }. There are two cases to consider, depending on whether all the cells that were queued in plane k at time slot t 1 left the plane before the arrival of cell c (see Figure 10): Case 1: ta(c) t 1 + k j (t 1 1)r. Since plane k is FCFS and work-conserving, it transfers every cell in its queue in exactly r time-slots, except cell c which is considered as transfered in the 56

67 k j (t 1 1) A k j (0, min{t, t 1 1}) k Lj (0, t) t 1 Case 1: cell c arrive before time-slot t 1 + k j (t 1 1)r t 1 + k j (t 1 1)r Case 2: cell c arrive after time-slot t 1 + k j (t 1 1)r Figure 10: Illustration for the different cases in the proof of Theorem 10 first time-slot of its transmission: tp(c) ta(c) r Q k j (ta(c)) + 1 t (time-slots) = r (A k j (0, ta(c)) L k j (0, ta(c))) + 1 by the definition of Q k j (ta(c)) ) r r (A k k j (0, ta(c)) A k j (t 1 1) + t 1 ta(c) j (0, t 1 1) r by Lemma 8, since ta(c) [t 1, t 1 + k j (t 1 1)r ] and L k j (0, ta(c)) L k j (0, ta(c) 1). ( ) r A k j (0, ta(c)) A k j (0, t 1 1) + r k j (t 1 1) + t 1 ta(c) r r A k j (t 1, ta(c)) + r k j (t 1 1) + t 1 ta(c) + r + 1 = A j (t 1, ta(c)) + r k j (t 1, ta(c)) + r k j (t 1 1) ta(c) + t 1 + r + 1 by Definition 10 A j (t 1, ta(c)) + r ( k j (ta(c)) + 1) ta(c) + t by Property 1 (4.4) By (4.4) and (4.3), tp(c) tl S (c) r ( k j (ta(c)) + 1) + B j. Case 2: ta(c) > t 1 + k j (t 1 1)r. If Q k j (ta(c)) = 0 then cell c is immediately delivered to the output-port, i.e., tp(c) = ta(c) + 1 tl S (c) and the claim holds since tp(c) tl S (c) 0. If Q k j (ta(c)) > 0, let t 2 be the beginning of a (k, j) busy period for ta(c). Note that by the 57

68 choice of t 2, L k j (t 2, ta(c)) ta(c) t2 +1. Hence, we have: r tp(c) ta(c) r Q k j (ta(c)) + 1 = r ( A k j (t 2, ta(c)) L k j (t 2, ta(c)) ) + 1 since Q k j (t 2 1) = 0 ) r (A k ta(c) (t2 1) j (t 2, ta(c)) +1 since plane k is continuously busy r ta(c) A j (t 2, ta(c)) + r k j (t 2, ta(c)) r (t2 1) + 1 r A j (t 1, ta(c)) + r ( k j (t 2, ta(c))+1 ) +t 1 ta(c)+(t 2 t 1 ) A j (t 1, t 2 1)+1 (4.5) By the choice of t 1, the output-buffer of the shadow switch is empty at time-slot t 1 1, and not empty during time-interval [t 1, t 2 1]. This implies that (t 2 t 1 ) A j (t 1, t 2 1), and therefore (4.5) implies: tp(c) ta(c) A j (t 1, ta(c)) + r ( k j (ta(c)) + 1) + t 1 ta(c) + 1 (4.6) By (4.6) and (4.3), tp(c) tl S (c) r ( k j (ta(c)) + 1) + B j. 4.5 Demultiplexing Algorithms with Optimal RQD This section presents several demultiplexing algorithms and uses the methodology described in Section 4.4 in order to bound their relative queuing delay. First, we revisit the fractional traffic dispatch algorithm (FTD) [70] and prove that its relative queuing delay is (N + 1)r time-slots. Then, for a PPS with speedup S > 2, we introduce a variant of the FTD algorithm that is 2N/S-partitioned; its relative queuing delay is at most (2N/S + 1) r + N(1 2/S) time-slots, matching the lower bound for fully distributed demultiplexing algorithms (Theorem 4). Finally, we present novel 1-RT and u-rt demultiplexing algorithms with relative queuing delay 3N + r time-slots (Sections and 4.5.3). Both algorithms have optimal relative queuing delay for a PPS with constant speedup. 58

69 4.5.1 Optimal Fully-Distributed Demultiplexing Algorithms Iyer and McKeown [70] presented the best-known example of a fully-distributed demultiplexing algorithm. In this algorithm, there is a window of size r time-slots that slides over the sequence of cells in each flow (i, j). The algorithm maintains window constraint that ensures that two cells in the same window are not sent through the same plane. An equivalent variation of the algorithm, which is called the fractional traffic dispatch algorithm (FTD), statically divides each flow to blocks of size r [70, 86]. The demultiplexing algorithm chooses the plane through which a cell is sent arbitrarily from the set of planes that do not violate the window constraint and the input-constraint described at Section 4.2. A speedup of S 2 suffices for the algorithm to work correctly [70]. A simple application of Theorem 10 and the fact that B j N shows: Theorem 11 R avg (F T D) R max (F T D) (N + 1)r. Proof. Let A i j (t 1, t 2 ) be the number of cells in flow (i, j) that arrive at the switch during timeinterval [t 1, t 2 ], and A k i j(t 1, t 2 ) be the number of these cells that are sent through plane k. k j (t 1, t 2 ) = A k j (t 1, t 2 ) A j(t 1, t 2 ) by Definition 10 r N = A k i j(t 1, t 2 ) A j(t 1, t 2 ) i=1 N Ai j (t 1, t 2 ) i=1 N i=1 = N r 1 r r r A j(t 1, t 2 ) r ( ) Ai j (t 1, t 2 ) + r 1 r r A j(t 1, t 2 ) r due to the window constraint since A i j, r are integers By Theorem 10, R max (F T D) (N + 1)r, since B j N. For PPS with speedup S > 2, a 2N S -partitioned variant of FTD can yield better relative queuing 59

70 Algorithm 1 Partitioned Fractional Traffic Dispatch (PART-FTD) Algorithm Local variables at demultiplexor i: M[N][r ]: matrix of values in {1,..., 2r, }, initially all R[r ]: vector of values in {1,..., 2r, }, initially all S[N + 1]: vector of values in {0,..., r 1}, initially all 0 1: int procedure DISPATCH(cell c) at demultiplexor i 2: j dest(c) 3: D {k {1,..., 2r } a {0,..., r 1}, M[j 1][a] = k} Planes that violate the window-constraint. 4: E {k {1,..., 2r } a {0,..., r 1}, R[a] = k} Planes that violate the input-constraint. 5: choose k ({1,..., 2r } \ (D E)) 6: M[j 1][S[j 1]] k Update for future window constraint calculations. 7: R[S[N]] k Update for future input constraint calculations. 8: S[j 1] S[j 1] + 1 mod r Update pointer for cyclic use of the vector. 9: S[N] S[N] + 1 mod r Update pointer for cyclic use of the vector. 10: return k + 2r i 2N/S 11: end procedure 1 FTD (see pseudo-code in Algorithm 1), demultiplexor i uses only planes ( ) 2r i,..., 2r i This implies that each demultiplexor uses exactly 2r 2N/S 2N/S planes, as required for the correctness of FTD, but each plane is used only by at most 2N S demultiplexors. Theorem 12 R avg (PART-FTD) R max (PART-FTD) ( 2N S + 1) r + N ( 1 2 S ). Proof. delay, matching the lower bounds described in Theorems 4 and 5. In this algorithm, denoted PART- We use the same calculations as in the proof of Theorem 11. The only difference is that k j (t 1, t 2 ) = A k j (t 1, t 2 ) A j(t 1, t 2 ) r N Ai j (t 1, t 2 ) i=1 r A j(t 1, t 2 ) r 2N S r 1 r since at most 2N demultiplexors can send cells through plane k. Therefore, by Theorem 10, S R max (PART-FTD) ( 2N + 1) r + N ( ) 1 2 S S. 60

71 4.5.2 Optimal 1-RT Demultiplexing Algorithm We describe a 1-RT demultiplexing algorithm that matches the lower bound presented in Theorem 8. Informally, Algorithm 2 divides the set of planes into two equal-size sets, V 0 and V 1, and its operations with respect to cells destined for a specific output-port into two phases. In each phase, the algorithm sends cells destined for a specific output-port through a different set of planes (i.e., V 0 or V 1 ). After every time-slot, each input-port collects global information about the switch, and uses it to calculate the imbalance for each plane k and each output-port j. In the next time-slot each input-port sends a cell to output-port j only through planes with low (or zero) imbalance. Intuitively, a phase i ends when there are no balanced planes in V i to use. Then, in the next phase, the demultiplexors use the planes of the set V 1 i. To avoid situations in which all the input-ports send cells through the same plane, we divide the input-ports into N sets of size r, and assure that, under no circumstances, two input-ports in r the same set send a cell destined for the same output-port through the same plane. This is done by calculating the actions of other input-ports in the same set as if they indeed get a cell destined for the same output-port. With respect to each output-port j, planes are divided into three levels according to their imbalance (see Definition 10): balanced planes with imbalance k j (t) 0, slightly imbalanced planes whose imbalance satisfies 0 < k j (t) < N, and extremely imbalanced planes with imbalance r k j (t) N, At the beginning of each time-slot, a set of eligible planes, denoted by F [j], is calculated for every destination j: A plane is eligible for output-port j if it is balanced with respect to r output-port j or if it was never extremely imbalanced with respect to output-port j since the last phase change. Phase i is changed to phase 1 i when all planes k V 1 i become balanced (The set Q[j] maintains the planes of V 1 i that are still imbalanced. The phase changes when Q[j] = ). Example 1 Suppose that at time-slot t = 0, phase[0] changed from 1 to 0, 9 0(0) = 6.5 and all other planes in V 1 have imbalance at most 6.5. In addition, we assume that planes 1 and 2 did not receive any cells before time-slot 0. The demultiplexors are divided into 4 sets: {0, 1}, {2, 3}, {4, 5}, {6, 7}. Upon receiving a cell, each demultiplexor calculates the behavior of all demultiplexors in its set that have smaller index 61

72 and ensures that it will not send the cell through the same plane as them. Table 4.2 shows the plane number through which each demultiplexor would have sent a cell destined for output-port 0, if such a cell arrives at the switch. The actual arrivals are marked in framed boxes, and are taken into account in the following time-slots. At time-slot 1, demultiplexor 0 will send a cell through the first plane in V 0 (that is, plane 1). On the other hand, demultiplexor 1 must avoid sending its cell through plane 1 and therefore it will use plane 2. Similarly, demultiplexors 2, 4 and 6 will use plane 1 and demultiplexors 3, 5 and 7 will use plane 2. 62

73 Algorithm 2 1-RT Algorithm Constants: V 0 ={1,..., K 2 }; V 1={ K 2 + 1,..., K} Shared: F [N]: N sets of planes, initially all V 0 cells for j can be sent only through F [j] R[N][r ]: matrix of values in {1,..., K, }, initially all holding input-constraints t: value in {0,..., r 1}, initially 0 cyclic pointer to matrix R Q[N], L[N]: N sets of planes, initially all M[N]: N sets of planes, initially all {1,..., K} phase[n]: vector of values in {0, 1}, initially all 0 1: void procedure ADVANCE-CLOCK( ) invoked at the beginning of each time-slot 2: For every j {1,..., N}: CALCULATE(j) 3: For every j {1,..., N}: F [j] UPDATE(j) 4: Update the matrix R[N][r ] according to global information 5: t (t + 1) mod r 6: end procedure 1: int procedure DISPATCH(cell c) at demultiplexor i 2: j dest(c) 3: p i r 4: set B 5: for x r p to i do 6: E {k {1,..., K} a {0,..., r 1},R[x][a]=k} 7: k min (F [j] \ (B E)) 8: B B {k} 9: end for 10: R[i][t] k can be read by other input-ports only in the next time-slot 11: return k 12: end procedure 1: set procedure UPDATE(int j) 2: set S F [j] 3: Q[j] Q[j] \ M[j] 4: if Q[j] = then change phase 5: Q[j] {1,..., K} \ M[j] 6: phase[j] (1 phase[j]) 7: S V phase[j] 8: else 9: S S \ (L[j]) 10: end if 11: return S 12: end procedure 1: void procedure CALCULATE(int j) 2: set A {k k j (t) > N r } using global information 3: M[j] {k k j (t) 0} using global information 4: L[j] (L[j] A) \ M[j] 5: end procedure

74 Time slot Demultiplexor Demultiplexor Demultiplexor Demultiplexor Demultiplexor Demultiplexor Demultiplexor Demultiplexor (t) (t) Table 4.2: Illustration of Example 1. The plane number through which each demultiplexor would have sent a cell destined for output-port 0, if such a cell arrives at the switch. Actual arrivals are marked in framed boxes. No cells arrive to different output-ports in this time interval At time-slot 2, demultiplexor 0 cannot use plane 1 due to the input-constraint. Therefore, it will use plane 2 and demultiplexor 1 will use plane 1. Plane 1 becomes extremely imbalanced after time-slot 2 and therefore it is not eligible to receive cells for output-port 0 in the following time-slots. Although plane 1 becomes slightly imbalanced after time-slot 3, Algorithm 2 dictates it is still not eligible for output-port 0, since the phase has not changed yet. The phase changes after time-slot 5, because for every plane k V 1, k 0(5) 0. This implies that planes from the set V 1 are used for sending cells destined for output-port 0 in the following time-slots. At this time, Q[0] = {1, 2} since 1 0(5) = 2.5 and 2 0(5) = 0.5. The phase changes again after time-slot 7, since both 1 0(7) and 2 0(7) are not positive. To prove the correctness of Algorithm 2, we start with two lemmas. The first lemma shows that the imbalance between each plane and each output-port is bounded. Lemma 9 In Algorithm 2, for every plane k V 0 V 1 and output-port j, k j < 2N r. Proof. Clearly, if k j (t 3 ) > N then k L[j] in the beginning of time-slot t r (procedure CALCULATE, Line 4). Therefore, k F [j] in the beginning of time-slot t 3 +1 (procedure UPDATE, Line 9), and cells are not sent through plane k until a time-slot t 3 > t 3 +1 in which k j (t 3 1) 0. 64

75 This observation holds also if the phase changes in the beginning of time-slot t 3 +1 since Q[j] = at Line 4 yields that V phase M[j] at Line 7. For every two input-ports i 1 and i 2, if i 1 r = i2 r, then i1 and i 2 do not send cells destined for the same output-port through the same plane in the same time-slot (procedure DISPATCH). This implies that the maximum number of cells destined for the same output-port and sent through the same plane in a single time-slot is N r. By Definition 10, if a plane does not receive cells destined for output-port j in time-slot t 1, then k j (t 1 ) k j (t 1 1). This implies that there is a time-slot t 1, in which plane k receives cells destined for j, and k j (t 1 ) = k j. In the worst-case, k j (t 1 1) = N r and k receives N r cells destined for j. Assume towards a contradiction, that k j (t 1 ) 2N r. Then there is a time-slot t 2 such that k j (t 2, t 1 ) 2N r. Note that k j (t 1, t 1 ) < N r, since A k j (t 1, t 1 ) N r implies that t 2 < t 1 and therefore by Definition 10: k j (t 2, t 1 ) = A k j (t 2, t 1 ) 1 r A j(t 2, t 1 ) and A j (t 1, t 1 ) A k j (t 1, t 1 ). This = A k j (t 2, t 1 1) + A k j (t 1, t 1 ) 1 r A j(t 2, t 1 1) 1 r A j(t 1, t 1 ) = k j (t 2, t 1 1) + k j (t 1, t 1 ) < 2N r This contradicts the choice of t 2, and the claim follows. The second property is a simple conclusion from Lemma 9: Lemma 10 If 2N cells destined for output j arrive at a PPS is operating under Algorithm 2 during time-interval [t 1, t 2 ] and none of them is sent through plane k, then k j (t 2 ) 0. Proof. By Definition 10, there is a time-slot t 3 such that k j (t 2 ) = k j (t 3, t 2 ). If t 3 t 1, then k j (t 3, t 2 ) 0, since A k j (t 3, t 2 ) = 0. Otherwise, k j (t 3, t 2 ) = k j (t 3, t 1 1) + k j (t 1, t 2 ) = k j (t 3, t 1 1) + A k j (t 1, t 2 ) 1 r A j(t 1, t 2 ) 65

76 By Lemma 9 and Definition 10, k j (t 3, t 1 1) k j 2N r. Since A k j (t 1, t 2 ) = 0 and A j (t 1, t 2 ) 2N, it follows that k j (t 3, t 2 ) 0 also in this case. The final theorem shows that a speedup of 8 suffices for this demultiplexing algorithm to achieve optimal relative queuing delay. Note that such a high speedup is considered impractical for real switches; yet, Algorithm 2 demonstrates that the lower bound presented in Theorem 8 is tight for u = 1. Theorem 13 Speedup S = 8 suffices for Algorithm 2 to work correctly with maximum relative queuing delay of 3N + r time-slots. Proof. It suffices to show that every time Line 7 of procedure DISPATCH is executed, F [j] \ (B E), and a plane can be chosen. Clearly, at each step B r and E < r ; therefore the claim follows if F [j] > 2r. Since F [j] is changed only by procedure UPDATE(j), it suffices to show that F [j] > 2r after any execution of UPDATE(j). Assume, without loss of generality, that phase = 0 after an execution of procedure UPDATE(j) at time-slot t 1. Assume, by way of contradiction, that F [j] 2r at time-slot t 1. Clearly, from Line 7 and the fact that V 0 = V 1 = K = Sr = 4r > 2r, it follows that phase = 0 after 2 2 time-slot t 1 1. This implies that V 0 L[j] 2r. Denote by t 2 the last time-slot in which phase was changed from 1 to 0 (t 2 = 0 if no such time-slot exists). At time-slot t 2, when executing Line 4, Q[j] is empty and therefore all planes k V 0 are at M[j] at time-slot t 2. This implies that for every k V 0, k j (t 2 ) 0. Let k be a plane in V 0 L[j]. By the definition of L[j] there is a time-slot t 3 [t 2, t 1 ] such that k j (t 3 ) > N r. Let t 4 be the last time-slot such that k j (t 3 ) = k j (t 4, t 3 ). If t 4 < t 2 then k j (t 4, t 3 ) = k j (t 4, t 2 ) + k j (t 2 + 1, t 3 ) k j (t 2 ) + k j (t 2 + 1, t 3 ) k j (t 2 + 1, t 3 ) and therefore t 4 is not maximal. Hence t 4 t 2 and [t 4, t 3 ] [t 2, t 1 ]. Since k j (t 4, t 3 ) = A k j (t 4, t 3 ) 1 r A j (t 4, t 3 ) > N r, and A j (t 4, t 3 ) A k j (t 4, t 3 ), it follows that A k j (t 4, t 3 ) > N. r 1 66

77 Because V 0 L[j] 2r, the number of cells arriving at the switch and destined for j during time-interval [t 2, t 1 ] is at least (2r ) N r 1 > 2N. Since during time-interval [t 2, t 1 ] no cells are sent to any plane in V 1, Lemma 10 implies that any plane k V 1 has k j (t 1 ) 0, and in particular all planes in Q[j]. This yields that Q[j] becomes empty, and the phase changes at least once during time-interval [t 2, t 1 ] which contradicts the choice of t 1 and t 2. By Lemma 9 and Theorem 10, the relative queuing delay of the algorithm is at most 3N + r Optimal u-rt Demultiplexing Algorithm Algorithm 2 can be used as a building block for u-rt algorithms with u > 1. Algorithm 3 runs u instances of Algorithm 2 in a round-robin manner, such that in each time-slot only one instance is active (that is, the i th instance is active on time-slots i, i + u, i + 2u, i + 3u etc.). Since there are u time-slots between two consecutive times in which the same instance is active, global information on the previous time the instance was active can be shared among the demultiplexors. In addition, each instance of Algorithm 2 has its own set of 8r planes, hence Algorithm 3 needs speedup S = 8u. We next bound the imbalance under Algorithm 3: Lemma 11 In Algorithm 3, for every plane k and output-port j, k j < 2N r. Proof. Assume towards a contradiction that there is a traffic T, a plane k and an output-port j, such that k j 2N r. Let t be the first time-slot in which k j (t) 2N r, and let x = t mod u. The choice of t and Algorithm 3 imply that a cell is sent through plane k at time-slot t by instance x. Let T be the traffic consisting of the cells of traffic T handled by the instance x, that is, T = {c c T, ta(c) x mod u = 0}. Let round(c) = ta(c) x u x was active until cell c arrived to the switch. be the number of times instance Consider traffic T in which each cell c of traffic T arrives at the switch at time-slot round(c), that is, T = {shift(c, round(c) ta(c)) c T }. Let Ãj(t 1, t 2 ) be the number of cells in traffic 67

78 Algorithm 3 u-rt Algorithm Shared: ALG[u]: u instances of Algorithm 2 Each instance with its own planes and shared variables x: value in {0,..., u 1}, initially u 1 cyclic pointer to array ALG 1: void procedure ADVANCE-CLOCK( ) invoked at the beginning of each time-slot 2: x (x + 1) mod u 3: ALG[x].ADVANCE-CLOCK() invoke procedure ADVANCE-CLOCK on the x th instance 4: end procedure 1: int procedure DISPATCH(cell c) at demultiplexor i 2: return ALG[x].DISPATCH(c) invoke procedure DISPATCH on the x th instance 3: end procedure T destined for output-port j that arrive at the switch during time interval [t 1, t 2 ], and Ãk j (t 1, t 2 ) be the number of these cells that are sent through plane k by Algorithm 2. Similarly, following Definition 10 k j (t 1, t 2 ) = Ãk j (t 1, t 2 ) 1 r Ã j (t 1, t 2 ), and k j (t 2 ) = max t1 t 2 k j (t 2 ). Since only instance x sends cells to plane k, and the dispatching decisions of instance x as response to traffic T are the same of the decisions of Algorithm 2 as response to traffic T, it ( follows that for every time t < t, A k j (t, t) = t ) Ãk x j u, t x u. On the other hand, in T there is a subset of T s cells destined for output-port j, and therefore A j (t, t) Ãj implies that k j ( t x u ) k j (t) 2N r, contradicting Lemma 9. Lemma 11, Theorem 13 and Theorem 10 immediately imply: ( t x u ), t x u. This Theorem 14 For any u 1 and a PPS with speedup S = 8u, there is a u-rt demultiplexing algorithm ALG such that R max (ALG) 3N + r. Note that speedup S = 8u is not feasible in real-life switches. Therefore, Algorithm 3 has only theoretical importance. We leave for further research the question whether there is an optimal u-rt demultiplexing algorithm which requires a speedup that does not depend on u. 68

79 4.6 Extensions of the PPS model The Relative Queuing Delay of an Input-Buffered PPS We extend our definitions for the bufferless PPS model to the case in which there are buffers in the input-ports. In an input-buffered PPS, when a cell arrives, the demultiplexor either sends the cell to one of the planes or keeps it in its buffer. In every time-slot, the demultiplexor sends any number of buffered cells to the planes, provided that the rate constraints on the lines between the inputport and any plane are preserved. In this section, we consider only deterministic demultiplexing algorithms. (a discussion on extending the bounds to randomized algorithms appears in the end of the section.) We refer to the buffer residing at input-port i with finite size s as a vector b i {1,..., N, } s. An element of this vector contains the destination of the cell stored at the corresponding place in the buffer. Empty places in the buffer are indicated with in the vector. The size of the buffer at input-port i is denoted b i. The demultiplexor state machine is changed to include the state of the input-port buffer. B i denotes the set of the reachable states of the buffer residing in input-port i. We refer to the set of states of the i th demultiplexor as S i B i. A switch configuration includes also the input-buffers content. Definition 11 The demutliplexing algorithm of the demultiplexor residing at input-port i with input-buffer, is a function ALG i : {1,..., N, } S i B i S i {1,..., K, } b i +1 which gives the next state and a vector of size b i + 1, according to the incoming cell destination ( if no cell arrives), current state and the content of the buffer. The vector of size b i + 1 that is returned by the function ALG i states through which plane to send the cell in the corresponding place in the buffer; the last element of the vector refers to the incoming cell; indicates that the corresponding cell remains in the buffer. We denote by to(c, T ) 69

80 the time-slot in which cell c T is sent from the input-port to one of the planes; since cells can be queued in the input-port to(c, T ) can be larger ta(c), unlike the bufferless PPS model. When measuring the relative queuing delay in an input-buffered PPS, the queuing of cells both in the input-buffers and the planes buffers of the PPS should be compared to the queuing of cells in the output-buffers of the shadow switch. Generally, input buffers increase the flexibility of the demultiplexing algorithms, which leads to weaker lower bounds. We prove these lower bounds by constructing (f, s) weakly-concentrating executions (recall Definition 5): Theorem 15 Any deterministic fully-distributed demultiplexing algorithm ALG with any inputbuffers sizes has R max (ALG) N S 1 ( ) ( ) 1 r and J (ALG) N S 1 1 r time-slots, where ε can be made arbitrarily small. Proof. Let C be a switch configuration in which all the buffers in the switch are empty, and denote by t 0 the time-slot in which the PPS is in configuration C. Denote the state of demultiplexor i in configuration C by (s 0 i, b 0 i ). Clearly, b 0 i = { } b i. For every input-port i, consider traffic T i = {c i } which consists on a single cell c i with orig(c i ) = i, dest(c i ) = j and ta(c i ) = t 0. Note that to(c i ) N/S time-slots, otherwise T i is causing a relative queuing delay greater then N/S time-slots. Let (s f i, bf i ) be the demultiplexor state just before this cell is sent. Clearly, T i has no bursts, since one cell arrives at the switch. Since K < N, there exists a plane k and a set of N/K demultiplexors I = {i 1,..., i N/K }, such that plane(c i ) = k for every i I. Now consider another traffic T = T i1 T i2... T in/k. That is, traffic T begins in configuration C, and for every time-slot t {t 0,..., t 0 + N/K 1} one cell, which is destined for output-port j, arrives at input-port i I. Note that in every time-slot at most one cell arrives at the switch, therefore this traffic has no bursts. Since ALG is fully-distributed demultiplexing algorithm, and all the buffers are empty in configuration C, a demultiplexor i I does not change its states until the first cell arrives. Before T and T i begin, demultiplexor i is in state (s 0 i, b 0 i ), and its individual flow under T is exactly the same as under T i (only one cell destined for output-port j arrives). Therefore, demultiplexor 70

81 i I changes its state to (s f i, bf i ), and sends its cell to plane k, implying that E P P S(ALG, T ) is (N/K, N/K) weakly-concentrating execution for output-port j and plane k. N S Applying Lemma 2 with f = N/K, s = N/K and B = 0, yields lower bounds of N K r N = K ( ) 1 1 r on the maximum relative queuing delay and relative delay jitter. Unlike fully-distributed demultiplexing algorithms, the size of the input-buffers affects the relative queuing delay in an input-buffered PPS under u-rt demultiplexing algorithms. A PPS that can store u cells in each input-port is able to support a u-rt demultiplexing algorithm that guarantees relative queuing delay of at most u time-slots, by simulating the CPA algorithm [74]. Note that CPA assumes the PPS is a global FCFS switch, i.e., cells leave an output-port in a FCFS order, regardless of the input-port from which they are originated. Theorem 16 There is a u-rt demultiplexing algorithm for a global FCFS input-buffered PPS, with buffer size at least u and speedup S 2, and a relative queuing delay of at most u time-slots. This algorithm may be impractical; yet, it demonstrates that a lower bound of Ω(N) time-slots does not hold when the input-buffers are sufficiently large. When buffers are smaller than u, we ( ) show that a global FCFS deterministic input-buffered PPS has relative queuing delay of N S 1 1 r time-slots, under leaky-bucket traffic with burstiness factor u( N 1): K Theorem 17 Any deterministic u-rt demultiplexing algorithm ALG with input-buffers sizes smaller ( ) ( ) than u has R max (ALG) N S 1 1 r and J (ALG) N S 1 1 r time-slots. Proof. Let C be the switch configuration at time t 0, and assume that at this time all the buffers in the switch are empty. Let T i be a traffic that is comprised of cells c with orig(c) = i and dest(c) = j such that one cell arrives at input-port i in each time-slot, until the first cell destined for output-port j is sent to one of the planes. This execution takes less than u time-slots, because otherwise input-port i is queuing in its input-buffer more cells than its capacity. Note that T i is a leaky-bucket traffic with no bursts. Since the PPS is FCFS, and only cells of traffic T i arrive at the switch, the first cell to leave the input-port i s buffer is the first cell of traffic T i. We denote this cell by c i. 71

82 Since K < N, there exists a plane k and a set of demultiplexors I {1,..., N} of size N/K, such that plane(c i ) = k for every i I. Let t = max{to(c i, T i ) i I}. Now compose all traffics T i for i I and append the time-interval (t, t + u(n/k 1)] in which no cells arrive at the switch. T = i I T i denotes the composite traffic, starting from configuration C at time t 0, which is a (R, u( N 1)) leaky-bucket traffic. K Every demultiplexor i I goes through the same state transitions in response to T i and T, since composing the traffic does not change the switch configurations in time interval [0, t 0 ], to(c i ) u < t 0, and its local information is identical in T i and T. Hence, demultiplexor i sends the cell c i to plane k in time-slot to(c i, T ) = to(c i, T i ) < t 0 + u. Notice that under traffic T at time interval [t 0, t 0 ] (that is, the first time-slot), N/K cells destined for the same output arrived at the switch and sent through the same plane. Furthermore, the burst of traffic T during this interval is N/K 1. Thus, Lemma 2 with f = N/K, s = 1 and B = N/K 1 implies that R max (ALG) = J (ALG) = N K r ( 1 + N 1) ( ) = N k S 1 1 r, as required. We leave for future research the question whether these lower bounds apply also for the average relative queuing delay and for randomized algorithms. The major difference between these proofs and our other lower bounds proofs (described in Section 4.3) is that they employ executions in which a concentration occurs at the beginning of the traffic rather than at its very end (that is, a weakly-concentrating executions). Therefore, it is unclear how a proper continuation to this traffic can be devised. Another interesting future research direction is to present methodology and algorithms that match the lower bounds for input-buffered PPS Recursive Composition of PPS Another extension to the PPS model is implementing the planes themselves as PPS (operating at a lower rate). A (q, K)-recursive-PPS ((q, K)-RPPS) is defined recursively as follows: Base case: (1, k 1 )-RPPS is a PPS with k 1 planes, operating at external rate R and internal rate r 1 as described in Section 4.2. Recursion Step: (q+1, K k q+1 )-RPPS is an (q, K)-RPPS whose planes are replaced with PPS 72

83 Figure 11: A (2, 2, 2 )-RPPS with 5 input-ports and 5 output-ports. switches, each with k q+1 planes, that operate at external rate r q, and internal rate r q+1. Note that r q > r q+1. K k q+1 denotes the concatenation of the vector k q+1 after the vector K. This composition is described in Figure 11. When a cell arrives to a K-RPPS, it is demultiplexed through a chain of q demultiplexors (where q is the length of the vector K) until it is sent to an output-queued switch. It is important to notice that each demultiplexor in this chain handles traffic originated only from a single input-port. The collection of demultiplexors that handle all flows originating from input-port i, denoted G i, forms a tree of height q with q i=1 K[i] leaves; the level of each demultiplexor is its distance from the the root of the corresponding tree G i. In the homogeneous case, where all the demultiplexors in G i are of the same type, G i can be considered as single (yet complex) demultiplexor of this type. Therefore, all lower-bound results, described in Section 4.3, hold after substituting K with q i=1 K[i] and r with r q. For simplicity, we present the results only for N-partitioned fully-distributed demultiplexing algorithms: Corollary 18 Any homogeneous-rpps that uses (randomized) N-partitioned fully distributed demultiplexing algorithms has, with probability 1 δ, average relative queuing delay of at least N( R r q 1) ε time-slots, against adaptive and oblivious adversaries, where ε > 0 and δ > 0 can 73

84 be made arbitrarily small (δ = 0 for deterministic algorithms). As in Theorem 5, the lower bound against oblivious adversary holds only if the RPPS obeys a global FCFS policy. Corollary 19 Any homogeneous-rpps that uses (randomized) u-rt demultiplexing algorithms has, with probability 1 δ, an average relative queuing delay of at least un (1 urq ) ε timeslots, against an adaptive adversary, where S = rq q S R R i=1 K[i], u = R 2r q and ε, δ > 0 can be made arbitrarily small (δ = 0 for deterministic algorithms). These results imply that building a PPS recursively and homogeneously does not improve its relative queuing delay. Note that similar approach may be applied in order to analyze an inputbuffered RPPS; hence, as in Theorems 16 and 17, the lower bound on the relative queuing delay some time depend on relations between the buffer size and the type of information used. Since sharing information is more feasible as the external rate of the switch decreases, it is interesting to investigate also a monotone K-RPPS in which the switches at levels 1,..., v operate under fully-distributed algorithms, the switches at levels v + 1,..., w operate under u-rt demultiplexors algorithms, and the switches at level w + 1,..., q are centralized. All demultiplexing algorithms can be either deterministic or randomized. For brevity, we assume all u-rt demultiplexors operate with the same parameter u, and identify such recursive PPS by the tuple K, v, w, u. If all demultiplexors are bufferless, Corollaries 18 and 19 imply the following lower bound: Corollary 20 A monotone (randomized) K, v, w, u -RPPS has, with probability 1 δ, an average relative queuing delay of at least { ( ) R max N 1, un ( 1 ur )} w ε, r v S r v against an adaptive adversary, where S = rw w rv r v i=v+1 K[i], u = 2r w arbitrarily small (δ = 0 for deterministic algorithms). and ε, δ > 0 can be made 74

85 Proof. Consider a single input-port i and the collections of demultiplexors that handles all flows originating from input-port i. On one hand, the demultiplexors at levels 1,..., v form a homogeneous fully-distributed demultiplexor. Therefore, Corollary 18 implies that it attains, with probability 1 δ, average relative ( ) R queuing delay of at least N r v 1 ε time-slots. On the other hand, the demultiplexors at levels v + 1,..., w form a collection of homogeneous u-rt distributed demultiplexors. Therefore, Corollary 19 implies that each of these demultiplexors ( ) attains, with probability 1 δ, average relative queuing delay of at least N un 1 urw S r v ε timeslots. Therefore, the overall average relative queuing delay is as claimed. We leave for further research the constructions of algorithms for recursive PPS and the analysis of other combinations of demultiplexing algorithms (e.g., when some of the demultiplexors are bufferless and some have input-buffers). 75

86 Chapter 5 Packet-Mode Scheduling in CIOQ Switches In many network protocols, from very large Wide Area Networks (WANs) to small Networks on Chips (NoCs), traffic is comprised of variable size packets. A prime example is provided by IP datagrams whose sizes typically vary from 40 to 1500 bytes [112]. Real-life switches, however, operate with fixed-size cells, which are easier to buffer and schedule synchronously in an electronic domain. Transmitting packets over cell-based switches requires the use of packet segmentation and reassembly modules, resulting in a significant computation and communication overhead [77]. Cellbased scheduling is expected to turn into an even more crucial problem as the use of optics becomes widespread, since future switches could deal with packets in the optical spectrum and might be unable to afford their segmentation and reassembly. Cell-based schedulers, unaware of packet boundaries, may cause performance degradation; indeed, packet-aware switches typically have better drop-rate, since they may reduce the number of retransmissions by ensuring that only complete packets are sent over the switch fabric (cf. [141, Page 44]). Packet mode schedulers [57, 101] bridge this gap by delivering packets contiguously over the switch fabric, implying that until a packet is fully transmitted, neither its originating port nor its destination port can handle different packets. It is imperative to explore whether packet-mode schedulers can provide similar performance guarantees as cell-based schedulers. We address this question by focusing on CIOQ switches 76

87 and investigating whether a packet-mode CIOQ switch can mimic an ideal shadow switch with bounded relative queuing delay. 5.1 Our Results In this chapter, we present packet-mode schedulers for CIOQ switches that mimic a ideal switch with bounded relative queuing delay. Since such mimicking requires CIOQ switches with a certain speedup, we further investigate the trade-off between the speedup of the switch and its relative queuing delay. We devise pipelined frame-based schedulers, in which scheduling decisions are done at the frame boundaries. Our schedulers and their analysis rely on matrix decomposition techniques. At each frame, a demand matrix, representing the total size of packets between each input-output pair, is decomposed into permutations that dictate the scheduling decisions in the next frame. The major challenge in these decompositions is ensuring contiguous packet delivery while decomposing the demand matrix to as few permutations as possible. We show that contrarily to a cell-based CIOQ switch, a packet-mode CIOQ switch cannot exactly emulate an ideal shadow switch (e.g., output-queued (OQ) switch), whatever the speedup. However, once we allow for a bounded relative queuing delay, we find that a packet-mode CIOQ switch does not require a fundamentally higher speedup than a cell-based CIOQ switch. Specifically, we show (Theorem 26) that a speedup of 2 + O( 1 R max ) suffices to ensure that a packet-mode CIOQ switch mimics a ideal switch with maximum relative queuing delay R max = O(N lcm(l max )) time-slots, where L max is the maximum packet size, and lcm(l max ) is the least common multiple of 1,..., L max. This result also holds in the common case where only few packet sizes are legal, and the resulting relative queuing delay is O(N lcm(l)) time-slots, where L is the restricted set of legal packet sizes. It is important to note that if L = {1,..., L max }, lcm(l) is exponential in L max, since it is bounded from below by the primorial of L max, L max #, and from above by the factorial of L max, L max!; both L max # and L max! are exponential in L max. The relative queuing delay can be significantly reduced with just a doubling of the speedup. We show (Theorem 25) that a speedup of 4 + O( 1 R max ) suffices to ensure that a packet-mode 77

88 Speedup 2L max Theorem 22 Theorem 25 Corollary 27 Corollary 28 Theorem 26 Relative Queuing Delay (logarithmic scale) Figure 12: Summary of our results. The solid line represents emulation of ideal switch with unbounded buffer size, while the dashed line represents emulation of ideal switch with buffer size B per output-port. Relative queuing delay scale is logarithmic. CIOQ switch mimics an ideal shadow switch with a more reasonable relative queuing delay of R max = O(NL max ) time-slots. The relative queuing delay can be further reduced to be only L max 1 time-slots, if the speedup is increased to 2L max (Theorem 22). In addition, we show (Theorem 21) that it is impossible to achieve relative queuing delay of less than L max /2 3, regardless of the speedup used. In particular, packet-mode schedulers cannot exactly emulate OQ switches (with no relative queuing delay). Finally, we consider mimicking an ideal switch with a bounded buffer size B at each outputport. Extending the matrix decomposition techniques, we show (Corollary 28) that with a smaller speedup of 1+O( 1 R max ) and relative queuing delay R max = O(B +N lcm(l max )), a packet-mode CIOQ mimics ideal shadow switch with buffer size B. Figure 12 summarizes our results and demonstrates the trade-off between the speedup required for switch mimicking and the resulting relative queuing delay. 78

89 5.2 A model for packet-mode CIOQ switches This section extends the model defined in Chapter 3 to capture packet-mode switches in general, and specifically packet-mode CIOQ switches. In a packet-mode switch, packets of variable size traverse the switch contiguously. The packet size is measured in cell-units, where the minimal packet size is one cell and the maximum packet size is L max cells. All cells of the same packet arrive at the switch contiguously in the same inputport and are destined for the same output-port. Therefore, we refer to a packet simply as a sequence of cells and assume that its size is known upon arrival of its first cell (e.g., the total size is written in the header). Packet-mode switch are required to ensure that cells of the same packet leave the switch contiguously; that is, cells of the same packet should leave the switch one after the other with no interleaving of cells from other packets. A packet-mode switch should further provide a relaxed notion of a first-come-first-serve (FCFS) discipline. If the last cell of packet p arrives at the switch before the first cell of packet p and both packets share the same output-port, then all cells of packet p should leave the switch before the cells of packet p. We denote this partial order of packets by p p (i.e., packet p should be handled before packet p ). An ideal packet-mode shadow switch (e.g., a packet-mode OQ switch) should also be workconserving: Namely, if a cell is pending for output port j at time-slot t, then some cell leaves the switch from output-port j at time-slot t. [37, 88, 91]. We denote by tl S (c) the time-slot at which cell c is delivered by the shadow switch. The contiguous packet delivery implies that for any packet p = (c 1,... c l ), tl S (c i ) = tl S (c j ) + (i j) for 1 j i l. Recall that in a CIOQ switch with speedup S, packets arriving at rate R are first buffered in the input side and then forwarded over the switch fabric to the output-side as dictated by a scheduling algorithm (see Figure 2). Packets that arrive at input-port i and are destined for output-port j are stored in the input side of the switch in a separate buffer, which is called virtual output-queue and denoted by V OQ ij. The switch fabric operates at rate S R, where S is the speedup of the switch, 79

90 Input 1 p 1 Input 2 1 p 2 p 3 R + 2 R + 3 L max + R + 2 time-slot Figure 13: Illustration of the proof of Theorem 21; white packets are destined for output-port 1, while the gray packet is destined for output-port 2. implying that the switch has S scheduling opportunities (or scheduling decisions) every time-slot. 1 A packet-mode CIOQ switch ensures that if a packet p from input-port i to output-port j consists of the cells (c 1,..., c l ) then after cell c 1 is transmitted across the switch fabric, no cells of packets other than p are transmitted from input port i or to output port j until cell c l is transmitted. L max Naturally, cells of the same packet are transmitted in order. It is possible that some input-port i starts transmitting cells of a packet p before all the cells of packet p arrived at the switch. Since the speedup of the switch is typically greater than 1, this may cause the switch to under-utilize its speedup. For example, suppose that the first cell c 1 of a packet p = (c 1, c 2,..., c l ) arrives at input-port i at time-slot ta(c 1 ) and is immediately sent to output-port j in the first scheduling opportunity of time-slot ta(c 1 ). Since cell c 2 arrives at the switch only at time-slot ta(c 2 ) = ta(c 1 ) + 1, no cells can be sent from input-port i or to output-port j for the next S 1 scheduling opportunities (even if there are cells of other packets in one of the relevant buffers). 5.3 Simple Upper and Lower Bounds on the Relative Queuing Delay We show that a packet-mode CIOQ switch cannot mimic with a small relative queuing delay an ideal shadow switch, regardless of the CIOQ switch speedup. In particular, this result implies that a packet-mode CIOQ cannot exactly emulate an OQ switch, whatever the speedup used. This runs 1 For non-integral speedup values, the speedup S is the average number of such scheduling decisions per time-slot, where at each time slot the switch makes between S and S scheduling decisions [59]. 80

91 against the conventional wisdom that speedup N solves every problem. Theorem 21 A packet-mode CIOQ switch cannot mimic an ideal switch with a relative queuing delay R max < L max /2 3 time-slots. Proof. Assume towards a contradiction that the CIOQ switch mimics an ideal shadow switch with relative queuing delay R max < L max /2 3, and consider the following traffic, comprising of only three packets (see Figure 13): At time-slot 1 a packet p 1 of size L max arrives at input-port 1, destined for output-port 1. At time-slot R max + 2, another packet, denoted p 2, of size 1 arrives at input-port 2, destined for output-port 1. At time-slot R max + 3, a packet p 3 of size L max arrives at input-port 2, destined for output-port 2. At time-slot 1, packet p 1 is the only packet destined for output-port 1; since the shadow switch is work-conserving, the first cell of p 1 is delivered by the shadow switch at time-slot 1, implying it must be delivered by the CIOQ switch by time-slot R max + 1. Packet-scheduling restricts the switch from delivering cells of other packets to output-port 1 until the last cell of packet p 1 is delivered. Since the last cell of packet p 1 arrives at the switch at time-slot L max, then output-port 1 is busy handling p 1 at least until time-slot L max. Using the same arguments, the first cell of packet p 3 must be delivered to output-port 2 at time-slot 2R max + 3, and input-port 2 is busy handling p 3 at least until time-slot L max + R max + 2. Since L max > 2R max + 3, packet p 2 cannot be delivered to output-port 1 until time slot L max + R max + 2. But, packet p 2 is delivered by the shadow switch in time-slot L max + 1, implying that its relative queuing delay is at least R max + 1, contradicting the assumption. Note that this result holds since the CIOQ switch waits for the cells of the different packets to arrive, and therefore under the situation described in the proof of Theorem 21, the switch in fact degrades to work at the external line rate (i.e., with S = 1), as an IQ switch. The result is therefore consistent with the known result that IQ switches, with speedup 1, cannot emulate output-queued switches [37]. We now show that a CIOQ switch can mimic a shadow switch with relative queuing delay of L max 1 time-slots provided it has a sufficiently large speedup of 2L max. The algorithm closely 81

92 follows the CCF algorithm, which emulates (precisely) a cell-based OQ switch with speedup S = 2 [37]. Intuitively, multiplying the speedup by the maximum packet size L max reduces the problem of packet-mode switching to cell-based switching: Each cell-based scheduling decision can be mapped to L max contiguous packet-mode scheduling decisions, implying that a packet can be transmitted contiguously. In addition, a relative queuing delay of L max 1 time-slots allows the scheduler to wait until a packet fully arrives at the switch before it is scheduled. The following theorem captures this simple result: Theorem 22 A packet-mode CIOQ switch with speedup S = 2L max can mimic a ideal shadow switch with relative queuing delay of L max 1 time-slots. Proof. For each time-slot t, let traffic T (t) be the collection of cells that arrive at the switch by time-slot t and let T (t) T (t) be a traffic comprising only of cells in T (t) that are the first cells of their corresponding packets. Denote by t CCF (c) the time-slot in which the CCF algorithm with speedup S = 2 schedules a cell c of traffic T (t) over the switch fabric, and let tl S (c) be the time-slot in which c leaves a cell-based OQ switch that handles traffic T. The packet-mode CCF algorithm (PM-CCF) simulates the behavior of a cell-based CCF: For each packet p of traffic T (t), PM-CCF forwards the entire packet p contiguously over the switch fabric in time-slot t P M CCF = t CCF (first(p)) + L max 1. Since the cell-based CCF works with speedup S = 2, for each time-slot t there are at most two cells which share the same input or output port and are forwarded over the switch fabric by the cellbased CCF in time-slot t. PM-CCF works correctly since it has 2L max scheduling opportunities at each time-slot and therefore can schedule the packets corresponding to these two cells entirely in the same time-slot t. In addition, the contiguous arrival of packets at the input-ports ensures that packet p has fully arrived to the switch by time-slot t CCF (first(p)) + L max 1. For each cell c of traffic T = t T (t), tl S(c) denotes the time-slot in which c leaves the packet-mode shadow switch. Note that tl S (c) tl S (first(packet(c))) tl S (first(packet(c))), because cells corresponding to the same packet are delivered in order and traffic T = t T (t) is a subset of traffic T. Since the cell-based CCF emulates cell-based OQ switch, it follows that for 82

93 each cell c of traffic T : tl S (c) tl S (first(packet(c))) t CCF (first(packet(c))) = t P M CCF (c) (L max 1) This implies that every cell c can be delivered from a CIOQ switch with packet-mode CCF at time-slot tl S (c) + L max 1, and the claim follows. This result only demonstrates the possibility of ideal shadow switch mimicking with bounded delay, since a speedup S = 2L max is unreasonable in practical switches. Furthermore, this result also shows that although cut-through CIOQ switches (that is, switches that do not wait for packets to fully-arrive at the switch before starting scheduling them) may provide smaller delay in cell-mode scheduling, in packet-mode scheduling it is more profitable to use store-and-forward CIOQ switches, that must wait for packets to fully-arrive to the switch before start scheduling them. In the rest of this chapter, we show how to achieve a similar result with smaller speedup, by presenting a tradeoff between the speedup and the relative queuing delay: As the speedup of the switch increases, the needed relative queuing delay for mimicking a shadow switch decreases. 5.4 Tradeoffs between the speedup and the relative queuing delay Our scheduling algorithms operate in a frame-based pipelined manner, with scheduling decisions done only at the frame boundaries. At each frame boundary, the algorithms first construct several demand matrices, and then decompose these matrices into permutations (or sub-permutations). The algorithms satisfy the demands by scheduling the cells in the next frame according to the resulting permutations. The algorithms and their analysis rely on some results of matrix theory, which are presented 83

94 next Matrix Decomposition Definition 12 A permutation P is a 0-1 matrix such that the sum of each row and the sum of each column is exactly 1. A sub-permutation P is a 0-1 matrix such that the sum of each row and the sum of each column is at most 1. In the rest of this chapter, for simplicity, we refer to sub-permutations as permutations. The following definition captures the fact that the number of cells that should be scheduled from a single input-port or to a single output-port is bounded: Definition 13 A matrix A IN N N is C-bounded if the sum of each row and each column in A is at most C. A classical result says that any C-bounded matrix A can be decomposed into C permutations, whose sum dominates A: Theorem 23 (BIRKHOFF VON-NEUMANN DECOMPOSITION [25, 52, 144]) If a matrix A IN N N is C-bounded by an integer C, then there are C permutations P 1,..., P C such that A C i=1 P i. Note that since all values in the matrix A are integer the same result can be obtained using König s Theorem that bounds the chromatic index of a bipartite graph by its maximum vertex degree [90]. The Birkhoff von-neumann decomposition implies that every C-bounded demand matrix can be scheduled, cell by cell, in C scheduling opportunities (or, equivalently, in C/S time-slots) when permutation P i dictates the scheduling in opportunity i. However, such a scheduling may violate the packet-mode restrictions, since there is no relation between adjacent permutations in the sequence. For reasons that will become clear shortly, we are interested in the following class of permutations: 84

95 Definition 14 A maximal matching for a matrix A = [a ij ] is a permutation matrix P = [p ij ] A such that if p ij = 0 and a ij > 0 then there exist i such that p i j = 1 or j such that p ij = 1. Intuitively, a permutation P A is a maximal matching for a matrix A if no element can be added to P, resulting in a matrix that is still a permutation and is dominated by A. The next theorem shows that if a matrix is decomposed by any sequence of maximal matchings then the number of permutations needed is at most twice the number needed in Theorem 23. The decomposition of a C-bounded matrix A works iteratively: In each iteration m, a maximal matching P (m) for the matrix A(m 1) is found and then subtracted from A(m 1) to form A(m) (negative values are treated as zeros). The procedure stops when A(m) = 0. We next show that this happens after at most 2C 1 iterations, regardless of the choice of the maximal matching in each iteration, implying that the matrix A is decomposed into less than 2C permutations. Theorem 24 ([145, THEOREM 2.2]) For every C-bounded matrix A IN N N, the decomposition procedure described above stops after at most 2C 1 iterations. Proof. Denote A(0) = A and let P (m) = [p(m) ij ] be the maximal matching found in iteration m. Let A(m) = [a(m) ij ] be the matrix, resulting from subtracting the permutation P (m) from the matrix A(m 1). If A(2C 1) 0 then there exist i,j such that a(2c 1) ij > 0. Let a(2c 1) ij = k and a(0) ij = l. This implies that p(m) ij = 1 in exactly l k permutations P (m) (1 m 2C 1), and therefore p(m) ij = 0 in (2C 1) l + k such permutations. Note that for every m 2C 1, a(m) ij > 0. Therefore, Definition 14 yields that if p(m) ij = 0 then there are either i such that p(m) i j = 1 or j such that p(m) ij = 1. However, the sum of either row i or column j, excluding a(0) ij, is at most C l. This implies that 2(C l) (2C 1) l + k, which is a contradiction since l, k 1. 85

96 5.4.2 Mimicking an Ideal Shadow Switch with Speedup S 4 Our schedulers operate by constructing a demand matrix at each frame boundary, and then using the result of decomposing this matrix for scheduling decisions in the next frame. The relative queuing delay of the schedulers corresponds to the size of the frame, while the speedup of the switch is determined by the ratio between the frame size and the number of permutations obtained in the decomposition. A key insight is that packet-mode shadow switches can be implemented by a push-in-firstout (PIFO) cell-based OQ switch. In such OQ switches, arriving cells are placed in an arbitrary location in their destination s buffer, and the switch always outputs the cells at the head of its buffers [37]. The PIFO policy is an extension of the first-in-first-out (FIFO) policy that can also implement QoS-aware (Quality-of-Service-aware) algorithms, such as WFQ and strict priority. In our case, it allows us to implement packet-mode shadow switches as follows: The first cell of a packet p arriving at the switch is placed at the end of the relevant OQ switch buffer. Each consecutive cell c i of packet p is placed immediately after cell c i 1 ; in each time-slot, the cell at the head of the buffer departs from the switch. Since cells of the same packet are placed one after the other in the buffer, they leave the OQ switch contiguously. In addition, if p p then the last cell of packet p is placed in the buffer before the first cell of packet p, implying that packet p is served before packet p. Notice that, using the CCF algorithm, a cell-based CIOQ switch with speedup S = 2 can emulate cell-based OQ switch with any PIFO discipline [37], and in particular the above-mentioned discipline. However, the CCF algorithm ensures only that packets departs contiguously from the switch and does not deliver the packets contiguously over the switch fabric (that is, from the inputports to the output-ports). Yet, our next algorithms use this underlying CCF algorithm in order to construct the demand matrix of each frame. Let t CCF (c) be the time-slot in which a cell c is forwarded over the switch fabric by this CCF algorithm. Clearly, t CCF (c) tl S (c). We have the next lemma: Lemma 12 If a scheduling algorithm ALG schedules the cell last(p) of every packet p by timeslot t CCF (last(p)) + δ then the maximum relative queuing delay of ALG is at most δ + L max 1, 86

97 where L max is the maximum packet size. Proof. Consider a cell c, let k be its place in packet(c), and let l be the size of packet(c). The contiguous packet delivery in the shadow switch dictates that tl S (c) = tl S (last(packet(c)) (l k). Let t ALG (c) be the time-slot in which ALG forwards cell c over the switch fabric. Since both ALG and CCF forward the cells of packet(c) in their order within the packet, t ALG (c) t ALG (last(packet(c)) t CCF (last(packet(c)) + δ tl S (last(packet(c)) + δ = tl S (c) + δ + l k tl S (c) + δ + l 1 tl S (c) + δ + L max 1 This implies that every cell c is in the output-side of the switch by time-slot tl S (c)+δ+l max 1, and therefore ALG can output cell c from the CIOQ switch at time-slot tl S (c) + δ + L max 1. Notice that ALG does not transmit two cells c, c at the same time-slot from the same output-port, since tl S (c) + δ + L max 1 = tl S (c ) + δ + L max 1 implies that tl S (c) = tl S (c ), contradicting the definition of the shadow switch. We now explore the trade-off between the speedup S in which the CIOQ switch operates and its relative queuing delay. We devise a frame-based scheduler in which the demand matrix in each frame is built according to the times in which the underlying CCF algorithm forwards cells over the switch fabric. In addition, packets that were not fully forwarded by the CCF algorithm until the frame boundary are queued in the input-side of the switch until the next frame. Thus, the CCF algorithm determines which packets should be delivered by a packet-mode CIOQ in each frame, as captured by the next definition: Definition 15 For every input-port i, output-port j, frame size τ and frame number k > 0, the set of eligible cells of frame k, denoted a ij (τ, k), includes all cells c k <k a ij(τ, k ) such that all cells c packet(c) have t CCF (c ) kτ. By convention, a ij (τ, 0) =. 87

98 Notice that by definition, all the cells of a packet p are in the same set of eligible cells. The next lemma bounds the number of cells, sharing an input-port or an output-port, that should be scheduled within the same frame: Lemma 13 For every input-port i, output-port j, frame size τ and frame number k > 0, Proof. N j =1 a ij (τ, k) 2τ + N (L max 1) N i =1 a i j(τ, k) 2τ + N (L max 1) Note that the CCF algorithm works with CIOQ switch with speedup 2. Thus, the number of cells c that share the same input-port (output-port) and have been forwarded by the CCF within frame k (namely, (k 1)τ < t CCF (c) kτ) is at most 2T. Since in each virtual output-queue V OQ i,j, all cells of the same packet p are stored one after the other, there is no cell of a different packet that is forwarded by CCF between cells of packet p. Therefore, only cells of one packet are in a ij (τ, k) and were forwarded by CCF before time-slot (k 1)τ; we next bound the number of such cells: Since the maximum packet size is L max and the last cell of each packet was forwarded by the CCF after time-slot (k 1)τ, at most L max 1 such cells share the same input-port and the same output-port. Thus, the number of such cells that share an input-port (output-port) is at most N(L max 1). This implies that both N j =1 a ij (τ, k) and N i =1 a i j(τ, k) are bounded by 2τ + N(L max 1). Lemma 13 and Theorem 23 imply that the eligible cells of each frame can be scheduled within 2τ + N(L max 1) scheduling opportunities. Unfortunately, the decomposition described in Theorem 23 does not ensure that the packet-mode scheduling constraints are satisfied and therefore cannot be used directly. For example, consider the matrix A = [a ij ] =, and 88

99 in which, for example, element a 1,1 represents a single packet of size 3 and elements a 2,2,a 2,3,a 2,4 represent packets of size 2. The following decomposition of A into six permutations A = violates the packet-mode constraints: Contiguous transmission of packet a 1,1 requires that the first three permutations are scheduled contiguously. On the other hand, each permutation i {1, 2, 3} must also be adjacent to permutation i + 3 in order to ensure contiguous transmission of packet a 2,i+1. These requirements cannot be satisfied simultaneously, since it yields that at least one permutation must be adjacent to three permutations. To circumvent this problem, we use Theorem 24 and introduce a different decomposition algorithm, which guarantees contiguous packet delivery but requires twice as much scheduling opportunities: At each frame boundary, the algorithm counts the number of cells in each set a ij (τ, k) and constructs a matrix B(k) = [b ij ] accordingly (namely, b ij = a ij (τ, k) ). Then, the algorithm repeatedly builds maximal matchings for matrix B(k) and keeps contiguous packet delivery in the following manner: If a cell from input-port i to output-port j is forwarded in some iteration of the algorithm, and there are more cells from i to j that were not forwarded yet, then the algorithm keeps the matching between i and j for the next iteration. (This procedure is sometimes called exhaustive service matching [96].) Since the algorithm uses only maximal matchings, Theorem 24 yields that the algorithm needs twice as many iterations as Birkhoff von-newmann decomposition in order to decompose matrix B(k). In particular, for every frame size τ, the algorithm needs at most 4τ + 2N(L max 1) iterations to complete. This implies that it can mimic an ideal switch with a speedup arbitrarily 89

100 close to 4, while attaining a relative queuing delay of O(NL max ). Theorem 25 A packet-mode CIOQ switch with speedup S = 4 + 2N(Lmax 1) 1 τ switch with a relative queuing delay of 2τ + L max 2 time-slots. Proof. can mimic an OQ Fix a frame size τ and let B(k) = [b ij ] be the N N matrix such that b ij = a ij (τ, k). Lemma 13 implies that the sum of each row and each column of B(k) is at most 2τ +N(L max 1). Algorithm 4 works by repeatedly constructing maximal matchings P for matrix B(k). If a cell in the set a ij (τ, k) is forwarded in some iteration of the algorithm, and there are more cells in a ij (τ, k) to be forwarded, the algorithm keeps the matching between input-port i and outputport j for the next iteration. Therefore, cells of a specific set are forwarded contiguously. Hence, Definition 15 implies that Algorithm 4 forwards all the cells corresponding to a specific packet contiguously: this clearly satisfies the packet-mode scheduling constraints. All matchings used by Algorithm 4 are maximal and the sum of each column and each row in B(k) is at most 2τ + N(L max 1). Theorem 24 implies that Algorithm 4 needs at most 2 (2τ + N(L max 1)) 1 = 4τ + 2N(L max 1) 1 iterations to complete. Thus, with speedup 4 + 2N(Lmax 1) 1 the algorithm schedules all cells τ corresponding to B(k) within the next frame, that is, by time-slot (k + 1)τ. Consider the last cell last(p) of some packet p. Definition 15 implies that if last(p) a ij (τ, k) then t CCF (last(p)) > (k 1)τ. Since Algorithm 4 schedules last(p) by time-slot (k + 1)τ, it follows that the relative queuing delay of last(p) is at most 2τ 1. By Lemma 12, the relative queuing delay is at most 2τ + L max 2. Notice that for switch speedup S > 4, the relative queuing delay induced by this algorithm is 2N(L max 1) 1 S 4 + L max 2 time-slots. 90

101 Algorithm 4 Coarse-Grained Maximal Matchings Local Variables: B: matrix of values in IN, initially B = B(k) P : matrix of values in {0, 1}, initially 0 1: procedure SCHEDULE(matrix B) 2: while B 0 do 3: for all P [i][j] do 4: if P [i][j] = 1 and B[i][j] = 0 then 5: P [i][j] = 0 6: end if 7: end for 8: P := MAX-MATCH(B, P ) returns a maximal matching of B that dominates P 9: for all P [i][j] do 10: if P [i][j] = 1 and B[i][j] > 0 then 11: forward a cell from input i to output j 12: end if 13: end for 14: B := B P 15: for all B[i][j] do avoid negative values in B 16: B[i][j] := max{b[i][j], 0} 17: end for 18: end while 19: end procedure 20: matrix procedure MAX-MATCH(matrix B, matrix P ) 21: while there are i, j such that B[i][j] = 1 and N j =1 P [i][j ] = 0 and N i =1 P [i ][j] = 0 do 22: P [i][j] = 1 23: end while 24: return P 25: end procedure Mimicking an Ideal Shadow Switch with Speedup S 2 Notice that for each frame, the scheduler described in Theorem 25 schedules all eligible cells with the same origin and the same destination contiguously, implying that in fact it considers them as a single packet. Using a more fine-grained scheduler and the Birkhoff von-neumann decomposition, we now show that a smaller speedup, arbitrarily close to 2, suffices albeit with larger relative queuing delay. This is done in the context of the common situation where packet size are restricted to be from 91

102 the set L (cf. [112, 128]). Notice that this case generalizes the unrestricted packet size case, where L = {1,..., L max }. Let lcm(l) be the least common multiple of all elements in L. Theorem 26 A packet-mode CIOQ switch with speedup S = 2 + N(Lmaxlcm(L) 1) τ ideal shadow switch with a relative queuing delay of 2τ + L max 2 time-slots. Proof. can mimic an Fix a frame size τ. For every packet size l L, let a ij (τ, l, k) a ij (τ, k) be the set of eligible cells (recall Definition 15) that correspond to packets of size l. Let B(l, k) = [b(l, k) ij ] be the matrix with values b(l, k) ij = aij(τ,l,k), that is, the number of eligible packets of size l in l frame k. For every packet size l, the algorithm first tries to concatenate lcm(l)/l packets one after the other in order to get one mega-packet of size lcm(l), each such mega-packet consists of packets of the same size. The matrix B(lcm(L), k) = [b((lcm(l), k) ij ] counts the number of such megapackets: b ((lcm(l), k) ij = l L l b(l, k)ij lcm(l) We first bound the sum of each row and each column of the matrix B(lcm(L), k). Consider some row i of the matrix (the proof for a column j follows analogously): N b ((lcm(l), k) ij = j=1 = = N l b(l, k)ij j=1 l L N j=1 l L 1 lcm(l) 1 lcm(l) lcm(l) l b(l, k) ij lcm(l) N a ij (τ, l, k) j=1 l L N a ij (τ, k) j=1 1 lcm(l) (2τ + N(L max 1)) by Lemma 13 92

103 By Theorem 23, the matrix B(lcm(L), k) can be decomposed into 2τ+N(Lmax 1) permutations. lcm(l) Denote by P(lcm(L), k) the set of these permutations. We now turn to deal with leftover packets. Let the matrices B (l, k) = [b (l, k)] ij count the number of packets of size l that are not concatenated into mega-packets. Note that b (l, k) ij (lcm(l) 1)/l, since it is the remainder of dividing b(l, k) ij by lcm(l)/l. This implies that the sum of each row and each column of matrix B (l, k) is bounded by N(lcm(L) 1)/l. By Theorem 23, the matrix B can be decomposed into N (lcm(l) 1) permutations. Let P(l, k) be the l set of the permutations used to decompose the matrix B (l, k). After obtaining these sets of permutations, the algorithm forwards contiguously all the megapackets by holding each permutation P P(lcm(L), k) for lcm(l) consecutive iterations. Then for every l L, the algorithm holds each permutation P P(l, k) for l consecutive iterations. Clearly, all the cells of a specific packet are forwarded contiguously, and the algorithm satisfies the packet-mode scheduling constraints. The number of iterations needed for the algorithm to complete is bounded by: lcm(l) 2τ+N(Lmax 1) lcm(l) + l L l N (lcm(l) 1) l 2τ + N(L max 1) + N L max (lcm(l) 1) Thus, with a speedup of 2+ N (Lmaxlcm(L) 1) the algorithm schedules all cells correspond to frame τ k within the next frame. This implies that for each packet p the maximum relative queuing delay of cell last(p) is less than two frame sizes, namely at most 2τ 1 time-slots. Hence, Lemma 12 implies that the maximum relative queuing delay is at most 2τ + L max 2. Note that for switch speedup S > 2, the relative queuing delay induced by this algorithm is N (L maxlcm(l) 1) S 2 + L max 2 time-slots. Furthermore, it is important to notice that even though the algorithm described in Theorem 26 employs more sophisticated decomposition than Algorithm 4, both algorithms have the same overall time-complexity: Their time-complexity is solely determined by the complexity of the underlying CCF algorithm that is invoked twice every time-slot (while decomposition is done only once 93

104 every τ time-slots). 5.5 Mimicking an Ideal Shadow Switch with Bounded Buffers In many practical applications, CIOQ switches are required to emulate shadow switches with bounded buffer size. We show that smaller speedup suffices for mimicking an ideal shadow switch with output buffer B. Intuitively, the reason for this better performance is that an ideal shadow switch with bounded buffers cannot handle all incoming traffic types without dropping cells. Therefore, by using the extra information about the legal incoming traffic types, the CIOQ switch can optimize its scheduling decisions, resulting in a simpler and more efficient scheduling algorithm. Unlike the previous algorithms, algorithms for bounded mimicking do not rely on the CCF algorithm, and use the following definition and lemma, which are adapted from Definition 15 and Lemma 13: Definition 16 For every input-port i, output-port j, frame size τ and frame number k > 0, the set of eligible cells of frame k, denoted a ij (τ, k), is the set of cells c that are delivered successfully by the ideal switch, c k <k a ij(τ, k ), and all cells c packet(c) arrive at the switch before time-slot kτ. By convention, a ij (τ, 0) =. As in Definition 15, all the cells of each packet p are in the same set of eligible cells. The next lemma bounds the size of these sets. Lemma 14 For every input-port i, output-port j, frame size T and frame number k > 0, N j =1 a ij (τ, k) τ + B + N (L max 1) and N i =1 a i j(τ, k) τ + B + N (L max 1) Proof. Clearly, at most τ cells arrive at each input-port between time-slot (k 1)τ and kτ. We next show that at most τ + B cells arrive between time-slot (k 1)τ and kτ, destined for a single output-port j, and are successfully delivered by the shadow switch. 94

105 Assume, by way of contradiction, that l 1 > τ + B cells destined for output-port j arrive at the switch within frame k and are not dropped by the shadow switch. Let l 2 0 be the number of cells stored in the buffer of output-port j in time-slot (k 1)τ. By the definition of a switch, at most τ cells are delivered from output-port j between time-slots (k 1)τ and kτ, hence the number of cells that are stored in the buffer by the end of frame k is at least l 1 + l 2 τ > B cells, contradicting the fact that the buffer size is B. Since all cells of the same packet p arrive at the switch contiguously, only cells of one packet are in a ij (τ, k) and arrived at the switch before time-slot (k 1)τ. Since the maximum packet size is L max and the last cell of each packet arrives after time-slot (k 1)τ, the number of such cells that share the same input-port and the same output-port is bounded by L max 1. Thus, the number of such cells that share the same input-port (output-port) is bounded by N(L max 1), and the sum is therefore bounded by τ + B + N(L max 1). In order to mimic an ideal switch, the CIOQ switch drops all cells that are dropped by the shadow switch. By employing Lemma 14 in the proofs of Theorems 25 and 26 respectively, we get the following results: Corollary 27 A packet-mode CIOQ switch with speedup S = 2 + 2B+2N(Lmax 1) 1 can mimic an τ ideal shadow switch with buffer size B with a relative queuing delay of 2τ + L max 2 time-slots. Corollary 28 A packet-mode CIOQ switch with speedup S = 1 + B+N (Lmax lcm(l) 1) can mimic τ an ideal shadow switch with buffer size B with a relative queuing delay of 2τ +L max 2 time-slots. 5.6 Simulation Results Analytically, the algorithm described in Section has prohibitive relative queuing delay of O(N lcm(l max )) time-slot and therefore it has only theoretical importance 2. Conversely, Algorithm 4 requires a speedup S 4 in order to mimic an ideal switch with reasonable relative queuing delay of O(NL max ) time-slots. 2 Even when lcm(l) = L max, the relative queuing delay is O(NL max 2 ). 95

106 In this section, we show that in practice, Algorithm 4 out-performs its analytical worst-case bounds (from Theorem 25), implying that even with a modest speedup, it achieves small relative queuing delay. The results are obtains by conducting extensive simulation experiments under various synthetic and trace-driven traffic patterns. Basically, in order to demonstrate the tradeoff between the speedup and the relative queuing delay, we conduct the following simulation: Given the incoming traffic and a fixed frame size, we measure the loss ratio of packets under various speedup values. Then, we present the speedup required to achieve less than 0.1% packet drop. Note that, unlike our theoretical upper bounds, we allow small amount of cell drops; this clearly represents real-life situations in which switches are allowed to drop cells in extreme situations. Furthermore, this metric is especially important because the delay occur in ideal switches (e.g., in OQ switches) is very well-studied, and by the relative queuing delay one can easily derive absolute bounds on the cell or packet delay of packetmode CIOQ switch. We first consider several stochastic traffic patterns, which are generally modeled as ON-OFF processes: The ON period length is chosen according to a specific packet size distribution (that is, each ON period models an arrival of a single packet), while the OFF period is distributed geometrically with some probability p; the parameter p is chosen so that a certain load is achieved. Specifically, we study the following three stochastic traffic patterns. These patterns were also used by Marsan et al. [101] in order to investigate the performance of a packet-mode Input-Queued switch (with no speedup). It is important to notice that our results are even stronger than real-life performance, since some of the traffic patterns are chosen specifically to reflect starvation and unfairness due to the contiguous forwarding of large packets [101]: 1. Uniform traffic: In this traffic pattern, packet sizes are chosen uniformly at random in the range [1, 192]. For each packet, its destination is chosen uniformly at random among all output-ports. This uniform traffic setting is considered due to its frequent use in simulations and stochastic analysis of switch performance (see Chapter 1.2). Note that the maximum packet size is the Maximum Transmission Unit (MTU) of IP over ATM, measured in ATM cells [10, 101]. 2. Spotted traffic: Packet sizes are 100 cells with probability 0.5 and 3 cells with probability 96

107 0.5; packet destination is chosen according to the following 8 8 matrix; each input-port i chooses a destination uniformly at random among all destinations with entry 1 in row i Since the matrix is doubly-stochastic and the sum of each row (column) is five, it implies that each input-port sends packets to 5 output ports, and each output-ports receives packets from 5 input-ports. Notice that this specific traffic matrix aims to highlight starvation and loss of throughput due to the contiguous forwarding of large packets [101]. 3. Diagonal traffic: Packet destinations are chosen uniformly at random. For every cell c, if orig(c) = dest(c) then the packet size is 100; otherwise, the packet size is 1. In this traffic pattern, the flows on the diagonal of the switching matrix consist only on long packets, while the flows that are not on the diagonal of the switching matrix consist only of short packets. Like the spotted traffic setting, this traffic pattern stresses the effects of contiguously delivering packets of variable sizes [101]. The length of all the simulations is 100, 000 time-slots, and they were performed on a switches (Except the spotted traffic simulations that were performed on an 8 8 switch, as in the setting described in [101]). For each traffic pattern (stochastic or trace-driven), we fix a certain speedup S and a frame size τ. For each frame k, Algorithm 4 constructs a demand matrix B(k) and then decomposes the demand matrix B(k 1) of the previous frame into a sequence of scheduling decisions. Under the fixed speedup S, Algorithm 4 schedules at most S τ of these scheduling decisions, and drops all packets with cells in the remaining scheduling decisions. We measure the loss ratio (in terms of 97

108 Speedup Uniform Traffic Frame Size 0.25 load 0.5 load 0.75 load 1.0 load Figure 14: Simulation results for a switch, operating under uniform traffic pattern and different input loads. The results shows the required frame size needed to achieve less than 0.1% packet drop ratio under specfiic speedup. packets) under all values of S and τ. Figures 14, 15 and 16 present the speedup required to achieve less than 0.1% packet drop ratio under different loads and different stochastic traffic patterns. As expected, the results demonstrate that Algorithm 4 needs smaller speedup to achieve smaller relative queuing delay. Moreover, the results show that as the load of the traffic increases, the speedup required by Algorithm 4 also increases. Interestingly, these results show that, even in extreme situations, a speedup of less than 2 suffices to achieve ideal switch mimicking with frame size of only 8L max time-slots. This can be explained by carefully investigating the reasons behind the upper bound of Theorem 25: A speedup S 4 is required due to frames at which the underlying CCF algorithm forwards 2τ cells from the 98

109 Speedup Spotted Traffic Frame Size 0.25 load 0.5 load 0.75 load Figure 15: Simulation results for a 8 8 switch, operating under spotted traffic pattern and different input loads. The results shows the required frame size needed to achieve less than 0.1% packet drop ratio under specfiic speedup. same input-port or to the same output-port; moreover, the additional factor of 2 is caused by a poor selection of maximal matchings resulting in an inefficient contiguous decomposition as captured by Theorem 24. Under non-adversarial traffics, these two situations rarely occur in practice, especially not simultaneously. A relative queuing delay of 2N(Lmax 1) 1 + L S 4 max 2 time-slots occurs in even more extreme situation: when there is a frame k and an input-port i (output-port j) such that from any flow (i, j) there is a packet p whose first cell is sent by the underlying CCF algorithm before time-slot (k 1)τ and its last cell is sent by the CCF between time-slots (k 1)τ and kτ. Clearly, this situation hardly ever happens. We also conducted trace-driven simulation using trace data of TCP traffic over OC-48 links; this trace data was taken from CAIDA [43]. We investigate the performance of Algorithm 4 under 99

110 Speedup Diagonal Traffic Frame Size 0.25 load 0.5 load 0.75 load 1.0 load Figure 16: Simulation results for a switch, operating under diagonal traffic pattern and different input loads. The results shows the required frame size needed to achieve less than 0.1% packet drop ratio under specfiic speedup. this real traffic and show that, also in this non-synthetic case, it performs better than its theoretical upper bounds. To the best of our knowledge, these are the first test-driven simulation of packetmode CIOQ switches. Figure 17 presents the performance of Algorithm 4 in the trace-driven experiments. We conducted these experiments in granularity of 30 bytes (that is, the cell unit size is 30 bytes) yielding a maximum packet size, L max, of 50 cells (i.e., 1500 bytes). Furthermore, we compressed the traffic so each input-port is fully utilized (that is, 100% load). Compressing the traces to 100% load intuitively represents the worst-case traffic that should be handled by the switch; this intuition is further confirmed by our previous experiments which show that as the traffic load increases the required speedup also increases. As in the previous synthetic traffic patterns, these trace-driven 100

111 Speedup Trace-Driven Simulations Frame Size Figure 17: Trace-driven simulation results for a switch, operating under 1.0 input loads. The results shows the required frame size needed to achieve less than 0.1% packet drop ratio under specific speedup. simulations also show that Algorithm 4 performs better than its theoretical bounds. Finally, we compare the performance of Algorithm 4 to two simple greedy algorithms. The cut-through greedy algorithm gets a certain relative queuing delay R max as a parameter, and ensures that each packet either attains relative queuing delay less than R max or is dropped. Specifically, the algorithm chooses randomly a maximal matching over all packets arriving at the input side of the switch (even if the packet has not fully-arrived at the switch) and, similarly to Algorithm 4, keeps an input-output pair matched until the corresponding packet is fully transmitting. Before a packet is selected for transmission its relative queuing delay is compared to R max, and the packet is dropped if it is above the threshold. Our simulations shows that the cut-thorough greedy algorithm never achieves 0.1% packet drop ratio no matter what the speedup of the switch 101

112 Speedup Store&Forward Greedy Algorithm Relative Queuing Delay Threshold Figure 18: Simulation results of the Store&Forward Greedy Algorithm for a switch, operating under 1.0 input loads and trace-driven traffic. The results shows the required speedup needed to achieve less than 0.1% packet drop ratio under specfic relative queuing delay threshold. is and what the relative queuing delay threshold, R max, chosen. This results coincide with the lower bound described in Theorem 21, implying that a cut-through algorithm cannot mimic an ideal switch. The store&forward greedy algorithm operates exactly as the cut-through greedy algorithm but schedules only fully-arrived packets. Although this algorithm potentially introduces additional relative queuing delay of L max, our trace-driven simulations, described in Figure 18, show that this algorithm converges very fast to achieve less than 0.1% packet drop ratio, and in fact it outperforms Algorithm 4. It is important to notice that the actions of the store&forward greedy algorithm and Algorithm 4 are very similar; the main difference is that the store&forward greedy algorithm does not 102

113 operate in a frame-based manner, and therefore it introduces smaller relative queuing delay. Yet, the store&forward greedy algorithm has no known analytical worst-case upper bounds. 103

114 Chapter 6 Jitter Regulation for Multiple Streams The notion of delay jitter (or Cell Delay Variation [11]), defined as the difference between the maximal and minimal end-to-end delays of different cells, captures the smoothness of a traffic. The need for efficient mechanisms to provide such smooth and continuous traffic is mostly motivated by the increasing popularity of interactive communication and in particular video/audio streaming. 1 Controlling traffic distortions within the network, and in particular jitter control, has the effect of moderating the traffic throughout the network [147]. This is important when a service provider in a QoS network must meet service level agreements (SLAs) with its customers. In such cases, moderating high congestion states in switches along the network results in the provider s ability to satisfy the guarantees to more customers [134]. Jitter control mechanisms have been extensively studied in recent years (see Section 2.4). These are usually modelled as jitter regulators that use internal buffers in order to shape the traffic, so that cells leave the regulator in the most periodic manner possible. Upon arrival, cells are stored in the buffer until their planned release time, or until a buffer overflow occurs. This indicates a tradeoff between the buffer size and the best attainable jitter, i.e., as buffer space increases, one can expect to obtain a lower jitter. In this chapter, we investigate the problem of finding an optimal jitter release schedule, given 1 For example, 6.98 billion video streams were initiated by U.S. users during August 2006, while the U.S. streaming audience increased by 4 percent from July 2006 to reach million streamers in August 2006, representing about 64 percent of the total U.S. Internet audience [42]. 104

115 a predetermined buffer size. This problem was first raised by Mansour and Patt-Shamir [100], who considered only a single-stream setting. In practice, however, jitter regulators handle multiple streams simultaneously and must provide low jitter for each stream separately and independently. In the multi-stream model, the traffic arriving at the regulator is an interleaving of M streams originating from M independent abstract sources (see Figure 19). Each abstract source i sends a stream of fixed-size cells in a fully periodic manner, with rate R i, which arrive at a jitter regulator after traversing the network. Variable end-to-end delays caused by transient congestion throughout the network may result in such a stream arriving at the regulator in a non-periodic fashion. The regulator knows the value of R i, and strives to release consecutive cells 1/R i time units apart, thus re-shaping the traffic into its original form. Moreover, the order in which cells are released by each abstract source is assumed to be respected throughout the network. This implies that the cells from the same stream arrive at the regulator in order (but not necessarily equally spaced), and the regulator should also maintain this order. We refer to this property as the FIFO constraint. Note that the FIFO constraint should be respected in each stream independently, but not necessarily on all incoming traffic. This implies that in the multi-stream model, the order in which cells are released is not known a priori. This lack of knowledge is an inherent difference from the case where there is only one abstract source, and it poses a major difficulty in devising algorithms for multi-stream jitter regulation (as we describe in detail in Section 6.4). 6.1 Our Results We present algorithms and tight lower bounds for jitter regulation in this multiple streams environment, both in offline and online settings. This answers a primary open question posed in [100]. We evaluate the performance of a regulator in the multi-stream model by considering the maximum jitter obtained on any stream. We show that, somewhat surprisingly, the offline problem can be solved in polynomial time. This is done by characterizing a collection of optimal schedules, and showing that their properties can be used to devise an offline algorithm that efficiently finds a release schedule that attains the optimal jitter. We use a competitive analysis approach in order to examine the online problem. In this setting, 105

116 Figure 19: The multi-stream jitter regulation model. by sizing up the buffer to a size of 2MB and statically partitioning the buffer equally among the M streams, applying the algorithm described in [100, Algorithm B] on each stream separately yields an algorithm that obtains the optimal max-jitter possible with a buffer of size B. We show that such a resource augmentation cannot be avoided, by proving that any online algorithm needs a buffer of size at least MB in order to obtain a jitter within a bounded factor from the optimal jitter possible with a buffer of size B. We further show that these results also apply when the objective is to minimize the average jitter attained by the M streams. These results indicate that online jitter regulation does not scale well as the number of streams increases unless the buffer is sized up proportionally. 6.2 Model Description, Notation, and Terminology We adapt the following definitions from [100]: Definition 17 Given a traffic T = {c i 0 i n} such that cell c i arrives at time ta(c i ), we define the following: 1. A release schedule s for traffic T defines the release time of cells in T. Specifically, for each cell c i, tl s (c i, T ) denotes the time at which cell c i T is released from the regulator under 106

A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch

A Starvation-free Algorithm For Achieving 00% Throughput in an Input- Queued Switch Abstract Adisak ekkittikul ick ckeown Department of Electrical Engineering Stanford University Stanford CA 9405-400 Tel