Advanced Networking Technologies Chapter Routers and Switches - Inputbuffered switches and matchings This chapter is heavily based on material by Holger Karl (Universität Paderborn) and Prof. Yashar Ganjali (University of Toronto) Content Head-of-line blocking the missing analysis A Markov model A closed-form approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching
Input-Queued Switch: How It Works The switch matches inputs and outputs Packets are ueued at the inputs. Input-Queued Switch: How It Works
Input-Queued Switch: Speed-Up Advantage At most one packet leaves from each input (arrives to each output) Þ speed-up=, not N Speedup: Ratio of switch speed/port speed 5 Head-of-Line Blocking Blocked! Blocked! Blocked! The switch is NOT work-conserving! (There is a packet for an output port, but the port is idle.) 6
Assumptions for a simple analysis Assumptions: Time is slotted, all packets same size At each time-slot, at each of the N inputs: i.i.d. packet arrivals with probability r Each packet is destined for one of the N outputs uniformly at random By symmetry, consider some given output Scheduling: at each time-slot, each output picks an HoL packet uniformly at random Or is idle if there is no packet for this out More precisely: Largest ½ so that ueues stay finite Question: What throughput r (per link) can we get? HoL Blocking in x Switch 8
HoL Blocking in x Switch 9 HoL Blocking in x Switch 0
Balls-and-Bins Model Saturated switch Assume infinite number of packets in each ueue They are all destined to some output u.a.r. (random coloring of packets) Balls-and-bins model N outputs Û N bins N HoL packets Û N balls Balls reflect ONLY the HoL packets; ueue state not relevant At each time-slot Remove one ball from each non-empty bin Assign new balls to bins with prob. ½, independently and u.a.r. It does not matter which ball we remove (why?!) Balls-and-Bins Model
Balls-and-Bins Model Balls-and-Bins Model
Markov Chain There are three states for the bin occupancy: (,0), (,), (0,) E.g., (,0) means both HoL packets are destined to first output We get a discrete-time Markov chain: (,0) (,) (0,) 5 Transition Probabilities in Markov Chain Transition from (,0) / / 6
Transition Probabilities in Markov Chain Euilibrium state distribution: p={¼, ½, ¼} Output throughput = -P(output empty) = 5% / / / / / (,0) (,) (0,) / / Side Note: State Collapse Symmetric Markov chain State collapse: (,0) and (,) / / / (,0) (,) / Euilibrium (collapsed) state distribution: (/,/) get real state distribution 8
x Switch Markov chain with following states: (,0,0),(0,,0),(0,0,), (,,0),(,0,),(,,0),(0,,),(0,,),(,0,) (,,) State collapse into: (,0,0),(,,0) and (,,) / / / /9 /9 (,0,0) (,,0) (,,) /9 /9 / 9 x Switch Euilibrium state distribution Per-output throughput 5% for x, 68% for x but state space explosion for large N 0
Content Head-of-line blocking the missing analysis A Markov model A closed-form approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching Method #: Closed form euations for balls in bins Suppose (!) we have an M/D/ system Given from Tele/Perf analysis the terms of Pollaczek-Khinchine E{k} = E{T v } = µ Hence for =0 E{k} = ( ) E{k} = For : = p 58%
HoL Blocking vs. OQ Switch Delay IQ switch with HoL blocking OQ switch -» 58% 0% 0% 0% 60% 80% 00% Load Content Head-of-line blocking the missing analysis A Markov model A closed-form approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching
VOQs: How Packets Move VOQs Scheduler 5 Basic Switch Model A (n) A (n) Q (n) D (n) S(n) A N (n) D N (n) A N (n) A N (n) A NN (n) D N (n) D NN (n) N N Q NN (n) 6
Notations: Arrivals A ij (n): packet arrivals at input i for output j at time-slot n A ij (n) = 0 or l ij = E[A ij (n)]: arrival rate L=[l ij ]: traffic matrix A=[A ij (n)] admissible iff: For all i, å j l ij < : no input is oversubscribed For all j, å i l ij < : no output is oversubscribed Notations: Schedule Q ij (n): ueue size of VOQ(i,j) Q=[Q ij (n)] S ij (n): whether the schedule connects input i to output j S ij (n) = 0 or No speedup: each input is connected to at most one output, each output to at most one input We will assume that each input is connected to exactly one output, and each output to exactly one input Þ S=[S ij (n)] permutation matrix 8
Scheduling Algorithm What it does: determine S(n) How: Either using traffic matrix L, Or, in most cases, using ueue sizes Q(n) (because L unknown) Objective: 00% throughput So that lines are fully utilized Secondary objective: minimize packet delays/backlogs 9 What is 00% throughput? Work-conserving scheduler Definition: If there is one or more packet in the system for an output, then the output is busy. i.e. holds system busy at all times An output ueued switch is trivially work-conserving. Each output can be modeled as an independent single-server ueue If λ < µ then E[Q ij (n)] < c for some c Therefore, we say it achieves 00% throughput. For fixed-sized packets, work-conservation also minimizes average packet delay. Q: What happens when packet sizes vary? Non work-conserving scheduler An input-ueued switch is, in general, non work-conserving. Q: What definitions make sense for 00% throughput? 0
Common Definitions of 00% throughput Work-conserving For all n,i,j, Q ij (n) < c, i.e., weaker For all n,i,j, i.e., E[Q ij (n)] < c We will focus on this definition. Departure rate = arrival rate, i.e., Content Head-of-line blocking the missing analysis A Markov model A closed-form approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching
Uniform Traffic Definition: l ij =l for all i,j i.e., all input-output pairs have same traffic rate Condition for admissible traffic: l < /N Example: Bernoulli traffic l = r/n Arrivals at input i are Bernoulli(r) and i.i.d. 00% Throughput for Uniform Traffic Nearly all algorithms in literature can give 00% throughput when traffic is uniform For example: Uniform cyclic. Random permutation. Wait-until-full. Maximum size matching (MSM). Maximal size matching (e.g. WFA, PIM, islip).
Uniform Cyclic Scheduling A B C D A A B C D B C D Each (i, j) pair is served every N time slots: M/D/ λ = r / N < / N /N Stable for r < 5 Wait Until Full We don t have to do much at all to achieve 00% throughput when arrivals are Bernoulli i.i.d. uniform. Simulation suggests that the following algorithm leads to 00% throughput. Wait-until-full: If any VOQ is empty, do nothing (i.e. serve no ueues). If no VOQ is empty, pick a random permutation. 6
Simple Algorithms with 00% Throughput Wait until full Uniform Cyclic Maximal Matching Algorithm (islip) MSM Uniform Random Scheduling At each time-slot, pick a schedule u.a.r. among: The N cyclic permutations A B C D Or the N! permutations A A B C D B C D Then P(S i,j =) = /N Q: why? 8
Uniform Random Scheduling We get a M/M/ system: = /N µ = /N Birth-death chain We get: E{Delay} / N Stable when r < 9 Table of content Head-of-line blocking the missing analysis A Markov model A recurrence euation approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching 0
Non-Uniform Traffic Assume the traffic matrix is: = L is admissible and non-uniform.6.5.08 0 6 0 0..5.08..5.55.56.09 0.8.90.90.8.88.9.85.8.9 Uniform Schedule? What if uniform schedule? Each VOQ serviced at rate µ = /N = / But arrivals to VOQ(,) have rate l = 0.5 Arrival rate > departure rate Þ switch unstable! Need to adapt schedule to traffic matrix.
Example Scheduling (Trivial) Assume we know the traffic matrix, it is admissible, and it follows a permutation: Then we can simply choose: 0 0 0 =.99 60 0 0 0 0 05 0 0 0 S(n) = 0 0 0 60 0 0 0 0 05, 8n 0 0 0 Example Scheduling Assume we know the traffic matrix, and it doesn t follow a permutation. For example: / / 0 0 / / =.99 6 0 0 0 0 05 0 0 0 Then we can choose the seuence of service permutations: 0 0 0 0 0 0 S() = 60 0 0 0 0 05,S() = S() = 6 0 0 0 0 0 05 0 0 0 0 0 0 And either cycle through it or pick randomly In general, if we know an admissible L, can we pick a seuence S(n) so that l < µ?
Definitions Doubly Stochastic Matrix: An NxN matrix with nonnegative entries where all rows and all columns sum to. Doubly Sub-Stochastic Matrix: An NxN matrix with nonnegative entries where the sum of entries in each row or column is less than or eual to. 5 Doubly Stochastic Matrices L is admissible, or doubly sub-stochastic Theorem (von Neumann): There exists L ={l ij } such that L < L, i.e. every element of L is smaller than the corresponding element in L, and L is doubly stochastic: å i l ij = å j l ij = Example: =.6.5.08 0 6 0 0..5.08..5.55 < 0 =.56.09 0.8..59..0 6.0 0.0.9.9.... 5.59..0. 6
Doubly Stochastic Matrices Fact. The set of doubly stochastic matrices is convex, compact (closed and bounded), in R N Fact. Any convex, compact set in R N has extreme points, and is eual to the convex hull of its extreme points (Krein-Milman Theorem) Doubly Stochastic Matrices Theorem (Birkhoff): Permutation matrices are the extreme points of the set of doubly stochastic matrices In other words: Given L, there exist K numbers a k > 0 and K permutation matrices P k such that Von Neumann Birkhoff Note: K apple N N + 8
Birkhoff-von Neumann (BvN) Scheduling BvN decomposition: L Þ L Þ {a k, P k } BvN weighted random scheduling: Pick P k with probability a k Theorem: BvN scheduling achieves 00% throughput 9 BvN Example For a given =.6.5.08 0 6 0 0..5.08..5.55.56.09 0.8 How do we find a feasible L? Lots of more or less complicated ways possible: Linear optimization Build linear euation system for values added to each additional contraints as needed... ij and add 50
BvN Example Lets take the following matrix 0 =..6. 0 6 0 0.5.5....5.6. 0. How do we get a valid schedule? 5 BvN Example Define the following helper matrix = 0 0 = 6 0 60 0 5 5 5 6 0 Choose a permutation P (at random or with a strategy) and subtract it from as often as possible 6 0 0 0 0 60 0 5 5 5 60 0 0 0 0 0 5 6 0 0 0 0 0 ) =, = 60 0 5 05 0 5
BvN Example Repeat! 0 0 0 0 0 0 60 0 5 05 60 0 0 0 0 05 ) =, = 60 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 05 60 0 0 0 0 05 ) =, = 60 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 0 0 05 60 0 0 0 0 05 ) =, = 60 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 5 BvN Example Now we know that..6. 0 0 = 6 0 0.5.5....5.6. 0. 0 0 0 0 0 0 =. 60 0 0 0 0 0 5 +. 60 0 0 0 0 05 + 0 0 0 0 0 0 0 0 0 0 0 0. 60 0 0 0 0 05 +. 60 0 0 0 0 05 0 0 0 0 0 0 Note that KX i = k= 5
Proof: 00% Throughput Lindley s euation: Arrival Rate: P(A ij (n)=) = E[A ij (n)] = l ij Departure Rate: Arrival rate < departure rate 00% throughput 55 Content Head-of-line blocking the missing analysis A Markov model A recurrence euation approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching 56
Unknown Traffic Matrix We want to maximize throughput Traffic matrix unknown Þ cannot use BvN Idea/intuition: maximize instantaneous throughput In other words: transfer as many packets as possible at each time-slot Maximum Size Matching (MSM) algorithm, using Reuest Graph A bipartite graph: input ports i and output ports j Edge from i to j if Q ij > 0 (ueue size of the virtual output ueue) Goal: Find the largest number of edges such that each node has at most one edge (in or out) 5 Maximum Size Matching (MSM) MSM maximizes instantaneous throughput Q (n)>0 Maximum Size Match Q N (n)>0 Reuest Graph Bipartite Match MSM algorithm: among all maximum size matches, pick a random one 58
Implementing MSM How can we find maximum size matches? We do so by recasting the problem as a network flow problem Classic math problem 59 Question Is the intuition right? In particular: Good idea to pick one of several maximum matchings at random? Answer: NO! There is a counter-example for which, in a given VOQ (i,j), l ij < µ ij but MSM does not provide 00% throughput. 60
Counter-example Consider the following non-uniform traffic pattern, with Bernoulli i.i.d. arrivals: l = l = /-d l = /-d Three possible matches, S(n): l = /-d Consider the case when Q, Q both have arrivals, w. p. (/ - d) When packets are processed on Q or Q, input is served w. p. at most /. Schedules eually probable Overall, the service rate for input, µ is at most / ( / ) + ( / ) Switch unstable for smaller d µ apple /( / ) apple /( / ) + p =0.059 apple / 6 Table of content Head-of-line blocking the missing analysis A Markov model A closed-form approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching 6
Scheduling in Input-Queued Switches A (n) A (n) Q (n) S*(n) D (n) A N (n) A N (n) A N (n) D N (n) A NN (n) Q NN (n) N N Problem. Maximum size matching Maximizes instantaneous throughput. Does not take into account VOQ backlogs. Solution. Give higher priority to VOQs which have more packets. Look at the weighted version of the bipartite matching problem 6 Maximum Weight Matching (MWM) Assign weights to the edges of the reuest graph. Q (n)>0 W Assign Weights Q N (n)>0 Reuest Graph W N Weighted Reuest Graph Find the matching with maximum weight. 6
MWM Scheduling Create the reuest graph. Find the associated link weights. Find the matching with maximum weight. How? Transfer packets from the ingress lines to egress lines based on the matching. Question. How often do we need to calculate MWM? 65 Options for Weights Longest Queue First (LQF) Weight associated with each link is the length of the corresponding VOQ. MWM tends to give priority to long ueues. Does not necessarily serve the longest ueue. Oldest Cell First (OCF) Weight of each link is the waiting time of the HoL packet in the corresponding ueue. 66
Longest Queue First (LQF) LQF is the name given to the maximum weight matching where weight w ij (n) = L ij (n). But the name is so bad that people keep the name MWM! LQF doesn t necessarily serve the longest ueue. Theorem. MWM-LQF scheduling provides 00% throughput. Problem: LQF can leave a short ueue unserved indefinitely. MWM-LQF is very important theoretically: most (if not all) scheduling algorithms that provide 00% throughput for unknown traffic matrices are variants of MWM! 6 Proof Idea: Use Lyapunov Functions Basic idea: when ueues become large, the MWM schedule tends to give them a negative drift. We will try to show that: E{L(n + ) L(n) L(n)} < 0 With L(n) =f(q(n)) being a Lyapunov function and Q(n) an estimate for ueue length written in a single vector Using this method we expect to find E{L(n + ) L(n) L(n)} < Q(n) + c That is for ueues being long enough there is a negative drift 68
Lyapunov Analysis Simple Example Suppose we have a long tube Cross section A m Defined value at t=0s Water poured into the tube at a rate of V m /s for t>0s Orifice has an area of a m at the bottom of the tube Let h(t) be the water level What happens? Volume changes according to differential euation g being the gravitational constant Larger volume higher drainage Set h(t) to be a Lyapunov function L(t) dl dt = (V System is stable for any start value A dh dt = V dl dt = (V a p gl(t)) < 0 ) L(t) > V A ga ap gl(t))/a ap gh(t) 69 Lyapunov Functions How we use this approach in general? Find a positive function L(t) that increases with some state of the system Some more properties needed, e.g. needs to map to a real value, smooth, etc. In the example L(t) = p h(t) or L(t) =(h(t)) would have been ok, too Show that dl dt is negative for all t > c Note: it may be positive for values below c, if it is exactly zero after some point the system may be stable depending on the starting conditions Much more theory behind this 0
Back to the Outline of the Proof Intuition: Can we find S s.t. for any =,... m,n? Q T (n) apple Q T (n) S T Solution: We look for worst case traffic 0 = argmax Q T (n) With NX MX 8j ij apple, 8i ij apple, 8(i, j) ij 0 i=0 j=0 We know (Birkhoff: "permutations are extreme points of doublystochastic matrices") that is at 0 most: S = argmax Q T (n) S NX 8j S ij =, 8i i=0 j=0 S MX S ij =, 8(i, j) S ij 0 Therefore for any, there is Q T (n) apple Q T (n) S Outline of Proof. We know that if we pick S = argmax Q T (n) S, then Q T (n)( S ) apple 0 S. Next we can use this fact to show that: E{L(n + ) L(n) L(n)} apple Q(n) + c with L(n) =Q T (n)q(n) i.e. a uadratic Lyapunov function.. Hence, if is Q(n) large enough, buffers do not grow. (Of course only once there is a small pause there is an expected single-step downward drift in occupancy.) Note: proof details in paper by McKeown et al.
LQF Variants Question: what if w ij (n) =L ij (n) or w ij (n) = L ij (n)? What if weight w ij (n) = W ij (n) (waiting time)? Preference is given to cells that have waited a long time. Is it stable? We call the algorithm OCF (Oldest Cell First). Remember that it doesn t guarantee to serve the oldest cell! Summary of MWM Scheduling MWM LQF scheduling provides 00% throughput. It can starve some of the packets. MWM OCF scheduling gives 00% throughput. No starvation. Question. Are these fast enough to implement in real switches? Not obviously so (recall: 8 ns!) Non-trivial amount of parallelization of these algorithms necessary Or: relax reuirements, look at less challenging version of the matching problem
Simulation of Simple x Example 5 Table of content Head-of-line blocking the missing analysis A Markov model A recurrence euation approach Virtual output ueues revisited Under uniform traffic Under non-uniform traffic, with known traffic matrix Under non-uniform traffic, with unknown traffic matrix Maximum size matching Maximum weight matching Maximal size matching 6
The Story So Far Output-ueued switches Best performance Impractical need speedup of N Input-ueued switches Head of line blocking à VOQs Known traffic matrix à BvN Unknown traffic matrix à MWM Complexity of Maximum Matchings Maximum Size Matchings: Typical complexity O(N.5 ) Maximum Weight Matchings: Typical complexity O(N ) In general: Hard to implement in hardware Slooooow Can we find a faster algorithm? No, but we can relax the reuirements 8
Maximal Matching A maximal matching is a matching in which adding any edge to it destroys the matching property Realization: Maximal matching can be computed by algorithms in which each edge is added one at a time, and is not later removed from the matching No augmenting paths allowed in the Ford-Fulkerson network flow (they remove edges added earlier) Conseuence: no input and output are left unnecessarily idle. 9 Example of Maximal Matching A A A B B B C C C D D D E 5 E 5 E 5 F 6 F 6 F 6 Maximal Size Matching Maximum Size Matching 80
Properties of Maximal Matchings In general, maximal matching is much simpler to implement, and has a much faster running time. A maximal size matching is at least half the size of a maximum size matching. (Why?) Most simple case: Greedy LQF Further (more relevant) examples: WFA PIM islip 8 Greedy LQF Greedy LQF (Greedy Longest Queue First) is defined as follows: Pick the VOQ with the most number of packets (if there are ties, pick at random among the VOQs that are tied). Say it is VOQ(i,j ). Then, among all free VOQs, pick again the VOQ with the most number of packets (say VOQ(i,j ), with i i, j j ). Continue likewise until the algorithm converges. Greedy LQF is also called ilqf (iterative LQF) and Greedy Maximal Weight Matching. 8
Properties of Greedy LQF The algorithm converges in at most N iterations. (Why?) Greedy LQF results in a maximal size matching. (Why?) Greedy LQF produces a matching that has at least half the size and half the weight of a maximum weight matching. (Why?) 8 Wave Front Arbiter (WFA) [Tamir and Chi, 99] Reuests Match 8
Wave Front Arbiter Reuests Match 85 Wave Front Arbiter Implementation,,,, Simple combinational logic blocks,,,,,,,,,,,, 86 86
Wave Front Arbiter Wrapped WFA (WWFA) N steps instead of N- Reuests Match 8 Properties of Wave Front Arbiters Feed-forward (i.e. non-iterative) design lends itself to pipelining. Always finds maximal match. Usually reuires mechanism to prevent Q from getting preferential service. In principle, can be distributed over multiple chips. 88
Parallel Iterative Matching [Anderson et al., 99] uar selection uar selection # # F: Reuests F: Grant F: Accept/Match 89 PIM Properties Guaranteed to find a maximal match in at most N iterations. (Why?) In each phase, each input and output arbiter can make decisions independently. In general, will converge to a maximal match in <N iterations. How many iterations should we run? 90
Parallel Iterative Matching Convergence Time Number of iterations to converge: EU [ i ] N ------ i EC [ ]» logn C N U i = = = # of iterations reuired to resolve connections # of ports # of unresolved connections after iteration i Anderson et al., High-Speed Switch Scheduling for Local Area Networks, 99. 9 Parallel Iterative Matching 9
Parallel Iterative Matching PIM with a single iteration 9 Parallel Iterative Matching PIM with iterations 9
islip [McKeown et al., 999] # # F: Reuests F: Grant F: Accept/Match 95 islip Operation Grant phase: Each output selects the reuesting input at the pointer, or the next input in round-robin order. It only updates its pointer if the grant is accepted. Accept phase: Each input selects the granting output at the pointer, or the next output in round-robin order. Conseuence: Under high load, grant pointers tend to move to uniue values. 96
islip Properties Random under low load TDM under high load Lowest priority to MRU (most recently used) iteration: fair to outputs Converges in at most N iterations. (On average, simulations suggest < log N) Implementation: N priority encoders 00% throughput for uniform i.i.d. traffic. But some pathological patterns can lead to low throughput. 9 islip 98
islip 99 islip Implementation Programmable Priority Encoder N Grant Accept log N State N Grant Accept log N Decision N N Grant N Accept log N 00
Maximal Matches Maximal matching algorithms are widely used in industry (especially algorithms based on WFA and islip). PIM and islip are rarely run to completion (i.e. they are sub-maximal). We will see that a maximal match with a speedup of is stable for non-uniform traffic. 0 Conclusion Switch architecture decision: Output buffer or input buffer Output buffer conceptually simple, but reuires expensive hardware (switching fabric!) Input buffer replaces hardware by brainware: clever scheduling gives comparable performance with much simpler/cheaper switching fabric But needs non-trivial computational effort 0
References Achieving 00% Throughput in an Input-ueued Switch (Extended Version). Nick McKeown, Adisak Mekkittikul, Venkat Anantharam and Jean Walrand. IEEE Transactions on Communications, Vol., No.8, August 999. A Practical Scheduling Algorithm to Achieve 00% Throughput in Input- Queued Switches.. Adisak Mekkittikul and Nick McKeown. IEEE Infocom 98, Vol, pp. 9-99, April 998, San Francisco. A. Schrijver, Combinatorial Optimization - Polyhedra and Efficiency, Springer-Verlag, 00. T. Anderson, S. Owicki, J. Saxe, and C. Thacker, High-Speed Switch Scheduling for Local-Area Networks, ACM Transactions on Computer Systems, II ():9-5, November 99. Y. Tamir and H.-C. Chi, Symmetric Crossbar Arbiters for VLSI Communication Switches, IEEE Transactions on Parallel and Distributed Systems, (j):-, 99. N. McKeown, The islip Scheduling Algorithm for Input-Queued Switches, IEEE/ACM Transactions on Networking, ():88-0, April 999. 0