Fisher Information in Flow Size Distribution Estimation

Size: px

Start display at page:

Download "Fisher Information in Flow Size Distribution Estimation"

Rudolph Clark
5 years ago
Views:

1 1 Fisher Information in Flow Size Distribution Estimation Paul Tune, Member, IEEE, and Darryl Veitch, Fellow, IEEE Abstract The flow size distribution is a useful metric for traffic modeling and management Its estimation based on samled data, however, is roblematic Previous work has shown that flow samling (FS) offers enormous statistical benefits over acket samling but high resource requirements recludes its use in routers We resent Dual Samling (DS), a two-arameter family, which, to a large extent, rovide FS-like statistical erformance by aroaching FS continuously, with ust acket-samlinglike comutational cost Our work utilizes a Fisher information based aroach recently used to evaluate a number of samling schemes, excluding FS, for TCP flows We revise and extend the aroach to make rigorous and fair comarisons between FS, DS and others We show how DS significantly outerforms other acket based methods, including Samle and Hold, the closest acket samling-based cometitor to FS We describe a acket samling-based imlementation of DS and analyze its key comutational costs to show that router imlementation is feasible Our aroach offers insights into numerous issues, including the notion of flow quality for understanding the relative erformance of methods, and how and when emloying sequence numbers is beneficial Our work is theoretical with some simulation suort and case studies on Internet data Index Terms Fisher information, flow size distribution, Internet measurement, router measurement, samling I INTRODUCTION The distribution of flow size, that is the number of ackets in a flow, is a useful metric for traffic modelling and management, and is imortant for security because of the role small flows lay in attacks As is now well known however, its estimation based on samled data is roblematic Currently, samling decisions in routers are made on a eracket basis, with only samled ackets being subsequently assembled into (samled) flows Duffield et al [1] were the first to oint out that simle acket samling strategies such as 1 in N eriodic or iid (indeendent, identically distributed) acket samling have severe limitations, in articular a strong flow length bias which allows the tail of the flow size distribution to be recovered, but dramatically obscures the details of small flows They exlored the use of TCP SYN ackets to imrove the resolution at the small flow end of the sectrum Hohn et al [2], [3] exlored these difficulties further and ointed out that flow samling, where the samling decision is made directly on flows, resulting in all ackets belonging to any samled flows being collected, has enormous statistical advantages However, flow samling has not been ursued further nor found its way into routers, artly because it imlies that lookus be erformed on every acket, which is very resource intensive The authors are with the Deartment of E&E Engineering, The University of Melbourne, Australia ( {lstune@ee,dveitch@}unimelbeduau) More recently, Ribeiro et al [4] exlored the use of TCP sequence numbers to imrove estimation for TCP flows The idea is that the resence of ackets which are not hysically samled can be inferred by noting the increasing byte count given by the sequence number fields of samled ackets By using the Fisher information as a metric of the effectiveness of samling in retaining information about the original flow sizes, they showed that this hels greatly to fill in the holes left by acket samling However, they did not address whether these techniques out-erform flow samling (FS) In this aer we revisit FS in the context of TCP flows Our first contribution is to exlain how the aroach of [4] can be reformulated and extended to include FS This rovides a framework for our second contribution, roofs that FS outerforms existing methods by a large margin, though they themselves greatly imrove uon simle acket samling Our results are rigorous, based on exlicit calculation and comarison of the Fisher Information matrices of cometing schemes With the statistical reutation of FS thus reinforced, the challenge is to find methods which can somehow aroach or aroximate flow samling in order to benefit from its information theoretic efficiency, but with lower resources requirements We show how this can be done The comutational roblem for FS can be described as follows To cature the variety resent in traffic flows and to rovide the raw material for a variety of current (and future) metrics, many flows must be samled This imlies large flow tables which in turn imlies the use of slower but cheaer DRAM rather than the faster but exensive SRAM [5] However, DRAM is not fast enough to erform lookus for every acket, as required by a straightforward imlementation of FS, for today s high caacity links The question then becomes, how can flow samling be imlemented using eracket decisions, in other words using some form of acket samling? The main contribution of this aer is the introduction of Dual Samling (DS), a hybrid aroach combining the advantages of both acket and flow samling It is a two arameter samling family which includes FS as a secial case and allows FS to be aroached continuously, enabling a tradeoff of samling efficiency against comutational cost Comutationally, it can be imlemented via a modified form of two-seed or dual acket samling which circumvents the roblem of slow DRAM There is a cost in terms of wasted samles, but we show that this can be borne in high seed routers Following [4], DS benefits from the use of TCP sequence numbers although it can also be used without them, and we rovide insight into how and when they have an

2 2 imact We show rigorously that DS outerforms the methods roosed in [4] We also comare and contrast DS with the well known Samle and Hold scheme [6] We show that Samle and Hold erforms quite well, though not as well as DS Finally, we introduce SYN+SEQ+FIN, another samling method which enables flow samling to be erfectly achieved (aside from errors in the maing of byte to acket counts) at very low comutational cost, well below that even of acket samling Its disadvantage is that it exloits the TCP FIN field, when not all TCP flows terminate correctly with a FIN acket With its exlicit use of TCP rotocol information in most cases, our work alies to TCP flows only However, the ideas and results could aly to other kinds of flows rovided that suitable substitutes could be found for connection startu (SYN), rogress (sequence numbers) and termination (FIN) TCP flows still constitute the overwhelming maority of traffic in the Internet The rest of the aer is organized as follows Section II describes our samling framework and derives the Fisher information matrix and its inverse exlicitly Section III defines the samling methods and derives their main roerties Section IV comares the methods theoretically and derives further roerties exlaining their erformance, and Section V exlores in more detail how sequence numbers reduce estimation variance Section VI introduces a simle model for comutational cost and uses it to define and solve an otimization roblem for samling erformance under constraints Section VII alies the methods to real Internet data and shows that DS erforms favorably with flow samling in ractice, and has better erformance than Samle and Hold A closed-form unbiased estimator was roosed for DS and Samle and Hold which achieves the Cramér-Rao lower bound asymtotically, eliminating the need for iterative otimization algorithms We conclude and discuss future work in Section VIII This aer is an extended and enhanced version of the conference aer [7] The main additions relate to the inclusion of the Samle and Hold method throughout the aer, several new theorems and counter-examles on method comarison, and the inclusion of a new maor data set II THE SAMPLING FRAMEWORK In this section we establish a framework to define and analyze samling techniques alied to an idealized view of TCP flows on a link Nominally, we imagine that such flows are defined by the usual 5-tule of origin and destination IP addresses, ort numbers, and TCP rotocol field together with a timeout For the analysis we make a number of simlifying assumtions: (i) flows begins with a SYN acket and have no others, (ii) flows are not slit (this can occur through timeouts or flow table clearing), (iii) all necessary rotocol information (5-tule, SYN/FIN bits and sequence numbers) can be observed, and (iv) er-flow sequence numbers count ackets, not bytes Assumtions (iii) and (iv) will be discussed/relaxed when we deal with real data in Section VII Note that we do resect TCP s er-flow random initialization of sequence numbers Hence their absolute value holds no information on the number of ackets in a flow, only differences of sequence numbers matter This is crucial for the analysis A The Flow Model We consider a measurement interval containing N f flows Let m i denote the size of flow i (the number of ackets it contains) It satisfies 1 m i W, where 1 W < is the maximum flow size The total number of ackets is n = N f i=1 m i Let M be the number of flows of size, 1 W, that is M = i:m i= 1, and N f = W =1 M The flow size distribution is the set θ = {θ 1, θ 2,, θ W } of relative frequencies, that is θ = M N f (1) where 0 θ 1 and W =1 θ = 1 Note that {M } and {θ, N f } are equivalent and comlete descritions of the flows in the measurement interval They are sets of deterministic arameters, not random variables, effectively a deterministic flow size model Most of the literature on traffic samling follows the above viewoint, where the data is deterministic, the only randomness being introduced through the samling itself An excetion is the work of Hohn and Veitch ([2], [3]) where randomness arises both through the traffic model and the samling rocess, which makes the analysis considerably more difficult, but less generic B General Random Samling The effect on flow size of random samling can be described as follows: from an original flow of size k, only ackets, 0 k, are samled (retained), with robability b k = Pr(samled flow has kts original flow has k kts) The oeration of the samling scheme is entirely defined by the b k, which can be assembled into a (W +1) W samling matrix B, whose (+1, k)-th element is b k Note that b k = 0 for > k By definition B is a (column) stochastic matrix, that is each element obeys b k 0, and each column sums to unity The exerimental outcome can be described by a set of random variables {M 0 W } where M counts the number of samled flows of size Thus, N f M = 1 = ½(m i = ) i:m i = i=1 Equivalently, let θ = {θ 0 W } (note that the index includes 0) denote the emirical distribution of samled flow sizes, where θ = M N f, (2) where 0 θ 1 and W =0 θ = 1 Note that θ 0 0 as some flows may not survive the thinning rocess Out of the

3 3 original N f flows, only N f = N f(1 θ 0 ) = N f M 0 flows survive samling Define a normalized set of fractions of the samled flow sizes γ = {γ 1 W } by γ = θ W k=1 θ k = θ 1 θ 0, (3) where 0 γ 1 and W =1 γ = 1 The set γ constitutes the naïve or directly measurable samled flow size distribution and is equivalent to the distribution of flows conditional on at least a single acket from that flow being samled For 1, θ is related to γ by (1 θ 0 )γ = (N f /N f)γ = θ C The Unconditional Formulation The samled flow above includes the case, = 0, where the flow evaorates It seems natural to conclude however that such cases cannot be observed This logically leads to an analysis based on the observation of the γ defined above where 1, which is effectively conditional: samle flow distributions given that at least one acket is samled This is the aroach adoted in [1], [4], [8] and in the literature generally One of the key differences in our work is that we show that it is ossible to observe the = 0 case, leading to an unconditional formulation which enoys many advantages To see how this is ossible we return to general context of N f flows, each one of which will be samled in this general sense As defined above N f is the number of flows of size at least 1 after samling The number of evaorated flows is ust N f N f, but tyically N f is not known and is regarded as a nuisance arameter which must be estimated However, it can easily be measured by directly counting the total number of SYN ackets, which is ust the number of flows N f For methods which are already assuming an ability to access and erform secific actions based on whether a acket is a SYN or not, the additional assumtion of being able to count all SYN ackets is a natural one It is also imlementable, as a single additional counter which checks every acket and conditionally increments based on a small number of header bits is not difficult even at the highest seeds [5], as we discuss in more detail in Section VI In summary, by knowing N f, every flow gives rise to a samled flow, each one of which is observable, either directly ( 1), or indirectly ( = 0) In other words, we can in effect observe the θ over the full range from = 0 to = W The chief advantage of the unconditional formulation is the very simle form of the likelihood function for the exerimental outcome for a single flow This makes the maniulation of the Fisher information far more tractable, leading to new analytic results and insights The other big advantage is that flow samling can now be included In the conditional world flow samling is erfect by definition, if a flow is samled at all, all its ackets will be and so there is nothing to do! The unconditional framework allows the missing art of the icture to be included, enabling meaningful comarison D The Samled Flow Distribution Our analysis is based on the idea of selecting a tyical flow, and that flows are mutually indeendent (a reasonable Fig 1 The flow samling and selection rocess Here a flow is selected which had k = 5 ackets originally and = 3 after samling assumtion if N f is very large) Since flows are in fact deterministic, this is only meaningful if we introduce a sulementary random variable U, a uniform over the N f flows available, which erforms the random flow selection This variable, which acts invisibly behind the scenes (and is rarely discussed), is not art of the random samling scheme itself, but is essential as it allows the θ arameters to be treated as robabilities, even though they are not An examle is given in Figure 1 which shows N f = 12 flows before and after samling, followed by a random flow selection In the interests of clarity a flow of size = 3 after samling was selected, but it could have been one of the evaorated flows ( = 0) With this background established, the discrete distribution for a samled flow originally of size k is very simle: d = b k θ k, 0 W (4) k=1 This can be exressed in matrix notation as d = Bθ (5) where d = [d 0, d 1, d 2,, d W ] T is a (W +1) 1 column vector, and θ a W 1 column vector The robability d is related to the emirical fraction θ for 0 by E[θ ] = E[ i:m = 1] Nf i i=1 = E[½(m i = )] N f N f = Nf i=1 Pr(m i = ) = N fd = d N f N f The likelihood function for the arameters is simly f(; θ) = d, 0 W (6) In the conditional framework commonly used = 0 is missing, and normalization is then needed to ensure robabilities add to one This imlies a division of random variables, which greatly comlicates the likelihood E The Fisher Information of a Samled Flow The arameter vector θ is the unknown we would like to estimate from samled flows Since here we are not concerned

4 4 with secific estimators of θ, but in the effectiveness of the underlying samling scheme, a owerful aroach (introduced in [4]) is to use the Fisher information [9, Section 1110] to access its efficiency in collecting information about θ We first introduce notation that will be used throughout this aer The exectation of a random variable X is denoted by E[X], and the variance by Var(X) Matrices are written in bold-face uer case and vectors in bold-face lower case The transose of a matrix A is denoted by A T The oerator also alies to vectors The oerator tr(a) denotes the trace of the matrix A The matrix I n denotes the n n identity matrix The vector 1 n = [1, 1,,1] T denotes an n 1 vector of 1s The vector 0 n denotes the n 1 null vector, and the m n null matrix is written as 0 m n Given an n 1 vector x, diag(x) denotes an n n matrix with diagonal entries x 1, x 2,, x n Definition 1: An n n real matrix M is ositive definite iff for all vectors z R n \{0 n }, z T Mz > 0, and is ositive semidefinite iff z T Mz 0 We write A > 0 or 0 < A to indicate that A is ositive definite For two matrices A and B, we write A > B to mean A B > 0 in the ositive definite sense Similarly, A B and A B 0 each mean that A B is ositive semidefinite The oerator returns the size of a vector or set All other definitions will be defined when needed The Fisher information is useful because its inverse is the Cramér-Rao lower bound (CRLB), which lower-bounds the variance of any unbiased estimator of θ In fact the Fisher information takes a different form deending on whether constraints are imosed on the θ or not [10] Inequality constraints are articularly roblematic, so we avoid them by assuming that each θ k obeys 0 < θ k < 1 (this ensures that the CRLB otimal solution cannot include boundary values, which would create bias and thereby invalidate the use of the unbiased CRLB) Assuming that flows exist for all sizes, ie that θ k > 0 for all k, is reasonable given the huge number of simultaneously active flows (u to a million) in high end routers There is one more constraint, the equality constraint W k=1 θ k = 1, which must be included As this comlicates the Fisher information, we first deal with the unconstrained case F The Unconstrained Fisher Information The Fisher information is based on the likelihood and is defined by J(θ) = E[( θ log f(; θ))( θ log f(; θ)) T ] = ( θ log f(; θ))( θ log f(; θ)) T d (7) =0 Here θ log f(; θ) = (1/d )[b 1,,b W ] T because of the simle form (6) of the likelihood This leads to the simle exlicit exression (J(θ)) ik = W =0, or equivalently b ib k d J(θ) = B T D(θ)B (8) where D(θ) is a diagonal matrix with (D(θ)) = d 1 1 We will need to find the inverse of J, but since B is not square, this cannot be done directly from (8) in terms of the inverse of B However if we re-exress B as [ ] b T B = 0 (9) B where b T 0 = [b 01,, b 0W ] is the to row of B and b 11 b 12 b 1W 0 b 22 b 2W B =, 0 0 b WW then we can write J(θ) = 1 d 0 b 0 b T 0 + J(θ) (10) where J(θ) = B T D(θ) B, D(θ) = diag(d 1 1,,d 1 W ), and d 0 = b T 0 θ Since B and D(θ) are square, the inverse of J(θ) is ust J 1 (θ) = B 1 D 1 (θ)( B 1 ) T, that is ( J 1 (θ)) ik = b i b k d (11) =1 where b k = ( B 1 ) k By Lemma 28(ii) in Aendix B, J(θ) is ositive definite We can now give the inverse of J Proosition 2: The inverse of J(θ) is given by J 1 (θ) = J 1 (θ) 1 d 0 + b T 0 J 1 (θ)b 0 J 1 (θ)b 0 b T 0 J 1 (θ) Proof: The matrix inversion lemma alies (see Lemma 30 in Aendix A) with R = J and T = 1/d 0 nonsingular Since d 0 + b T J 0 1 (θ)b 0 > b T J 0 1 (θ)b 0 > 0 as J 1 is ositive definite, the result immediately follows The diagonal elements of the matrix J 1 (θ) will be imortant in later sections, and the exlicit formula is given below: (θ)) = k=1 b 2 k d k ( W ) 2 k= d kb k k l=1 b 0lb lk ) 2 d 0 + W k=1 d k ( k l=1 b 0lb lk (12) Again, this exlicit inverse, valid for any general samling matrix B, is made ossible by the very simle form of the likelihood function in equation (6) We now secialize the above result for samling matrices that satisfy articular conditions Although some of these matrices exhibit a deendence on θ, we dro this deendence for notational simlicity when the context is clear Corollary 3: If for some constant q the samling matrix B satisfies b 0 = q1 W then (setting = 1 q) J 1 = J 1 q θθt Proof: The given condition imlies that d 0 = b T 0 θ = q1w T θ = q, and 1T B W 1 = (1/)1 T W since B is column

5 5 stochastic Next, b T 0 J 1 b 0 = q 2 1 T W J 1 1 W = q 2 1 T W B 1 D 1 ( B 1 ) T 1 W = (q 2 / 2 )1 T D W 1 1 W = (q 2 / 2 ) d = (q 2 / 2 )(1 d 0 ) = q 2 / =1 Let d = Bθ = [d 1, d 2, d W ] T Then from Proosition 2, J 1 = J 1 1 d 0 + q 2 / J 1 b 0 b T 1 0 J = J 1 q2 / 2 q + q 2 / B 1 D 1 1 W 1 T W D 1 ( B T ) 1 = J 1 q2 / 2 q + q 2 / B 1 d dt ( B T ) 1 = J 1 q θθt The matrix J corresonds to the information carried by the outcomes 1 W only We exect J to carry more information through the knowledge of N f which gives access to = 0, and therefore J 1 to have reduced uncertainty, corresonding (through the CRLB) to a reduced variance The following result confirms this intuition (roof in Aendix A) Theorem 4: An uer bound for J 1 (θ) is J 1 (θ) J 1 (θ) Equality holds if and only if b 0 = 0 W 1 The reduction in uncertainty is given by the second term in the exression for J 1 (θ) in Proosition 2 G The Constrained Fisher Information and CRLB Intuitively, constraints on the arameters should increase the Fisher information since they tell us something more about them, for free In fact, [11] shows that this is only true for equality constraints Since we are assuming that 0 < θ k < 1, the only active constraint is W k=1 θ k = 1 Its gradient is G(θ) = θ g(θ) (13) where g(θ) = W =1 θ 1 The inverse constrained Fisher information [11] is I + = J 1 J 1 G ( G T J 1 G ) 1 G T J 1 (14) where I + denotes the Moore-Penrose seudo-inverse [12, Chater 20, ] of the constrained Fisher information matrix I The matrix I + is rank W 1 due to the single equality constraint and is thus singular (see [11, Remark 2]) This somewhat formidable exression can be simlified in our case, as we now show Lemma 5: Jdiag(θ 1,,θ W )1 W = 1 W Proof: The row sum of row i of Jdiag(θ 1,,θ W ) is k=1 =0 b i b k d θ k = =0 b i d since B is column stochastic b k θ k = k=1 =0 b i d d = 1 It is easy to see that G = 1 W Hence J 1 G = J 1 1 W = diag(θ 1,, θ W )1 W = θ from Lemma 5 It is then straightforward to verify that ( G T J 1 G ) is simly the number 1, and further that (14) can be reduced to I + = J 1 θθ T (15) The remarkable thing here is that the constraint term θθ T deends on θ only, and so is constant for all samling matrices B, a great advantage when comaring different methods Since we are assuming flows are samled indeendently, the Fisher information for N flows is ust NJ, and the inverse becomes I + /N For any unbiased estimator ˆθ of θ, the CRLB then states that E[(ˆθ θ)(ˆθ θ) T ] I+ (θ) N (16) Because of indeendence we study N = 1 In ractice all flows are samled and so N = N f Remark 6: There is an interretation to the simle structure of the matrix P = J 1 G(G T J 1 G) 1 G T J 1 The scalar value 1 T W I+ 1 W is equivalent to Var( W i=1 ˆθ i ) By the equality constraint, we exect the best estimator of W i=1 ˆθ i to have a variance of 0, since the estimator already knows that W i=1 θ i = 1 This corresonds to a CRLB of zero, namely 1 T W I+ 1 W = 1 T W J 1 1 W 1 T W θθt 1 W = 1 T W diag(θ 1,,θ W )1 W 1 = 0 Thus, P = θθ T is the form of the correction term to the unconstrained covariance matrix J 1 needed to satisfy the constraint III THE SAMPLING METHODS In this section we define the samling methods we consider and derive their main roerties We begin with methods which have been described elsewhere, including simle acket and flow samling, as well as others exloiting rotocol information, in articular those roosed in [4], [1] Aart from their inherent interest, we revisit these because in the unconditional framework these methods are now all different to before More imortantly, we also derive inverses analytically which has not been ossible before, and thereby obtain a number of imortant insights We also include the widely cited Samle and Hold [6] whose Fisher information has not reviously been studied We then introduce our new method, Dual Samling To better see the connection between the usual framework and ours, recall that b k is always a conditional robability with resect to the size k of the original flow Tyically however, it is also made conditional with resect to, but we do not so here Hence, if B c is the usual -conditional matrix, then B c C = B where C = I W diag(b 01,, b 0W ), ie the matrix C 1 does the conditioning We use the decomosition of (9) to describe each samling matrix B In each case we define B and B, give the inverse B 1 of B, and give exlicit exressions for the diagonal terms ), or in some case for the entire inverse J 1 The imortance of the diagonal terms will become very clear in Section IV

6 6 A Packet Samling (PS) By this we mean the simlest form of samling, iid acket samling, where each acket is retained indeendently with robability and otherwise droed with q = 1 For the urose of simlicity, we treat both 1 in N eriodic samling and iid random samling under the same framework, as both methods were shown to be statistically indistinguishable in ractice [1] The chief benefit of PS is its simlicity, and the fact that it can be imlemented at high seed because a samling decision can be made without even insecting the acket The chief disadvantage is the fact that it has a strong length bias, small flows are very likely to evaorate ) It is easy to see that b k = ( k q k, or q q 2 q 3 q 4 q W ( ) W 2 q 3 q 2 4 q 3 q W 1 1 ( ) B = W q 6 2 q2 2 2 qw W Before finding the inverse of B, it is first instructive to note the following general results by Strum [13] Let B(x, y) be an (W + 1) (W + 1) matrix with the following structure 1 x x 2 x W ( ) W 0 y 2xy x W 1 y 1 ( ) B(x, y) = W 0 0 y 2 x W 2 y y W which is known as the binomial matrix Note that B(0, 1) reduces to the identity matrix From [13] we have Lemma 7: If y 0, then B(x, y) is invertible and [B(x, y)] 1 = B( xy 1, y 1 ) Using these results we can find the inverse of B (recall that b k = ( B 1 ) k ) Theorem 8: The inverse of B is given by q 3 3 q2 W W ( q ) W 1 B 1 = q 0 0 W ( ) k that is b k = ( 1)k q k k Proof: Here b T 0 = [q, q, 2 q W ] We have [ ] 1 b T B(q, ) = 0 0 W B Since B(q, )[B(q, )] 1 = I W+1, for some k we have [ ] [ ] [ ] 1 b T 0 1 k 1 0 T 0 W B 0 W B 1 = W 0 W I W Furthermore, from Lemma 7 we have [B(q, )] 1 = B( q 1, 1 ) Thus B 1 is essentially a rincial W W submatrix of B( q 1, 1 ) For PS J 1 is difficult to write in a comact form and will be omitted It is however feasible to give using equation (12) ( ) 2 k ) = q 2(k ) 2k d k k= ( ( ) W k ) 2 k= ( 1)2k 1 d k q 2k 2k W k=0 q2k 2k d k (17) The above form is derived in Aendix C B Packet Samling with Sequence Numbers (PS+SEQ) First PS with arameter is erformed as above Sequence numbers are then used as follows Let s l be the lowest sequence number among the samled ackets, and s h the highest All ackets in-between these can now reliably inferred, hence = s h s l + 1 is the number of samled ackets returned This is called ALL-seq-sflag in [4] The chief benefit of PS+SEQ is the fact that a otentially large number of ackets can be virtually observed without having to hysically samle them The disadvantage is the additional rocessing involved to erform the inference Also, the technique is of limited value if flows are too short (as we discuss later) If = 0, 1 then sequence numbers cannot hel and b k is as for PS Otherwise, note that the ackets must occur in a contiguous block bordered by s l and s h There are k + 1 ossible ositions for such a block, each characterized by k unsamled ackets outside it and the borders s l and s h It follows that b k = (k + 1) 2 qk for 2 < k Hence q q 2 q 3 q 4 q W 2 q 3 q 2 4 q 3 W q W B = q 3 2 q2 (W 1) 2 2 qw q (W 2) 2 3 qw Theorem 9: The inverse of B is 1 2q 2 q q 2 q B 1 = q Proof: Observe that B = ST where S = diag(, 2, 2,, 2 )

7 is a W W matrix and 1 2q 3q 2 4q 3 Wq W q 3q 2 (W 1)q W 2 T = q (W 2)q W A straightforward comutation yields 1 2q q q q 2 0 T 1 = q 0, and S 1 = diag( 1, 2, 2,, 2 ) Thus, B 1 = T 1 S 1, which roves our result The diagonal elements of J 1 are given by ) 11 = 2 d 1 + 4q 2 4 d 2 + q 4 4 d 3 (q 2 ) 22 = 4 d 2 + 4q 2 4 d 3 + q 4 4 d 4 q 4 4 d 1 + 2q3 4 d 2 )2 8 d2 2 ) = 4 d + 4q 2 4 d +1 + q 4 4 d +2, 3 W where r = d 0 + q 2 2 d 1 + q 4 4 d 2, and for convenience we set d = 0 for > W C Packet Samling with SYN Samling (PS+SYN) First PS with arameter is erformed as above A ost-rocessing hase then discards all ackets belonging to samled flows which lack a SYN acket (or more accurately, mas them to samled flows with = 0) This was introduced in [1] and called SYN-ktct in [4] The chief benefit of PS+SYN is that the flow length bias of PS is averted by keeing flows based on the resence of the SYN, which is flow length indeendent The chief disadvantage is the fact that it is wasteful: if = 001 then 99% of ackets which were initially samled belong to failed flows and are subsequently discarded! A flow evaorates iff its SYN is not samled, hence b 0k = q For 1 the SYN must first be samled, which occurs with robability, and conditional on this 1 more ackets must be samled from the remaining ) k 1 using iid samling Hence b k = ( k 1 1 giving q q q q q q 2 q W 1 B = q (W 1) W r 1 q k for 1, r 2 qw Theorem 10: The inverse of B is given by 1 q 1 q 2 2 ( q ) W 1 W+1 B 1 = q 2 ( q ) W 2 W W+1 ( ) k 1 that is b k = ( 1)k q k k 1 Proof: Note that we can exress B in terms of a W W binomial matrix B(x, y) such that B = B(q, ) Then by Lemma 7, the inverse is given by B = (1/ )B( q 1, 1 ) It is easy to see that the condition of Corollary 3 is satisfied with = Hence J 1 = J 1 q θθ T, and the diagonal entries for 1 W are ) = k= ( k 1 1 ) 2 q 2(k ) 7 2k d k q θ 2 (18) D Packet Samling with SYN and SEQ (PS+SYN+SEQ) First samling is erformed according to PS+SYN with arameter, and then on each resulting samled flow the sequence number ost-rocessing is erformed as er PS+SEQ This is called SYN-seq in [4] PS+SYN+SEQ is a hybrid of PS+SYN and PS+SEQ and combines the advantages and disadvantages of both If = 0, 1 then sequence numbers cannot hel and b k is as for PS+SYN Otherwise, by combining the arguments above, it is easy to see that b k = q k for > 1, giving q q q q q q q 2 q 3 q W B = q 2 q2 2 2 qw q 2 q W 3 (19) Theorem 11: The inverse of B is given by 1 q B 1 = 1 0 q q Proof: A straightforward comutation shows that B B 1 = I W Since b 0 = q 1 W, Corollary 3 alies and states that J 1 = J 1 q θθ T The diagonal elements can be exlicitly written, but we defer this to Section III-H E Flow Samling (FS) In iid flow samling [3], flows are retained indeendently with robability f and otherwise droed with q f = 1 f The chief benefit of FS is the fact that flows which are samled retain their full comlement of ackets, eliminating comletely the difficulties in inverting samled flow sizes back to original sizes The chief disadvantage is that each acket requires a looku in a flow table to see if it belongs to be flow which has been samled A flow evaorates iff its SYN is not samled, hence b 0k = q f If a flow has been selected, which occurs with robability

8 8 f, then conditional on this = k with certainty, that is b k = 1 if = k, else 0, for 1: q f q f q f q f q f f B = 0 f (20) f The inverse of B is ust B 1 = I W / f, and J takes the elegant form J(θ) = q f 1 W 1 T W + fdiag(θ1 1,, θ 1 W ) Clearly Corollary 3 alies with = f Since J = f diag(θ 1 1,,θ 1 W ), the unconstrained inverse can therefore be exressed as J 1 (θ) = 1 f diag(θ) + (1 1 f )θθ T (21) Remark 12: By using equations (15) and (21), the inverse constrained Fisher information matrix for FS is given by I + = 1 f diag(θ) 1 f θθ T, with the diagonals being (I + ) kk = (1/ f )θ k (1 θ k ) This is ust an aroriately scaled version (by 1/ f ) of the inverse Fisher information of a multinomial model This makes sense given that FS works by icking out whole flows from the original flow set in an iid fashion and that the comlete likelihood function can be modeled using a multinomial model F Packet Samling with SYN, FIN,SEQ (PS+SYN+FIN+SEQ) In this scheme SYN ackets are retained indeendently with robability f and otherwise droed with q f = 1 f, and the FIN ackets corresonding to samled SYNs are also samled, but no others Sequence numbers are then used to infer flow sizes This scheme has two great advantages: like FS the flows samled are samled erfectly, and moreover this could be achieved by hysically samling only two ackets er flow, based on looking for SYN and FIN flags on a er acket basis, which is feasible at high seed The disadvantage is that a moderate minority of flows do not terminate correctly with a FIN, and/or the FIN may be not observable Furthermore, flows consisting of a single SYN (such as in a SYN attack) would be entirely missed For this reason we choose not to study it further Information theoretically, PS+SYN+FIN+SEQ is identical to flow samling rovided we assume θ 1 0 G Samle and Hold (SH) Here ackets are first samled as for PS with robability, however for each flow if a acket is samled, then all subsequent ackets in the flow will be Hence the total number of ackets samled is much higher than the arameter The scheme was introduced in [6] The chief benefit of SH is that rovided ust a single acket from a flow is PS-samled, then tyically many will be finally samled This conditional behaviour is much more effective than methods using SEQ where at least two ackets must be PS-samled, and even then fewer ackets are finally recoued Essentially SH skis a geometric number of the first ackets in a flow and then catures all the rest It therefore efficiently skis small flows and accurately recovers the size of large flows This amlified flow length bias (even stronger than for PS) makes it well suited for the heavy hitter roblem (ie accurately measuring the very largest flows) for which it was originally designed For flow size estimation more generally however, it is a disadvantage for most flow sizes The other disadvantage is the need to check, for each acket, whether it belongs to a flow which has already been samled This makes it very costly in a true samling imlementation Indeed Estan and Varghese imlemented it using lossy sketching techniques [6] A flow evaorates iff none of its ackets are samled, hence b 0k = q k Otherwise, b k = q k, thus: q q 2 q 3 q 4 q W q q 2 q 3 q W 1 B = 0 q q 2 q W 2 (22) It is interesting to note that the first row is as for PS, the second as for PS+SYN, and subsequent rows like PS+SYN scaled by 1/ Theorem 13: The inverse of B is given by 1 q q 0 0 B 1 = q q Proof: It is easily verified that B B 1 = I W It is not difficult to show that Proosition 2 reduces to J 1 = J 1 C, where the only non-zero element of C is C 11 = 2 q2 d2 1 (2 d 0 + q 2 d 1) 1 = d 0 / since d 1 = q d 0 Furthermore, using (11) one can show that J 1 is tridiagonal (and symmetric) with uer off-diagonal terms ( J 1 ) k,k+1 = 2 q d k+1, k < W, and diagonal elements ) 11 = 1 2 (d 1 + q 2 d 2 ) 1 d 0 = θ W k=2 q k 1 θ k (23) ) = 1 2 (d + q 2 d +1 ), 2 W 1 ) WW = d W 2 = θ q = θ W, k=+1 q k θ k (24)

9 9 where we have used the roerty of B that d = θ +q d +1 for 1 W 1 H Dual Samling (DS) DS can be defined simly as follows First, at the acket level it consists of two PS schemes running in arallel, one which oerates only on SYN ackets with samling robability f, and the other only on non-syn ackets with samling robability In a second hase, samled flows which lack a SYN are discarded, and sequence numbers are used to infer additional virtual ackets, as in PS+SYN+SEQ Thus, at one level DS is simly a generalization of PS+SYN+SEQ, and reduces to it when f = However, the generalization is significant as it also includes FS as the secial case = 1, and interolates continuously between the two This is illustrated in Figure 2 which deicts the (, f ) arameter sace, and marks the secial cases Dual Samling is dual in two senses Comutationally it can be viewed as the original PS samling being slit into two, at the (low) cost of er-acket switching based on some bit checking to determine which PS alies Information theoretically, the samling is now slit into two arts with very different natures, each controlled by a dedicated arameter: a FS-like direct samling of flows, and a PS-like in-flow samling Here f controls the number of samled flows, and their quality The derivation of the samling matrix mirrors closely that of PS+SYN+SEQ The result shows clearly how f and act in a modular fashion The flow samling comonent controls the to row of B and factors B, whereas the acket samling comonent determines the internal structure of B q f q f q f q f q f f f q f q 2 f q 3 f q W 1 0 f f q f q 2 f q W 2 B = 0 0 f f q f q W f The searation of the FS and PS roles in B is clearly reflected in its inverse: 1 q B 1 = 1 0 q f q The similarity between B 1 for SH and DS is striking Indeed, whereas SH uses samling to select which flows it will focus on and then holds to them, DS reverses these oerations and thus could be described as a Hold and Samle scheme However, although they each combine PS and FS features, the methods remain significantly different In articular SH is strongly flow length biased whereas DS is unbiased Once again Corollary 3 for the inverse Fisher information holds, showing that J 1 = J 1 q f f θθ T Similarly to SH, Fig 2 Parameter sace (, f ) of DS Fixing ESR= constrains the family to the solid (blue) curve f ( ; ), where it reduces to PS+SYN+SEQ at = and FS at = 1 Fixing PPR= constrains the family to a straight (green) line emanating from (0, D) For fixed, the constraint curves deend on D Three examles are given for each normalization from (11) one can show that J 1 is tridiagonal with uer off diagonal terms ( J 1 ) k,k+1 = ( f ) 2 q d k, k < W The diagonal elements (here 2 W 1) are given by ) 11 = 1 f 2 ( 2 d 1 + qd 2 2 ) q f θ f = θ 1 f + 1 f k=2 ) = 1 f 2 (d + q 2 2 d +1) q f θ 2 f = θ f + 1 f q k 1 θ k q f θ1 2 (25) f k=+1 q k (1 + q )θ k q f θ 2 f ) WW = d W f 2 q f θ 2 W 2 = θ W q f θ f f W 2 (26) f By setting f = we obtain those for PS+SYN+SEQ IV COMPARISONS In this section we comare and contrast the erformance of the different methods, using two normalizations which are the key to a fair comarison We show that a ositive semidefinite comarison holds for certain cases, and ustify why we ultimately resort to comarison of the diagonals of the covariance matrix We rovide a artial ranking of the methods Proofs of most key results in this section are deferred to Aendix D A Normalization We must first consider how to comare fairly We do this in two ways, using the following acket-based measures of incoming workload/information

10 Packet Processing Rate (PPR): the rate at which ackets are initially being samled (and hence require some further rocessing) Effective Samling Rate (ESR): the rate at which ackets are arriving to

10 10 Packet Processing Rate (PPR): the rate at which ackets are initially being samled (and hence require some further rocessing) Effective Samling Rate (ESR): the rate at which ackets are arriving to the flow table (and hence become available as information for estimation) PPR is a measure of the rocessing seed required by the methods, whereas ESR is a measure of the arrival rate of ackets containing information which is actually used by the method Because the methods which discard flows without SYN ackets lose (the maority) of their ackets, an equal PPR comarison greatly disadvantages them An equal ESR comarison effectively inflates the arameters of such methods to comensate Note that given a samling method X, both the PPR and ESR normalizations of X+SEQ and X are identical, as the methods only differ in how the acket data is used, not how many ackets are hysically collected We now resent the normalizations for each method Here and below we use D = W k=1 kθ k 1 to denote the average flow size Table I gives the results for PPR Both the average samling rate as a function of the method arameter(s), and its inverse, the arameter value required to set the average samling rate at, are shown The exression (, θ) for SH was derived by comuting the average number of ackets samled (which is highly deendent on θ) and dividing the result by the average flow size D It is monotonically increasing in with (1, θ) = 1, and / W =1 W k= θ k/d 1 as 0, which in the worst case (θ W = 1) becomes / = (1+W)/2 We invert (, θ) numerically as required for choices of θ, denoted by root(( )) in the tables DS has two arameters We choose to make the indeendent one Table II gives the results for ESR Only the methods which involve discarding ackets after their initial collection have changed comared to Table I Note that for PS+SYN / 1/D 1 as 0, and (1) = 1 We are articularly interested in the ESR case for the DS family, and so reeat its ESR normalization here: f ( ; ) = D (D 1) + 1 (27) The denominator (D 1)+1 is simly the average samled flow size, conditional on its SYN being samled For fixed, this equation gives f as a monotonically decreasing, in fact convex, function of Three examles (the blue curves) are given in Figure 2 for different values of D To maintain an ESR fixed at, if we increase we must move down the curve and decrease f to comensate The PPR case is similar Method(arams) Average samling rate = ESR() = PS( ) PS+SYN( ( (D 1)+1) D(D 1) ) D 2(D 1) FS( f ) f SH( W W ) D =1 q k= qk θ k root(( )) DS(, f ) f ( (D 1)+1) D D f = (D 1)+1 TABLE II AVERAGE ESR SAMPLING RATES AND NORMALIZATIONS but simler as the level curves are straight lines (green lines in Figure 2) Note that deending on and D, under both PPR and ESR there are values of f and/or that may be disallowed To comare methods we also require a erformance metric A natural criterion is the set of diagonal elements of the CRLB, the k-th being (I + ) kk = ) kk θ 2 k, (28) since this is a lower bound on the variance of any unbiased estimator of θ k We lot the square root of these values, calculated using the exressions for the revious section Fig 3 An equal PPR comarison of the CRLB bound for θ with = 0005 Method(arams) Average samling rate = PPR() = PS( ) PS+SYN( ) FS( f ) f SH( ) D W =1 q W k= qk ) θ k DS(, f ) f ( 1 D) + ( 1 1 D root(( )) f = D (D 1) TABLE I AVERAGE PPR SAMPLING RATES AND NORMALIZATIONS Fig 4 Deendence of the CLRB bound on, PPR comarison

11 We begin with some instructive examles Consider the results of Figure 3, where we set W = 5 with θ = {022, 021, 020, 019, 008}, for which D = 290 ackets We use the PPR normalization with = 0005,

11 11 We begin with some instructive examles Consider the results of Figure 3, where we set W = 5 with θ = {022, 021, 020, 019, 008}, for which D = 290 ackets We use the PPR normalization with = 0005, corresonding to for SH For DS we give two examles satisfying : ( f, ) = (0001, 00443), and ( f, ) = (01, ) As exected from earlier work, the erformance of PS is extraordinarily oor In agreement with the results of [4] and as exected, the inclusion of SEQ imroves it enormously, by orders of magnitude, but it is still orders of magnitude behind FS, which has the lowest standard deviation bound of all In a highly counterintuitive result, PS+SYN erforms (much!) better than PS, desite the enormous number of wasted ackets We can offer two reasons for this First, in the unconditional framework information in discarded flows is not entirely wasted, some is recoued by the (observable) increase in the = 0 outcome 1 Second, SYN samled flows (much like FS) have no flow length bias Together these lead PS+SYN to erform better in most cases even under PPR Three results for DS are given From best to worst these were with f = 0001 <, the secial case PS+SYN+SEQ with f =, and f = 01 > Finally SH erforms well in this examle as its key disadvantage, a strong bias to large flows, does not lay a large role for such a small value of W Using the same PPR based comarison, Figure 4 shows the imrovement in CRLB as increases, as we exect We see that very high values of are needed before the erformance of FS is aroached Here the free arameter f for DS was chosen to guarantee meaningful examles to either side of PS+SYN+SEQ Secifically, the normalization defines for each value of a feasible range [f l, u f ] for f which includes f = For DS1 we set f = c( f l ) <, and for DS2 f = +c(f u ) >, where c [0, 1] In the figure we use c = 01 The great merit of the ESR normalization is that it allows us to view the methods as being different ways in which the 1 It was ointed out in [4], under a conditional framework, that PS+SYN+SEQ outerforms PS+SEQ on actual traces Our own numerical evaluation of the CRLB of these methods (under the conditional framework) also show this in some cases, leading us to conecture the same counterintuitive results may hold for the conditional framework Fig 5 An equal ESR comarison of the CRLB bound for θ with = 0005 same budget of samled ackets can be allocated among flows The essential tradeoff is that we can have more samled flows of oor quality, or fewer of higher quality In a class with no flow length bias, flow samling is at one extreme: for a given ESR= it gives the minimum number of samled flows, each of erfect quality Among sub-classes of methods with roughly the same number of flows, we can then distinguish between the finer details of how the holes aear over the flow, which will have an imact on the degree of imrovement which the sequence numbers can bring Results using the same scenario as Figure 3 but ESR normalized are shown in Figure 5 As exected, all SYN-based methods imrove significantly In articular the DS with the smaller f = 0001 now outerforms SH, and is second only to FS Another examle with a larger W = 50 and a truncated exonential distribution for θ with D = is given in Figure 6 The same general conclusions hold, with the excetion of SH, whose erformance is now worse than PS+SYN+SEQ rather than considerably better This is to be exected at larger W, as SH exends its acket budget on the rare largest flows At realistic W values in network data this effect is further exaggerated It is interesting to note in the right lot of Figure 6 that variance decreases as k increases This is consistent with the fact that the ambiguity inherent in an observation of samled ackets is lower for larger and disaears at = W, since this observation can only arise in one way The di at very small k we attribute to the influence of the constraint B Positive Semidefinite Comarisons The examles above comared methods using the CRLB of each θ k searately Let two methods have Fisher information J 1 and J 2 A more comlete, and in a sense ideal comarison, would be to show that J 1 J 2, since that would imly I + 1 I+ 2, a lower CRLB What this ositive semidefinite comarison really means is that for any linear combination f(θ) = a T θ = k a kθ k of the arameters, the (bound on the) variance of f(θ) under method 1 will be less than that under method 2 A geometric interretation is that the ellisoid corresonding to unit variance of vectors under J 1 lies entirely within that of J 2 In this section we rovide a number of comarisons of this tye between methods, and exlain why it is not suitable as a universal basis of comarison We first confirm the intuition that methods lose information monotonically in their samling arameter(s) Theorem 14: For any method Z surveyed here if 1 2 (for DS,1,2 and f,1 f,2 ) then J Z ( 1 ) J Z ( 2 ) (for DS J Z (,1, f,1 ) J Z (,2, f,2 )) Equality holds iff 1 = 2 (for DS (,1, f,1 ) = (,2, f,2 )) The roof (see Aendix D), is a straightforward consequence of closure under samling for all methods excet SH Next, we confirm that the use of sequence numbers increases Fisher information Theorem 15: J Z+SEQ J Z for each of Z = PS and Z =PS+SYN

12 12 Fig 6 Comarison of the CRLB bound with W = 50 Left: PPR with = 0005 f = 0001, = for DS ( f < ) and f = 001, = for DS ( f > ) Middle: ESR with = 0005 f = 0032, = 01 for DS ( f < ) and f = 0079, = 0001 for DS ( f > ) Right: Zoom of middle lot, the same legend alies Proof: We make use of the data rocessing inequality (DPI) for Fisher information (see Theorem 31 in Aendix B), which states that if θ Y X is a Markov chain, where X is a deterministic function of Y, then J Y (θ) J X (θ) which holds with equality if X is a sufficient statistic of Y We first consider PS+SEQ and PS Flows are selected randomly and are reresented as a random vector of SEQ numbers V = {A, A+ 1,,A+K 1} with realization v = {a, a+1,, a+k 1} Here, K reresents the flow size (realization k) while the A, the initial SEQ number (realization a) is uniform over [0, N a 1] All oerations on them are modulo N a, thus allowing wraarounds A is indeendent of K We also assume that W N a to avoid roblems with multile wra-arounds After the PS rocess, we have the SEQ vector Y = {Y i, 1 i J}, (realization y = {y i, 1 i }) We define two statistics on Y: J(Y), the actual number of raw ackets samled, and L(Y) = max i (Y i ) min i (Y i ) + 1 (with a samle l(y) and L(Y) = 0 if Y is emty), the inferred length of the sequence The statistic L(Y) disregards A since it takes the difference of SEQ numbers Note that both statistics are deterministic functions of Y and so form Markov chains θ Y L(Y) and θ Y J(Y) A straightforward alication of DPI yields J Y J L(Y) and J Y J J(Y) We show that L(Y) is a sufficient statistic of Y wrt θ We use the Fisher-Neyman factorization theorem [14, Theorem 65, 35] which states that if the robability mass function of Y takes the form Pr(Y = y) = h(y)g(l(y), θ), (29) where h is indeendent of the arameters θ, then L(Y) is a sufficient statistic Let m 0 denote the number of unsamled ackets before the first samled acket If J = > 0, Pr(Y = y) = k=l k l θ k m=0 = Pr(A = a) = 1 N a q Pr(A = y 1 m)q m ( 2 k=l k l θ k m=0 q k ( W ) (k l + 1)q k θ k, k=l q l ) q k l m since A is uniform For the case = 0, Pr(Y = ) = q k θ k k=1 Clearly, each case satisfies (29), hence L(Y) is a sufficient statistic of θ By the DPI, since L(Y) is a sufficient statistic, J Y = J L(Y) From the revious relation J Y J J(Y), we now have J L(Y) J J(Y) But J(Y) is equivalent to the statistic used in PS, roving the result As for PS+SYN samling, we define an additional random variable S taking value 1 if the SYN acket was samled, 0 otherwise V is defined as before Y = if and only if S = 0 The same statistics L(Y) and J(Y) are defined Then, for J = > 0, Pr(Y = y) = and for = 0, k=l θ k Pr(A = y 1 ) ( 2 = Pr(A = a) = 1 N a q Pr(Y = ) = k=l θ k q k ( W ) q k θ k, k=l q θ k = q k=1 q l ) q k l Once again, each case satisfies (29), hence L(Y) is a sufficient statistic Thus, by DPI, the same relationshi J L(Y) J J(Y) holds, with J(Y) now equivalent to the statistic used in PS+SYN Intuitively this result seems obvious: if the additional information afforded by the sequence numbers is available, we should certainly be able to do better by using it However, it is temting to conclude that by the same logic J PS > J PS+SYN under the PPR normalization, since deciding to discard flows without a SYN is also a deterministic transformation However, the data rocessing inequality does not aly here because some of the information used by PS+SYN (namely the SYN variable S), although available to PS, is not used by

13 13 it Indeed we saw in the figures above that, counterintuitively, PS+SYN can actually outerform PS in terms of the individual variances under PPR, which is a counter-examle to the more general ositive semidefinite comarison Under ESR this is even more true as we saw from the middle lot in Figure 6 We now give another, even more surrising counter-examle which teaches an imortant lesson Theorem 16: J FS J PS Proof: Proof is by contradiction via counter-examle Let W = 2 with θ 1 = θ 2 = 1/2, and let = f = Evaluate 1 T 2 (J FS J PS )1 2, (the sum of each element of the matrix difference), which we exect to be nonnegative by assumtion It can be shown that this reduces to 2 2q(1 + q2 ) 1 + q 1 + 3q 4q3 2 q q which can be negative, eg when q = 1/2, a contradiction Using Lemma 27(ii) one can show that J FS J PS imlies J 1 PS J 1 FS which in turn imlies I+ FS I+ PS (see Lemma 26) Since we earlier showed that PS is enormously worse than FS for variances, this result is surrising, articularly as exerience shows that it is not difficult to find other examles for much larger W We can exlain this quandary as follows When we focus on the variances in isolation we highlight the oor erformance of PS However, linear combinations such as k a kθ k bring in cross terms, which for PS are negative because of strong ambiguity in its observations Strong correlations are a bad feature of a covariance matrix of an estimator, however if they are negative, they can in some cases cancel other ositive terms resulting in a lower total variance Hence, aradoxically, it is the oor behaviour of PS which revents J FS J PS from holding The last examle reveals that a ranking of methods based on a ositive semidefinite comarison, although very desirable when true, is not ossible in general We therefore turn to a more generally alicable aroach in the next section which we use for the remainder of the aer C Variance Comarisons: a Partial Ranking In this section we focus on comaring methods via the diagonal elements of I +, corresonding to the otimal variances of θ i estimates, ust as in the figures above As before, it is sufficient to consider the unconstrained covariance matrix J 1 since I + = J 1 θθ T by (15) To the extent ossible, we seek to establish a hierarchy between methods based on diagonal comarisons Thanks to Theorem 15, which alies in articular to diagonal elements, we already have the answer in some cases We begin by comaring PS and PS+SYN The examles given so far showed PS+SYN as enormously suerior to PS However, this is not always the case A counterexamle under the PPR normalization with = 005 is given by θ = {θ 1, θ 1, θ 3 k 10 = θ 1 /100}, θ 1 =04808, (30) for which D 169 As seen in Figure 7, PS outerforms PS+SYN for θ 9 and θ 10, but not otherwise This makes sense given the bias of PS toward samling large flows For these same arameters PS+SYN outerforms PS for all θ k under ESR This is tyically the case For examle, with the above arameter set, PS+SYN outerforms PS on all samling rates While it may be true that PS outerforms PS+SYN under ESR in some rare situations, we believe this is not the case, and conecture that PS+SYN defeats PS under ESR for all θ For small, we can rove that this is the case Theorem 17: Under PPR and ESR normalization, for small enough, PS+SYN ) PS ) for any θ, = 1,,W Fig 7 PPR comarison between PS and PS+SYN with = 005 for a articular distribution highly skewed to small flows We now consider the above comarison after both methods have benefitted from the use of sequence numbers Theorem 18: Under PPR and ESR normalization, for every 3 W, PS+SYN+SEQ ) PS+SEQ ) For each of = 1 and 2 counterexamles can be found for certain θ and, under both PPR and ESR For examle PS+SEQ has lower variance for θ 1 for the arameter set (30) with = 005 under both PPR and ESR, and for θ 2 the family θ = {a, 1 2a, a} with a < 01 gives examles for wide ranges of, including as 0 for a small enough The counterexamles occur in atyical situations where θ 1 or θ 2 are large relative to other θ, and can be excluded if these are aroriately controlled For examle if < 1/2 it can be shown that when θ 2 /θ 3 < q 2, then PS+SEQ is worse under PPR (and hence ESR) Given that PS+SYN+SEQ defeats PS+SEQ excet for erhas θ 1 or θ 2 under atyical conditions, in general PS+SYN+SEQ can be regarded as suerior The icture emerging from the above comarisons is that PS+SYN+SEQ is clearly suerior to PS and PS+SEQ Since PS+SYN+SEQ is ust a secial case of DS, we now consider this family in more detail We focus on the ESR normalization which, aart from being far more imortant than PPR, is the key to our otimality result in Section VI The following is one of our main results, a detailed characterization of the erformance of DS under ESR It can be shown that an analogous result does not hold for PPR Theorem 19: The diagonal elements 2 W of J 1 DS under the ESR normalization are monotonically decreasing in

14 14 The roerty holds for = 1 iff the condition θ 2 D 1 D θ 1(1 θ 1 ) (31) is satisfied Also, monotonicity holds when D = 1 or W = 2 The above result shows, rovided (31) is satisfied, that the variance bounds for each θ k under DS dro as increases It follows that the otimal oint in the DS family lies at = 1, that is flow samling! The sueriority of FS clearly demonstrates that the roblem of inverting from imerfect samled flows is so difficult that in the tradeoff between flow quality and quantity, quality wins convincingly The excetion is for θ 1 when (31) fails, in which case the otimal DS lies to the left of = 1 However, this only occurs when θ 2 /θ 1 is extremely small, and even then in most cases 1 unless D is also very large Consider the region 1 By solving for the otimal using (41), we obtain (rovided D > 1) = θ 2 (D 1)[θ 1 (1 θ 1 ) θ 2 ] (32) For to be much smaller than 1, we require D to be large and the ratio θ 2 /θ 1 very small (effectively, θ 2 being negligible) Such a scenario is very unlikely to occur in networks and in any case only affects θ 1, and so it is reasonable to assume that the otimal DS corresonds to FS in ractice Our next result comares the final method, SH, against FS Theorem 20: Under PPR and ESR normalization, for every 2 W, FS ) SH ) For = 1 the condition holds for any θ for sufficiently small Sufficient conditions indeendent of are either W = 2, or for W > 2, θ 2 θ 1 (1 θ 1 ) (33) The roof for (33) involves bounds which are tight as 1 Hence, when the condition is violated and is large enough, SH can indeed outerform FS for θ 1 An examle is given in Figure 8 with W = 6 and = 04351, which is large and reasonably close to = 05 However, this effect is difficult to see in distributions with larger W as SH shifts most of its acket budget to larger flows, resulting in being orders of magnitude smaller than and the bounds leading to the sufficient condition becoming very loose In all cases of interest therefore, FS can be regarded as suerior to SH at all flow sizes Finally, we clarify the relationshi between DS and SH more generally DS is a two arameter family whose erformance varies in a wide range Indeed, Figure 5 already shows that it can erform better than SH at the FS end of its range, and worse at the other end The imortant observation is that in all cases where FS outerforms SH for a given θ, it follows from continuity of the CRLB in (, f ()) that there are members of the DS family other than FS which do likewise Theorem 20 shows that this is true in almost all cases under ESR, allowing us to conclude that in general DS outerforms SH at all flow sizes The natural counterexamle to this general rule is when (31) is satisfied so that the best DS can do is given by FS, and yet θ is such that SH defeats FS (for θ 1 ), imlying that SH defeats all members of the DS family for θ 1 This can occur if (33) is not satisfied and is large Furthermore there are cases where both SH and the otimal DS defeat FS, in which case the comarison is more comlex Since however this battle is contained within a very small and unimortant region of (θ,, f ()) sace, we do not characterise these counterexamles fully but instead conclude with the following result, which for the first time exhibits secific DS members (other than those in a neighbourhood of FS) which defeat SH Theorem 21: Under ESR normalization with rate, if 0, then a sufficient condition for (J 1 DS ) SH ) for every 1 W is that,ds satisfies,sh D 1 D 1,SH,DS Summary: We have confirmed that a ranking of methods based on a ositive semidefinite comarison is imossible, since even a diagonal comarison does not yield a simle hierarchy Nonetheless, ignoring the counterexamles which sometimes occur for the CRLB variance for θ 1 and erhas θ 2, a clear overall icture emerges from the comarisons above First, from Theorem 15, the utilization of sequence numbers yields consistent information theoretic dividends In articular, PS need not be considered further since it has extremely oor erformance and by Theorem 15 PS+SEQ is better Theorems 17 and 18 show that the use of SYN samling as a technique to eliminate bias is very owerful, which makes PS+SYN+SEQ the leading candidate, and focussed attention on its generalization, DS Theorem 19 then shows that DS out-erforms PS+SYN+SEQ under the ESR normalization (the most relevant one for estimation erformance) rovided >, and furthermore that FS is the favored member of DS Theorem 20 shows that FS is also suerior to SH In conclusion, FS is the best samling method for all flow sizes Fig 8 PPR and ESR comarison between FS and SH with = 05 for a articular distribution highly skewed to small flows

15 15 This strongly motivates the adotion of samling methods which aroach FS as closely as ossible, but which are efficiently imlementable This refocuses attention back to DS as it enoys both of these roerties V ANALYSIS OF THE SEQ DIVIDEND The utilization of sequence numbers rovides additional virtual ackets for free The extent to which this imroves our information on θ clearly deends on arameters For examle, flows with only k = 1 or 2 ackets are not heled at all Intuitively (for examle, under PS) it is the flows for which k 1/ which will receive the most benefit: since only the ackets before the first and after the last hysically samled ackets will not be recovered using sequence numbers, and on average there are 2/ of these since 1/ is the average ga between two hysically samled ackets within a flow To quantify the benefit of sequence numbers better we define the effective acket gain, which is the ratio r = E[Ñk]/E[N k ], where Ñk and N k are the number of SEQ assisted and unassisted (hysical) ackets samled resectively from a flow of size k By definition, R = Ñk/N k 1 and so r 1, and asymtotically the gain saturates at r = 1/ For DS it can be shown (see [15] for more details) that to obtain a ratio α of this maximum gain, a flow must have a size of the order of q(1+α) (1 α), suggesting that flows must have a size of roughly 1/ or more before the sequence number information dividend effectively switches on Although intuitively aealing, this is however quite misleading There is a high variance associated with the random variable Ñk, which for large flows under PS+SYN+SEQ goes as Var(Ñk) 4 + k(k 2) + (2k 3) Therefore the gain R is not sharly concentrated around its mean and is highly deendent on the samling arameter More generally, the sequence number dividend switch is rimarily about whether at least two ackets are samled When is small this is a very rare event, but can nonetheless be resonsible for most of the available information about flow size For examle in Figure 3 the robability of samling two ackets is of the order of 10 6, and yet the ratio of variance of PS to PS+SEQ is around 400 even rare dividends are worth having! Thus even for smaller to medium sized flows ( mice and rabbits ), the use of sequence numbers rovides a huge reduction in estimator variance, as shown by numerical and emirical evidence throughout this aer It is interesting to comare the SEQ dividend with SH, which imlements a kind of extreme SEQ effect in the sense that if only a single acket is samled, then all subsequent ones will be This is a more owerful switch than the SEQ mechanism which requires at least two ackets For large flows however, ackets inferred under DS aroach this level of acket recovery since, because the SYN acket is samled, only a geometric number of ackets will be missed at the end of the flow, comared to a geometric number at the beginning under SH Of course, SH achieves this solely from its real acket budget whereas DS does not, which exlains why DS outerforms SH even in the latter s area of strength, namely very large flows VI COMPUTATIONAL ISSUES A Otimizing DS with Comutational Constraints From our result on the monotonic behaviour of ESR normalized DS, what is otimal in terms of information is clear: simly move down the ESR constraint curve to FS However, there are other constraints from the resource side which will constrain the arts of the curve that will be accessible The solution to the oint information/resource otimization roblem is therefore clear: move as far down the ESR curve as ossible Our task now is to determine those constraints and hence the region in the (, f ) lane which is feasible The bottlenecks in the imlementation of any samling method are the memory access time and memory size CPU rocessing is included in the memory access, since the CPU has to read out values from memory during looku, erforming modifications and write-backs when necessary These tasks constitute the main ortion of what is required of the CPU for measurement, as the measurement rocess is basically all about efficient counting [5] This can be done for examle with a hybrid SRAM-DRAM counter achitecture [16], [17], [18], where a small amount of (fast but exensive) SRAM is used for the counter and its value eriodically exorted to a (cheaer but slower) DRAM store With about 10 ns access time for SRAM, such a counter can be imlemented even for OC-768 links which run at 40 Gbs We now consider how to otimize DS based on these constraints Regarding flows, let T max be the maximum flow table size measured in terms of flow records and λ F be the flow arrival rate As for ackets, let C be the caacity of the link, P the size of the smallest acket ( 40 bytes), and τ the access time of memory (nominally DRAM) Our simle analysis of the above bottleneck constraints is based on the following In terms of acket arrivals we assume the worst case, namely the smallest size ackets arriving backto-back at line rate By bounding the rocessing in such a case we guarantee that the front line of acket rocessing (which occurs at the highest seeds) is not under-dimensioned In terms of flow arrivals we assume the average case based on the average number of active flows This can easily be made more conservative by relacing the average by some quantile to take into account fluctuations in flow arrivals For simlicity, we ignore data exort constraints from the measurement center to a central data collection center We also do not consider rate adatation based on the traffic condition although such schemes are comatible with DS Consider the rocessing of a single acket The SYN bit is first tested to see which samling arameter will aly, the cost of this is negligible, and if a acket is not samled no further action is needed Now consider the cost of a acket which is samled Each SYN acket which is samled is inserted into the flow table No rior looku is necessary since it must be the first acket of its flow Each non-syn acket which is samled must first erform a looku of its flow-id in the flow table to see if that flow is being tracked If not it is discarded

16 16 The cost of this wasteful er acket imlementation of the flow discarding ste (inherent to any SYN based method such as DS) is not the bottleneck because the following case is the most exensive of all and is the one we model: a non- SYN acket which is samled and whose flow is being tracked requires both a looku followed by an udate Using the above, the constraints are f f = min ( T max P, Dλ F τc ), = P 2τC (34) The constraint T max /(Dλ F ) ensures that the average number Dλ F of active flows does not exceed the flow table size Constraint P/(τ C) rovides the worst case bound for er acket rocessing, for a single oeration (insert or udate) The factor of 2 that aears in the denominator for is to account for the worst case, in which a looku and udate is needed The analysis is based on the use of a single er flow counter to track flow size In ractice, there may be more counters needed to track other quantities of interest, necessitating tighter constraints In the sequel, our examles consider traffic mixes that have manageable numbers of SYN ackets, so that from (34), f = T max /(Dλ F ) This is done mainly to illustrate the relationshi between the CRLB and arameter f Indeed, at lower link seeds (OC-48 for examle), f is determined by the number of flows arriving at the measurement oint, rather than acket rocessing time Secondly, to increase accuracy, from the revious discussions, we would like f <, ie fewer, but better quality samled flows The constraints form a simle region on the (, f ) lane that is convex, since it is rectangular with a corner at the origin We want to minimize the variance for each θ k subect to these constraints Since the ESR curve is convex with resect to f and, the otimal value must lie on the vertex of the convex constraint set [19, Corollary 3231, 344] For this to hold, we require that the otimum for θ k on the ESR curve (Section IV-C) is outside the constrained region This will be the case for all k for any reasonable traffic mix Under such conditions, the solution is therefore f = f and = We then have a relationshi between the flow table size and link caacity and the diagonal elements of J 1 DS Since it is aarent from (26) that the diagonals are dominated by 1/ f, by substituting the otimal solution, we have for 1 W, DS ) = O( 1 1 ) = O( ) f T max Thus, with a larger table size (ie more memory available), variance of the estimator can be reduced As for the relation to caacity, we assume that C is large and use the aroximation q 1 This can be ustified considering that samling is required beyond OC-3 link seeds The diagonals become DS ) 11 = θ 1(1 θ 1 ) f + 1 f θ k + θ1 2 k=2 DS ) = θ (1 θ ) f + 1 f DS ) WW = θ W f 1 f θ 2 W + θ2 W k=+1 2θ k + θ 2 which are aroximately inversely roortional to, and hence are O(C) We conclude that the variance of DS is inversely roortional to memory usage and roortional to the link caacity As an examle, consider an OC-192 link with Dλ F = flows/sec, T max = 100, 000 and an access time of 100 ns for DRAM Let us assume a further 100 ns is required for further rocesses (eg sequence number information) Thus, f = 01 and = 008 With our numerical evaluation on the Leizig-II trace (discussed in the following section), this would imly that the trace contain at least original flows in to achieve a standard deviation of 10 8 or better If we comare this to PS with a samling rate of = 01, we require a staggering flows to achieve the same erformance! To observe the deendence on memory and link caacity, now consider an OC-768 link instead with T max = 10, 000 This time, we have f = 001 and = 002 The number of flows required to achieve the same standard deviation now increases to which is still orders of magnitude better than PS If we consider the SRAM-DRAM architecture discussed earlier, with state-of-the-art SRAM having access times of about 5-10 ns, this can only increase the value of Ideally, we still kee f low to reduce the number of flow entries while increasing to aroach FS erformance Continuing on from the revious examle of the OC-192 link and assuming τ = 5 ns, can now aroach 1 For the OC-768 examle, f = 001 while = 08, giving huge erformance gains B Other Issues Checking for the SYN bit can be done in a simle way by testing the ayload tye in the IP header and then verifying the resence of the SYN bit This takes much less effort than dee acket insection systems Furthermore, samling decisions can be imlemented using recalculated values, much faster than a straight random number generator imlementation We address the roblem of flow table overload by using the method roosed by Estan et al [20] by defining discrete measurement time bins, where samled flows are exorted at the end of each time bin Consequently, overload of the flow table can be avoided at the cost of increased exort rate VII CASE STUDY ON INTERNET DATA In this section we test the erformance of DS on two traffic traces under the ESR normalization, and comare it to FS and its closest cometitor, SH We also further examine the

17 17 benefit of sequence numbers both in a fully emirical setting, and an idealised one Since we work with real data, we require an unbiased estimator for each method We roose closed-form estimators that achieve the CRLB asymtotically We test their statistical erformance by examining how close they aroach the emirical CRLB on the traces we used In general, the emirical results match the theoretical ones from earlier sections well A Data Traces We used two ublicly available network traces, Leizig- II [21] and Abilene-III [22], which are each collections of anonymized acket headers assing through a single router Since the raw traffic is at acket level, secialized software such as CoralReef [23] are required to reconstruct flows We modified the software so that only TCP flows were analyzed Summary statistics of these traces aear in Table III Both traces are unidirectional, resenting roblems when constructing a sequence number function, as elaborated later There are many flows whose SYN acket is missing as they began before the measurement interval To be consistent with our model assumtions, we remove these A similar situation was encountered in [1] Fortunately, such flows are in the minority, for examle with Leizig-II they account for only 18% Furthermore, when samling the trace, we assume an infinite timeout, that is, flows are exired at the end of the measurement interval This is in accordance with the fact that we do not consider flow slitting In ractice, timeouts would slit a flow, resulting underestimation of flow size Note that some flows may be malformed and have one or more SYN ackets within the flow Potentially, DS may samle one of these ackets and treat it as a new flow, instead of art of a longer one However, such cases are rare In Leizig-II, there are only 468 of such flows, or 002%, and there are none in Abilene-III These flows were left in the trace Trace Link Active TCP Duration D Caacity Flows (hh:mm:ss) Leizig-II 50 Mbs 2,277,052 02:46: Abilene-III 10 Gbs 23,806,285 00:59: TABLE III SUMMARY OF THE TRACES USED The value of D in Table III is the actual average flow size In our exeriments, we truncate Leizig-II to W = 1000 and Abilene-III to W = 2000, resulting in D being 194 and 765 ackets for each trace resectively Truncation is erformed by discarding all flows with size above W, which ensures that the assumtion θ k > 0, k = 1, 2,, W is met B Closed-Form Unbiased Estimators The analysis of earlier sections were centered on the CRLB, which bounds what is achievable by any unbiased estimator, but it is neither a construction of such an estimator nor even a roof of its existence When working with real data an actual estimator is required, ideally one that achieves the reviously comuted CRLB The maximum likelihood Fig 9 Comarison of DS on Leizig-II, using W = 1000 and varying arameters under ESR normalization with = 001 Fig 10 Comarison of DS on Leizig-II, using W = 1000 and varying arameters under ESR normalization with = 0001 estimator (MLE) is an attractive candidate as it is asymtotically efficient, guaranteeing that its erformance aroaches the CRLB asymtotically [24] However, the MLE, although unbiased asymtotically, is in general biased, esecially in the small samle regime We roose the following estimator for FS and DS (or other SYN based methods such as PS+SYN): ˆθ = B 1 N f [M 1, M 2,, M W ]T (35) This estimator has a natural interretation as an emirical histogram based on observed samled flow counts, inverted by B 1 to the original flow distribution It is easy see that it is unbiased since E[ˆθ] = B 1 Bθ = θ The matrix B 1 can be comuted exlicitly for these methods, as shown in Section III Our estimator is further motivated by its relation to the MLE, as we now show (all roofs are deferred to Aendix E)

18 18 The following alies to SH Theorem 23: The MLE for SH is given by ˆθ SH = B 1 SH N f [(M 0 + M 1), M 2,, M W] T (37) Fig 11 Comarison of DS on Abilene-III, using W = 2000 and varying arameters under ESR normalization with = 0005 Theorem 22: If the samling matrix of a method satisfies b 0 = q1 W, then the MLE is ( ) ˆθ = B 1 W =1 M [M 1, M 2,, M W] T (36) By rewriting W =1 M as N f(1 M 0 /N f) and observing that M 0 /N f converges to q with high robability (this follows from Hoeffding s inequality [25, 303]), the MLE (36) tends to our estimator (35), which therefore aroaches the CRLB asymtotically Our roosed estimator only obeys the constraint 1 T W ˆθ = 1 on average, while the estimator in Theorem 22 always obeys the constraint (Note that (36) remains a viable estimator in the conditional framework, see Remark 32 in Aendix E) This time the MLE is unbiased (see Aendix E), so we use it directly on the data It closely resembles the estimator (35), with a slight difference: for the estimate of θ 1, the number of missing flows M 0 lays a significant role This can be interreted as follows Since SH works by geometrically skiing the first few ackets in a flow, flows of size 1 are those most likely to have been entirely missed Hence, the simlest way to incororate a knowledge of M 0 is to assume that it arises solely from evaorated flows originally of size 1 The advantage of simle closed form estimators such as these, which only require a matrix multilication, is that they eliminate the need for iterative estimation algorithms such as Exectation-Maximization (EM), often emloyed in the literature [1], [4], [8] From a comutational viewoint, this is highly advantageous C Testing with a Perfect Sequence Number Function We begin our case study by testing DS with a erfect sequence number function, which returns the exact number of ackets between two samled ackets As we have access to the original unsamled flows, this is easily evaluated Aart from being a benchmark for sequence number functions that may be designed in the future, a erfect function allows clean comarisons between alternative methods emloying sequence numbers to be made In Figure 9, for an ESR of = 001, values of from = 1 (equivalent to FS) steadily decreasing to = 0001 are shown (DS with f = = 0019 corresonds to PS+SYN+SEQ) As 1 the erformance vastly imroves Similar results were observed in Figure 10, where = 0001 Fig 12 Comarison of FS, SH ( r = 00002) and DS ( f = 00117, = 07) on Leizig-II, using W = 1000 and varying arameters under ESR normalization with = 001 Fig 13 Comarison of FS, SH ( r = ) and DS ( f = 00134, = 07) on Abilene-III, using W = 1000 and varying arameters under ESR normalization with = 001

19 19 Fig 14 Comarison of DS on Leizig-II, using W = 1000 and varying arameters under ESR normalization with = 001 An imerfect sequence number function is used here Fig 16 Comarison of DS on Abilene-III, using W = 2000 and varying arameters under ESR normalization with = 0005 An imerfect sequence number function is used here Fig 15 Comarison of DS on Leizig-II, using W = 1000 and varying arameters under ESR normalization with = 0001 An imerfect sequence number function is used here Fig 17 Benefits of using sequence numbers Three cases are shown: PERFECT uses a erfect sequence number function, MAPPING uses an imerfect function and SEQ OFF uses no sequence number information In all cases, a chronic lack of samles at the tail end of the distribution causes inaccurate estimates, with zero samles showing as discontinuities FS holds to the true distribution the longest, as we exect, and the best DS erforms similarly to it With a much larger samle set in Figure 11, we can see better agreement between the estimates and the original distribution We also tested SH, as seen in Figures 12 and 13 To simlify the comarison across the traces, we truncate each to W = 1000, so that D becomes 194 and 645 ackets for Leizig-II and Abilene-III resectively, and use the same ESR value = 001 In both cases the erformance is much worse for SH than DS, which tracks FS and the true distribution well In Leizig- II the erformance is very oor almost everywhere, while in Abilene-III the front end of the distribution, u to about flows of size 8, is estimated quite well, before deteriorating badly at larger sizes The good erformance of SH at = 1 can be attributed to the fact that this event is dominated by the sheer number of original flows of size one Both FS and DS tracks the distribution much better since both methods have samled flows of far suerior quality to SH D Testing with an Imerfect Sequence Number Function We now test DS with an imerfect sequence number function Our function is similar to that outlined in [4] However, as we do not have statistics of oular TCP ayload sizes available, we infer the most likely ayload size as follows If the sequence number difference is divisible by a oular ayload size (for examle, 1460 bytes), we take this as the most likely ayload size Otherwise, we use the average ayload size This function is subect to errors, esecially when a flow has variable ayloads, however it suffices for

20 20 our uroses In addition to TCP sequence numbers, we exloit IPID numbers As mentioned in [4], the IPID field of Linux machines is incremented sequentially for each TCP flow every time a acket in the flow is transmitted Given that the maority of web-servers on the Internet are Linux-based, we exloit IPID numbers to check the accuracy of our estimate when a flow has ackets with variable ayloads Furthermore, the unidirectional nature of the trace resents a significant challenge As one side in a TCP connection usually transmits more data than the other, some samled flows may consist mainly of TCP ACK ackets, which means that they will have zero-byte TCP ayloads In this case sequence numbers may not be incremented at all, and so will not rovide information about the number of bytes transmitted A solution is to use the TCP acknowledgement numbers instead to infer the number of bytes transmitted from the oosite direction, which would yield an estimate of the number of ackets in the TCP ACK flow This may not be the most ideal solution, as this method would be suscetible to delayed ACKs, thus underestimating the size of the flow Modern web browsers rely on maintaining TCP connections rather than initiating new connections, which require more memory This is the ersistent HTTP rotocol The revalence of this rotocol amongst web browsers resents a challenge, since emty ayload ackets are eriodically sent to kee the connection alive These ackets do not increment the sequence number The best we can do in such cases is to infer using IPID numbers, or ossibly counting ACK ackets coming in from the other direction, if bidirectional information is available Even with the imerfect and relatively simle sequence number function used here, results are consistent with theory Figures 14 and 15 illustrate this In both cases, the imerfection of the function affects the accuracy of DS, but not to a large degree A similar observation alies to the Abilene- III trace, shown in Figure 16, where the artifacts due to the imerfect function (the sawtooth attern) are clearly seen Finally, Figure 17 illustrates the effect of using sequence numbers in recovering the flow size distribution The three cases shown are for DS with arameters f = and = 07, with an ESR of = 001 The PERFECT case is when DS is given a erfect sequence number function, MAPPING when using our sequence number function, and SEQ OFF when no sequence numbers were used at all It is aarent that using sequence numbers, even with an imerfect function, rovides significantly more information to an estimator E Emirical estimator variance Here, we see how closely the estimator variance matches the CRLB by comuting the observed Fisher information [26] of the estimator in Figure 18 To imrove readability, smoothing was alied to the tail end of the observed Fisher information where samles are scarse using a simle moving average filter with a window size of 100 Even when using an imerfect sequence number function, the variance of the estimator closely matches the CRLB, effectively roving the benefit of sequence numbers Fig 18 CRLB versus observed MLE variance of DS with f = , = 06 on Abilene-III VIII CONCLUSIONS AND FUTURE WORK We have re-examined the question of samling for flow size estimation in the context of TCP flows from a theoretical oint of view We used the Fisher information to examine the inherent otential of a number of samling methods Most of these had been examined reviously, but we showed how the usual conditional framework can be made unconditional and thereby simlified, which actually changes the samling methods themselves and their erformance The new framework led to a number of new rigorous results regarding the erformance of samling methods which we studied under two different normalizations It also enabled flow samling to be comared to methods using TCP sequence numbers and Samle and Hold for the first time, and we showed that it is far suerior to them, excet in very secial cases which are not imortant for network measurement alications We introduced a new two arameter family of methods, Dual Samling, which allows the statistical benefits of flow samling to be traded off against the comutational advantages of acket samling We discussed how, as an unbiased Hold and Samle method, it differs from Samle and Hold, and roved that it is suerior to it We argue that the scheme is imlementable and offers an efficient way of aroaching flow samling in ractice to the extent ossible We also roosed closed-form unbiased estimators for SYN-based methods and SH which asymtotically achieve the CRLB, saving comutational time in the estimation stage We erformed a case study of Dual Samling and Samle and Hold on two Internet data traces using our roosed estimators, and found results entirely consistent with the theoretical redictions, desite the fact that the function which mas sequence numbers to acket counts introduces a new source of error, and was not highly otimized Although there is high variation at the tail end of the distribution, our roosed estimator closely matches the CRLB In future work, we intend to search over the sace of all ossible samling matrices to find an otimal samling method

21 21 for flow size estimation, and to comare it to flow samling Our framework is alicable to other traffic metrics such as anomaly detection and a future direction is to extend the work to those areas It would also be of interest to imrove the sequence number maing function, and also to exlore using our aroach for the direct estimation of the byte size of flows, for which sequence numbers are more naturally suited, rather than the acket size Finally, we will develo a more detailed case for DS and its imlementability in high seed routers A General APPENDIX A OUR MATRIX LEMMAS Lemma 24: The matrix J(θ) and its inverse J(θ) 1 are symmetric and ositive definite Proof: For simlicity we omit the θ deendencies Recall that J = B T D B where D is a real diagonal ositive definite and rank( D) = W Matrix B is a W W matrix and rank( B) = W It follows that an inverse exists for both D and B Hence, an inverse also exists for J since J 1 = B 1 D 1 ( B T ) 1 An equivalent exression for J is matrix with ( D) = d 1 J = B T D1/2 D1/2 B = ( D1/2 B) T ( D 1/2 B) (38) since D 1/2 is symmetric We can now show that J is symmetric We have J T = [( D 1/2 B) T ( D 1/2 B)] T = ( D 1/2 B) T ( D 1/2 B) = J The form ( D 1/2 B) T ( D 1/2 B) is at least ositive semidefinite (Theorem 28) However, since J 1 = B 1 D 1 ( B T ) 1, J is invertible, it is also ositive definite By definition of symmetric, ositive definite matrices, its inverse is also symmetric, ositive definite Lemma 25: The unconstrained Fisher information matrix J(θ) and its inverse J(θ) 1 are symmetric and ositive definite Proof: Recall that J = B T DB where D is a real diagonal ositive definite matrix with (D) = d 1 1 and rank(d) = W +1 Matrix B is a (W +1) W matrix and rank(b) = W An equivalent exression for J is J = B T D 1/2 D 1/2 B = (D 1/2 B) T (D 1/2 B) since D 1/2 is symmetric Now J T = [(D 1/2 B) T (D 1/2 B)] T = (D 1/2 B) T (D 1/2 B) = J From Theorem 28, (D 1/2 B) T (D 1/2 B) is ositive semidefinite Moreover, J is invertible by Proosition 2, imlying it is ositive definite By definition of symmetric, ositive definite matrices, its inverse is also symmetric and ositive definite Lemma 26: J 1 J 2 if and only if I + 1 I+ 2 Proof: Since by Lemma 27(ii), J 1 J 2 iff J 1 1 J 1 2, then by definition of ositive semidefinite matrices, x T J 1 1 x xt J 1 2 x Hence, this imlies xt 1 θθ T )x x T 2 θθ T )x, imlying J 1 J 2 B Proof of Theorem 4 Let E = (1/d 0 )b 0 b T 0 from (10) It has rank 1 and is therefore ositive semidefinite since its eigenvalues are tr(e) = W k=1 b2 0k /d 0 with multilicity 1 and 0 with multilicity W 1 It follows from Lemma 29 that J(θ) J(θ) since J(θ) = E + J(θ), and from Lemma 27, this imlies that J 1 (θ) J 1 (θ) 0 W W and therefore J 1 (θ) J 1 (θ) Equality can only hold iff E = 0 W W since this is the only case where all eigenvalues of E are zero This imlies that b 0 = 0 W 1 is required for equality, that is that no flow can evaorate APPENDIX B OTHER MATRIX LEMMAS We collect some useful results required in this aer here This first result comes from [12] Lemma 27: Let A be a n n symmetric ositive definite matrix and B an n n ositive definite matrix Then (i) If B A is ositive definite, then so is A 1 B 1, (ii) If B A is symmetric and ositive semidefinite (imlying B is symmetric), then A 1 B 1 0 The following theorem aears in [27, Theorem 63, 161] Lemma 28: The following statements are equivalent: (i) A is ositive semidefinite; (ii) A = B B for some matrix B, where B is the conugate transose of B The following result gives more roerties of ositive semidefinite matrices [27, Theorem 65, 166] Lemma 29: Let A 0 and B 0 have same size Then (i) A + B B, (ii) A 1/2 BA 1/2 0, (iii) tr(ab) tr(a)tr(b), (iv) the eigenvalues of AB are all nonnegative The matrix inversion lemma (also known as Woodbury s formula) can be found in [12, Theorem 1828, 424] Lemma 30 (Matrix Inversion Lemma): Let R be a n n matrix, S a n m matrix, T a m m matrix, and U a m n matrix Suose that R and T are nonsingular Then, (R + STU) 1 = R 1 R 1 S(T 1 + UR 1 S) 1 UR 1 The data rocessing inequality for Fisher information from [28] is as follows Theorem 31: If Θ Y X satisfies a relation of the form f(y, x Θ) = f Θ (y)f(x y) (ie the conditional distribution of X given Y is indeendent of Θ), then J X (Θ) J Y (Θ) with the deterministic version being J γ(y ) (Θ) J Y (Θ) Equality holds if γ(y ) is a sufficient statistic relative to the family f Θ (y), ie Θ γ(y ) Y forms a Markov chain A Proof of equation (17) APPENDIX C SAMPLING METHODS Exanding (1 + ( 1)) k leads to the useful identity k ( ) k ( 1) k l = ( 1) k 1 (39) l l=1

22 22 ( ) k Using (12) and b k = ( 1)k q k k for PS, we have ( ) 2 k ) = q 2(k ) k= ( W k k= ( d 0 + W k=1 d k 2k d k ( 1)2k l( k l=1 q k k k k ) )( l q 2k l=1 ( 1)k l ( k l The denominator can be simlified as follows: d 0 + k=1 = d 0 + = d 0 + = k=0 d k ( q k k k=1 k ( k ( 1) k l l l=1 l=1 ) )2 ( k ( k d k q 2k 2k ( 1) k l l ( 1) 2k 2 d k q 2k k=1 q 2k 2k d k, 2k ) 2k 2 d k )) 2 ) )2 where we used identity (39) in the second line Similarly, the numerator can be simlified: 2 k ( )( ) k k ( 1) 2k l q 2k 2k d k l k= l=1 ( ) = k k ( l) k d k ( 1) k q 2k 2k ( 1) k l k= ( ) = k d k ( 1) k q 2k k= ( ) = k ( 1) 2k 1 d k k= 2k q 2k l=1 ( 1) k 1 2k where (39) is used in the third line This roves (17) A Proof of Theorem 14 APPENDIX D COMPARISONS Let Z denote any method excet SH, and Y the comlete outcome (ie the vector of SEQ numbers and the SYN variable in the richest case) of samling with Z using arameter 1 (for DS (,1, f,1 )) Now samle Y using Z with arameter 2 / 1 to form X (for DS (,2 /,1, f,2 / f,1 )) Since X is a function only of Y and the new samling, the DPI for Fisher alies (Theorem 31) (and X Y ) Furthermore, it is easy to see that X is statistically equivalent to the outcome of samling the original data using Z with robability 1 ( 2 / 1 ) = 2 (for DS (,1 (,2 /,1 ), f,1 ( f,2 / f,1 )) = (,2, f,2 )) It follows that J Z ( 1 ) J Z ( 2 ) Equality holds iff 2 = 1 (for DS,1 =,2 and f,1 = f,2 ) imlying X = Y, since samling inherently discards information The above roof does not aly to SH as it does not have a closure roerty, meaning that X is not equivalent to 2, 2 2 alying SH with some We turn instead to a direct method, exloiting the tridiagonal structure of J 1 SH Denote SH ) i = a i,, and consider the quadratic form x T J 1 SH x for any non-zero vector x RW : x T J 1 SH x = (a 1,1 + a 1,2 )x (a W,W + a W 1,W )x 2 W 1 + (a, + a 1, + a,+1 )x 2 (40) + =2 ( a,+1 )(x x +1 ) 2 =1 From Lemma 5 we know J 1 1 W = θ It follows that (a 1,1 + a 1,2 ) = θ 1 (a, + a 1, a,+1 ) = θ, = 2,,W 1 (a W,W + a W 1,W ) = θ W which are each indeendent of Consider then the term involving a,+1 = 2 q d +1 = 1 W k=+1 qk θ k, 1 < W Differentiating with resect with, we obtain 1 2 k=+1 q k θ k 1 k=+1 (k )q k 1 θ k Since this is negative, it follows that x T J 1 SH x decreases monotonically in, and so J 1 SH ( 1) J 1 SH ( 2) for any 1 2 The result then follows by Lemma 27(ii) B Proof of Theorem 17 We first consider PPR normalization, where = for both methods Recall from (17) that ( ) 2 k PS ) = q 2(k ) 2k d k,ps k= ( ( ) W k ) 2 k= ( 1)2k 1 d k,ps q 2k 2k W k=0 q2k 2k d k,ps ( ) 2 k q 2(k ) 2k d k,ps q 2 (W +1)W! k= }{{} T 1 where the lower bound follows by, in the second term, droing ( ) the first terms in the denominator and uer bounding k by W! Also, from (18), ( ) 2 k 1 PS+SYN ) = q 2(k ) 2k d k,ps+syn q 1 θ2 k= }{{} T 2 Since ( ) ( k = k 1 ) ( 1 + k 1 ) ( k 1 1), ( ) k 1 d,ps q k θ k = d,ps+syn, 1 k= imlying that T 1 T 2

23 23 Now under the limit 0, for 1, the second term for PS becomes (W + 1)W! which is finite, whereas for PS+SYN, q θ2 1 θ2 It follows that PS+SYN ) PS ) Since > for PS+SYN under ESR, the result also holds for ESR by Theorem 14 C Proof of Theorem 18 We begin by examining the simlest case, where = W We have PS+SEQ ) WW = θ W 2 and PS+SYN+SEQ ) WW = θ W 2 q θ 2 W, and so PS+SYN+SEQ ) WW PS+SEQ ) WW under PPR Now consider the case 3 W 1 Let d Q, and d SQ, be the roortion of samled flows of size after samling by PS+SEQ and PS+SYN+SEQ resectively For 1, a straightforward comarison would yield d Q, d SQ, Intuitively, since more flows are discarded in the PS+SYN+SEQ scheme, the roortion of samled flows must be less than the roortions of PS+SEQ Therefore, exressing the diagonals in terms of d Q, and d SQ,, we have for 3 W 1, and PS+SEQ ) = 4 d Q, + 4q 2 4 d Q,+1 + q 4 4 d Q,+2 PS+SYN+SEQ ) = 4 d SQ, + q 2 4 d SQ,+1 q 1 θ2 4 d SQ, + q 2 4 d SQ,+1, which, by a direct comarison, shows PS+SYN+SEQ ) PS+SEQ ) Similarly, the ESR comarison follows since the samling rate for PS+SYN+SEQ must increase, thereby reducing its CRLB D Proof of Theorem 19 First consider the simlest case, = W By substituting (27) into (26) and then differentiating wrt d DS d ) WW = θ W 2 D (D 1)θ2 W D < 0, imlying DS ) WW is monotonically decreasing with For 2 W 1 the derivative d d DS ) is given by θ 2 D (D 1)θ2 1 D 2 D q k (1 + q )θ k k=+1 1 q k 1 ((k ) + (k + 1)q )θ k, f k=+1 which is negative since each term is negative for 0 < 1 Finally, for = 1 we have ( d DS d ) 11 = D 1 D θ 1(1 θ 1 ) 1 W ) 2 D q k 1 θ k k=2 ( W ) (D 1) + 1 (k 1)q k 2 θ k (41) D k=2 For small values of the exression is dominated by terms in 1/ and is therefore again negative, but as the first term is ositive, for large it may change sign It is not hard to d d 2 show that ossible Setting = 1 in (41) yields (31) as the necessary and sufficient condition for this not to occur in the feasible region 1 The secial cases follow simly from (31) E Proof of Theorem 20 DS ) 11 > 0, so at most one sign change is First consider the case 2 W From (21) and (24) SH ) θ θ θ q θ2 = FS ) since f = and = () under both PPR and ESR Now consider = 1 It is convenient to recall (23) and (21): SH ) 11 = θ W k=2 qk 1 θ k, and FS ) 11 = 1 f θ 1 q f f θ1 2 It follows that FS ) 11 SH ) 11 when W k=2 qk 1 θ k θ 1 (1 θ 1 )(1 ) (42) A sufficient condition imlying (42) is obtained by using the lower bound q θ 2 W θ k and rearranging, yielding k=2 qk 1 θ 2 θ 1 (1 θ 1 )(1 ) + θ 2 (43) Furthermore, since, a more restrictive sufficient condition is given by relacing) the lhs with, which reduces to (1 ) (θ 1 (1 θ 1 ) θ 2 0, which shows that for any 0 1, if θ 2 θ 1 (1 θ 1 ), then (42) holds and FS ) 11 SH ) 11 The condition is satisfied if W = 2 Now let θ be arbitrary and consider the small (and hence small ) limit Then (42) becomes /θ 1, which is always satisfied since F Proof of Theorem 21 To obtain the condition, we look at the comarison of diagonals at = 1 For simlicity, let () and f () denote the samling arameters for DS, and r () the arameter for SH, under ESR normalization We have (see (25) and (23)) and DS ) 11 = θ 1(1 θ 1 ) f + θ f SH ) 11 = θ r k=2 k=2 q k 1 r θ k q k 1 θ k, Using ESR and comaring term-by-term, we would like (D 1) + 1 θ 1 (1 θ 1 ) + θ1 2 D θ 1 (44) Equation (44) yields the uer bound (D 1)/(D 1) For the remaining (summation) terms, note that 1 = (D 1) (45) f D

24 24 holds when 0 (D 1)/(D 1) It is sufficient to have r, which imlies that (45) is uer bounded by 1/ r and that for all k 0, q k qr k, imlying the result The next ste is to check if the condition ensures DS ) SH ) for 2 W For = W, Since when (D 1)/(D 1), DS ) WW (D 1) + 1 θ W θ W /, D and SH ) WW = θ W / r, the condition holds since r For the other cases, we use the bound DS ) θ f + 1 f and comare term-by-term with SH ) = θ r + 1 r k=+1 k=+1 q k (1 + q )θ k, qr k (1 + q r )θ k It is then easy to show that r is sufficient APPENDIX E MAXIMUM LIKELIHOOD ESTIMATORS A Proof of Theorem 22 The likelihood function for N f flows is f(θ, N f ) = i f( i ; θ) = 0 d M The MLE is the θ which maximizes the log-likelihood l(θ, N f ) = 0 M log d subect to the constraint k 1 θ k = 1, θ k > 0, k The otimization roblem admits a feasible solution by the Bolzano- Weierstrass theorem [29, 517], since the log-likelihood function is concave and continuous, and otimization is erformed over a comact, convex set Furthermore, the solution obtained will be unique under our assumtions, since the Hessian of the log-likelihood is the Fisher information, which is ositive definite given 0 < θ k < 1 for all k Given the assumtions, the method of Lagrange multiliers would yield the otimal solution since strong duality holds as the roblem satisfies Slater s constraint qualification [30, Section 523, 226] The Lagrangian is L(θ, λ, ν) = ( ) M log d λ θ k 1 ν T θ, 0 k 1 where the vector ν has elements ν k 0 and λ R By differentiating with resect to θ k, k and the multiliers, L(θ, λ, ν) = M b k λ ν k = 0, θ k d k λ L(θ, λ, ν) = 1 θ k = 0, k 1 ν k L(θ, λ, ν) = θ k = 0 The second equation is ust the equality constraint while the third yields a solution θ = 0 W, which lies on the boundary, yielding an unbounded solution (observed by substituting the solution into the likelihood function) That leaves the first equation, and by the Karush-Kuhn-Tucker condition, ν T θ = 0 [30, Section 553, 243], imlying that ν = 0 W (our assumtions require that the original arameters 0 < θ k < 1 for all k, hence the otimal must lie within the region where the constraints are inactive) Thus, we have, in matrix form, B T Ddiag(M 1,, M W)1 W = λ1 W M 0 d 0 b 0, (46) recalling that D = diag(d 1 1,, d 1 W ) We roceed to solve for λ using the equality constraint k 1 θ k = 1 and d = Bθ, as follows, by multilying both sides of (46) by θ T to obtain θ T B T Ddiag(M 1,, M W )1 W = λ M 0 d T Ddiag(M 1,, M W)1 W = λ M 0 1 T W diag(m 1,, M W )1 W = λ M 0, imlying λ = N f For methods that ick flows in an unbiased manner, b 0 = q1 W, and thus B T 1 W = 1 W, imlying ( B 1 ) T 1 W = 1 1 W, therefore (46) reduces to B T Ddiag(M 1,,M W )1 W = (N f M 0 )1 W (47) which can be rewritten as D 1 ( B 1 ) T 1 1 W = N f M 0 diag(m 1,,M W )1 W Bθ = W =1 M [M 1,, M W] T The result follows Remark 32: In the conditional framework, exressing the log-likelihood function is difficult, due the fact that normalization of the likelihood involves division by random variables However, the estimator above would, with high robability, be close to the actual MLE The flow selection rocess is a Bernoulli rocess The denominator, W =1 M encasulates information about N f, because asymtotically, the deviation between N f and W =1 M is extremely small, a consequence of the concentration of Bernoulli samles around its mean This roerty is not found amongst other methods, such as PS, where samles are biased towards large flows B Proof of Theorem 23 We begin with the otimization equation (46), B T Ddiag(M 1,, M W )1 W + M 0 d 0 b 0 = N f 1 W Using roerties of the samling matrix for SH, we have Ddiag(M 1,,M W)1 W + M 0 d 0 q e 1 = N f ( B 1 ) T 1 W (48) Ddiag(M 0 + M 1,,M W )1 W = N f ( B 1 ) T 1 W (49) Ddiag((M 0 + M 1 ),, M W )1 W = N f 1 W (50)

25 where in (48) we use the roerty b T 0 = q B T e 1, where e i is the canonical basis vector, in (49) we use d 0 = q d 1, and in (50) we use ( B 1 ) T 1 W = diag( 1, 1,1)1 W All these roerties can

exectation, we obtain E[(M 0 +M 1 )] = N f(d 0 +d 1 ) = N f (+q)d 1 = d 1, by using the identity d 1 = q d 0 while clearly E[M ] = d for all 2 Thus E[ˆθ SH ] = B 1 SH B SH θ = θ [25] M Mitzenmacher

25 25 where in (48) we use the roerty b T 0 = q B T e 1, where e i is the canonical basis vector, in (49) we use d 0 = q d 1, and in (50) we use ( B 1 ) T 1 W = diag( 1, 1,1)1 W All these roerties can be obtained by a straightforward evaluation using the samling matrix The final line reduces to Bθ = 1 N f [(M 0 + M 1),, M W] T, roving the result The estimator is unbiased, as by taking the exectation, we obtain E[(M 0 +M 1 )] = N f(d 0 +d 1 ) = N f (+q)d 1 = d 1, by using the identity d 1 = q d 0 while clearly E[M ] = d for all 2 Thus E[ˆθ SH ] = B 1 SH B SH θ = θ [25] M Mitzenmacher and E Ufal, Probability and Comuting Cambridge University Press, 2005 [26] G J McLachlan and T Krishnan, The EM Algorithm and Extensions, 2nd ed Wiley Interscience, 2008 [27] F Zhang, Matrix Theory: Basic Results and Techniques Sringer- Verlag, 1999 [28] R Zamir, A roof of the Fisher information inequality via a data rocessing argument, IEEE Transactions on Information Theory, vol 44, no 3, , May 1998 [29] A E Taylor and W R Mann, Advanced Calculus, 3rd ed John Wiley and Sons, 1983 [30] S Boyd and L Vandenberghe, Convex Otimization Cambridge University Press, 2004 REFERENCES [1] N Duffield, C Lund, and M Thoru, Estimating Flow Distributions from Samled Flow Statistics, IEEE/ACM Trans Networking, vol 13, no 5, , Oct 2005 [2] N Hohn and D Veitch, Inverting Samled Traffic, in Proc 2003 ACM SIGCOMM Internet Measurement Conference, Miami, October 2003, [3], Inverting Samled Traffic, IEEE/ACM Transactions on Networking, vol 14, no 1, 68 80, 2006 [4] B Ribeiro, D Towsley, T Ye, and J Bolot, Fisher information on samled ackets: an alication to flow size estimation, in Proc ACM/SIGCOMM Internet Measurement Conf, Rio de Janeiro, Oct 2006, [5] G Varghese, Network Algorithmics San Francicso: Elsevier/Morgan Kaufmann, 2005 [6] C Estan and G Varghese, New directions in traffic measurement and accounting, ACM Transactions on Comuter Systems, vol 21, no 3, , August 2003 [7] P Tune and D Veitch, Towards Otimal Samling for Flow Size Estimation, in Proc ACM SIGCOMM Internet Measurement Conf, Vouliagmeni, Greece, Oct , [8] L Yang and G Michailidis, Estimation of flow lengths from samled traffic, in Proc GLOBECOM, San Francisco, CA, Nov 2006 [9] T M Cover and J A Thomas, Elements of Information Theory, 2nd ed John Wiley and Sons, Inc, 2006 [10] A Hero, J Fessler, and M Usman, Exloring estimator bias-variance tradeoffs using the Uniform CR bound, IEEE Trans Sig Proc, vol 44, no 8, , Aug 1996 [11] J D Gorman and A O Hero, Lower bounds for arametric estimation with constraints, IEEE Trans Info Th, vol 26, no 6, , Nov 1990 [12] D Harville, Matrix Algebra from a Statistician s Persective Sringer- Verlag, 1997 [13] J E Strum, Binomial matrices, The Two Year College Mathematics Journal, vol 8, no 5, , November 1977 [14] E L Lehmann and G Casella, Theory of Point Estimation, 2nd ed, ser Sringer Texts in Statistics Sringer, 1998 [15] P Tune and D Veitch, Fisher information in flow size distribution estimation: Technical reort, Det E&EE, The University of Melbourne, Tech Re, 2008, coy available uon request [16] D Shah, S Iyer, B Prabhakar, and N McKeown, Maintaining statistics counters in line cards, IEEE Micro, vol 22, no 1, 76 81, 2002 [17] S Ramabhadran and G Varghese, Efficient imlementation of a statistics counter architecture, ACM SIGMETRICS Performance Evaluation Review, vol 31, no 1, , June 2003 [18] Q Zhao, J Xu, and Z Liu, Design of a novel statistics counter architecture with otimal sace and time efficiency, Proc of ACM SIGMETRICS 2006, vol 34, no 1, , June 2006 [19] R T Rockafellar, Convex Analysis, ser Princeton Landmarks in Mathematics and Physics Princeton University Press, 1970 [20] C Estan, K Keyes, D Moore, and G Varghese, Building a better netflow, in Proc ACM SIGCOMM 2004, Portland, OR, Aug 2004 [21] NLANR, Leizig-II Trace Data, htt://manlanrnet/secial/lei2html [22], Abilene-III Trace Data, htt://manlanrnet/secial/ils3html [23] CoralReef, htt://wwwcaidaorg/tools/measurement/coralreef/ [24] S Kay, Fundamentals of Statistical Signal Processing, Volume I, Estimation Theory Prentice Hall, 1993 Paul Tune (M 09) received the BE (Hon) and BSc degrees in Electrical and Electronics Engineering and Comuter Science resectively, both in 2005, from the University of Melbourne, Australia He has recently comleted a PhD degree at the ARC Secial Center for Ultra-Broadband Information Networks (CUBIN) at the University of Melbourne, Australia His research interests are in communication networks, articularly in traffic inversion and samling, information theory, statistics and signal rocessing, secifically sarse signal recovery Darryl Veitch (F 10) comleted a BSc Hons at Monash University, Australia (1985) and a mathematics PhD from Cambridge, England (1990) He worked at TRL (Telstra, Melbourne), CNET (France Telecom, Paris), KTH (Stockholm), IN- RIA (Sohia Antiolis, France), Bellcore (New Jersey), RMIT (Melbourne) and EMUlab and CUBIN at The University of Melbourne, where he is a Professorial Research Fellow His research interests include traffic modelling, arameter estimation, active measurement, traffic samling, and clock synchronization

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules