SAMPLING AND INVERSION Darryl Veitch dveitch@unimelb.edu.au CUBIN, Department of Electrical & Electronic Engineering University of Melbourne Workshop on Sampling the Internet, Paris 2005
A TALK WITH TWO PARTS CHALLENGES IN SAMPLING Introduction Two Consequences CROSS TRAFFIC ESTIMATION AS NON-LINEAR SAMPLING An Inverse Queueing Problem Limits to Inversion: Identifiability From Inversion Theory to Estimation Practice
FIRST BYTES Given i.i.d. pkt sampling, recover 1st order pkt statistics. EXAMPLE 1: AVERAGE PACKET RATE λ X Sampling regime is given (not selected) No shortage of data Simple inversion: ˆλX = λ X p Here sampling well adapted to the parameter, inversion easy.
FIRST BYTES Given i.i.d. pkt sampling, recover 1st order pkt statistics. EXAMPLE 1: AVERAGE PACKET RATE λ X Sampling regime is given (not selected) No shortage of data Simple inversion: ˆλX = λ X p Here sampling well adapted to the parameter, inversion easy.
FIRST BYTES Given i.i.d. pkt sampling, recover 1st order pkt statistics. EXAMPLE 1: AVERAGE PACKET RATE λ X Sampling regime is given (not selected) No shortage of data Simple inversion: ˆλX = λ X p Here sampling well adapted to the parameter, inversion easy.
FIRST PROBLEMS Given i.i.d. pkt sampling, recover flow statistics. EXAMPLE 2: FLOW SIZE DISTRIBUTION P Sampling regime is given Drastic data shortage for body of P (tail ok) Simple inversion (sample histogram) very poor, better methods still struggle Sampling not well adapted, inversion problematic.
FIRST PROBLEMS Given i.i.d. pkt sampling, recover flow statistics. EXAMPLE 2: FLOW SIZE DISTRIBUTION P Sampling regime is given Drastic data shortage for body of P (tail ok) Simple inversion (sample histogram) very poor, better methods still struggle Sampling not well adapted, inversion problematic.
FIRST PROBLEMS Given i.i.d. pkt sampling, recover flow statistics. EXAMPLE 2: FLOW SIZE DISTRIBUTION P Sampling regime is given Drastic data shortage for body of P (tail ok) Simple inversion (sample histogram) very poor, better methods still struggle Sampling not well adapted, inversion problematic.
A SOLUTION: SELECT SAMPLING REGIME Given i.i.d. flow sampling, recover flow statistics. EXAMPLE: FLOW SIZE DISTRIBUTION P Sampling regime is selected Data shortage for P vanishes Simple inversion (sample histogram) good Matching sampling regime to the metric worth considering! BUT Comes at a cost Not neccessarily possible
A SOLUTION: SELECT SAMPLING REGIME Given i.i.d. flow sampling, recover flow statistics. EXAMPLE: FLOW SIZE DISTRIBUTION P Sampling regime is selected Data shortage for P vanishes Simple inversion (sample histogram) good Matching sampling regime to the metric worth considering! BUT Comes at a cost Not neccessarily possible
A SOLUTION: SELECT SAMPLING REGIME Given i.i.d. flow sampling, recover flow statistics. EXAMPLE: FLOW SIZE DISTRIBUTION P Sampling regime is selected Data shortage for P vanishes Simple inversion (sample histogram) good Matching sampling regime to the metric worth considering! BUT Comes at a cost Not neccessarily possible
A SOLUTION: SELECT SAMPLING REGIME Given i.i.d. flow sampling, recover flow statistics. EXAMPLE: FLOW SIZE DISTRIBUTION P Sampling regime is selected Data shortage for P vanishes Simple inversion (sample histogram) good Matching sampling regime to the metric worth considering! BUT Comes at a cost Not neccessarily possible
THE BROADER PICTURE NEED TO CONSIDER Parameter to measure Sampling regime Inversion task Costs: What can we infer from this?
NEED TO CONSIDER THE BROADER PICTURE Parameter to measure Sampling regime matched to parameter? or data model? preserves needed information? Inversion task Costs: What can we infer from this?
NEED TO CONSIDER Parameter to measure THE BROADER PICTURE Sampling regime Inversion task well posed? it is possible? robust/stable? Costs: What can we infer from this?
NEED TO CONSIDER Parameter to measure Sampling regime THE BROADER PICTURE Inversion task Costs: sampling complexity (Cisco..) inversion (real-time?) scalable aggregation (transport to analysis node) of failure ($ per unit std) What can we infer from this?
THE BROADER PICTURE NEED TO CONSIDER Parameter to measure Sampling regime Inversion task Costs: What can we infer from this?
OUTLINE CHALLENGES IN SAMPLING Introduction Two Consequences CROSS TRAFFIC ESTIMATION AS NON-LINEAR SAMPLING An Inverse Queueing Problem Limits to Inversion: Identifiability From Inversion Theory to Estimation Practice
I: NEED STRUCTURE DETECTORS SINCE Cannot match sampling to all parameters, and Parallelism is limited Relevant information is generically scarce. HENCE Forced to detect weak signals in noise (in most cases) Essential to exploit unique structure of information
A FLOW-CLUSTER MODEL OF PACKET ARRIVALS
NAIVE INTUITION: CLUSTERS ARE FLOWS
MORE REALISTICALLY: FLOWS INTERLEAVE
REALITY CHECK: CLUSTERS LOST IN FOG
UNASSISTED: WHERE ARE THE CLUSTERS NOW?
FLOWS ARE ESSENTIAL, YET INVISIBLE FLOWS ARE REAL, HAVE IMPACT, YET INVISIBLE WITHOUT side information, or more powerful ways to detect structure in noise.
II: SAMPLING NEEDS A BROADER CONTEXT SAMPLING IS The threetuple {parameter, sampling, inversion} Any measurements carrying information, followed by inference Example: active probing is a branch of sampling.
OUTLINE CHALLENGES IN SAMPLING Introduction Two Consequences CROSS TRAFFIC ESTIMATION AS NON-LINEAR SAMPLING An Inverse Queueing Problem Limits to Inversion: Identifiability From Inversion Theory to Estimation Practice
INVERTING DELAY SAMPLES FOR CROSS TRAFFIC JOINT WORK WITH S.MACHIRAJU, F.BACCELLI, J.BOLOT, A.NUCCI A FIFO QUEUE: Packet workload arrives instantaneously Deterministic service rate µ PROBE STREAM: Constant probe service time x = p/µ Arrivals {T n }, departures {T n}, E2E delays {D n = T n T n } Examine residual delay: R n = D n x 0 CROSS TRAFFIC: A measure A (or process): workload A(t) arrives in [0, t] Think of Poisson packet arrivals with random sizes (Eg constant or trimodal service time distribution)
INVERTING DELAY SAMPLES FOR CROSS TRAFFIC JOINT WORK WITH S.MACHIRAJU, F.BACCELLI, J.BOLOT, A.NUCCI A FIFO QUEUE: Packet workload arrives instantaneously Deterministic service rate µ PROBE STREAM: Constant probe service time x = p/µ Arrivals {T n }, departures {T n}, E2E delays {D n = T n T n } Examine residual delay: R n = D n x 0 CROSS TRAFFIC: A measure A (or process): workload A(t) arrives in [0, t] Think of Poisson packet arrivals with random sizes (Eg constant or trimodal service time distribution)
INVERTING DELAY SAMPLES FOR CROSS TRAFFIC JOINT WORK WITH S.MACHIRAJU, F.BACCELLI, J.BOLOT, A.NUCCI A FIFO QUEUE: Packet workload arrives instantaneously Deterministic service rate µ PROBE STREAM: Constant probe service time x = p/µ Arrivals {T n }, departures {T n}, E2E delays {D n = T n T n } Examine residual delay: R n = D n x 0 CROSS TRAFFIC: A measure A (or process): workload A(t) arrives in [0, t] Think of Poisson packet arrivals with random sizes (Eg constant or trimodal service time distribution)
THE INVERSE QUEUEING PROBLEM Given measured delays {R i }, what can be learned about A?
CONDITION ON TIME-SCALE t WHY? Desirable to understand A as a function of timescale Also necessary technically LOOK AT CONDITIONAL DELAYS: Of the sequence {R i }, take those for which T n+1 T n = t (if probes periodic, all probes qualify) For a given such R, the next probe arrives t later with residual delay S. We study statistics of the pair (R, S) Not limited to Poisson or periodic probe streams!
CONDITION ON TIME-SCALE t WHY? Desirable to understand A as a function of timescale Also necessary technically LOOK AT CONDITIONAL DELAYS: Of the sequence {R i }, take those for which T n+1 T n = t (if probes periodic, all probes qualify) For a given such R, the next probe arrives t later with residual delay S. We study statistics of the pair (R, S) Not limited to Poisson or periodic probe streams!
JOINT DENSITY OF (R, S) 80 70 0.04 60 50 0.02 40 0.01 30 20 10 10 20 30 40 50 60 70 80 0 FIGURE: Diagonals are lines U = u, where U = R S is delay variation.
FORWARD EQUATIONS: FROM A TO R S = max [ x + R + C, B ] C = A(t) t B = sup A([v, t)) (t v) 0 v t TECHNICAL ASSUMPTION R n is independent of (R n 1, C n, T n+1 T n ) {R n } is an ergodic Markov chain i.e.: future delays conditionally independent of past, R free to vary. RESULT f r (s) = P(S R = r) determined by density h(k, l) = P(B = k, C = l)
FORWARD EQUATIONS: FROM A TO R S = max [ x + R + C, B ] C = A(t) t B = sup A([v, t)) (t v) 0 v t TECHNICAL ASSUMPTION R n is independent of (R n 1, C n, T n+1 T n ) {R n } is an ergodic Markov chain i.e.: future delays conditionally independent of past, R free to vary. RESULT f r (s) = P(S R = r) determined by density h(k, l) = P(B = k, C = l)
FORWARD EQUATIONS: FROM A TO R S = max [ x + R + C, B ] C = A(t) t B = sup A([v, t)) (t v) 0 v t TECHNICAL ASSUMPTION R n is independent of (R n 1, C n, T n+1 T n ) {R n } is an ergodic Markov chain i.e.: future delays conditionally independent of past, R free to vary. RESULT f r (s) = P(S R = r) determined by density h(k, l) = P(B = k, C = l)
MEANING OF (B, C) FIGURE: C = A t is net workload in interval t B a measure of burstiness
SUPPORT OF (B, C) DENSITY C s1 r1 x f r1 (s1) 0 B x s2 r2 x f r2 (s2) t s2 s1 FIGURE: Density h(k, l) vanishes outside yellow strip
EXAMPLE OF (B, C) DENSITY C(l) 0.02 0.2 l*d (Bytes) 0 40 120 0.05 0.01 0.15 0.1 0.05 320 400 k*d (Bytes)
OUTLINE CHALLENGES IN SAMPLING Introduction Two Consequences CROSS TRAFFIC ESTIMATION AS NON-LINEAR SAMPLING An Inverse Queueing Problem Limits to Inversion: Identifiability From Inversion Theory to Estimation Practice
SYSTEM IDENTIFIABILITY TWO KINDS OF AMBIGUITY FOR THE INVERSION Pathwise: knowledge of {R i } does not determine A. Eg.: probes in Same busy period: different pkt arrivals with same total service Different busy periods: anything between is invisible Distributions: Again does not (in general) determine A
SYSTEM IDENTIFIABILITY TWO KINDS OF AMBIGUITY FOR THE INVERSION Pathwise: knowledge of {R i } does not determine A. Eg.: probes in Same busy period: different pkt arrivals with same total service Different busy periods: anything between is invisible Distributions: Again does not (in general) determine A
SYSTEM IDENTIFIABILITY TWO KINDS OF AMBIGUITY FOR THE INVERSION Pathwise: knowledge of {R i } does not determine A. Eg.: probes in Same busy period: different pkt arrivals with same total service Different busy periods: anything between is invisible Distributions: Again does not (in general) determine A
A RECURSIVE PROCEDURE TO DETERMINE h(k, l) Condition on R = r. From S = max [ x + R + C, B ], conditional probabilities f r (s) = P(S = s R = r) corresponds to a simple sum of h(k, l) values. These expresssions can be combined to invert: k 1 h(k, l) = [2f k l x (i) f k l x 1 (i) f k l x+1 (i)]+[f k l x (k) f k l x+1 (k)] i=0 provided k l x 1. This is almost a full inversion of the joint density!
LINKING (R, S) TO (B, C) DENSITY C s1 r1 x f r1 (s1) 0 B x s2 r2 x f r2 (s2) t s2 s1 FIGURE: Observed (r, s) corresponds to a (b, c) value in the angle.
INVERSION METHOD USING ANGLES FIGURE: Values in the ambiguity zone (top) cannot be resolved.
THE ROLE OF x Width of ambiguity zone is x + 1 probe invasiveness hides system details However! if A has stationary independent increments: The partial inversion here is not fundamental Not only can h(k, l) be recovered for this t, but the entire law of the process also In general, full inversion in inherently impossible
THE ROLE OF x Width of ambiguity zone is x + 1 probe invasiveness hides system details However! if A has stationary independent increments: The partial inversion here is not fundamental Not only can h(k, l) be recovered for this t, but the entire law of the process also In general, full inversion in inherently impossible HOW DOES THAT WORK? The marginal c(l) of C can always be recovered This is enough to determine the Lèvy exponent, which characterises such processes
LINKING (R, S) TO (B, C) DENSITY C s1 r1 x f r1 (s1) 0 B x s2 r2 x f r2 (s2) t s2 s1 FIGURE: Observed (r, s) corresponds to a (b, c) value in the angle.
OUTLINE CHALLENGES IN SAMPLING Introduction Two Consequences CROSS TRAFFIC ESTIMATION AS NON-LINEAR SAMPLING An Inverse Queueing Problem Limits to Inversion: Identifiability From Inversion Theory to Estimation Practice
IMPLEMENTING THE INVERSION METHOD MAJOR CHALLENGES: Must condition: t, r Must estimate the f r (s) Coverage of (k, l) plane may not be adequate, even missing! Must map available mass into the strip in right way Epicentre of h(k, l) may be far from available mass But, can exploit strong assumption to extend effective invertibility to low data availability
EXAMPLE OF (B, C) DENSITY C(l) 0.02 0.2 l*d (Bytes) 0 40 120 0.05 0.01 0.15 0.1 0.05 320 400 k*d (Bytes)
AVAILABLE MASS AND h(k, l) (ρ = 0.8) Avail c(l) with h Contour 80 % Utilization 0.04 0.02 l*d (Bytes) 0 40 0.0025 0.01 120 0.005 0.01 320 0.001 400 k*d (Bytes)
AVAILABLE MASS AND h(k, l) (ρ = 0.2) Avail c(l) with h Contour 20 % Utilization l*d (Bytes) 0 40 0.0025 120 0.08 0.04 0.02 320 400 0.01 k*d (Bytes) 0.01 0.005 0.001
ROUTER DATA: ESTIMATING h(k, l) 20 0.1 20 0.1 0.08 0.08 l*d (KB) 0 0.06 0.04 l*d (KB) 0 0.06 0.04 0.02 0.02 18.9 0 20 40 k*d (KB) 0 18.9 0 20 40 k*d (KB) 0 FIGURE: Left: replayed router data through FIFO, Right: estimation
SUMMARY CHALLENGES IN SAMPLING: Sampling and Inversion must be structure aware Sampling is a general program {parameter,sampling,inversion} CROSS TRAFFIC ESTIMATION: Cross traffic inversion impossible in general! Invasiveness an intrinsic barrier Detailed partial inversion still possible
SUMMARY CHALLENGES IN SAMPLING: Sampling and Inversion must be structure aware Sampling is a general program {parameter,sampling,inversion} CROSS TRAFFIC ESTIMATION: Cross traffic inversion impossible in general! Invasiveness an intrinsic barrier Detailed partial inversion still possible