Network Performance Tomography

Size: px

Start display at page:

Download "Network Performance Tomography"

Amberlynn Hunter
6 years ago
Views:

1 Network Performance Tomography Hung X. Nguyen TeleTraffic Research Center University of Adelaide

2 Network Performance Tomography Inferring link performance using end-to-end probes 2

3 Network Performance Tomography Inferring link performance using end-to-end probes Concentrate on only loss rates in this talk 3

4 End-to-End Loss Rates Are not Enough Path transmission rates and link transmission rates are linearly related φ: path transmission rate A tr: link transmission rate e 1 φ 1 φ 2 A->B A->C e 1 e 2 e * log(tr 1 ) log(tr 2 ) log(tr 3 ) = log(φ 1 ) log(φ 2 ) B e 2 e 3 C Under-determined Routing matrix : does not have full column rank Link transmission rates cannot be calculated using only end-to-end transmission rates 4

5 Network Performance Tomography Performance tomography: is all about obtaining additional information to solve the inverse problem and inference methods given the data. Assumptions: Fixed topology: Known routing matrix all columns are distinct all columns have at least one 1 entry Loss independence: Link losses are independent (both temporally and spatially) tr 1 φ 1 =0.72 φ 2 =0.8 B A->B A->C A tr 2 tr 3 e 1 e 2 e C 5

6 Temporal Correlation Using Multicast Multicast multicast probes: probes are forwarded to multiple receivers at the branching point strong temporal correlation A->B A->C A->B A->C e 1 e 2 e * log(tr 1 ) log(tr 2 ) log(tr 3 ) = log(φ 1 ) log(φ 2 ) log(φ 2 1 ) Multicast router 1" 1" 0" 0" B A tr 1 " tr 2 "tr 3 " C 1" 0" 1" 0" Temporal correlation ^ tr ^ ^ 1,tr 2, tr 3 " 6

7 Inference Algorithm Maximum likelihood algorithm to infer loss rates on a multicast tree [Caceres et al 99] n : number of probes n(01) : number of 01 outcomes, etc p(01) = n(01) n,etc γ 1 = p(11) + p(10) + p(01) γ 2 = p(11) + p(10) γ 3 = p(11) + p(01) γ tr 1 = 2 γ 3,tr γ 2 +γ 3 γ 2 =1 γ γ 1 2,tr 1 γ 3 =1 γ γ γ 2 tr 2 B Can be extended to work with multiple trees [Bu et al. 01] Limitation: Multicast is not widely accessible to end users 1" 1" 0" 0" A tr 1 C tr 3 1" 0" 1" 0" 7

8 Unicast Emulation of Multicast Using back-to-back unicast packets to create temporal correlation A router β 1 B C Additional variable: the conditional probability β 1 = Pr(red _ succeeds blue _ succeeds) 8

9 Treat β as variables: Inference Algorithms Infer the conditional probabilities β together with link loss rates using the EM (expectation-maximization) method [Coates 00] n b,r m b,r Likelihood : l(m b,r n b,r, p b,r ) = n b,r m p b,r m m b,r b,r (1 p b,r b,r ) n b,r m b,r, where : Number of packet pairs (blue,red) blue succeeds : Number of successful packet pairs (blue,red)-both succeed p b,r = β 1 * tr 2 Inference: argmax tr,β l(m N,tr,β) 9

10 Inference Algorithms (cont.) Infer the conditional probabilities β together with link loss rates But the model is not identifiable The estimator is biased, loss rates of internal links are overestimated Alternatively, force perfect correlation [Duffield et al. 01] β 1 = Pr(red blueblueblue) 1 Apply the multicast based inference algorithms Less accurate than the multicast approach and highly dependent on back ground traffic 10

11 Without temporal correlation In most large scale systems, temporal correlation is hard to enforce Multicast: may not be available Packet pairs/trains: hard to scale, not accurate End-to-end transmission rates are easy to obtained Identifiability issue without temporal correlation A->B A->C e 1 e 2 e * log(tr 1 ) log(tr 2 ) log(tr 3 ) = log(φ 1 ) log(φ 2 ) 11

12 Pragmatic Goals Pragmatic goals Loss rates of groups of consecutive links [Zhao et al. 06] tr 1 *tr 2 and tr 1 *tr 3 In many cases, can compute loss rates for small sets of links Identifying only the worst performing links [Padmanabhan et al. 03, Duffield 06] e 1 is congested (tr 1 <t l ), e 2 and e 3 are good(tr 2,tr 3 >t l ) where t l is the link threshold 12

13 Boolean Tomography Objective is to locate congested links (links with transmission rate tr<t l ) Input: - Routing topologies - Probing results - The threshold t l Output - The set of congested links X 2 =1 X 1 =0 A A->B A->C X 3 =0 e 1 e 2 e B C

14 Assumptions T1: Known routing matrix all columns are distinct all columns have at least one 1 entry S1: Loss independence: all losses are temporally and spatially independent S2: All flows traverse a link e i have transmission rate tr i on that link tr 1 φ 1 =0.72 φ 2 =0.8 B A->B A->C A tr 2 tr 3 e 1 e 2 e C 14

15 Explore All Possible Loss Rates Simulate all possible link transmission probabilities that are consistent with the end-to-end measurements (Padmanabhan et al. 03) Random sampling Monte Carlo Markov Chain Simulation: Construct a Markov chain whose stationary distribution is Pr( tr 1,tr 2,tr 3 probes) Slow to converge, why does it work? A tr B tr 2 tr 3 C (1,0.99, 0.75) (0.99,1, 0.76) (0.999,0.99, 0.75) Lossy links = links with majority of sample rates <0.8 Lossy links = {e 3 }

16 Boolean System of Equations Transform the linear equations into Boolean equations [Duffield 03]: A->B A->C e 1 e 2 e * log(tr 1 ) log(tr 2 ) log(tr 3 ) = log(φ 1 ) log(φ 2 ) (+,x) (max,min) Y i =1 if φ 1 < t p, Y i =0 otherwise X i =1 if tr 1 < t l, X i =0 otherwise e 1 e 2 e 3 A->B A->C X 1 X 2 X 3 = Y 1 Y 2 16

17 Smallest Consistence Failure Set (SCFS) Assumptions: 1. (Performance separability) a path is bad (φ 1 < t p ) if and only if at least one link is bad (tr 1 < t l ) 2. Bad links are rare On a tree topology [Duffield 03]?? x x 0?? 0??

18 Performance of SCFS DR: detection rate, FPR: false positive rate Tree with 1000 nodes, 1000 probes, bad links have loss rate uniformly in [0.05,1], good links: [0,0.01 ] 18

19 Back to SCFS Objective is to locate congested links (links with transmission rate tr<t l ) End-to-end measurements link e k is congested iff tr k < t l =0.9 e 1 φ 1 =0.72 φ 2 =0.8 e 2 e 3 congested congested (Y 1 =1) (Y 2 =1) congested (X 1 =1) Not congested Not congested (X 2 =0) (X 3 =0) Assumptions: congested links are rare, links are equally likely to be congested. Can be very inaccurate: biases in favor of shared links 19

20 Time-Varying Boolean Tomography Link quality changes with time

21 Time-Varying Boolean Solution To overcome the biases, link state probabilities (Nguyen 07): Methodology: p k = P(X k =1) = P(tr k <t l ) Step 1 (learning phase): Take multiple snapshots to learn the link state probabilities Step 2 (diagnosis phase): Then determine the most probable set of congested links in the current snapshot p 1 = P(X 1 =1) p 2 = P(X 2 =1) p 3 =P(X 3 =1) Y 1 =1 Y 2 =1 21

22 Identifiability of Link State Probability Theorem (Identifiability): with the previous assumptions, link state probabilities p k can be uniquely learnt from endto-end measurements if and only if 0 p k < 1 p 1 = P(X 1 =1) =0 p 2 = P(X 2 =1) =0.3 p 3 =P(X 3 =1) =0.6 Follow the proof technique of Vardi, JASA,

23 Step 1: Estimating Link State Probabilities Theorem (Identifiability): link state probabilities p k can be uniquely learnt from end-to-end measurements if 0 p k <1 p 1 =? p 2 =? Method of moments: -Vardi, JASA, E[Y 1 ] = P(Y 1 =1) = 1-(1-p 1 )*(1-p 2 ); - E[Y 2 ] = P(Y 2 =1) = 1-(1-p 1 )*(1-p 3 ); - E[Y 1 *Y 2 ] = E[max(X 1,X 2,X 3 )] = p 1 *(1-p 2 )*(1-p 3 ) +p 1 *p 2 *(1-p 3 ) +p 1 *(1-p 2 )*p 3 +(1-p 1 )*p 2 *p 3 p 3 =? [ Y 1 Y 1 Y 1... ] [Y 2 Y 2 Y 2... ] Non linear equations, intractable!!!

24 Step 1: Method of Moments Estimator Taking the second order moments in Boolean algebra: E[max(Y 1,Y 2 )]=P(max(Y 1,Y 2 )=1)= P(max(X 1,X 2,X 3 )=1) =1-P(X 1 =0)*P(X 2 =0)*P(X 3 =0) =1-(1-p 1 )*(1-p 2 )*(1-p 3 ) * log(1-p 1 ) log(1-p 2 ) log(1-p 3 ) = log(1-e[y 1 ]) log(1-e[y 2 ]) log(1-e[max(y 1,Y 2 )]) Full rank linear system!!! We can go up to third, fourth, etc. order moments. Conjecture: Second order moments are enough to obtain a system of full rank in all networks.

25 Step 2: Identifying congested links A simple optimization problem to find the set of congested links: ( ) k arg min X log 1 p k pk Xk X k = 1 subject to: at least one bad link is on a bad path End-to-end measurements link e k is congested iff tr k < t l =0.9 e 1 p 1 =0, p 2 =0.3, p 3 =0.6 Not congested tr=0.72 tr =0.72 congested congested e 2 e 3 congested congested

26 Performance DR: detection rate, FPR: false positive rate Tree with 1000 nodes, 1000 probes, bad links have loss rate uniformly in [0.05,1], good links: [0,0.01 ] 26

27 Loss Rates on the Internet Observation 1: Most links have negligible loss rates (tr k = 1) 0n PlanetLab, more than 80% of end-to-end paths have zero loss rates Observation 2: congested links have high loss rate variances [Paxson Sigcomm 97, Zhang IMC 01] 27

28 Time-Varying Loss Tomography Calculate the link loss rates from end-to-end loss rates without using probe temporal correlation Input: - Routing matrix - End-to-end transmission rates Output - Link transmission rates tr 2 tr 1 e 2 e 1 e 3 tr 3 snapshot 28

29 Assumptions T1: Known routing matrix - all columns are distinct - all columns have at least one 1 entry S1: Link independence: tr k are independent S2: Identical sample rates: φ i,k = tr k (a.s) for all path i and link e k A->B A->C S3: Monotonic relationship between mean and variance of link loss rates: v k = VAR(log(tr k ) ) is a non-decreasing function of E(1-tr k ) e 1 e 2 e

30 Time-Varying Tomography Solution To overcome the ill-posed problem, identify links with negligible loss rate variances: v k ~ 0 tr k ~ 1 Methodology: Step 1 (learning phase): Take multiple snapshots to learn the link variances Step 2 (diagnosis phase): Recursively assigning zero loss rates to links with smallest variances to reduce the diagnosis equations to a linear system of full-rank 30

31 Step 1: Estimating Link Variances Calculate link variances: A * v 1 v 2 v 3 = VAR(log(φ 1 )) VAR(log(φ 2 )) COV(log(φ 1 ), log(φ 2 )) v 1 [ φ 1 φ 1 φ 1 ] [φ 2 φ 2 φ 2 ] v 2 v 3 Full-rank linear system B C Sort links according to their variances v 3 v 2 v 1 Links with smaller variances are less congested 31

32 Step 2: Calculating Link Loss Rates Recursively eliminating good links from the first order equations * log(tr 1 ) log(tr 2 ) log(tr 3 ) = log(φ 1 ) log(φ 2 ) v 3 v 2 v 1 tr 1 ~ 1 Until we obtain a full column rank system Solve the resulting system using standard linear algebra technique 32

33 Summary Network performance tomography is all about finding extra information to overcome the inidentifiability problem Two major approaches: Using temporal correlation (multicast or Unicast) Using prior information about link loss rate Boolean Tomography Time-varying tomography There are still much work to be done before large scale diagnosis systems can be built using these techniques: Topology changes Inaccurate end-to-end measurements 33

Active Measurement for Multiple Link Failures Diagnosis in IP Networks

Active Measurement for Multiple Link Failures Diagnosis in IP Networks Hung X. Nguyen and Patrick Thiran EPFL CH-1015 Lausanne, Switzerland Abstract. Simultaneous link failures are common in IP networks