Network Simulation Chapter 5: Traffic Modeling Prof. Dr. Jürgen Jasperneite 1 Chapter Overview 1. Basic Simulation Modeling 2. OPNET IT Guru - A Tool for Discrete Event Simulation 3. Review of Basic Probabilities and Statistics 4. Building valid, credible Simulation Models 5. Traffic Modeling 6. Output Data Analysis 0: Overview 2 Prof. Dr. J ürgen Jasperneite 1
Chapter Overview 1. Basic Simulation Modeling 2. OMNeT++ - A Tool for Discrete Event Simulation 3. Review of Basic Probabilities and Statistics 4. Building valid, credible Simulation Models 5. Traffic Modeling 6. Output Data Analysis 0: Overview 3 Overview Introduction Quantifying models Goodness of fit tests 4 Prof. Dr. J ürgen Jasperneite 2
Introduction load parameter System parameter Workload Traffic Source System under study metrics 5 Introduction Part of modeling what input probability distributions to use as input to simulation for: e.g. Interarrival times, message lengths, message types Characterization of traffic is very important Results of a simulation are only as good as the input > Inappropriate input distribution(s) can lead to incorrect output, bad decisions. Many different methods are used to generate traffic sources. Each method has advantages/disadvantages Development time Flexibility Accuracy 6 Prof. Dr. J ürgen Jasperneite 3
Introduction Traffic categories include: Statistical sources Exponential distributed IA times ON-OFF Network applications FTP HTTP Voice Video... Captured packet traces (trace-driven simulation) 7 Simple Statistical Distributions Statistical distributions are commonly used in performance analysis Poisson (Application Traffic, Interarrival times) Normal (Packet Sizes) Uniform (Destination Addresses) 8 Prof. Dr. J ürgen Jasperneite 4
Overview Introduction Quantifying models Goodness of fit tests 9 Introduction Usually, have observed data on input quantities options for use: Use Pros Cons Trace-driven Use actual data values to drive simulation Valid vis-a-vis real world direct Not generalizable 10 Empirical distribution Use data values to define a connect- the-dots distribution Fitted standard deviation Use data to fit a classical distribution (exp, uniform, Poisson, etc.) Fairly valid Simple Fairly direct Generalizable fills in holes in data May limit range of generated variates (depending on form) May not be valid May be difficult Prof. Dr. J ürgen Jasperneite 5
Extracting distributions out of traces How to overcome finiteness of a trace? How to characterize a trace in general? Consider a trace as a set {X 1,, X n } of individual values Assumption: all samples come from the same distribution Construct the empirical distribution function from this set Sort the {X 1,, X n } in increasing order such that X (1) X (n) Define a piecewise-linear distribution function: 0 i 1 x X ( i) F ( x) n 1 ( n 1)( X X ) ( i 1) ( i) 1 if if x X X ( i ) if X x X ( n) (1) x ( i 1) for i 1,..., n 1 11 Empirical distribution - example Figure shows an empirical distribution function for six data points F(x) 1 0.8 0.6 Empirical distribution 0.4 0.2 0 0 X (1) 5 X (2) 10 X (4) 15 20 25 X(3) X (5) X (6) 12 Prof. Dr. J ürgen Jasperneite 6
Empirical distributions Discussion For realistic sample sizes, no or few data are available for the tail of a distribution. Empirical distributions as defined above do not allow to generate values larger than maximum X j, which might be desirable Adding an exponential tail to the data is possible and often useful 14 Traces vs. empirical distributions Going from traces to empirical distributions seems to be quite attractive Infinite number of samples can be easily generated Is there a downside? Example: Suppose you want to use the waiting time of customers in a queue as an input to some other simulation model Trace-driven: generate a long list of many individual waiting times of customers (either by measurement or by simulation), store this list, and whenever a waiting time is needed, use one entry of this list. Empirical distribution function: take the list, compute an empirical distribution, and generate a random variate when a waiting time is required. 15 Prof. Dr. J ürgen Jasperneite 7
Traces vs. empirical distributions Difference? Generating random variates from distribution happens one at a time, no information about the previous values is stored All values are identically distributed (they come from the same distribution) and are independent of each other Their corresponding random variables are called IID variables In a trace, the history of the system, how the values were generated, is still maintained (though implicitly) Such history could result in a mutual dependence of values Consider a queue: When the person before you has to wait long, it is quite likely that you will also have to wait long Waiting times in a queue are positively correlated Correlation structure of traces is destroyed by simulation using empirical distributions! 16 Traces vs. empirical distribution Example Consider the waiting times in an M/M/1 queue Compute the empirical distribution from one simulation run Use this empirical distribution to generate random numbers according to it Plot shows both distributions; they are reasonably similar We will see what is reasonable shortly Cumulat ive distribut ion function Randomly generated according to empirical distribution Empirical distribution 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Waiting time 17 Prof. Dr. J ürgen Jasperneite 8
Traces vs. empirical distributions Example 18 But: look at the autocorrelation of the two sets of numbers (measured from the simulation and randomly generated): Compare the graphs on the right Note the slowly decaying autocorrelation for the simulated/measured data Randomly generated data is practically uncorrelated Generating random numbers in a direct fashion destroys correlation structure A utocorrela tion 1 0.8 0.6 0.4 0.2 0-0.2 p C Randomly generated data Simulated/Measured data 1 2 3 4 5 6 7 8 9 10 11 C i, i j Lag j 2 2 i i j E [( X )( X )] i, i j i i i j i j Generalizing empirical distributions Empirical distributions are essentially a big set of data points Unwieldy, big description Generating random numbers based on such an empirical distribution is quite time-consuming We will soon see how Is there a possibility to have a more compact, smaller representation? Yes: look for an analytically described (closed-form) distribution function that matches the empirical distribution function! 19 Prof. Dr. J ürgen Jasperneite 9
Fitting empirical distributions To replace an empirical distribution by an analytically described distribution, the following steps are required Find an analytical distribution which fits the overall shape of the empirical distribution As an analytical distribution is usually parameterized, find appropriate values for these parameters Determine the quality of the fit 20 Finding proper families of distributions To find a suitable family of analytical distributions, often prior knowledge about the underlying empirical distribution is available E.g., certain assumptions about arrival process directly result in Poisson distributions, etc. Negative selection is also possible: Some values have natural upper or lower bounds E.g., values that can only be positive should not be modeled with distributions that take on negative value 21 Prof. Dr. J ürgen Jasperneite 10
Heuristics to choose distributions How to choose distributions to fit data when no prior knowledge is available? Some heuristics exist Summary statistics Histogram Note that most of these heuristics (as well as procedures to check the quality of a fit) require the underlying data (from which the empirical distribution has been generated) to be independent One means to check independency is autocovariance 22 Summary statistics Compute summary statistics such as mean, median, variance, coefficient of variation, or skewness (measure of symmetry) from the original sample Compare these results with properties of a possible distribution E.g., for symmetric distributions mean and median are equal For some distributions, coefficient of variation must be smaller than 1, equal to 1 (exponential distribution) More a means to quickly weed out inappropriate distributions from a large set of possible ones. 23 Prof. Dr. J ürgen Jasperneite 11
Histograms Compute a histogram of the original data Typically, equidistant buckets are useful Compare the shape of the histogram with that of the density of possible distributions Many shapes are quite characteristic and easily recognized Ignore differences in location and scale How to choose width/number k of buckets? Sturges s rule: k 1 log 2 n where n is the number of data points Better: rely on optical impression smooth shape, buckets neither too wide (detail is lost, spikes at crucial points could be missed) nor too small (small differences are overemphasized) Histogram can often indicate whether density is sum of two individual densities 24 Histograms Example of a multimodal distribution Histogram shows Data traffic between a Logic Controller (PLC) and a Human-Maschine Interface (HMI). [1] Jasperneite, Jürgen: Analyse und Modellierung von Kommunikationslasten in der Fertigungstechnik. in: at - Automatisierungstechnik, R. Oldenbourg Verlag(49) S.: 206-213, Apr 2001 Result of a keep-alive function, where every 5 sec. Packets will be exchanged. 25 Prof. Dr. J ürgen Jasperneite 12
Overview Introduction Quantifying models Goodness of fit tests 26 Goodness of fit tests Based on a hypothesized distribution along with estimated parameters, how to tell how good this hypothesis matches real data? Heuristic procedures Density/Histogram overplots: Plot both empirical histogram and estimated density function in one graph, look for differences Frequency comparison: Plot empirical histogram and calculated histogram side by side, look for differences Distribution Function Difference Plot: Compute difference between empirical and estimated distribution, plot this difference. Ideally, result is a horizontal line at 0 Directly comparing two plots of distributions is difficult for most humans Probability/Quantile Plots see below 27 Prof. Dr. J ürgen Jasperneite 13
QQ-Plots Way of plotting the difference between two distributions: Q-Q plots A quantile is the variable-value that corresponds to a fixed cumulative frequency. First quartile = 0.25 quantile Second quartile = median = 0.5 quantile Third quartile = 0.75 quantile Can read any quantile from the cdf plot 28 QQ-Plot..... compare two univariate 1) distributions.. is a plot of matching quantiles > a straight line implies that the two distributions have the same shape... has units of the data.. emphasize differences in the tails 1) Involving one variable, as opposed to two (bivariate) or many (multivariate) 29 Prof. Dr. J ürgen Jasperneite 14
Example : QQ-Plot 4 1 5 2 Sample 3 6 30 Normal Example: Old faithful inter-eruption times Data describing times between eruptions from a geyser (in minutes): 3.600,1.800,3.333,2.283,4.533,2.883,4.700,3.600,1.950,4.350,1.833,3.917,4.200,1.750,4. 700,2.167,1.750,4.800,1.600,4.250,1.800,1.750,3.450,3.067,4.533,3.600,1.967,4.083,3.85 0,4.433,4.300,4.467,3.367,4.033,3.833,2.017,1.867,4.833,1.833,4.783,4.350,1.883,4.567, 1.750,4.533,3.317,3.833,2.100,4.633,2.000,4.800,4.716,1.833,4.833,1.733,4.883,3.717,1.6 67,4.567,4.317,2.233,4.500,1.750,4.800,1.817,4.400,4.167,4.700,2.067,4.700,4.033,1.967, 4.500,4.000,1.983,5.067,2.017,4.567,3.883,3.600,4.133,4.333,4.100,2.633,4.067,4.933,3. 950,4.517,2.167,4.000,2.200,4.333,1.867,4.817,1.833,4.300,4.667,3.750,1.867,4.900,2.48 3,4.367,2.100,4.500,4.050,1.867,4.700,1.783,4.850,3.683,4.733,2.300,4.900,4.417,1.700, 4.633,2.317,4.600,1.817,4.417,2.617,4.067,4.250,1.967,4.600,3.767,1.917,4.500,2.267,4.6 50,1.867,4.167,2.800,4.333,1.833,4.383,1.883,4.933,2.033,3.733,4.233,2.233,4.533,4.817,4.333,1.983,4.633,2.017,5.100,1.800,5.033,4.000,2.400,4.600,3.567,4.000,4.500,4.083,1.800,3.967,2.200,4.150,2.000,3.833,3.500,4.583,2.367,5.000,1.933,4.617,1.917,2.083,4.5 83,3.333,4.167,4.333,4.500,2.417,4.000,4.167,1.883,4.583,4.250,3.767,2.033,4.433,4.08 3,1.833,4.417,2.183,4.800,1.833,4.800,4.100,3.966,4.233,3.500,4.366,2.250,4.667,2.100, 4.350,4.133,1.867,4.600,1.783,4.367,3.850,1.933,4.500,2.383,4.700,1.867,3.833,3.417,4. 233,2.400,4.800,2.000,4.150,1.867,4.267,1.750,4.483,4.000,4.117,4.083,4.267,3.917,4.55 0,4.083,2.417,4.183,2.217,4.450,1.883,1.850,4.283,3.950,2.333,4.150,2.350,4.933,2.900, 4.583,3.833,2.083,4.367,2.133,4.350,2.200,4.450,3.567,4.500,4.150,3.817,3.917,4.450,2. 000,4.283,4.767,4.533,1.850,4.250,1.983,2.250,4.750,4.117,2.150,4.417,1.817,4.467 31 Prof. Dr. J ürgen Jasperneite 15
Histogram of eruption data Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Histogram of eruptions 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 eruptions 32 Empirical distribution of eruption data Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 ecdf(eruptions) 2 3 4 5 x Data evidently bi-modal -> no standard distribution will fit What about looking at only the, e.g.,upper part? 33 Prof. Dr. J ürgen Jasperneite 16
Restricted empirical distribution Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 ecdf(long) 3.0 3.5 4.0 4.5 5.0 x 34 Looks like a reasonable fit with a normal distribution Check with Q-Q plot! Q-Q plot for eruption data 35 Sample Quantiles 3.0 3.5 4.0 4.5 5.0 Normal Q-Q Plot -2-1 0 1 2 Theoretical Quantiles Reasonable fit, but some differences in the tail Shifted mean for the theoretical quantiles not taken into account Example taken from the R manual (see Web page www.r-project.org ) Prof. Dr. J ürgen Jasperneite 17
Traffic Modeling Introduction Quantifying models Goodness of fit tests 36 Prof. Dr. J ürgen Jasperneite 18