An Architecture for a WWW Workload Generator. Paul Barford and Mark Crovella. Boston University. September 18, 1997

Size: px

Start display at page:

Download "An Architecture for a WWW Workload Generator. Paul Barford and Mark Crovella. Boston University. September 18, 1997"

Augustine Clarke
6 years ago
Views:

1 An Architecture for a WWW Workload Generator Paul Barford and Mark Crovella Computer Science Department Boston University September 18, Overview SURGE (Scalable URL Reference Generator) is a WWW workload generator which is based on analytical models of WWW use. It relies on the fact that a great deal of prior work has been done in the analysis of WWW transactions, and that models for many of the important characteristics of WWW use have been developed. The goal of SURGE is to provide a scalable framework which, from the server's perspective, makes document requests which mimic a set of real users. SURGE is intended to be used for benchmarking, network trac generation and simulation. This paper contains descriptions of the details of the SURGE framework, the additional models which we developed, and the statistical methods we used to parameterize the models. 2 The SURGE Framework 2.1 Structure of SURGE SURGE is a scalable software framework within which a collection of models for the various components of WWW use are combined. From the top down, SURGE software resides on a set of clients which are connected to a WWW server as illustrated in Figure 1. Each SURGE client executes a set of threads, each of which is an ON/OFF source [15, 9]. Each thread requests a document set which is then transfered by the server (ON time). After receiving the document set, a thread sleeps for some amount of time (OFF time). This ON/OFF characteristic is an important dierence between SURGE and other benchmarks such as [1, 2, 3, 4, 5] ( [14] also includes OFF times). 1

2 SURGE Client System ON/OFF Thread ON/OFF Thread ON/OFF Thread ON/OFF Thread SURGE Client System LAN Web Server System SURGE Client System Figure 1: SURGE Architecture When SURGE is started on a client system, it begins by populating a number of arrays with data generated by the analytic models of WWW client use. It then spawns a user-designated number of threads which execute the loop shown in pseudo code in Figure 2. URL List, Inactive OFF, Active OFF and ON Count each represent an array of data. Each array is generated by the collection of models within the SURGE framework which characterize dierent aspects of WWW use. The arrays are generated in the sequence shown in Figure Models Used in SURGE We began the development of the SURGE framework by looking at the work that had been done in other studies. In particular, models for the following aspects of WWW use had already been proposed: A model for the distribution of unique sizes of les requested from servers is suggested in [9]. A model for the distribution of sizes of all les transferred from servers is suggested in [9] (used in box 3 in Figure 3). A model for the popularity of all les requested is suggested in [10] (used in box 1 in Figure 3). A model for the temporal locality of les requested is suggested in [6] (used in box 4 in Figure 3). A model for the Inactive OFF times is suggested in [9, 12] (used in box 7 in Figure 3). 2

3 (1) SLEEP(NEXT Inactive_OFF item) (2) WHILE(URLs remain in URL_List) { (3) I = NEXT ON_Count item (4) WHILE(I > 0) { (5) D = NEXT URL_List item (6) TRANSFER D (7) IF (I > 1) SLEEP(NEXT Active_OFF item) (8) I = I - 1 } (9) SLEEP(NEXT Inactive_OFF item) } Figure 2: Pseudo code executed by each SURGE Thread (1) Total Requests/File (Popularity) (2) Unique File Sizes (3) Matching (4) Sequence Generator (5) ON Count Generator URL_List ON_Count (6) (7) Active OFF Inactive OFF Generator Generator Active_OFF Inactive_OFF Figure 3: Generation of SURGE Data Arrays 3

4 We began by incorporating these models into SURGE; however they do not represent the complete set of information required to generate ON/OFF entities that follow the algorithm in 2. We had to develop the following analytic models in order to complete the SURGE framework: A model for the unique sizes of les transferred which was accurate over the entire distribution (used in box 2 in Figure 3). A model for Active OFF periods (used in box 6 in Figure 3). A model for the number of documents transferred during ON periods (used in box 5 in Figure 3). We believe that a good model for a representative WWW workload generator should include the aforementioned models of WWW use. However, there may be additional characteristics that could be added to this model. For example, [6] also describes spatial locality characteristics for WWW requests. Spatial locality has not been included in SURGE at the present time. 3 Statistical Overview We used the BU client trace data sets discussed in [10] to develop the three models required to complete SURGE. To develop these models, we used the standard statistical methods described in [16]. These are the following: 1. Use empirical data to determine distributional models (using Q-Q or CDF plots) for each data set using the half sample method [11]. Use logarithmic transformations where necessary to distinguish important characteristics. 2. Estimate parameters for analytic models using maximum likelihood estimators and then test for the accuracy of the model using goodness-of-t tests. We used the Anderson-Darling (A 2 ) test. This empirical distribution function (EDF) test is a powerful means for analyzing the entire distribution and is suggested as the recommended EDF test for models with unknown parameters [11, 16]. 3. If goodness-of-t tests fail then use goodness of t metrics such as the 2 test suggested in [16] and described in [17]. This discrepancy measure is used to compare how well analytic models describe an empirical data set. It is a technique that relies on partitioning a data set into bins. We used the method suggested in [16] to select bin size. 4

5 4. Data sets can often contain outliers which do not seem to be part of the distribution and can skew goodness-of-t analysis. These must be investigated and explained before they are excluded from any analysis. 5. Validate the model using the the second half of the sample data or against other data sets if they are available. We believe that each of the analytic models within SURGE are required to generate a representative workload. However, it is not our intention to argue for invariant properties of any of these models, including those which we developed in this study. To that end, SURGE has been developed as a fully parameterizable tool. Our aim was to complete the models required and to present them as a reasonable approximation for each characteristic. The constant evolution of the WWW means that it is important to understand how each component of SURGE eects the workload generated at the server. The following sections describe the results of the analysis completed for each of the three models. 4 File Size Model World Wide Web le sizes have been analyzed in a number of studies including [7, 9]. Crovella and Bestavros showed in [9] that the distribution of the set of unique le sizes transferred to users exhibit a \heavy tailed" characteristic. However, examination of the models proposed in this work show that \heavy tailed" distributions do not accurately describe the entire distribution of the le sizes. In particular, the data shows the distribution for le sizes less than approximately 100KB deviates from an ideal Pareto distribution. For SURGE, we had to develop a model which more accurately characterized the entire range of unique le sizes. We began our unique le size model assuming that the heavy tailed characterization of the distribution was accurate. The model was then developed as a hybrid consisting of a new distribution for the body up to a threshold, followed by a heavy tailed model for the upper tail. Our task was to nd the correct distribution for the body, its parameters, and the appropriate threshold between the body and upper tail. 4.1 Modeling Results We began by selecting all of the unique les sizes from the BU client traces. We used half of this sample to develop our model which gave us a data set with 11,188 points. Our assumption was 5

6 Log(Unique File Sizes) Figure 4: BU Unique File Size Data that the underlying distribution had a specic characteristic which is \contaminated" by the long upper tail. We generated complementary distribution function (CDF) plots to determine the most appropriate model t for the data. We found the best visual t for this data to be the lognormal distribution. Figure 4 shows the histogram for log e transform of the data. In addition, the 2 test on the data versus the distributional models (lognormal, Weibull, Pareto, exponential, log-extreme) whose parameters were derived using maximum likelihood estimators (MLE) showed that the best (smallest) 2 value was from the lognormal distribution. Our null hypothesis for the A 2 goodness-of-t test for unique le sizes was: H 0 : The log transformed unique le size sample comes from the normal distribution N(; 2 ) with both and 2 derived from the sample using MLE. The result of the A 2 test run on test data set showed that we must reject this null hypothesis at any level of signicance. The failure of the A 2 test was due primarily to the fact that a fairly large data set was used for the test. When large data sets are used in EDF tests to measure goodnessof-t, the null hypothesis is often rejected, because small deviations from the ideal function are exaggerated when sample sizes are large [11, 16]. One way of dealing with this problem is to take random sub-samples of the data and test goodness-of-t on these samples [16, 8]. On sub-samples of size 100 from our test data set, the A 2 test returned a goodness-of-t statistic at the 10% signicance level (meaning that with probability 10% the test will erroneously declare the hypothesis as false) for all of the sub-samples. Thus, it appears that the lognormal distribution is a reasonable model for the body of the unique le size distribution. 6

7 A^2 Statistic % Sample Used Figure 5: Cuto Threshold Analysis Censoring techniques were employed to determine where to split between the lognormal distribution for the body and the heavy tailed distribution in the tail. A sample is said to be right censored if all observations greater than some value are missing. Our sample is right censored since we assumed that it was contaminated with a heavy tail. We used the A 2 statistic to determine the cuto point. Figure 5 shows the A 2 statistic versus the percent of sample data. The Figure shows that the A 2 statistic decreases in value as the percent of the sample used decreases from 100% to 93%. It then increased in value until the percent of the sample used decreased to 82%. We believe that the decrease in the A 2 statistic until the 93% level was due to the eective elimination of the contaminating heavy tail and that the statistic rose between 93% and 82% because we were eliminating data from the true distribution. Eliminating data below the 82% level begins to make the A 2 statistic look better simply due to reduced eective sample size. The 93% level for the data gave us a cuto at approximately 133KB. Finally, we tested this model against the second half of the sample data and found the 2 value to very close to the value for the rst half of the sample. This means that the proposed model is a reasonably good predictor for the data. Using the 133KB cuto, we found that 93% of the total number of unique les sizes lies below the 133KB cuto. This gure along with the hybrid distributional model allow us to generate the appropriate number of les for test in SURGE. The summary for the hybrid, unique le size model is in Table 1. The t of both the body and the tail of our unique le size model can be seen in the CDF plot for the body and Log-Log Complementary Distribution (LLCD) plot for the tail in Figures 6 and 7. Both plots are necessary since CDF tends to obscure discrepancies in the tail and LLCD tends to do the same for discrepancies in the body. 7

8 Component Model Probability Density Function Parameters Body Log e normal p x (x) = 1 x p2 e?(lnx?)2 =2 2 = 9:357; = 1:318 Tail P areto p x (x) = k x?(+1) k = 133K; = 1:1 Cuto 133,225KB Percent of les in body 93% Percent of les in tail 7% Table 1: Summary Statistics for SURGE File Size Model log(sizes) Figure 6: CDF: Unique File Size Data vs. Model log(p[x>x]) log(sizes) Figure 7: LLCD: Unique File Size Data (+) vs. Model (*) 8

9 Active OFF times less than 10sec Active OFF times less than 5sec Active OFF times less than 1sec Figure 8: Histograms of Active OFF times for 10, 5 and 1 second thresholds 5 OFF Time Model A number of models for OFF times have been presented in [7, 9, 12]. The model presented in [9] proposes a structure which consists of two kinds of OFF times. The rst, referred to as Active OFF time, is the time needed by the client machine to process transmitted les (interpret, format, and display). The second, referred to as Inactive OFF time, is the time that users take to examine the data that has been transmitted to their browser. We incorporate both of these types of OFF time into SURGE. 5.1 Modeling Results We use the characterizations given in [9, 12] to dene the OFF times for SURGE. The model presented in [9] gives a parameterized model for Inactive OFF time (which we use in SURGE) but not a model for the Active OFF time. Thus, we derived our model for Active OFF times from the BU client trace les. We consider an OFF time to be \Active" if it is less than a threshold time. We considered three dierent threshold values in our analysis: 1, 5 and 10 seconds. As can be seen in Figure 8 there is a strong clustering of data in the one second threshold histogram which does not continue past approximately one second as can be seen in the ve and ten second threshold histograms. From this, we conclude that the principal Active OFF period is less than one second and we continued our analysis focusing on this data set. The trace data for the one second threshold gave us a half sample of 40,037 elements. We do not consider values from our model for Active OFF times extending beyond the one second threshold since we were not able to distinguish between machine generated and human generated OFF times in our data. 9

10 solid line is the empirical d.f. Figure 9: CDF: Active OFF Time Model for BU Data versus Weibull Distribution Visual analysis via CDF plots lead us to consider both Weibull and Beta distributions as possible models for the Active OFF times. The 2 value for the Beta distribution was 0.85 while the value for the Weibull distribution was Thus, Weibull was selected as the distributional model for the Active OFF time data. The 2 test on the second half of the OFF time data yielded a value of 0.50 which shows that the Weibull model is a good predictor of Active OFF time values. The CDF plot of the data versus the Weibull distribution is shown in Figure 9. The A 2 test for the Active OFF time data versus the Weibull distribution was run using the following null hypothesis: H 0 : our active OFF time sample comes from the Weibull distribution W (shape; scale) with both shape and scale derived from the sample using MLE. We failed to nd signicance in our test at any level, which we attribute to the relatively large sample size. For random sub-samples of size 100, the A 2 test returned a goodness-of-t statistic at the 1% signicance level for approximately 50% of the sub-samples. This is additional evidence that the Active OFF time distribution is reasonably modeled by Weibull. We used the model given in [9] for the Inactive OFF periods. This is a Pareto distribution with = 1:5 and k = 1. The summary of the complete OFF time model used in SURGE is in Table 2. We placed a upper limit of 30 minutes on the Inactive OFF times generated by our model, because the random generation of values from the Pareto distribution will produce some very large values. We selected this as the limit since less than 0.05% of the measured OFF times from the traces were greater than 30 minutes. 10

11 Component Model Probability Density Function Parameters Active OFF W eibull p x (x) = bxb?1 a b e?(x=a)b a = 1:46; b = 0:382 Inactive OFF P areto p x (x) = k x?(+1) k = 1; = 1:5 Threshold Upper limit 1 second 1800 seconds 6 ON Time Model Table 2: Summary Statistics for SURGE OFF Time Model Based on the model presented in [9], multiple documents can be transferred during any ON period. In order to complete the SURGE framework, a model for the number of documents fetched during any ON period (ON counts) was necessary. 6.1 Modeling Results Our data for ON counts was extraced from the BU client traces by counting the number of documents fetched by a given user for which the OFF time between documents was less than the one second threshold. This resulted in a half sample data set with 26,142 elements. Initial inspection of this data set showed that its distribution had a long right tail. This lead us to test it against standard distributions which also have this characteristic (Lognormal, Logextreme, Weibull, and Pareto). We also included the Zipf-Estroup discrete distribution which is the discrete form of the Pareto distribution [13]. Our set of ON counts had a mean value of 2.71 and thus may not have been well approximated by a continuous function. So, we consider discrete distributions model as candidates. Generation of CDF and LLCD plots lead us to consider the Pareto distribution the best visual t for the data. We found that the MLE value for the Pareto parameter gave a model whose tail was not a good visual t for the ON count data. A least squares estimate of the tail slope in the LLCD plot resulted in an estimate of which gave a better visual t for both the body which can be seen in the CDF plot in Figure 10, and in the tail which can be seen in the LLCD plot in Figure 11. Note that the least squares method gave a K value for the Pareto distribution of 1.5. We know that minimum ON counts are in fact 1.0 (that is a single document) which is the value used in SURGE. The eect of this value for K is that SURGE will generate ON counts which are smaller 11

12 log(on Counts) Figure 10: CDF: ON count Model vs. Pareto Distribution log(p[x>x]) log(data) Figure 11: LLCD: ON Count Data (+) vs. Pareto Model (*) on average than the empirical distribution. The Pareto distribution yielded the best 2 value of the candidate distributions (1.12). The predictive value of the model was tested via the 2 test on the second half of the ON count data. The resulting value of 1.96 indicated that the model was reasonable. The summary of the ON Count model used in SURGE is in Table 3. We added an upper limit parameter to this model since the random generation of values from the Pareto distribution will produce some very large values. The largest document count in our data was 138 and so this is the upper limit on ON counts generated by SURGE. Our null hypothesis for the A 2 test for the ON Count data versus the Pareto distribution was: 12

13 Component Model Probability Density Function Parameters ON Count P areto p x (x) = k x?(+1) k = 1; = 2:354 Upper Limit 138 Table 3: Summary Statistics for SURGE ON Count Model Component Model Probability Density Function Parameters File Size - Body Log e normal p x (x) = 1 x p2 e?(lnx?)2 =2 2 = 9:357; = 1:318 File Size - Tail P areto p x (x) = k x?(+1) k = 133K; = 1:1 Document Popularity Zipf Temporal Locality Log e normal p x (x) = 1 2 x p2 e?(lnx?)2 =2 = 1:5; = 0:80 Active OFF W eibull p x (x) = bxb?1 a b e?(x=a)b a = 1:46; b = 0:382 Inactive OFF P areto p x (x) = k x?(+1) k = 1; = 1:5 ON Count P areto p x (x) = k x?(+1) k = 1; = 2:43 Table 4: Summary Statistics for Models used in SURGE H 0 : our ON Count sample comes from the Pareto distribution P (k; ) with both k and derived from the sample using LLCD plot estimation. We failed to nd any signicance in our test and again we attribute this to relatively the large size of the sample. We employed the random sub-sample method to see how well our data t the Pareto model. For sub-samples of size 100, the A 2 test returned a goodness-of-t statistic at the 25% level for a few of the samples. We attribute this to the fact that there were only a few values in the tail of our empirical data and the 100 item sub-samples rarely had values in their tails. 7 Surge Model Summary The model distributions and parameters used in SURGE are given in Table 4. We do not argue that any of these characteristics of WWW use are invariant. We do argue that consideration of each of these characteristics is important in WWW server workload generation. SURGE has been designed to oer the distributions and parameters listed above as a baseline, 13

14 however the user can change parameter values as well as distributions for any characteristic. In the expanded version of the paper describing SURGE, we will present the importance of each characteristic in detail. This will allow users to scale the various parameters of SURGE to more accurately model the expected workload on their server. References [1] Http client. [2] Specweb96. [3] Web [4] Webcompare. [5] Webstone. webstone.html. [6] V. Almeida, A. Bestavros, M. Crovella, and A. Oliveira. Characterizing reference locality in the www. Technical Report TR-96-11, Boston University Department of Computer Science, Boston, MA 02215, [7] M.F. Arlitt and C.L. Williamson. Web server workload characterization: The search for invariants. In Proceeding of the ACM SIGMETRICS '96 Conference, Philadelphia, PA, April [8] Henry Braun. A simple method for testing goodness of t in the presence of nuisance parameters. Journal of the Royal Statistical Society, [9] M.E. Crovella and A. Bestavros. Self-similarity in world wide web trac: Evidence and possible causes. In Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, May [10] C.A. Cunha, A. Bestavros, and M.E. Crovella. Characteristics of www client-based traces. Technical report, Boston University Department of Computer Science, [11] R. B. D'Agostino and editors M. A. Stephens. Goodness-of-Fit Techniques. Marcel Dekker, Inc.,

15 [12] S. Deng. Empirical model of www document arrivals at access link. In Proceedings of the 1996 IEEE International Conference on Communication, June [13] N. Johnson, S. Kotz, and N. Balakrishnan. Discrete Univariate Distributions. John Wiley and Sons, Inc., [14] Sunil U. Khaunte and John O. Limb. Statistical characterization of a world wide web browsing session. Technical report, College of Computing, Georgia Institute of Technology, [15] W.E. Leland, M.S. Taqqu, W. Willinger, and D.V. Wilson. On the self-similar nature of ethernet trac (extended version). IEEE/ACM Transactions on Networking, pages 2:1{15, [16] Vern Paxson. Empirically-derived analytic models of wide-area tcp connections. IEEE/ACM Transactions on Networking, [17] S. Pederson and M. Johnson. Estimating model discrepancy. Technometrics,

Network Traffic Characteristic

Network Traffic Characteristic Hojun Lee hlee02@purros.poly.edu 5/24/2002 EL938-Project 1 Outline Motivation What is self-similarity? Behavior of Ethernet traffic Behavior of WAN traffic Behavior of WWW