Wavelet and SiZer analyses of Internet Traffic Data

Size: px

Start display at page:

Download "Wavelet and SiZer analyses of Internet Traffic Data"

April Wilkerson
5 years ago
Views:

1 Wavelet and SiZer analyses of Internet Traffic Data Cheolwoo Park Statistical and Applied Mathematical Sciences Institute Fred Godtliebsen Department of Mathematics and Statistics, University of Tromsø Felix Hernandez-Campos Department of Computer Science, University of North Carolina at Chapel Hill J. S. Marron Department of Statistics and Operations Research, University of North Carolina at Chapel Hill Vitaliana Rondonotti European Central Bank, Germany F. Donelson Smith Department of Computer Science, University of North Carolina at Chapel Hill Stilian Stoev Department of Mathematics and Statistics, Boston University, and Murad Taqqu Department of Mathematics and Statistics, Boston University

2 June 5, Abstract It is important to characterize burstiness of Internet traffic and find the causes for building models that can mimic real traffic. To achieve this goal, exploratory analysis tools and statistical tests are needed, along with new models for aggregated traffic. This talk introduces statistical tools based on wavelets and SiZer (SIgnificance of ZERo crossings of the derivative). The intricate fluctuations of Internet traffic are explored in various respects and lessons from real data analyses are summarized. Introduction In the past decade the study of Internet traffic has attracted the interest of researchers in computer science, engineering, applied probability and statistics. A number of studies in the mid 99 s have shown that the classical models for telephone traffic do not apply to the traffic in modern computer tele-communication networks (see, Leland, Taqqu, Willinger and Wilson (99) and Paxson and Floyd (995)). Statistical analyses indicate that the random fluctuations in the arrival rates, measured in packets or bytes per unit time, are strongly correlated over large time lags, i.e. are long-range dependent. The long-range dependence property of the traffic fluctuations has important implications on the performance, design and dimensioning of the network. Wavelets capture both time and frequency features in the data and often provide a richer picture than the classical Fourier analysis. Since the seminal work of Abry and Veitch (998), they have become a very popular tool for studying the long-range dependence properties of network traffic. The self-similar scaling which is often encountered in the presence of long-range dependence is naturally captured by the wavelet spectrum of the data. This scaling is used to define the wavelet estimator of the Hurst parameter in terms of the slope of the wavelet spectrum. SiZer (SIgnificant ZERo crossing of the derivatives), originally proposed by Chaudhuri and Marron (999), analysis is a visualization method that enables statistical inference to discover meaningful structure within the data, while doing exploratory analysis using statistical smoothing methods. It brings clear and immediate insight into a central scientific issue in exploratory data analysis: Which features observed in a smooth of data are really there? This main question is critical in real data analysis, because discovery of a new feature can often lead to important new scientific insight, while on the other hand scientific effort can be wasted trying to find explanations for spurious features.

3 Data description The trace collection was obtained by placing a network monitor on the high-speed link connecting the University of North Carolina at Chapel Hill (UNC) campus network to the Internet via our Internet service provider (ISP). All units of the university including administration, academic departments, research institutions, and a medical complex (with a teaching hospital that is the center of a regional health-care network) used a single ISP link for Internet connectivity. The user population is large (over 5,) and diverse in their interests and how they use the Internet - including, for example, , instant messaging, student surfing (and music downloading), access to research publications and data, business-to-consumer shopping, and business-to-business services (e.g., purchasing) used by the university administration. There are tens of thousands of on-campus computers (the vast majority of which are Intel architecture PCs running some variant of Microsoft Windows). We used a network monitor to capture traces of the TCP/IP headers on all packets entering the campus network from the ISP. All the traffic entering the campus from the Internet traversed a single full-duplex Gbps Ethernet link from the ISP edge router to the campus aggregation switch. In this configuration, both public Internet and Internet traffic were co-mingled on the one Ethernet link and the only traffic on this link was traffic from the ISP edge router. We placed a monitor on this Gbps Ethernet by passively inserting a fiber splitter to divert some light to the receive port of a Gigabit Ethernet network interface card (NIC) set in promiscuous mode. The tcpdump program was run on the monitoring NIC to collect a trace of TCP/IP packet headers. The trace entry for each packet included a timestamp (with microsecond resolution), the length of the Ethernet frame, and the complete TCP/IP header (which includes the IP length field). The traces were collected during four two-hour tracing intervals each day for seven consecutive days of the week. The collection period was during the second week in April, a particularly high traffic period on our campus coming just a few weeks before the semester ends. More details can be found in Park, Hernandez-Campos, Marron, Rolls, and Smith (). Dependent SiZer A goodness of fit test is a statistical technique to validate a null hypothesis without specifying an alternative hypothesis. It is useful to assess models being considered as possible candidates to explain the data well. In this section, a new goodness of fit test method combined with visual insight is proposed by extending SiZer. We illustrate the usefulness of our method in the challenging context of time series that arise in the statistical analysis of Internet traffic. In this section, we study time series of (the numbers of packets arriving in consecutive millisecond intervals). In other words, our raw data is a time series Y = {Y i, i =,,..., n} of binned packet arrival times, with a bin width of millisecond.

4 Figure : Time series plot of of the Sat time series. The plot shows several spikes, but it is not clear which are meaningful features. Figure displays a full packet count time series of Internet traffic measured at the main internet link of UNC on April, Saturday, from p.m. to about p.m., (Sat). The time series plot in Figure shows a thick spike and two long thin spikes in the middle, and a small spike at the right end. From this plot, however, it is not easy to see which features are meaningful. Figure (a) shows the SiZer plot of the Sat time series. The thin blue curves in the top panel display the family of smooths. These are kernel regression estimates, see e.g. Wand and Jones (995), based on the data, (i, Y i ) for i =,..., 75555, some of which are displayed as green dots (only integer valued, because there are counts). There are several blue curves corresponding to different levels of smoothing, that is bandwidths. Some of these curves are oversmoothed, and some are undersmoothed. The goal of SiZer is to understand which of the features in the smooths can be distinguished from the background white noise. In other words, to determine which features of the blue curves are important and which are just spurious sampling artifacts is a challenging problem. The lower panel of Figure (a) displays the SiZer map of the data. The horizontal locations in the SiZer map are the same as in the top panel, and the vertical locations correspond to the logarithm of bandwidths of the family of smooths shown as blue curves in the top panel. Each pixel shows a color that gives the result of a hypothesis test for the slope of the blue curve, at the point indexed by the horizontal location, and at the bandwidth corresponding to that row. When the slope of the blue curve is significantly positive (negative), the pixel is colored blue (red, respectively). When the slope is not significant, which means the sampling noise is dominant, the pixel has the intermediate color of purple. There is less purple in this map than is typical, because the unusually large sample size of n = 75555, results in a very powerful hypothesis test. There is a fourth SiZer color, that does not appear in Figure (a), which is gray, used to show pixel locations where the data are too sparse for reasonable statistical inference. See Chaudhuri and Marron (999) for a more detailed introduction to SiZer. In Figure (a), the family of smooths indicates that the large spike in the middle dominates the

5 6 (a) SiZer (b) Dependent SiZer Figure : (a) Original SiZer plot of the Sat time series. The top panel suggests that there is a huge spike in the middle. The lower panel shows many features distinguishable from the white noise including the huge spike in the middle. (b) Dependent SiZer plot of the Sat time series. This series is tested against FGN. The spike in the middle is still significantly different from FGN with given parameters while other significant features in the original SiZer plot in (a) disappear, i.e. are consistent with FGN. other features. From the SiZer map, however, there are a lot of blue and red colors at several time and scale locations other than where the huge spike occurs. It clearly shows that the huge spike is significantly different from white noise, but so are all of the other wiggles (large and small) visible in the family of smooths. All these cases of burstiness are flagged as significant because of the very large sample size, and because SiZer is trying to distinguish the data from white noise, but the data have an intrinsic burstiness caused by natural dependence in the time series. In addition to long range dependence, Internet traffic data have been known to exhibit the selfsimilar property, see e.g. Leland, Taqqu, Willinger, and Wilson (99). Fractional Gaussian Noise (FGN) is a fundamental time series model which exhibits both long range dependence and selfsimilarity, see Mandelbrot (988) for example. To understand when such models are appropriate, it is useful to test whether a given time series is consistent with FGN or not. Motivation for the development of dependent SiZer comes from comparing the analysis in Figure (a) with a SiZer analysis of simulated FGN data (with parameters defined later), as shown in Figure (a). While the two families of smooths exhibit very different behavior (a single very big burst for the Sat data, compared with more numerous and homogeneous bursts in the simulated data), the two SiZer maps look at least qualitatively similar, with very many red and blue significant regions. Thus conventional SiZer is not very effective at differentiating between these data sets (with apparently different burstiness properties). Much better differentiation will come from adjusting the SiZer inference, so that burstiness at the level inherent to FGN, as shown (for the same simulated data 5

6 (a) SiZer (b) Dependent SiZer Figure : (a) Original SiZer of a simulated FGN. The SiZer map shows that many features at different scales are flagged as significant. (b) Dependent SiZer of the same simulated FGN. In the SiZer map, almost all pixels are purple, which is consistent with this model. set) in Figure (b), is the null hypothesis, and thus shows up as all purple. When such a method is applied to the Sat time series, as in Figure (b), it becomes clear that most of the burstiness apparent in Figure (a) is of a magnitude expected under an FGN model, but the large central spike is a much different type of burst. Figure (b) and Figure (b), are illustrations of dependent SiZer, which enables us to more specifically compare the data with a given model. This can be done by adjusting the statistical inference, using the autocovariance function of an assumed model. Therefore, dependent SiZer provides not only a goodness of fit test of an assumed model but also visual insight as to how the data differ from an assumed model. Figure (b) shows the dependent SiZer plot (using adjusted statistical inference) of the Sat time series. The coordinate axes the axes and the colors are the same as those of the original SiZer. The only difference is that dependent SiZer compares the data against a null hypothesis of FGN with particular parameters, H =.9 and variance.6. The estimation of these parameters is described in Park, Marron, and Rondonotti (). The SiZer map in the lower panel confirms that the spike in the middle can not be explained by FGN, as expected, because there is a significant increasing trend (flagged as blue) and a decreasing trend (flagged as red). However, in the other regions, there is no significant evidence that they are different from FGN because the SiZer map shows all purple. The input time series to the SiZer analysis in Figure (b) is no longer the full packet count series of n = 75555, but instead has been summarized by binning, to a series of length, for fast calculation of the dependent SiZer inference. The green dots in Figure (b) look much different from those in part (a), because they show this summarized time series. 6

7 (a) (b) Figure : (a) A zoomed dependent SiZer (against FGN) plot of the Sat data over a 6 minute range. The valley in the middle is significantly different from FGN. (b) A further zoomed dependent SiZer plot of the Sat for seconds. Surprisingly, there is a second periodic behavior and which can not be explained by FGN. To further explore this statistically significant spike, we zoom in for a closer look, by restricting attention to the data between the two vertical red lines in the top panel of Figure (b). Figure (a) shows the dependent SiZer within the spike (i.e. between the red lines), which is about 6 minutes long. The family of smooths shows a valley in the middle and the SiZer map indicates that this valley is not typical behavior of FGN with the same parameters above, by the red and blue regions. A further zoomed dependent SiZer, which corresponds to the two vertical red lines in the top panel of Figure (a), is displayed in Figure (b). This extracted subtrace is around seconds long and its dependent SiZer plot surprisingly shows periodic behavior, with a period of about seconds. This behavior is very inconsistent with a FGN model and the SiZer map in the lower panel confirms this by flagging the bumps in the top panel as significant. A likely explanation of this is an intense Internet Protocol (IP) port scan, where a hacker has attempted to find weak points in the system by sending query packets to all of the ports of all possible IP addresses in the UNC domain. This example shows that dependent SiZer is a very useful tool, not only as a goodness of fit test that studies whether the given data are consistent with the model being tested, but also to highlight unusual behavior of the data. This is done through the SiZer combination of visual insight and statistical inference. More details can be found in Park, Marron, and Rondonotti (). 7

8 (a) time (sec) (c) (b) (d) Sj Scale j Figure 5: (a) Time series plot of measured every milliseconds at the link of University of North Carolina, Chapel Hill (UNC) on April, Thursday, from p.m. to p.m., (Thu). Several spikes are shooting up and down. (b) The wavelet spectrum of the Thu time series. There is a bump at the scale j =. (c) and (d) display the dependent SiZer of the Thu time series. Wavelet SiZer Figure 5 (a) displays a time series of binned measured at the link of UNC on April, Thursday, from p.m. to p.m., (Thu). This time series, which is the number of packets arriving at the link every milliseconds, can be written as Y = {Y (i), i =,,..., 76}. The plot shows several spikes shooting up and down. Figure 5 (b) displays what we call the wavelet spectrum of the Thu time series. Here, the wavelet spectrum shows signal power obtained by wavelet coefficients at each scale. See Abry and Veitch (998), or Stoev, Taqqu, Park, and Marron () for more details on the wavelet spectrum. We used Daubechies wavelets, see for example, Daubechies (99), with three zero moments for all wavelet spectra in this paper. The wavelet spectrum of this time series shows a bump at the scale j =. Since the y axis of the spectrum corresponds to the energy of the signal at a given scale, this bump suggests a unusual behavior at j =. This has a serious impact on the estimation of the Hurst parameter (see Stoev, Taqqu, Park, and Marron () for more details). A serious weakness of the wavelet spectrum statistics is that they fail to provide information about the time location of the unusual behavior, because they involve an average through time. Figures 5 (c) and (d) display the dependent SiZer of the Thu time series. The data are tested against a null hypothesis of FGN where the Hurst parameter, H, and the variance are.9 and 85.6 respectively (see Park, Marron, and Rondonotti () for the choice of the parameters). The goal is to see whether FGN can model the data at large time scales, and if not, to see how far the data are from FGN. Because FGN is a stationary series, significant features discovered in the dependent SiZer 8

9 analysis can be considered as nonstationarities. Observe that a sharp valley appears around x = 5 (seconds). Figure 5 (d) displays the SiZer map of the data. The horizontal locations in the SiZer map are the same as in the top panel, and the vertical locations correspond to the same logarithmically spaced bandwidths that were used for the family of smooths (blue curves) in the top panel. Blue (red) colors in the SiZer map indicate that the blue curves in Figure 5 (c) are significantly increasing (decreasing, respectively), at the point indexed by the horizontal location and the bandwidth corresponding to that row, compared to FGN. Purple colors mean that the trend shows no increase or decrease that is significantly different from those typical of FGN at that particular location. The SiZer map in Figure 5 (d) suggests that the valley around x = 5 is different from FGN with the given parameters at fine scales. This indication is not strong in the sense that there are only a few bandwidths (scales) and locations in the SiZer map where the valley around x = 5 is flagged as significant. Since dependent SiZer provides both location and scale information simultaneously, it is very useful to find nonstationary behavior when a stationary model for the noise is assumed. However, it sometimes fails to locate nonstationarities that have short durations. Furthermore, the model must be specified before analyzing the data. Hence, estimation of parameters in the autocovariance function of the specified FGN model for the noise is needed. Nonstationary behavior in the data can be further explored by zooming in on the region of the nonstationarity. Because we have evidence of unusual behavior in the Thu time series from Figure 5, one can split the full time series into several subtraces and apply the wavelet method and dependent SiZer to the subtraces. Since the wavelet spectrum in Figure 5 (b) provides only scale information aggregated over the full time series, we use these subtraces of the data to find more local information about the nonstationarities. The full time series is divided into subtraces, each of which is 6 seconds long. Figure 6 (b) shows the wavelet spectra of subtraces of the Thu time series. Each spectrum is obtained from one of the subtraces. The spectrum with vertical gray lines corresponds to the full time series. The spectrum with the thick black line corresponds to the subtrace marked by red vertical lines in Figure 6 (a), and this subtrace includes the location where nonstationary behavior was flagged in the dependent SiZer plot in Figures 5 (c) and (d). The wavelet spectrum corresponding to the red window in Figure 6 (a) shows that the bump at the scale j = of the full time series comes from this particular subtrace. Figures 6 (c) and (d) display the dependent SiZer plot for this subtrace. The data is tested on FGN with the same parameters as in Figures 5 (c) and (d). This strongly confirms that the valley is significantly different from FGN with the given parameters. Figures 5 and 6 show that by combining a zoomed view with the wavelet spectrum and dependent SiZer, we can find locations where the time series has strong nonstationary behavior. However, there are serious practical limitations to this approach, caused by the lack of location information in the wavelet spectrum, by the need to choose the length of subtraces, by the need to specify an assumed model for the data, and by the need for parameter estimation in dependent SiZer. 9

10 6 (a) (c) (b).5 (d) 8 yj Octave j Figure 6: (a) Time series plot of the Thu time series. The red vertical lines are the boundaries of the 6 second window between x = and x = (seconds). (b) Overlays of the wavelet spectra for each 6 seconds window. The spectrum with vertical lines (gray) corresponds to the full time series, which is the same as (b) in Figure 5. (c) and (d) display the dependent SiZer plot of the 6 second window between the red lines in (a). The subtrace is tested against FGN with H =.9 and the variance Motivated by the idea of combining the wavelet spectrum and dependent SiZer, and by overcoming the difficulties that they cause, we propose a new visualization tool, entitled wavelet SiZer. This tool combines wavelet coefficients with the original SiZer. Wavelet SiZer uses functions of wavelet coefficients at different scales as inputs. Next, SiZer is used to determine where and how they are significantly different from what one would have expected, had the data been stationary. By doing so, wavelet SiZer considers not only the wavelet coefficients at all scales, but also adds location information at each scale obtained from SiZer. One advantage, compared to applying dependent SiZer on the raw time series, is that conventional SiZer can be used on the wavelet coefficients because they are essentially uncorrelated. This removes the burdens of model specification and parameter estimation. Figure 7 shows the wavelet SiZer of the Thu time series. The top panel displays the family of smooths of the full time series. When SiZer is applied directly to the wavelet coefficients, very few features are typically found. It turns out, however, that by applying SiZer to functions of the wavelet coefficients, interesting features in the underlying time series can be found. In particular, the squared wavelet coefficients are useful for finding features of interest. This is illustrated in the remaining rows of Figure 7, where a SiZer analysis is applied to the squared wavelet coefficients. We used Daubechies wavelets with three zero moments for all wavelet SiZer plots in this paper. The second row displays the family of smooths and the SiZer map of the square of the wavelet coefficients at the scales j = (the first and the second column) and j = (the third and the fourth column). In the same way, the

11 x x.5 x 5 x x x.5 5 x x x 68 x 555 x x x 5 6 Figure 7: Wavelet SiZer plots of the Thu time series. The top panel shows the family of smooths of the time series. The second row displays the family of smooths and the SiZer map of the second power wavelet coefficients at the scales j = and j =. In the same way, the other rows display SiZer plots of the second power wavelet coefficients from j = to j =. The SiZer plots at different scales are mostly dominated by four spikes, which are numbered at the j = 7 family of smooths. other rows display SiZer plots of the square of the wavelet coefficients from j = to j =. The larger scales are not presented here because they do not provide further useful information. The SiZer plots in Figure 7 at different scales are mostly dominated by four major spikes, labelled as,,, and in the family of smooths at scale j = 7. Spike appears from j = to j = 5, which are fine scales. Spike comes up at j = 5 and lasts until j =, a relatively coarse scale. Note that this spike matches the location that dependent SiZer had found in Figures 5 and 6. Spikes and appear from j = to j = 9, which cover both fine and coarse scales. Note that spikes,, and could not be observed in Figures 5 and 6, but the wavelet SiZer is able to reveal these hidden local nonstationarities. Having found these regions of strong local nonstationarity, it is natural to investigate more deeply. To do so, we zoom into the original time series. The four regions highlighted in red in Figure 8 (a) correspond to the locations of the four spikes flagged as significant by wavelet SiZer in Figure 7. Figures 8 (b), (c), (d), and (e) show the zoomed time series corresponding to the four red windows in Figure 8 (a), respectively. Figure 8 (b) shows that there is a dropout (absence of signal) for a duration about. second perhaps caused by a short router pause, which is a clear type of nonstationarity. Since this dropout is quickly over with, the spike appears only at fine scales in wavelet SiZer. Figure 8 (f) displays the wavelet spectrum for the full time series with this subtrace (shown in Figure 8 (b)) removed and the remaining two parts concatenated. This does not look different from the original spectrum in Figure 5 (b), suggesting that this feature did not cause the j = bump in the spectrum.

12 5 6 7 (a) (b) (c) 5 time (sec) (d) (e) (f) (g) (i) Sj Scale j Figure 8: (a) Time series plot of the Thu time series. The four intervals hightlighted by the vertical red lines correspond to the significant spikes of the wavelet SiZer in Figure 7. (b), (c), (d), and (e) show the subtraces corresponding to the four red windows in (a), respectively. (f), (g),, and (i) are corresponding wavelet spectra when each red window is excluded from the full time series and the remaining two parts are concatenated. This shows that the j = bump in the spectrum was caused by the dropout shown in (c). Figure 8 (c) shows a long dropout for about 8 seconds, again a strong local nonstationarity. Since this dropout is much longer, the spike appears at much coarser scales in wavelet SiZer in Figure 7. The spike appears at a number of scales because the signal power of a jump discontinuity is spread across a range of wavelet scales. Figure 8 (g) displays the wavelet spectrum of the full time series with the subtrace (shown in Figure 8 (c)) removed as for Figure 8 (f), and the bump at j = in Figure 5 (b) disappears. This suggests that this feature plays an important role in causing the j = bump. Figures 8 (d) and (e) show two unusual drops in both subtraces. Because the sizes of drops are about half of the size of the signal, this could be caused by one of two routers pause. Since some of them are short and some are long, and since they are located relatively close to each other, the wavelet coefficients feel the effect over many scales from fine to coarse in Figure 7. Although Figures 8 and (i) display the wavelet spectra of the full time series with the subtraces (shown in Figure 8 (d) and (e) respectively) removed as above, they are not different from the original spectrum. It is interesting to see that all spikes in wavelet SiZer come from valleys in the full time series, rather than upward spikes. We conclude that the long dropout in Figure 8 (c) causes the bump at j = in the original wavelet spectrum in Figure 5 (b) by comparing the original spectrum with the one in 8 (g) because the bump disappears. Also, additional hidden nonstationarities were found by wavelet coefficient based visualization tools.

13 This example suggests that the idea of combining wavelet coefficients and scale-space methods is a powerful technique for detecting hidden local nonstationary behavior of time series. In fact, the new tools enable us to easily find underlying structures in the original time series that would have been hard to find by simply examining wavelet coefficients. Although we have focused on Internet traffic data in this paper, the described methodology is also applicable and useful for other types of time series with potential local nonstationarities. More details can be found in Park, Godtliebsen, Taqqu, Stoev, and Marron (). References Abry, P. and Veitch, D (998). Wavelet analysis of long range dependent traffic. IEEE Transactions on Information Theory, 5. Chaudhuri, P. and Marron, J. S. (999). SiZer for exploration of structures in curves. Journal of the American Statistical Association 9, Daubechies, I. (99). Ten Lectures on Wavelets. SIAM, Mandelbrot, B. B. (988). The Fractal Geometry of Nature. W. H. Freeman and Co., New York, New York. Leland, W., Taqqu, M., Willinger, W., and Wilson, D. (99). On the Self-Similar Nature of Ethernet Traffic. IEEE/ACM Transactions on Networking, 5. Park, C., Godtliebsen, F., Taqqu, M., Stoev, S., and Marron, J. S. (). Visualization and inference based on wavelet coefficients, SiZer and SiNos. Submitted to Journal of Computational and Graphical Statistics. Park, C., Hernandez-Campos, F., Marron, J. S., Rolls, D., and Smith, F. D. (). Long-range dependence in a changing Internet traffic mix. submitted. Park, C., Marron, J. S. and Rondonotti, V. (). Dependent SiZer: goodness of fit tests for time series models. submitted. Paxson, V. and Floyd, S. (995). Wide Area traffic: the failure of Poisson modeling. IEEE/ACM Transactions on Networking, 6. Stoev, S., Taqqu, M., Park, C. and Marron, J. S. (). Strengths and limitations of the wavelet spectrum of the wavelet spectrum method in the analysis of Internet traffic. Submitted.

Visualization and Inference Based on Wavelet Coefficients, SiZer and SiNos

Visualization and Inference Based on Wavelet Coefficients, SiZer and SiNos Cheolwoo Park, Fred Godtliebsen, Murad Taqqu, Stilian Stoev and J.S. Marron Technical Report #- March 6, This material was based