Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

009 9th IEEE International Conferene on Distributed Computing Systems Modeling Probabilisti Measurement Correlations for Problem Determination in Large-Sale Distributed Systems Jing Gao Guofei Jiang Haifeng Chen Jiawei Han University of Illinois at Urbana-Champaign NEC Labs Ameria {jinggao3,hanj}@illinois.edu, {gfj,haifeng}@ne-labs.om Abstrat With the growing omplexity in omputer systems, it has been a real hallenge to detet and diagnose problems in today s large-sale distributed systems. Usually, the orrelations between measurements olleted aross the distributed system ontain rih information about the system behaviors, and thus a reasonable model to desribe suh orrelations is ruially important in deteting and loating system problems. In this paper, we propose a transition probability model based on markov properties to haraterize pairwise measurement orrelations. The proposed method an disover both the spatial (aross system measurements) and temporal (aross observation time) orrelations, and thus suh a model an suessfully represent the system normal profiles. Problem determination and loalization under this framework is fast and onvenient. The framework is general enough to disover any types of orrelations (e.g. linear or non-linear). Also, model updating, system problem detetion and diagnosis an be onduted effetively and effiiently. Experimental results show that, the proposed method an detet the anomalous events and loate the problemati soures by analyzing the real monitoring data olleted from three ompanies infrastrutures.. Introdution Reent years have witnessed the rapid growth of omplexity in large-sale information systems. For example, the systems underlying Internet servies are integrated with thousands of mahines, and thus possess unpreedented apaity to proess large volume of transations. Therefore, large amount of system measurements (metris) an be olleted from software log files, system audit events and network traffi statistis. To provide reliable servies, system administrators have to monitor and trak the operational status of their infrastrutures in real time and fix any problems quikly. Due to the sale and omplexity of the system, we have to automate the problem determination proess so as to redue the Mean Time to Reovery (MTTR). It is a hallenging task to automatially detet anomalies in a large system beause both the normal and anomalous behaviors are heterogeneous and dynami. In fat, the.5 x 0 5 (a)ifoutotetsrate_if 0 0 50 00 50 00 50 Time ( 6 minutes).5 x 0 5 (b) IfInOtetsRate_IF 0 0 50 00 50 00 50 Time ( 6 minutes) Figure. Measurements as Time Series. widely existing orrelations among measurements are very useful for autonomi system management. In this paper, we propose a novel method that an effetively haraterize the orrelations aross different system measurements and observation time. The method aptures the ompliated and hanging normal profiles, and thus an be used to quikly detet and loate system problems. Eah distributed system usually onsists of thousands of omponents, suh as operating systems, databases, and appliation softwares. On eah omponent, we are interested in its usage parameters, suh as CPU and memory utilization, free disk spae, I/O throughput and so on. Suppose we monitor l measurements for a partiular system, and eah measurement m a ( a l) is uniquely defined by the omponent (e.g. database) and the metri (e.g. memory usage). Due to the dynami nature of workloads reeived by the system, the measurement values usually hange with time. Therefore, eah measurement m a an be viewed as a time series. We all the set of time series olleted from the system as the monitoring data. Correlations are ommonly found among the measurements beause some outside fators, suh as work loads and number of user requests, may affet them simultaneously. For example, the two measurements shown in Figure are orrelated. For the purpose of problem determination, it is essential to hek the orrelations among measurements instead of monitoring eah measurement individually. A sudden inrease in the values of a single measurement may not indiate a problem, as shown by the peaks in Figure (a) and Figure (b), instead, it ould be aused by a flood of user requests. Monitoring multiple measurements simultaneously, we an identify this senario as normal when we find that many measurements values inrease but their orrelations remain unhanged. Therefore, profiling 063-697/09 $5.00 009 IEEE DOI 0.09/ICDCS.009.56 63

Network throughput at Server A CPU usage at Server C CPU usage at Server B CPU usage at Server A Memory usage at Server C Memory usage at Server B Memory usage at Server A IfInOtetsRate_IF.5 x 05.5 (b) Linear Properties 0 0.5 IfOutOtetsRate_IF x 0 5 CurrentUtilization_PORT.6..08.04.96.9.88.84 (d) Arbitrary Shapes 90 300 30 30 330 340 350 360 370 IfOutOtetsRate_PORT Figure. Measurement Correlations: Pair-wise measurement orrelations an be shown in two-dimensional spae. measurement orrelations an help find the real problems and redue false positives. We are espeially interested in traking the pair-wise orrelations, i.e., the orrelations between any two measurements beause it an assist quik problem loalization. In Figure (a), we illustrate pair-wise orrelations using a graph where eah node represents a measurement and an edge indiates the orrelation. At a ertain time point, if all the links leading to a measurement m a have ertain problems, the system administrator an diretly loate the problem soure, i.e., m a. The pair-wise orrelations an be roughly divided into linear and nonlinear ategories. To observe the orrelations more learly, we extrat the values of two measurements m and m at eah time point t, and plot (m t,m t ) as a point in the twodimensional spae. Figure (b)-() shows the measurement values extrated from real systems. Clearly, measurements in Figure (b) (the rate of traffi goes in and out the same mahine), exhibit linear orrelations, and Figure () (in and out traffi rate on two different mahines), and Figure (d) (PORT throughput and utilization) demonstrate the nonlinear relationships. In real monitoring data, we find that nearly half of the measurements have linear relationships with at least one of the other measurements, but the other half only have non-linear ones. Therefore, to model the behavior of the whole system, we need analysis tools that an identify both types of orrelations. Some efforts have been devoted to model the linear measurement orrelations in distributed systems [, ]. Speifially, linear regression models are used to haraterize the orrelations, suh as the one in Figure (b). One the extrated linear relationship is broken, an alarm is flagged. In [3], the authors assume that the two-dimensional data points ome from a Gaussian Mixture and use ellipses to model the data lusters, so the points falling out of the luster boundaries are onsidered anomalous events, as shown in Figure (). Despite these efforts, there are many problems that restrit the use of the orrelation profiling tools in real systems. First, existing work only fouses on one type of orrelations, and thus annot haraterize the whole system preisely. Seondly, the assumption on the form of the data points may not be true (e.g., linear relationships or ellipse-shape lusters). For example, in Figure (d), the data points form arbitrary shapes and annot be modeled by existing methods. Most importantly, how the data evolve is an important part of the system behavior, so besides spatial orrelations, orrelations aross observation time should also be taken into onsideration. In light of these hallenges, we propose a grid-based transition probability model to haraterize orrelations between any two measurements in a distributed system. As shown in Figure (d), we partition the spae into a number of non-overlapping grid ells and map the data points into orresponding ells. A transition probability matrix is then defined over the two-dimensional grid struture where eah entry V ij orresponds to the probability of transitions from grid ell v i to v j. We initialize both the grid struture and the transition probability matrix from a snapshot of history monitoring data, e.g., olleted from last month, and adapt them online to the distribution hanges. We then propose a fitness sore to evaluate how well one or all the measurements are desribed by the orrelation models. One the fitness sore drops below a threshold, it indiates that ertain system problems may our. Our ontributions are: ) We propose a novel probability model to haraterize both spatial and temporal orrelations among measurements from a distributed system. Based on the model, we develop methods to detet and loate system problems. ) We make no assumptions on the type of orrelations and data distributions, therefore, the proposed framework is general and an apture the normal behaviors of the entire distributed system. Also, the model is easy to interpret and an assist later human debugging. 3) We demonstrate the proposed approah s ability of system problem detetion and diagnosis by experimenting on one month s real monitoring data olleted from three ompanies IT infrastrutures. We disuss the related work in Setion, and present the probability model in Setion 3. Setion 4 and Setion 5 introdues how to ompute and use the model. In Setion 6 and Setion 7, we disuss experimental results and onlusions.. Related Work Due to the inrease in omplexity and sale of the urrent systems, it beomes important to utilize the measurement orrelation information in system logs for autonomi system management. Methods are developed to model orrelations 64

P( x t x ) t+ < δ C C C3 8% N Y C4 C5 C6 6% 4% % 0% Learn Normal Anomaly Update Alarm C7 C8 C9 8 % x t + Data Figure 3. Grid Struture Figure 4. Transitions from 5 Figure 5. Transition Probability Matrix 3 4 5 6 7 8 9.98% 4.65% 8.79% 4.65% 9% 7.33% 8.79% 7.33% 5.49% 3.6% 9.74% 3.6% 9.87% 3.6% 9.87% 6.58% 7.89% 6.58% 3 8.79% 4.65%.98% 7.33% 9% 4.65% 5.49% 7.33% 8.79% 4 3.6% 9.87% 6.58% 9.74% 3.6% 7.89% 3.6% 9.87% 6.58% 5 8.8%.76% 8.8%.76% 7.65%.76% 8.8%.76% 8.8% 6 6.58% 9.87% 3.6% 7.89% 3.6% 9.74% 6.58% 9.87% 3.6% 7 8.79% 7.33% 5.49% 4.65% 9% 7.33%.98% 4.65% 8.79% 8 6.58% 7.89% 6.58% 9.87% 3.6% 9.87% 3.6% 9.74% 3.6% 9 5.49% 7.33% 8.79% 7.33% 9% 4.65% 8.79% 4.65%.98% of request failures [4], or among server response time [5]. Correlating monitoring data aross omplex systems has been studied reently, when algorithms are developed to extrat system performane invariants [, ] and desribe the non-linear orrelations [3]. Our proposed method distinguishes itself from the above methods by modeling both spatial and temporal orrelations among measurements. In markov model based failure predition methods [6], the temporal information is taken into onsideration, but they require the event-driven soures, suh as system errors as input. Conversely, our method does not require any knowledge about the system states. The problem of anomaly detetion has been extensively studied in several researh fields. Partiularly, many algorithms have been developed to identify faults or intrusions in Internet [7] or wireless network [8] by examining network traffi data. Different from the above methods, our approah models the data evolution instead of stati data points, and thus detets outliers from both spatial and temporal perspetives. In the proposed framework, we partition the two-dimensional data spae into grid ells. The idea of spae partitioning is motivated by grid-based lustering algorithms [9]. The term grid refers to the resulting disretized spae, thus arries a ompletely different meaning from that in grid omputing. 3. Transition Probability Model At time t, the values of two system measurements m and m an be regarded as a two-dimensional feature vetor x t = (m t,m t ). Then the task is to build a model M based on the inoming data x, x,...,x t,... to desribe the orrelations. Suppose x is drawn from S = A A, a -dimensional bounded numerial spae. We partition the spae S into a grid onsisting of non-overlapping retangular ells. We first partition eah of the two dimensions into intervals. A ell is the intersetion of intervals from Figure 6. The Framework of Correlation Modeling the two dimensions, having a form = (v,v ), where v a =[l a,u a ) is one interval of A a (a {, }). A data point x =(m,m ) is ontained in the ell if l a m a <u a for a =and a =.IfA and A are partitioned into s and s intervals, there are altogether s = s s ells. The olletion of all the non-overlapping retangular ells is alled grid struture: G = {,,..., s }. We define the probability of having a new observation x t+ based on G. To simplify the problem, we assume that the future observation is only dependent on urrent value and not on any past ones (markov property), i.e., P (x t+ x t,...,x ) = P (x t+ x t ). The experimental results in Setion 6 show that this assumption works well in pratie. Suppose x t+ j and x t i, we then approximate P (x t+ x t ) using P (x t+ j x t i ), whih is the probability of x t+ falling into ell j when x t belongs to ell i ( i, j G). To failitate later disussions, we use P (x t x t+ ) to denote P (x t+ x t ), and use P ( i j ) to denote P (x t+ j x t i ). Sine i and j are drawn from the olletion of grid ells G = {,,..., s },we an define a s by s matrix V where V ij = P ( i j ). Row i ( i s) of the matrix V defines a disrete probability distribution P ( i j ) ( s j= P ( i j )=) for the transitions from i to any ell in the grid ( j G). A snapshot of monitoring data from two measurements is plotted in Figure 3. The feature spae is partitioned into nine grid ells:,,..., 9, and in Figure 5, we show an example probability matrix V 9 9. Suppose x t is ontained in ell 5, the disrete probability distribution of x t+ given x t is then haraterized by V 5,V 5,...,V 59. As shown in Figure 4, higher probability on j indiates that x t+ is more likely to jump to j when its original loation is 5. Therefore, the model to haraterize the pair-wise orrelations onsists of the grid struture and the probability matrix: M =(G,V). In Setion 4, we disuss the methods to initialize and update the model. In setion 5, we desribe how to use the model to determine system problems. 4. Model Computation The framework of learning and updating the orrelation probability model for problem determination is depited graphially in Figure 6. We first initialize the model from a set of history data. The model is then put into use on 65

the ontinuously flowing monitoring data. Based on the observed x t+ and x t, the model outputs P (x t x t+ ) and if it is below a ertain threshold δ, an alarm is flagged. We update the model to inorporate the atual transition made by x t+ if it is normal. Sine the model is omprised of grid struture G and probability matrix V, we present the learning algorithms for both of them as follows. 0.035 0.03 0.05 0.0 0.05 0.0 0.005 0.05 0. 0. 0.3 0.35 0.4 0.05 0.045 0.04 0.035 0.03 0.05 0.0 0.05 0.0 0.005 0.05 0. 0. 0.3 0.35 0.4 4.. Grid Struture Figure 7. Initial Grid Figure 8. Updated Grid Initialization. Based on a set of history data {x t } n t=, we seek to design a grid struture G, defined by a set of grid ells {,,..., s }. Eah ell is represented by a retangle in the two-dimensional spae. We ompute the grid ells by setting their boundaries on the two dimensions separately. Formally, eah ell is defined as the intersetion of any interval from eah of the two dimensions, and the grid struture is thus represented by {(vi,v,s j )}s i=,j=, where v i and vj are intervals of A and A respetively. Now the problem is: For data mapped onto one dimension a: X a = {x a,x a,...,x a n}, we wish to disretize A a into s a intervals to hold all the data points. We would ompute transition probabilities based on the grid struture, so it should reflet the data distribution. Also, the omputation needs to be effiient sine multiple pairs of measurements may be wathed. Therefore, we propose an effiient approah to partition eah dimension into intervals adaptive to the data distribution based on MAFIA [9], a lustering method. We first get the upper and lower bound l a and u a from X a and divide [l a,u a ) into small equal-sized units with unit length z a. Note that z a is muh smaller than the atual interval size of the grid struture. We ount the number of points falling into eah unit. Adjaent units are then merged to form an interval if their ounts are similar with respet to a threshold, or are both below a density threshold. The basi idea behind this is to represent the dense areas using more ells, and regions with similar probability densities an be represented using one ell beause they may have similar transition patterns. If the data are equaldistributed, we ignore the above proedure and simply divide the dimension into equal-sized intervals. We run the above proedure for eah dimension and obtain all the ells by interseting intervals of the two dimensions. Figure 7 shows an example of the history data and the grid struture the algorithm generates for the data. Update. During the online proess, most of the time, a new observation x t+ falls into one of the ells defined by the grid struture G. However, it is likely that x t+ is out of the boundary defined by G. Then either x t+ is an outlier, or the underlying distribution has hanged. We wish to ignore the outliers, but only adapt the grid struture aording to the distribution evolution. However, it is hallenging to distinguish between the two ases in a real-time manner. We observe that real data usually evolve gradually, thus we assume that the boundary of the grid struture is also hanging gradually. Therefore, when x t+ is not ontained in any ells of G, we only update G if x t+ is lose enough to the grid boundary. For eah dimension A a, we ompute the average interval size ravg a offline during initialization and suppose the upper bound of G on dimension A a is u a. When x a t+ > u a for a = or, we first judge if xt+ a u a + λ a ravg, a where λ a is a parameter indiating the maximum number of intervals to be added. If it holds true, we take it as a signal of potential distribution evolution and add intervals to the dimension until x t+ is ontained within the boundary. New ells are inorporated into G as the intersetions of the added intervals and the intervals from the other dimension. Note that we do not delete ells having sparse densities to maintain the retangular shape of the grid struture for fast omputation. Figure 8 shows the online data and the aordingly updated grid struture, whereas the offline struture is illustrated in Figure 7. It an be seen that the data evolve along the vertial axis, and thus two more intervals are added to aommodate suh hanges. 4.. Transition Probability Matrix We seek to ompute P ( i j ) for any i and j in G, i.e., the transition probability between any pair of ells. One natural solution is to ompute the empirial distribution based on the set of monitoring data D. Speifially, let P ( i j ) be the perentage of examples jumping to j when it originally stays at i. Although the empirial probability an apture most transitions, it may not be aurate on the transitions whih are under represented or even unseen in past reords. We therefore need to adjust the empirial distribution to make it smooth over the spae so that an unseen transition may still have hanes to our in the future. Therefore we introdue a prior into the distribution using the following bayesian analysis tehnique P (D i j)p (i j) [0]: P ( i j D) = P (D) where i j indiates the existene of a transition from ell i to j and D is monitoring data set. The transitions are assumed to be independent of eah other. Also, our aim is to infer i j, so the term P (D) is not relevant and an be omitted: n P ( i j D) P ( i j ) P (x t x t+ i j ) t= () 66

where n is the size of D. The two steps under this bayesian framework inlude: ) define a prior distribution for P ( i j ) for any i and j, and ) update the distribution based on eah observed transition from x t to x t+. After all data points in D are seen, we an obtain the posterior probability P ( i j D). We explain the two steps as follows. Prior Distribution. Bayesian methods view the transition from i to j as a random variable having a prior distribution. Observation of the monitoring data onverts this to a posterior distribution. When a transition is seldom or never seen in the data, the prior will play an important role. Therefore, the prior should reflet our knowledge of the possible transitions. The question is, given x t i, whih ell is the most probable of ontaining x t+? With respet to our assumption that the monitoring data evolve gradually, the transition would have the spatial loseness tendeny, i.e., the transitions between nearby ells are more probable than those between ells far away. To support this laim, we hek the number of transitions with respet to the ell distane in two days measurement values. We find that the total number of transitions is 70, among whih 4 ours inside the ells, i.e., the data points would simply stay inside a ertain ell. There are 80 transitions between a ell and its losest neighbor. As the ell distane inreases, it beomes less likely that points move among these ells. Therefore, the spatial loseness tendeny assumption is valid. Based on this finding, we define the prior distribution as P ( i j ) P (i i) where d( w d( i, j ) i, j ) is the distane between i and j, and w is the rate of probability derease. If we observe that x t belongs to i, it is most likely that x t+ stays at i as well. We set P ( i i ) to be the highest and as j departs further away from i, P ( i j ) dereases exponentially. From the definition and the onstraints that s j= P ( i j ) =, the prior probability of having transitions from i to any ell an be omputed. An example prior distribution of transiting from ell to other ells is shown in Figure 9. It an be seen that the transition probability at is the highest, followed by the probability of transitions to its losest neighbors. Distribution Updates. Aording to Eq. (), to update the prior distribution, we need to multiply it by P (x t x t+ i j ).Ifx t+ in fat falls into h, we should set P (x t x t+ i h ) to be the highest among all the pairs of ells. Also, due to the spatial loseness tendeny, it is likely that a future transition an our from i to h s neighbors. Again, we assume an exponential derease in the transition probability with respet to the ell distane and use the following update rule: P (x t x t+ i j ) P (x t x t+ i h ) w d( h, j) if x t+ h and x t i () On Eq. (), we take log over all the probabilities, and the updates an be performed using additive operations. C Figure 9. Initial Transitions C0 Figure 0. Updated Transitions Note that we update the transition probability only on normal points, but not on outliers with zero probability. The updating equation is applied on the i-th row of the transition probability matrix where i is the ell x t belongs to. We start the updating proedure from x where P ( i j x ) is assumed to be the prior: P ( i j ), and repeatedly exeute it for i =,...,n. The prior distribution shown in Figure 9 is updated using six days monitoring data and the posterior probability distribution on ell is depited in Figure 0. The prior probability of going from to is the highest, but it turns out that many transitions from to 0 are observed, so the probability at 0 beomes the highest in the posterior. 5. Problem Determination and Loalization In this setion, we disuss how to determine problems in a distributed system with l measurements available. Sine we build pair-wise orrelation models for any two measurements, we have l(l )/ models to haraterize all the orrelations within the whole system. We propose a fitness sore as an indiator for the probability of having system problems, whih is defined at the following three levels and measures how well the models fit the monitoring data. ) Eah pair of measurements at a given time: For a pair of measurements m a and m b at time t +, suppose the most updated model derived from the monitoring data from time to t is M a,b t+. x represents the two dimensional feature vetor onsisting of measurement values from m a and m b. Suppose x t falls into ell i. At time t+, the model M a,b t+ outputs the transition probability from i to any ell j in the grid ( j s). We define a ranking funtion π( j ):π( j ) <π( k ) if P ( i j ) >P( i k ).In other words, j would be ranked higher if the probability of going from i to j is higher. We then define the fitness sore as Q a,b π t+ = M a,b t+ ( h ) s M a,b t+ atually belongs to, and s M a,b t+ in model M a,b t+ where h is the ell x t+ is the number of grid ells. Outliers that lie outside the grid have zero transition probability, and thus their fitness sores are zero as well. Figure illustrates the fitness sore omputation through an example. Suppose x t is ontained in ell 4 and the transition probability from 4 to other ells is shown in the left part of the figure. If x t+ is in ell 5, we first sort 67

probability 4 3.6% 4.% 5% 5 6 5.38% 7.34% 4% rank π ) 3 ( i 4 5 6 5 3 4 6 fitness sore 3 4 5 6 0.3333 0.8333 0.6667.0000 000 0.667 Figure. Fitness Sore Computation the ells aording to the transition probability and 5 is ranked at the 4-th plae. Then to ompute the fitness sore, we have π M a,b =4and s t+ M a,b =6, so the result is. t+ To examine the effet of fitness sores, we repeat the above proedure for the other ells and the results are shown in Figure. As an be seen, the fitness sore Q measures the fitness of model M a,b t+ on the observed monitoring data. ) Eah measurement at a given time: For a measurement m a ( a l), we an derive l different models, eah of whih haraterizes the orrelations between m a and m b (b =,...,a,a+,...,l). At time t +, the fitness b a Qa,b t+ l sore for m a is omputed as: Q a t+ = where Q a,b t+ is the fitness sore for the model built upon ma and another measurement m b. The fitness sore of a single measurement is determined by the fitness of orrelation models onstruted for its links to all the other measurements. 3) At a given time: We aggregate the sores from l measurements into one sore Q t+, whih an be used to judge if there are any problems in the entire system at time t +. Again, this an be ahieved by averaging the fitness sores of all the measurements. only evaluates the orrelation model between two measurements (e.g., one link in Figure (a), suh as the link between CPU Usage at Server A and Memory Usage at Server C ). Q a t+ is the aggregation At the finest level, Q a,b t+ of Q a,b t+, i.e., examining the l links leading to one node. For example, the fitness sore for measurement CPU Usage at Server A is omputed based on all its links. Q t+ works for the entire system by aggregating all the fitness sores (e.g., all the links in Figure (a)). In general, this evaluation framework an provide different granularity in the data analysis for system management. We an first merge the fitness sores of all the system omponents so that the system administrators an monitor a single sore for systemwide problems. If the average sore deviates from the normal state, the administrators an drill down to Q a t+ or even Q a,b t+ to loate the speifi omponents where system errors our. We an expet a high fitness sore when the monitoring data an be well explained by the model, whereas anomalies in system performane lead to a low sore. 6. Experiments We demonstrate the effetiveness and effiieny of the proposed method through experiments on a large olletion of real monitoring data from three ompanies infrasturture. Due to privay issue, we annot reveal their names and will denote them as A, B and C in the following disussions. Eah ompany provides a ertain Internet servie and has over a hundred servers to support user requests every day. On eah server, a wide range of system metris are monitored that are of interests to system administrators, for example, free memory amount, CPU utilization, I/O throughput, et. A metri obtained from a mahine represents a unique measurement. For example, CPU utilization on mahine with IP x.x.x.x is one measurement. We expet that orrelations exist among measurements from the same mahine, as well as aross different mahines, beause the whole system is usually affeted by the number of user requests. For eah group, there are roughly 3000 measurements olleted from around 50 mahines. We selet ( 00 from) 00 eah group and ondut the experiments on the 3 pairs of measurements. To test on the diffiult ases, we enfore the following seletion riteria: ) The sampling rate should be reasonably high, at least every 6 minutes; ) The measurements do not have any linear relationships with other measurements; and 3) The measurement should have high variane during the monitoring period. We wish to find out the proposed transition probability model s ability in profiling the system normal behaviors. To ahieve this, we sample a training set to simulate history data, and a test set, whih an be regarded as online data, from the one month s monitoring data (May 9 to June 7, 008). We ompute a model from the training set and evaluate it on the test set. To examine how the sizes of the training and test set affet the model performane, we onstrut the following training and test sets and ondut experiments on all the ombinations for eah of the three groups. Training sets: ) day (May 9), ) 8 days (May 9-June 5), and 3) 5 days (May 9- June ). Test sets: ) day (June 3), ) 5 days (June 3-June 7), 3) 9 days (June 3-June ), and 4) 3 days (June 3-June 5). Problem Determination. In this part, we assess the performane of the proposed method in system problem determination. The distributed systems in use are usually stable and do not have any ritial failures. Therefore, we test our methods on three pairs of system measurements where potential problems our as identified by the system administrators. Based on these events, we an get some general ideas about the proposed method s effetiveness in problem determination. Figure depits the fitness sores for three pairs of measurements where the groundtruth problems are found. The test set is one day s monitoring data and the problems are found in the morning (Group A), or in the afternoon (Group B and C). The two measurements are CurrentUtilization PORT and ifoutotetsrate PORT (Group A), ifoutotetsrate PORT and ifinotetsrate PORT (Group B), and CurrentUtilization IF and ifoutotetsrate IF (Group C). It learly shows 68

0.8 Group A Group B 0.3 Group C 0.6 am 6am 6am pm pm 6pm 6pm am 0.3 am 6am 6am pm pm 6pm 6pm am 0. am 6am 6am pm pm 6pm 6pm am Figure. Fitness Sores When System Problems our that the anomalies identified by the proposed transition probability method are onsistent with the ground-truth in all the three ases. During the period when a problem ours, we an observe a deep downward spike in the plot of fitness sore, whih means that this problemati time stamp reeives a muh lower fitness sore ompared with normal periods. To provide some intuitive ideas about how the method detets these anomalies, we show the normal and anomalous transitions for the experiments on Group B. From am up to pm, the values of the two measurements stay within the normal ranges [47.3,588] & [88.83,3437], however, an anomalous jump to the grid ell [588,458] & [0940,370] is observed, whih leads to the downward spike in the fitness sore. After that, the measurements fall into either the above normal ranges or [588,67670] & [3437,550], whih gives a little disturbane to the fitness sores until 8pm. Finally, the measurements go bak to their normal values and thus the fitness sore stabilizes at. Note that we omit the transition probability here, but only give the normal and anomalous transitions to illustrate the basi idea. So the proposed model an help detet the system problems as well as investigate the problem auses. We also try to identify the speifi mahine where the problem loates within the whole distributed system. To do so, we ompute the average fitness sore among measurements olleted from the same mahine and plot the sore distribution aross eah information system in Figure 4. The loations with low fitness sores are the potential problem soures. Beause the monitoring data from the three information systems have different harateristis and distributions, the sales of fitness sores on the three groups are different. We an see that most of the fitness sores are above a ertain threshold within eah group, whih implies that most of the servers are stable and have few problems. There are only a few servers with low average sores, where the system administrators need to hek arefully. For example, in Figure 4, there is only mahine soring at below in group A, muh lower than the sores of other mahines. We should pay more attention to this server in future monitoring and analysis. Offline versus Adaptive. In the following experiments, we show the method s performanes on all pairs of measurements by analyzing the fitness sores. As disussed, the real distributed system exhibits normal behaviors most of the time, therefore, a good model should predit the 8 6 4 (a) Fitness Sore Group A 5.9 6. 5.9 6.5 5.9 5.9 0.88 6.3 6.3 6.3 6.7 6.3 6. 6.3 6.5 Running Time (s) 50 40 30 0 0 (b) Updating Time Group A 5.9 5.9 5.9 6.5 5.9 6. 0 6.3 6.3 6.3 6.7 6.3 6. 6.3 6.5 Figure 3. Average Fitness Sore and Updating Time system behaviors well and generate a high average fitness sore. First, we ompare the following two methods: Offline methods where the model is derived from the training set offline, and Adaptive methods where the model is initialized from the training set but updated based on online test set. It would be interesting to see if online model updating an provide additional benefits to the offline model. In Setion 5, we show that, at eah sampling point, a fitness sore Q t+ is omputed to reflet the effetiveness of the urrent model. Therefore, we an evaluate the performanes of offline and adaptive methods by averaging the fitness sores omputed aording to their generated model at eah time stamp. When the model is ontinuously good, the average fitness sore would be high. Due to the spae limit, we only show the experimental results on group A. The experiments on the other two groups have similar patterns. The results are shown in Figure 3(a), where solid and dotted lines represent adaptive and offline methods respetively. It an be seen that the adaptive method usually improves the fitness sore over the offline method, espeially when the training set is small. When history data are limited, online updating of the model is neessary. But when we have suffiient history data, the models from offline analysis an predit reasonably well on the test set. When the size of the test set inreases, we an observe an inrease in the fitness sores, whih an be explained by the fat that large sample size usually redues the estimates variane. Typially, the average fitness sore is between 0.8 and 8, indiating that the proposed model aptures the transitions in monitoring data and is apable of prediting the future. Updating Time. In this part, we evaluate the adaptive method s effiieny. First, one we have a new observation, we simply determine the grid ell it falls into and look up 69

9 Days Test Day s Test 5 5 5 0.85 0.8 Group A Group B Group C 5 0 0 0 30 40 50 60 0.85 0.8 5 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.0 6. 0.85 0.8 5.9 5.9 5 5.9 6.5 5.9 6. am 6am 6am pm pm 6pm 6pm am Figure 4. Q Sores w.r.t Loations Figure 5. Q Sores for Nine Days Figure 6. Q Sores for One Day the transition probability matrix to get the predition, so the time of applying the model to make preditions is negligible. On the other hand, we have relatively more time to spend for offline analysis. Therefore, the time of updating the model online is the most important part in effiieny analysis. Figure 3(b) shows the online updating time of the adaptive method. When the training samples are suffiient (9 days or 5 days), it osts below 0 seonds to proess more than 4,000 monitoring data points, i.e., less than.5 milliseonds per sample, muh smaller than the sampling frequeny (6 minutes). If the period of the training set drops to one day, the updating time inreases greatly. Beause the history data set does not ontain enough examples to initialize the model aurately, the model has to be updated frequently online. However, even in the worst ase, the updating time is less than 3 milliseonds per sample. So the proposed method is effiient and an be embedded in online monitoring tools. Periodi Patterns. The volume of user requests usually affets the system behaviors. Heavier work loads an make the system less preditable. Therefore, when we examine the fitness sore at eah time stamp Q t+ over a period of 9 days, we find some interesting periodi patterns in Figure 5. We initialize the model using one day s monitoring data, then update and evaluate it on the data from June 3 to June. It is obvious that higher fitness sores are obtained during the time when the system is less ative inluding the weekends. At peak hours, the model has lower fitness sores beause the system is heavily affeted by the large volume of user requests and would be diffiult to predit orretly. When more history data are employed in building the initial model, the fitness sores an be improved greatly. To illustrate this, we vary the size of the training set and plot the fitness sores on one day s monitoring data (June 3), shown in Figure 6. When only one day s data are used as training set, the fitness sore drops when heavy workloads inrease the predition omplexity. But the model initialized from 5 days history data greatly improves the stability, with a fitness sore above during both peak and non-peak hours. The results suggest that it is important to inorporate more training samples that share similar properties with the online data to learn the initial model. 7. Conlusions In this paper, we develop a novel statistial approah to haraterize the pair-wise interations among different omponents in distributed systems. We disretize the feature spae of monitoring data into grid ells and ompute the transition probabilities among the ells adaptively aording to the monitoring data. Compared with previous system monitoring tehniques, the advantages of our approah inlude: ) It detets the system problems onsidering both spatial and temporal information; ) The model an output the problemati measurement ranges, whih are useful for human debugging; and 3) The method is fast and an desribe both linear and non-linear orrelations. Experiments on monitoring data olleted from three real distributed systems involving 00 measurements from around 50 mahines, show the effetiveness of the proposed method. Referenes [] G. Jiang, H. Chen, and K. Yoshihira, Disovering likely invariants of distributed transation systems for autonomi system management, Cluster Computing, vol. 9, no. 4, pp. 385 399, 006. [] M. A. Munawar, M. Jiang, and P. A. S. Ward, Monitoring multi-tier lustered systems with invariant metri relationships, in Pro. of SEAMS, 008, pp. 73 80. [3] Z. Guo, G. Jiang, H. Chen, and K. Yoshihira, Traking probabilisti orrelation of monitoring data for fault detetion in omplex systems, in Pro. of DSN, 006, pp. 59 68. [4] M. Chen, E. Kiiman, E. Fratkin., A. Fox, and E. Brewer, Pinpoint: problem determination in large, dynami internet servies, in Pro. of DSN, 00, pp. 595 604. [5] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang, Towards highly reliable enterprise network servies via inferene of multi-level dependenies, SIGCOMM Comput. Commun. Rev., vol. 37, no. 4, pp. 3 4, 007. [6] F. Salfner and M. Malek, Using hidden semi-markov models for effetive online failure predition, in Pro. of SRDS, 007, pp. 6 74. [7] X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan, G. Iannaone, and A. Lakhina, Detetion and identifiation of network anomalies using sketh subspaes, in Pro. of IMC, 006, pp. 47 5. [8] P. Chhabra, C. Sott, E. Kolazyk, and M. Crovella, Distributed spatial anomaly detetion, in Pro. of INFOCOM, 008, pp. 705 73. [9] S. Goil, H. Nagesh, and A. Choudhary, Mafia: Effiient and salable subspae lustering for very large data sets, in Tehnial Report, Department of Eletrial and Computer Engineering, Northwestern University, 999. [0] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis (nd ed.). Chapman and Hall, 004. 630