Incremental Commute Time Using Random Walks and Online Anomaly Detection

Size: px

Start display at page:

Download "Incremental Commute Time Using Random Walks and Online Anomaly Detection"

Jonas Wood
5 years ago
Views:

1 Incremental Commute Time Using Random Walks and Online Anomaly Detection Nguyen Lu Dang Khoa 1 and Sanjay Chawla 2,3 1 Data61, CSIRO, Australia khoa.nguyen@data61.csiro.au 2 Qatar Computing Research Institute, HBKU 3 University of Sydney, Australia sanjay.chawla@sydney.edu.au Abstract. Commute time is a random walk based metric on graphs and has found widespread successful applications in many application domains. However, the computation the commute time is expensive, involving the eigen decomposition of the graph Laplacian matrix. There has been effort to approximate the commute time in offline mode. Our interest is inspired by the use of commute time in online mode. We propose an accurate and efficient approximation for computing the commute time in an incremental fashion in order to facilitate real-time applications. An online anomaly detection technique is designed where the commute time of each new arriving data point to any point in the current graph can be estimated in constant time ensuring a real-time response. The proposed approach shows its high accuracy and efficiency in many synthetic and real datasets and takes only 8 milliseconds on average to detect anomalies online on the DBLP graph which has more than 600,000 nodes and 2 millions edges. Keywords: Commute time, random walk, incremental learning, online anomaly detection 1 Introduction Commute time is a well-known measure derived from random walks on graphs [10]. The commute time between two nodes i and j in a graph is the expected number of steps that a random walk, starting from i will take to visit j and then come back to i for the first time. Commute time has been used as a robust metric for different learning tasks such as clustering [14] and anomaly detection [8]. It has also found widespread applications in personalized search [16], collaborative filtering [3] and image segmentation [14]. The fact that the commute time is averaged over all paths (and not just the shortest path) makes it more robust to data perturbations. More advanced measures generally require more expensive computation. Estimating commute time involves the eigen decomposition of the graph Laplacian matrix and resulting in an O(n 3 ) time complexity which is impractical for

2 2 Khoa and Chawla large graphs. Saerens, Pirotte and Fouss [15] used subspace approximation to approximate the commute time. Sarkar and Moore [13] introduced a notion of truncated commute time and a pruning algorithm to find nearest neighbors in the truncated commute time. Recently, Spielman and Srivastava [17] proposed an approximation algorithm to create a structure in nearly linear time so that the pairwise commute time can be approximated in O(log n) time. However, all the above-mentioned approximation techniques all work in a batch fashion and therefore have a high computation cost for online applications. We are interested in the following scenarios: a dataset or a graph D is given from an underlying domain of interest such as data from a network traffic log or a social network graph. A new data point p arrives and we want to determine if p is an anomaly with respect to D in commute time. A data point is an anomaly if it is far away from its nearest neighbors in commute time measure (as described in [8]). This particular application requires the computation of commute time in an online fashion. In this paper, we propose a method called iect to incrementally estimate the commute time and use it to design an online anomaly detection application. The method makes use of the recursive definition of commute time in terms of random walk measures. The commute time from a new data point to any data point in the existing data D is computed based on the current commute times among points in D. The method is novel and reveals insights about commute time which are independent of the applications. The contributions of this paper are as follows: We use characteristics of random walk measures to propose a method to estimate the commute time incrementally in constant time. Then we design an online anomaly detection technique using the incremental commute time. To the best of our knowledge, this is the first method to estimate the commute time in an online fashion. The proposed technique is verified by experiments in different applications using several synthetic and real datasets. The experiments show the effectiveness of the proposed methods in terms of accuracy and performance. The methods can be applied directly to graph data and can be used in any application that utilizes the commute time (e.g. classification and graph ranking using commute time). The remainder of the paper is organized as follows. Section 2 reviews notations and concepts related to random walks and commute time and a method to approximate the commute time offline in large graphs. Section 3 presents a simple motivation example to tie up all the definitions and ideas, and proposes a method to incrementally estimate the commute time. In Section 4, we propose an online anomaly detection algorithm which uses the incremental commute time. We evaluate our approaches using experiments on synthetic and real datasets in Section 5. Sections 6 and 7 cover the related work and a summary of our work.

3 Incremental Commute Time Using Random Walks 3 2 Background 2.1 Random Walks on Graphs and Commute Time We provide a self-contained introduction to random walks with an emphasis on commute time. Assume we are given a connected undirected and weighted graph G = (V, E, W ). Definition 1. Let i be a node in G and N(i) be its neighbors. The degree d i of a node i is j N(i) w ij. The volume V G of the graph is defined as i V d i. Definition 2. The transition matrix M = (p ij ) i,j V of a random walk on G is given by { wij p ij = d i, if (i, j) E 0, otherwise Definition 3. The Hitting Time h ij is the expected number of steps that a random walk starting at i will take before reaching j for the first time. Definition 4. The Hitting Time can be defined in terms of the recursion { 1 + l N(i) h ij = p ilh if i j 0 otherwise Definition 5. The Commute Time c ij between two nodes i and j is given by c ij = h ij + h ji. Fact 1 The commute time can be expressed in terms of the Laplacian of G. c ij = V G (l + ii + l+ jj 2l+ ij ) = V G(e i e j ) T L + (e i e j ) (1) where l + ij is the (i, j) element of L+ (the pseudo-inverse of the Laplacian L) and e i is the V dimensional column vector with 1 at location i and zero elsewhere [3]. L + can be computed from the eigensystem of L: L + = V i=2 1 λ i v i v T i. 2.2 Approximation of Commute Time Embedding (Batch Mode) Computing commute time involves the eigen decomposition of the graph Laplacian matrix which is impractical for large graphs. Recently, Spielman and Srivastava [17] proposed an approximation algorithm utilizing random projection and a SDD solver to create a structure in nearly linear time so that the pairwise commute time can be approximated in k RP = O(log n) time (k RP is the reduced dimension in random projection). The fast SDD solver [18] for linear systems is a new class of near-linear time methods for solving a system of equations Ax = b when A is a symmetric diagonally dominant (SDD) matrix. The idea is based on the fact that θ = V G L + B T W 1/2 is a commute time embedding where the commute time c ij is a squared Euclidean distance between points i and j in θ. Here m be the number of edges in G, B is a signed edgevertex incidence matrix and W is a diagonal matrix whose entries are the edge weights. For the details of the embedding creation, refer to [17].

4 4 Khoa and Chawla 3 Incremental Commute Time 3.1 Problem and Scope Problem: Given a dataset or a graph D from an underlying domain of interest. When a new data instance p comes in, we want to compute the commute time from p to any data instance in D. In an Euclidean space, an insertion of a new point does not change the features of existing points. However, an insertion of a new node in an original feature space or a graph will change the features of existing points in the commute time embedding space, which is spanned by eigenvectors of the graph Laplacian matrix. Updating an eigensystem of a graph Laplacian is costly and not suitable for online applications. In this work, we use the characteristics of random walk measures to estimate the commute time incrementally in constant time and use it to design online applications. There are some notes regarding the scope of this work. Firstly, the proposed method is only suitable for applications which do not need to update the training model overtime (i.e. a representative training data are available). That means we treat the new data one by one, estimate its corresponding commute time and leave the trained model intact. Secondly, in case of graph data, we only deal with the case of node insertion, not node deletion or weight update. 3.2 Motivation Examples Consider a graph G shown in Figure 1a where all the edge weights equal 1. The sum of the degree of nodes, V G = 8. We will calculate the commute time c 12 in two different ways: (a) 4-node graph (b) Adding node 5 Fig. 1: c 12 increases after an addition of node 5 even though the shortest path distance remains unchanged. 1. Using random walk approach: note that the expected number of steps for a random walk starting at node 1 and returning back to it is V G d1 = 8 1 = 8 [10]. But the walk from node 1 can only go to node 2 and then return from node 2 to 1. Thus c 12 = Using algebraic approach: the Laplacian matrix is L =

5 Incremental Commute Time Using Random Walks 5 and the pseudo-inverse is L + = Since c 12 = V G (e 1 e 2 ) T L + (e 1 e 2 ) and (e 1 e 2 ) T L + (e 1 e 2 ) = = 1, T c 12 = V G 1 = 8. Suppose we add a new node (labeled 5) to node 4 with a unit weight as in Figure 1b. Then c new 12 = VG new/d 1 = 10/1 = 10. The example in Figure 1b shows that by adding an edge, i.e. making the cluster which contains node 2 denser, c 12 increases. This shows that commute time between two nodes captures not only the distance between them (as measured by the edge weights) but also the data densities. For the proof of this claim, see [8]. This property of commute time has been used to simultaneously discover global and local anomalies in data - an important problem in the anomaly detection literature. In the above example, we exploited the specific topology (degree one node) of the graph to calculate the commute time efficiently. This can only work for very specific instances. The general, more widely used but slower approach for computing the commute time is to use the Laplacian formula as in Equation 1. One key contribution of this paper is that for an incremental computation of commute time we can use insights from this example to efficiently approximate the commute time using random walk in much more general situations. 3.3 Incremental Estimation of Commute Time In this section, we derive a new method for computing the commute time in an incremental fashion. This method uses the definition of commute time based on the hitting time. The basic intuition is to expand the hitting time recursion until the random walk has moved a few steps away from the new node and then use the old values. In Section 5 we will show that this method results in remarkable agreement between the batch and online modes. We deal with two cases shown in Figure Rank one perturbation corresponds to the situation when a new node connects with one other node in the existing graph. 2. Rank k perturbation deals with the situation when the new node has k neighbors in the existing graph.

6 6 Khoa and Chawla (a) Rank 1 (b) Rank k Fig. 2: Rank 1 and rank k perturbation when a new data point arrives. Rank one perturbation Proposition 1. Let i be a new node connected by one edge to an existing node l in the graph G. Let w il be the weight of the new edge. Let j be an arbitrary node in the graph G. Then c ij = c old + V G w il + O( 1 k ) (2) where old represents the commute time in graph G (k nearest neighbor graph) before adding i. Proof. (Sketch) Since the random walk needs to pass l before reaching j, the commute distance from i to j is: It is known that: c ij = c il + c. (3) c il = (V G + 2w il ) w il (4) where V G is volume of graph G [8]. We also know c = h jl + h and h jl = h old jl. The only unknown factor is h. By definition: h = 1 + p lq h qj = 1 + p lq h qj + p li h ij. q N(l) q N(l),q i Since commute time is robust against small changes or perturbation in data, we have h qj h old qj. Moreover, p lq = (1 p li )p old lq, and h ij = 1 + h. Therefore, h 1 + (1 p li )p old lq h old qj + p li (1 + h ) q N(l),q i = 1 + (1 p li ) q N(l),q i p old lq h old qj + p li (1 + h ) = 1 + (1 p li )(h old 1) + p li (1 + h ). After simplification, h = h old + 2p li 1 p li. Then c h old jl + h old + 2p li 1 p li. Since there is only one edge connecting from i to G, i is likely an isolated point and thus p li = O( 1 k ) (G is the k nearest neighbor graph). Then c = h old jl + h old + O( 1 k ) = cold + O( 1 ). (5) k

7 Incremental Commute Time Using Random Walks 7 As a result from Equations 3, 4, and 5: c ij = (V G + 2w il ) w il + c old + O( 1 k ) = cold + V G w il + O( 1 k ) Rank k perturbation The rank k perturbation analysis is more involved but the final formulation is an extension of the rank one case. Proposition 2. Denote l G be one of k neighbors of i, and j be a node in G. The approximate commute time between nodes i and j is: c ij l N(i) p il c old + V G d i + O( 1 k ) (6) For the proof, see Appendix in the supplement document. When k = 1 (rank one case), the Equation 6 becomes Equation 2. 4 Online Applications Using Incremental Commute Time We return to our original motivation for computing incremental commute time. We are given a dataset D which is a representative of the underlying domain of interest. We need to find nearest neighbors of a new data point p in commute time metric incrementally. We want to check if p is an anomaly in D. We train the dataset D using Algorithm 1. First, a mutual k 1 -nearest neighbor graph is constructed from the dataset. This graph connects nodes u and v if u belongs to k 1 -nearest neighbors of v and v belongs to k 1 -nearest neighbors of u [11]. Then the approximate commute time embedding θ is computed as in Section 2.2. Finally, a distance-based anomaly detection with a pruning rule proposed by Bay and Schwabacher [2] is used in θ to find the top N anomalies. That means the distance-based method uses commute time, instead of Euclidean distance. It has been shown that a distance-based approach using commute time can be used to simultaneously identify global, local and even group anomalies in data [8]. The anomaly score used is the average commute time of a data instance to its k 2 nearest neighbors. Pruning Rule [2]: A data point is not an anomaly if its score (e.g. the average distance to its k nearest neighbors) is less than an anomaly threshold. The threshold can be fixed or be adjusted as the score of the weakest anomaly found so far. Using the pruning rule, many non-anomalies can be pruned without carrying out a full nearest neighbors search. After training, the corresponding graph G, the commute time embedding θ, and the anomaly threshold τ are obtained (τ is the score of the weakest anomaly found among top N anomalies). We propose a method shown in Algorithm 2 (denote as iect) to detect anomalies online given the trained model. When a new data point p arrives, it is connected to graph G created in the training phase so that the property of the mutual nearest neighbor graph is held. The commute times are incrementally updated to estimate the anomaly score

8 8 Khoa and Chawla Algorithm 1 Approximate Commute Time Distance Based Anomaly Detection (for training). Input: Data matrix X, the numbers of nearest neighbors k 1 (for building the k- nearest neighbor graph) and k 2 (for estimating the anomaly score), the number of random vectors k RP, the numbers of anomalies to return N Output: Top N anomalies, anomaly threshold τ 1: Construct a mutual k-nearest neighbor graph G from the dataset (using k 1) 2: Compute the approximate commute time embedding θ from G 3: Find top N anomalies using a distance-based technique with pruning rule described in [2] on θ (using k 2) 4: Return top N anomalies and the anomaly threshold τ Algorithm 2 Online Anomaly Detection using the incremental Estimation of Commute Time (iect) Input: Graph G, the approximate commute time embedding θ and the anomaly threshold τ computed in the training phase, and a new arriving data point p Output: Determine if p is an anomaly or not 1: Add p to G satisfying the property of the mutual nearest neighbor graph 2: Determine if p is an anomaly or not by estimating its anomaly score incrementally using the method described in Section 3.3. Use pruning rule with threshold τ to reduce the computation 3: Return whether p is an anomaly or not of p using the approach in Section 3.3. The embedding θ is used to compute the commute time c old. The pruning is used as follows: p is not anomaly if its average distance to k nearest neighbors is smaller than the anomaly threshold τ. Generally commute time is robust against small changes or perturbation in data. Therefore, only the anomaly score of a new data point needs to be estimated and be compared with the anomaly threshold computed in the training phase. This claim will be verified by experiments in Section Analysis The incremental estimation of commute time in Section 3.3 requires O(k RP ) for each query of c old in θ. So if there are k edges added to the graph due to the addition of a new node, it takes O(kk RP ) for each query of c ij. As explained earlier, we only need to compute the anomaly score of the new data point. Using pruning rule with the known anomaly threshold, it takes only O(k 2 ) nearest neighbor search to determine if the test point is an anomaly or not where k 2 is the number of nearest neighbors for estimating the anomaly score. For each commute time query, it takes O(kk RP ) as described above. Therefore, iect takes O(k 2 kk RP ) to determine if a new arriving point is an anomaly or not. [19] has suggested that k RP = 2 ln n/ which is just 442 for a dataset of

9 Incremental Commute Time Using Random Walks 9 a million data points. Therefore k RP n. Since k, k 2 n, O(k 2 kk RP ) = O(1) resulting in a near constant time complexity for iect. Note that this constant time complexity of iect does not depend on the complexity of O(k RP ) for each query of c old using the method in [17]. If we query c old using equation 1 with just O(k EV ) eigenvectors of Laplacian matrix L (as described in [8]), each query only takes O(k EV n) also resulting in a constant time complexity for iect. 5 Experiments and Results In this section, we determined and compared the effectiveness of online anomaly detection application using incremental commute time. The experiments were carried out on synthetic as well as real datasets. In all experiments, the numbers of nearest neighbors were k 1 = 10 (for building the nearest neighbor graph), k 2 = 20 (for estimating a nearest neighbor score or an anomaly score in anomaly detection applications), and the number of random vectors was k RP = 200 (for creating the commute time embedding) unless otherwise stated. We used Koutis s CMG solver [9] as an implementation of the SDD-Solver for creating the embedding. The solver is used for SDD matrices which is available online at The choice of parameters was determined from the experiments and it was also analyzed in Section 5.5. Source code and data can be assessed at 0B6LuuZJnvhFdTldkMmE1clk2T28/view?usp=sharing 5.1 Approach We split a dataset into two parts: a training set and a test set. We trained the training set to find top N anomalies and the threshold value τ using Algorithm 1. Then an anomaly score of each instance p in the test set was calculated based on its k 2 neighbors in the training set. If this score was greater than τ then the test instance was reported as an anomaly. During the time searching for the nearest neighbors of p, if its average distance to the nearest neighbors found so far is smaller than τ, we can stop the search as p is not anomaly (pruning rule). In practise, it is not trivial to know the amount of anomalies N in the training data so that we can find the top N and set the threshold for anomaly. We investigated a method to find the threshold as follows. In the training phase, we computed the anomaly scores of all the data points and we had the mean and standard deviation of all the scores. Anomalies were data points whose scores were greater than three times of standard deviation away from the mean score. N was the number of anomalies found. Baseline: in all experiments, the batch method (Algorithm 1) was used as the benchmark since there is no other method to estimate commute time incrementally. Note that for both the batch and incremental methods, we need to compute only the anomaly score of the new arriving data instance and pruning was also applied using τ. The difference is in the batch method, the new approximate commute time embedding was recomputed and the anomaly score was estimated using the new embedding space. The incremental method, on the

10 10 Khoa and Chawla other hand, estimated the score incrementally using the method described in Section Synthetic datasets We created six synthetic datasets with 1000, 10000, 20000, 30000, and data points. Each dataset contained several clusters generated from Normal distributions and 100 random points generated from uniform distribution which were likely anomalies. The number of clusters, the sizes, and the locations of the clusters were also chosen randomly. Each dataset was divided into a training set and a test set. There were 100 data points in every test set and half of them were random anomalies mentioned above. Experiments on Robustness: We first tested the robustness of commute time between nodes in an existing graph when a new node is introduced. As the commute time c ij is a measure of expected path distance, the hypothesis is that the addition of a new point will have minimal influence on c ij and thus the anomaly scores of data points in the existing set are relatively unchanged. Table 1 shows the average, standard deviation, minimum, and maximum of anomaly scores of points in graph G before and after a new data point was added to G. Graph G was created from the training set of a 1000 point dataset described above. The result was averaged over 100 test points in the test set. The result shows that the anomaly scores of data instances in G do not change much when a new point is added to G (the change of the average score was only about 0.7%). Table 1: Robustness of commute time. The anomaly scores of data instances in existing graph G are relatively unchanged when a new point is added to G. Average Std Min Max Without test point 15, , , With test point 15, , , In the following experiments, the change in eigensystem of the graph Laplacian L of the training data due to an addition of a new node was analyzed. Figures 3a shows average changes in the top 50 eigenvalues before and after an addition of each test point in the test set in the 1000 point dataset. The changes are small for most of them (most of them were less than 1% and all of them were less than 6%). Figures 3b shows dot products of eigenvectors with the second smallest eigenvalues (the smallest is zero) before and after an addition of each test point. The eigenvectors did not change much after we add a new node to the graph. As shown in Equation 1, since the change in eigensystem of the Laplacian is small, the commute times between existing training nodes do not change much. All these results show commute time is a robust measure: a small change or perturbation in the data will not result in large changes in commute times. Therefore, only the anomaly score of the new point needs to be estimated.

11 Incremental Commute Time Using Random Walks 11 Eigen value changes (before vs after) (%) Test index (a) Eigenvalue changes Dot products of second eigenvectors (before vs after) Test index (b) Eigenvector changes Fig. 3: Change in eigensystem when new nodes were added to the graph. Experiments on Effectiveness: We applied iect to all six datasets mentioned earlier. The effectiveness of iect and the commute time approximation were reported and discussed. Table 2 presents the results in accuracy and performance of iect in six synthetic datasets. Average score was the average anomaly score with pruning rule over 100 test points. The precision and recall were for the anomalous class. The time was the average time to process each of 100 test points. iect captured all the anomalies, had a few false alarms, and was much more efficient than the batch method. Note that the scores shown here were the anomaly scores with pruning rule and the scores for anomalies are always much higher than scores for normal points. Therefore the average scores shown in the table were dominated by the scores of anomalies. Table 2: Effectiveness of the incremental method. iect captured all the anomalies, had a few false alarms and was much more efficient than the batch method. Dataset iect Batch Size Precison (%) Recall (%) Avg score Time (s) Avg score Time (s) 1, , , , , , There is an interesting dynamic at play between the pruning rule and the number of anomalies in the data. The reason is there was a high proportion of anomalies in the test set (about 50%). We know that the pruning rule only works for non-anomalies and therefore, the time to process anomalies should be much longer than the others. Table 3 shows the details of time to process data

12 12 Khoa and Chawla points in the test set. For batch and iect methods, the average time to process only anomalies, only other data points (non-anomalies), and all data instances are listed in the table. There was not much difference in batch method between time to process anomalies and non-anomalies since for each new data point the time to create the new commute time embedding was much higher than that of the nearest neighbor search. On the other hand, this gap was very high for iect so that the times to process non-anomalies were much faster than those of anomalies. In practice, since most of the data points are not anomalies, iect is very efficient. Another cost we have not mentioned is the time to update the graph. That is the time to add a new data point to an existing graph satisfying the property of the mutual nearest neighbor graph. Since we stored the kd tree corresponding to the training data, the update cost was very low as shown in Table 3. Table 3: Performance of the incremental method. In iect, the times to process non-anomalies were much faster than those of anomalies. Dataset Graph update iect (s) Batch (s) Size Time (s) Anomaly Others All Anomaly Others All 1, , , , , , Graph Dataset In this section, we evaluated the iect method on a large DBLP co-authorship network to show the scalability of the method. In this graph, nodes are authors and edge weights are the number of collaborated papers between the authors. Since the graph is not fully connected, we extracted its biggest component. It has 612,949 nodes and 2,345,178 edges in a snapshot in December 5th, 2011 which is available at We randomly chose a test set of 50 nodes and removed them from the graph. We ensured that the graph remained connected. After training, each node was added back into the graph along with their associated edges. We trained the graph using Algorithm 1, stored the approximate embedding in order to query the c old in iect algorithm. The batch method use the approximate embedding created from a new graph after adding each test point. The result shows that it took seconds on average over 50 test data points to detect whether each test point was an anomaly or not. The batch method, which is the fastest approximation of commute time to date, required 1,454 seconds on average to process each test data point. This dramatically highlights the constant time complexity of iect algorithm and suggests that iect is highly suitable for the computation of commute time in an incremental fashion. Since there was no anomaly information in the random test set, we cannot report the detection accuracy here. The average anomaly score over all

13 Incremental Commute Time Using Random Walks 13 the test points of iect was 8.6% higher than the batch method. This shows the high accuracy of iect approximation even in a very large graph. 5.4 Real Datasets In this experiment, we report the results for online anomaly detection using real datasets in different application domains. They are applications in network intrusion detection, video surveillance and bridge damage detection. Spambase dataset: The Spambase dataset provided by Machine Learning Repository [4] was investigated. There are 4,601 s in the data with 57 features each. The task is check whether a is spam or not. Since the dataset has duplicated data instances, and the numbers of spams and non-spams are not imbalanced, we removed duplicated data, kept the non-spams, and sampled 100 spams from the dataset. Finally we had 2631 data instances. Computer network anomaly detection: The dataset is from a wireless mesh network at the University of Sydney which was deployed by NICTA [20]. It used a traffic generator to simulate traffic on the network. Packets were aggregated into one-minute time bins and the data was collected in 24 hours. There were 391 origin-destination flows and 1270 time bins. Anomalies were introduced to the network including DOS attacks and ping floods. After removing duplications in the data, we had 1193 time-bin instances. Damage detection on bridge: The Sydney Harbour Bridge is one of major bridges in Australia, which was opened in As the bridge is aging, it is critical to ensure it stays structurally healthy. There are 800 jack arches on the underside of the deck of the bus lane (lane seven) needed to be monitored. Vibration data caused by passing vehicles were recorded by three-axis accelerometers installed under the deck of lane seven. For this case study, only six instrumented joints were considered (named 1 to 6). The data were obtained in the period from early August until late October in A known crack existed in joint 4 while the other joints were in good conditions. The feature extraction was used as described in [7]. A dataset was created to include vibration events from all healthy joints and 100 events from the damaged joint (totally 2523 events). Each dataset was divided into a training set and a test set with 100 data points except that in the video dataset, test set only contained 38 data objects. The anomaly threshold τ was set based on the training data, which was the weakest score of the anomalies in the training set. The results of using iect and batch methods are shown in Table 4. It shows that iect has a high detection accuracy and is much more efficient than the batch method. Also the commute time scores between iect and batch method were quite similar. 5.5 Impact of Parameters In this section, we investigate how the parameters k 1, k 2, and k RP affect the effectiveness of the proposed method. Parameters k 1 and k 2 only affect the accuracy of computing commute time in batch mode and were analyzed in [8]. Therefore, this section analyses impact of k RP to the incremental commute time.

14 14 Khoa and Chawla Table 4: The effectiveness of iect in real datasets. It shows that iect has a high detection accuracy and is much more efficient than the batch method. Dataset Precision Recall iect Batch (%) (%) Avg Score Time (s) Avg Score Time (s) Spambase Network Bridge We conducted an experiment with different k RP for the three real datasets mentioned in the previous section. The results in Figure 4 show that the method can achieve high accuracy with small k RP and is not sensitive to k RP Accuracy Accuracy Accuracy krp (a) Spambase krp (b) Network krp (c) Bridge Fig. 4: The method can achieve high accuracy with small k RP and is not sensitive to k RP. 5.6 Summary and discussion The experimental results show that iect can accurately approximate the commute time in constant time. It is much more efficient than the batch method using Algorithm 1. The results on real datasets collected from different domains and applications also have similar tendency showing the reliability and effectiveness of the proposed method. One weakness of iect is that it can only be used in online applications where the update of the graph is given by the addition of a new node, not by updating the edge weights. However, in the case of updating edge weights, the method by Ning et. al in [12] can be used. This method incrementally updates the eigenvalues and eigenvectors of the graph Laplacian matrix based on a change of an edge weight on the graph. Then we can use the new eigen pairs of the Laplacian to update the commute time. 6 Related work Khoa and Chawla [8] proposed a new method to find anomalies using commute time. They showed that unlike Euclidean distance, commute time between two nodes can capture both the distance between them and their densities so that it can capture both global and local anomalies using distance based methods such as methods in [2].

15 Incremental Commute Time Using Random Walks 15 Incremental learning using an update on eigen decomposition has been studied for a long time. Early work studied the rank one modification of the symmetric eigen decomposition [5, 6]. The authors reduced the original problem to the eigen decomposition of a diagonal matrix. Though they can have a good approximation of the new eigenpair, they are not suitable for online applications nowadays since they have at least O(n 2 ) computation for the update. More recent approach was based on the matrix perturbation theory [1]. It used the first order perturbation analysis of the rank-one update for a data covariance matrix to compute the new eigenpair. These algorithms have a linear time computation. The advantage of using the covariance matrix is if the perturbation involving an insertion of a new point, the size of the covariance matrix is unchanged. This approach cannot be applied directly to increasing matrix size due to an insertion of a new point. For example, in spectral clustering or commute time based anomaly detection, the size of the graph Laplacian matrix increases when a new point is added to the graph. Ning et. al [12] proposed an incremental approach for spectral clustering to monitor evolving blog communities. It incrementally updates the eigenvalues and eigenvectors of the graph Laplacian matrix based on a change of an edge weight on the graph using the first order error of the generalized eigen system. This algorithm is only suitable for cases of weight update, not for an addition of a new node. 7 Conclusion In this paper, we proposed a method to approximate commute time incrementally and used it to design an online anomaly detection application. The method incrementally estimates the commute time in constant time using properties of random walk and hitting time. The main idea is to expand the hitting time recursion until the random walk has moved a few steps away from the new node and then use the old values. The experimental results in synthetic and real datasets show the effectiveness of the proposed approach in terms of performance and accuracy. iect can incrementally estimate the commute time accurately, resulting in high accuracy in several datasets from different applications. It only took 8 milliseconds on average to process a new arriving node in a graph of more than 600,000 nodes and two millions edges. Moreover, the idea of this work can be extended in other applications which utilize the commute time. References 1. Agrawal, R.K., Karmeshu: Perturbation scheme for online learning of features: Incremental principal component analysis. Pattern Recogn. 41, (2008) 2. Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD 03: Proc. of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp ACM, New York, NY, USA (2003)

16 16 Khoa and Chawla 3. Fouss, F., Renders, J.M.: Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transaction on Knowledge and Data Engineering 19(3), (2007) 4. Frank, A., Asuncion, A.: Uci machine learning repository (2010) 5. Golub, G.H.: Some modified matrix eigenvalue problems. SIAM Review 15(2), (1973) 6. Gu, M., Eisenstat, S.C.: A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl. 15, (1994) 7. Khoa, N.L., Zhang, B., Wang, Y., Chen, F., Mustapha, S.: Robust dimensionality reduction and damage detection approaches in structural health monitoring. Structural Health Monitoring 13(4), (2014) 8. Khoa, N.L.D., Chawla, S.: Robust outlier detection using commute time and eigenspace embedding. In: PAKDD 10: Proceedings of the The 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp Springer, Berlin/Heidelberg (2010) 9. Koutis, I., Miller, G.L., Tolliver, D.: Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing. In: Proceedings of the 5th International Symposium on Advances in Visual Computing: Part I. pp ISVC 09, Springer-Verlag, Berlin, Heidelberg (2009) 10. Lovász, L.: Random walks on graphs: a survey. Combinatorics, Paul Erdös is Eighty 2, 1 46 (1993) 11. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), (2007) 12. Ning, H., Xu, W., Chi, Y., Gong, Y., Huang, T.: Incremental spectral clustering with application to monitoring of evolving blog communities. In: In SIAM Int. Conf. on Data Mining (2007) 13. Purnamrita Sarkar, A.W.M.: A tractable approach to finding closest truncatedcommute-time neighbors in large graphs. In: The 23rd Conference on Uncertainty in Artificial Intelligence(UAI) (2007) 14. Qiu, H., Hancock, E.: Clustering and embedding using commute times. IEEE TPAMI 29(11), (2007) 15. Saerens, M., Fouss, F., Yen, L., Dupont, P.: The principal components analysis of a graph, and its relationships to spectral clustering. In: Proc. of the 15th European Conference on Machine Learning (ECML 2004). pp Springer-Verlag (2004) 16. Sarkar, P., Moore, A.W., Prakash, A.: Fast incremental proximity search in large graphs. In: Proceedings of the 25th international conference on Machine learning. pp ICML 08, ACM, New York, NY, USA (2008) 17. Spielman, D.A., Srivastava, N.: Graph sparsification by effective resistances. In: Proceedings of the 40th annual ACM symposium on Theory of computing. pp STOC 08, ACM, New York, NY, USA (2008) 18. Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. CoRR abs/cs/ (2006) 19. Venkatasubramanian, S., Wang, Q.: The johnson-lindenstrauss transform: An empirical study. In: Mller-Hannemann, M., Werneck, R.F.F. (eds.) ALENEX. pp SIAM (2011) 20. Zaidi, Z.R., Hakami, S., Landfeldt, B., Moors, T.: Real-time detection of traffic anomalies in wireless mesh networks. Wireless Networks (2009)

LARGE SCALE ANOMALY DETECTION AND CLUSTERING USING RANDOM WALKS

LARGE SCALE ANOMALY DETECTION AND CLUSTERING USING RANDOM WALKS A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in the School of Information Technologies at