arxiv: v1 [cs.si] 13 Nov 2014

Size: px

Start display at page:

Download "arxiv: v1 [cs.si] 13 Nov 2014"

Veronica Hoover
5 years ago
Views:

1 Anomaly Detection in Dynamic Networks of Varying Size Timothy La Fond 1, Jennifer Neville 1, Brian Gallagher 2 1 Purdue University, 2 Lawrence Livermore National Laboratory {tlafond,neville}@purdue.edu, bgallagher@llnl.gov arxiv: v1 [cs.si] 13 Nov 2014 ABSTRACT Dynamic networks, also called network streams, are an important data representation that applies to many real-world domains. Many sets of network data such as networks, social networks, or internet traffic networks are best represented by a dynamic network due to the temporal component of the data. One important application in the domain of dynamic network analysis is anomaly detection. Here the task is to identify points in time where the network exhibits behavior radically different from a typical time, either due to some event (like the failure of machines in a computer network) or a shift in the network properties. This problem is made more difficult by the fluid nature of what is considered normal network behavior. The volume of traffic on a network, for example, can change over the course of a month or even vary based on the time of the day without being considered unusual. Anomaly detection tests using traditional network statistics have difficulty in these scenarios due to their Density Dependence: as the volume of edges changes the value of the statistics changes as well making it difficult to determine if the change in signal is due to the traffic volume or due to some fundamental shift in the behavior of the network. To more accurately detect anomalies in dynamic networks, we introduce the concept of Density- Consistent network statistics. These statistics are designed to produce results that reflect the state of the network independent of the volume of edges. On synthetically generated graphs anomaly detectors using these statistics show a a % improvement in the recall when distinguishing graphs drawn from different distributions. When applied to several real datasets Density-Consistent statistics recover multiple network events which standard statistics failed to find, and the times flagged as anomalies by Density-Consistent statistics have subgraphs with radically different structure from normal time steps. 1. INTRODUCTION Network analysis is a broad field but one of the more important applications is in the detection of anomalous or critical events. These anomalies could be a machine failure on a computer network, an example of malicious activity, or the repercussions of some event on a social network s behavior [16, 12]. In this paper, we will focus on the task of anomaly detection in a dynamic network where the structure of the network is changing over time. For example, each time step could represent one day s worth of activity on an network. The goal is then to identify any time steps where the pattern of those communications seems abnormal compared to those of other time steps. As comparing the communication pattern of two network examples directly is complex, one simple approach is to summarize each network using a network statistic then compare the statistics. A number of anomaly detection methods rely on these statistics [15, 2, 8]. Another method is to use a network model such as ERGM [13]. However, both these methods often encounter difficulties when the properties of the network are not static. A typical real-world network experiences many changes in the course of its natural behavior, changes which are not examples of anomalous events. The most common of these is variation in the volume of edges. In the case of an e- mail network where the edges represent messages, the total number of messages could vary based on the time or there could be random variance in the number of messages sent each day. The statistics used to measure the network properties are usually intended to capture some other effect of the network than simply the volume of edges: for example, the clustering coefficient is usually treated as a measure of the transitivity. However, the common clustering coefficient measure is statistically inconsistent as the density of the network changes. Even on an Erdos-Renyi network, which does not explicitly capture transitive relationships, the clustering coefficient will be greater as the density of the network increases. When statistics vary with the number of edges in the network, it is not valid to compare different network time steps using those statistics unless the number of edges is constant in each time step. A similar effect occurs with network models that employ features or statistics which are size sensitive: [11] show that ERGM models learn different parameters given subsets of the same graph, so even if the network properties are identical observing a smaller portion of the network leads to learning a different set of parameters.

2 The purpose of this work is to analytically characterize statistics by their sensitivity to network density, and offer principled alternatives that are consistent estimators, which empirically give more accurate results on networks with varying densities. The major contributions of this paper are: We prove that several commonly used network statistics are Density Dependent and poorly reflect the network behavior if the network size is not constant. We offer alternative statistics that are Density Consistent which measure changes to the distribution of edges regardless of the total number of observed edges. We demonstrate through theory and synthetic trials that anomaly detection tests using Density Consistent statistics are better at identifying when the distribution of edges in a network has changed. We apply anomaly detection tests using both types of statistics to real data to show that Density Consistent statistics recover more major events of the network stream while Density Dependent statistics flag many time steps due to a change in the total edge count rather than an identifiable anomaly. We analyze the subgraphs that changed the most in the anomalous time steps and demonstrate that Density Consistent statistics are better at finding local features which changed radically during the anomaly. 2. STATISTIC-BASED ANOMALY DETECTION A statistic-based anomaly detection method is any method which makes its determination of anomalous behavior using network statistics calculated on the graph examples. The actual anomaly detection process can be characterized in the form of a hypothesis test. The network statistics calculated on examples demonstrating normal network behavior form the null distribution, while the statistic on the network being tested for anomalies forms the test statistic. If the test statistic is not likely to have been drawn from the null distribution, we can reject the null hypothesis (that the test network reflects normal behavior) and conclude that it is anomalous. Let G t = {V, E t} be a multigraph that represents a dynamic network, where V is the node set and E t is the set of edges at time t, with e ij,t the number of edges between nodes i and j at time t. The edges represent the number of interactions that occurred between the nodes observed within a discrete window of time. As the number of participating nodes is relatively static compared to the number of communications, we will assume that the node set is a constant; in time steps where a node has no communications we will treat it as being part of the network but having zero edges. Let us define some network statistic S k (G t) designed to measure network property k which we will use as the test statistic (e.g., clustering coefficient). Given some set of time steps t Lx {t L1,...t Lmax} such that the set of graphs in those times {G tlx } are all examples of normal network behavior, S k is calculated on each of these learning set examples to estimate an empirical null distribution. If t test is the time step we are testing for abnormality, then the value S k (G ttest ) is the test statistic. Given a specified p-value (referred to as α), we can find threshold(s) that reject p percentage of the null distribution, then draw our conclusion about whether to reject the null hypothesis (conclude an anomaly is present) if the test statistic S k (G ttest ) falls out of those bounds. For this work we will use a two-tailed Z-test with a p-value of α = 0.05 with the thresholds φ lower and φ upper. Anomalous test cases where the null hypothesis is rejected correspond to true positives; normal cases where the null hypothesis is rejected correspond to false positives. Likewise anomalous cases where the null is not rejected correspond to false negatives and normal cases where the null is not rejected correspond to true negatives. Deltacon [5] is an example of the statistic-based approach, as are many others [7, 8, 9, 14]. As these models also often incorporate network statistics, we will focus on the statistics themselves in this paper. Moreover, not all methods rely solely on statistics calculated from single networks: there are also delta statistics which measure the difference between two network examples. Netsimile is an example of such a network comparison statistic [1]. For a dynamic network, these delta statistics are usually calculated between the networks in two consecutive time steps. In practical use for hypothesis testing, these delta statistics function the same as their single network counterparts. If the network properties being tested are not static with respect to the time t this natural evolution may cause S k (G t) to change over time regardless of anomalies which makes the null distribution invalid. It is useful in these instances to replace the statistic with a detrended version S k (G t) = S k (G t) f(t) where the function f(t) is some fit to the original statistic values. The paper by [6] describes how to do dynamic anomaly detection using a linear detrending function, but other functions can be used for the detrending. This detrending operation does not change the overall properties of the statistic so for the remainder of the paper assume S k (G t) refers to a detrended version, if appropriate. 2.1 Common Network Statistics Listed here are some of the more commonly used network statistics for the anomaly detection. Graph Edit Distance: The graph edit distance (GED) [4] is defined as the number of edits that one network needs to undergo to transform into another network and is a measure of how different the network topologies are. In the context of a dynamic network the GED is applied as a delta measure to report how quickly the network is changing. GED(G t) = V t + V t 1 2 V t V t E t 1 2 E t E t 1 (1) Degree Distribution Difference: Define D i,t = j i eij,t to be the degree of i, the total number of messages to or from the node. The degree distribution is then a histogram of the number of nodes with a particular degree. Typically real-world network exhibit a power-law

3 degree distribution [3] but others are possible. To compare the degree distributions of t and t 1, one option is to take the squared difference between the degree counts for each possible degree. We will call this the degree distribution difference (DD): DD(G t) = max i (D i,t,d i,t 1 ) k=1 ( i I(D i,t = k) I(D i,t 1 = k)) 2 Other measures can be used to compare the two distributions but the statistical properties of the squared distance described later extend to other distance measures. Weighted Clustering Coefficient: Clustering coefficient is intended as a measure of the transitivity of the network. It is a critical property of many social networks as the odds of interacting with friends of your friends often lead to these triangular relationships. As the standard clustering coefficient is not designed for weighted graphs we will be analyzing a weighted clustering coefficient, specifically the Barrat weighted clustering coefficient (CB)[10]: CB(G t) = i 1 N (k i 1) s i (2) e i,j + e i,k a i,j a i,k a 2 Where a i,j = I[e i,j > 0], s i = j ei,j, and ki = j ai,j. Other weighted clustering coefficients exists but they behave similarly to the Barrat coefficient. 3. DENSITY DEPENDENCE To illustrate why dependency of the network statistic on the edge count affects the conclusion of hypothesis tests, we will us first investigate statistics that are density dependent. Definition 3.1 A statistic S is density dependent if the value of S(G t) is dependent on the density of G t (i.e., ). Theorem 3.1 False Positives for Density Dependent Statistics Let L = {G tlx } be a learning set of graphs and G ttest be the test graph. If S k (G t) is monotonically dependent on and { E tlx } is bounded by finite E lower and E upper, there is some E ttest that will cause the test case to be rejected regardless of whether the network is an example of an anomaly with respect to property k. Proof. Let S k (G) be a network statistic that is a monotonic and divergent function with respect to the number of edges in G. Given a set of learning graphs {G tl } the values of S k ({G tl }) are bounded by max(s k ({G tl })) and min(s k ({G tl })), so the critical points φ lower and φ upper of a hypothesis test using this learning set will be within these bounds. Since an increasing E ttest implies S k (G ttest ) increases or decreases, then there exists a E ttest such that S k (G ttest ) is not within φ lower and φ upper and will be rejected by the test. If changing the number of edges in an observed network changes the output of the statistic, then if the test network (3) differs sufficiently in its number of edges compared to the learning examples the null hypothesis will be rejected regardless of the other network properties. As for why it is not sufficient to simply label these times as anomalies (due to unusual edge volume) the hypothesis test is designed to test for abnormality in a specific network property. If edge count anomalies also flag anomalies on other network properties, we cannot disambiguate the case where both are anomalous or just the message volume. If just the message volume is unusual, this might simply be an example of an exceptionally busy day where the pattern of communication is roughly the same just elevated. This is a very different case from where both the volume and the distribution of edges are unusual. A second problem occurs when the edge counts in the learning set have high variance. If the statistic is dependent on the number of edges, noise in the edge counts translates to noise in the statistic values which lowers the statistical power of the test. Theorem 3.2 False Negatives for Density Dependent Statistics For any S k (G ttest ) calculated on a network that is anomalous with respect to property k, if S k (G t) is dependent on there is some value of the variance of the learning network edge counts such that S k (G ttest ) is not detected as an anomaly. Proof. Let S k (G ttest ) be the test statistic and {G tl } be the set of learning graphs where the edge count of any learning graph E tlx be drawn according to distribution Φ E. If S k (G t) is a monotonic divergent function of then as the variance of Φ E increases the variance of S k (G tlx ) increases as well. For a given α, the hypothesis test thresholds φ lower and φ upper will widen as the variance increases to incorporate 1 α learning set instances. Therefore, for a given S k (G ttest ) there is some value of var(φ E) such that φ lower < S k (G ttest ) < φ upper. With a sufficient amount of edge count noise, the statistical power of the anomaly detector drops to zero. These theorems have been defined using a statistic calculated on a single network, but some statistics are delta measures which are measured on two networks. In these cases, the edge counts of either or both of the networks can cause edge dependency issues. Lemma 3.3 If a delta statistic S k (G t, G t 1) is dependent on either E min(g t)=min(, E t 1 ) or E (G t)=abs( E t 1 ) then Thms apply to the delta statistic. Proof. For some delta statistic S k (G t, G t 1) if the statistic is dependent on, E t 1, or both, Theorems 3.1 and 3.2 apply to any edge count which influences the statistic. If S k (G t, G t 1) depends on E = abs( E t 1 ) then as either or E t 1 change the statistic produced changes leading to the problems described in theorems 3.1 and 3.2. If S k (G t, G t 1) depends on E min = min(, E t 1 ) then if both and E t 1 increase or decrease the statistic is affected leading to the same types of errors.

4 These theorems show that dependency on edge counts can lead to both false positives and false negatives from anomaly detectors that look for unusual network properties other than edge count. In order to distinguish between the observed edge counts in each time step and the other network properties, we need a more specific data model to represent the network generation process with components for each. 4. DENSITY INDEPENDENCE Now that we have established the problems with density dependence, we need to define the properties that we would prefer our network statistics to have. To do this we need a more detailed model of how the graph examples were generated. 4.1 Data Model Let the number of edges in any time step be a random variable drawn from distribution M n E(t) in times where there is a normal message volume and distribution M a E(t) in times where there is anomalous message volume. Now let the distribution of edges amongst the nodes of the graph be represented by a N N matrix P t where the value of any cell p ij,t is the probability of observing a message between two particular node pairs at a particular time. This is a probability distribution so the total matrix mass sums to 1. Like edge count, treat this matrix as drawn from distribution MP n (t) in normal times and M a k P (t) in anomalous times where k is the network parameter that is anomalous (for example, an atypical degree distribution). Any observed network slice can be treated as having been generated by a multinomial sampling process where edges are selected independently from NxN with probabilities P t. Denote the sampling procedure for a graph G t with F (, P t). In the next section, we will detail how this decomposition into the count of edges and their distribution allows us to define statistics which are not sensitive to the number of edges in the network. 4.2 Density Consistency and Unbiasedness In the above data model, P t is the distribution of edges in the network, thus any property of the network aside from the volume of edges is encapsulated by the P t matrix. Therefore, a network statistic designed to capture some network property other than edge count should be a function of P t alone. Let S k (P t) be some test statistic designed to capture a network property k of P t, that is independent of the density of G t. Since P t is not directly observable we can estimate the statistic with the empirical statistic Ŝk(G t) where G t is used to estimate P t, with ˆp ij,t = e ij,t E t. Definition 4.1 A statistic S is density consistent if Ŝ(Gt) is a consistent estimator of S(P t). If Ŝk(G t) is a consistent estimator of the true value of S k (P t), then observing more edge examples should cause the estimated statistics to converge to the true value given P t. More specifically it is asymptotically consistent with respect to the true value as the number of observed edges increases. Another way to describe this property is that Ŝk(G t) has some bias term dependent on the edge count, but the bias converges to zero as the edge count increases: lim Et Ŝk(G t) S k (P t) = 0. Density consistent statistics allow us to perform accurate hypothesis tests as long as a sufficient number of edges are observed in the networks. To begin, we will prove that the rate of false positives does not exceed the selected p-value α. Theorem 4.1 False Positive Rate for Density Consistent Statistics As if Ŝk(G t = F (, P t MP n (t)) converges to S k (P t), the probability of a false positive when testing a time with an edge count anomaly G ttest = F (, P t MP n (t)) approaches α. Proof. Let all learning set graphs G tlx = F ( E tlx ME(t n Lx), P tlx MP n (t Lx)) be drawn from non-anomalous E and P distributions and the test instance G ttest = F ( E ttest ME(t a test), P ttest MP n (t test)) be drawn from an anomalous E distribution but a non-anomalous P distribution. If Ŝk(G t) is a consistent estimator of S k (P t), lim Et Ŝk(G t) = S k (P t). Then as both E tlx and E ttest increase, all learning set instances and the test set instance approach the distribution S k (P t MP n (t)). As any threshold is as likely to reject a learning set instance as the test instance, the false positive rate approaches α. As the bias converges to zero, graphs created with the same underlying properties will produce statistic values within the same distribution, making the test case come from the same distribution as the null. Even if the test case has an unusual number of edges, as long as the number of edges is not too small there will not be a false positive. Density consistency is also beneficial in the case of false negatives. Theorem 4.2 False Negative Rate for Density Consistent Statistics As let Ŝk(G t = F (, Pt n MP n (t))) converge to S k (Pt n ) and Ŝk(G t = F (, Pt a MP a (t)) converge to S k (Pt a ). If S k (Pt n ) and S k (Pt a ) are separable then the probability of a false negative 0 as increases. Proof. Let all learning set graphs G tlx = F ( E tlx ME(t n Lx), P tlx MP n (t Lx)) be drawn from non-anomalous E and P distributions and the test instance G ttest = F ( E ttest ME(t n test), P ttest M a k P (ttest)) be drawn from a non-anomalous E distribution but a P distribution that is anomalous on the network property being tested. Let Ŝk(G t) be a consistent estimator of S k (P t). If S k (P t M a k P (t)) and S k(p t MP n (t)) are separable, then and lim Ŝ k (G ttest ) E ttest lim Ŝ k (G tlx ) E tlx converge to two non-overlapping distributions and the probability of rejection approaches 1 for any α. The statistical power of a density consistent statistic depends only on whether the P matrices of normal and anomalous graphs are separable using the true statistic value: as long as is sufficiently large the bias is small enough that it is not a factor in the rate of false negatives.

5 A special case of density consistency is density consistent and unbiased, which refers to statistics where in addition to consistency the statistic is also an unbiased statistic of the true S k (P t). Definition 4.2 A statistic S is density unbiased if Ŝ(Gt) is a unbiased estimator of S(P t). Unbiasedness is a desirable property because a density consistent statistic without it may produce errors due to bias when the number of observed edges is low. 4.3 Proposed Density-Consistent Statistics We will now define a set of Density-Consistent statistics designed to measure network properties similar to the previously described dependent statistics, but without the sensitivity to total network edge count. Probability Mass Shift: The probability mass shift (MS) is a parallel to GED as a measure of how much change has occurred between the two networks examined. Mass Shift, however, attempts to measure the change in the underlying P distributions and avoids being directly dependent on the edge counts. The probability mass shift between time steps t and t 1 is MS(P t) = ij (p ij,t p ij,t 1 ) 2 (4) The MS can be thought of as the total edge probability that changes between the edge distributions in t and t 1. Probabilistic Degree Shift: We will now propose a counterpart to the degree distribution which is density consistent. Define the probabilistic degree of a node to be P D(v i) = j V t,i j pij,t. Then, let the probabilistic degree shift of a particular node in G t be defined as the squared difference of the probabilistic degree of that node in times t and t 1. The total probabilistic degree shift (DS) of G t is then: DS(P t) = i ( p ij,t p ij,t 1 ) 2 (5) j i j i This is a measure of how much the total probability mass of nodes in the graph change across a single time step. If the shape of the degree distribution is changing, the probabilistic degree of nodes will be changing as well. Triangle Probability: As the name suggests, the triangle probability (TP) statistics is an approach to capturing the transitivity of the network and an alternative to traditional clustering coefficient measures. Define the triangle probability as: T P (P t) = i, V,i j k p ij,t p ik,t p jk,t (6) 5. PROPERTIES OF NETWORK STATISTICS Now that we have described the different categories of network statistics and their relationship to the network density we will characterize several common network statistics as density dependent or consistent, comparing them to our proposed alternatives. Table 1 summarizes our findings. Dependent Consistent Unbiased Graph edit distance Degree distribution Barrat clustering Mass shift Degree shift Triangle probability Figure 1: Statistical properties of previous network statistics and our proposed alternatives. 5.1 Graph Edit Distance Claim 5.1 GED is a density dependent statistic. When the edge counts of the two time steps are the same, the GED (Eq. 1) can be thought of as the difference in the distribution of edges in the network. However, the GED is sensitive to in two ways: the change in the number of edges from t 1 to t: E = abs( E t 1 ), and the minimum number of edges in each time step E min = min( E t 1, ). In both cases the statistic is density dependent, and in fact it diverges as the number of edges increases. The first case is discussed in Theorem 5.1, the second in Theorem 5.2. Theorem 5.1 As E, GED(G t), regardless of P t and P t 1. Proof. Let E be abs( E t 1 ). Since the GED corresponds to + E t 1 2 E t E t 1, the minimum edit distance between G t and G t 1 occurs when their edge sets overlap maximally and is equal to E. Therefore, as E increases even the minimum (i.e., best case) GED(G t, G t 1) also increases. Theorem 5.2 As E min, GED(G t) if P t P t 1. Proof. Let G t = F (, Pt a ) and G t 1 = F ( E t 1, Pt 1), b with Pt a Pt 1. b Select two nodes i, j such that p a ij,t p b ij,t 1. The edit distance contributed by those two nodes is abs(e ij,t e ij,t 1). Let E min increase but E remain constant. As E min increases the edit distance of the two nodes converges to abs(e min p a ij,t E min p b ij,t 1) = E min abs(p a ij,t p b ij,t 1). Since every pair of nodes with differing edge probabilities in the two time steps will have increasing edit distance as E min increases, the global edit distance will also increase. Since the GED measure is the literal count of edges and nodes that differ in each graph, the statistic is dependent on the difference in size between the two graphs. Even if the graphs are the same size, comparing two large graphs is likely to produce more differences than two very small graphs due to random chance. In addition, even when small nonanomalous differences occur between the probability distributions of edges in two time steps, variation in the edge count can result in large differences in GED. 5.2 Degree Distribution Difference Claim 5.2 DD is a density dependent statistic.

6 The degree distribution is naturally very dependent on the total degree of a network: the average degree of nodes is larger in networks with many edges. The DD measure (Eq. 2) is again sensitive to via E and E min. In both cases the statistic is density dependent. The first case, in Theorem 5.3, shows that as E increases, the DD measure will also increase even if the graphs were generated with the same P probabilities. The second case, in Theorem 5.4, shows that as E min increases, small variations in P will increase the DD measure. Theorem 5.3 As E, DD(G t), regardless of P t and P t 1. Proof. Pick any node i in G t and j in G t 1. If increases, and E t 1 stays the same, the expected degree of i increases, while j stays the same; likewise the inverse is true if E t 1 increases and stays the same. Thus, as E increases the probability of any two nodes having the same degree approaches zero, so the degree distribution difference of the two networks increases with greater E. Theorem 5.4 As E Min, DD(G t) if P t P t 1. Proof. Let P D(i, G t) = k i p ik,t be the probabilistic degree of node i for G t = F (, P t). Pick any node i in G t and j in G t 1 such that P D(i, G t) P D(j, G t 1). For a constant E, as E min increases, the expected degrees of i and j converge to D i,t = E min k i p ik,t and D j,t 1 = E min k j p jk,t 1 respectively. This means that the probability of the two nodes having the same degree approaches zero. Since every pair of nodes with differing edge probabilities in the two time steps will have unique degrees, the degree distribution difference will also increase. If a node has very similar edge probabilities in matrices P t and P t 1, when few edges are sampled it is likely to have the same degree in both time steps, and thus the impact on the degree distribution difference will be low. However, as the number of edges increases, even small non-anomalous difference in the P matrices will become more apparent (i.e., the node is likely to be placed in different bins in the degree distribution difference calculation), and the impact on the measure will be larger. 5.3 Weighted Clustering Coefficient Claim 5.3 CB is a density consistent statistic. As shown in Theorem 5.5 below, the weighted Barrat clustering coefficient (CB, Eq. 3) is in fact density consistent. However, we will show later that the triangle probability statistic is also density unbiased, which gives more robust results, even on very sparse networks. Theorem 5.5 CB(G t) is a consistent estimator of CB(P t), with a bias that converges to 0. Proof. For G t = F (, P t) the number of edges observed any pair of nodes can be represented using a multinomial distribution t! E e ij,t!...e yz,t! pe ij,t ij,t...pe yz,t yz,t. As, the rate of sampling a particular node pair i, j is p ij,t, so: lim e ij,t = p ij,t E t Let CB(G t, i) refer to the clustering coefficent of node i. Then, in the limit of it converges to: lim CB(Gt, i) = Et p ij,t + p ik,t E t 2 k i,lim j i Et p ij,t ( lim I[e ij,t >0] I[e ik,t >0] I[e jk,t >0]) E t = p ij,t + p ik,t 2 k i,lim j i p ij,t I[p ij,t >0] I[p ik,t >0] I[p jk,t >0] where k i,lim = j i I[pij,t > 0]. Since this limit can be expressed in terms of P t alone, and CB is a sum of the clustering over all nodes, the Barrat clustering coefficient is a density consistent statistic. If we calculate the expectation of the Barrat clustering, we obtain: 1 E[CB(G t, i)] = E[ (k i 1) s i e i,j + e i,k a i,j a i,k a ] 2 1 = E[ ] E[e i,j ] + E[e i,k ] (k i 1) s i 2 E[a i,j ] E[a i,k ] E[a ] 1 = E[ ] p ij,t + p ik,t (k i 1) s i 2 E[a i,j ] E[a i,k ] E[a ] 1 = E[ ] p ij,t + p ik,t (k i 1) s i 2 (1 p ij,t ) (1 p ik,t ) (1 p jk,t ) As this does not simplify to the limit of CB, it is not an unbiased estimator, and is thus density consistent but not unbiased. Other weighted clustering coefficients are also available but they have the same properties as the Barrat statistic. 5.4 Probability Mass Shift Claim 5.4 MS is a density consistent statistic. As the true P is unobserved, we cannot calculate the Mass Shift (MS, Eq. 4) statistic exactly and must use the empirical Probability Mass Shift: MS(G t) = ij (ˆpij,t ˆpij,t 1)2. As shown in Theorem 5.6 below, the bias of this estimator approaches 0 as and E t 1 increase, making this a density consistent statistic. Theorem 5.6 MS(G t) is a consistent estimator of MS(P t), with a bias that converges to 0. Proof. The expectation of the empirical Mass Shift can be calculated with E[ ij (ˆp ij,t ˆp ij,t 1 ) 2 ] = E[ ij ( e ij,t e ij,t 1 E t 1 )2 ]

7 = ij E[ e2 ij,t 2 + e2 ij,t 1 E t e ij,t e ij,t 1 E t 1 ] As the expectation of e 2 ij,t for any node pair i, j can be written as: E[e 2 ij,t ] = V ar(e ij,t) + E[e ij,t ] 2 = V ar(bin(, p ij,t )) + E[Bin(, p ij,t )] 2 = p ij,t (1 p ij,t ) + 2 p 2 ij,t The expected empirical mass shift can then be written as: ij E[ e2 ij,t 2 + e2 ij,t 1 E t e ij,t e ij,t 1 E t 1 ] = ij p ij,t (1 p ij,t ) 2 + Et 2 p 2 ij,t 2 + E t 1 p ij,t 1 (1 p ij,t 1 ) E t E t 1 2 p 2 ij,t 1 E t Et p ij,t E t 1 p ij,t 1 E t 1 = ij p 2 ij,t 2 p ij,t p ij,t 1 + p 2 ij,t 1 + p ij,t(1 p ij,t ) = ij + p ij,t 1(1 p ij,t 1 ) E t 1 (p ij,t p ij,t 1 ) 2 + p ij,t(1 p ij,t ) + p ij,t 1(1 p ij,t 1 ) E t 1 As the two additional bias terms converge to 0 as and E t 1 increase, the empirical mass shift is a consistent estimator of the true mass shift, and is density consistent. We can improve the rate of convergence as well by using our empirical estimates of the probabilities to subtract an estimate of the bias from the statistic. We use the following bias-corrected version of the empirical statistic in all experiments: MS (G t) = ij (ˆp ij,t ˆp ij,t 1 ) 2 ˆp ij,t(1 ˆp ij,t ) 5.5 Probabilistic Degree Shift ˆp ij,t 1(1 ˆp ij,t 1 ) E t 1 Claim 5.5 DS is a density consistent statistic. Again, since the true P is unobserved, we cannot calculate the Degree Shift (DS, Eq. 5) statistic exactly and must use the empirical Probability Degree Shift: DS(Ĝt) = i ( j i ˆpij,t j i ˆpij,t 1)2. As shown in Theorem 5.7 below, the bias of this estimator approaches 0, making this a density consistent statistic. (7) Theorem 5.7 DS(G t) is a consistent estimator of DS(P t), with a bias that converges to 0. Proof. The expectation of the empirical degree shift can be calculated with [ E[ DS(G t)] = E ( ˆp ij,t ˆp ij,t 1 ) 2] i j i j i [ = E ( e 2 ij,t E i j i t 2 + e 2 ij,t 1 E j i t e ij,t E e ik,t i t E + 2 e ij,t 1 t E i t 1 e ik,t 1 E t 1 2 ] ˆp ij,t ˆp ik,t 1 i = ( p 2 ij,t + p ij,t (1 p ij,t ) + p 2 ij,t 1 E i j i j i t j i + p ij,t 1 (1 p ij,t 1 ) + 2 p ij,t ˆp ik,t E j i t i + 2 p ij,t 1 p ik,t 1 2 ) p ij,t p ik,t 1 i i = ( ˆp ij,t ) 2 ˆp ij,t 1 + i j i j i j i + j i p ij,t 1 (1 p ij,t 1 ) p ij,t (1 p ij,t ) Which is density-consistent because the two additional bias terms converge to 0 as increases. Since the bias converges to 0, the statistic is density consistent. By subtracting out the empirical estimate of this bias term we can hasten the convergence. We use the following bias-corrected empirical degree shift in our experiments: DS (Ĝt) = i j i ( ˆp ij,t ˆp ij,t 1 ) 2 j i j i ˆp ij,t (1 ˆp ij,t ) + j i 5.6 Triangle Probability ˆp ij,t 1 (1 ˆp ij,t 1 ) (8) Claim 5.6 TP is a density consistent and density unbiased statistic. Again, since the true P is unobserved, we cannot calculate the Triangle Probability (TP, Eq. 6) statistic exactly and must use the empirical Triangle Probability T P (G t) = e ij,t e ik,t e jk,t 3 i, V,i j k which is an unbiased estimator of the true statistic (shown below in Theorem 5.8). This means that there is no minimum number of edges necessary to attain an unbiased estimate of the true triangle probability.

8 Theorem 5.8 T P (G t) is an consistent and unbiased estimator of T P (P t). Proof. The expectation of the empirical Triangle Probability can be written as E[ ˆ P T (G t)] = ijk E[ˆp ij,t ˆp ik,t ˆp jk,t ] = ijk E[ eij,t e ik,t e jk,t E ] t 1k-2k edges 3k-5k edges 7k-10k edges Edit Distance 0.01 ± ± ± 0.02 Degree Dist ± ± ± 0.06 Clust. Coef ± ± ± 0.04 Mass Shift 0.51 ± ± ± 0.04 Degree Shift 0.62 ± ± ± 0.08 Triangle Prob ± ± ± 0.03 Figure 3: Recall when applying each statistic to flag synthetically generated anomalies. Each cell is an average over all model parameter pairs. As the number of edges on any node pair i, j can be represented with a multinomial, the expectation of each is p ij,t. This lets us rewrite the triangle probability as = ijk = ijk p ij,t p i,jp i,k p p ik,t p jk,t Therefore the empirical triangle probability is an unbiased estimator of the true triangle probability and is a density consistent statistic. -1e-04 0e+00 1e-04 2e-04 3e (a) (b) 6. EXPERIMENTS Now that we have established the properties of densityconsistent and -dependent statistics we will show the tangible effects of these properties using both synthetic datasets as well as data from real networks. The purpose of the synthetic data experiments is to show the ability of hypothesis tests using various statistics to distinguish networks that have different distributions of edges but also a random number of observed edges. The real data experiments demonstrate the types of events that generate anomalies as well as the characteristics of the anomalies that hypothesis tests using each statistic are most likely to find. 6.1 Synthetic Data Experiments To validate the ability of Density-Consistent statistics to more accurately detect anomalies than traditional statistics we evaluated their performance on sets of synthetically generated graphs. Rather than create a dynamic network we generated independent sets of graphs using differing model parameters and a random number of edges. For every combination of model parameters, statistics calculated from graphs of one set of parameters (or pairs of graphs with the same parameters in the case of delta statistics like mass shift) became the null distribution, and statistics calculated on graphs from the other set of parameters (or two graphs, one from each model parameter in the case of delta statistics) became the test examples. Treating the null distribution as normally distributed, the critical points corresponding to a p-value of 0.05 are used to accept or reject each of the test examples. The percentage of test examples that are rejected averaged over all combinations of model parameters becomes the recall for that statistic. To generate the synthetic graphs, we first sampled a uniform random variable representing the number of edges to insert into the graph. For each statistic we did three sets of trials with 1k-2k, 3k-5k, and 7k-10k edges respectively. For Figure 4: Comparison of Mass Shifts (a) and Edit Distances (b) produced by synthetic dataset pairs. The variance of the Edit Distances is due to the variable edge counts of the graphs, and leads to errors when distinguishing the two distributions. mass shift, edit distance, triangle probability, and clustering coefficient each synthetic graph was generated according to the edge probabilities of a stochastic blockmodel using accept/reject sampling to place the selected number of edges amongst the nodes. Each set of models was designed to produce graphs varying a certain property, i.e. triangle probability and clustering coefficient were applied to models with varying transitivity while mass shift and edit distance skew the class probabilities by a certain amount. For the degree distribution and degree shift statistics, rather than using a stochastic blockmodel we assigned degrees to each node by sampling from a power-law distribution with varying parameters then using a fast Chung-Lu graph generation process to construct the network. Unlike a standard Chung-Lu process we allow multiple edges between nodes and we continue the process until a random number of edges are inserted rather than inserting edges equal to the sum of the degrees. The recalls are calculated the same way as with the stochastic blockmodels, using pairs of graphs with the same power-law degree distribution as the null set and differing degree distributions as the test set. Figure 6.1 shows the average recall of each statistic when applied to all models of a certain range of edges. In general the more edges that are observed the more reliable the statistics are; however, the statistics we have proposed enjoy an advantage over the traditional statistics for all ranges of network sizes, and in some cases this improvement is as large as 200% or more.

9 Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient September 2001: Enron stock begins to falter May 2001: Mintz sends memorandum to Skilling on LJM paperwork Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient December 25: Christmas November 14-16: Severe tornado warnings February 14: Valentine's Day Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient May 2000: Price manipulation strategy "Death Star" implemented December 2000: Skilling takes over as CEO February 2002: Skilling testifies before Congress August 29: Semester begins January 16: Martin Luther King day September 5: Labor day March 12-17: Spring Break August 27: Classes begin January 14: Classes resume November 1: Basketball season begins November 1: Basketball season begins Time (weeks) (a) Enron network Time (days) (b) University network Time (days) (c) Facebook Univ. subnetwork Figure 2: Timelines showing reported anomalies using each statistic for three real world networks. Figure 4 demonstrates the effect of density consistency using one pair of graph models and the mass shift and edit distance statistics. Each black point represents a pair of datasets drawn from the same distribution while each green point is the statistic value calculated from a pair of points from differing distributions; the number of edges sampled from each distribution was 7k-10k uniformly at random. Clearly the distribution of statistic values when calculated on graph pairs from the same distribution is different from the distribution of cross-model pairs for both statistics. However, the randomness of the size of the graphs translates to variance in the edit distance statistics calculated leading to two distributions which are not easily separable leading to reduced recall. Mass shift, on the other hand, is nearly unaffected by the edge variance leading to two very distinct statistic distributions. 6.2 Real Data Experiments We will now demonstrate how using Density-Consistent statistics as the test statistics for anomaly detectors improves the ability of detectors to find novel anomalies in real-world networks. We analyzed three dynamic network, one composed of communication from the Enron corporation during its operation and collapse, one from communications from university students, and one from the Facebook wall postings of the subnetwork composed of students at a university. Solid points represent statistics that we have proposed in this paper while open bubbles represent classic network statistics. Ideally the major events marked with vertical lines will be found as unusual by the detectors. A hypothesis test using the statistic in question is applied to every time step to determine its category as anomalous or normal. The null distribution used is the set of statistics calculated on all other time steps of the stream. A normal distribution is fit to the statistic values and critical points selected set according to a p-value of 0.05; any time step with a statistic that exceeds the critical points is flagged as anomalous. Figure 2(a) shows the the timeline of events that occurred during the Enron scandal and breakdown. The Density-Consistent statistics are able to recover the critical events of the timeline including events that standard statistics are not able to find. In particular the price manipulation strategy that was implemented in the summer of 2000 and the CEO transition in December 2000 are not found by edge dependent statistics; unlike some of the later events this strategy was not accompanied with an abnormal number of edges so edge dependent statistics produced less of a signal on these points. In addition, the time steps that are flagged by traditional statistics cluster around a set of edge count anomalies just after the Enron stock begins to crash. This period has an elevated number of messages in general, so the statistics are responding to the number of edges in these time steps rather than changes in other properties of the network. Figure 2(b) is taken from the communication of students from the summer of 2011 to February In general the density-consistent statistics flag time steps where certain events are taking place such as the start of basketball season, Martin Luther King Day, and Valentine s day. The density-dependent statistics tend to flag more randomly and often coincide with times of unusual total edge count. Figure 2(c) is constructed from the set of wall postings between members of a university Facebook group. As with the Enron data, there is a large set of time steps with an usual total edge count that are also flagged by the density-dependent statistics. The density-consistent statistics recover the major events such as the beginning of each semester and the start and end of spring break. Interestingly, they also find unusual activity before and after spring break; project deadlines often fall before or after the break which may explain this activity. Let us take a look at the network structures found in the anomalous time steps. As the mass shift, degree shift, and triangle probability statistics are sums over the edges or nodes of the graph, by decomposing the total statistic into the values generated by each edge/node we can select a subgraph which contributes the most to the statistic. As these statistics are designed to measure changes in the probability distribution of edges, this subgraph can be considered as

10 s38 s122 s19 s51 s73 s18 s66 s128 s40 s70 s3 s92 s23 s1 s118 s72 s29 s10 s36 s54 s81 s63 s26 s126 s55 s2 s107 s69 s41 s17 s20 s68 s127 s46 s21 s33 s4 s32 s61 s56 s137 s141 s25 s71 s30 s114 s38 s122 s19 s51 s73 s18 s66 s128 s40 s70 s3 s92 s23 s1 s118 s72 s29 s10 s36 s54 s26 s126 s55 s2 s81 s63 s107 s69 s41 s17 s20 s68 s127 s46 s21 s33 s4 s32 s61 s56 s137 s141 s25 s71 s30 s114 s14727 s7319 s50726 s50319 s32422 s30541 s6716 s50631 s11636 s49428 s779 s780 s3166 s7021 s7648 s1926 s147 s3162 s7018 s3159 s49512 s47860 s18256 s5653 s4512 s1471 s2149 s10660 s40212 s5541 s36377 s1137 s1139 s32316 s1419 s1136 s5969 s7963 s2007 s5952 s2711 s19203 s41629 s4682 s4281 s1155 s1209 s20823 s25071 s14942 s13565 s37282 s1164 s1208 s44766 s4600 s2880 s1154 s1156 s6315 s41435 s795 s7022 s7415 s103 s19227 s28136 s51704 s30270 s562 s7025 s5134 s3433 s8081 s3168 s7012 s8062 s3245 s45818 s2268 s40538 s760 s1788 s270 s3450 s41003 s29338 s3938 s811 s11634 s3160 s1767 s21606 s3694 s3161 s2288 s3693 s269 s26645 s14608 s12094 s44650 s52441 s32399 s6630 s10779 s2476 s2094 s20015 s16829 s1049 s17778 s81 s19534 s11182 s8427 s9746 s28692 s2814 s39082 s5192 s5706 s3632 s6095 s18388 s12345 s8917 s8959 s1085 s6675 s31337 s31296 s20190 s1169 s4312 s33918 s1562 s18688 s5650 s4252 s2799 s4070 s20739 s5056 s3977 s35210 s13254 s183 s1357 s29774 s44661 s4254 s1418 s29262 s11199 s7251 s8156 s2664 s35960 s1423 s381 s23932 s21900 s9835 s4326 s1736 s9324 s3633 s3509 s1356 s31847 s14363 s18305 s1165 s1243 s29694 s4601 s35188 s1880 s2062 s12542 s34286 s20493 s12203 s3528 s3128 s41347 s41085 s2645 s7398 s33719 s44 s5310 s8966 s16404 s11712 s49881 s14988 s1483 s69 s5231 s46491 s38061 s38190 s1417 s31434 s10145 s1737 s1394 s40605 s46994 s18681 s1686 s11461 s1960 s777 s12631 s886 s8026 s22174 s17266 s2514 s40088 s21741 s12495 s1714 s6864 s1687 s4179 s2738 s14583 s50807 s20621 s17373 s42924 s7041 s4276 s6499 s3809 s36618 s11899 s19563 s48422 s24790 s25340 s10366 s12007 s15834 s30570 s5896 s40348 s184 s7198s25853s1335s29895s13216s8189s11120 s1162 s1153 s1166 s7024s5284 s18785 s20377s1959 s1481s2117 s1161 s1152 s1167 s9294 s6097 s7649 s1060 s31579 s9096 s16177 s25417 s1420 s1683 s3775 s7130 s9097 s42402 s49539 s80 s1609 s10135 s13496 s2462 s46989 s39681 s6600 s11162 s2833 s42239 s42168 s27599 s28801 s2002 s4059 s6984 s2836 s38654 s15955 s38489 s1376 s593 s3639 s41479 s8323 s998 s2708 s7440 s2121s21145s1760 s14037 s2646 s32279 s2001 s37954 s31676 s2834 s26676 s28634 s6447 s5441 s7170 s45563 s52 s11739 s3120 s1080 s15268 s2018 s2137 s6909 s3608 s13767 s14757 s26148 s22102s5764s2904 s4304 s6445 s3244 s38109 s8249 s3238 s2189 s43712 s11479 s35336 s33303 s24619 s39263 s3294 s17287 s17268 s11394 s36993 s8032 s146 s2097s4083 s98 s8394 s33343 s31556 s1972 s45092 s32483 s20366 s44089 s563 s2057 s3309 s16947 s47311 s39956 s29650 s13123 s941 s32121 s30927 s10239 s2606 s6936 s10617 s2127 s6825 s601 s2156 s1961 s3308 s6541 s403 s8242 s677 s10813 s2658 s2090 s47707 s5543 s2155 s3333 s3196 s9262 s2092 s4123 s2091 s879 s4275 s19 s4191 s23937 s3325 s3521 s763 s703 s704 s3184 s43847 s3755 s9054 s3042 s3681 s699 s700 s2136 s8051 s2745 s33320 s8 s36288 s34903 s1735 s7899 s21 s2191 s4795 s996 s2821 s4633 s49640 s875 s4062 s876 s2823 s97 s2183 s577 s2822 s738 s984 s12813 s22 s1065 s11074 s20525 s5160 s208 s8251 s17013 s45719 s4125 s2223 s44228 s1269 s19167 s2313 s1859 s5468 s4481 s9372 s6004 s23664 s2190 s47636 s22553 s37618 s3944 s2780 s1738 s2295 s2152 s2354 s2151 s2153 s477 s2154 s1291 s7023 s8522 s8163 s1163 s701 s702 s29905 s25412 s11308 s50846 s1267 s5561 s1454 s1292 s1293 s1295 s1297 s17523 s6783 s7533 s6784 s1628 s13429 s46544 s16327 s38975 s995 s15489 s9304 s9046 s30592 s624 s8325 s3512 s72 s11950 s16332 s6437 s38585 s2145 s8831 s6246 s12317 s17551 s9105 s2221 s1505 s3100 s5316 s18073 s6168 s41197 s38901 s5516 s7383 s21967 s27439 s18197 s14727 s7319 s50726 s50319 s32422 s30541 s6716 s3159 s49512 s47860 s18256 s5653 s4512 s1471 s2149 s40212 s1137 s10660 s5541 s36377 s1139 s32316 s1419 s1136 s5969 s14942 s13565 s50631 s11636 s7025 s8062 s795 s7022 s49428 s7021 s780 s779 s3166 s7648 s1926 s147 s3162 s7018 s7963 s4281 s2007 s5952 s1155 s1209 s1164 s1154 s1156 s1208 s2711 s19203 s20823 s6315 s44766 s41629 s4682 s25071 s37282 s41435 s4600 s2880 s103 s7415 s19227 s3245 s45818 s28136 s51704 s30270 s562 s40538 s8081 s7012 s760 s2268 s29338 s3694 s3161 s3160 s5134 s3433 s3168 s2288 s270 s1788 s3450 s3693 s3938 s41003 s811 s11634 s14608 s12094 s21606 s19534 s11182 s81 s8427 s9746 s28692 s2094 s20015 s1767 s269 s26645 s52441 s32399 s16829 s44650 s6630 s10779 s2476 s17778 s2814 s1049 s5192 s5706 s8917 s8959 s39082 s3632 s18388 s12345 s6095 s1169 s31337 s2799 s4312 s5056 s3977 s1357 s35210 s1085 s31296 s33918 s1562 s4070 s29262 s11199 s6675 s20190 s18688 s5650 s4252 s20739 s13254 s183 s1418 s7251 s8156 s381 s29774 s44661 s4254 s1880 s2062 s12542 s1165 s1243 s3509 s1356 s1736 s1423 s2664 s35960 s23932 s9324 s21900 s9835 s4326 s34286 s20493 s3633 s31847 s14363 s18305 s12203 s3528 s29694 s4601 s35188 s41085 s3128 s41347 s2645 s7398 s11712 s12495 s1714 s1687 s6864s17373 s1686 s11461 s1960 s1737 s1394 s1417 s1483 s69 s5310 s8966 s16404 s33719 s44 s5231 s49881 s14988 s46491 s777 s12631 s38061 s38190 s31434 s10145 s40605 s46994 s18681 s886 s8026 s22174 s17266 s2514 s40088 s21741 s4179 s2738 s14583 s50807 s36618 s11899 s20621 s42924 s7041 s2117 s6097 s7649s1060s1481 s25417 s1683 s1420 s4276 s6499 s3809 s48422 s6600 s42239 s42168 s25853 s1161 s1152s1162s1153s1166s7198 s3120 s2018 s30570 s5896 s1335s29895s13216s8189s11120s7024s5284 s18785 s20377s1959 s1167 s9294 s31579 s9096 s16177 s3775 s19563 s2462 s46989 s15955 s38489 s1376 s593 s3639 s41479 s11394 s36993 s3294 s8323 s998 s8032 s2708s7440s2121s21145s1760s22102 s17268 s4083 s17287 s5764s2904 s98 s8394 s146 s2097 s24790 s25340 s10366 s12007 s15834 s39681 s11162 s2833 s2836 s27599 s28801 s2002 s38654 s4059 s14037 s2001 s2646 s6984 s32279 s37954 s31676 s2834 s26676 s28634 s6447 s5441 s7170 s45563 s52 s11739 s1080 s15268 s2137 s6909 s3608 s13767 s14757 s26148 s4304 s6445 s33343 s31556 s1972 s45092 s32483 s20366 s44089 s563 s2057 s3309 s16947 s47311 s39956 s13123 s40348 s184 s7130 s9097s42402 s49539 s80 s1609 s10135 s13496 s2606 s6825 s601 s3308 s6541 s403 s10813 s2658 s2090 s5543 s47707 s2155 s8242 s2156 s1961 s8249 s38109s3244 s3238 s2189 s43712 s11479 s35336 s33303 s24619 s39263 s29650 s941 s32121 s30927 s10239 s6936 s10617 s2127 s677 s3333 s3196 s9262 s2092 s4123 s3184 s23937 s3325 s3521 s2091 s879 s4275 s19 s4191 s2745 s704 s763 s43847 s9054 s3042 s3681 s3755 s700 s703 s2136 s8051 s2191 s8 s36288 s34903 s699 s33320 s1735 s7899 s4795 s996 s2821 s21 s875 s876 s49640 s4633 s2823 s2183 s2822 s738 s12813 s4062 s1065 s984 s577 s97 s11074 s5160 s8251 s17013 s45719 s2223 s44228 s4125 s208 s20525 s22 s6004 s23664 s2190 s5468 s1269 s19167 s2313 s1859 s37618 s2780 s1738 s2295 s4481 s9372 s47636 s22553 s2354 s2151 s2154 s2152 s2153 s477 s1291 s25412 s29905 s11308 s50846 s1267 s7023 s8522 s3944 s1292 s1293 s17523 s5561 s1454 s1295 s1297 s6783 s7533 s8163 s701 s702 s6784 s1628 s13429 s16327 s38975 s995 s46544 s15489 s9304 s9046 s38585 s2145 s8831 s6246 s12317 s17551 s9105 s2221 s1505 s3100 s5316 s18073 s6168 s30592 s624 s8325 s3512 s72 s11950 s16332 s6437 s1163 s41197 s38901 s5516 s7383 s21967 s27439 s18197 s s s s10891 s8523 s s7840 s15803 s57576 s s8958 s s s8968 s s13209 s8977 s8978 s8955 s8946 s41221 s8951 s8967 s s5830 s s s8971 s3224 s s s8957 s8973 s8947 s7945 s91949 s s s s83964 s73274 s s s s s7946 s s s s s2073 s s s s s5357 s s s s s s s14676 s s20506 s s s2052 s7671 s2050 s s s s s9087 s s s s s s s s22285 s27935 s s s s s17076 s86923 s6636 s8302 s s s s s20122 s60937 s s23414 s s s s7482 s s s9365 s30370 s s s s s s s s51567 s s s s s s s2084 s1104 s s26311 s3594 s s43817 s99432 s s66164 s s s43375 s7839 s87587 s22827 s46377 s s14863 s99882 s17917 s s s22364 s23046 s s97307 s8528 s24803 s13673 s26807 s13348 s s s17682 s s16223 s s s8870 s s s77342 s s s s s s s s8434 s s8649 s s99425 s s s s s s s74882 s s89330 s s67458 s s21308 s8964 s5263 s74725 s20502 s15393 s38152 s49339 s14412 s2102 s7384 s s2079 s10862 s s14927 s s20503 s99083 s s20510 s s7534 s s s s s s60994 s80588 s s s17044 s s34833 s16629 s s74513 s s45502 s13629 s50932 s23979 s s15883 s s s37745 s8942 s10991 s474 s382 s82241 s95854 s26647 s s10145 s s s10604 s s413 s s s434 s42891 s464 s50872 s2030 s s14411 s27445 s36487 s46558 s67851 s78705 s3516 s5513 s13547 s s s38304 s97839 s99726 s7532 s s420 s16162 s s36306 s27956 s7535 s12746 s s s2096 s52112 s448 s s16210 s s454 s22316 s91683 s9449 s68603 s s49414 s s10921 s80490 s10406 s21724 s6416 s471 s s2072 s422 s400 s444 s467 s s22164 s2066 s s s95852 s86709 s14589 s s s710 s323 s s s452 s460 s s s469 s s12054 s s29126 s s20861 s s s2049 s2062 s s2093 s s15106 s76500 s s27668 s426 s1759 s s s3136 s74694 s s1188 s10435 s84945 s440 s435 s52585 s385 s437 s20511 s59113 s51868 s14828 s18569 s s20296 s21413 s11080 s s433612s s s8559 s s s s90482 s s s s s s61956 s59006 s91701 s s7304 s s8556 s s s s s33033 s12348 s s8565 s s4816 s1352 s s8561 s4817 s1354 s s s s33165 s s s14762 s s s8558 s25877 s s s s s s s31132 s s s s1290 s s8554 s40849 s12728 s s19202 s s13990 s s8563 s s s78350 s s4558 s s31050 s s99585 s4745 s s s s s s16664 s26798 s10652 s s s s s s67116 s s s s94603 s s s s48004 s8557 s s18044 s64422 s8519 s17357 s s6163 s s s14862 s8527 s76514 s1186 s29780 s1193 s27948 s5019 s2295 s18062 s19363 s5391 s14790 s91687 s14425 s27329 s16853 s15617 s84092 s616 s s634 s620 s610 s7672 s3801 s55272 s42110 s s439 s s451 s438 s46427 s5893 s s10979 s s s10955 s13344 s4561 s43592 s68608 s s68035 s22162 s6434 s14124 s22091 s35279 s637 s2760 s4829 s17631 s s1378 s15263 s67441 s s s s s s s s34117 s23276 s s s13740 s10992 s s10943 s3470 s3389 s s s s19931 s s s s10882 s5926 s s s23275 s s27460 s593 s s24837 s10906 s s s2230 s s s s2201 s1247 s s21732 s s10981 s s10968 s64198 s s s3240 s61404 s60536 s5894 s10964 s2 s58068 s10838 s2191 s s s10944 s12520 s18821 s s s10990 s22979 s1892 s2207 s13662 s2220 s28019 s81228 s46673 s1232 s10937 s2193 s57150 s18169 s s10909 s s10913 s s22995s s10902 s43371 s58067 s2208 s s10911 s s2190 s2184 s s18481 s58193 s10945 s18824 s6322 s s s1148 s66025s3s10971 s10950 s6560 s92391 s83995 s5286 s8940 s17556 s s400887s64896 s3079 s10958 s22410 s s10892 s18494 s s612 s23256 s s99212 s98593 s71436 s84136 s502 s s11546 s6126 s4180 s508 s7150 s12109 s497 s596 s68599 s s78878 s640 s625 s s629 s8663 s17760 s s607 s55601 s s s36536 s34604 s13780 s621 s639 s31824 s25852 s s507 s14328 s1010 s s1034 s35627 s65571 s s s9376 s s s61456 s11975 s s85086 s s6125 s85095 s1013 s92834 s506 s6433 s s79069 s35628 s8744 s15189 s6124 s s s322 s68101 s14703 s13135 s27362 s1031 s35743 s13169 s58854 s9130 s37538 s7159 s s s13198 s13123 s13174 s1041 s1007 s7757 s4029 s s s1036 s s32865 s13194 s s1008 s2202 s2216 s13199 s13162 s45872 s s s1029 s s13196 s3512 s s66855 s1028 s9756 s s8036 s s s2187 s13121 s48005 s s s10149 s18130 s s18100 s s25006 s2867 s24442 s s s18129 s18131 s s s s2856 s s4073 s s s s71443 s s81518 s18144 s69201 s4043 s s s46546 s15443 s23563 s8369 s s8372 s93502 s s39243 s s36093 s s s24745 s49346 s13159 s s s s s66119 s44449 s8351 s s s s20264 s s s s1781 s10531 s s10150 s s19629 s62511 s s s s69712 s13200 s84534 s s s s8363 s8338 s8334 s s s83996 s13178 s14143 s s19628 s8355 s18816 s s8376 s s8373 s8360 s s s s75943 s s s s s905 s8342 s8356 s8347 s s s s s173 s s s s8332 s8367 s6152 s s92937 s91236 s s74217 s38957 s s s s14279 s s s29481 s8346 s63365 s2910 s s14267 s5734 s s s s11041 s s s2941 s12936 s24825 s4798 s s s20263 s s s22650 s s s s s4099 s s s s5732 s39324 s14266 s59923 s s17852 s15184 s27729 s4044 s84198 s s6017 s s s s s s4911 s4113 s s s s s s s s s4128 s4068 s4084 s s57472 s s s41427 s s s49587 s99883 s s14265 s14281 s1782 s28915 s s s s63929 s s98152 s4093 s4104 s4083 s4103 s4069 s4070 s4030 s s s71871 s4106 s30499 s s s s75944 s53594 s12971 s s s16005 s s14277 s18135 s s4088 s s4312 s s47760 s s4100 s4133 s s82929 s s s69906 s4119 s s24711 s s40910 s24061 s118 s171 s s s s s s s4064 s4135 s4101 s4056 s s s s s s64255 s s4077 s4121 s s4105 s s s1334 s s s s4062 s s s4111 s s4061 s s4074 s40966 s s31046 s s33769 s14810 s51862 s s s65327 s18319 s s18143 s86042 s s s s s s80301 s s s24746 s27707 s64571 s38821 s s892 s s15774 s s s89716 s69081 s4041 s2847 s2866 s s s89476 s8804 s8255 s84783 s4347 s s12705 s s35095 s1814 s s s s s s10891 s8523 s s7840 s15803 s57576 s s8958 s s s8968 s s13209 s8977 s8978 s8955 s8946 s41221 s8951 s8967 s s5830 s s s8971 s3224 s s s8957 s8973 s8947 s7945 s91949 s s s s83964 s73274 s s s s s7946 s s s s s s s s s5357 s s s s s s s6636 s s20506 s s14676 s s s7671 s s2073 s2052 s s7482 s s9365 s30370 s9087 s s s s s s s s22285 s27935 s s s s s17076 s s s s s s8302 s86923 s30097 s s s20122 s60937 s s23414 s s s s s43817 s99432 s s66164 s s s s s s51567 s s s s s s77342 s s s s s s s s8434 s1104 s15393 s38152 s s s2050 s s20502 s14412 s49339 s2102 s7384 s2079 s7839 s87587 s22827 s46377 s s14863 s26311 s3594 s s s8649 s s99425 s s s s s s10862 s s14927 s s16223 s s s8870 s s s43375 s s s74882 s s89330 s s67458 s s21308 s20503 s99083 s s20510 s s99882 s74513 s s45502 s13629 s50932 s23979 s s15883 s17917 s s s22364 s23046 s s97307 s8528 s24803 s13673 s26807 s13348 s s s17682 s s5263 s8964 s74725 s s s s s60994 s7534 s s17044 s s34833 s s s80588 s s s s37745 s8942 s10991 s474 s382 s s16629 s82241 s95854 s26647 s s10145 s10604 s s s413 s s s434 s42891 s464 s420 s50872 s2030 s s14411 s s86709 s s14589 s49414 s s10921 s80490 s10406 s21724 s27445 s36487 s46558 s67851 s52112 s78705 s3516 s5513 s13547 s s s38304 s97839 s99726 s7532 s s16162 s s7535 s s s323 s710 s12746 s36306 s27956 s448 s s454 s22316 s91683 s6416 s471 s422 s400 s444 s467 s s s452 s460 s s s469 s s s s s16210 s2096 s9449 s s68603 s s s2072 s22164 s2066 s s95852 s12054 s s29126 s s20861 s s2049 s2093 s s s1759 s s15106 s76500 s27668 s426 s84092 s2062 s s s74694 s s1188 s10435 s84945 s20511 s15617 s440 s435 s91687 s14425 s3136 s59113 s27329 s16853 s52585 s385 s616 s s634 s620 s610 s51868 s14828 s18569 s s20296 s21413 s11080 s433612s s s8559 s s61956 s59006 s s s s91701 s184672s s433609s90482 s s s184673s s s7304 s s375234s33033 s s8561 s8556 s4817 s12348 s s4816 s1352 s s8565 s1354 s s s s s33165 s s s8558 s s s14762 s383584s25877 s s s s s s31132 s s8554 s40849 s s12728 s1290 s s s19202 s13990 s s s s s78350 s8563 s s s4558 s s31050 s s99585 s4745 s s s s s s16664 s26798 s10652 s s s s s s67116 s s s s94603 s s s s48004 s8557 s48005 s s s18044 s s64422 s s8519 s25006 s17357 s2867 s s s6163 s2856 s s s s4073 s14862 s s8527 s4043 s76514 s s1186 s s29780 s s1193 s46546 s27948 s15443 s5019 s23563 s2295 s8369 s18062 s s19363 s8372 s5391 s66119 s14790 s44449 s437 s42110 s7672 s s s13344 s3801 s55272 s4561 s439 s s451 s438 s43592 s68608 s68035 s22162 s6434 s14124 s22091 s46427 s35279 s637 s5893 s s10979 s s17631 s15263 s s1378 s2760 s4829 s s34117 s67441 s s s10955 s s23276 s s s10992 s10943 s3470 s3389 s s10882 s5926 s10906 s s s s s1247 s21732 s10981 s10968 s s s5894 s10964 s2 s s s s s13740 s s s s19931 s s s s s s23275 s612 s s23256 s s99212 s98593 s27460 s71436 s84136 s502 s s593 s11546 s4180 s10838 s10944 s12520 s18821 s s6126 s596 s s7150 s10990 s68599 s s24837 s s s39243 s s36093 s s s24745 s49346 s71443 s s81518 s18144 s24442 s s s18129 s18131 s10149 s18130 s s2230 s2187s s s2201 s s s s s64198 s s58068 s3240 s61404 s60536 s s2191 s s2216 s28019 s2202 s81228 s46673 s13662 s2220 s1232 s10937 s s1892 s18169 s2207 s10909 s2193 s57150 s s10913 s s22995 s10902 s58067 s2208 s s s10911 s18481 s2190 s2184 s43371 s58193 s s s s64896 s s s10945 s18824 s6322 s s s1148 s3079 s10958 s22410 s s10892 s10950 s6560 s92391 s83995 s5286 s8940 s17556 s18494 s66025s3s10971 s s55601 s22979 s78878 s12109 s497 s640 s625 s508 s629 s8663 s17760 s s607 s36536 s s s34604 s13780 s31824 s25852 s621 s639 s s507 s14328 s1034 s35627 s65571 s s1010 s s s9376 s s s61456 s11975 s s85086 s1013 s92834 s506 s6433 s79069 s s6125 s85095 s s35628 s15189 s8744 s6124 s s s1031 s35743 s322 s68101 s13135 s14703 s27362 s37538 s13169 s58854 s9130 s7159 s s s13198 s13123 s13174 s1041 s1007 s7757 s4029 s s s1036 s s32865 s13194 s s1008 s13162 s13199 s45872 s s s1029 s s13196 s s3512 s66855 s1028 s9756 s s8036 s13121 s18100 s s s s69201 s93502 s13159 s s s s s1781 s10531 s s10150 s s8351 s s19629 s62511 s s s8363 s84534 s s s20264 s s s13200 s s83996 s s14143 s s69712 s8338 s8334 s s s8355 s s s s19628 s13178 s18816 s s8376 s s8373 s s s s s s75943 s8360 s905 s8342 s8356 s8347 s s s8332 s8367 s6152 s74217 s s s8361 s38957 s s29481 s8346 s63365 s2910 s s2941 s12936 s24825 s s s s11041 s s s s14279 s s s91236 s s s s s s173 s s s4798 s s s s s s s22650 s s s14267 s5734 s4099 s s s27729 s4044 s84198 s s6017 s s5732 s39324 s14266 s59923 s s17852 s20263 s s s15184 s s s s s s s63929 s s98152 s4113 s4911 s s s s s s49587 s99883 s s14265 s14281 s1782 s28915 s s s s4128 s4068 s4084 s4093 s4104 s4083 s18135 s s57472 s s s41427 s s s4103 s4069 s4070 s4030 s s s s4088 s71871 s53594 s12971 s s s16005 s s14277 s s118 s171 s s30499 s s s s75944 s4312 s4106 s s47760 s s4100 s4133 s s69906 s4119 s s s89716 s69081 s24711 s s40910 s24061 s82929 s s s s s s s s s27707 s64571 s38821 s s892 s s15774 s4064 s4135 s4101 s4056 s s s s s s64255 s s4077 s4121 s s4105 s s s1334 s s s4062 s s s s s s80301 s s s24746 s s s4111 s s4061 s s4074 s40966 s4041 s2847 s2866 s s s89476 s8804 s8255 s35095 s84783 s4347 s s12705 s s1814 s s s31046 s s33769 s14810 s51862 s s s65327 s18319 s s18143 s86042 s Prior Anomaly Prior Anomaly Prior Anomaly (a) Mass Shi2: Enron (b) Triangle Probability: E- mail (c) Degree Shi2: Facebook Figure 5: Anomalies detected by (a) Mass Shift: Enron time steps 31-32, (b) Triangle Probability: time steps 88-89, and (c) Degree Shift: Facebook time steps Each plot shows most unusual subgraph for the prior time step and the anomalous one. Prior Anomaly Prior Anomaly Prior Anomaly (a) Edit Distance: Enron (b) Barrat Clustering: E- mail (c) Degree Distribu<on: Facebook Figure 6: Anomalies detected by (a) Edit Distance: Enron time steps , (b) Barrat Clustering: time steps , and (c) Degree Shift: Facebook time steps Each plot shows most unusual subgraph for the prior time step and the anomalous one. the portion of the graph experiencing the most change in its edge probabilities. Figure 5 shows the subgraphs of time steps that flag as anomalous in the mass shift, degree shift, or triangle probability; the subgraphs shown account for at least 50% of the total anomaly score. As you can see the network structures are quite different in the anomalous time steps versus the prior time steps; often nodes that do not communicate at all in the prior time step will have a significant message volume during the anomaly. Figure 6 shows the subgraphs obtained by decomposing the mass shift, degree shift, or triangle probability on time steps that were flagged as anomalies by the density-dependent statistics but not the density-consistent statistics. Not all of the density-dependent statistics can be easily decomposed into the node/edge contributions which is why we did the density-consistent decomposition; the subgraphs generated should still be the subset of the graph experiencing the most change to its edge probabilities in that time step. The subgraphs discovered in these time steps tend to have less dramatic changes between the anomalous and normal time step which implies that these time steps were likely flagged due to a change in the total number of edges rather than a major shift in the edge probabilities. Because densityconsistent statistics are not sensitive to a global change in edge count they are better at detecting components of the network that have changed radically in their communication behavior. 7. CONCLUSIONS In this paper we demonstrate that in order to draw proper conclusions from analysis of dynamic networks it is necessary to use methods which take into account the changes they exhibit. As most network statistics were designed in an ad-hoc manner to describe informal properties of the network these statistics are not valid for use in a hypothesis testing approach to outlier detection when the network changes in size. To remedy this we have described the Density-Consistency property for network statistics and shown that statistics that adhere to this property can be used for accurate anomaly detection when the network edge volume is variable. A Density- Consistent statistic should measure some property of the network that is independent of the edge volume like the distribution of edges or the transitivity. We have also proposed three network statistics, Mass Shift, Degree Shift, and Triangle Probability to replace the edge dependent statistics of Graph Edit Distance, Degree Distribution and Clustering Coefficient. We have proven that our statistics are Density-Consistent and demonstrated using synthetic trials that anomaly detectors using the consistent statistics have superior performance. When applied to real datasets, anomaly detectors utilizing Density-Consistent statistics are often able to recover events that are missed by edge dependent statistics.

Anomaly Detection in Networks with Changing Trends

Anomaly Detection in Networks with Changing Trends Timothy La Fond 1, Jennifer Neville 1, Brian Gallagher 2 1 Purdue University, 2 Lawrence Livermore National Laboratory {tlafond,neville}@purdue.edu, bgallagher@llnl.gov