Hypothesis testing for automated community detection in networks

Size: px

Start display at page:

Download "Hypothesis testing for automated community detection in networks"

Andrea Kelley
5 years ago
Views:

1 J. R. Statist. Soc. B (216) 78, Part 1, pp Hypothesis testing for automated community detection in networks Peter J. Bickel University of California at Berkeley, USA and Purnamrita Sarkar University of Texas at Austin, USA [Received November 213. Revised February 215] Summary. Community detection in networks is a key exploratory tool with applications in a diverse set of areas, ranging from finding communities in social and biological networks to identifying link farms in the World Wide Web. The problem of finding communities or clusters in a network has received much attention from statistics, physics and computer science. However, most clustering algorithms assume knowledge of the number of clusters k. We propose to determine k automatically in a graph generated from a stochastic block model by using a hypothesis test of independent interest. Our main contribution is twofold; first, we theoretically establish the limiting distribution of the principal eigenvalue of the suitably centred and scaled adjacency matrix and use that distribution for our test of the hypothesis that a random graph is of Erdó s Rényi (noise) type. Secondly, we use this test to design a recursive bipartitioning algorithm, which naturally uncovers nested community structure. Using simulations and quantifiable classification tasks on real world networks with ground truth, we show that our algorithm outperforms state of the art methods. Keywords: Asymptotic analysis; Community detection; Hypothesis testing; Networks; Stochastic block model; Tracy Widom distribution 1. Introduction Network structured data can be found in many real world problems. Facebook is an undirected network of entities where edges are formed by who knows whom. The World Wide Web is a giant directed network with Web pages as nodes and hyperlinks as edges. Finding community structure in network data is a key ingredient in many graph mining problems. For example, viral marketing targets tightly knit groups in social networks to increase popularity of a brand of product. There are many clustering algorithms in computer science and statistics literature. However, most suffer from a common issue: one has to assume that the number of clusters k is known a priori. For labelled data, a common approach for learning k is cross-validating using held-out data. However, cross-validation requires a large amount of computation, and for sparse graphs it is suboptimal to leave out data. In this paper we address this problem via a hypothesis testing framework based on random-matrix theory. This framework naturally leads to a recursive bipartitioning algorithm, yielding a hierarchical clustering structure of the data. For genetic data, Patterson et al. (26) showed how to combine principal components analysis with random-matrix theory to discover whether the data have cluster structure. This work Address for correspondence: Purnamrita Sarkar, Department of Statistics and Data Sciences, College of Natural Sciences, University of Texas at Austin, Austin, TX 78712, USA. purna.sarkar@austin.utexas.edu 215 Royal Statistical Society /16/78253

2 254 P. J. Bickel and P. Sarkar uses existing results on the limit distribution of the largest eigenvalue of large random covariance matrices. In standard machine learning literature where data points are represented by real-valued features, Pelleg and Moore (2) jointly optimized over the set of cluster locations and number of cluster centres in the kmeans clustering algorithm to maximize the Bayesian information criterion. Hamerly and Elkan (23) proposed a hierarchical clustering algorithm based on the Anderson Darling statistic which tests whether the data assigned to a cluster come from a Gaussian distribution. For network clustering, finding the number of clusters automatically via a series of hypothesis tests has been proposed by Zhao et al. (211), who presented a label switching algorithm for extracting tight clusters from a graph sequentially, using a characterization of an associative cluster. Although the criterion is not probabilistically based, the stopping rule is based on parametric bootstraps from an underlying probability model. They showed attractive consistency properties of their method under block and related models. We take a probabilistic approach, considering the problem of finding the number of clusters in a graph generated from a stochastic block model, which is a widely used model for generating labelled graphs (Holland et al., 1983). We begin by constructing a test of the null hypothesis based on the very rapid computation of the largest eigenvalue of an appropriately centred and scaled adjacency matrix. Our null hypothesis is that there is only one cluster, i.e. the network is generated from an Erdó s Rényi G n,p -graph, where n denotes the number of nodes and p denotes the probability of linkage between a pair of nodes. Existing literature (Lee and Yin, 214) can be used to show that this largest eigenvalue asymptotically has the Tracy Widom distribution. Using recent theoretical results from random-matrix theory, we show that this limit also holds for our statistic, when the probability of an edge p is unknown, and the centring and scaling are done using an estimate of p. Our theory holds for p constant with respect to n, i.e. the dense asymptotic regime where the average degree is growing linearly with n. We are currently investigating the behaviour of the largest eigenvalue when p decays as n. We show how to obtain Bartlett-type corrections (Bartlett, 1937) for our test statistic when the graph is small or sparse. Although we cannot yet establish theory for this correction, we show its effectiveness by using simulations and labelled real world networks. On quantifiable classification tasks on real world networks with ground truth, our method outperforms McAuley and Leskovec s (212) algorithm which has been shown to perform better than known methods for obtaining overlapping clusters in networks. Further, we show that our recursive bipartitioning algorithm gives a multiscale view of smaller communities with different densities nested inside bigger communities. Although our theory applies only to block models, our simulations and data examples show that our method is quite robust to deviations from the block model assumptions (Section 4.1.3). Our paper is organized as follows. Section 2 gives the background on block and other models to be considered in our theory and simulations. Section 3 presents our main results on the hypothesis test and the recursive bipartitioning scheme. We present experimental results on simulated networks and labelled real world networks in Section 4. We conclude with a discussion in Section Background Latent variable models have been explored by many researchers for modelling networks (Raftery et al., 22; Bickel and Chen, 29). The general set-up of a latent variable model assigns n latent random variables Z :=.Z 1, Z 2, :::, Z n / to n nodes in a network. These variables take values in a general space Z. The linkage probabilities between two nodes is specified via a symmetric map h : Z Z [, 1]. Bickel and Chen (29) took Z i to be independent and identically distributed

3 Hypothesis Testing for Community Detection 255 uniform(,1) random variables. Raftery et al. (22) modelled these as positions in some d- dimensional latent space. Handcock et al. (27) proposed to use a mixture of multivariate Gaussian distributions, each for a separate cluster. A stochastic block model is a special class of these models, where Z i is a binary length k vector encoding membership of a node in a cluster. This has been a widely popular model (Snijders and Nowicki, 1997; Bickel and Chen, 29; Birmelé and Ambroise, 211) for modelling community structure in networks Stochastic block model For our theoretical results we focus on community detection in graphs generated from stochastic block models. Informally, a stochastic block model with k classes assigns latent cluster memberships to every node in a graph. We shall denote by n the number of nodes in a graph. Each pair of nodes with identical cluster memberships for the end points have identical probability of linkage, thus leading to stochastic equivalence. Let Z denote an n k binary matrix where each row has exactly one 1 and the ith column has n i 1s, i.e. the ith class has n i nodes with Σ i n i = n. For this paper, we shall assume that Z is fixed and unknown. By definition there are no self-loops. Under this model, the conditional expectation of the adjacency matrix A is E[A Z] = ZBZ T diag.zbz T /,.1/ where diag. / is a diagonal matrix, with diag. / ii =. / ii, i. A is symmetric and the edges are independent Bernoulli trials. Thus, the subgraph that is induced by the nodes in the ith cluster is simply an Erdó s Rényi graph. This stochastic equivalence criterion of a stochastic block model leads to tractable analysis and inference. One may naturally question this criterion while modelling a real network. Depending on the definition of a block, it is possible to have variations within a block based on covariates; however, once we stratify a block by using covariates, it is reasonable to assume that nodes belonging to a stratum of a block behave similarly in terms of making connections, i.e. have stochastic equivalence Degree-corrected stochastic block model A degree-corrected stochastic block model (Karrer and Newman, 211) is an extension of a block model which allows for heterogeneity of degrees within a block. To be concrete, in addition to the class membership vectors Z i present in a block model, this model has an extra set of degree parameters.θ i / n 1. Given these parameters, we have the relationship P.A ij = 1 θ, Z, B/ = θ i θ j Zi T BZ j:.2/ Zhao et al. (212) showed that, if the degree parameters θ take values from a finite discrete set, then a wide range of statistical methods for community detection are consistent. This added condition ensures that θ cannot take many different values, which would be more difficult to learn from a small data set. Under this condition, however, the degree-corrected model becomes a parametric submodel of the standard stochastic block model, i.e. within a block we have more blocks arising from varied degrees. In Section 4, we show that our hypothesis test can be applied to identify different degree strata of a degree-corrected block model with one block. We also show with simulations that for a latent position cluster model, even though the stochastic equivalence condition does not hold, if the mixture components have small variance then our algorithm can identify the clusters correctly. We conclude this section with a note on the applicability of stochastic block models for modelling real world networks.

4 256 P. J. Bickel and P. Sarkar 2.3. Stochastic block models as a histogram approximation While many extensions to a simple block model have been proposed (Karrer and Newman, 211; Airoldi et al., 28), a parallel line of research has been focused on obtaining approximations of real networks with block models with a growing number of clusters (Olhede and Wolfe, 214; Airoldi et al., 213). In particular, Olhede and Wolfe (214) showed that a block model approximation of an unlabelled network is analogous to the use of histograms as non-parametric summaries of an unknown probability distribution. Varying the number of communities or size of communities is in essence equivalent to varying the number of bins or the bandwidth. Their results imply that, under some mild regularity conditions on the limiting linkage probability function (which is also referred to as the graphon ), if we allow k to grow with n, then the block model approximation converges to the true graphon (in mean integrated squared error). In addition, Olhede and Wolfe (214) also showed that block model approximations of student friendship networks (Resnick et al., 1997) and political blog networks (Adamic and Glance, 25) bring out interesting structure. The political blogs network is a symmetrized network of 15 political blogs linking to each other, with two known (ground truth) clusters, i.e. the conservative and the liberal blogs. Olhede and Wolfe (214) showed that a block model approximation of this network returns mostly homogeneous (all liberal or all conservative) blocks. For the student friendship network, they showed that the clusters returned by a block model approximation are often homogeneous in terms of available covariate information, e.g. race and school year of the students. 3. The hypothesis test Deciding whether a stochastic block model has k or k +1 blocks can be thought of as inductively deciding whether there is one block or two. In essence, we develop a hypothesis test to determine whether a graph is generated from an Erdó s Rényi model with matching link probability or not. First we discuss some known properties of Erdó s Rényi graphs. Throughout this paper we assume that the number of clusters k and the edge probabilities are constant, whereas the number of nodes n is growing to. Thus the average degree is growing linearly with n Properties of Erdó s Rényi graphs Let A denote the adjacency matrix of an Erdó s Rényi (n, p) random graph, and let P := E[A]. We shall assume that there are no self-loops and hence A ii =, i. Under the Erdó s Rényi model, P is defined as follows: P = npee T pi,.3/ where e is a length n vector with e i = 1= n, i, and I is the n n identity matrix. We also introduce the normalized matrices A P Ã := :.4/ {.n 1/p.1 p/} The eigenvalues of Ã are denoted by λ 1 λ 2 ::: λ n. Let us also define the density of the semicircle law. In particular we have the following definition. Definition 1. Let ρ sc denote the density of the semicircle law, defined as ρ sc.x/ := 1.4 x 2 / + x R:.5/ 2π

5 Hypothesis Testing for Community Detection 257 For Wigner matrices with entries having a symmetric law, the limiting behaviour of the empirical distribution of the eigenvalues was established by Wigner (1958). This distribution converges weakly to the semicircle law defined in equation (5). Also, Tracy and Widom (1994) proved that, for Gaussian orthogonal ensembles (GOEs), λ 1 and λ n, after suitable shifting and scaling, converge to the Tracy Widom distribution with index 1 (TW 1 ). Soshnikov (1999) proved that this universal result at the edge of the spectrum also holds for more general distributions, provided that the random variables have symmetric laws of distribution, all their moments are finite and E[Ã m ij ].Cm/m for some constant C, and positive integers m. This shows that n 2=3.λ 1 2/ weakly converges to the limit distribution of GOE matrices, i.e. the Tracy Widom law with index 1 for p = 2 1. Recently, Erdó s et al. (212) have removed the symmetry condition and established the edge universality result for general Wigner ensembles. Further Lee and Yin (214) showed a necessary and sufficient condition for having the limiting Tracy Widom law, which shows that n 2=3.λ 1 2/ converges weakly to TW 1 also. If we know the true p, it would be easy to frame a hypothesis test which accepts or rejects the null hypothesis that a network is generated from an Erdó s Rényi graph. First we shall compute θ := n 2=3.λ 1 2/, and then estimate the p-value P.X θ/ from available tables of probabilities for the Tracy Widom distribution. We reject the null hypothesis if the p-value falls below a predefined significance level α. However, we do not know the true parameter p; we can only estimate it within O P.1=n/ error by computing the proportion of pairs of nodes that forms an edge. Let us denote this estimate by ˆp. Thus the matrix at hand is A ˆP, where ˆP is ˆP = nˆpee T ˆpI:.6/ In this paper, we show that the largest eigenvalue of A ˆP also follows the TW 1 -law after suitable shifting and scaling. Theorem 1. Let Ã := A ˆP {.n 1/ ˆp.1 ˆp/} :.7/ We have the following asymptotic distribution of our test statistic θ: θ := n 2=3 {λ 1.Ã / 2} d TW 1.8/ where TW 1 denotes the Tracy Widom law with index 1. This is also the limiting law of the largest eigenvalue of GOEs. We give a proof sketch in Appendix A; the details are deferred to the on-line supplementary material. For consistency we need to show that the above statistic θ does not have the Tracy Widom distribution when A is generated from a stochastic block model with k>1 blocks. We show that θ if A is generated from a stochastic block model, as long as the class probability matrix B is diagonally dominant. The diagonally dominant condition leads to clusters with more edges within than those across. A similar condition can be found in Zhao et al. (211) for proving asymptotic consistency of the extraction algorithm for stochastic block models with k =2. Further, Bickel and Chen (29) also noted that, for k =2, the Newman Girvan modularity is asymptotically consistent if diagonal dominance holds, although in general less is needed. Proposition 1. Let A be generated from a stochastic block model with hidden class assignment matrix Z, and probability matrix B (as in equation (1)) whose elements are constants with

6 258 P. J. Bickel and P. Sarkar Table 1. test Algorithm 1: preliminary hypothesis Step 1: A adjacency matrix of G Step 2: ˆp Σ i,j A ij ={n.n 1/} Step 3: Ã.A ˆP/= {.n 1/ ˆp.1 ˆp/} Step 4: θ n 2=3 {λ 1.Ã / 2} Step 5: pval P TW1.X > θ/ Table 2. Algorithm 2: hypothesis test with correction Step 1: ˆp = Σ ij A ij ={n.n 1/} Step 2: θ n 2=3.λ 1 [.A ˆP/= {.n 1/ ˆp.1 ˆp/}] 2/ Step 3: μ TW E TW1 [X] Step 4: σ TW var TW1.X/ Step 5: for i = 1,:::,5do Step 6: A i Erdó s Rényi.n, ˆp/ Step 7: θ i n 2=3 [λ 1.A ˆP/= {.n 1/ ˆp.1 ˆp/} 2] Step 8: ˆμ n,ˆp mean.{θ i }/ Step 9: ˆσ n,ˆp standard deviation.{θ i }/ Step 1: θ μ TW + {.θ ˆμ n,ˆp /= ˆσ n,ˆp }σ TW Step 11: pval P TW1.X > θ / respect to n. If, i, B ii > Σ j i B ij,wehave θ := n 2=3 {λ 1.Ã / 2} C n 7=6.9/ where C is a deterministic positive constant independent of n. On the basis of theorem 1, we present a preliminary version of our procedure for calculating the p-value of the test statistic (Table 1) A small sample correction Algorithm 1 uses the asymptotic distribution of the test statistic to obtain a p-value. Hence its performance on a finite network depends on how quickly the empirical distribution of the statistic approaches the limiting law. We performed simulation experiments to compare the speed of convergence of our test statistic with that of the scaled largest eigenvalue of GOE matrix ensembles. This is simply a reference point in our comparison, since Tracy Widom distributions were discovered for Gaussian random-matrix ensembles. Our empirical investigation shows that, whereas the largest eigenvalues of GOE matrices converge to the Tracy Widom distribution quite quickly, those of adjacency matrices do not. Moreover the convergence is even slower if p is small, which is so for sparse graphs. We elucidate this issue with some simulation experiments. We generate 1 GOE matrices M, where M ij N., 1/. In Fig. 1, we plot the empirical density of λ 1.M/= n against the true Tracy Widom density. In Figs 1(a) and 1(b), we plot the GOE cases with n equalling 5 and 1 respectively, whereas Figs 1(c) and 1(d) respectively show the Erdó s Rényi cases with n = 5, p = :5, and n = 5, p = :5. This suggests that computing the p-value by using the empirical distribution of λ 1 generated by using a parametric bootstrap step will be better than using the limiting Tracy Widom distribution. However, this will be computationally expensive, since it would have to be carried out at every level of the recursion in algorithm 3 later. Instead we note that, if one can learn the shift

7 Hypothesis Testing for Community Detection (a) (b) (c) (d) Fig. 1. Empirical distributions of largest eigenvalues plotted against the limiting Tracy Widom law ( ): (a) GOE matrices with nd5; (b) GOE matrices with nd5; (c) Erdó s Rényi graphs with nd5 and pd:5; (d) Erdó s Rényi graphs with n D 5 and p D :5. and scale of the bootstrapped empirical distribution, it can be well approximated by the limiting TW 1 -law. Hence we propose to do a few simulations to compute the mean and the variance of the distributions, and then shift and scale the test statistic to match the first two moments of the limiting TW 1 -law. In Fig. 2, we plot the empirical distribution of 1 bootstrap replicates. Figs 2(a) and 2(b) show how the empirical distribution of λ 1 differs from the limiting TW 1 -law. In Figs 2(c) and 2(d) we show the shifted and scaled version of this empirical distribution, where the mean and variance of the empirical distribution are estimated by using 1 samples drawn from the respective Erdó s Rényi models. We can see that Figs 2(c) and 2(d) are a much better fit to the Tracy Widom distribution. Finally, in Figs 2(e) and 2(f) we have the corrected empirical distributions where the mean and variance are estimated from 5 random samples. Although this is not as good a fit as Figs 2(c) and 2(d), it is not much worse. We note that these corrections are akin to Bartlett-type corrections (Bartlett, 1937) to likelihood ratio tests, which propose a family of limiting distributions, all scaled variants of the well-known χ 2 -limit, and estimate the best fit by using the data at hand. On the basis of this discussion, we now present algorithm 2 (Table 2), which is a modified version of algorithm 1.

8 26 P. J. Bickel and P. Sarkar (a) (c) (e) (b) (d) (f) Fig. 2. Corrected empirical distributions of largest eigenvalues computed using 1 bootstrap replicates from an Erdó s Rényi graph with matching parameters against the limiting Tracy Widom law ( ): (a), (b) original uncorrected empirical distribution; (c), (d) corrected version with shift and scale estimated by using 1 samples; (e), (f) corrected version with shift and scale estimated by using 5 samples; (a), (c), (e) generated from G 5,:5 ; (b), (d), (f) generated from G 2,:5

9 Hypothesis Testing for Community Detection 261 Table 3. Algorithm 3: recursive bipartitioning of networks by using Tracy Widom theory Step 1: function recursive bipartition.g, α/ Step 2: pval HypothesisTest.G/ Step 3: if pval < α then Step 4:.G 1, G 2 / bipartition.g/ Step 5: recursive bipartition.g 1, α/ Step 6: recursive bipartition.g 2, α/ 3.3. Recursive algorithm We are now ready to present the recursive clustering scheme in algorithm 3 (Table 3). For the fourth step in algorithm 3 we use the regularized version of spectral clustering that was introduced in Amini et al. (213). We want to emphasize that the choice of spectral clustering is not connected to the hypothesis test. One can use any other method for partitioning the graph Relationship to Zhao et al. (211) We conclude this section with a brief discussion of the similarities and differences of our work with the method in Zhao et al (211). The main difference is that they focused on finding and extracting communities which maximize a ratio-cut-type criterion. We in contrast do not prescribe a clustering algorithm. The clustering step in algorithm 3 is not tied to our hypothesis test and can easily be replaced by their community extraction algorithm. Computationally, our hypothesis testing step is faster, because we propose to estimate the mean and variance of the empirical distribution by using the bootstrap, not the distribution itself. This is possible because the limiting distribution is provably Tracy Widom, and small sample corrections can be made cheaply by generating fewer bootstrap samples. Further, their extraction step is based on a label switching algorithm, which is inherently much slower than a spectral bipartitioning step on the first two eigenvectors of the data matrix, which is what we use. This gives us another computational boost. Finally, another difference is that they did a sequential extraction; the hypothesis test is applied sequentially on the complement of the communities extracted so far. We, in contrast, find the communities recursively, thus leading to a natural hierarchical clustering. Thus, if there is a nested community structure inside an extracted community, this sequential strategy would miss that. There has been interesting work aimed at finding the number of blocks in a stochastic block model, which also does not look for nested structure (Chatterjee, 215; Lei, 214). Thus, our method has added advantage in a restrictive albeit important setting, since many networks naturally have hierarchical cluster structure. We also demonstrate this in our simulated experiments. 4. Experiments In this section, we present experiments on simulated data (Section 4.1), robustness to deviations from block model properties (Section 4.1.3) and real world networks (Section 4.2) Various types of block models Our simulations show two properties of our hypothesis test. First we show that it can differentiate an Erdó s Rényi graph from another with a small dense cluster planted in it. Secondly we show that, although our theory holds only for probability of linkage p fixed with respect to n, our algorithm works for sparse graphs as well.

10 262 P. J. Bickel and P. Sarkar Planted small cluster Using the same set-up as in Zhao et al. (211), we plant a densely connected small cluster in an Erdó s Rényi graph. In essence we are looking at a stochastic block model with n = 1, and n 1 nodes in cluster 1. The block model parameters are B 11 = :15 and B 22 = B 12. We plot error bars from 5 random runs on the p-values against increasing n 1 -values in Fig. 3(a) and p-values against increasing B 12 values in Fig. 3(b). A larger p-value simply means that the hypothesis test considers the graph to be close to an Erdó s Rényi graph. In Fig. 3(a) we see that the p-values decrease as n 1 increases from 3 to 1. This is expected since the planted cluster is easier to detect as n 1 grows. In contrast, in Fig. 3(b) we see that the p-values increase as P 12 is increased from.4 to.1. This is also expected since the graph is indeed losing its block structure Nested stochastic block models We present a nested stochastic block model, where the communities become increasingly dense. Specifically, B 11 = B 22 = ρa, B 12 = ρb, B 13 = B 23 = ρc and B 33 = ρd, where a = :2, b = :1 and c = :1. As we increase ρ from.5 to 5 in steps of.5, the average expected degree of an n = 1-node graph increases from 2:8to13:8. We plot error bars on p-values from 5 random runs. Similarly to Zhao et al. (211), we use the adjusted Rand index (referred to as RB), which is a well-known measure of closeness between two sets of clusterings with n 1 = n 2 = 2 and n 3 = 6. Fig. 4 shows that the adjusted Rand index grows as the average degree increases. This also demonstrates that, although theory holds only for fixed p with respect to n, in practice our recursive bipartitioning algorithm works for sparse graphs as well. We used a p-value cut-off of.1 for the simulation experiments. Finally, we compare our method with that of Zhao et al. (211). In Table 4 we show the adjusted Rand index score (referred to as E) obtained by using the E and RB algorithms for our nested block model setting with the largest expected degree. In this particular case, the E algorithm first extracts the community containing communities 1 and 2, and then tries to extract another community from the remainder of the graph, leading to poor performance. This accuracy can be improved by changing their sequential extraction strategy with a recursive strategy p value.6 p value n 1 B 12 (a) (b) Fig. 3. p-values computed by using algorithm 2 in simulated networks of n D 1: (a) B 11 D.15 and B 12 D B 22 D.5; (b) n 1 D 1

11 Hypothesis Testing for Community Detection Adjusted Rand Index Average degree Fig. 4. Adjusted Rand index averaged over 5 random runs: a higher value indicates that the estimated clustering is closer to the true clustering Table 4. Comparison with the community extraction algorithm E averaged over 5 random runs Algorithm Adjusted Rand index E.55 ±.3 RB.88 ± Robustness to deviation from block model assumptions We conclude our experimental section with some simulations to demonstrate robustness. We shall first demonstrate this with an example of a degree-corrected stochastic block model with one block, where stochastic equivalence holds within every degree stratum (nodes in the same block with identical expected degrees). Next, we shall show that even if stochastic equivalence does not hold exactly, i.e. the probability of linkage of all pairs in a block are not identical, but close, then our hypothesis test and the recursive partitioning scheme lead to accurate partitions Degree-corrected stochastic block model. Consider a degree-corrected block model (equation (2)) where θ takes three different values. We applied our recursive algorithm equipped with simple degree clustering to data generated from this model with one block and the parameters given in equation (1). For identifiability θ is arbitrary within a multiplicative constant, and we chose this constant such that the average degree is about 25. Our algorithm assigned 94% of the nodes correctly to their respective degree strata, averaged over 1 random runs. We show the hierarchical cluster structure that was obtained from our algorithm from one such network in Fig. 5. Note that this model is basically a parametric submodel of a block model with three blocks. However, the conditional expectation matrix of this model is rank 1, not rank 3. We give an intuitive explanation of why our algorithm works in this setting. It is well known (Füredi and

12 264 P. J. Bickel and P. Sarkar Fig. 5. Block structure of the adjacency matrix from a degree-corrected stochastic block model and the partitions made by the recursive algorithm (nz denotes twice the number of edges) Komlós, 1981) that the principal eigenvector of an Erdó s Rényi graph is closely approximated by the all-1s vector. Intuitively, centring the adjacency matrix removes the contribution of the principal eigenvector. Thus, λ 1.A/ with A Erdó s Rényi.n, p/ has largest eigenvalue equal to np + O P.1/, whereas all other eigenvalues are O P {.np/}. However, λ 1.A ˆp11 T / is O P {.np/}. Further we show that our test statistic (a scaled and centred version of the largest eigenvalue) is O P.1/. In contrast, the principal empirical eigenvector of the adjacency matrix A of the degree-corrected model specified above is not well approximated by the all-1s vector. Using standard concentration tools from random-matrix theory (Oliveira, 29), it can be shown to be close (in Frobenius norm) to the population eigenvector, which is a blockwise constant vector with three blocks arising from the three degree strata. Thus, centring the adjacency matrix does not remove the contribution of the principal eigenvector. Empirically we see the test statistic is of a larger order than it would be under the Erdó s Rényi model. As a result, the hypothesis test rejects the null hypothesis, i.e. the adjacency matrix generated from the degree-corrected model is an Erdó s Rényi graph. The test keeps splitting until we actually reach the subgraphs that are induced by the degree strata, which indeed are Erdó s Rényi graphs. We used n = 1 and k = 1 and the following model parameters: { 1 1 i 2, θ i 5 21 i 5,.1/ i 1: Latent space models. Next we apply our method to a model akin to latent position cluster models (Handcock et al., 27). In particular, node i is assigned a latent position ψ i in a two-dimensional space. The positions of the nodes in the ith cluster are generated from an N.4i, σ 2 / distribution. We specify P.A ij = 1 ψ i, ψ j / := exp. αd ij + β/={1 + exp. αd ij + β/}, where d ij = ψ i ψ j 2. Note that the linkage probabilities within a cluster are now no longer identical, but close, depending on the variance of a cluster σ. As σ increases, the disparity of linkage probabilities within a cluster increases, and we expect our algorithm to split a block

13 Hypothesis Testing for Community Detection 265 further to find homogeneous structure. Also, with increasing σ there is more overlap between clusters leading to harder clustering. Fig. 6(a) shows the clear block structure of the adjacency matrix for σ =:1, whereas Fig. 6(b) shows the diminished block structure for σ = 3:1. In both Fig. 6(a) and Fig. 6(b) the rows and columns are reordered so that all nodes with latent positions generated from the same Gaussian distribution are placed together. In Fig. 6(c) we show the classification error rate as σ grows. For small values of σ, the model is well approximated by a stochastic block model with four blocks, and our recursive partitioning algorithm finds the four clusters accurately. As σ grows, the performance deteriorates as the clustering algorithm finds additional structure to capture the overlaps between the clusters Real networks Now we present results on real world networks with known labels. We compare our algorithm s (a) (b).7.6 Classification Error σ (c) Fig. 6. Block structure of latent space graphs, where the latent positions are generated from four Gaussian distributions with identical variance σ 2 and separated means: (a), (b) block structures for networks generated with small.σ D.1/ and large.σ D3.1/ σ respectively (nz denotes twice the number of edges); (c) classification error computed against the data-generating cluster assignments as σ grows

14 266 P. J. Bickel and P. Sarkar performance with state of the art clustering methods on Facebook ego networks (Section 4.2.1) and karate club data and political books networks (Section 4.2.2) Facebook ego networks We show our results on ego networks manually collected and labelled by McAuley and Leskovec (212). Here we have a collection of nine networks which are induced subgraphs formed by neighbours of a node. The central node is called the ego node. The ground truth labels consist of overlapping cluster assignments, also known as circles. The hope is to identify social circles of the ego node by examining the network structure and features of nodes. Whereas McAuley and Leskovec s (212) work takes node features into account, we work only with the network structure. For every network we remove nodes with zero degree and cluster the remaining nodes. Since ground truth clusters are sometimes incomplete, in the sense that not all nodes are assigned to some cluster, we use the F-score for comparing two clusterings. Consider the ground truth cluster C and the computed cluster Ĉ. The F-measure between these is defined as follows: recall.c, Ĉ/= C Ĉ, C precision.c, Ĉ/= C Ĉ, Ĉ 2precision.C, Ĉ/recall.C, Ĉ/ F.C, Ĉ/= precision.c, Ĉ/+ recall.c, Ĉ/ : This was extended to hierarchical clusterings by Larsen and Aone (1999). For ground truth cluster C i, one computes x i = max j {F.C i, Ĉ j /}, where Ĉ j is obtained by flattening out the subtree for node j in the hierarchical clustering tree. Now the overall F-measure is obtained by computing a weighted average Σ i x i C i =Σ j C j. For the real data we use a cut-off (α in algorithm 3) of.1. We can also stop dividing the graph, when the subgraph size falls under a given number, say n β. Although we report results without any such stopping conditions added, we note that, for n β = 1, the F-measures are similar, whereas the numbers of clusters are fewer. In Table 5, we compare our recursive bipartitioning algorithm RB with McAuley and Leskovec s (212) by using the code that was kindly shared by Julian McAuley. We see that we obtain better or comparable F-measures for most of the ego networks. To visualize the cluster structure uncovered by algorithm RB, we present Fig. 7, in which we show Table 5. F -measure comparison on nine Facebook ego networks Nodes with Number of ground F-measure (McAuley Number of clusters F-measure non-zero degree truth clusters and Leskovec, 212) learned by RB

Hypothesis Testing for Community Detection 267 a density image of a matrix, whose rows and columns are ordered such that all nodes in the same subtree appear consecutively.

This helps us to see the hierarchical structure. Further, we shade every diagonal block by using the ˆp computed from the subgraph induced by nodes in the subtree corresponding to it. In Fig.

15 Hypothesis Testing for Community Detection 267 a density image of a matrix, whose rows and columns are ordered such that all nodes in the same subtree appear consecutively. Thus nodes in every subtree correspond to a diagonal block in Fig. 7(a). Also, a subtree belonging to a parent subtree will give rise to a diagonal block contained inside that of the parent subtree. This helps us to see the hierarchical structure. Further, we shade every diagonal block by using the ˆp computed from the subgraph induced by nodes in the subtree corresponding to it. In Fig. 7(a), we plot this matrix for one of the ego networks on the log-scale. The lighter the shading in a block is, the higher the corresponding ˆp. To match this image with the graph itself, we also plot the adjacency matrix with rows and columns ordered identically in Fig. 7(b). The density plot shows that the hierarchical splits find regions of varied densities Karate club and the political books network The karate club data are a well-known network which has 34 individuals belonging to a karate club. Later, the members split into two groups after a disagreement on class fees (Zachary, 1977). These two groups are considered the ground truth communities. In Fig. 8 we present the clusterings that are obtained by using the various algorithms. In particular, we show the clusterings that are obtained by using the extraction method (algorithm E) in Fig. 8(a), the (a) (b) Fig. 7. (a) Density plot for one ego network with rows ordered to have nodes from the same cluster consecutively and (b) adjacency matrix using the same order (a) (b) (c) (d) Fig. 8. Clusters obtained from the karate club network by using (a) community extraction, (b) pseudolikelihood, (c) recursive bipartitioning with a p-value cut-off of.1 and (d) recursive bipartitioning with a p-value cut-off of.1

268 P. J. Bickel and P. Sarkar 5 1 15 2.6.8 1 1.2 1.4 5 1 15 2 25 1.6 25 3 1.8 3 5 1 15 2 25 3 5 1 15 2 25 3 (a) (b) Fig. 9.

k = 3 (Amini et al., 213) in Fig. 8(b), our recursive bipartitioning algorithm RB using a p-value cut-off of :1 in Fig. 8(c) and finally algorithm RB with a p-value cut-off of :1 in Fig. 8(d).

16 268 P. J. Bickel and P. Sarkar (a) (b) Fig. 9. (a) Density plot of the karate club data with rows ordered to have nodes from the same cluster consecutively and (b) adjacency matrix using the same order pseudolikelihood method (algorithm PL) with k = 3 (Amini et al., 213) in Fig. 8(b), our recursive bipartitioning algorithm RB using a p-value cut-off of :1 in Fig. 8(c) and finally algorithm RB with a p-value cut-off of :1 in Fig. 8(d). These results were generated by using the code of Yunpeng Zhao and Aiyou Chen. We see that algorithm E finds the cores of the two communities, algorithm PL puts high degree nodes in one cluster (similarly to the Markov chain Monte Carlo method for fitting a stochastic block model in Zhao et al. (211)). Our method achieves perfect clustering for a p-value cut-off of.1. However, our statistic computed from the dark grey group has a p-value of about.3, which is why we also show the clustering with a larger cut-off. Here the dark grey community is broken further into a clique-like subset of nodes, and the rest. We also provide a density plot in Fig. 9(a) and an image of the adjacency matrix with rows and columns ordered similarly to the density plot in Fig. 9(b) to elucidate this issue. The political books network (Newman, 26) is an undirected network of 15 books. Two books are connected if they are co-purchased frequently on Amazon. Although the ground truth is not available on this data set, the common conjecture (Zhao et al., 211) is that some books are strongly political, i.e. liberal or conservative, and the others are somewhat in between. Zhao et al. (211) also showed that existing algorithms give reasonable results with k = 3 clusters, and algorithm E returned the cores of the communities with k = 2. We show clustering obtained by using algorithm PL with k =3 in Fig. 1(a), the two communities that are extracted by algorithm E in Fig. 1(b), clustering by algorithm RB in Fig. 1(c) and finally our density plot in Fig. 1(d). Algorithm E finds the core set of nodes from the medium grey and dark grey clusters found by algorithm PL. In contrast, algorithm RB breaks the graph into six parts. The first split is between the dark grey nodes with the rest. The second split separates the light grey nodes from the medium grey nodes. The next two splits divide the medium grey nodes and the dark grey nodes into further smaller clusters. We overlay the density plot with the row and column reordered adjacency matrix, so that the brightest pixels correspond to an edge. The ordering simply puts nodes from the same cluster consecutively, and clusters in the same subtree consecutively. Fig. 1 shows the hierarchically nested structure, where we pick up denser subgraphs.

Hypothesis Testing for Community Detection 269 (a) (b) (c) 1

17 Hypothesis Testing for Community Detection 269 (a) (b) (c) (d) Fig. 1. Clusterings of the political books data: (a) pseudolikelihood; (b) community extraction; (c) recursive bipartitioning; (d) subgraph density plot superimposed with the adjacency matrix 5. Discussion In this paper we have proposed an algorithm which provably detects the number of blocks in a graph that is generated from a stochastic block model. Using the largest eigenvalue of the suitably shifted and scaled adjacency matrix, we develop a hypothesis test to decide whether the graph is generated from a stochastic block model with more than one block. Our approach is significantly different from existing work because we theoretically establish the limiting distribution of the statistic under the null hypothesis, which in our case is that the graph is an Erdó s Rényi graph. We also propose to obtain small sample corrections on the limiting distribution, which, together with the known form of the limiting law, alleviates the need for expensive parametric bootstrap replicates. Using this hypothesis test we design the recursive bipartitioning algorithm RB which naturally yields a hierarchical cluster structure. Strictly speaking, we have proved the validity of our bipartitioning algorithm for k = 2 only.

18 27 P. J. Bickel and P. Sarkar The difficulty is that there is apparently no guarantee that, once we have rejected k = 1 and partitioned by using ordinary spectral biclustering, the resulting two partitions are disjoint unions of distinct sets of the true blocks. However, it can be shown that, in the dense regime, under diagonal dominance a slight modification of spectral biclustering can split the network into two partitions, such that each partition is a disjoint union of ground truth clusters with probability tending to 1. Hence, for a finite k, it should be possible to show that, after each test, with high probability we are testing on unions of disjoint sets of the true blocks. This will be argued elsewhere. For this paper, we demonstrate that ordinary spectral clustering works well. On nine real data sets with ground truth from Facebook, algorithm RB outperforms the existing method that has been shown to have the best performance among other state of the art algorithms for finding overlapping clusters. We also show the nested cluster structure of varied densities that is discovered by algorithm RB on the karate club data and the political books data. Our experiments on the karate club and political books network is not aimed at showing that we find better quality clusters, but that we find interesting structure matching with existing work without having to specify k. We choose spectral clustering because of its good theoretical properties in the context of block models (Rohe et al., 211) and its computational scalability. Acknowledgements We thank Elizaveta Levina, Yunpeng Zhao, Aiyou Chen and Julian McAuley for sharing their code. We are also grateful to Antti Knowles for pointing out the relevant literature for applying the result on isotropic delocalization of eigenvectors to our setting. This research was funded in part by National Science Foundation Focused Research Group on Networks grant DMS Appendix A: Proof of main result In this section we shall present proof sketches of theorem 1 and proposition 1. The complete proof with details is included in the on-line supplementary material. Our proof uses the following machinery developed in random-matrix theory in recent years. For ease of understanding, we shall state some results without rigorous statements, which will be given in detail in the supplementary material. We shall now present Weyl s interlacing inequality, which will be used heavily in our proof. A.1. Weyl s interlacing inequality Let B 1 be an n n real symmetric matrix and B 2 = B 1 + dxx T, where d> and x R n. Denoting the ith largest eigenvalue of matrix. / by λ i. / we have λ n.b 1 / λ n.b 2 / λ n 1.B 1 / ::: λ 2.B 2 / λ 1.B 1 / λ 1.B 2 /:.11/ An immediate corollary of this result is that, for d<, λ n.b 2 / λ n.b 1 / λ n 1.B 2 / ::: λ 2.B 1 / λ 1.B 2 / λ 1.B 1 /.12/ Let ˆp := Σ ij A ij ={n.n 1/}, and let e denote the normalized n 1 vector of all 1s. As in equation (6), ˆP is the empirical version of P (equation (3)). Lemma 1. Let Ã 1 := Ã + n.p ˆp/ee T = {.n 1/p.1 p/}. Also let λ 1 λ 2 ::: λ n be the eigenvalues of Ã and μ 1 μ 2 ::: μ n be the eigenvalues of Ã 1.Ifp is a constant with respect to n,wehave μ 1 λ 1 = o P.1=n/. A.2. Proof sketch of theorem 1 Let λ i and v i be the ith eigenvalues and eigenvectors of Ã respectively, where λ i λ i+1, i {1, :::, n 1}.

19 Hypothesis Testing for Community Detection 271 Also, let μ i and u i be the ith eigenvalues and eigenvectors of Ã 1 respectively, also arranged in decreasing order of μ i. Let G :=.Ã zi/ 1 and G 1 :=.Ã 1 zi/ 1 be the resolvents of Ã and Ã 1. Let n. ˆp p/ c n := : {p.1 p/} {n=.n 1/} We note that the matrices Ã and Ã 1 differ by a random multiple of the all-1s matrix: Ã = Ã 1 + c n ee T :.13/ This equation also gives λ 1 μ 1 c n =O P.1= n/,.14/ which is true because ˆp is the average of n.n 1/=2 independent and identically distributed Bernoulli coins, and thus c n =O P.1= n/ for p constant with respect to n. However, this error masks the n 2=3 -scale of the Tracy Widom law. Equation (13) also gives the identity e T.G.z/ G 1.z//e = c n.e T G.z/e/.e T G 1.z/e/, 1 e T G 1.z/e 1 e T G.z/e = c n: Since 1={e T G 1.μ 1 /e} =, we have e T G.μ 1 /e = 1=c n. Further, using Weyl s interlacing result in Appendix A.1 we see that the eigenvalues of Ã and Ã 1 interlace. Since G.z/ s eigenvalues and vectors are given by 1=.λ i z/ and v i respectively, we have 1 = e T G.μ 1 /e = c n i We shall now do a case-by-case analysis..e T v i / 2 λ i μ 1 :.15/ A.2.1. Case c n > In this case the interlacing result (equation (11)) tells us that λ 1 μ 1 λ i, i>1. Thus we have.e T v 1 / 2.e T v i / 2 = 1 λ 1 μ 1 c n.e T v 1 / 2 :.16/ λ 1 μ 1 i>1 μ 1 λ i c n A.2.2. Case c n < In this case the interlacing result (equation (12)) tells us that μ 1 λ 1 μ 1 λ i, i>1. We now divide the eigenvalues λ i into two groups: one with μ 1 λ i 2 c n (denoted by S cn ), and the other with μ 1 λ i >2 c n. Since Σ i.v T i e/2 = 1, we have 1 c n = i.e T v i / 2.e T v i / μ 1 λ i i S cn μ 1 λ i 2 c n : Further, since μ 1 λ 1 μ 1 λ i, i>1, μ 1 λ 1 2 c n.e T v i / 2 :.17/ i S cn Let cn = c n1.c n < /. Combining equations (16) and (17) we see that { λ 1 μ 1 c n max 2 }.e T v i / 2,.e T v 1 / 2 :.18/ i S c n We can now use the fact that, in the bulk, it is possible to estimate the empirical eigenvalue density of general Wigner ensembles by using the semicircle law of Erdó s et al. (212). Using probabilistic bounds on μ 1 and ˆp we can show that S c n =Õ P.n 1=4 /:.19/

20 272 P. J. Bickel and P. Sarkar c 1 Fig. 11. Semicircle distribution The details are presented in the on-line supplementary material. Now we shall use another result (theorem 2.16, Bloemendal et al. (214)) which shows that, under some broad conditions, the projection of a deterministic vector on any eigenvector of a symmetric Wigner ensemble is uniformly Õ P.1= n/. Here the Õ P.ξ/ notation denotes a sequence of random variables which are bounded in probability by some non-negative random variable ξ up to small powers of n. This yields i S c n.e T v i / 2 = Õ P.n 3=4 /:.2/ Since.e T v 1 / 2 = Õ P.1=n/ by using the aforementioned result, equation (18) in conjunction with equation (2) yields λ 1 μ 1 =Õ P.n 5=4 /. The precise definition of the Õ-notation ensures that Õ P.n 5=4 / is o P.1=n/ for sufficiently large n. We make this more precise in the on-line supplementary material. Now theorem 1 follows by using a series of simple algebraic manipulations, which we defer to the on-line supplementary material. A.3. Proof of proposition 1 If B ii > Σ j i B ij, then B is a positive definite matrix by diagonal dominance. Hence, ZBZ T is also positive definite. Since we are considering the dense regime of degrees, i.e. where the elements of B are constant with respect to n, the k largest eigenvalues of E[A Z] (equation (1)) are of the form C i n, where C i,1 i k, are positive constants. Oliveira (29) showed that λ i.a/ = λ i.e[a Z]/ + O P [ {n log.n/}]. Hence, with high probability, the k largest eigenvalues of A will be positive. Using Weyl s identity we have λ 2.A/ λ 1.A ˆP/ λ 1.A/. Thus, with high probability, λ 1.A ˆP/ Cn for some positive constant C. Thus, for large n, λ 1.Ã / C n with high probability and, since θ := n 2=3 {λ 1.Ã / 2}, the result is proved. References Adamic, L. A. and Glance, N. (25) The political blogosphere and the 24 U.S. election: divided they blog. In Proc. 3rd Int. Wrkshp Link Discovery. New York: Association for Computing Machinery. Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (28) Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9, Airoldi, E. M., Costa, T. B. and Chan, S. H. (213) Stochastic blockmodel approximation of a graphon: theory and consistent estimation. In Advances in Neural Information Processing Systems (eds C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Weinberger), vol. 26, pp Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (213) Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist., 41, Bartlett, M. S. (1937) Properties of sufficiency and statistical tests. Proc. R. Soc. Lond., 16, Bickel, P. J. and Chen, A. (29) A nonparametric view of network models and Newman Girvan and other modularities. Proc. Natn. Acad. Sci. USA, 16, Bloemendal, A., Erdó s, L., Knowles, A., Yau, H.-T. and Yin, J. (214) Isotropic local laws for sample covariance and generalized Wigner matrices. Electron. J. Probab., 19, Chatterjee, S. (215) Matrix estimation by universal singular value thresholding. Ann. Statist., 43, Erdó s, L., Yau, H.-T. and Yin, J. (212) Rigidity of eigenvalues of generalized Wigner matrices. Adv. Math., 229, Füredi, Z. and Komlós, J. (1981) The eigenvalues of random symmetric matrices. Combinatorica, 1,

arxiv: v1 [stat.ml] 29 Jul 2012

arxiv: v1 [stat.ml] 29 Jul 2012 arxiv:1207.6745v1 [stat.ml] 29 Jul 2012 Universally Consistent Latent Position Estimation and Vertex Classification for Random Dot Product Graphs Daniel L. Sussman, Minh Tang, Carey E. Priebe Johns Hopkins