Scalable and exact sampling method for probabilistic generative graph models

Size: px

Start display at page:

Download "Scalable and exact sampling method for probabilistic generative graph models"

Hollie Carroll
5 years ago
Views:

1 Data Min Knowl Disc Scalable and exact sampling method for probabilistic generative graph models Sebastian Moreno 1 Joseph J. Pfeiffer III 2 Jennifer Neville 3 Received: 29 October 2015 / Accepted: 11 April 2018 The Author(s) 2018 Abstract Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of large-scale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al. (in: IEEE 14th international conference on data mining, pp , 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability values this enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and Responsible editor: Kristian Kersting. Sebastian Moreno acknowledges the support of CONICYT + PAI/Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria folio B Sebastian Moreno sebastian.moreno.araya@gmail.com Joseph J. Pfeiffer III joelpf@microsoft.com Jennifer Neville neville@purdue.edu 1 Faculty of Engineering and Science, Universidad Adolfo Ibañez, Santiago, Chile 2 Microsoft, Bellevue, WA 98004, USA 3 Computer Science and Statistic, Purdue University, West Lafayette, IN 47907, USA

2 S. Moreno et al. probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95 s on a single processor. Keywords Network analysis Network models Social networks Graph generation Scalable sampling 1 Introduction Many interesting complex systems can be modeled as networks, with edges (e.g., s, hyperlinks) connecting vertices (individuals, webpages). Consequently, models of graph structure are useful for representing and reasoning about the underlying properties of these systems. Probabilistic models of graph structure (Frank and Strauss 1986; Wasserman and Pattison 1996; Robins et al. 2007; Wang et al. 2013), which represent the likelihood of observing a particular network structure, can be used for prediction (Pfeiffer et al. 2014) and hypothesis testing (Moreno and Neville 2013). Generative models of network structure (Watts and Strogatz 1998; Barabasi and Albert 1999; Kumar et al. 2000; Karrer and Newman 2009, 2010; Golosovsky and Solomon 2012; Bu et al. 2013a), which can provide sample networks with similar but varying structure, can be used to test conjectures about network processes and analyze the performance of algorithms overlaid on the complex system. As such, there has been a great deal of recent research on statistical models that are both probabilistic and generative, since they provide a principled means to approach a wide range of network science analysis tasks. Notably, scalable (subquadratic in number of nodes) sampling algorithms for statistical models such as Kronecker Product Graph Model (KPGM) (Leskovec and Faloutsos 2007) and Chung Lu (CL) (Chung and Lu 2002) made analysis of largescale, sparse networks feasible for the first time (Leskovec et al. 2010; Pinar et al. 2011). However, in contrast to prior expectations, these scalable sampling algorithms do not sample from the underlying probability distribution defined by the model. Recently, we investigated this issue for KPGM-family models (Moreno et al. 2014). We show that the efficient sampling algorithm from Leskovec et al. (2010), models a different space of graphs, and does not sample networks according to the original distribution. In practice, this sampler generates networks that are unlikely to have been drawn from the model distribution. In this work, we generalize the results of Moreno et al. (2014) to a broad class of statistical models that we refer to as probabilistic generative graph models (PGGMs). These models typically represent graph structure probabilistically with a N v N v probability matrix P, where P ij is the probability of edge existence between nodes i and j, and N v is the number of nodes. To avoid the expensive memory construction of P, well known models use a set of parameters Θ to implicitly represent P. For example, the Erdös Rényi random graph model (Erdos and Renyi 1960) assigns equal

3 Scalable and exact sampling method for probabilistic probability p to every possible edge (i.e., i, j P ij = p), then Θ = p implicitly represent the N v N v probability matrix P. Recent models differ in how they specify the edge probabilities in P. The CL model (Chung and Lu 2002) defines the probability of an edge to be proportional to the degree of the incident nodes, while other models such as KPGM (Leskovec and Faloutsos 2007) and mixed KPGM (Moreno et al. 2010) define P via Kronecker multiplication of a small seed parameter matrix, with itself. To generate a network based on P, naive algorithms, which we will refer to as Independent Probability methods, sample edges independently for every pair of nodes. These methods have quadratic time complexity (i.e., O(Nv 2 )), and thus cannot scale to large networks with millions of vertices. Researchers have developed faster alternatives to sample based on P for sparse networks. These algorithms, which we refer as Edge-by-edge methods, avoid the construction of the full P matrix and the analysis over every pair of nodes, and sample the locations for N e edges (i.e., unsampled i, j pairs are assumed to not be linked). KPGM and CL have associated Edge-by-edge sampling algorithms with complexity Õ(N e ) (Leskovec et al. 2010; Pinar et al. 2011). Although these algorithms sample network efficiently, they do not correctly sample from the underlying distribution represented by P. Particularly, some algorithms sample the N e edges based on P, without verifying collisions among the sampled edges (producing multigraphs). Other algorithms sample the sparse set of edges and spread the probability of previous sampled edges throughout the network, incorrectly increasing the probability of unlikely edges. As a consequence, new edges are not sampled from the correct underlying distribution. To address this issue, we build on the work of Moreno et al. (2014) and develop a general grouped sampling process for any PGGM that uses P implicitly in its representation. We exploit a common property of PGGMs that edges are parameterized by a small set of probability values, which enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove any bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Specifically, our main contributions in this paper are: 1. Generalization of the proposed algorithm in (Moreno et al. 2014), to any PGGM. 2. Theoretical analysis to prove the correctness of the approach used in the new algorithms over PGGMs. 3. New faster and general algorithm, which uses geometric distributions to replace the Binomial sampling and select the locations of the sampled edges, avoiding rejection sampling. 4. Detailed time complexity analysis of the generalized algorithms, estimating the cost of the rejection sampling process. To demonstrate the applicability of our grouped sampling approaches, we discuss specific implementations for six different PGGMs. We then validate our theoretical analysis through an empirical evaluation, demonstrating the inaccuracies of previous algorithms as well as the advantages of our approaches. As part of our experiments, we demonstrate the scale of our approach on three real world datasets, including a LiveJournal network with 68 million edges. We then push our sampling algorithm far

4 S. Moreno et al. past the limits of prior work, by sampling networks with over one billion edges in 95 seconds on a single processor. 2 Probabilistic generative graph models Let G = (V, E) be a graph/network 1 with a set of finite vertices V and directed edges E V V, where (i, j) E indicates that a directed edge (or arc) exists between nodes V i, V j V. LetN v = V and N e = E be the number of vertices and edges, respectively. Note that this definition and theory focus in directed graphs, but all of our subsequent theory and algorithms can be applied too undirected graphs by working on edges (i, j), such that i j. Definition 1 Probabilistic generative graph model (PGGM) A PGGM is a statistical model M with parameters Θ that defines a size N v N v matrix P of probabilities. P models the structure of the network through the set of binary random variables E (i, j) i, j {1,...,N v }, where each P ij P represents the probability the directed edge (i, j) exists in the network (i.e., if E (i, j) = 1 then (i, j) E and P(E (i, j) = 1) = P ij ). Typically, the model M does not explicitly represent the full P, but instead provides a construction process to calculate each P ij from the parameters Θ. The number of parameters in Θ differs for each PGGM, and can vary in the range [1, Nv 2].2 However, since the goal is to model the underlying structure of the network, most PGGMs have a small number of parameters (usually less than N v ) to be parsimonious and to avoid overfitting. Moreover, let G be the space of all possible graphs with N v nodes, where G =2 N v 2. Then a PGGM implicitly defines a probability distribution over the space of graphs G, where the likelihood of an observed graph G = (V, E) G can be calculated based on P as: (i, j) E P ij (i, j)/ E (1 P ij). Once M and Θ are specified, allowing an implicit representation of P,itispossible to generate a single network G = (V, E) from G. There are two naive ways to sample a single network from G. The first method enumerates all possible graphs from G, calculates each graph s likelihood, and computes a cumulative density function. Then a number is sampled uniformly from the [0, 1] interval, used to index into the CDF, and return a randomly selected network. Unfortunately, the complexity of this method is O(2 N v 2 Nv 2 ), because it requires enumerating and calculating the likelihood of each possible graph. The second method calculates each P ij and sample the E (i, j) independently, using a Bernoulli distribution with P(E (i, j) = 1) = P ij. The time complexity of this method is O(Nv 2 ), yet this is still prohibitive for sampling networks with more than several thousands of nodes. We refer to this second method as Naive Sampling through the rest of the paper. We omit further discussion of the first method (sampling based on the enumeration of the G graphs) because of the prohibitive time complexity. 1 In this paper we will use graphs and networks interchangeably. 2 In the worst case scenario, Θ is a N v N v matrix that explicitly represents P (Θ P).

5 Scalable and exact sampling method for probabilistic 2.1 PGGMs Several new PGGMs have been developed through time, including not only the structure determined by the edges, but also some network characteristics (Kim and Leskovec 2010; Kolda et al. 2014; Benson et al. 2014; Aicher et al. 2015). However, in this work we focus in six of the most important PGGM models of the last years, where every edge is sampled using a Bernoulli distribution (E (i, j) Bernoulli(P ij )). Erdös Rényi random graphs model (RG) RGs assume that every edge in the network have the same sampling probability (Erdös and Rényi 1959; Erdos and Renyi 1960). Then, P is implicitly represented by a single probability p defined by the user (P ij = p i, j {1,...,N v }). Chung Lu model (CL) CLs represent the probability of an edge by the degrees of the incident nodes (Chung and Lu 2002), where expected degrees of a generated graph matches the original network degrees. P is based on the degree distribution of the network, where a single P ij can be calculated as d i d j N e (d i is the degree of node V i ). Stochastic block model (SBM) SBMs represent the structure of a network through N k communities (usually N k V ), using a set of RGs (Holland et al. 1983; Wasserman and Anderson 1987). Each node V i is assigned to only one of the N k community, and the network is modeled by a N k N k probability matrix P S that represents the probability of an edge between partitions. P is represented by P S and node s partition Z S (ZV S i is node V i partition assignment), obtaining P ij =P S (ZV S i, ZV S j ). Block two-level Erdös Rényi model (BTER) BTERs represent a network through the combination of RG and CL models (Seshadhri et al. 2012; Kolda et al. 2014). BTER groups nodes with similar number of edges in clusters of size d k + 1 (all nodes have a minimum degree d k ), and defines a vector P B, representing the probability of an edge among nodes in the same partition (RG model). This step increases the community structure by generating multiple edges among nodes in the same cluster. Note that for large N v, P B N e /3, where P B = N e /3 if the network consists in multiple isolated triangles (three nodes connected among them). Then, the probability between nodes of different clusters is estimated based on the CL model, matching the original degree distribution. Similarly to the previous models, P is represented by a combination of P B and the original degree distribution, where P ij is defined by the RG or CL models (if V i and V j belongs to the same cluster RG, otherwise CL). Kronecker product graph model (KPGM) KPGMs assume that networks are hierarchically organized into communities that grow recursively through a fractal process. To represent the fractal process, KPGM uses a seed matrix Θ of size b b, where θ ij [0, 1] and typically b = 2orb = 3, which is recursively extended using K = log b (N v ) Kronecker multiplications with itself. Then, P is calculated through Θ and K, where P ij = K k=1 Θ uk,v k (1 u k,v k b). As can be observed, the probability of an edge is given by the multiplication of K values from Θ (Leskovec

6 S. Moreno et al. Algorithm 1 Original Naive Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: for i {1,...,N v } do 3: for j {1,...,N v } do 4: Calculate P ij based on M(Θ) 5: if Bernoulli(P ij ) then 6: E = E {(i, j)} 7: Return G = (V, E) and Faloutsos 2007; Leskovec et al. 2010). So, by the commutative property of multiplication, a single probability can appear in many places of P. Mixed-Kronecker product graph model (mkpgm) mkpgm is a generalization of KPGM, where the hierarchical communities are related among them via a set of dependent edge probabilities (Moreno et al. 2010). mkpgm also models a network through the parameters Θ and K as explained in KPGM, but it incorporates a new parameter l {1, 2,..., K } that creates the edge dependencies. Specifically, mkpgm uses the same hierarchical structure than KPGM at the first l levels; then it ties the edge probabilities by sampling each of the following hierarchical layers. This process keeps the marginal edge probabilities among nodes but increases the variability of the generated networks. P is calculated through Θ, K, and l, where P ij = K k=1 Θ uk,v k when all hierarchical layers are sampled, otherwise the edge has no probability of being sampled. 2.2 Sampling from PGGMs We discuss the naive Independent probability sampling algorithms of PGGMs and discuss scalable Edge-by-edge sampling algorithms previously mentioned Naive sampling To sample a graph G from a model M with parameter Θ, a naive algorithm samples the Nv 2 possible edges using Bernoulli(P ij), where P ij is calculated from M(Θ). If Bernoulli(P ij ) = 1, the edge (i, j) is added to G. Complexity Algorithm 1 shows an outline for naive sampling algorithms with time complexity O(M c Nv 2), where M c refers to the time needed to calculate a single edge probability P ij (based on M and Θ). Line 4 has complexity O(M c ). Lines 5 and 6 are O(1), and the loops increase the complexity to O(M c Nv 2 ). Note, that in the majority of the cases O(M c ) = O(1), because P ij is a function of the parameters Θ; reducing the naive sampling time complexity algorithm to O(Nv 2 ). However, it is still does not scale to large networks. Space of graphs Given M(Θ), we will refer to the space of graphs with N v nodes to be G O (i.e., all possible graphs that can be generated by M(Θ)), where o stands for original distribution. Formally, G O ={G = (V, E) such that (V ={1, 2,...,N v })

7 Scalable and exact sampling method for probabilistic and (E V V)}. Note that E is a subset or equal to V V. So, G O defines a space of graphs where only a single edge can exist between every pair of nodes (Diestel 1997). All PGGMs define a probability distribution P O over G O. Using these definitions, the network generation process is associated with sampling a network G from P O, where P O (G) : G O [0, 1]. Considering that for sparse networks N e Nv 2, and due to the O(N v 2) complexity of Algorithm 1, authors have searched for algorithms with sub-quadratic time complexity. Even though, this could sound pointless, because only enumerating or iterating over all possible edges has a time complexity O(Nv 2 ), it is possible by avoiding the edge by edge Bernoulli sampling process. For example, consider the Erdös Rényi Random Graphs model where all edges have the same probability p. Then, instead of iterate over all values of P, which is implictly represented by p, an edge can be generated by sampling two nodes from the discrete uniform distribution [1, N v ] and place an edge between the selected nodes. This example belongs to the Edge-byedge sampling algorithms, which have time complexity proportional to the number of edges (O(N e )). Among the most popular Edge-by-edge sampling algorithms, we can mention the CL (Chung and Lu 2002) and KPGM (Leskovec and Faloutsos 2007). Although these algorithms differ with respect to a chosen model, all of them fall within the following generalized processes Edge by edge without rejection Edge-by-edge sampling algorithms avoid the iteration over all pairs of nodes, and only consider the placement locations of N e sampled edges, with all remaining (i.e., unsampled) pairs of nodes considered unlinked. For this process, an edge (i, j) is sampled proportional to its defined probability (P ij ) and added to the network, where P is implicitly represented through M(Θ). This process is repeated N e times, where N e is defined by the user (or could be estimated). Complexity We give pseudocode for the sampling process in Algorithm 2. Generally speaking and assuming that line 4 is constant, the time complexity of the algorithm is Õ(N e ).Thewhile loop (line 3) dominates the time complexity for scalable algorithms, as line 4 has a sampling time of M s which is usually constant, and lines 5 and 6 have constant time. The entire P is never iterated, generated, or calculated; given that these algorithms use other approaches to draw proportional to P ij, rather than directly calculating P ij. As an example, RG uniformly samples (i, j) at random from the Nv 2 possible edges. Space of graphs Let X = N e be the number of edges to place via the edge-byedge sampling algorithm. Edge-by-edge algorithms are also a sampling mechanism from a probability distribution P EN defined over a space of graphs G EN (X) with a fixed number of N v nodes and X = N e edges (EN stands for edge by edge without rejection). Particularly, G EN (X) ={G = (V, E) is a pair of disjoints sets together with a map E V V V, and ( E =N e = X)}. There are two main differences of G EN (X) with respect to G O.First,G EN (X) is restricted to networks with exactly X edges. Second, the mapping function allows multiple edges between the same two nodes (Diestel 1997).

8 S. Moreno et al. Algorithm 2 Edge by Edge Without Rejection Sampling Require: M,Θ,andN e {can be calculated or defined by the user} 1: V ={1,...,N v }, E ={} 2: countedge=0 3: while count Edge < N e do 4: Sample edge (i, j) P uv 5: countedge++ 6: E = E {(i, j)} 7: Return G = (V, E) Edge by edge with rejection As the above edge-by-edge algorithm generates multigraphs (two or more edges can be sampled between the same two nodes), a simple solution is to include a rejection process for edges that were previously sampled. Complexity The pseudocode for this sampling process is similar to Algorithm 2, but a rejection step (IF ((i, j)/ E)) is added after line 4. This rejection process increases the runtime with respect to the number of rejected edges. Space of graphs Edge-by-edge with rejection algorithms are also a sampling mechanism from a probability distribution P ER, where ER stands for edge-by-edge with rejection. The space of graphs and probability distribution for network with a fixed number of N v nodes and X = N e edges are given by G ER (X) ={G = (V, E), such that (V ={1, 2,...,N v }) and (E V V) and ( E =N e = X)}. As with the previous sampling process, G ER (X) is restricted to a specific number of edges and defines a different space of graphs than G O. However, its definition does not allow multigraphs. Prior work claimed edge-by-edge generation algorithms (with and without rejection) sample from P O (G) (i.e. P O (G) = P EN (G X) = P ER (G X)), while reducing complexity time to Õ(N e ). However, we theoretically prove at Moreno et al. (2014) that these sampling processes lie far from the true distribution, with the exception of RG models. We empirically corroborate these results in Sect Group probability sampling This section develops our new scalable and exact Group Probability (GP) sampling algorithm for Probabilistic Generative Graph Models (PGGMs). We begin by defining a new representation for PGGMs, followed by the generalization of the GP algorithm from (Moreno et al. 2014). We then analyze its time complexity, estimating the cost of the rejection sampling process. Importantly, we conclude the section by removing the rejection sampling process by introducing a new Geometric Grouped Probability (GGP) sampler. This section discusses the generalized algorithms, while in the next section we give detailed implementations for six existing PGGMs.

9 Scalable and exact sampling method for probabilistic 3.1 Representation For most PGGMs, multiple pairs of nodes have the same probability of edge existence in P. For example, in the extreme case of the Random Graph model (RG), every edge exists with equal probability, i.e., P ij = P uv = p, i, j, u,v {1,...,N v }. Although more complex PGGMs have different probabilities values throughout P, a single probability value can appear in many places in P (e.g., P ij = P kl where i, j = k, l). As a result, even without iterating or calculating P, we can identify cells in P that corresponds to a single unique probability value, and treat those associated random variables as a group. In this work, we propose to sample each group of identically distributed edge probabilities using the Binomial distribution. Namely, for every group of cells in P that corresponds to an unique probability value, we draw the number of edges that should be placed from a Binomial distribution. Then, we determine locations to place the chosen number of edges from the locations of the cell in P. For current PGGMs, this entire process can be realized without explicitly iterating or calculating the full matrix P. We formally introduce some new notation before describing the algorithm. Let U be the set of unique probability values in P, and U be the total number of unique probabilities in P as defined by a PGGM. Let π k be the kth unique probability in U and T = (T 1, T 2,...,T U ) be a vector where T k is the number of times π k is repeated in P (i.e. the count of ij pairs such that P ij = π k ). For example, for RG, U ={π 1 }, U =1, T = (T 1), and T 1 = Nv Algorithm Using this notation, a network is sampled in three steps. First, in a preprocessing step calculate U and T. Second, for each unique probability π k U, sample the number of edges x k using a Binomial distribution (x k Bin(n, p) n = T k, p = π k ).Third, place the x k sampled edges uniformly at random among cells with probability π k. Algorithm 3 describes the pseudocode of the GP sampling. Line 2 is the preprocessing step that constructs U and T (1st step). Lines 4 6 determine π k, T k, and x k (2nd step). Lines 8 12 generate and place the x k edges to the network (3 rd step). Line 9 generates the new edge and Lines 10 through 12 add (u,v)to E if it has not been previously sampled (this step avoids multigraphs). 3.3 Complexity The time complexity for the GP sampling algorithm varies depending on the choice of PGGM. To reflect this, we define a general complexity based on an arbitrary model M, where O(M p ) and O(M s ) represent the time complexity for preprocessing and sampling respectively, when using the model M in the algorithm. 3 Line2is 3 Note that M p includes the time needed to calculate probability values, which we referred to earlier in Sect as M c.

10 S. Moreno et al. Algorithm 3 Group Probability Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: Construct U and T, based on M(Θ) 3: for k = 1; k ++;k U do 4: Get π k U the k-th unique probability of the set U 5: Get T k T the number of times that π k is repeated in P 6: Draw x k Bin(T k,π k ) 7: countedge=0 8: while count Edge < x k do 9: Sample new edge (u,v)among edges with probability π k 10: if (u,v) / Ethen 11: countedge++ 12: E = E (u,v) 13: Return G = (V, E) the preprocessing step, corresponding to constructing of the sets U and T with time complexity O(M p ). Similarly, Lines 4 and 5 have a cost O(1). For Line 6, the exact sampling from a Binomial(T k,π k ) has a complexity O(T kπ k ) (Devroye 1980). Line 9 is the sampling step, which varies according to the model, so we define its time as O(M s ). Lines 10 and 12 are dependent on the choice of network data structure. In practice, hash tables have a constant lookup/insertion time, so for the remainder of the paper we will assume O(1) lookup/insertion time. 4 Using the above analysis, we proceed to calculate the total complexity of the generalized algorithm. The kth iteration of the while loop (Lines 8 12) is dominated by the total number of generated edges (including the rejected samples), multiplied by O(M s ). The total number of generated edges can be calculated as the summation of x k geometric random variables with different success probabilities. Let X k be the total number of edges to be sampled, such that x k different edges are added to the network. Then X k = x k i=1 X k,i where X k,i is a geometric random variable indicating the number of samples ( until the ith edge is successfully added to the network ( X k,i Geometric Tk (i 1) T k )). Thus the expected number of samples is: x k E[X k,i x k ]= i=1 x k i=1 1 T k (i 1) T k T k = T k i=1 = T k T 1 k x k i i=1 x k 1 T k (i 1) ( ) = T k HTk H Tk x k where H j is the jth harmonic number. In the worst case scenario, x k = T k obtaining a total complexity of O(T k H Tk ), increasing the time complexity of the algorithm. Otherwise, if x k < T k, considering that x k π k T k and H j < log( j) + 1, we obtain i=1 1 i 4 In the worst case, lookups/insertions are O(N v ) for hash tables (when the hash function puts all entries are inserted into the same bucket). To avoid this worst case scenario, one can use a balanced binary tree (e.g., red-black trees) for the implementation, which is O(log N v ).

11 Scalable and exact sampling method for probabilistic ( ) ( T k HTk H Tk x k < Tk log(tk ) log ( T k T k π k)) = Tk log ( 1 π k ) Applying the inequality log(1 + X) <X for X 1 based on the Taylor series expansion: E[X k x k ]< T k log ( 1 π k) <π k T k x k Continuing with the time complexity analysis, the loop from lines 8 12 is O(x k M s ). Adding lines 4 7, incorporating the summation over U, and adding line 2, we obtain a total complexity U O M p + M s = O ( ) M p + M s N e (1) k=1 x k As a result, the final complexity of the algorithm is directly related to: the preprocessing time M p, the sampling time M s, and the number of edges N e. It is important to note that a naive implementation of M has complexity O(Nv 2 ); however, we avoid naive implementations in the next section. 3.4 Analysis The GP algorithm samples networks from a specific probability distribution P GP.Let G GP be the space of graphs for the GP sampling process, defined by {G = (V, E) such that (V ={1, 2,...,N v }) and (E V V)}. Note that G GP G O (the space graph of the original sampling process). Then, the generation of a network can be considered as sampling G from P GP (G), where P GP (G) : G GP [0, 1]. With these definitions, the next theorems prove that our generalized GP algorithm samples networks from a probability distribution as defined by a corresponding PGGM. Theorem 1 Given a valid PGGM M(Θ), and the GP sampling algorithm from Algorithm 3 with edge probabilities defined as p GP ( ), then u,v V p GP (E uv ) = p O (E uv ), where p O ( ) is the edge probability from the original correct, but inefficient, Algorithm 1. Proof Let π uv be the newly defined representation of the probability of an edge (u,v), and T be the number of edges with unique probability π uv, then: p GP (E uv ) = T P(E uv X = k)p(x = k) (2) k=0 where X Bin(T,π uv ). We can then see: P(E uv X = k) = 1 P(E uv X = k) (3)

12 S. Moreno et al. where P(E uv X = k) is the probability of not sampling the edge in the k trials. As every edge is randomly assigned over all possible remaining positions, the probability that the edge (u,v)is not selected in the first trial is 1 T 1. For the second trial, the probability is slightly ( higher because ) T 1 positions are available. Generalizing, P(E uv X = k) = k 1 k ( ) 1 i=1 T 1 i 1 j=1 1 =1 1. Reemploying P(E uv X = k) T (i 1) i=1 in Eq. 3: P(E uv X = k) = 1 k ( ) 1 1 = 1 T (i 1) i=1 = 1 T k T Reemploying P(E uv X = k) in Eq. 2: p GP (E uv ) = T k=0 = k T k i=1 k E[X] T Bin(T,π uv ; X = k) = = π uv T T T ( T i ) T i + 1 = p O (E uv ) Next, as the space of graphs and edge probabilities between the original model and GP are equal, we can prove that graph distributions are equal too. Theorem 2 Given a valid PGGM M(Θ), and the GP sampling algorithm from Algorithm 3 with edge probabilities defined as p GP ( ), then G G GP p GP (G) = p O (G), where p O (G) is the graph probability under the original correct, but inefficient, Algorithm 1. Thus P GP (G) = P O (G). Proof The generation of edges across different unique probability values are independent. Let E k be the set of edges with unique probability π k and E k =x k, then p GP (G) = = U k=1 U k=1 U T k p GP (E k ) = p GP (E k Y = i)p(y = i) k=1 i=1 p GP (E k Y = x k )P(Y = x k ) where Y Bin(π k, T k), and p GP (E k Y = i) is the probability of sampling the set of edges E k ( E k =x k ) given that i edges are generated. If i = x k, then p GP (E k Y = i) = 0. In contrast, p GP (E k Y = x k ) is the summation of probabilities over the x k! possible sequences of edges that generate E k. For example, the set of edges (1, 1) and (1, 2), can be generated by the sequence (1, 1), (1, 2) or (1, 2), (1, 1).LetE ki be the k i edge sequence that generates E k. As all edges are uniformly sampled with repetition, then p(e ki ) = T 1 1 k T k 1 1 T k x k +1.Sop GP(E k Y = x k ) is given by:

13 Scalable and exact sampling method for probabilistic p GP (E k Y = x k ) = x k! i=1 p(e ki ) = x k! 1 T k 1 T k 1 1 T k x k + 1 = x k!(t k x k )! T k! Joining p GP (E k Y = x k ) with P(Y = x k ): p GP (E k Y = x k )P(Y = x k )= 1 ( Tk x k ) Reemploying in p GP (G): U U p GP (G)= p GP (E k ) = k=1 = E uv E E uv / E ( Tk x k ) π x k k = 1 ( Tk x k ) π x k k (1 π k )T k x k = k=1 E uv E p O (E uv ) (1 p O (E uv )) = p O (G) (1 π k )T k x k = π x k k (1 π k )T k x k p GP (E uv ) E uv / E (1 p GP (E uv )) Given that G O = G GP, therefore P O (G) = P GP (G). 3.5 Geometric group probability algorithm To avoid the rejection process of the GP algorithm, we develop a new algorithm keeping the group probability proposal, but using geometric sampling for the Binomial sampling step. Namely, Devroye (1980) proved that sampling from a Binomial distribution with parameters n and p (X Bin(n, p)) is obtained through repeatedly sampling L i Geometric(p) until X+1 i=1 L i > n. Specifically, we utilize geometric sampling to avoid rejection process. At the beginning of the sampling process, we index the T k possible edge locations of an unique probability π k. Then, we generate the jth edge using the index j i=1 L i. For example, if T k = 10, L 1 = 4, L 2 = 2, and L 3 = 5, we generate two edges corresponding to the indexes locations L 1 = 4 and L 1 + L 2 = 6 (a third edge is not generated because 3i=1 L i > 10). Sampling The pseudocode of the new sampling process is given in Algorithm 4. Lines 1 5 are the same as the GP Algorithm 3. Lines 6 9 define the index of the initial edge. While lines 6 and 7 deal with the cases π k = 0 and π k = 1, line 9 deals with 0 <π k < 1 by sampling from a geometric distribution with probability π k via inverse transformation method (Voss 2013). If the corresponding index is below T k,anew edge is generated in the following loop. This loop determines an edge location based on the sampled index (line 11), inserts the edge into the network (line 12), and samples from the geometric distribution to determine a new index (line 13). Note: if π k = 1 the geometric distribution is not relevant ( 1 π k =0) and all edges with π k = 1 are generated.

14 S. Moreno et al. Algorithm 4 Geometric Group Probability Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: Construct U and T based on M(Θ) 3: for k = 1; k ++;k U do 4: 5: Obtain π k U the k-th unique probability of the set U Obtain T k T the number of times that π k is repeated in P 6: 7: if π k == 0ORπ k == 1 then count Edge = T k (1 π k ) + 1 8: else 9: count Edge = 1 + log(1 Uniform(0,1)) log(1 π k ) 10: while count Edge T k do 11: Sample new edge (u,v)among edges with probability π k based on count Edge 12: E = E (u,v) 13: count Edge = count Edge π k 14: Return G = (V, E) log(1 Uniform(0,1)) log(1 π k ) Complexity We analyze the runtime for a general model M. Line 2 is the preprocessing complexity O(M p ), while lines 4 7 lines are O(1). Lines 9 and 13 have complexity O(1) because of the log( ) function. Lines 10 and 12 are O(1), while line 11 is O(M s ) (sampling time according to the PGGM). Then, the loop from lines is O(1 x k M s ) = O(x k M s ). Incorporating, lines 3 13, the summation over U, and preprocessing time, we obtain a final complexity: U O M p + M s k=1 x k = O(M p + M s N e ) (4) The difference with respect to Algorithm 3 is the elimination of the rejection step, reducing the number of sampled edges. While U k=1 x k N e for Algorithm 3 (recall Sect. 3.3), in this new algorithm U k=1 x k is strictly equal to N e. Analysis As this new Geometric Group Probabilistic Sampling Algorithm (GGP) is an improved implementation of GP, it is also provably correct and samples networks from the original probability distribution of their corresponding PGGMs. 3.6 Summary In this section, we proposed two new sampling algorithms: Group Probability (GP) and Geometric Group Probability (GGP). We demonstrated that, given a valid PGGM M, both algorithms are probably correct and sample networks from the original probability distribution defined by M. The time complexity for GP is bounded by O(M p + M s N e ) when all unique probabilities are less than 1 (π i < 1), where O(M p ) and O(M s ) are the time complexities for preprocessing and sampling a sin-

15 Scalable and exact sampling method for probabilistic gle edge respectively. GGP overcomes the restriction on the probability values and keeps the same time complexity O(M p + M s N e ). Thus, both sampling algorithms can generate a network in time proportional to the number of edges, when the following conditions are fulfilled: (i) the preprocessing time is less than the number of edges (i.e., O(M p ) O(N e )), and (ii) the sampling process is a small constant (i.e., O(M s ) = O(1)). Both conditions are fulfilled for well known PGGMs. In the next section, we will discuss the implementation and complexity details for six different PGGMs. 4 Implementation This section explains the implementation of the group probability sampling algorithms over six different models: Random Graphs, Chung Lu (CL), Stochastic Block Models, Block Two-Level Erdös Rényi, Kronecker Product Graph Model (KPGM), and mixed KPGM. Naive implementation of these algorithms, analyzing each cell from P, has time complexity O(Nv 2 ). However, our implementations avoid a direct calculation and analysis of P, which reduces the time complexity. Specifically, for each model, we detail the preprocessing step (Line 2 of Algorithms 3 and 4) and the edge generation process, and discuss their respective time complexities O(M p ) and O(M s ). Recall, that the preprocessing corresponds to the construction of the sets U and T (unique probabilities and their counts), and we use the first letter of a model as a superscript to indicate the model (e.g. U C is U for CL). 4.1 Random graphs (RG) The implementation of the group probability sampling algorithms for RG is simple. Given Θ = p (the probability among edges), the sets U R ={p} and T R = Nv 2 (the number of edges with probability p) are created. Note that U R = T R =1. Then, the preprocessing time is O(M p ) = O(1). Group probability Each edge is generated through two uniform samplings between 1 and N v, with sampling time complexity O(M s ) = O(1). Reemploying this value at Eq. 1, the final time complexity is O(1 + 1 N e )) = O(N e ). Geometric group probability The edge indexes are determined using count Edge count Edge N v {1,...,Nv 2}. The new indexes are: idx i = +1 and idx j = count Edge (idx i 1) N v + 1, both with time complexity O(1). Applying these values in Eq. 4, the final complexity is O(1 + 1 N e )) = O(N e ). Summary In this model, the proposed algorithms have constant complexity in their preprocessing and sampling time (O(M p ) = O(M s ) = O(1)), obtaining a final time complexity of O(N e ). 4.2 Chung Lu (CL) Given the degree distribution of the network (Θ), the set of unique probabilities U C is given by the cross product of the unique degrees. Let D C be a vector of the unique

16 S. Moreno et al. degree distribution in a network. Then, the ith lowest unique degree with value j is Di C = j. Similarly, let ND C i be the total number of nodes with degree Di C. For example, if D5 C = 10 and N D C 5 = 3 there are 3 nodes with degree 10, the 5th lowest unique degree. The construction of the vectors D C and ND C can be implemented by analyzing each element of the degree distribution, with a time complexity O(N v ). Given Di C, UC is constructed by the cross product D C D C ( U C = D C 2 ), π k = Di C D C j N e for i, j {1,..., D C }. Similarly, T C = (T1 C,... T k C,... T C ), where T C U C k = ND C i ND C j (the product of the number of nodes with degrees i and j), is created by the vector multiplication ND C N D C. Then, the preprocessing time complexity O(M p) is given by O(N v + D C 2 + ND C 2 ) = O( D C 2 ) given that D C = ND C, and most of the time D C 2 is greater than N v To create the final network, the nodes with same degree are generated in groups. So, to generate an edge, we uniformly sample the indexes from ND C i and ND C j,plusa specific offset with respect to the initial position of the nodes with same degree. Group probability Consider the block structure of the grouped nodes. The index generation between nodes with degree i and j is done through the uniform sampling from ND C i and ND C j, with a specific offset (the total number of nodes with degree less than ND C i and ND C j respectively). This sampling edge process has complexity O(M s ) = O(1). Reemploying values at Eq. 1, the final time complexity is O( D C N e ) = O(N e ),if D C 2 N e. Geometric group probability The edge indexes between nodes with degree i and j are determined using count Edge {1,...,ND C i ND C j }. These indexes are given count Edge 1 by: idx i = +1 and idx ND C j = count Edge (idx i 1) ND C j,plus j the offsets with respect to the initial position of the nodes with degree i and j. Both indexes are calculated with complexity O(1) obtaining a sampling time complexity O(M s ) = O(1). Then, the final time complexity is O( D C N e ) = O(N e ),if D C 2 N e. Summary Both algorithms realize the same preprocessing step with time complexity O(M p ) = O( D C 2 ). Similarly, both algorithms have the same sampling time complexity O(M s ) = O(1), obtaining a final time of O(N e ),if D C 2 N e. Note that in most social networks the degree distribution follows a power law distribution reducing considerably the size of D C. 4.3 Stochastic block model (SBM) SBM requires three different parameters, the number of nodes partition K, the node assignment partition Z S (Zi S = j implies that the node i belongs to partition j), and the probability matrix partition P S (Pi, S j = p ij implies that the probability between the partition i and j is p ij ). With these parameters, we construct the vector N K, where

17 Scalable and exact sampling method for probabilistic N Ki is the number of nodes at partition i. This vector of size K 1 can be constructed by iterating over the vector Z S with time complexity O(N v ). Given the parameters, the set U S is simply P S. In contrast, the set T S = (T1 S,...,TS ), where T S U S k = N ki N k j, is the number of elements with probability π k between nodes of partitions i and j is constructed by the vector multiplication N K N K with time complexity O( N K 2 ) = O(K 2 ). Using these time complexities, the final preprocessing time is O(M p ) = O(N v + K 2 ). To create a network, we group nodes with same partition. Then edge indexes, among node partitions i and j, are sampled uniformly from ND S i and ND S j, plus an offset w.r.t. the position of the first nodes belonging to these partitions. Group probability The edge sampling between nodes with partition i and j is through an uniform sampling from N ki and N k j plus an offset (the total number of nodes belonging to the previous i and j partitions). This edge generation process has a complexity of O(M s ) = O(1), resulting in a total complexity time of O(N v + K N e )) = O(N e ),ifk 2 N e. Geometric group probability The edge indexes, between nodes of the partitions i and j, are determined using count Edge {1,...,N ki N k j }. The indexes are: idx i = count Edge 1 N ki +1 and idx j = count Edge (idx i 1) N ki, plus the offsets with respect to the initial position of the nodes in partitions i and j. Both indexes are calculated with a time complexity O(M s ) = O(1), obtaining a final complexity of of O(N v + K N e )) = O(N e ),ifk 2 N e. Summary Similar to CL, both algorithms have the same preprocessing and sampling complexity time (O(M p ) = O(N v + K 2 ) and O(M s ) = O(1) respectively), obtaining a final time of O(N e ),ifk 2 N e. Note that large values of K correspond to over-fitted models, so typically K is chosen to be considerably smaller than N v. 4.4 Block two-level Erdös Rényi (BTER) BTER receives the following parameters: the number of nodes partition K, the node assignment partition Z B (Zi B = j implies that the node i belongs to partition j), and the probability matrix partition P B (Pi B = p i implies that the probability of an edge inside partition i is p i ). Note: because there are two phases for this model, the preprocessing and sampling complexities are defined as M p1, M p2, M s1, and M s2, where the subindexes corresponds to the first or second phase of the algorithm. For the algorithm first phase (applying RG model on each of the K designated clusters): U1 B = PB, U1 B =K, and TB 1 is the vector with number of elements of each cluster (obtained through the analysis of Z B as in SBM). So, the preprocessing time for the first phase is O(M p1 ) = O(1 + N v ) = O(N v ). The second phase starts analyzing the degree distributions of the generated network, dynamically generated in the previous network, to apply the CL model (Sect. 4.2). Recall that U2 B and TB 2 are calculated using DB (vector of unique degrees) and N D (number of nodes with unique degrees). Then, similarly to Sect. 4.2, the preprocessing complexity is O(M p2 ) = O( D B 2 ).

18 S. Moreno et al. Group probability Edges are generated using RG and CL group probability algorithms, with sampling times O(M s1 ) = O(M s2 ) = O(1). Using these complexities, the first phase is O(N v + N e1 ), where N e1 is the number of edges generated in the first phase. Similarly, the second phase is O( D B 2 + N e2 ). These values give us a total complexity of O(N e1 + N v + D B 2 + N e2 ) = O(N e ),ifn e > D B 2. Geometric group probability The implementation of this algorithm is realized based on the geometric group probabilities algorithms for RG and CL models, with sampling times O(M s1 ) = O(M s2 ) = O(1). While the first phase has time complexity O(N v + N e1 ), the second phase is O( D B 2 + N e2 ), obtaining a total complexity O(N e ),if N e > D B 2. Summary Both algorithms have the same preprocessing and sampling times. Let N ei be the number of edges sampled in the RG (i = 1) and CL (i = 2) phase. For the RG model phase O(M p1 ) = O(N v ) and O(M s1 ) = O(1), obtaining a total time of O(N v + N e1 ). For the CL model phase O(M p2 ) = O( D B 2 ) and O(M s2 ) = O(1), obtaining a total time of O( D B 2 + N e2 ). Then, the final algorithm has complexity O(N e ),ifn e > D B Kronecker product graph model (KPGM) For KPGM, there are two parameters: a matrix of size b b with values between 0 and 1, whichwecallθ, and the number of kronecker multiplication K = log b (N v ).Forthe implementation, we use the representation developed in (Moreno et al. 2014), where P K (V i, V j ) = θ γ θ γ θ γ bb bb, with γ ij {0, 1,..., K } and b bj=1 i=1 γ ij = K. With this representation, U K is generated by the multiplication of all possible combinations of γ ij subject to b bj=1 i=1 γ ij = K (k-combination with repetition problem (Sheldon 2002)). The time complexity of this generation process is given by O(K U K ), where K corresponds to the multiplications to obtain an unique π k, and U K is the number of k-combination with repetition. As it is demonstrated in Moreno et al. (2014), U K < N v for large K (K 7, 10 for b = 2, 3 respectively), which results in complexity O(K N v ). Similarly T K is constructed by all k = K! γ 11k!γ 12k! γ bbk! possible permutations of Ɣ k ={γ 11,...,γ bb }, assuming the value of the factorials known, then the construction of T K is O(K N v ) too. Then, the preprocessing time of the algorithm is O(M p ) = O(K N v ). Group probability Let Λ i =[i 1, i 2,...,i K ] and Λ j =[j 1, j 2,..., j K ] be θ indexes s.t. K l=1 θ Λi (l)λ j (l) = π k. To sample an edge from the T K k positions, we realize a random permutation over σ = {1, 2,...,K } and calculate the indexes by Kl=1 (Λ i, j (σ (l)) 1)b l Both, the random permutation and indexes, have complexity time O(K ), obtaining O(M s ) = O(K ). Reemploying these values in Eq. 1, the total complexity is O(K N v + K N e ) = Õ(N e ). Note that this is the only implementation discussed in Moreno et al. (2014). Geometric group probability The edge generation process must determine the count Edge th permutation of vectors Λ i and Λ j for a given π k. To determine

19 Scalable and exact sampling method for probabilistic Λ i, j (l), we calculate all possible permutation using different θ ij s and pick the θ ij s able to generate the count Edgeth permutation. We start fixing Λ i, j (1) = 1 with θ 11 (assume θ 11 is used in π k ) and calculate the number of possible positions over the rest K 1 elements using the others θ ij s. If this value is greater than count Edge, then Λ i, j (1) = 1 is fixed and the search continues over the next position. Otherwise, the positions are changed to Λ i (1) = 1 and Λ j (1) = 2 (parameter θ 12 ) and the process is repeated. This process is realized K times, assigning a position to all θ ij s. Then, the edge generation has time complexity O(M s ) = O(K b 2 b2 ), where K corresponds to the number of assigned positions, and b 2 s are possible values to test and to compute a permutation (2! to K! are assumed known). Reemploying these values in Eq. 1, the total complexity is O(K N v + K b 4 N e ) = Õ(N e ). Summary Even though both algorithms have the same preprocessing time O(M p ) = O(K N v ), their sampling times differ by a constant of b 4. While O(M s ) = O(K N e ) for the group probability algorithm, the geometric group probability algorithm has a complexity O(M s ) = O(K b 4 N e ). Beside this difference, both algorithms have a total complexity of Õ(N e ) for K > 7(10) and b > 2(3) respectively. 4.6 Mixed KPGM (mkpgm) In addition to the Θ and K parameters from KPGM, mkpgm has a new parameter l to control the dependencies between edges. The preprocessing step of mkpgm includes the generation of G l using KPGM (K = l for this case). This preprocessing step has a time complexity of Õ(N el ), where N el is the number of edges of G l. Then, we sample each of the other layers G l+k, based on the proposed algorithm. Given that Pl+k M = G l+k 1 Θ, thus U M = Uk M = Θ ={θ 11,θ 12,...,θ bb }, k {1,...,K l}, with time complexity O(1). Moreover, for all π j Uk M, Tj M = E l+k 1 (the number of edges of the previous layer), so the construction of T M k is O(1) too. Given these time complexities, the preprocessing step has time complexity O(M p ) = Õ(N el ). Group probability To generate a random edge at layer G l+k using π l = θ ij,wepicka random edge (u,v)from G l+k 1, and calculate the new indexes by idx i = u b+i and idx j = v b+ j. Given that u,vare stored, the edge generation process has complexity time O(M s ) = O(1). With these values, the final complexity to generate G K is Õ(N el + K l k=1 N e l+k ) = Õ((K l) N e ) = Õ(N e ). Note that this implementation differs from (Moreno et al. 2014), which does not use a rejection process. Geometric group probability To generate an edge at layer G l+k using π l = θ ij,wepick the count Edgeth edge (count Edge {1,..., E l+k 1 }) from E l+k 1.Let(u,v)be the selected edge, then the new indexes are idx i = (u 1) b+i and idx j = (v 1) b+ j (O(M s ) = O(1)). Thus, the final complexity is Õ(N el + K l k=1 N e l+k ) = Õ(N e ). Summary The preprocessing time of both algorithms is dominated by the generation of the KPGM network G l with K = l, which implies O(M p ) = Õ(N el ), where N el

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Using Bayesian Network Representations for Effective Sampling from Generative Network Models Pablo Robles-Granda and Sebastian Moreno and Jennifer Neville Computer Science Department Purdue University