Scalable and exact sampling method for probabilistic generative graph models

Size: px
Start display at page:

Download "Scalable and exact sampling method for probabilistic generative graph models"

Transcription

1 Data Min Knowl Disc Scalable and exact sampling method for probabilistic generative graph models Sebastian Moreno 1 Joseph J. Pfeiffer III 2 Jennifer Neville 3 Received: 29 October 2015 / Accepted: 11 April 2018 The Author(s) 2018 Abstract Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of large-scale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al. (in: IEEE 14th international conference on data mining, pp , 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability values this enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and Responsible editor: Kristian Kersting. Sebastian Moreno acknowledges the support of CONICYT + PAI/Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria folio B Sebastian Moreno sebastian.moreno.araya@gmail.com Joseph J. Pfeiffer III joelpf@microsoft.com Jennifer Neville neville@purdue.edu 1 Faculty of Engineering and Science, Universidad Adolfo Ibañez, Santiago, Chile 2 Microsoft, Bellevue, WA 98004, USA 3 Computer Science and Statistic, Purdue University, West Lafayette, IN 47907, USA

2 S. Moreno et al. probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95 s on a single processor. Keywords Network analysis Network models Social networks Graph generation Scalable sampling 1 Introduction Many interesting complex systems can be modeled as networks, with edges (e.g., s, hyperlinks) connecting vertices (individuals, webpages). Consequently, models of graph structure are useful for representing and reasoning about the underlying properties of these systems. Probabilistic models of graph structure (Frank and Strauss 1986; Wasserman and Pattison 1996; Robins et al. 2007; Wang et al. 2013), which represent the likelihood of observing a particular network structure, can be used for prediction (Pfeiffer et al. 2014) and hypothesis testing (Moreno and Neville 2013). Generative models of network structure (Watts and Strogatz 1998; Barabasi and Albert 1999; Kumar et al. 2000; Karrer and Newman 2009, 2010; Golosovsky and Solomon 2012; Bu et al. 2013a), which can provide sample networks with similar but varying structure, can be used to test conjectures about network processes and analyze the performance of algorithms overlaid on the complex system. As such, there has been a great deal of recent research on statistical models that are both probabilistic and generative, since they provide a principled means to approach a wide range of network science analysis tasks. Notably, scalable (subquadratic in number of nodes) sampling algorithms for statistical models such as Kronecker Product Graph Model (KPGM) (Leskovec and Faloutsos 2007) and Chung Lu (CL) (Chung and Lu 2002) made analysis of largescale, sparse networks feasible for the first time (Leskovec et al. 2010; Pinar et al. 2011). However, in contrast to prior expectations, these scalable sampling algorithms do not sample from the underlying probability distribution defined by the model. Recently, we investigated this issue for KPGM-family models (Moreno et al. 2014). We show that the efficient sampling algorithm from Leskovec et al. (2010), models a different space of graphs, and does not sample networks according to the original distribution. In practice, this sampler generates networks that are unlikely to have been drawn from the model distribution. In this work, we generalize the results of Moreno et al. (2014) to a broad class of statistical models that we refer to as probabilistic generative graph models (PGGMs). These models typically represent graph structure probabilistically with a N v N v probability matrix P, where P ij is the probability of edge existence between nodes i and j, and N v is the number of nodes. To avoid the expensive memory construction of P, well known models use a set of parameters Θ to implicitly represent P. For example, the Erdös Rényi random graph model (Erdos and Renyi 1960) assigns equal

3 Scalable and exact sampling method for probabilistic probability p to every possible edge (i.e., i, j P ij = p), then Θ = p implicitly represent the N v N v probability matrix P. Recent models differ in how they specify the edge probabilities in P. The CL model (Chung and Lu 2002) defines the probability of an edge to be proportional to the degree of the incident nodes, while other models such as KPGM (Leskovec and Faloutsos 2007) and mixed KPGM (Moreno et al. 2010) define P via Kronecker multiplication of a small seed parameter matrix, with itself. To generate a network based on P, naive algorithms, which we will refer to as Independent Probability methods, sample edges independently for every pair of nodes. These methods have quadratic time complexity (i.e., O(Nv 2 )), and thus cannot scale to large networks with millions of vertices. Researchers have developed faster alternatives to sample based on P for sparse networks. These algorithms, which we refer as Edge-by-edge methods, avoid the construction of the full P matrix and the analysis over every pair of nodes, and sample the locations for N e edges (i.e., unsampled i, j pairs are assumed to not be linked). KPGM and CL have associated Edge-by-edge sampling algorithms with complexity Õ(N e ) (Leskovec et al. 2010; Pinar et al. 2011). Although these algorithms sample network efficiently, they do not correctly sample from the underlying distribution represented by P. Particularly, some algorithms sample the N e edges based on P, without verifying collisions among the sampled edges (producing multigraphs). Other algorithms sample the sparse set of edges and spread the probability of previous sampled edges throughout the network, incorrectly increasing the probability of unlikely edges. As a consequence, new edges are not sampled from the correct underlying distribution. To address this issue, we build on the work of Moreno et al. (2014) and develop a general grouped sampling process for any PGGM that uses P implicitly in its representation. We exploit a common property of PGGMs that edges are parameterized by a small set of probability values, which enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove any bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Specifically, our main contributions in this paper are: 1. Generalization of the proposed algorithm in (Moreno et al. 2014), to any PGGM. 2. Theoretical analysis to prove the correctness of the approach used in the new algorithms over PGGMs. 3. New faster and general algorithm, which uses geometric distributions to replace the Binomial sampling and select the locations of the sampled edges, avoiding rejection sampling. 4. Detailed time complexity analysis of the generalized algorithms, estimating the cost of the rejection sampling process. To demonstrate the applicability of our grouped sampling approaches, we discuss specific implementations for six different PGGMs. We then validate our theoretical analysis through an empirical evaluation, demonstrating the inaccuracies of previous algorithms as well as the advantages of our approaches. As part of our experiments, we demonstrate the scale of our approach on three real world datasets, including a LiveJournal network with 68 million edges. We then push our sampling algorithm far

4 S. Moreno et al. past the limits of prior work, by sampling networks with over one billion edges in 95 seconds on a single processor. 2 Probabilistic generative graph models Let G = (V, E) be a graph/network 1 with a set of finite vertices V and directed edges E V V, where (i, j) E indicates that a directed edge (or arc) exists between nodes V i, V j V. LetN v = V and N e = E be the number of vertices and edges, respectively. Note that this definition and theory focus in directed graphs, but all of our subsequent theory and algorithms can be applied too undirected graphs by working on edges (i, j), such that i j. Definition 1 Probabilistic generative graph model (PGGM) A PGGM is a statistical model M with parameters Θ that defines a size N v N v matrix P of probabilities. P models the structure of the network through the set of binary random variables E (i, j) i, j {1,...,N v }, where each P ij P represents the probability the directed edge (i, j) exists in the network (i.e., if E (i, j) = 1 then (i, j) E and P(E (i, j) = 1) = P ij ). Typically, the model M does not explicitly represent the full P, but instead provides a construction process to calculate each P ij from the parameters Θ. The number of parameters in Θ differs for each PGGM, and can vary in the range [1, Nv 2].2 However, since the goal is to model the underlying structure of the network, most PGGMs have a small number of parameters (usually less than N v ) to be parsimonious and to avoid overfitting. Moreover, let G be the space of all possible graphs with N v nodes, where G =2 N v 2. Then a PGGM implicitly defines a probability distribution over the space of graphs G, where the likelihood of an observed graph G = (V, E) G can be calculated based on P as: (i, j) E P ij (i, j)/ E (1 P ij). Once M and Θ are specified, allowing an implicit representation of P,itispossible to generate a single network G = (V, E) from G. There are two naive ways to sample a single network from G. The first method enumerates all possible graphs from G, calculates each graph s likelihood, and computes a cumulative density function. Then a number is sampled uniformly from the [0, 1] interval, used to index into the CDF, and return a randomly selected network. Unfortunately, the complexity of this method is O(2 N v 2 Nv 2 ), because it requires enumerating and calculating the likelihood of each possible graph. The second method calculates each P ij and sample the E (i, j) independently, using a Bernoulli distribution with P(E (i, j) = 1) = P ij. The time complexity of this method is O(Nv 2 ), yet this is still prohibitive for sampling networks with more than several thousands of nodes. We refer to this second method as Naive Sampling through the rest of the paper. We omit further discussion of the first method (sampling based on the enumeration of the G graphs) because of the prohibitive time complexity. 1 In this paper we will use graphs and networks interchangeably. 2 In the worst case scenario, Θ is a N v N v matrix that explicitly represents P (Θ P).

5 Scalable and exact sampling method for probabilistic 2.1 PGGMs Several new PGGMs have been developed through time, including not only the structure determined by the edges, but also some network characteristics (Kim and Leskovec 2010; Kolda et al. 2014; Benson et al. 2014; Aicher et al. 2015). However, in this work we focus in six of the most important PGGM models of the last years, where every edge is sampled using a Bernoulli distribution (E (i, j) Bernoulli(P ij )). Erdös Rényi random graphs model (RG) RGs assume that every edge in the network have the same sampling probability (Erdös and Rényi 1959; Erdos and Renyi 1960). Then, P is implicitly represented by a single probability p defined by the user (P ij = p i, j {1,...,N v }). Chung Lu model (CL) CLs represent the probability of an edge by the degrees of the incident nodes (Chung and Lu 2002), where expected degrees of a generated graph matches the original network degrees. P is based on the degree distribution of the network, where a single P ij can be calculated as d i d j N e (d i is the degree of node V i ). Stochastic block model (SBM) SBMs represent the structure of a network through N k communities (usually N k V ), using a set of RGs (Holland et al. 1983; Wasserman and Anderson 1987). Each node V i is assigned to only one of the N k community, and the network is modeled by a N k N k probability matrix P S that represents the probability of an edge between partitions. P is represented by P S and node s partition Z S (ZV S i is node V i partition assignment), obtaining P ij =P S (ZV S i, ZV S j ). Block two-level Erdös Rényi model (BTER) BTERs represent a network through the combination of RG and CL models (Seshadhri et al. 2012; Kolda et al. 2014). BTER groups nodes with similar number of edges in clusters of size d k + 1 (all nodes have a minimum degree d k ), and defines a vector P B, representing the probability of an edge among nodes in the same partition (RG model). This step increases the community structure by generating multiple edges among nodes in the same cluster. Note that for large N v, P B N e /3, where P B = N e /3 if the network consists in multiple isolated triangles (three nodes connected among them). Then, the probability between nodes of different clusters is estimated based on the CL model, matching the original degree distribution. Similarly to the previous models, P is represented by a combination of P B and the original degree distribution, where P ij is defined by the RG or CL models (if V i and V j belongs to the same cluster RG, otherwise CL). Kronecker product graph model (KPGM) KPGMs assume that networks are hierarchically organized into communities that grow recursively through a fractal process. To represent the fractal process, KPGM uses a seed matrix Θ of size b b, where θ ij [0, 1] and typically b = 2orb = 3, which is recursively extended using K = log b (N v ) Kronecker multiplications with itself. Then, P is calculated through Θ and K, where P ij = K k=1 Θ uk,v k (1 u k,v k b). As can be observed, the probability of an edge is given by the multiplication of K values from Θ (Leskovec

6 S. Moreno et al. Algorithm 1 Original Naive Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: for i {1,...,N v } do 3: for j {1,...,N v } do 4: Calculate P ij based on M(Θ) 5: if Bernoulli(P ij ) then 6: E = E {(i, j)} 7: Return G = (V, E) and Faloutsos 2007; Leskovec et al. 2010). So, by the commutative property of multiplication, a single probability can appear in many places of P. Mixed-Kronecker product graph model (mkpgm) mkpgm is a generalization of KPGM, where the hierarchical communities are related among them via a set of dependent edge probabilities (Moreno et al. 2010). mkpgm also models a network through the parameters Θ and K as explained in KPGM, but it incorporates a new parameter l {1, 2,..., K } that creates the edge dependencies. Specifically, mkpgm uses the same hierarchical structure than KPGM at the first l levels; then it ties the edge probabilities by sampling each of the following hierarchical layers. This process keeps the marginal edge probabilities among nodes but increases the variability of the generated networks. P is calculated through Θ, K, and l, where P ij = K k=1 Θ uk,v k when all hierarchical layers are sampled, otherwise the edge has no probability of being sampled. 2.2 Sampling from PGGMs We discuss the naive Independent probability sampling algorithms of PGGMs and discuss scalable Edge-by-edge sampling algorithms previously mentioned Naive sampling To sample a graph G from a model M with parameter Θ, a naive algorithm samples the Nv 2 possible edges using Bernoulli(P ij), where P ij is calculated from M(Θ). If Bernoulli(P ij ) = 1, the edge (i, j) is added to G. Complexity Algorithm 1 shows an outline for naive sampling algorithms with time complexity O(M c Nv 2), where M c refers to the time needed to calculate a single edge probability P ij (based on M and Θ). Line 4 has complexity O(M c ). Lines 5 and 6 are O(1), and the loops increase the complexity to O(M c Nv 2 ). Note, that in the majority of the cases O(M c ) = O(1), because P ij is a function of the parameters Θ; reducing the naive sampling time complexity algorithm to O(Nv 2 ). However, it is still does not scale to large networks. Space of graphs Given M(Θ), we will refer to the space of graphs with N v nodes to be G O (i.e., all possible graphs that can be generated by M(Θ)), where o stands for original distribution. Formally, G O ={G = (V, E) such that (V ={1, 2,...,N v })

7 Scalable and exact sampling method for probabilistic and (E V V)}. Note that E is a subset or equal to V V. So, G O defines a space of graphs where only a single edge can exist between every pair of nodes (Diestel 1997). All PGGMs define a probability distribution P O over G O. Using these definitions, the network generation process is associated with sampling a network G from P O, where P O (G) : G O [0, 1]. Considering that for sparse networks N e Nv 2, and due to the O(N v 2) complexity of Algorithm 1, authors have searched for algorithms with sub-quadratic time complexity. Even though, this could sound pointless, because only enumerating or iterating over all possible edges has a time complexity O(Nv 2 ), it is possible by avoiding the edge by edge Bernoulli sampling process. For example, consider the Erdös Rényi Random Graphs model where all edges have the same probability p. Then, instead of iterate over all values of P, which is implictly represented by p, an edge can be generated by sampling two nodes from the discrete uniform distribution [1, N v ] and place an edge between the selected nodes. This example belongs to the Edge-byedge sampling algorithms, which have time complexity proportional to the number of edges (O(N e )). Among the most popular Edge-by-edge sampling algorithms, we can mention the CL (Chung and Lu 2002) and KPGM (Leskovec and Faloutsos 2007). Although these algorithms differ with respect to a chosen model, all of them fall within the following generalized processes Edge by edge without rejection Edge-by-edge sampling algorithms avoid the iteration over all pairs of nodes, and only consider the placement locations of N e sampled edges, with all remaining (i.e., unsampled) pairs of nodes considered unlinked. For this process, an edge (i, j) is sampled proportional to its defined probability (P ij ) and added to the network, where P is implicitly represented through M(Θ). This process is repeated N e times, where N e is defined by the user (or could be estimated). Complexity We give pseudocode for the sampling process in Algorithm 2. Generally speaking and assuming that line 4 is constant, the time complexity of the algorithm is Õ(N e ).Thewhile loop (line 3) dominates the time complexity for scalable algorithms, as line 4 has a sampling time of M s which is usually constant, and lines 5 and 6 have constant time. The entire P is never iterated, generated, or calculated; given that these algorithms use other approaches to draw proportional to P ij, rather than directly calculating P ij. As an example, RG uniformly samples (i, j) at random from the Nv 2 possible edges. Space of graphs Let X = N e be the number of edges to place via the edge-byedge sampling algorithm. Edge-by-edge algorithms are also a sampling mechanism from a probability distribution P EN defined over a space of graphs G EN (X) with a fixed number of N v nodes and X = N e edges (EN stands for edge by edge without rejection). Particularly, G EN (X) ={G = (V, E) is a pair of disjoints sets together with a map E V V V, and ( E =N e = X)}. There are two main differences of G EN (X) with respect to G O.First,G EN (X) is restricted to networks with exactly X edges. Second, the mapping function allows multiple edges between the same two nodes (Diestel 1997).

8 S. Moreno et al. Algorithm 2 Edge by Edge Without Rejection Sampling Require: M,Θ,andN e {can be calculated or defined by the user} 1: V ={1,...,N v }, E ={} 2: countedge=0 3: while count Edge < N e do 4: Sample edge (i, j) P uv 5: countedge++ 6: E = E {(i, j)} 7: Return G = (V, E) Edge by edge with rejection As the above edge-by-edge algorithm generates multigraphs (two or more edges can be sampled between the same two nodes), a simple solution is to include a rejection process for edges that were previously sampled. Complexity The pseudocode for this sampling process is similar to Algorithm 2, but a rejection step (IF ((i, j)/ E)) is added after line 4. This rejection process increases the runtime with respect to the number of rejected edges. Space of graphs Edge-by-edge with rejection algorithms are also a sampling mechanism from a probability distribution P ER, where ER stands for edge-by-edge with rejection. The space of graphs and probability distribution for network with a fixed number of N v nodes and X = N e edges are given by G ER (X) ={G = (V, E), such that (V ={1, 2,...,N v }) and (E V V) and ( E =N e = X)}. As with the previous sampling process, G ER (X) is restricted to a specific number of edges and defines a different space of graphs than G O. However, its definition does not allow multigraphs. Prior work claimed edge-by-edge generation algorithms (with and without rejection) sample from P O (G) (i.e. P O (G) = P EN (G X) = P ER (G X)), while reducing complexity time to Õ(N e ). However, we theoretically prove at Moreno et al. (2014) that these sampling processes lie far from the true distribution, with the exception of RG models. We empirically corroborate these results in Sect Group probability sampling This section develops our new scalable and exact Group Probability (GP) sampling algorithm for Probabilistic Generative Graph Models (PGGMs). We begin by defining a new representation for PGGMs, followed by the generalization of the GP algorithm from (Moreno et al. 2014). We then analyze its time complexity, estimating the cost of the rejection sampling process. Importantly, we conclude the section by removing the rejection sampling process by introducing a new Geometric Grouped Probability (GGP) sampler. This section discusses the generalized algorithms, while in the next section we give detailed implementations for six existing PGGMs.

9 Scalable and exact sampling method for probabilistic 3.1 Representation For most PGGMs, multiple pairs of nodes have the same probability of edge existence in P. For example, in the extreme case of the Random Graph model (RG), every edge exists with equal probability, i.e., P ij = P uv = p, i, j, u,v {1,...,N v }. Although more complex PGGMs have different probabilities values throughout P, a single probability value can appear in many places in P (e.g., P ij = P kl where i, j = k, l). As a result, even without iterating or calculating P, we can identify cells in P that corresponds to a single unique probability value, and treat those associated random variables as a group. In this work, we propose to sample each group of identically distributed edge probabilities using the Binomial distribution. Namely, for every group of cells in P that corresponds to an unique probability value, we draw the number of edges that should be placed from a Binomial distribution. Then, we determine locations to place the chosen number of edges from the locations of the cell in P. For current PGGMs, this entire process can be realized without explicitly iterating or calculating the full matrix P. We formally introduce some new notation before describing the algorithm. Let U be the set of unique probability values in P, and U be the total number of unique probabilities in P as defined by a PGGM. Let π k be the kth unique probability in U and T = (T 1, T 2,...,T U ) be a vector where T k is the number of times π k is repeated in P (i.e. the count of ij pairs such that P ij = π k ). For example, for RG, U ={π 1 }, U =1, T = (T 1), and T 1 = Nv Algorithm Using this notation, a network is sampled in three steps. First, in a preprocessing step calculate U and T. Second, for each unique probability π k U, sample the number of edges x k using a Binomial distribution (x k Bin(n, p) n = T k, p = π k ).Third, place the x k sampled edges uniformly at random among cells with probability π k. Algorithm 3 describes the pseudocode of the GP sampling. Line 2 is the preprocessing step that constructs U and T (1st step). Lines 4 6 determine π k, T k, and x k (2nd step). Lines 8 12 generate and place the x k edges to the network (3 rd step). Line 9 generates the new edge and Lines 10 through 12 add (u,v)to E if it has not been previously sampled (this step avoids multigraphs). 3.3 Complexity The time complexity for the GP sampling algorithm varies depending on the choice of PGGM. To reflect this, we define a general complexity based on an arbitrary model M, where O(M p ) and O(M s ) represent the time complexity for preprocessing and sampling respectively, when using the model M in the algorithm. 3 Line2is 3 Note that M p includes the time needed to calculate probability values, which we referred to earlier in Sect as M c.

10 S. Moreno et al. Algorithm 3 Group Probability Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: Construct U and T, based on M(Θ) 3: for k = 1; k ++;k U do 4: Get π k U the k-th unique probability of the set U 5: Get T k T the number of times that π k is repeated in P 6: Draw x k Bin(T k,π k ) 7: countedge=0 8: while count Edge < x k do 9: Sample new edge (u,v)among edges with probability π k 10: if (u,v) / Ethen 11: countedge++ 12: E = E (u,v) 13: Return G = (V, E) the preprocessing step, corresponding to constructing of the sets U and T with time complexity O(M p ). Similarly, Lines 4 and 5 have a cost O(1). For Line 6, the exact sampling from a Binomial(T k,π k ) has a complexity O(T kπ k ) (Devroye 1980). Line 9 is the sampling step, which varies according to the model, so we define its time as O(M s ). Lines 10 and 12 are dependent on the choice of network data structure. In practice, hash tables have a constant lookup/insertion time, so for the remainder of the paper we will assume O(1) lookup/insertion time. 4 Using the above analysis, we proceed to calculate the total complexity of the generalized algorithm. The kth iteration of the while loop (Lines 8 12) is dominated by the total number of generated edges (including the rejected samples), multiplied by O(M s ). The total number of generated edges can be calculated as the summation of x k geometric random variables with different success probabilities. Let X k be the total number of edges to be sampled, such that x k different edges are added to the network. Then X k = x k i=1 X k,i where X k,i is a geometric random variable indicating the number of samples ( until the ith edge is successfully added to the network ( X k,i Geometric Tk (i 1) T k )). Thus the expected number of samples is: x k E[X k,i x k ]= i=1 x k i=1 1 T k (i 1) T k T k = T k i=1 = T k T 1 k x k i i=1 x k 1 T k (i 1) ( ) = T k HTk H Tk x k where H j is the jth harmonic number. In the worst case scenario, x k = T k obtaining a total complexity of O(T k H Tk ), increasing the time complexity of the algorithm. Otherwise, if x k < T k, considering that x k π k T k and H j < log( j) + 1, we obtain i=1 1 i 4 In the worst case, lookups/insertions are O(N v ) for hash tables (when the hash function puts all entries are inserted into the same bucket). To avoid this worst case scenario, one can use a balanced binary tree (e.g., red-black trees) for the implementation, which is O(log N v ).

11 Scalable and exact sampling method for probabilistic ( ) ( T k HTk H Tk x k < Tk log(tk ) log ( T k T k π k)) = Tk log ( 1 π k ) Applying the inequality log(1 + X) <X for X 1 based on the Taylor series expansion: E[X k x k ]< T k log ( 1 π k) <π k T k x k Continuing with the time complexity analysis, the loop from lines 8 12 is O(x k M s ). Adding lines 4 7, incorporating the summation over U, and adding line 2, we obtain a total complexity U O M p + M s = O ( ) M p + M s N e (1) k=1 x k As a result, the final complexity of the algorithm is directly related to: the preprocessing time M p, the sampling time M s, and the number of edges N e. It is important to note that a naive implementation of M has complexity O(Nv 2 ); however, we avoid naive implementations in the next section. 3.4 Analysis The GP algorithm samples networks from a specific probability distribution P GP.Let G GP be the space of graphs for the GP sampling process, defined by {G = (V, E) such that (V ={1, 2,...,N v }) and (E V V)}. Note that G GP G O (the space graph of the original sampling process). Then, the generation of a network can be considered as sampling G from P GP (G), where P GP (G) : G GP [0, 1]. With these definitions, the next theorems prove that our generalized GP algorithm samples networks from a probability distribution as defined by a corresponding PGGM. Theorem 1 Given a valid PGGM M(Θ), and the GP sampling algorithm from Algorithm 3 with edge probabilities defined as p GP ( ), then u,v V p GP (E uv ) = p O (E uv ), where p O ( ) is the edge probability from the original correct, but inefficient, Algorithm 1. Proof Let π uv be the newly defined representation of the probability of an edge (u,v), and T be the number of edges with unique probability π uv, then: p GP (E uv ) = T P(E uv X = k)p(x = k) (2) k=0 where X Bin(T,π uv ). We can then see: P(E uv X = k) = 1 P(E uv X = k) (3)

12 S. Moreno et al. where P(E uv X = k) is the probability of not sampling the edge in the k trials. As every edge is randomly assigned over all possible remaining positions, the probability that the edge (u,v)is not selected in the first trial is 1 T 1. For the second trial, the probability is slightly ( higher because ) T 1 positions are available. Generalizing, P(E uv X = k) = k 1 k ( ) 1 i=1 T 1 i 1 j=1 1 =1 1. Reemploying P(E uv X = k) T (i 1) i=1 in Eq. 3: P(E uv X = k) = 1 k ( ) 1 1 = 1 T (i 1) i=1 = 1 T k T Reemploying P(E uv X = k) in Eq. 2: p GP (E uv ) = T k=0 = k T k i=1 k E[X] T Bin(T,π uv ; X = k) = = π uv T T T ( T i ) T i + 1 = p O (E uv ) Next, as the space of graphs and edge probabilities between the original model and GP are equal, we can prove that graph distributions are equal too. Theorem 2 Given a valid PGGM M(Θ), and the GP sampling algorithm from Algorithm 3 with edge probabilities defined as p GP ( ), then G G GP p GP (G) = p O (G), where p O (G) is the graph probability under the original correct, but inefficient, Algorithm 1. Thus P GP (G) = P O (G). Proof The generation of edges across different unique probability values are independent. Let E k be the set of edges with unique probability π k and E k =x k, then p GP (G) = = U k=1 U k=1 U T k p GP (E k ) = p GP (E k Y = i)p(y = i) k=1 i=1 p GP (E k Y = x k )P(Y = x k ) where Y Bin(π k, T k), and p GP (E k Y = i) is the probability of sampling the set of edges E k ( E k =x k ) given that i edges are generated. If i = x k, then p GP (E k Y = i) = 0. In contrast, p GP (E k Y = x k ) is the summation of probabilities over the x k! possible sequences of edges that generate E k. For example, the set of edges (1, 1) and (1, 2), can be generated by the sequence (1, 1), (1, 2) or (1, 2), (1, 1).LetE ki be the k i edge sequence that generates E k. As all edges are uniformly sampled with repetition, then p(e ki ) = T 1 1 k T k 1 1 T k x k +1.Sop GP(E k Y = x k ) is given by:

13 Scalable and exact sampling method for probabilistic p GP (E k Y = x k ) = x k! i=1 p(e ki ) = x k! 1 T k 1 T k 1 1 T k x k + 1 = x k!(t k x k )! T k! Joining p GP (E k Y = x k ) with P(Y = x k ): p GP (E k Y = x k )P(Y = x k )= 1 ( Tk x k ) Reemploying in p GP (G): U U p GP (G)= p GP (E k ) = k=1 = E uv E E uv / E ( Tk x k ) π x k k = 1 ( Tk x k ) π x k k (1 π k )T k x k = k=1 E uv E p O (E uv ) (1 p O (E uv )) = p O (G) (1 π k )T k x k = π x k k (1 π k )T k x k p GP (E uv ) E uv / E (1 p GP (E uv )) Given that G O = G GP, therefore P O (G) = P GP (G). 3.5 Geometric group probability algorithm To avoid the rejection process of the GP algorithm, we develop a new algorithm keeping the group probability proposal, but using geometric sampling for the Binomial sampling step. Namely, Devroye (1980) proved that sampling from a Binomial distribution with parameters n and p (X Bin(n, p)) is obtained through repeatedly sampling L i Geometric(p) until X+1 i=1 L i > n. Specifically, we utilize geometric sampling to avoid rejection process. At the beginning of the sampling process, we index the T k possible edge locations of an unique probability π k. Then, we generate the jth edge using the index j i=1 L i. For example, if T k = 10, L 1 = 4, L 2 = 2, and L 3 = 5, we generate two edges corresponding to the indexes locations L 1 = 4 and L 1 + L 2 = 6 (a third edge is not generated because 3i=1 L i > 10). Sampling The pseudocode of the new sampling process is given in Algorithm 4. Lines 1 5 are the same as the GP Algorithm 3. Lines 6 9 define the index of the initial edge. While lines 6 and 7 deal with the cases π k = 0 and π k = 1, line 9 deals with 0 <π k < 1 by sampling from a geometric distribution with probability π k via inverse transformation method (Voss 2013). If the corresponding index is below T k,anew edge is generated in the following loop. This loop determines an edge location based on the sampled index (line 11), inserts the edge into the network (line 12), and samples from the geometric distribution to determine a new index (line 13). Note: if π k = 1 the geometric distribution is not relevant ( 1 π k =0) and all edges with π k = 1 are generated.

14 S. Moreno et al. Algorithm 4 Geometric Group Probability Sampling Require: M and Θ 1: V ={1,...,N v }, E ={} 2: Construct U and T based on M(Θ) 3: for k = 1; k ++;k U do 4: 5: Obtain π k U the k-th unique probability of the set U Obtain T k T the number of times that π k is repeated in P 6: 7: if π k == 0ORπ k == 1 then count Edge = T k (1 π k ) + 1 8: else 9: count Edge = 1 + log(1 Uniform(0,1)) log(1 π k ) 10: while count Edge T k do 11: Sample new edge (u,v)among edges with probability π k based on count Edge 12: E = E (u,v) 13: count Edge = count Edge π k 14: Return G = (V, E) log(1 Uniform(0,1)) log(1 π k ) Complexity We analyze the runtime for a general model M. Line 2 is the preprocessing complexity O(M p ), while lines 4 7 lines are O(1). Lines 9 and 13 have complexity O(1) because of the log( ) function. Lines 10 and 12 are O(1), while line 11 is O(M s ) (sampling time according to the PGGM). Then, the loop from lines is O(1 x k M s ) = O(x k M s ). Incorporating, lines 3 13, the summation over U, and preprocessing time, we obtain a final complexity: U O M p + M s k=1 x k = O(M p + M s N e ) (4) The difference with respect to Algorithm 3 is the elimination of the rejection step, reducing the number of sampled edges. While U k=1 x k N e for Algorithm 3 (recall Sect. 3.3), in this new algorithm U k=1 x k is strictly equal to N e. Analysis As this new Geometric Group Probabilistic Sampling Algorithm (GGP) is an improved implementation of GP, it is also provably correct and samples networks from the original probability distribution of their corresponding PGGMs. 3.6 Summary In this section, we proposed two new sampling algorithms: Group Probability (GP) and Geometric Group Probability (GGP). We demonstrated that, given a valid PGGM M, both algorithms are probably correct and sample networks from the original probability distribution defined by M. The time complexity for GP is bounded by O(M p + M s N e ) when all unique probabilities are less than 1 (π i < 1), where O(M p ) and O(M s ) are the time complexities for preprocessing and sampling a sin-

15 Scalable and exact sampling method for probabilistic gle edge respectively. GGP overcomes the restriction on the probability values and keeps the same time complexity O(M p + M s N e ). Thus, both sampling algorithms can generate a network in time proportional to the number of edges, when the following conditions are fulfilled: (i) the preprocessing time is less than the number of edges (i.e., O(M p ) O(N e )), and (ii) the sampling process is a small constant (i.e., O(M s ) = O(1)). Both conditions are fulfilled for well known PGGMs. In the next section, we will discuss the implementation and complexity details for six different PGGMs. 4 Implementation This section explains the implementation of the group probability sampling algorithms over six different models: Random Graphs, Chung Lu (CL), Stochastic Block Models, Block Two-Level Erdös Rényi, Kronecker Product Graph Model (KPGM), and mixed KPGM. Naive implementation of these algorithms, analyzing each cell from P, has time complexity O(Nv 2 ). However, our implementations avoid a direct calculation and analysis of P, which reduces the time complexity. Specifically, for each model, we detail the preprocessing step (Line 2 of Algorithms 3 and 4) and the edge generation process, and discuss their respective time complexities O(M p ) and O(M s ). Recall, that the preprocessing corresponds to the construction of the sets U and T (unique probabilities and their counts), and we use the first letter of a model as a superscript to indicate the model (e.g. U C is U for CL). 4.1 Random graphs (RG) The implementation of the group probability sampling algorithms for RG is simple. Given Θ = p (the probability among edges), the sets U R ={p} and T R = Nv 2 (the number of edges with probability p) are created. Note that U R = T R =1. Then, the preprocessing time is O(M p ) = O(1). Group probability Each edge is generated through two uniform samplings between 1 and N v, with sampling time complexity O(M s ) = O(1). Reemploying this value at Eq. 1, the final time complexity is O(1 + 1 N e )) = O(N e ). Geometric group probability The edge indexes are determined using count Edge count Edge N v {1,...,Nv 2}. The new indexes are: idx i = +1 and idx j = count Edge (idx i 1) N v + 1, both with time complexity O(1). Applying these values in Eq. 4, the final complexity is O(1 + 1 N e )) = O(N e ). Summary In this model, the proposed algorithms have constant complexity in their preprocessing and sampling time (O(M p ) = O(M s ) = O(1)), obtaining a final time complexity of O(N e ). 4.2 Chung Lu (CL) Given the degree distribution of the network (Θ), the set of unique probabilities U C is given by the cross product of the unique degrees. Let D C be a vector of the unique

16 S. Moreno et al. degree distribution in a network. Then, the ith lowest unique degree with value j is Di C = j. Similarly, let ND C i be the total number of nodes with degree Di C. For example, if D5 C = 10 and N D C 5 = 3 there are 3 nodes with degree 10, the 5th lowest unique degree. The construction of the vectors D C and ND C can be implemented by analyzing each element of the degree distribution, with a time complexity O(N v ). Given Di C, UC is constructed by the cross product D C D C ( U C = D C 2 ), π k = Di C D C j N e for i, j {1,..., D C }. Similarly, T C = (T1 C,... T k C,... T C ), where T C U C k = ND C i ND C j (the product of the number of nodes with degrees i and j), is created by the vector multiplication ND C N D C. Then, the preprocessing time complexity O(M p) is given by O(N v + D C 2 + ND C 2 ) = O( D C 2 ) given that D C = ND C, and most of the time D C 2 is greater than N v To create the final network, the nodes with same degree are generated in groups. So, to generate an edge, we uniformly sample the indexes from ND C i and ND C j,plusa specific offset with respect to the initial position of the nodes with same degree. Group probability Consider the block structure of the grouped nodes. The index generation between nodes with degree i and j is done through the uniform sampling from ND C i and ND C j, with a specific offset (the total number of nodes with degree less than ND C i and ND C j respectively). This sampling edge process has complexity O(M s ) = O(1). Reemploying values at Eq. 1, the final time complexity is O( D C N e ) = O(N e ),if D C 2 N e. Geometric group probability The edge indexes between nodes with degree i and j are determined using count Edge {1,...,ND C i ND C j }. These indexes are given count Edge 1 by: idx i = +1 and idx ND C j = count Edge (idx i 1) ND C j,plus j the offsets with respect to the initial position of the nodes with degree i and j. Both indexes are calculated with complexity O(1) obtaining a sampling time complexity O(M s ) = O(1). Then, the final time complexity is O( D C N e ) = O(N e ),if D C 2 N e. Summary Both algorithms realize the same preprocessing step with time complexity O(M p ) = O( D C 2 ). Similarly, both algorithms have the same sampling time complexity O(M s ) = O(1), obtaining a final time of O(N e ),if D C 2 N e. Note that in most social networks the degree distribution follows a power law distribution reducing considerably the size of D C. 4.3 Stochastic block model (SBM) SBM requires three different parameters, the number of nodes partition K, the node assignment partition Z S (Zi S = j implies that the node i belongs to partition j), and the probability matrix partition P S (Pi, S j = p ij implies that the probability between the partition i and j is p ij ). With these parameters, we construct the vector N K, where

17 Scalable and exact sampling method for probabilistic N Ki is the number of nodes at partition i. This vector of size K 1 can be constructed by iterating over the vector Z S with time complexity O(N v ). Given the parameters, the set U S is simply P S. In contrast, the set T S = (T1 S,...,TS ), where T S U S k = N ki N k j, is the number of elements with probability π k between nodes of partitions i and j is constructed by the vector multiplication N K N K with time complexity O( N K 2 ) = O(K 2 ). Using these time complexities, the final preprocessing time is O(M p ) = O(N v + K 2 ). To create a network, we group nodes with same partition. Then edge indexes, among node partitions i and j, are sampled uniformly from ND S i and ND S j, plus an offset w.r.t. the position of the first nodes belonging to these partitions. Group probability The edge sampling between nodes with partition i and j is through an uniform sampling from N ki and N k j plus an offset (the total number of nodes belonging to the previous i and j partitions). This edge generation process has a complexity of O(M s ) = O(1), resulting in a total complexity time of O(N v + K N e )) = O(N e ),ifk 2 N e. Geometric group probability The edge indexes, between nodes of the partitions i and j, are determined using count Edge {1,...,N ki N k j }. The indexes are: idx i = count Edge 1 N ki +1 and idx j = count Edge (idx i 1) N ki, plus the offsets with respect to the initial position of the nodes in partitions i and j. Both indexes are calculated with a time complexity O(M s ) = O(1), obtaining a final complexity of of O(N v + K N e )) = O(N e ),ifk 2 N e. Summary Similar to CL, both algorithms have the same preprocessing and sampling complexity time (O(M p ) = O(N v + K 2 ) and O(M s ) = O(1) respectively), obtaining a final time of O(N e ),ifk 2 N e. Note that large values of K correspond to over-fitted models, so typically K is chosen to be considerably smaller than N v. 4.4 Block two-level Erdös Rényi (BTER) BTER receives the following parameters: the number of nodes partition K, the node assignment partition Z B (Zi B = j implies that the node i belongs to partition j), and the probability matrix partition P B (Pi B = p i implies that the probability of an edge inside partition i is p i ). Note: because there are two phases for this model, the preprocessing and sampling complexities are defined as M p1, M p2, M s1, and M s2, where the subindexes corresponds to the first or second phase of the algorithm. For the algorithm first phase (applying RG model on each of the K designated clusters): U1 B = PB, U1 B =K, and TB 1 is the vector with number of elements of each cluster (obtained through the analysis of Z B as in SBM). So, the preprocessing time for the first phase is O(M p1 ) = O(1 + N v ) = O(N v ). The second phase starts analyzing the degree distributions of the generated network, dynamically generated in the previous network, to apply the CL model (Sect. 4.2). Recall that U2 B and TB 2 are calculated using DB (vector of unique degrees) and N D (number of nodes with unique degrees). Then, similarly to Sect. 4.2, the preprocessing complexity is O(M p2 ) = O( D B 2 ).

18 S. Moreno et al. Group probability Edges are generated using RG and CL group probability algorithms, with sampling times O(M s1 ) = O(M s2 ) = O(1). Using these complexities, the first phase is O(N v + N e1 ), where N e1 is the number of edges generated in the first phase. Similarly, the second phase is O( D B 2 + N e2 ). These values give us a total complexity of O(N e1 + N v + D B 2 + N e2 ) = O(N e ),ifn e > D B 2. Geometric group probability The implementation of this algorithm is realized based on the geometric group probabilities algorithms for RG and CL models, with sampling times O(M s1 ) = O(M s2 ) = O(1). While the first phase has time complexity O(N v + N e1 ), the second phase is O( D B 2 + N e2 ), obtaining a total complexity O(N e ),if N e > D B 2. Summary Both algorithms have the same preprocessing and sampling times. Let N ei be the number of edges sampled in the RG (i = 1) and CL (i = 2) phase. For the RG model phase O(M p1 ) = O(N v ) and O(M s1 ) = O(1), obtaining a total time of O(N v + N e1 ). For the CL model phase O(M p2 ) = O( D B 2 ) and O(M s2 ) = O(1), obtaining a total time of O( D B 2 + N e2 ). Then, the final algorithm has complexity O(N e ),ifn e > D B Kronecker product graph model (KPGM) For KPGM, there are two parameters: a matrix of size b b with values between 0 and 1, whichwecallθ, and the number of kronecker multiplication K = log b (N v ).Forthe implementation, we use the representation developed in (Moreno et al. 2014), where P K (V i, V j ) = θ γ θ γ θ γ bb bb, with γ ij {0, 1,..., K } and b bj=1 i=1 γ ij = K. With this representation, U K is generated by the multiplication of all possible combinations of γ ij subject to b bj=1 i=1 γ ij = K (k-combination with repetition problem (Sheldon 2002)). The time complexity of this generation process is given by O(K U K ), where K corresponds to the multiplications to obtain an unique π k, and U K is the number of k-combination with repetition. As it is demonstrated in Moreno et al. (2014), U K < N v for large K (K 7, 10 for b = 2, 3 respectively), which results in complexity O(K N v ). Similarly T K is constructed by all k = K! γ 11k!γ 12k! γ bbk! possible permutations of Ɣ k ={γ 11,...,γ bb }, assuming the value of the factorials known, then the construction of T K is O(K N v ) too. Then, the preprocessing time of the algorithm is O(M p ) = O(K N v ). Group probability Let Λ i =[i 1, i 2,...,i K ] and Λ j =[j 1, j 2,..., j K ] be θ indexes s.t. K l=1 θ Λi (l)λ j (l) = π k. To sample an edge from the T K k positions, we realize a random permutation over σ = {1, 2,...,K } and calculate the indexes by Kl=1 (Λ i, j (σ (l)) 1)b l Both, the random permutation and indexes, have complexity time O(K ), obtaining O(M s ) = O(K ). Reemploying these values in Eq. 1, the total complexity is O(K N v + K N e ) = Õ(N e ). Note that this is the only implementation discussed in Moreno et al. (2014). Geometric group probability The edge generation process must determine the count Edge th permutation of vectors Λ i and Λ j for a given π k. To determine

19 Scalable and exact sampling method for probabilistic Λ i, j (l), we calculate all possible permutation using different θ ij s and pick the θ ij s able to generate the count Edgeth permutation. We start fixing Λ i, j (1) = 1 with θ 11 (assume θ 11 is used in π k ) and calculate the number of possible positions over the rest K 1 elements using the others θ ij s. If this value is greater than count Edge, then Λ i, j (1) = 1 is fixed and the search continues over the next position. Otherwise, the positions are changed to Λ i (1) = 1 and Λ j (1) = 2 (parameter θ 12 ) and the process is repeated. This process is realized K times, assigning a position to all θ ij s. Then, the edge generation has time complexity O(M s ) = O(K b 2 b2 ), where K corresponds to the number of assigned positions, and b 2 s are possible values to test and to compute a permutation (2! to K! are assumed known). Reemploying these values in Eq. 1, the total complexity is O(K N v + K b 4 N e ) = Õ(N e ). Summary Even though both algorithms have the same preprocessing time O(M p ) = O(K N v ), their sampling times differ by a constant of b 4. While O(M s ) = O(K N e ) for the group probability algorithm, the geometric group probability algorithm has a complexity O(M s ) = O(K b 4 N e ). Beside this difference, both algorithms have a total complexity of Õ(N e ) for K > 7(10) and b > 2(3) respectively. 4.6 Mixed KPGM (mkpgm) In addition to the Θ and K parameters from KPGM, mkpgm has a new parameter l to control the dependencies between edges. The preprocessing step of mkpgm includes the generation of G l using KPGM (K = l for this case). This preprocessing step has a time complexity of Õ(N el ), where N el is the number of edges of G l. Then, we sample each of the other layers G l+k, based on the proposed algorithm. Given that Pl+k M = G l+k 1 Θ, thus U M = Uk M = Θ ={θ 11,θ 12,...,θ bb }, k {1,...,K l}, with time complexity O(1). Moreover, for all π j Uk M, Tj M = E l+k 1 (the number of edges of the previous layer), so the construction of T M k is O(1) too. Given these time complexities, the preprocessing step has time complexity O(M p ) = Õ(N el ). Group probability To generate a random edge at layer G l+k using π l = θ ij,wepicka random edge (u,v)from G l+k 1, and calculate the new indexes by idx i = u b+i and idx j = v b+ j. Given that u,vare stored, the edge generation process has complexity time O(M s ) = O(1). With these values, the final complexity to generate G K is Õ(N el + K l k=1 N e l+k ) = Õ((K l) N e ) = Õ(N e ). Note that this implementation differs from (Moreno et al. 2014), which does not use a rejection process. Geometric group probability To generate an edge at layer G l+k using π l = θ ij,wepick the count Edgeth edge (count Edge {1,..., E l+k 1 }) from E l+k 1.Let(u,v)be the selected edge, then the new indexes are idx i = (u 1) b+i and idx j = (v 1) b+ j (O(M s ) = O(1)). Thus, the final complexity is Õ(N el + K l k=1 N e l+k ) = Õ(N e ). Summary The preprocessing time of both algorithms is dominated by the generation of the KPGM network G l with K = l, which implies O(M p ) = Õ(N el ), where N el

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Using Bayesian Network Representations for Effective Sampling from Generative Network Models Using Bayesian Network Representations for Effective Sampling from Generative Network Models Pablo Robles-Granda and Sebastian Moreno and Jennifer Neville Computer Science Department Purdue University

More information

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Using Bayesian Network Representations for Effective Sampling from Generative Network Models Using Bayesian Network Representations for Effective Sampling from Generative Network Models Pablo Robles-Granda and Sebastian Moreno and Jennifer Neville Computer Science Department Purdue University

More information

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Pablo Robles Granda,

More information

Tied Kronecker Product Graph Models to Capture Variance in Network Populations

Tied Kronecker Product Graph Models to Capture Variance in Network Populations Tied Kronecker Product Graph Models to Capture Variance in Network Populations Sebastian Moreno, Sergey Kirshner +, Jennifer Neville +, SVN Vishwanathan + Department of Computer Science, + Department of

More information

Supporting Statistical Hypothesis Testing Over Graphs

Supporting Statistical Hypothesis Testing Over Graphs Supporting Statistical Hypothesis Testing Over Graphs Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Tina Eliassi-Rad, Brian Gallagher, Sergey Kirshner,

More information

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs Hyokun Yun (work with S.V.N. Vishwanathan) Department of Statistics Purdue Machine Learning Seminar November 9, 2011 Overview

More information

Sampling of Attributed Networks from Hierarchical Generative Models

Sampling of Attributed Networks from Hierarchical Generative Models Sampling of Attributed Networks from Hierarchical Generative Models Pablo Robles Purdue University West Lafayette, IN USA problesg@purdue.edu Sebastian Moreno Universidad Adolfo Ibañez Viña del Mar, Chile

More information

Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs

Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs Hyokun Yun Department of Statistics Purdue University SV N Vishwanathan Departments of Statistics and Computer Science

More information

A Random Dot Product Model for Weighted Networks arxiv: v1 [stat.ap] 8 Nov 2016

A Random Dot Product Model for Weighted Networks arxiv: v1 [stat.ap] 8 Nov 2016 A Random Dot Product Model for Weighted Networks arxiv:1611.02530v1 [stat.ap] 8 Nov 2016 Daryl R. DeFord 1 Daniel N. Rockmore 1,2,3 1 Department of Mathematics, Dartmouth College, Hanover, NH, USA 03755

More information

Deterministic Decentralized Search in Random Graphs

Deterministic Decentralized Search in Random Graphs Deterministic Decentralized Search in Random Graphs Esteban Arcaute 1,, Ning Chen 2,, Ravi Kumar 3, David Liben-Nowell 4,, Mohammad Mahdian 3, Hamid Nazerzadeh 1,, and Ying Xu 1, 1 Stanford University.

More information

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( ) Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the

More information

How to exploit network properties to improve learning in relational domains

How to exploit network properties to improve learning in relational domains How to exploit network properties to improve learning in relational domains Jennifer Neville Departments of Computer Science and Statistics Purdue University!!!! (joint work with Brian Gallagher, Timothy

More information

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Solutions to Problem Set 5

Solutions to Problem Set 5 UC Berkeley, CS 74: Combinatorics and Discrete Probability (Fall 00 Solutions to Problem Set (MU 60 A family of subsets F of {,,, n} is called an antichain if there is no pair of sets A and B in F satisfying

More information

Networks: Lectures 9 & 10 Random graphs

Networks: Lectures 9 & 10 Random graphs Networks: Lectures 9 & 10 Random graphs Heather A Harrington Mathematical Institute University of Oxford HT 2017 What you re in for Week 1: Introduction and basic concepts Week 2: Small worlds Week 3:

More information

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan

More information

Chapter 7 Network Flow Problems, I

Chapter 7 Network Flow Problems, I Chapter 7 Network Flow Problems, I Network flow problems are the most frequently solved linear programming problems. They include as special cases, the assignment, transportation, maximum flow, and shortest

More information

An Analysis of Top Trading Cycles in Two-Sided Matching Markets

An Analysis of Top Trading Cycles in Two-Sided Matching Markets An Analysis of Top Trading Cycles in Two-Sided Matching Markets Yeon-Koo Che Olivier Tercieux July 30, 2015 Preliminary and incomplete. Abstract We study top trading cycles in a two-sided matching environment

More information

1 Mechanistic and generative models of network structure

1 Mechanistic and generative models of network structure 1 Mechanistic and generative models of network structure There are many models of network structure, and these largely can be divided into two classes: mechanistic models and generative or probabilistic

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Relation of Pure Minimum Cost Flow Model to Linear Programming

Relation of Pure Minimum Cost Flow Model to Linear Programming Appendix A Page 1 Relation of Pure Minimum Cost Flow Model to Linear Programming The Network Model The network pure minimum cost flow model has m nodes. The external flows given by the vector b with m

More information

IITM-CS6845: Theory Toolkit February 3, 2012

IITM-CS6845: Theory Toolkit February 3, 2012 IITM-CS6845: Theory Toolkit February 3, 2012 Lecture 4 : Derandomizing the logspace algorithm for s-t connectivity Lecturer: N S Narayanaswamy Scribe: Mrinal Kumar Lecture Plan:In this lecture, we will

More information

Contents. Counting Methods and Induction

Contents. Counting Methods and Induction Contents Counting Methods and Induction Lesson 1 Counting Strategies Investigations 1 Careful Counting... 555 Order and Repetition I... 56 3 Order and Repetition II... 569 On Your Own... 573 Lesson Counting

More information

Graph Detection and Estimation Theory

Graph Detection and Estimation Theory Introduction Detection Estimation Graph Detection and Estimation Theory (and algorithms, and applications) Patrick J. Wolfe Statistics and Information Sciences Laboratory (SISL) School of Engineering and

More information

Copyright 2013 Springer Science+Business Media New York

Copyright 2013 Springer Science+Business Media New York Meeks, K., and Scott, A. (2014) Spanning trees and the complexity of floodfilling games. Theory of Computing Systems, 54 (4). pp. 731-753. ISSN 1432-4350 Copyright 2013 Springer Science+Business Media

More information

Shortest paths with negative lengths

Shortest paths with negative lengths Chapter 8 Shortest paths with negative lengths In this chapter we give a linear-space, nearly linear-time algorithm that, given a directed planar graph G with real positive and negative lengths, but no

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

On the number of cycles in a graph with restricted cycle lengths

On the number of cycles in a graph with restricted cycle lengths On the number of cycles in a graph with restricted cycle lengths Dániel Gerbner, Balázs Keszegh, Cory Palmer, Balázs Patkós arxiv:1610.03476v1 [math.co] 11 Oct 2016 October 12, 2016 Abstract Let L be a

More information

CS 224w: Problem Set 1

CS 224w: Problem Set 1 CS 224w: Problem Set 1 Tony Hyun Kim October 8, 213 1 Fighting Reticulovirus avarum 1.1 Set of nodes that will be infected We are assuming that once R. avarum infects a host, it always infects all of the

More information

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Randomized Algorithms Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Sampling Random Variables

Sampling Random Variables Sampling Random Variables Introduction Sampling a random variable X means generating a domain value x X in such a way that the probability of generating x is in accordance with p(x) (respectively, f(x)),

More information

The number of Euler tours of random directed graphs

The number of Euler tours of random directed graphs The number of Euler tours of random directed graphs Páidí Creed School of Mathematical Sciences Queen Mary, University of London United Kingdom P.Creed@qmul.ac.uk Mary Cryan School of Informatics University

More information

Lecture 6 September 21, 2016

Lecture 6 September 21, 2016 ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Nonparametric Bayesian Matrix Factorization for Assortative Networks Nonparametric Bayesian Matrix Factorization for Assortative Networks Mingyuan Zhou IROM Department, McCombs School of Business Department of Statistics and Data Sciences The University of Texas at Austin

More information

Network models: random graphs

Network models: random graphs Network models: random graphs Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Structural Analysis

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

More information

CS224W: Social and Information Network Analysis

CS224W: Social and Information Network Analysis CS224W: Social and Information Network Analysis Reaction Paper Adithya Rao, Gautam Kumar Parai, Sandeep Sripada Keywords: Self-similar networks, fractality, scale invariance, modularity, Kronecker graphs.

More information

Modeling of Growing Networks with Directional Attachment and Communities

Modeling of Growing Networks with Directional Attachment and Communities Modeling of Growing Networks with Directional Attachment and Communities Masahiro KIMURA, Kazumi SAITO, Naonori UEDA NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho, Kyoto 619-0237, Japan

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth

The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth Gregory Gutin, Mark Jones, and Magnus Wahlström Royal Holloway, University of London Egham, Surrey TW20 0EX, UK Abstract In the

More information

An average case analysis of a dierential attack. on a class of SP-networks. Distributed Systems Technology Centre, and

An average case analysis of a dierential attack. on a class of SP-networks. Distributed Systems Technology Centre, and An average case analysis of a dierential attack on a class of SP-networks Luke O'Connor Distributed Systems Technology Centre, and Information Security Research Center, QUT Brisbane, Australia Abstract

More information

Lecture 1 and 2: Introduction and Graph theory basics. Spring EE 194, Networked estimation and control (Prof. Khan) January 23, 2012

Lecture 1 and 2: Introduction and Graph theory basics. Spring EE 194, Networked estimation and control (Prof. Khan) January 23, 2012 Lecture 1 and 2: Introduction and Graph theory basics Spring 2012 - EE 194, Networked estimation and control (Prof. Khan) January 23, 2012 Spring 2012: EE-194-02 Networked estimation and control Schedule

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu March 16, 2016 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Multiple Choice Tries and Distributed Hash Tables

Multiple Choice Tries and Distributed Hash Tables Multiple Choice Tries and Distributed Hash Tables Luc Devroye and Gabor Lugosi and Gahyun Park and W. Szpankowski January 3, 2007 McGill University, Montreal, Canada U. Pompeu Fabra, Barcelona, Spain U.

More information

Computing Connected Components Given a graph G = (V; E) compute the connected components of G. Connected-Components(G) 1 for each vertex v 2 V [G] 2 d

Computing Connected Components Given a graph G = (V; E) compute the connected components of G. Connected-Components(G) 1 for each vertex v 2 V [G] 2 d Data Structures for Disjoint Sets Maintain a Dynamic collection of disjoint sets. Each set has a unique representative (an arbitrary member of the set). x. Make-Set(x) - Create a new set with one member

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

1 Complex Networks - A Brief Overview

1 Complex Networks - A Brief Overview Power-law Degree Distributions 1 Complex Networks - A Brief Overview Complex networks occur in many social, technological and scientific settings. Examples of complex networks include World Wide Web, Internet,

More information

Northwestern University Department of Electrical Engineering and Computer Science

Northwestern University Department of Electrical Engineering and Computer Science Northwestern University Department of Electrical Engineering and Computer Science EECS 454: Modeling and Analysis of Communication Networks Spring 2008 Probability Review As discussed in Lecture 1, probability

More information

Strongly chordal and chordal bipartite graphs are sandwich monotone

Strongly chordal and chordal bipartite graphs are sandwich monotone Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its

More information

Technische Universität Dresden Institute of Numerical Mathematics

Technische Universität Dresden Institute of Numerical Mathematics Technische Universität Dresden Institute of Numerical Mathematics An Improved Flow-based Formulation and Reduction Principles for the Minimum Connectivity Inference Problem Muhammad Abid Dar Andreas Fischer

More information

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms 1. Computability, Complexity and Algorithms Part a: You are given a graph G = (V,E) with edge weights w(e) > 0 for e E. You are also given a minimum cost spanning tree (MST) T. For one particular edge

More information

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Determining the Diameter of Small World Networks

Determining the Diameter of Small World Networks Determining the Diameter of Small World Networks Frank W. Takes & Walter A. Kosters Leiden University, The Netherlands CIKM 2011 October 2, 2011 Glasgow, UK NWO COMPASS project (grant #12.0.92) 1 / 30

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

Network models: dynamical growth and small world

Network models: dynamical growth and small world Network models: dynamical growth and small world Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics

More information

Conditional Marginalization for Exponential Random Graph Models

Conditional Marginalization for Exponential Random Graph Models Conditional Marginalization for Exponential Random Graph Models Tom A.B. Snijders January 21, 2010 To be published, Journal of Mathematical Sociology University of Oxford and University of Groningen; this

More information

Project in Computational Game Theory: Communities in Social Networks

Project in Computational Game Theory: Communities in Social Networks Project in Computational Game Theory: Communities in Social Networks Eldad Rubinstein November 11, 2012 1 Presentation of the Original Paper 1.1 Introduction In this section I present the article [1].

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Dominating Set Counting in Graph Classes

Dominating Set Counting in Graph Classes Dominating Set Counting in Graph Classes Shuji Kijima 1, Yoshio Okamoto 2, and Takeaki Uno 3 1 Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan kijima@inf.kyushu-u.ac.jp

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing

More information

Algebraic Methods in Combinatorics

Algebraic Methods in Combinatorics Algebraic Methods in Combinatorics Po-Shen Loh 27 June 2008 1 Warm-up 1. (A result of Bourbaki on finite geometries, from Răzvan) Let X be a finite set, and let F be a family of distinct proper subsets

More information

Data Structure. Mohsen Arab. January 13, Yazd University. Mohsen Arab (Yazd University ) Data Structure January 13, / 86

Data Structure. Mohsen Arab. January 13, Yazd University. Mohsen Arab (Yazd University ) Data Structure January 13, / 86 Data Structure Mohsen Arab Yazd University January 13, 2015 Mohsen Arab (Yazd University ) Data Structure January 13, 2015 1 / 86 Table of Content Binary Search Tree Treaps Skip Lists Hash Tables Mohsen

More information

3.2 Configuration model

3.2 Configuration model 3.2 Configuration model 3.2.1 Definition. Basic properties Assume that the vector d = (d 1,..., d n ) is graphical, i.e., there exits a graph on n vertices such that vertex 1 has degree d 1, vertex 2 has

More information

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS Colin Cooper School of Mathematical Sciences, Polytechnic of North London, London, U.K. and Alan Frieze and Michael Molloy Department of Mathematics, Carnegie-Mellon

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

An Algorithmic Proof of the Lopsided Lovász Local Lemma (simplified and condensed into lecture notes)

An Algorithmic Proof of the Lopsided Lovász Local Lemma (simplified and condensed into lecture notes) An Algorithmic Proof of the Lopsided Lovász Local Lemma (simplified and condensed into lecture notes) Nicholas J. A. Harvey University of British Columbia Vancouver, Canada nickhar@cs.ubc.ca Jan Vondrák

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

On the maximum number of isosceles right triangles in a finite point set

On the maximum number of isosceles right triangles in a finite point set On the maximum number of isosceles right triangles in a finite point set Bernardo M. Ábrego, Silvia Fernández-Merchant, and David B. Roberts Department of Mathematics California State University, Northridge,

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2. Chapter 1 LINEAR EQUATIONS 11 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,, a n, b are given real

More information

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th CSE 3500 Algorithms and Complexity Fall 2016 Lecture 10: September 29, 2016 Quick sort: Average Run Time In the last lecture we started analyzing the expected run time of quick sort. Let X = k 1, k 2,...,

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Specification and estimation of exponential random graph models for social (and other) networks

Specification and estimation of exponential random graph models for social (and other) networks Specification and estimation of exponential random graph models for social (and other) networks Tom A.B. Snijders University of Oxford March 23, 2009 c Tom A.B. Snijders (University of Oxford) Models for

More information

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Intro sessions to SNAP C++ and SNAP.PY: SNAP.PY: Friday 9/27, 4:5 5:30pm in Gates B03 SNAP

More information

Quick Sort Notes , Spring 2010

Quick Sort Notes , Spring 2010 Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.

More information

Lecture 5: January 30

Lecture 5: January 30 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 5: January 30 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They

More information

Clustering bi-partite networks using collapsed latent block models

Clustering bi-partite networks using collapsed latent block models Clustering bi-partite networks using collapsed latent block models Jason Wyse, Nial Friel & Pierre Latouche Insight at UCD Laboratoire SAMM, Université Paris 1 Mail: jason.wyse@ucd.ie Insight Latent Space

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data

More information

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1 CSE525: Randomized Algorithms and Probabilistic Analysis April 2, 2013 Lecture 1 Lecturer: Anna Karlin Scribe: Sonya Alexandrova and Eric Lei 1 Introduction The main theme of this class is randomized algorithms.

More information

Fundamental Algorithms

Fundamental Algorithms Chapter 2: Sorting, Winter 2018/19 1 Fundamental Algorithms Chapter 2: Sorting Jan Křetínský Winter 2018/19 Chapter 2: Sorting, Winter 2018/19 2 Part I Simple Sorts Chapter 2: Sorting, Winter 2018/19 3

More information

Fundamental Algorithms

Fundamental Algorithms Fundamental Algorithms Chapter 2: Sorting Harald Räcke Winter 2015/16 Chapter 2: Sorting, Winter 2015/16 1 Part I Simple Sorts Chapter 2: Sorting, Winter 2015/16 2 The Sorting Problem Definition Sorting

More information

Uniform generation of random graphs with power-law degree sequences

Uniform generation of random graphs with power-law degree sequences Uniform generation of random graphs with power-law degree sequences arxiv:1709.02674v2 [math.co] 14 Nov 2017 Pu Gao School of Mathematics Monash University jane.gao@monash.edu Abstract Nicholas Wormald

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

CS224W: Methods of Parallelized Kronecker Graph Generation

CS224W: Methods of Parallelized Kronecker Graph Generation CS224W: Methods of Parallelized Kronecker Graph Generation Sean Choi, Group 35 December 10th, 2012 1 Introduction The question of generating realistic graphs has always been a topic of huge interests.

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information