Lecture 6 (supplemental): Stochastic Block Models

Size: px

Start display at page:

Download "Lecture 6 (supplemental): Stochastic Block Models"

Candice Rich
5 years ago
Views:

3 Lecture 6 (supplemental): Stochastic Block Models Aaron Clauset @aaronclauset Assistant Professor of

1 3 Lecture 6 (supplemental): Stochastic Block Models Aaron Assistant Professor of Computer Science University of Colorado Boulder External Faculty, Santa Fe Institute 2017 Aaron Clauset

2 what is structure?

3 what is structure? makes data different from noise makes a network different from a random graph

4 what is structure? makes data different from noise makes a network different from a random graph helps us compress the data describe the network succinctly capture most relevant patterns

5 what is structure? makes data different from noise makes a network different from a random graph helps us compress the data describe the network succinctly capture most relevant patterns helps us generalize, from data we ve seen to data we haven t seen: i. from one part of network to another ii. from one network to others of same type iii. from small scale to large scale (coarse-grained structure) iv. from past to future (dynamics)

6 statistical inference imagine graph G is drawn from an ensemble or generative model: a probability distribution Pr(G ) with parameters can be continuous or discrete; represents structure of graph

7 statistical inference imagine graph G is drawn from an ensemble or generative model: a probability distribution Pr(G ) with parameters can be continuous or discrete; represents structure of graph inference (MLE): given G, find that maximizes Pr(G ) inference (Bayes): compute or sample from posterior distribution Pr( G)

8 statistical inference imagine graph G is drawn from an ensemble or generative model: a probability distribution Pr(G ) with parameters can be continuous or discrete; represents structure of graph inference (MLE): given G, find that maximizes Pr(G ) inference (Bayes): compute or sample from posterior distribution Pr( G) if is partly known, constrain inference and determine the rest if Gis partly known, infer and use Pr(G ) to generate the rest if model is good fit (application dependent), we can generate synthetic graphs structurally similar to G if part of G has low probability under model, flag as possible anomaly

9 statistical inference imagine graph G is drawn from an ensemble or generative model: a probability distribution Pr(G ) with parameters can be continuous or discrete; represents structure of graph inference (MLE): given G, find that maximizes Pr(G ) inference (Bayes): compute or sample from posterior distribution Pr( G) statistical inference = principled approach to learning from data combines tools from statistics, machine learning, information theory, and statistical physics if is partly known, constrain inference and determine the rest if Gis partly known, infer and use Pr(G ) to generate the rest if model is good fit (application dependent), we can generate synthetic graphs structurally similar to G if part of G has low probability under model, flag as possible anomaly quantifies uncertainty separates the model from the learning

10 statistical inference: key ideas interpretability model parameters have meaning for scientific questions auxiliary information node & edge attributes, temporal dynamics (beyond static binary graphs) scalability fast algorithms for fitting models to big data (methods from physics, machine learning) model selection which model is better? is this model bad? how many communities? partial or noisy data extrapolation, interpolation, hidden data, missing data anomaly detection low probability events under generative model

11 generative models for complex networks define a parametric probability distribution over networks generation : given, draw G from this distribution inference : given G, choose that makes G likely Pr(G ) model Pr(G θ) generation inference G=(V,E) data

12 generative models for complex networks general form model Pr(G θ) generation inference G=(V,E) data Pr(G ) = Y ij Pr(A ij ) edge generation function assumptions about structure go into consistency lim ˆ Pr 6= =0 n!1 Pr(A ij ) requires that edges be conditionally independent [Shalizi, Rinaldo 2011] two general classes of these models

13 generative models for complex networks model Pr(G θ) generation G=(V,E) data stochastic block models k types of vertices, Pr(A ij M,z) depends only on node types z i,z j inference originally invented by sociologists [Holland, Laskey, Leinhardt 1983] many, many flavors, including binomial SBM [Holland et al. 1983, Wang & Wong 1987] simple assortative SBM [Hofman & Wiggins 2008] mixed-membership SBM [Airoldi et al. 2008] hierarchical SBM [Clauset et al. 2006,2008, Peixoto 2014] fractal SBM [Leskovec et al. 2005] infinite relational model [Kemp et al. 2006] degree-corrected SBM [Karrer & Newman 2011] SBM + topic models [Ball et al. 2011] SBM + vertex covariates [Mariadassou et al. 2010, Newman & Clauset 2016] SBM + edge weights [Aicher et al. 2013,2014, Peixoto 2015] bipartite SBM [Larremore et al. 2014] multilayer SBM [Peixoto 2015, Valles-Catata et al. 2016] and many others D

generative models for complex networks model Pr(G θ) generation G=(V,E) latent space models nodes live in a latent space, Pr(A ij f(x i,x j )) depends only on vertex-vertex

2002] social status / ranking [Ball, Newman 2013] nonparametric metadata relations [Kim et al.

14 generative models for complex networks model Pr(G θ) generation G=(V,E) latent space models nodes live in a latent space, Pr(A ij f(x i,x j )) depends only on vertex-vertex proximity originally invented by statisticians [Hoff, Raftery, Handcock 2002] many, many flavors, including logistic function on vertex features [Hoff et al. 2002] social status / ranking [Ball, Newman 2013] nonparametric metadata relations [Kim et al. 2012] multiplicative attribute graphs [Kim & Leskovec 2010] nonparametric latent feature model [Miller et al. 2009] infinite multiple memberships [Morup et al. 2011] ecological niche model [Williams et al. 2010] hyperbolic latent spaces [Boguna et al. 2010] and many others inference data

15 opportunities and challenges model Pr(G θ) generation G=(V,E) data richly annotated data edge weights, node attributes, time, etc. = new classes of generative models inference generalize from n =1 to ensemble useful for modeling checking, simulating other processes, etc. many familiar techniques frequentist and Bayesian frameworks makes probabilistic statements about observations, models predicting missing links leave-k-out cross validation approximate inference techniques (EM, VB, BP, etc.) sampling techniques (MCMC, Gibbs, etc.) learn from partial or noisy data extrapolation, interpolation, hidden data, missing data

16 opportunities and challenges model Pr(G θ) generation G=(V,E) only two classes of models stochastic block models (categorical latent variables) latent space models (ordinal / continuous latent variables) bootstrap / resampling for network data critical missing piece depends on what is independent in the data model comparison naive AIC, BIC, marginalization, LRT can be wrong for networks what is goal of modeling: realistic representation or accurate prediction? model assessment / checking? how do we know a model has done well? what do we check? what is v-fold cross-validation for networks? Omit n 2 /v edges? Omit n/v nodes? What? inference data

17 the stochastic block model each vertex i has type z i 2 {1,...,k} ( k vertex types or groups) stochastic block matrix M of group-level connection probabilities probability that i, j are connected = M zi,z j community = vertices with same pattern of inter-community connections inferred M 9 D

18 the stochastic block model assortative edges within groups disassortative edges between groups ordered linear group hierarchy core-periphery dense core, sparse periphery

19 the stochastic block model likelihood function the probability of G given labeling z and block matrix M Pr(G z,m) = Y Y (1 M zi,z j ) M zi,z j (i,j)2e (i,j)62e } } edge / non-edge probability

20 the stochastic block model likelihood function the probability of G given labeling z and block matrix M Pr(G z,m) = Y Y (1 M zi,z j ) M zi,z j (i,j)2e (i,j)62e = Y rs M e r,s r,s (1 M r,s ) n sn r e r,s (Bernoulli edges) Bernoulli random graph with parameter M r,s

21 the stochastic block model the most general SBM Pr(A z, ) = Y i,j f A ij R(zi,z j ) A ij : value of adjacency R : partition of adjacencies f : probability function a, : pattern for a-type adjacencies Binomial = simple graphs Poisson = multi-graphs Normal = weighted graphs etc

22 the stochastic block model degree-corrected SBM ( f = Poisson) Karrer, Newman, Phys. Rev. E (2011)

23 the stochastic block model degree-corrected SBM ( f = Poisson) key assumption Pr(i! j) = i j! zi,z j stochastic block matrix (degree) propensity of node likelihood:! r,s i Pr(A z,,!) = Y i<j i j! zr,z j A ij! A ij exp i j! zr,z j where ˆ i = k i Pj k j z i,z j ˆ! rs = m rs = X A ij zi,r z j,s ij } } fraction of i s group s stubs on i total number of edges between r and s Karrer, Newman, Phys. Rev. E (2011)

24 the stochastic block model comparing SBM vs. DC-SBM : Zachary karate club Karrer, Newman, Phys. Rev. E (2011)

25 the stochastic block model comparing SBM vs. DC-SBM : Zachary karate club SBM leader/follower division DC-SBM social group division Karrer, Newman, Phys. Rev. E (2011)

26 the stochastic block model comparing SBM vs. DC-SBM : Zachary karate club SBM leader/follower division DC-SBM social group division Karrer, Newman, Phys. Rev. E (2011)

27 the stochastic block model comparing SBM vs. DC-SBM : Zachary karate club log likelihood SBM leader/follower division DC-SBM social group division Peel, Larremore, Clauset, forthcoming (2016)

28 the stochastic block model comparing SBM vs. DC-SBM : Zachary karate club log likelihood log likelihood SBM leader/follower division Peel, Larremore, Clauset, forthcoming (2016) DC-SBM social group division

29 extending the SBM many variants! we ll cover three: bipartite community structure weighted community structure hierarchical community structure

30 bipartite networks many networks are bipartite scientists and papers (co-authorship networks) actors and movies (co-appearance networks) words and documents (topic modeling) plants and pollinators genes and genomes etc. assortative structure? Larremore et al., Phys. Rev. E 90, (2014)

31 bipartite networks many networks are bipartite scientists and papers (co-authorship networks) actors and movies (co-appearance networks) words and documents (topic modeling) plants and pollinators genes and genomes etc. most analyses focus on one-mode projections which discard information assortative structure? no, bipartite structure Larremore et al., Phys. Rev. E 90, (2014)

32 bipartite networks bipartite stochastic block model (bisbm) exactly the SBM, but model knows network is bipartite if type(z i )=type(z j ) assortative structure? then require M zi,z j =0 inference proceeds as before no, bipartite structure Larremore et al., Phys. Rev. E 90, (2014)

33 bipartite networks SBM can learn bipartite structure on its own but often over fits or returns mixed-type groups oops degree corrected bsbm likelihood score, Eq. (16) SBM K=5 SBM K=5 bisbm K a,k b= 3,2 SBM K=4 bisbm K a,k b= 3,2 bisbm K a,k b= 3,2 Larremore et al., Phys. Rev. E 90, (2014)

34 bipartite networks bipartite stochastic block model (bisbm) how do we know it works well? Larremore et al., Phys. Rev. E 90, (2014)

35 bipartite networks bipartite stochastic block model (bisbm) how do we know it works well? synthetic networks with planted partitions ! planted = B A ! planted = B 0 A 0 easy case: 4x4 easy-to-see communities hard case: 3x2 hard-to-see communities Larremore et al., Phys. Rev. E 90, (2014)

36 bipartite networks bipartite stochastic block model (bisbm) how do we know it works well? synthetic networks with planted partitions ! planted = B A easy case: 4x4 easy-to-see communities hard case: 3x2 hard-to-see communities Larremore et al., Phys. Rev. E 90, (2014)

37 bipartite networks bipartite stochastic block model (bisbm) how do we know it works well? synthetic networks with planted partitions 1.0 A 1.0 B bisbm, degree corrected normalized mutual information bisbm degree corrected SBM weighted projection unweighted projection uncorrected SBM weighted projection unweighted projection normalized mutual information SBM and degree-corrected SBM applied to both weighted and unweighted one-mode projections bisbm, no degree correction λ easy case: 4x4 easy-to-see communities λ hard case: 3x2 hard-to-see communities Larremore et al., Phys. Rev. E 90, (2014)

38 bipartite networks bipartite stochastic block model (bisbm) always find pure-type communities more accurate than modeling one-mode projections (even weighted projections) finds communities in both modes Larremore et al., Phys. Rev. E 90, (2014)

39 bipartite networks example 1: malaria gene network genes substrings genes substrings Larremore et al., Phys. Rev. E 90, (2014)

40 bipartite networks example 2: Southern Women network Events Attended 2 4 People People Events Attended Larremore et al., Phys. Rev. E 90, (2014)

41 bipartite networks example 3: IMDb Actors (53158) 1 Czech, Danish, German 2 French, Dutch, Cantonese, Greek, Filipino, Finnish Movies (39768) Japanese, Russian Adult 6 }English Larremore et al., Phys. Rev. E 90, (2014)

42 bipartite networks other approaches PRL 110, (2013) P H Y S I C A L R E V I E W L E T T E R S week ending 5 APRIL 2013 Parsimonious Module Inference in Large Networks Tiago P. Peixoto* Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D Bremen, Germany (Received 19 December 2012; published 5 April 2013; publisher error corrected 5 April 2013) minimum description length (MDL) principle learns that a network is bipartite Predicting Human Preferences Using the Block Structure of Complex Social Networks Roger Guimerà 1,2 *, Alejandro Llorente 3, Esteban Moro 4,5,3, Marta Sales-Pardo 2 marginalize over bipartite SBM parameterizations Peixoto, Phys. Rev. Lett. 110, (2013) Guimera et al., PLOS ONE 7(9), e44620 (2012)

bipartite networks PRL 110, 148701 (2013) P H Y S I C A L R E V I E W L E T T E R S Parsimonious Module Inference in Large Networks Tiago P.

investigate the detectability of modules in large networks when the number of modules is not known in advance.

43 bipartite networks PRL 110, (2013) P H Y S I C A L R E V I E W L E T T E R S Parsimonious Module Inference in Large Networks Tiago P. Peixoto* Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D Bremen, Germany (Received 19 December 2012; published 5 April 2013; publisher error corrected 5 April 2013) We investigate the detectability of modules in large networks when the number of modules is not known in advance. We employ the minimum description length principle which seeks to minimize the total amount of information required to describe the network, and avoid overfitting. According to this criterion, we obtain general bounds on the detectability of any prescribed block structure, given the number of nodes and p edges ffiffiffiffi in the sampled network. We also obtain that the maximum number of detectable blocks scales as N, where N is the number of nodes in the network, for a fixed average degree hki. We also show that the simplicity of the minimum description length approach yields an efficient multilevel Monte Carlo inference algorithm with a complexity of OðN36 lognþ, if the number of blocks is unknown, and OðNÞ if it is known, where is the mixing time of the Markov chain. We illustrate the application of the method on a large network of actors and films with over 10 6 edges, and a dissortative, bipartite block structure. 67 week ending 5 APRIL DOI: /PhysRevLett PACS numbers: Hc, Tt, Cf The detection of modules or communities is one of of B blocks, and the number of edges between nodes of the most intensely studied problems in the recent literature blocks r and s is given by the matrix e rs (or twice that of network systems [1,2]. The use of generative285models number if r ¼ s). The degree-corrected variant [14] further 304 for this 251 purpose, such as the stochastic 232blockmodel family imposes that each node i has a degree given by k i, where [3 20], has been gaining increasing attention. 31 This the set fk i g is an additional parameter 310 set of the model. The approach contrasts 297drastically with the majority of other directed version of both models is analogously defined, methods thus far employed214 in the 157 field 12 (such 266 as modularity with e rs becoming asymmetric, and fk i g together with maximization [21]), since not only is it derived from first 321 fk þ 10 i g fixing the in- and out-degrees of the nodes, respectively. These ensembles are characterized principles, but 146 also it is not restricted to purely assortative by their microcanonical entropy S ¼ 177 ln, where is the total number and undirected community structures. However, most inference methods used to 158 obtain the most 127 likely 125 blockmodel 59 of network realizations [29]. The entropy can be computed assume that the number 155 of communities 137 is known in analytically in both cases [30], advance [14,18,22 25]. Unfortunately, in most practical cases this quantity is completely unknown, 120 and one would S t ffi E 1 X ers e 201 like to infer it from the data as well. 140 Here we explore a very 2 rs ln ; (1) rs n r n s efficient way of obtaining this information from the data, for the traditional blockmodel ensemble 257 and, known as the minimum description length principle (MDL) [26,27], which predicates that the best choice of model which fits given data is the one which most compresses 108 it, 106 i.e., minimizes 294 k S c ffi E X N k lnk! 1 X ers e rs ln ; (2) rs e r e s the total amount of 74 information required to describe 183 it. This123approach 109 has been 38 introduced for P the degree corrected variant, where in both cases E ¼ 105 in the task of blockmodel inference in Ref. [28]. Here, we rse rs =2 is the total number of edges, n 281 r is the number of generalize it to accommodate an arbitrarily large number nodes which belong to block r, and N k is the total number of268 communities, and to obtain 225 general bounds on the 241 of nodes with degree k, and e 23 r ¼ P se rs is the number 99 detectability of arbitrary community 260 structures. We also of44 half-edges incident on block r. The directed case show that, according to this criterion, the maximum number of detectable blocks scales as N p ffiffiffiffi is analogous [30] (see Supplemental Material [31] for an , where N is the overview). 64 number of nodes in the network. Since the MDL approach 163 The detection problem consists in obtaining the block results in a simple penalty on the log-likelihood, 41 we 226 use it partition fb i g which is the most likely, when given an to implement 92 an efficient multilevel Monte 259 Carlo algorithm with an overall complexity of OðN lognþ, where node i. This 187 is done by maximizing the log-likelihood unlabeled network G, where b i is the block label of is the91average mixing time of the Markov chain, which lnp that the network G is observed, given the model can be used to infer arbitrary block 70 structures on very large compatible with a chosen block partition. Since we have networks. 231 simply P ¼ 1=, maximizing lnp is equivalent to minimize the entropy S 261 The model. The stochastic blockmodel 62 ensemble is t=c, which is the language we will use composed of graphs with N nodes, each belonging to one henceforth Entropy minimization is well defined, but only =13=110(14)=148701(5) Ó 2013 American Physical Society Peixoto, Phys. Rev. Lett. 110, (2013)

44 weighted networks most interactions are weighted frequency of interaction strength of interaction outcome of interaction etc. but! thresholding discards information and can obscure underlying structure

45 weighted networks Aicher et al., J. Complex Networks 3, (2015).

46 weighted networks Threshold = 1 Threshold = 2 Threshold = 3 Threshold = Aicher et al., J. Complex Networks 3, (2015).

47 weighted networks how will the results depend on the threshold? what impact does noise have, under threshold? Thomas & Blitzstein, arxiv: (2011).

48 recall the most general SBM Pr(A z, ) = Y i,j f A ij R(zi,z j ) A ij : value of adjacency R : partition of adjacencies f : probability function a, : pattern for a-type adjacencies Binomial = simple graphs Poisson = multi-graphs Normal = weighted graphs etc

49 weighted networks weighted stochastic block model (WSBM) model edge existence and edge weight separately edge existence: SBM edge weights: exponential family distribution log-likelihood: ln Pr(G M,z,,f) = ln Pr(G M,z) +(1 )lnpr(g, z, f) edge-existence [binomial distribution] M zi,z j edge-weights [exponential-family distribution] zi,z j Poisson, Normal, Gamma, Exponential, Pareto, etc. Aicher et al., J. Complex Networks 3, (2015).

50 weighted networks weighted stochastic block model (WSBM) model edge existence and edge weight separately edge existence: SBM edge weights: exponential family distribution log-likelihood: ln Pr(G M,z,,f) = ln Pr(G M,z) +(1 )lnpr(g, z, f) mixing parameter =1 only model edge existence (ignore weights) =0 only model edge weights (ignore non-edges) Aicher et al., J. Complex Networks 3, (2015).

51 weighted networks American football, NFL 2009 season 32 teams, 2 divisions, 4 subdivisions edge existence: who plays whom edge weight: mean score difference Aicher et al., J. Complex Networks 3, (2015).

52 weighted networks American football, NFL 2009 season 32 teams, 2 divisions, 4 subdivisions SBM ( =1 ) recovers subdivisions perfectly NFC { { AFC block matrix M,z Aicher et al., J. Complex Networks 3, (2015).

Dolphins) middle teams { bottom teams block

53 { weighted networks American football, NFL 2009 season 32 teams, 2 divisions, 4 subdivisions WSBM ( =0 ) recovers team skill hierarchy top teams { { wild cards (Steelers & Dolphins) middle teams { bottom teams block matrix M,z Aicher et al., J. Complex Networks 3, (2015).

54 weighted networks adding weights to the SBM what does A ij =0mean? no edge or weight=0 edge or non-observed edge? how will we model the distribution of edge weights? edge existences and edge weights may contain different large-scale structure (conference structure vs. skill hierarchy)

55 weighted networks other approaches

weighted networks other approaches PHYSICAL REVIEW E 92,042807(2015) Inferring the mesoscale structure of layered, edge-valued, and time-varying networks Tiago P.

56 weighted networks other approaches PHYSICAL REVIEW E 92,042807(2015) Inferring the mesoscale structure of layered, edge-valued, and time-varying networks Tiago P. Peixoto * Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D Bremen, Germany (Received 18 June 2015; published 9 October 2015) bin the edge weights into C ranges C bins = C layers of a multi-layer SBM common node labels each layer has its own block matrix infer C, M c, z Peixoto, Phys. Rev. E 92, (2015)

57 weighted networks other approaches (discretized weights = SBM layers) OpenFlights airport network edge weights = distance (km) 10 3 Probability density (a) (b) (c) (d) (e) Edge distance (km) Peixoto, Phys. Rev. E 92, (2015)

58 weighted networks other approaches (discretized weights = SBM layers) OpenFlights airport network Peixoto, Phys. Rev. E 92, (2015)

59 fin

Mixed Membership Stochastic Blockmodels

Mixed Membership Stochastic Blockmodels (2008) Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg and Eric P. Xing Herrissa Lamothe Princeton University Herrissa Lamothe (Princeton University) Mixed