Modeling heterogeneity in random graphs

Size: px

Start display at page:

Download "Modeling heterogeneity in random graphs"

Marilyn Morris
5 years ago
Views:

1 Modeling heterogeneity in random graphs Catherine MATIAS CNRS, Laboratoire Statistique & Génome, Évry (Soon: Laboratoire de Probabilités et Modèles Aléatoires, Paris) cmatias

2 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

Data: Biological networks Different networks types Protein-protein interaction networks (PPI), Metabolic networks Genes co-expression networks Genes regulation networks.

3 Data: Biological networks Different networks types Protein-protein interaction networks (PPI), Metabolic networks Genes co-expression networks Genes regulation networks... Some challenges Analyse big data sets, noisy data, Identify structures (topological patterns, cliques, nodes groups, etc), Modelling evolution across time, Compare networks between different species,...

4 Some models for random graphs Some existing models, advantages and drawbacks Erdös-Rényi, simple and mathematically well-understood, too homogeneous; edges are i.i.d. B(p) Models based on degree distribution, scale-free property, only a partial descriptor of the graph, greedy numerical simulations with fixed-degrees models ; Generative processes (like preferential attachment), dynamic model, depends on parameters (initialisation, stop,... ), can we characterize the result? Exponential random graph, natural from a statistical point of view, big inference issues ;...

5 Some models for random graphs Some existing models, advantages and drawbacks Erdös-Rényi, simple and mathematically well-understood, too homogeneous; Models based on degree distribution, scale-free property, only a partial descriptor of the graph, greedy numerical simulations with fixed-degrees models ; Nodes degrees are fixed, samples obtained through re-wiring algorithm Generative processes (like preferential attachment), dynamic model, depends on parameters (initialisation, stop,... ), can we characterize the result? Exponential random graph, natural from a statistical point of view, big inference issues ;...

6 Some models for random graphs Some existing models, advantages and drawbacks Erdös-Rényi, simple and mathematically well-understood, too homogeneous; Models based on degree distribution, scale-free property, only a partial descriptor of the graph, greedy numerical simulations with fixed-degrees models ; Generative processes (like preferential attachment), dynamic model, depends on parameters (initialisation, stop,... ), can we characterize the result? Start from a small graph, add nodes and connect them with higher prob. to nodes with large degrees Exponential random graph, natural from a statistical point of view, big inference issues ;...

7 Some models for random graphs Some existing models, advantages and drawbacks Erdös-Rényi, simple and mathematically well-understood, too homogeneous; Models based on degree distribution, scale-free property, only a partial descriptor of the graph, greedy numerical simulations with fixed-degrees models ; Generative processes (like preferential attachment), dynamic model, depends on parameters (initialisation, stop,... ), can we characterize the result? Exponential random graph, natural from a statistical point of view, big inference issues ; P θ (Y = y) = c(θ) 1 exp(θ S(y))...

8 Some models for random graphs Some existing models, advantages and drawbacks Erdös-Rényi, simple and mathematically well-understood, too homogeneous; Models based on degree distribution, scale-free property, only a partial descriptor of the graph, greedy numerical simulations with fixed-degrees models ; Generative processes (like preferential attachment), dynamic model, depends on parameters (initialisation, stop,... ), can we characterize the result? Exponential random graph, natural from a statistical point of view, big inference issues ;... In this talk, I am interested in modeling heterogeneity, thus I will focus on model-based clustering methods acting on the nodes.

9 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

10 State space models I Observations Adjacency matrix: Y = (Y ij ) 1 i,j n characterizing the relations between n distinct individuals or nodes; Either binary (Y ij {0, 1}) or weighted (Y ij R s ); Directed or undirected (Y ij = Y ji ), with or without self-loops (Y ii = 0); Example of undirected binary graph with no self-loops. 9

11 State space models II State space modeling There exist latent variables Z i associated to each node i; The Z i s are assumed to be i.i.d.; Conditional on the Z i s, the observations (Y ij ) 1 i,j n are assumed to be independent, The conditional distribution of Y ij only depends on Z i, Z j ; Different cases Depending on the latent state space Z Finite case: Z i {1,..., K} corresponds to stochastic block model (SBM), Continuous case: Z i R k or [0, 1].

12 State space models III Common features Networks characteristics are summarized through a low dimensional latent space; Classical em algorithm is not tractable.

13 Likelihood computation (state space r.g. models) I Preliminary comments As in any state space model, likelihood computation is not tractable, except for very small sample sizes, log P θ (Y) = log Z P θ (Y, Z) One approach is to consider the latent variables as parameters and compute a conditional likelihood ; counterpart is maximisation issues ; Classical solution is em-algorithm, but requires that the distribution of {Z i } conditional on {Y ij } is easily computed, which is not the case here!

14 Likelihood computation (state space r.g. models) II p({z i }, {Y ij }) Moralization of p({z i }, {Y ij }) p({z i } {Y ij }) The conditional distribution of the latent variables given the observed ones is not factorized.

15 One example: Latent space models [Handcock et al. 07] Model Z i i.i.d. vectors in a latent space R k. Conditional on {Z i }, the {Y ij } are independent Bernoulli r.v. log-odds(y ij = 1 Z i, Z j, U ij, θ) = θ 0 + θ 1 U ij Z i Z j, where log-odds(a) = log P(A)/(1 P(A)) ; {U ij } set of covariate vectors and θ parameters vector. Results [Handcock et al. 07] Two-stage maximum likelihood or MCMC procedures are used to infer the model s parameters Assuming Z i sampled from mixture of multivariate normal, one may obtain a clustering of the nodes. Issues No model selection procedure to infer the effective dimension k of latent space and the number of clusters.

16 One example: Latent space models [Handcock et al. 07] Model Z i i.i.d. vectors in a latent space R k. Conditional on {Z i }, the {Y ij } are independent Bernoulli r.v. log-odds(y ij = 1 Z i, Z j, U ij, θ) = θ 0 + θ 1 U ij Z i Z j, where log-odds(a) = log P(A)/(1 P(A)) ; {U ij } set of covariate vectors and θ parameters vector. Results [Handcock et al. 07] Two-stage maximum likelihood or MCMC procedures are used to infer the model s parameters Assuming Z i sampled from mixture of multivariate normal, one may obtain a clustering of the nodes. Issues No model selection procedure to infer the effective dimension k of latent space and the number of clusters.

17 One example: Latent space models [Handcock et al. 07] Model Z i i.i.d. vectors in a latent space R k. Conditional on {Z i }, the {Y ij } are independent Bernoulli r.v. log-odds(y ij = 1 Z i, Z j, U ij, θ) = θ 0 + θ 1 U ij Z i Z j, where log-odds(a) = log P(A)/(1 P(A)) ; {U ij } set of covariate vectors and θ parameters vector. Results [Handcock et al. 07] Two-stage maximum likelihood or MCMC procedures are used to infer the model s parameters Assuming Z i sampled from mixture of multivariate normal, one may obtain a clustering of the nodes. Issues No model selection procedure to infer the effective dimension k of latent space and the number of clusters.

18 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

19 Mixture model approach Idea: probability model based clustering Assume that the nodes of the graph belong to unobserved groups, that describe their connectivity to the other nodes. Advantages Induces heterogeneity in the data, keeping it simple, Clustering of the nodes groups induced by the model, Model encompasses the community detection framework. Motivation/Justification: Szemerédi regularity Lemma [Szemerédi 78] Every large enough graph can be divided into subsets of about the same size so that the edges between different subsets behave almost randomly.

20 Stochastic block model (binary graphs) γ 6 γ γ n = 10, Z 5 = 1 Y 12 = 1, Y 15 = 0 γ 9 γ 10 Binary case (parametric model with θ = (π, γ)) K groups (=colors ). {Z i } 1 i n i.i.d. vectors Z i = (Z i1,..., Z ik ) M(1, π), with π = (π 1,..., π K ) groups proportions. Z i not observed (latent). Observations: presence/absence of an edge {Y ij } 1 i<j n, Conditional on {Z i } s, the r.v. Y ij are independent B(γ Zi Z j ).

21 (Assumption: f has continuous cdf at zero). Stochastic block model (weighted graphs) γ 6 γ γ n = 10, Z 5 = 1 Y 12 R, Y 15 = 0 γ 9 γ 10 Weighted case (parametric model with θ = (π, γ (1), γ (2) )) Latent variables: idem Observations: weights Y ij, where Y ij = 0 or Y ij R s \ {0}, Conditional on the {Z i } s, the random variables Y ij are independent with distribution µ Zi Z j ( ) = γ (1) Z i Z j f(, γ (2) Z i Z j ) + (1 γ (1) Z i Z j )δ 0 ( )

22 SBM clustering vs other clusterings SBM clustering Nodes clustering induced by the model reflects a common connectivity behaviour; Many clustering methods try to group nodes that belong to the same clique (ex: community detection) Toy example SBM cluster Clustering based on cliques

23 Particular cases and generalisations Particular cases Affiliation model: connectivity matrix γ has only 2 parameters α... β γ =..... α β β... α Affiliation + α β = community detection (cliques clustering). Generalisations Overlapping groups [Latouche et al. 11, Airoldi et al. 08] for binary graphs; Adding covariates [Zanghi et al. 10b] ; Latent block models (LBM), for array data.

24 From SBM to LBM A graph is encoded through its adjacency matrix. Clustering the nodes corresponds to simultaneous and identical clustering of the rows and columns. Generalise this to non square array data, without constraining identical rows and columns groups. Models bi-partite graphs.

25 Latent block models I LBM notation Observations: array Y n,m := {Y ij } 1 i n,1 j m with Y ij Y, K 1 and L 1 number of row and column groups, respectively. Groups prior distributions π = (π 1,..., π K ) over K = {1,..., K} and ρ = (ρ 1,..., ρ L ) over L = {1,..., L}, such that k π k = l ρ l = 1. Latent variables Z n := Z 1,..., Z n iid π over K and W m := W 1,..., W m i.i.d. ρ over L.

26 Latent block models II Two models in the same framework 2 cases occur LBM : {Z i } 1 i n and {W j } 1 j m independent. SBM : n = m, K = L, Z i = W i for all 1 i n and π = ρ. Connectivity parameters γ = (γ kl ) (k,l) K L, Conditional on {Z i, W j }, random variables {Y ij } are independent, with distribution Y ij Z i = k, W j = l f( ; γ kl ).

27 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

28 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

29 Parameter s identifiability Problem Obviously, the model may only be identifiable up to a permutation on the group s labels. But whether one may uniquely recover the parameter up to label switching is a delicate task. Existing identifiability results Undirected SBM, binary or weighted [Allman et al. 09, Allman et al. 11], Directed and binary SBM [Celisse et al. 12], Overlapping SBM [Latouche et al. 11], Binary LBM [Keribin et al. 13].

30 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

31 Parameter estimation I Parameter estimation issue em algorithm not feasible because latent variables are not independent conditional on observed ones. Ex (SBM) : P({Z i } i {Y ij } i,j ) i P(Z i {Y ij } i,j ) Alternatives: Gibbs sampling or Variational approximation to em. Composite likelihood approaches for affiliation valued graphs [Ambroise & Matias 10]; About LBM case Variational methods for binary, Gaussian or Poisson data arrays [Govaert & Nadif 03, Govaert & Nadif 08, Govaert & Nadif 10]. Bayesian framework and Gibbs sampling for binary and Gaussian data [Wyse & Friel 12] sem Gibbs approach (for categorical data) [Keribin et al. 13].

32 Parameter estimation II Model selection Maximal likelihood is not available (thus neither AIC or BIC), ICL criterion is used [Daudin et al. 08, Keribin et al. 13]. MCMC approach to select number of LBM groups [Wyse & Friel 12]. Node clustering Automatically performed by the previous algorithms.

33 Variational approximation principle I Likelihood decomposition L Y (θ) := log P(Y; θ) = log P(Y, Z; θ) log P(Z Y; θ) and for any distribution Q on Z, L Y (θ) = E Q (log P(Y, Z; θ)) + H(Q) + KL(Q P(Z Y; θ)) em principle e-step: maximise the quantity E Q (log P(Y, Z; θ (t) )) + H(Q) with respect to Q. This is equivalent to minimizing KL(Q P(Z Y; θ (t) )) with respect to Q. m-step: keeping now Q fixed, maximize the quantity E Q (log P(Y, Z; θ)) + H(Q) with respect to θ and update the parameter value θ (t+1) to this maximiser. This is equivalent to maximizing the conditional expectation E Q (log P(Y, Z; θ)) w.r.t. θ.

34 Variational approximation principle II Variational em e-step: search for an optimal solution Q within a restricted class of distributions Q, e.g. the class of factorized distributions n Q(Z) = Q(Z i ), i=1 Q = argmin KL(Q P(Z Y; θ (t) )) Q Q m-step: unchanged, i.e. θ (t+1) = argmax θ E Q (log P(Y, Z; θ)) A consequence of KL 0 is the lower bound L Y (θ) E Q (log P(Y, Z; θ)) + H(Q) So that the variational approximation consists in maximizing a lower bound on the log-likelihood. Why does it make sense?

35 Variational approximation in SBM Q = {Q, Q(Z) = n i=1 Q i(z i )} and Q i (Z i ) = k τ Z ik ik ; The minimizer of KL(Q P(Z Y; θ (t) )) in Q satisfies a fixed point relation τ ik π k f kk (Y ij ) τjk j k It s a mean field approximation. Once the τ ik are computed, it s easy to do the m-step.

36 Binary or Weighted Affiliation SBM [Ambroise & Matias 10] I Models (Binary/weighted) {Z i } 1 i n i.i.d. latent vectors Z i = (Z i1,..., Z ik ) M(1, π); Conditional on {Z i } s, the Y ij are independent; Binary case: Y ij { B(γin ) if Z i = Z j B(γ out ) if Z i Z j. Weighted case: { pin f(, γ Y ij in ) + (1 p in )δ 0 ( ) if Z i = Z j p out f(, γ out ) + (1 p out )δ 0 ( ) if Z i Z j.

37 Binary or Weighted Affiliation SBM [Ambroise & Matias 10] II Composite likelihood idea - Weighted case The present edges Y ij 0 follow a mixture distribution Y ij Y ij 0 { Q q=1 π2 q p in}f(y ij ; γ in )+ { q l πqπ lp out}f(y ij ; γ out ) Parameters of a mixture of two continuous distributions are in general identifiable. We form a composite log-likelihood L c X(θ) = 1 log[α n(n 1) in f(y ij ; γ in ) + α out f(y ij ; γ out )]. i<j If this converges to E[log(α in f(y ij ; γ in ) + α out f(y ij ; γ out )] then we can estimate the parameters with θ = argmax L c X(θ). θ

38 Binary or Weighted Affiliation SBM [Ambroise & Matias 10] III Moment methods idea - Binary case Same idea does not apply directly in the Binary case, because Y ij mixture of Bernoulli. Not identifiable! However, mixtures of 3-variate Bernoulli distributions are identifiable (in many cases). Develop same methodology with L c X (π, α, β) = 1 n(n 1)(n 2) (i,j,k) I 3 log P(Y ij, Y ik, Y jk ). For the two approaches to be valid, we need to know whether the composite log-likelihoods converge.

39 Binary or Weighted Affiliation SBM [Ambroise & Matias 10] IV Notation i = (i 1,..., i k ) a k-tuple of nodes, Y i = (Y i1 i 2,..., Y i1 i k, Y i2 i 3,..., Y ik 1 i k ) the vector of p = ( ) k 2 r.v. induced by the nodes i, g : Y p R s, a function and m g = (n k)! n! i I k g(yi ) and m g = E(g(Y (1,...,k) )). Theorem For any k, s 1 and p = ( k 2) and any measurable function g : Y p R s such that E( g(y (1,...,k) ) 2 ) < +, the estimator ˆm g is consistent ˆm g m g almost surely, n as well as asymptotically normal n( ˆm g m g ) n N (0, Σ g ).

40 Outline Introduction to random graphs State space models for random graphs Stochastic Block Model (binary or weighted graphs) Parameter estimation and node clustering Identifiability Parameter estimation Clustering convergence results

41 Convergence issues Why does the variational approximation work? The variational approximation appears to be efficient, both for LBM and SBM. Variational approximation does not converge unless the true posterior p(z Y; γ) is degenerate [Gunawardana & Byrne 05]. Remaining issues What is the (asymptotic) behaviour of the groups posterior distribution? Is it degenerate? Is variational approximation somehow equivalent to em approach? Does maximum likelihood converge in this setting anyway?

42 Maximum likelihood and variational approach Results from [Celisse et al. 12] in SBM case Variational em is asymptotically equivalent to classical em for SBM. Maximum likelihood is convergent in this setup (as the sample size increases).

43 Convergence of the groups posterior distribution (LBM or SBM) Results from [Mariadassou & Matias 13] In general, the groups posterior distribution converges to a Dirac mass (when n, m ). However, when there exist equivalent configurations (=nodes groups inducing the same likelihood), the posterior converges to a mixture of Dirac located at these configurations. In some cases -in particular affiliation-, the number of equivalent configurations is larger than the number of label switching configurations. When there are equivalent configurations, the posterior converges to a Dirac mass at the configuration with largest prior.

44 Equivalent configurations in SBM or LBM Label switching corresponds to P (σ(π),σ(γ)) = P (π,γ) for any permutation σ of {1,..., K}; In classical mixtures, identifiability requires that γ q γ q for any q q ; In SBM or LBM, one may have γ ql = γ q l for some q q ; Then, if the matrix γ has symmetries, we may have σ(γ) = γ with the model still identifiable if π has non equal entries. Namely P (π,σ(γ)) = P (π,γ) ; As a consequence, the ratios between the posterior distributions at (Z n, W m ) and σ 1 (Z n, W m ) does not depend on data P (π,γ) (Z n, W m Y n,m ) π(z n, W m )P γ (Y n,m Z n, W m ) π(z n, W m )P σ(γ) (Y n,m Z n, W m ) π(z n, W m ) π(σ 1 (Z n, W m )) P (π,γ)(σ 1 (Z n, W m ) Y n,m ).

45 Conclusions Modeling data SBM are natural and powerful models for handling networks data. Many variants, with overlapping groups or covariates. Data may be binary or weighted, sparse or not, directed or not... ; Natural generalisation of SBM for matrix data: LBM are handled in the same way. Model based clustering of the nodes of the graph (or the rows/columns of the array), that encompasses community detection approaches. Theoretical results Convergence results are difficult to obtain but some exist. Variational em approximations provide good practical results but tend to depend on initialisation: there is room for improvement!

46 References I [Airoldi et al. 08] E.M. Airoldi, D.M. Blei, S.E. Fienberg and E.P. Xing. Mixed Membership Stochastic Blockmodels. J. Mach. Learn. Res., 9: , [Allman et al. 09] E.S. Allman, C. Matias and J.A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A): , [Allman et al. 11] E.S. Allman, C. Matias and J.A. Rhodes. Parameter identifiability in a class of random graph mixture models. J. Statist. Planning and Inference, 141(5): , 2011.

47 References II [Ambroise & Matias 10] C. Ambroise and C. Matias. New consistent and asymptotically normal estimators for random graph mixture models. Journal of the Royal Statistical Society: Series B, 74(1):3-35, [Celisse et al. 12] A. Celisse, J.-J. Daudin, and L. Pierre. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Statist., 6: , [Daudin et al. 08] J.-J. Daudin, F. Picard, and S. Robin. A mixture model for random graphs. Stat. Comput., 18(2): , 2008.

48 References III [Gassiat et al. 13] E. Gassiat, A. Cleynen, S. Robin. Finite state space non parametric Hidden Markov Models are in general identifiable. arxiv: , [Govaert & Nadif 03] G. Govaert and M. Nadif. Clustering with block mixture models. Pattern Recognition, 36(2): , [Govaert & Nadif 08] G. Govaert and M. Nadif. Block clustering with Bernoulli mixture models: Comparison of different approaches. Computational Statistics and Data Analysis, 52(6): , 2008.

49 References IV [Govaert & Nadif 10] G. Govaert and M. Nadif. Latent block model for contingency table. Communications in Statistics - Theory and Methods, 39(3): , [Gunawardana & Byrne 05] Gunawardana and Byrne. Convergence Theorems for Generalized Alternating Minimization Procedures. JMLR, 6: , [Handcock et al. 07] M.S. Handcock, A.E. Raftery and J.M. Tantrum Model-based clustering for social networks. J. R. Statist. Soc. A., 170(2): , [Hoff et al. 02] P.D. Hoff, A.E. Raftery and M. S. Handcock Latent space approaches to social network analysis. J. Amer. Statist. Assoc., 97(460): , 2002.

50 References V [Keribin et al. 13] C. Keribin, V. Brault, G. Celeux and G. Govaert. Estimation and selection for the latent block model on categorical data. INRIA Research report 8264, [Latouche et al. 11] P. Latouche, E. Birmelé and C. Ambroise. Overlapping Stochastic Block Models With Application to the French Political Blogosphere. Annals of Applied Statistics, 5(1): , [Mariadassou & Matias 13] M. Mariadassou and C. Matias. Convergence of the groups posterior distribution in latent or stochastic block models. To appear in Bernoulli. hal , 2013.

51 References VI [Szemerédi 78] Szemerédi, Endre. Regular partitions of graphs. Problèmes combinatoires et théorie des graphes (Colloq. Internat. CNRS, Univ. Orsay, Orsay, 1976), Colloq. Internat. CNRS, 260: , [Wyse & Friel 12] J. Wyse and N. Friel Block clustering with collapsed latent block models. Stat Comput 22: , [Zanghi et al. 10b] H. Zanghi, S. Volant and C. Ambroise. Clustering based on random graph model embedding vertex features. Pattern Recognition Letters 31(9): , 2010.

IV. Analyse de réseaux biologiques

IV. Analyse de réseaux biologiques Catherine Matias CNRS - Laboratoire de Probabilités et Modèles Aléatoires, Paris catherine.matias@math.cnrs.fr http://cmatias.perso.math.cnrs.fr/ ENSAE - 2014/2015 Sommaire