Uncovering structure in biological networks: A model-based approach

Size: px

Start display at page:

Download "Uncovering structure in biological networks: A model-based approach"

Kenneth Barrett
5 years ago
Views:

1 Uncovering structure in biological networks: A model-based approach J-J Daudin, F. Picard, S. Robin, M. Mariadassou UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées Statistics for Biological Sequences group (including E. Birmelé, C. Matias, S. Schbath) Examples of networks. Social: Internet: Biological: who knows who? connection between servers or web pages. which protein interacts with which? which gene regulates which? 1

2 Uncovering stucture in networks: A simple example

3 Random graphs Notation and definition. Given a set of n vertices (i = 1..n), X ij indicates the presence/absence of a (non oriented) edge between vertices i and j: X ij = X ji = Á{i j}, X ii =. The random graph is defined by the join distribution of all the {X ij } i,j. Typical characteristics. Degree of the vertices: K i = j i X ij Clustering coefficient: c = Pr{X jk = 1 X ij = X ik = 1} = Pr{ V} Diameter: Longest path between two vertices. Number of occurrences of some motifs: distribution of the number of triangles ( ), V, stars ( ), etc. 3

4 Erdös-Rényi mixture for graph (ERMG) Erdös-Rényi (ER) model. {X ij } i.i.d., Pr{i j} = π K i B(n 1, π) P[(n 1)π], c = π. This model fits poorly many real-world networks, probably because of some heterogeneity between the vertices. Mixture population of edges. We suppose that the edges belong to Q groups: α q = Pr{i q}, Z iq = Á{i q}. Conditional distribution of the edges. The edges {X ij } are conditionally independent given the group of the vertices: X ij {i q, j l} B(π ql ). π ql = π lq is the connection probability between groups q and l. A high value of π ql reveals a preferential connectivity between groups q and l. 4

5 Examples Network Q π Cluster. coef. Random 1 p p Independent model (product connectivity) Stars 4 ( ) a ab ab b (a + b ) (a + b) Clusters (affiliation networks) ( 1 ε ε 1 ) 1 + 3ε (1 + ε) Scale free network model (Barabasi & Albert, 99) can also be expressed in terms of ERMG. 5

6 Some properties of the ERMG model Conditional distribution of the degrees: Denoting π q = l α lπ ql, λ q = (n 1)π q K i {i q} B(n 1, π q ) P(λ q ). Marginal distribution of the degrees: we get a Poisson mixture K i q B(n 1, π q ) q α q P(λ q ). Between-group connectivity. Let A ql denote the number of edges between groups q and l. Its expectation in the ERMG model is (A ql ) = n(n 1) α q α l π ql. Clustering coefficient: c = Pr{ V} = Pr{ }/ Pr{V} = q,l,m α qα l α m π ql π qm π lm q,l,m α qα l α m π ql π qm. 6

7 Estimation of the parameters Denote X the set of random vertices and Z the set of labels. The aim is to maximize the log-likelihood L(X) which is not calculable (need to sum over all possible Z). We actually maximize J (R X ) = L(X) KL [R X (Z),Pr(Z X)] = H(R X ) Z R X (Z)L(X,Z) where H(R X ) is the entropy of R X : H(R X ) = Z R X(Z)ln(Z). The distribution R X (Z) is chosen to be the best approximation of Pr(Z X) in terms of Kullback-Leibler (KL) distance: R X (Z) = Pr(Z X) J (R X ) = L(X). Algorithm. We alternately optimize J (R X ) with respect (a) to R X and (b) to the parameter estimates ( α, π), so the proposed algorithm converges. 7

8 Why not use the standard E-M algorithm? E-M is very popular to fit mixture models. It requires to calculate the conditional distribution Pr{Z X } which is intractable. Dependency graph (oriented) Edge X ij only depends on its two parents Z 1 and Z Moral graph (parents are married) Conditional on the edges, labels Z i s all depend on each others X 1 X 1 Z Z X 3 Z 1 X 3 Z 1 Z 3 Z 3 X 13 X 13 All edges are actually neighbours. 8

9 Choice of the number of groups We propose a heuristic penalized likelihood criterion inspired from BIC. The completed log-likelihood L(X, Ẑ) is the sum of i q i q τ iq log α q which deals with (Q 1) independent proportions α q s and involves n terms, j>i l τ iq τ jl log b(x ij ;π ql ) which deals with Q(Q + 1)/ probabilities π ql s and involves n(n 1)/ terms. We propose the following heuristic criterion: L(X, Ẑ) + (Q 1) log n + Q(Q + 1) [ ] n(n 1) log. 9

10 Application to E. coli reaction network n = 65 vertices (reactions) and 1 78 edges. reactions i and j are connected if the product of i is the substrate of j (or conversely). provided by V. Lacroix and M.-F. Sagot (INRIA Hélix). Number of groups. Pseudo-BIC selects Q = 1. Group proportions. α q (%) Many small groups actually correspond to cliques or pseudo-cliques. 1

11 Dot-plot representation of the graph. Biological interpretation: Groups 1 to gather reactions involving all the same compound either as a substrate or as a product. A compound (chorismate, pyruvate, ATP, etc) can be associated to each group Posterior probabilities τ iq

12 Zoom (bottom left). Submatrix of π: q, l Groups 1 and 16 both involve pyruvate, but only group 1 involves also CO (group 7) and acetylcoa (group 1). Vertices degree K i. Mean degree in the last group: K 1 =

13 Distribution of the degree. According to the ERMG, de degrees have a Poisson mixture distribution. Histogram + mixture distribution 5 1 P-P plot Clustering coefficient. Empirical ERMG (Q = 6) ERMG (Q = 1) ER (Q = 1)

14 Extension: Motif Statistics Joint work with E. Birmele, C. Matias (Evry), E. Roquain, S. Schbath (Jouy-en-Josas) The organisation of a network can be charaterized by the frequency of some specific motifs: Chain: Loop: Hub: For a given motif m, we denote its count N m (G) = number of occurrences of the motif m in the graph G Question: In a graph G of size n =, the motif m is observed N obs m = times. Is it exceptional? 14

15 Conclusions Past. The ERMG model is a flexible generalization of the ER model and a promising alternative to the scale-free model. It seems to fit well several real-world networks It is properly defined, so its properties can be properly studied. Future. Study the probabilistic properties of the ERMG model (diameter, sampling properties, distribution of the occurrences of specific motifs, etc). Study the statistical properties of the estimates; Derive a relevant criterion to select the number of groups. Extension to valued graphs: X ij not only /1, but some measure of the connection intensity. 15

Statistical analysis of biological networks.

Statistical analysis of biological networks. Assessing the exceptionality of network motifs S. Schbath Jouy-en-Josas/Evry/Paris, France http://genome.jouy.inra.fr/ssb/ Colloquium interactions math/info,