Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules

Similar documents
DOTTORATO DI RICERCA IN FISICA

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

BIOINFORMATICS ORIGINAL PAPER

ORIE 6334 Spectral Graph Theory September 22, Lecture 11

Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018

Protein Complex Identification by Supervised Graph Clustering

Supplementary Material

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search

6.842 Randomness and Computation March 3, Lecture 8

A novel and efficient algorithm for de novo discovery of mutated driver pathways in cancer

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Predicting Protein Functions and Domain Interactions from Protein Interactions

Differential Modeling for Cancer Microarray Data

CS6220: DATA MINING TECHNIQUES

A new approach for biomarker detection using fusion networks

Dynamic modeling and analysis of cancer cellular network motifs

CS6220: DATA MINING TECHNIQUES

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

Predicting causal effects in large-scale systems from observational data

Basic modeling approaches for biological systems. Mahesh Bule

1. The growth of a cancerous tumor can be modeled by the Gompertz Law: dn. = an ln ( )

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Data Mining and Analysis: Fundamental Concepts and Algorithms

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Comparative Network Analysis

Spectral Clustering. Guokun Lai 2016/10

Network alignment and querying

Kristina Lerman USC Information Sciences Institute

SUPPLEMENTARY INFORMATION

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

ECEN 689 Special Topics in Data Science for Communications Networks

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Biological Networks Analysis

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1

Analysis and Simulation of Biological Systems

Evolutionary dynamics of cancer: A spatial model of cancer initiation

Machine Learning for Data Science (CS4786) Lecture 24

Towards Detecting Protein Complexes from Protein Interaction Data

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Network Alignment 858L

Statistical Methods for Analysis of Genetic Data

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Research Statement on Statistics Jun Zhang

Machine Learning CPSC 340. Tutorial 12

Prediction of double gene knockout measurements

Nested Effects Models at Work

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

LAPLACIAN MATRIX AND APPLICATIONS

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

CMPUT651: Differential Privacy

Principal component analysis (PCA) for clustering gene expression data

Analyzing Microarray Time course Genome wide Data

Data Mining Recitation Notes Week 3

Charalampos (Babis) E. Tsourakakis WABI 2013, France WABI '13 1

Synchronization Transitions in Complex Networks

Protein function prediction via analysis of interactomes

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Diffusion and random walks on graphs

Graph Wavelets to Analyze Genomic Data with Biological Networks

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Supplementary Figures.

CS Homework 3. October 15, 2009

A Multiobjective GO based Approach to Protein Complex Detection

Identifying Signaling Pathways

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Preface. Contributors

Supplementary Tables and Figures

Bioinformatics: Network Analysis

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

A Study of Network-based Kernel Methods on Protein-Protein Interaction for Protein Functions Prediction

Lecture: Local Spectral Methods (1 of 4)

FROM UNCERTAIN PROTEIN INTERACTION NETWORKS TO SIGNALING PATHWAYS THROUGH INTENSIVE COLOR CODING

Time varying networks and the weakness of strong ties

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Algorithms for the Stabilization of a Leader-Follower formation described via complex Laplacian

Biological networks CS449 BIOINFORMATICS

MTopGO: a tool for module identification in PPI Networks

Learning in Bayesian Networks

Noisy PPI Data: Alarmingly High False Positive and False Negative Rates

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Predictive Modeling of Signaling Crosstalk... Model execution and model checking can be used to test a biological hypothesis

REVIEW: Waves on a String

Systems biology and biological networks

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Statistical regimes of random laser fluctuations

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

PHP 2510 Probability Part 3. Calculate probability by counting (cont ) Conditional probability. Law of total probability. Bayes Theorem.

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Quantum walk algorithms

A general co-expression network-based approach to gene expression analysis: comparison and applications

Applied Machine Learning Annalisa Marsico

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Web Structure Mining Nodes, Links and Influence

Transcription:

Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules Matteo Bersanelli 1+, Ettore Mosca 2+, Daniel Remondini 1, Gastone Castellani 1 and Luciano Milanesi 2 1 Department of Physics and Astronomy, Universita di Bologna, Bologna, via B. Pichat 6/2, 40127, Italy 2 Institute of Biomedical Technologies-CNR, Segrate(MI), via Fratelli Cervi 93, 20090, Italy ettore.mosca@itb.cnr.it + these authors contributed equally to this work Supplementary Notes and Figures Supplementary Note 2 Supplementary Figure S1: Network resampling p value 5 Supplementary Figure S2: The choice of ɛ in PRAD SM data 6 Supplementary Figure S3: The choice of ɛ in PRAD GE data 7 Supplementary Figure S4: Comparison of network-based and network-free quantities on FP60 PPIs. 8 Supplementary Figure S5: Comparison of network-based and network-free quantities on GHIASSIAN PPIs. 9 Supplementary Figure S6: Comparison of network-based and network-free quantities on NCBI PPIs. 10 Supplementary Figure S7: Comparison of network-based and network-free quantities on HI PPIs. 11 Supplementary Figure S8: Comparison with other diffusion-based methods 12 Supplementary Figure S9: Relationship between ω and h 13 1

Supplementary Note Network diffusion We are interested in studying the the stationary distributions X dependence on the altered nodes that for each sample are summarized by the vector X 0 where the i-th component is 1 if the i-th molecular entity is altered and 0 otherwise. We consider the network propagation equation X t+1 = αw X t + (1 α)x 0 (1) In equation (1) during each iteration each node receives the information from its neighbors, and also retains its initial information and self-reinforcement is avoided. Moreover the information is spread symmetrically since W is a symmetric matrix. The algorithm convergence is demonstrated using the power extension method; we set X 0 = X(0) and we apply some iterations: X 1 = αw X 0 + (1 α)x 0, X 2 = αw X 1 + (1 α)x 0 Iterating this procedure at step t we get: = αw (αw X 0 + (1 α)x 0) + (1 α)x 0 = (αw ) 2 X 0 + (1 α)(αw + I)X 0 = (αw ) 2 X 0 + (1 α)((αw ) + (αw ) 0 )X 0 t 1 X t = (αw ) t X 0 + (1 α) (αw ) i X 0, Since 0 < α < 1 and the eigenvalues of W are in [ 1; 1], when we take the limit for t we get: i=0 X = (1 α)(i αw ) 1 X 0 (2) This method is successfully exploited by Hofree et al. [1] in their network based approach applied to somatic mutation profiles. We now try to give a physical interpretation of such a diffusive algorithm in order to be able to inherit some helpful concepts from an associated physical model. Physical model The network propagation algorithm can be interpreted as the discrete implementation of a continuous linear dynamical system; we define L α = I αw the symmetrically normalized Laplacian matrix with the perturbation parameter α; and consider the following system of ODEs written in vector form: Ẋ = L αx + (1 α)x 0 (3) We can rewrite equation (3) by defining a new sink parameter γ > 0 so that γ = 1 α α after rescaling time by the parameter α: so that, Ẋ = (L + γi) X + γx 0 (4) where L = I W is the symmetrically normalized Laplacian matrix. Equation (4) is essentially the physical model cited by Vandin [3] for defining a gene prioritization algorithm on the basis of the work of Qi et al. [2]; Qi defines a continuous-time model for the distribution of a hypothetical fluid in the network. All nodes contain no fluid initially. Query nodes are then selected to serve as sources, where the fluid is pumped in at a constant rate. Fluid diffuses from node to node through the network according to the edge connections. The source imput is balanced by fluid loss out of each node at the constant first order rate γ. The stationary solution of the dynamical system is of course equation (2) and, expressed in terms of the constant sink parameter γ: X = γ(l + γi) 1 X 0 (5) 2

The choice of α and γ The choice of parameter α (or equivalently γ) influences the behavior of the diffusive algorithm since α controls how much information is retained in the nodes versus how much tends to be spread in the network. From a physical point of view it is reasonable to assume that α > 0.5, in order to make network topology relevant. Throughout the analysis of several real or toy datasets we find that the importance of the choice of a specific value of 0.5 α < 1 is somehow negligible, in the sense that the results are very similar for different choices of α. However qualitatively it is a good trade off between diffusion rate and computational cost (which increases as α 1) to take the parameter α = 0.7. Discussion The connection of the discrete model described by equation (1) to the continuous model allows us to point out that we are dealing with an open system in which the amount of information in each node depends on the constant rate of the sinks γ and on the query nodes distribution. The system is open in the sense that the conservation of fluid amount from a macroscopic point of view and the probability conservation from a microscopic point of view is not guaranteed. The non conservative system shows that X represents the quantity of fluid remaining after the flow stabilizes; in each node the fluid pumped by the altered nodes is balanced by the constant sinking of the fluid at the constant rate γ. In principle also the amount of overall fluid on the network could differentiate from cases to controls but we observe that on large networks differences in this sense are not sensible; on the other hand the fact that we are dealing with an open system implies that the final overall amount of fluid depends on both the distribution and the number of initially altered species. Hofree et al [1] normalize each diffused profile so that the sum of the fluid in each sample is constant therefore actually interpreting equation (1) as a random walk with restart. In our work we chose not to normalize each profile in this fashon, but we use the non-normalized profiles to define the S scores and the S scores where the actual amount of fluid on each node is taken into account. This allows to better highlight the nodes that gain more fluid in one class versus the other even in patients in the same class that have a different number of altered entities. Network resampling procedure The definition of network resampling p values (p nr) is performed by comparing the objective function Ω with a sample of its local perturbations Ω σ. We first define, as in the main text, the objective function Ω: Ω(n) = S t (n) A n S(n) (6) where A n is the adjacency matrix between the first n top scoring entities and S(n) are the first n scores decreasingly ordered. Equation (6) can be seen as a scalar product between the top highest S, where the matrix A n makes sure that only the scores associated with entities that have at least a connection to the remaining n 1 ones positively contribute to the calculation. The Ω function is non-decreasing since only positive scores are considered. A single perturbation of function (6) is defined as follows: Ω σ(n) = S t (n) A σ(n) S(n) (7) where σ(n) is a random permutation between the first n molecular entities labels, so that A σ(n) is simply a random resampling of the existing connections between the top scoring genes. We specify that the permutations are constructed as follows: we first randomly permute the indexes of all the genes and then we use that random reassignment to define a single Ω σ(n) perturbation; in this fashion we construct perturbations that behave similarly to equation (6), with the advantage of a graphical feedback (for example, see Supplementary Fig. S3). In order to define network resampling p-values we produce a set of k perturbations and for each value n we compute the fraction of times in which the perturbations are above or equal to the objective function Ω. As log as for some of the k permutations it holds that Ω σ(n) Ω(n) it means that the first n top scoring genes are connected enough that a resampling of the connections among them do not alter too much the strength of the subnetwork, while it s 3

reasonable to expect a sensible deviation of the permutations from the objective function when top-scoring genes that are not connected to the previous n 1 ones enter the top of the list. Therefore for each value n we take k different permutations of the indexes {σ 1, σ 2,, σ k } compute: p nr(n) = {j (1,, k) Ωσ j (n) Ω(n)} + 1 (8) k + 1 where we add 1 both at the numerator and at the denominator so that the smallest p value is never null. At this point, according to the model s assumptions, in principle any local minimum of equation (8) could be an interesting choice for cutting the top of the S list; however in our applications we find reasonable to cut at the first local minimum ˆp nr since it often represents the value corresponding to the value n where the perturbations (equation (7)) start to sensibly deviate from the objective function (equation (6)). See supplementary Fig. S3. 4

Supplementary Figure S1: Network resampling p value The deviations of 999 permutations Ω σ from the objective function Ω (top) captures the biological signal within the synthetic module of size M = 50. Such deviation is captured by the first minimum p nr -value (bottom). 5

Supplementary Figure S2: The choice of ɛ in PRAD SM data Scatter plot with f values and S calculated on PRAD SM data for different values of the parameter ɛ. The overlap is calculated over the top 500 genes of the two quantities on STRING PPIs. 6

Supplementary Figure S3: The choice of ɛ in PRAD GE data Scatter plot with lfcp values and Sp calculated on PRAD GE data for different values of the parameter ɛ. The overlap is calculated over the top 500 genes of the two quantities on STRING PPIs. 7

Supplementary Figure S4: Comparison of network-based and network-free quantities on FP60 PPIs. Comparison of network-based and network-free quantities calculated on somatic mutation and gene expression data from PRAD samples associated with two different prognostic groups. (a-b) Scatter plot with network-based (y axis) vs network-free (x axis) gene scores calculated on PRAD SM (a) and GE (b) data; colours indicate the top 500 genes ranked by network-free (red) or network-based (yellow, blue) scores and the overlaps (brown, purple). (c-d) Number of links (y axis, left) and number of connected genes (y axis, right) within the first 500 genes ordered by network ( S, Sp) and network-free ( f, lfcp) gene scores, calculated on PRAD SM (c) and PRAD GE (d) data. (a-d) S and Sp were calculated using FP60 PPIs and, respectively, ɛ = 0.25 and ɛ = 1. (c-d) #: number of links (vertical axis, left) or number of veritces (vertical axis, right). 8

Supplementary Figure S5: Comparison of network-based and network-free quantities on GHIASSIAN PPIs. Comparison of network-based and network-free quantities calculated on somatic mutation and gene expression data from PRAD samples associated with two different prognostic groups. (a-b) Scatter plot with network-based (y axis) vs network-free (x axis) gene scores calculated on PRAD SM (a) and GE (b) data; colours indicate the top 500 genes ranked by network-free (red) or network-based (yellow, blue) scores and the overlaps (brown, purple). (c-d) Number of links (y axis, left) and number of connected genes (y axis, right) within the first 500 genes ordered by network ( S, Sp) and network-free ( f, lfcp) gene scores, calculated on PRAD SM (c) and PRAD GE (d) data. (a-d) S and Sp were calculated using GHIASSIAN PPIs and, respectively, ɛ = 0.25 and ɛ = 1. (c-d) #: number of links (vertical axis, left) or number of veritces (vertical axis, right). 9

Supplementary Figure S6: Comparison of network-based and network-free quantities on NCBI PPIs. Comparison of network-based and network-free quantities calculated on somatic mutation and gene expression data from PRAD samples associated with two different prognostic groups. (a-b) Scatter plot with network-based (y axis) vs network-free (x axis) gene scores calculated on PRAD SM (a) and GE (b) data; colours indicate the top 500 genes ranked by network-free (red) or network-based (yellow, blue) scores and the overlaps (brown, purple). (c-d) Number of links (y axis, left) and number of connected genes (y axis, right) within the first 500 genes ordered by network ( S, Sp) and network-free ( f, lfcp) gene scores, calculated on PRAD SM (c) and PRAD GE (d) data. (a-d) S and Sp were calculated using NCBI PPIs and, respectively, ɛ = 0.25 and ɛ = 1. (c-d) #: number of links (vertical axis, left) or number of veritces (vertical axis, right). 10

Supplementary Figure S7: Comparison of network-based and network-free quantities on HI PPIs. Comparison of network-based and network-free quantities calculated on somatic mutation and gene expression data from PRAD samples associated with two different prognostic groups. (a-b) Scatter plot with network-based (y axis) vs network-free (x axis) gene scores calculated on PRAD SM (a) and GE (b) data; colours indicate the top 500 genes ranked by network-free (red) or network-based (yellow, blue) scores and the overlaps (brown, purple). (c-d) Number of links (y axis, left) and number of connected genes (y axis, right) within the first 500 genes ordered by network ( S, Sp) and network-free ( f, lfcp) gene scores, calculated on PRAD SM (c) and PRAD GE (d) data. (a-d) S and Sp were calculated using HI PPIs and, respectively, ɛ = 0.25 and ɛ = 1. (c-d) #: number of links (vertical axis, left) or number of veritces (vertical axis, right). 11

Supplementary Figure S8: Comparison with other diffusion-based methods (a) Scatter plot with SAM q values and S calculated on PRAD SM data. (b) Network of STRING PPIs formed by the top 300 genes order by SAM q or S; (c) Percentage of st-svm extracted genes recalled on top of the list (from 50 to 1000) of Sp. (d) Network resampling applied to the Sp array suggests to cut around 55 top scoring genes. (e) Overlap between st-svm and our method on STRING PPIs. 12

Supplementary Figure S9: Relationship between ω and h The signal ω as a function of average hills frequency h. The mountains percentage is fixed at m % = 0.1. Values are averaged over ensembles of modules of the same size M. 13

References 1. Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108 1115 (2013). 2. Qi, Y., Suhail, Y., Lin, Y. Y., Boeke, J. D. & Bader, J. S. Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic interactions. Genome Res. 18, 1991 2004 (2008). 3. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507 522 (2011). 14