Hub Gene Selection Methods for the Reconstruction of Transcription Networks

Similar documents
Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Introduction to Bioinformatics

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Information-Theoretic Inference of Gene Networks Using Backward Elimination

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

the noisy gene Biology of the Universidad Autónoma de Madrid Jan 2008 Juan F. Poyatos Spanish National Biotechnology Centre (CNB)

Learning undirected graphical models from multiple datasets with the generalized non-rejection rate

BME 5742 Biosystems Modeling and Control

Translation and Operons

Computational Biology: Basics & Interesting Problems

Gene Regulation and Expression

Discovering modules in expression profiles using a network

Simulation of Gene Regulatory Networks

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Research Article On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Group lasso for genomic data

Welcome to Class 21!

Learning structurally consistent undirected probabilistic graphical models

Latent Variable models for GWAs

Bayesian learning of sparse factor loadings

2. Mathematical descriptions. (i) the master equation (ii) Langevin theory. 3. Single cell measurements

Approximate inference for stochastic dynamics in large biological networks

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Graph Alignment and Biological Networks

Optimal State Estimation for Boolean Dynamical Systems using a Boolean Kalman Smoother

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from High-throughput Data

Systems biology and biological networks

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Controlling Gene Expression

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 4: Transcription networks basic concepts

Computational Cell Biology Lecture 4

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

Sparse Approximation and Variable Selection

Inferring Protein-Signaling Networks

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

GENE REGULATION AND PROBLEMS OF DEVELOPMENT

CHAPTER : Prokaryotic Genetics

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Inferring Models of cis-regulatory Modules using Information Theory

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

56:198:582 Biological Networks Lecture 8

Bioinformatics Chapter 1. Introduction

Sparse, stable gene regulatory network recovery via convex optimization

Expectation Propagation for Approximate Bayesian Inference

Sparse representation classification and positive L1 minimization

Lecture 7: Simple genetic circuits I

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Predicting causal effects in large-scale systems from observational data

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Multiple Choice Review- Eukaryotic Gene Expression

12-5 Gene Regulation

arxiv: v1 [cs.it] 21 Feb 2013

Linear Methods for Regression. Lijun Zhang

Chapter 12. Genes: Expression and Regulation

Chapter 16 Lecture. Concepts Of Genetics. Tenth Edition. Regulation of Gene Expression in Prokaryotes

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Singular Value Decomposition and Principal Component Analysis (PCA) I

Sparse inverse covariance estimation with the lasso

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Finding important hubs in scale-free gene networks

Chapter 15 Active Reading Guide Regulation of Gene Expression

Introduction to Machine Learning

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

MULTIPLEKERNELLEARNING CSE902

Bayesian Methods for Sparse Signal Recovery

Nucleus. The nucleus is a membrane bound organelle that store, protect and express most of the genetic information(dna) found in the cell.

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Inequalities on partial correlations in Gaussian graphical models

The Monte Carlo Method: Bayesian Networks

Prokaryotic Regulation

Introduction to Bioinformatics

CSC 576: Variants of Sparse Learning

Efficient Sparse-Group Bayesian Feature Selection for Gene Network Reconstruction

Fuzzy Clustering of Gene Expression Data

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Structured Sparsity. group testing & compressed sensing a norm that induces structured sparsity. Mittagsseminar 2011 / 11 / 10 Martin Jaggi

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Andriy Mnih and Ruslan Salakhutdinov

CS-E5880 Modeling biological networks Gene regulatory networks

High-dimensional covariance estimation based on Gaussian graphical models

1. In most cases, genes code for and it is that

Predicting Protein Functions and Domain Interactions from Protein Interactions

Geometric ergodicity of the Bayesian lasso

Computational Systems Biology

Self-Calibration and Biconvex Compressive Sensing

Measuring TF-DNA interactions

Transcription:

for the Reconstruction of Transcription Networks José Miguel Hernández-Lobato (1) and Tjeerd. M. H. Dijkstra (2) (1) Computer Science Department, Universidad Autónoma de Madrid, Spain (2) Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands September 21, 2010

Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

Introduction to Transcription Networks Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

1 Introduction to Transcription Networks Gene Expression Process Cells are composed of thousands of proteins produced at different rates in different situations. Regulatory molecules control the final concentration of each protein.

2 Introduction to Transcription Networks Transcription Regulation Transcription is typically controlled by transcription factors which can be activators or repressors.

3 Introduction to Transcription Networks Transcription Control Networks......have hubs or highly connected genes. These are key regulators.

4 Introduction to Transcription Networks Reconstruction of Transcription Control Networks. Main Contributions: Identifying beforehand those genes that are more likely to be hubs can improve the results of existing network reconstruction methods. We propose different methods for identifying hub genes.

Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

5 A Linear Model for Transcription Control Networks Under certain assumptions, the log-concentration of mrna for a gene [X i ] can be represented as a linear function of the log-concentration of mrna for its regulators [X j ] using a linear model: log[x i ] = b j log[x j ] + constant. j i where the b j are regression coefficients. No autoregulation! Extension to Account for all the Genes in a Cell X BX + σe, where X are mrna log-concentration values and E is Gaussian noise. The diagonal of B is zero. Hub genes correspond to the columns of B that have more non-zero elements.

6 Basic Approach and Proposed Methods Basic Approach Identify the rows of X that are more relevant for solving the multiple linear regression problem X BX + σe, or the columns of B thar are more likely to have many non-zero elements. Each of these rows or columns represents a different hub gene. Proposed Methods We use standard feature selection and sparsity enforcing techniques: Automatic Relevance Determination (ARD). Group Lasso (GL). Maximum-relevance minimum-redundancy (MRMR).

7 Automatic Relevance Determination (ARD) Let b j i be the element in the j-th column and in the i-th row of B, then the prior for B is P (B a) = i δ(b i i ) j i N (b j i, 0, a 1 j ), (1) where a = (a 1,..., a d ) is a vector of inverse variances, one component for each column of B and δ is a point mass at zero. ARD maximizes the evidence of the resulting Bayesian model with respect to a. During this process some of the a i become infinite and the resulting solution for B is column-wise sparse. The optimization is performed in a greedy manner to select the k most important columns.

8 Group Lasso (GL) A column-wise sparse estimate of B is obtained as the minimizer of X BX 2 F subject to d b i 2 M and diag(b) = 0, i=1 (2) where F and 2 stand for the Frobenius and l 2 norms, respectively, b i is the i-th column of matrix B and M is a positive regularization parameter. The l 2 norm forces the solution to be column-wise sparse. M is fixed so that only k columns of the resulting estimate are different from zero.

9 Maximum Relevance Minimum Redundancy (MRMR) Let x i = (x i1,..., x in ) be the i-th row of X and let I (x i, x j ) denote the empirical mutual information for rows i and j. Now, suppose that S is a set that contains those genes already selected by the method. The next gene to be selected is a gene j that is not yet in S and that maximizes 1 d 1 i j I (x i, x j ) 1 S I (x i, x j ). (3) where I (x i, x j ) = 0.5 log(1 ˆρ 2 ij ), and ˆρ ij is the empirical correlation of the vectors x i and x j. The process is repeated until k genes are selected. i S

10 Evaluation of the SynTREN is used to generate synthetic gene expression data from three networks with 423, 690 and 1330 genes, respectively. 50 gene expression matrices with 250 measurements are generated from each network. The different methods are configured (fixing k) to select a 5% of genes. The methods are compared with a random approach that selects a 5% of genes randomly. Performance of each method is evaluated by the average connectivity of the selected genes. Network Genes Random GL MRMR ARD Small E. coli 423 2.54±0.77 7.23±0.40 13.31±0.77 14.01±0.71 Yeast 690 3.14±0.91 9.46±0.51 13.91±1.27 16.51±0.74 Large E. coli 1330 4.22±1.94 8.56±0.72 14.70±2.48 23.48±1.62

11 Genes Selected on the Smallest Network by ARD and GL ARD GL

Improving the Performance of ARACNE Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

12 Improving the Performance of ARACNE Description of ARACNE ARACNE is a state-of-the-art method for TCN reconstruction. ARACNE follows the following steps: 1 The empirical mutual information for any two genes i and j is computed, that is, I (x i, x j ). 2 The connection weight w ij for any two genes i and j is initialized as w ij = I (x i, x j ). 3 The data processing inequality is applied I (x i, x j ) min{i (x i, x k ), I (x j, x k )} (4) for any genes i, j and k and w ij is set to zero to eliminate indirect interactions when the inequality holds for some gene k. Finally ARACNE links any two genes i and j when w ij is higher than a threshold θ > 0.

13 Improving the Performance of ARACNE ARACNE and Noisy Data Noise in the mutual information estimates can significantly affect the performance of ARACNE. ARACNE may fail to remove an indirect interaction between genes i and j when fluctuations in the measurement process make I (x i, x j ) larger than I (x i, x k ) or I (x j, x k ) for those genes k that actually connect i and j. Similarly, ARACNE may mistakenly remove a direct interaction between genes i and j when these fluctuations make I (x i, x j ) smaller than I (x i, x k ) and I (x j, x k ) for some gene k.

14 Improving the Performance of ARACNE Improving ARACNE using a Hub Gene Selection Method After ARACNE has computed the empirical mutual information estimates, these are updated as { Iold (x I new (x i, x j ) = i, x j ) + ij if i S H or j S H I old (x i, x j ) otherwise (5) where S H is a set of hub genes and ij is a positive number that scales with the level of noise in the estimation of I (x i, x j ). When I (x i, x j ) = 0.5 log(1 ˆρ 2 ij ), we can fix a probabilistic upper bound on the noise in I (x i, x j ) using the delta method, namely where γ is a small positive number. ij = n 1/2 ˆρ ij Φ 1 (1 γ), (6)

15 Improving the Performance of ARACNE Evaluation of the New ARACNE Method I The standard ARACNE algorithm is compared with four versions of the modified method which compute S H using ARD, GL, MRMR and the random approach The data used are the expression matrices generated by SynTREN to validate the hub gene selection methods. Performance is measured using the area under the precision-recall curve generated by altering threshold θ. γ is fixed to be the inverse of the number of genes in the network. Standard New ARACNE with Network Genes ARACNE Random GL MRMR ARD Small E. coli 423 0.41±0.03 0.28±0.05 0.41±0.04 0.57±0.04 0.56±0.03 Yeast 690 0.30±0.02 0.16±0.03 0.30±0.02 0.35±0.03 0.39±0.02 Large E. coli 1330 0.16±0.01 0.06±0.01 0.15±0.01 0.18±0.03 0.26±0.02

Experiments with Actual Microarray Data Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

16 Experiments with Actual Microarray Data The ARD Method is Validated on a Yeast Dataset......which includes 247 expression measurements for a total of 5520 genes and it is publicly available at the Many Microbe Microarrays Database. Table: Top ten genes selected by ARD on the yeast dataset. Rank Gene Description 1 YOR224C RNA polymerase subunit. 2 YPL013C Mitochondrial ribosomal protein. 3 YGL245W Glutamyl-tRNA synthetase. 4 YPL012W Ribosomal RNA processing. 5 YER125W Ubiquitin-protein ligase. 6 YER092W Associates with the INO80 chromatin remodeling complex 7 YBR289W Subunit of the SWI/SNF chromatin remodeling complex. 8 YBR272C Involved in DNA mismatch repair. 9 YBR160W Catalytic subunit of the main cell cycle kinase. 10 YOR215C Unknown function.

Conclusions Outline 1 Introduction to Transcription Networks 2 3 Improving the Performance of ARACNE 4 Experiments with Actual Microarray Data 5 Conclusions

17 Conclusions Conclusions Network reconstruction can be improved by predicting beforehand which genes are hubs. The linear model identifies hubs with non-sparse columns of a regression matrix B. We have proposed three hub gene selection methods: ARD, GL, MRMR. The best one is ARD. The performance of ARACNE can be improved by using some of these methods. The best method, ARD, was validated on a yest expression dataset.

18 Conclusions Bibliography Gardner, T.S., Faith, J.J.: Reverse-engineering transcription control networks. Physics of Life Reviews 2(1), 65-88 (2005) den Bulcke, T.V., Leemput, K.V., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., Moor, B.D., Marchal, K.: Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics 7(1), 43 (2006) Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., Califano, A.: Reverse engineering of regulatory networks in human B cells. Nature Genetics 37, 382-390 (2005) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68, 4967 (2006) Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211-244 (2001) Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8), 1226-1238 (2005) Faith, J.J., Driscoll, M.E., Fusaro, V.A., Cosgrove, E.J., Hayete, B., Juhn, F.S., Schneider, S.J., Gardner, T.S.: Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Research 36, D866-D870 (2008) D.P. Wipf and B.D. Rao, An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem, IEEE Transactions on Signal Processing, vol. 55, no. 7, July 2007

19 Conclusions Thank you for your attention!