Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Similar documents
Probabilistic Graphical Models

Probabilistic Graphical Models

High-Throughput Sequencing Course

11 : Gaussian Graphic Models and Ising Models

Equivalent Partial Correlation Selection for High Dimensional Gaussian Graphical Models

Downloaded by Stanford University Medical Center Package from online.liebertpub.com at 10/25/17. For personal use only. ABSTRACT 1.

The lasso: some novel algorithms and applications

High-dimensional covariance estimation based on Gaussian graphical models

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

Graphical Model Selection

DEXSeq paper discussion

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Proximity-Based Anomaly Detection using Sparse Structure Learning

Sparse Inverse Covariance Estimation for High-throughput microrna Sequencing Data in the Poisson Log-Normal Graphical Model

Chapter 17: Undirected Graphical Models

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

An Introduction to Graphical Lasso

Learning Network from High-dimensional Array Data

Learning Gaussian Graphical Models with Unknown Group Sparsity

Fast Regularization Paths via Coordinate Descent

Subject CS1 Actuarial Statistics 1 Core Principles

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Robust Inverse Covariance Estimation under Noisy Measurements

Permutation-invariant regularization of large covariance matrices. Liza Levina

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Inferring Transcriptional Regulatory Networks from High-throughput Data

Linear regression methods

Comparative analysis of RNA- Seq data with DESeq2

Computational Genomics

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Learning Markov Network Structure using Brownian Distance Covariance

Machine Learning Linear Classification. Prof. Matteo Matteucci

Predicting causal effects in large-scale systems from observational data

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Inequalities on partial correlations in Gaussian graphical models

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Introduction to Bioinformatics

Sparse Graph Learning via Markov Random Fields

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Lasso regression. Wessel van Wieringen

Regularization and Variable Selection via the Elastic Net

Prediction & Feature Selection in GLM

A General Framework for High-Dimensional Inference and Multiple Testing

Efficient Sparse-Group Bayesian Feature Selection for Gene Network Reconstruction

An Imputation-Consistency Algorithm for Biomedical Complex Data Analysis

Approximate inference for stochastic dynamics in large biological networks

Institute of Actuaries of India

25 : Graphical induced structured input/output models

Differential expression analysis for sequencing count data. Simon Anders

25 : Graphical induced structured input/output models

Dispersion modeling for RNAseq differential analysis

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Generalized Linear Models (1/29/13)

Example: multivariate Gaussian Distribution

Robust and sparse Gaussian graphical modelling under cell-wise contamination

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

Bayes methods for categorical data. April 25, 2017

Statistical aspects of prediction models with high-dimensional data

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Graphical model inference with unobserved variables via latent tree aggregation

Applied Machine Learning Annalisa Marsico

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Bayesian Learning (II)

Math Review Sheet, Fall 2008

Dimension Reduction. David M. Blei. April 23, 2012

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Joint Gaussian Graphical Model Review Series I

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Extended Bayesian Information Criteria for Gaussian Graphical Models

Regression, Ridge Regression, Lasso

Probabilistic Graphical Models

Graph Wavelets to Analyze Genomic Data with Biological Networks

Factor-Adjusted Robust Multiple Test. Jianqing Fan (Princeton University)

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Topics in Graphical Models. Statistical Machine Learning Ryan Haunfelder

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)

Conditional distributions (discrete case)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

10708 Graphical Models: Homework 2

Causal Inference: Discussion

Metabolic networks: Activity detection and Inference

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Exam: high-dimensional data analysis January 20, 2014

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Discovering molecular pathways from protein interaction and ge

On the inconsistency of l 1 -penalised sparse precision matrix estimation

STA 4273H: Statistical Machine Learning

Statistical Applications in Genetics and Molecular Biology

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Stability and the elastic net

Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Learning Objectives for Stat 225

Transcription:

Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1

Paper G. I. Allen and Z. Liu. 2012. A log-linear graphical model for inferring genetic networks from high-throughput sequencing data. ArXiv 1204.3941. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 2

Overview 1 Background 2 Network Reconstruction (GGM and Poisson) 3 Simulation Study 4 Analysis of microrna Data 5 Discussion Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 3

Genetic Networks Most analysis of RNA-Seq data (e.g. differential expression, clustering, classification) ignores the dependencies among genes. In contrast, in genetics networks one is specifically interested in these dependencies. Questions: 1 What are suitable statistical models for dependency networks for count data? 2 How do we learn these networks from actual biological data? Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 4

Graphical Models For microarray data a common way to model dependencies are graphical models, e.g., Gaussian Graphical Models (GGMs). For sequence data a similar approach is needed: Poisson graphical model. Allen and Liu propose a to use a log-linear graphical model (llgm) similar to regression-based GGMs and develop a fast algorithm based on lasso regression suitable for estimation from high-dimensional data. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 5

Previous Work on Poisson Graphical Models There exist some literature on graphical models for count data and contingency tables, for example: Whittaker 1990 Madigan et al 1995 Lauritzen 1996 Hastie et al 2009 However, all these algorithms do not work well for large number of variables. Inference for dimension d > 20 is infeasible. Allen and Liu (2012) address this issue by introducing the llgm algorithm. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 6

Poisson vs. Negative Binomial Model and Preprocessing Allen and Liu use the Poisson distribution rather than the Negative Binomial. But overdispersion is accounted for in preprocessing: 1 genes with zero counts, that are constant or with low variance are filtered out. 2 adjustment for sequence depth via scale factors (e.g. Anders and Huber 2012). 3 power transform X α with α [0; 1] to correct overdispersion. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 7

Log-Linear Model Conventional linear model: with normal error. log linear model: µ = E(Y X i = x i ) = β i x i log µ = log E(Y X i = x i ) = β i x i with Poisson error (in GLM speak: Poisson regression with natural log link function) Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 8

Log-Linear Model: Properties automatically ensures that µ > 0 the predictors x i need not be integers (preprocessing!) effects of predictor are multiplicative, as µ = e β i x i the regression coefficients can be estimated by ML, penalized ML (e.g. lasso or elastic net) or Bayesian approaches. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 9

Gaussian Graphical Model: Basics Starting point: genes X 1,..., X d are jointly normal distributed with mean µ and covariance Σ and corresponding correlation matrix P = (ρ ij ) From P we compute partial correlations P: Ω = P 1 = (ω ij ) ω ij ρ ij = ωii ω jj A vanishing partial correlation coefficient ρ ij = 0 implies (for normal data) conditional independence of gene i and j given all other genes. Non-zero coefficients are represented by edges GGM network. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 10

GGM: Regression View Partial correlation between X 1 and X 2 can also be computed by linear regression: E(X 1 X j = x j ) j 1 = j 1 β 1 j x j E(X 2 X j = x j ) j 2 = j 2 β 2 j x j and r 2 ij = ˆβ 1 2 ˆβ 2 1 Partial correlation is the geometric mean of the two regression coefficients (one for each direction of an edge in a network). Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 11

GGM: Neighborhood Selection Meinshausen and Bühlmann (2006) propose the neighborhood selection approach to inference of GGM networks: for each potential edge between X i and X j estimate the corresponding regression coefficients ˆβ j i and ˆβ i j using L1-penalized regression ( lasso ). lasso has built-in variable selection: coefficients can be exactly zero. include an edge in the graph if both coefficients are non-zero (alternative: if at least one of them is non-zero). Advantages: very fast and can be applied to very high dimensions. Drawback: this procedure does not always produce a consistent global joint distribution (e.g. the resulting implied covariance is not guaranteed to be positive definite. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 12

llgm Algorithm Inspired by GGM neighborhood selection Allen and Liu propose their local llgm (log-linear graphical model) algorithm: use L1-penalized log-linear regression to estimate regression coefficients. optimal regularization parameter is chosen via stability selection (Meinshausen and Bühlmann 2010). construct a llgm network by including an edge between if at least of the two regression coefficients corresponding to an edge is non-zero (union). Alternatively, include an edge only if both coefficients are non-zero (intersection). Advantages: very fast and can be applied to very high dimensions. Drawback: this procedure does not necessarily produce a consistent global Poisson graphical model. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 13

Simulations: Setup Three graphs structures are simulated (50 nodes): 1 hub network 2 scale-free network 3 random network Poisson data with sample size n = 200 for these networks are simulated using an algorithm by Karlis (2003). Comparison with GGM lasso algorithm (directly on count data or on log-transformed count data). Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 14

Simulations: Hub Network Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 15

Simulations: Scale-Free Network Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 16

Simulations: Random Network Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 17

Simulations: Results the llgm algorithm greatly outperforms GGM-based algorithms on hub and scale-free networks. for GGM graphs it does not matter whether the data are log-transformed. for random networks the ROC curves of all methods are approximately equal. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 18

microrna Data Set aim: infer network to discover relationship among micrornas (breast cancer samples). data set: 544 patients and 524 micrornas. after preprocessing and filtering 262 micrornas remained for analysis (n = 544, d = 262). Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 19

Inferred microrna Network Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 20

microrna Network Details Node-degrees follows a power-law (scale free network). many well-known hub genes are recovered. plus additional potentially interesting hub genes. micorna cluster identified without transcript location Biological hypothesis obtained from network reconstruction: mir-379 is a regulatory microrna for breast cancer progression. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 21

Discussion A framework for inferring Poisson graphical networks from count data was developed. Based on Poisson L1-penalized regression combined with neighborhood selection. Applicable to much higher dimension than previous algorithms. The proposed approach clearly outperforms in simulations GGM networks inferred from the same data. Using a microrna data set previously known facts were recovered and new biological hypotheses were generated for further validation. Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 22