Differential Modeling for Cancer Microarray Data

Similar documents
Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Introduction to clustering methods for gene expression data analysis

Identifying Bio-markers for EcoArray

Interaction Network Analysis

Discovering molecular pathways from protein interaction and ge

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Introduction to clustering methods for gene expression data analysis

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Self Similar (Scale Free, Power Law) Networks (I)

Introduction to Bioinformatics

Learning in Bayesian Networks

Efficient Mining Differential Co-Expression Constant Row Bicluster in Real-Valued Gene Expression Datasets

Data visualization and clustering: an application to gene expression data

Protein Complex Identification by Supervised Graph Clustering

Automatic Reconstruction of the Building Blocks of Molecular Interaction Networks

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

A Multiobjective GO based Approach to Protein Complex Detection

Stat 315c: Introduction

Gene Ontology and overrepresentation analysis

Predicting Protein Functions and Domain Interactions from Protein Interactions

Data Mining Techniques

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Computational Systems Biology

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Principal component analysis (PCA) for clustering gene expression data

Biological networks CS449 BIOINFORMATICS

Protein function prediction via analysis of interactomes

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Bioinformatics and Computerscience

Clustering and Network

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Network Biology-part II

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Proteomics Systems Biology

Computational methods for predicting protein-protein interactions

Algorithms for Molecular Biology

Thematic review series: Systems Biology Approaches to Metabolic and Cardiovascular Disorders

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Basics of Multivariate Modelling and Data Analysis

Non-Negative Factorization for Clustering of Microarray Data

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Constraint-based Subspace Clustering

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Modularity and Graph Algorithms

Comparative Network Analysis

Three right directions and three wrong directions for tensor research

Sig2GRN: A Software Tool Linking Signaling Pathway with Gene Regulatory Network for Dynamic Simulation

Parametric Empirical Bayes Methods for Microarrays

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Systems biology and biological networks

An Example of Visualization in Data Mining

Keywords: systems biology, microarrays, gene expression, clustering

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

Inferring Transcriptional Regulatory Networks from High-throughput Data

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

BMD645. Integration of Omics

25 : Graphical induced structured input/output models

Structural Learning and Integrative Decomposition of Multi-View Data

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules

Sample Size Estimation for Studies of High-Dimensional Data

Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg

Introduction Centrality Measures Implementation Applications Limitations Homework. Centrality Metrics. Ron Hagan, Yunhe Feng, and Jordan Bush

Overview. Overview. Social networks. What is a network? 10/29/14. Bioinformatics I. Networks are everywhere! Introduction to Networks

Statistical Methods for Analysis of Genetic Data

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Types of biological networks. I. Intra-cellurar networks

Solving the Order-Preserving Submatrix Problem via Integer Programming

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Singular value decomposition for genome-wide expression data processing and modeling. Presented by Jing Qiu

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Kristina Lerman USC Information Sciences Institute

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Computational approaches for functional genomics

Prediction of double gene knockout measurements

DECOMPOSITION OF GENE REGULATORY NETWORKS INTO FUNCTIONAL PATHS AND THEIR MATCHING WITH MICROARRAY GENE EXPRESSION PROFILES

Correlation Networks

A Mining Order-Preserving SubMatrices from Probabilistic Matrices

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Model Accuracy Measures

Single gene analysis of differential expression. Giorgio Valentini

Towards Detecting Protein Complexes from Protein Interaction Data

Molecular Biology: from sequence analysis to signal processing. University of Sao Paulo. Junior Barrera

Erzsébet Ravasz Advisor: Albert-László Barabási

Structural measures for multiplex networks

Transcription:

Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1

Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing methods Limitations Differential Biclustering Biclustering Proposed algorithm Results Differential Networking Gene Networks Proposed algorithm Results Conclusion 2

The Biology and The Technology The central dogma DNA microarray Picture from http://www.ncbi.nlm.nih.gov/ Picture from http://www.dnamicroarray.net/ 3

genes Example of Cancer Microarray Data measure the expression level of thousands of genes under different conditions. samples 4

Sample Types tissue type (e.g., normal vs cancerous) subject type (e.g., male vs female) time points (time series data) comparative gene expression analysis Problem: Find the most significant genes relevant to phenotypic variation. 5

GENES The goals of Differential Modeling SAMPLES S 1 S 2 S 3 S 4 S 5 S 6 Group A (normal) S 1 S 3 S 4 Group B (cancer) S 2 S 5 S 6 G 1 G 2 G 3 G 4 G 5 G 6 G 7 G 1 G 2 G 3 G 4 G 5 G 6 G 7 The goal of differential analysis is to answer the following questions: What are the genes that are related to cancer? How these genes are correlated in cancer and in normal cells? 6

Applications of Differential Modeling Identifying disease causing genes. many applications! Examine the effects of a certain treatment. Understanding the different roles played by a given gene in two different kinds of cells. Comparative gene expression analysis. 7

Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing methods Limitations Differential Biclustering Biclustering Proposed algorithm Results Differential Networking Gene Networks Proposed algorithm Results Conclusion 8

Differential Expression (DE) Is the mean of the expression level of a gene in group A significantly different from the mean of the expression level in group B? Solution : compute t-test statistic for each gene Example : Significance Analysis of Microarrays (SAM) [Tusher et al, 2001] 9

Differential Variability (DV) Is the variance of the expression level of a gene in group A significantly different from the variance of the expression level in group B? Solution : compute F-test statistic for each gene Example : AlteredExpression [Prieto et al, 2006] 10

Limitations of DE &DV Methods Perform a statistical test for each gene individually, and do not capture the relationships between genes. Cannot find the differences in the coexpression patterns in normal and disease samples. It was shown that some disease genes were highly differentially co-expressed but not differently expressed. study genes individually Therefore, we proposed two data mining approaches: Differential biclustering Differential networking study groups of genes 11

Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing methods Limitations Differential Biclustering Biclustering Proposed algorithm Results Differential Networking Gene Networks Proposed algorithm Results Conclusion 12

Clustering Group similar objects together K-means clustering Hierarchical clustering 13

Traditional Clustering Algorithms In the traditional clustering methods, such as K-means and Hierarchical clustering, the similarity is computed across all the features. These methods fail in discovering: Only a small set of the genes participates in a cellular process of interest. An interesting cellular process is active only in a subset of the conditions. 14

Expression level Biclustering (Co-clustering) The genes are NOT correlated in all of the samples. The genes are correlated in a subset of the samples. Samples 15

Genes Biclustering identifies a subset of objects that are similar under a subset of features Samples 5 5 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 0 5 5 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 7 7 7 0 0 0 0 0 0 0 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 More complicated biclusters Arbitrarily Positioned Overlapping biclusters These biclusters cannot be identified using traditional clustering algorithms such as k-means or hierarchical clustering. 16

words users Applications of Biclustering Text mining Documents Recommendation system movies 0 0 0 0 0 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 0 0 0 0 0 0 9 9 0 0 0 0 4 4 4 0 0 0 0 0 0 0 4 4 4 0 0 0 0 0 0 0 4 4 4 0 0 0 0 5 5 0 0 0 0 0 0 0 0 5 5 0 4 4 0 0 0 0 0 5 5 0 4 4 0 1 1 1 1 0 0 0 4 4 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 7 7 7 0 0 0 0 0 0 0 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Identify subgroups of documents with similar properties relatively to subgroups of attributes. Identify subgroups of customers with similar preferences or behaviors toward a subset of products. 17

POsitive and NEgative correlation based Overlapping Co-Clustering (PONEOCC) Main contributions of PONEOCC algorithm 1. Ranking based objective function. 2. Positive and negative correlation. 3. Large overlapping co-clusters. 4. Handling missing values. Positive and negative correlation Positive correlation: similar patterns Negative correlation: opposite patterns Omar Odibat and Chandan K. Reddy, "A Generalized Co-clustering Framework for Mining arbitrarily Positioned Overlapping Co-clusters", In Proceedings of the SIAM International Conference on Data Mining (SDM), Phoenix, AZ, April 2011. 18

PONEOCC: Model The Mean Squared Residue (MSR) function is used to measure the homogeneity of a bicluster X. error = 0 error = 0 error = 6.1 19

PONEOCC: Main Steps Initialization Core coclustering Merging Refinement 20

% of significant biclusters PONEOCC: Results Existing co-clustering algorithms 1. CC [Cheng and Church, 2000] 2. OPSM [Ben-Dor et al., 2003] 3. ISA [Ihmels et al, 2004] 4. ROCC [Deodhar et al., 2009] Statistical analysis Average of seven data sets score= 1- error. score is in [0,1] Significance level 21

PONEOCC : Examples Positive correlation Positive & negative correlation 22

The DiBiCLUS Algorithm Main contributions of DiBiCLUS: I. Incorporating the class labels in the biclustering. II. Using k-means to quantize the gene values. III. The ability to find overlapping biclusters. Find the sets of genes that are correlated in one class of conditions, but not in the other class. Find the sets of genes that have different type of coorelations among the two classes (positive/negative). Omar Odibat, Chandan K. Reddy and Craig N. Giroux, "Differential Biclustering for Gene Expression Analysis", In Proceedings of the ACM Conference on Bioinformatics and Computational Biology (BCB), Niagara Falls, NY, 2010. 23

Overview of DiBiCLUS Step 1: Quantization. Step 2: Finding the differential pairs of genes. Step 3: Identifying the differential biclusters. Step 4: Merging highly overlapping biclusters. 24

Quantization Original values Quantized value(k=3) 0.35 0.36 0.37 0.93 0.99 1.2 1.29 1.3 1.36 1.39-1 -1-1 0 0 +1 +1 +1 +1 +1 25

Differential Pairs: Two Criteria Different co-expression type (positive in one class & negative in the other one, or vice versa). Same co-expression type in both classes, but sim or sim A B (g N A (g N B 1 1 25%,g,g 2 2 ) ) sim sim B A (g N B (g N A 1 1,g,g 2 2 ) ) N A is the number of conditions in class A. N B is the number of conditions in class B. δ is a user threshold. 26

Differential Pairs - Example Class A Class B g1 +1 +2 +1-2 -1 g2 +1-1 -2 +1-1 sim A (g N A 1,g 2 ) 2 5 0.4 g1 +2-2 -1-2 +1-1 +2 g2 +2-2 -2-2 +1-1 +2 sim B (g N 1 B,g 0.86 0.86 > 0.4 + 0.25, so g1 and g2 are considered a differential pair in class B. 2 ) 6 7 27

Finding Differential Pairs: Case 1 Class A Class B The two genes are correlated in class B more than class A. So, they are considered differential genes. 28

Finding Differential Pairs: Case 2 Class A Class B The two genes are negatively correlated in 10 samples class A but positively correlated in 10 samples in class B. 29

From Differential Pairs to Differential Biclusters Keep dividing the differential pairs until all the biclusters are found. This row indicates that g 1 and g 2 are correlated in s 1, s 3, s 6,s 7, s 9 and s 10 Bicluster 1: Genes={1,2,3,5,7,9} Samples={3,6,7} Bicluster 2: Genes={1,4,6,8,10} Samples={2,5,6,8,9} 30

The prostate cancer dataset DiBiCLUS: Results Class A: an early stage of prostate cancer (low grade ), 433 samples Class B: a developed stage of prostate cancer (high grade ), 208 samples. examples of the biclusters p-values analysis 31

Class B Class A Significance of The Results - Example The genes are shown to be mapped to a closely related local sub-network in the IPA biological interaction Knowledge Base. This mapping result suggests that these three genes function in closely related biological processes, associated with the aggressive state of prostate cancer. Pathway obtained from IPA knowledgebase for the genes ACTA2, MTA1 and DVL 32

Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing methods Limitations Differential Biclustering Biclustering Proposed algorithm Results Differential Networking Gene Networks Proposed algorithm Results Conclusion 33

Gene Networks play a key role in modeling gene activities and in understanding the functions of cells. Gene Nodes represent genes Gene 1 Gene 2 Picture from [Nayak et al, Genome Research, 2009] Links represent correlation between genes. 34

Building Gene Networks (Reverse Engineering) Bayesian Networks. Information-theoretic approach. Boolean networks. Ordinary differential equations. DNA microarray adjacency matrix gene network 35

Hubs Cliques Centrality 36

Scale Free Networks The vast majority of nodes have only a few connections and few nodes are very highly connected K: the number of connections P(k): the number of nodes with k connections, divided by the total number of nodes 37

Differential Networking Comparing the structure of the cancer and control co-expression networks provides insight into disease-specific alterations Genes that have a strongly altered connectivity are assumed to play an important role in the disease phenotype. Uncover differences in modules and connectivity in different data sets. Reveals genes/pathways that are wired differently in different sample populations. 38

Differential Networking Data set A Data set B Network construction Network construction Differential Network analysis (Our contribution) Ranked gene list 39

Differential Networking How to identify the genes that responsible for changes between two gene networks? Same nodes but different links 40

The Proposed Model Existing approaches are based on statistical tests to compare different networks based on: The connectivity of genes (differential genes) The weight of the edges (differential edges). Inspired by the power of pagerank algorithm, we propose a data mining approach. Differential Genes Ranking algorithm (DiGeR) to rank the genes based on their contribution to the differences between two gene coexpression networks. 41

Centrality Measure Small changes in the expression level of the central genes could significantly alter the interconnection and the topology of the gene network. Low betweenness centrality High betweenness centrality 42

Differential Betweeness Centrality The shaded node has the same degree and same betweenness centrality in both networks. but the shortest paths that pass through that node are different between the two networks Then we should compare the shortest paths between the networks dbc= unique shortest paths. 43

The Proposed Model Rank Centrality Connectivity 44

Example How to rank the differentially connected genes? 45

Example 46

Example 1 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the high tumor grade network but they are less connected in the low grade network. Low ranks mean more differential gene! 47

Example 2 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the high tumor grade network but they are less connected in the low grade network. 48

Example 3 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the low tumor grade network but they are less connected in the high grade network. 49

Conclusion Differential modeling of microarray data helps in associating differences in gene expression profiles to phenotypic differences across different conditions. Differential modeling can find the most significant genes relevant to phenotypic variation and the genes that are related to disease. 50

Acknowledgments Advisor: Dr Chandan Reddy Collaborator: Dr Craig N. Giroux Karmanos Cancer Institute 51

Email: odibat@wayne.edu 52