Differential Modeling for Cancer Microarray Data

Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1

Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing methods Limitations Differential Biclustering Biclustering Proposed algorithm Results Differential Networking Gene Networks Proposed algorithm Results Conclusion 2

The Biology and The Technology The central dogma DNA microarray Picture from http://www.ncbi.nlm.nih.gov/ Picture from http://www.dnamicroarray.net/ 3

genes Example of Cancer Microarray Data measure the expression level of thousands of genes under different conditions. samples 4

Sample Types tissue type (e.g., normal vs cancerous) subject type (e.g., male vs female) time points (time series data) comparative gene expression analysis Problem: Find the most significant genes relevant to phenotypic variation. 5

GENES The goals of Differential Modeling SAMPLES S 1 S 2 S 3 S 4 S 5 S 6 Group A (normal) S 1 S 3 S 4 Group B (cancer) S 2 S 5 S 6 G 1 G 2 G 3 G 4 G 5 G 6 G 7 G 1 G 2 G 3 G 4 G 5 G 6 G 7 The goal of differential analysis is to answer the following questions: What are the genes that are related to cancer? How these genes are correlated in cancer and in normal cells? 6

Applications of Differential Modeling Identifying disease causing genes. many applications! Examine the effects of a certain treatment. Understanding the different roles played by a given gene in two different kinds of cells. Comparative gene expression analysis. 7

Differential Expression (DE) Is the mean of the expression level of a gene in group A significantly different from the mean of the expression level in group B? Solution : compute t-test statistic for each gene Example : Significance Analysis of Microarrays (SAM) [Tusher et al, 2001] 9

Differential Variability (DV) Is the variance of the expression level of a gene in group A significantly different from the variance of the expression level in group B? Solution : compute F-test statistic for each gene Example : AlteredExpression [Prieto et al, 2006] 10

Limitations of DE &DV Methods Perform a statistical test for each gene individually, and do not capture the relationships between genes. Cannot find the differences in the coexpression patterns in normal and disease samples. It was shown that some disease genes were highly differentially co-expressed but not differently expressed. study genes individually Therefore, we proposed two data mining approaches: Differential biclustering Differential networking study groups of genes 11

Clustering Group similar objects together K-means clustering Hierarchical clustering 13

Traditional Clustering Algorithms In the traditional clustering methods, such as K-means and Hierarchical clustering, the similarity is computed across all the features. These methods fail in discovering: Only a small set of the genes participates in a cellular process of interest. An interesting cellular process is active only in a subset of the conditions. 14

Expression level Biclustering (Co-clustering) The genes are NOT correlated in all of the samples. The genes are correlated in a subset of the samples. Samples 15

Genes Biclustering identifies a subset of objects that are similar under a subset of features Samples 5 5 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 0 5 5 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 7 7 7 0 0 0 0 0 0 0 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 More complicated biclusters Arbitrarily Positioned Overlapping biclusters These biclusters cannot be identified using traditional clustering algorithms such as k-means or hierarchical clustering. 16

words users Applications of Biclustering Text mining Documents Recommendation system movies 0 0 0 0 0 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 3 3 3 3 0 0 9 9 0 0 0 0 0 0 0 0 9 9 0 0 0 0 4 4 4 0 0 0 0 0 0 0 4 4 4 0 0 0 0 0 0 0 4 4 4 0 0 0 0 5 5 0 0 0 0 0 0 0 0 5 5 0 4 4 0 0 0 0 0 5 5 0 4 4 0 1 1 1 1 0 0 0 4 4 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 7 7 7 0 0 0 0 0 0 0 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Identify subgroups of documents with similar properties relatively to subgroups of attributes. Identify subgroups of customers with similar preferences or behaviors toward a subset of products. 17

POsitive and NEgative correlation based Overlapping Co-Clustering (PONEOCC) Main contributions of PONEOCC algorithm 1. Ranking based objective function. 2. Positive and negative correlation. 3. Large overlapping co-clusters. 4. Handling missing values. Positive and negative correlation Positive correlation: similar patterns Negative correlation: opposite patterns Omar Odibat and Chandan K. Reddy, "A Generalized Co-clustering Framework for Mining arbitrarily Positioned Overlapping Co-clusters", In Proceedings of the SIAM International Conference on Data Mining (SDM), Phoenix, AZ, April 2011. 18

PONEOCC: Model The Mean Squared Residue (MSR) function is used to measure the homogeneity of a bicluster X. error = 0 error = 0 error = 6.1 19

PONEOCC: Main Steps Initialization Core coclustering Merging Refinement 20

% of significant biclusters PONEOCC: Results Existing co-clustering algorithms 1. CC [Cheng and Church, 2000] 2. OPSM [Ben-Dor et al., 2003] 3. ISA [Ihmels et al, 2004] 4. ROCC [Deodhar et al., 2009] Statistical analysis Average of seven data sets score= 1- error. score is in [0,1] Significance level 21

PONEOCC : Examples Positive correlation Positive & negative correlation 22

The DiBiCLUS Algorithm Main contributions of DiBiCLUS: I. Incorporating the class labels in the biclustering. II. Using k-means to quantize the gene values. III. The ability to find overlapping biclusters. Find the sets of genes that are correlated in one class of conditions, but not in the other class. Find the sets of genes that have different type of coorelations among the two classes (positive/negative). Omar Odibat, Chandan K. Reddy and Craig N. Giroux, "Differential Biclustering for Gene Expression Analysis", In Proceedings of the ACM Conference on Bioinformatics and Computational Biology (BCB), Niagara Falls, NY, 2010. 23

Overview of DiBiCLUS Step 1: Quantization. Step 2: Finding the differential pairs of genes. Step 3: Identifying the differential biclusters. Step 4: Merging highly overlapping biclusters. 24

Quantization Original values Quantized value(k=3) 0.35 0.36 0.37 0.93 0.99 1.2 1.29 1.3 1.36 1.39-1 -1-1 0 0 +1 +1 +1 +1 +1 25

Differential Pairs: Two Criteria Different co-expression type (positive in one class & negative in the other one, or vice versa). Same co-expression type in both classes, but sim or sim A B (g N A (g N B 1 1 25%,g,g 2 2 ) ) sim sim B A (g N B (g N A 1 1,g,g 2 2 ) ) N A is the number of conditions in class A. N B is the number of conditions in class B. δ is a user threshold. 26

Differential Pairs - Example Class A Class B g1 +1 +2 +1-2 -1 g2 +1-1 -2 +1-1 sim A (g N A 1,g 2 ) 2 5 0.4 g1 +2-2 -1-2 +1-1 +2 g2 +2-2 -2-2 +1-1 +2 sim B (g N 1 B,g 0.86 0.86 > 0.4 + 0.25, so g1 and g2 are considered a differential pair in class B. 2 ) 6 7 27

Finding Differential Pairs: Case 1 Class A Class B The two genes are correlated in class B more than class A. So, they are considered differential genes. 28

Finding Differential Pairs: Case 2 Class A Class B The two genes are negatively correlated in 10 samples class A but positively correlated in 10 samples in class B. 29

From Differential Pairs to Differential Biclusters Keep dividing the differential pairs until all the biclusters are found. This row indicates that g 1 and g 2 are correlated in s 1, s 3, s 6,s 7, s 9 and s 10 Bicluster 1: Genes={1,2,3,5,7,9} Samples={3,6,7} Bicluster 2: Genes={1,4,6,8,10} Samples={2,5,6,8,9} 30

The prostate cancer dataset DiBiCLUS: Results Class A: an early stage of prostate cancer (low grade ), 433 samples Class B: a developed stage of prostate cancer (high grade ), 208 samples. examples of the biclusters p-values analysis 31

Class B Class A Significance of The Results - Example The genes are shown to be mapped to a closely related local sub-network in the IPA biological interaction Knowledge Base. This mapping result suggests that these three genes function in closely related biological processes, associated with the aggressive state of prostate cancer. Pathway obtained from IPA knowledgebase for the genes ACTA2, MTA1 and DVL 32

Gene Networks play a key role in modeling gene activities and in understanding the functions of cells. Gene Nodes represent genes Gene 1 Gene 2 Picture from [Nayak et al, Genome Research, 2009] Links represent correlation between genes. 34

Building Gene Networks (Reverse Engineering) Bayesian Networks. Information-theoretic approach. Boolean networks. Ordinary differential equations. DNA microarray adjacency matrix gene network 35

Hubs Cliques Centrality 36

Scale Free Networks The vast majority of nodes have only a few connections and few nodes are very highly connected K: the number of connections P(k): the number of nodes with k connections, divided by the total number of nodes 37

Differential Networking Comparing the structure of the cancer and control co-expression networks provides insight into disease-specific alterations Genes that have a strongly altered connectivity are assumed to play an important role in the disease phenotype. Uncover differences in modules and connectivity in different data sets. Reveals genes/pathways that are wired differently in different sample populations. 38

Differential Networking Data set A Data set B Network construction Network construction Differential Network analysis (Our contribution) Ranked gene list 39

Differential Networking How to identify the genes that responsible for changes between two gene networks? Same nodes but different links 40

The Proposed Model Existing approaches are based on statistical tests to compare different networks based on: The connectivity of genes (differential genes) The weight of the edges (differential edges). Inspired by the power of pagerank algorithm, we propose a data mining approach. Differential Genes Ranking algorithm (DiGeR) to rank the genes based on their contribution to the differences between two gene coexpression networks. 41

Centrality Measure Small changes in the expression level of the central genes could significantly alter the interconnection and the topology of the gene network. Low betweenness centrality High betweenness centrality 42

Differential Betweeness Centrality The shaded node has the same degree and same betweenness centrality in both networks. but the shortest paths that pass through that node are different between the two networks Then we should compare the shortest paths between the networks dbc= unique shortest paths. 43

The Proposed Model Rank Centrality Connectivity 44

Example How to rank the differentially connected genes? 45

Example 46

Example 1 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the high tumor grade network but they are less connected in the low grade network. Low ranks mean more differential gene! 47

Example 2 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the high tumor grade network but they are less connected in the low grade network. 48

Example 3 (Prostate Cancer Data) This set contains highly ranked genes which form a clique in the low tumor grade network but they are less connected in the high grade network. 49

Conclusion Differential modeling of microarray data helps in associating differences in gene expression profiles to phenotypic differences across different conditions. Differential modeling can find the most significant genes relevant to phenotypic variation and the genes that are related to disease. 50

Acknowledgments Advisor: Dr Chandan Reddy Collaborator: Dr Craig N. Giroux Karmanos Cancer Institute 51

Email: odibat@wayne.edu 52