Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Similar documents
Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Discovering molecular pathways from protein interaction and ge

Bioinformatics. Transcriptome

Introduction to Bioinformatics

Application of random matrix theory to microarray data for discovering functional gene modules

Analyzing Microarray Time course Genome wide Data

Exploratory statistical analysis of multi-species time course gene expression

Hierarchical Clustering

Topographic Independent Component Analysis of Gene Expression Time Series Data

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Probabilistic sparse matrix factorization with an application to discovering gene functions in mouse mrna expression data

Computational approaches for functional genomics

BMD645. Integration of Omics

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Effects of Gap Open and Gap Extension Penalties

Introduction to Bioinformatics

Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Clustering & microarray technology

Chapter 5: Microarray Techniques

Phylogenetic Analysis of Molecular Interaction Networks 1

Learning in Bayesian Networks

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Mitochondrial Genome Annotation

Comparative Genomics II

Tree-Dependent Components of Gene Expression Data for Clustering

In-Depth Assessment of Local Sequence Alignment

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Phylogenetic Tree Reconstruction

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Improving the identification of differentially expressed genes in cdna microarray experiments

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

86 Part 4 SUMMARY INTRODUCTION

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Lecture 5: November 19, Minimizing the maximum intracluster distance

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Computational methods for predicting protein-protein interactions

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Hidden Markov Models

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Computational Biology: Basics & Interesting Problems

Lecture 8 Multiple Alignment and Phylogeny

Francisco M. Couto Mário J. Silva Pedro Coutinho

Genomics and bioinformatics summary. Finding genes -- computer searches

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions.

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Structural Learning and Integrative Decomposition of Multi-View Data

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

A Case Study -- Chu et al. The Transcriptional Program of Sporulation in Budding Yeast. What is Sporulation? CSE 527, W.L. Ruzzo 1

Comparing transcription factor regulatory networks of human cell types. The Protein Network Workshop June 8 12, 2015

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Bioinformatics Chapter 1. Introduction

Graph Alignment and Biological Networks

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

Dr. Amira A. AL-Hosary

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

Nested Effects Models at Work

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

ProtoNet 4.0: A hierarchical classification of one million protein sequences

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Lecture 3: A basic statistical concept

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

A Multiobjective GO based Approach to Protein Complex Detection

Analysis and Design of Algorithms Dynamic Programming

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Integration of functional genomics data

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Bioinformatics 2 - Lecture 4

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Macroevolution Part I: Phylogenies

A Study of Network-based Kernel Methods on Protein-Protein Interaction for Protein Functions Prediction

Introduction to Evolutionary Concepts

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Phylogenetic Networks, Trees, and Clusters

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

5. Sum-product algorithm

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

CS612 - Algorithms in Bioinformatics

CSCE555 Bioinformatics. Protein Function Annotation

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

Transcription:

Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1

Data representations Data are relative measurements log 2 ( red / green) Data are usually collections of gene expressions in many experiment conditions. A tabular representation of data, for example rows are genes and columns are different experiment conditions. Graphical representation with color-coding. High ratios of expression samples are coded red and low ratios are coded green, with the brightness of the color proportional to the magnitude of the differential expression. A ratio of 1 is black. 2

3

Objectives Identify clusters of genes that share an expression profile. Genes in the same group are co-expressed, which gives hints that the unknown genes may have functions of the respective groups they cluster in. 4

Clustering methods Clustering methods work for expression data of G genes under E experiments, for example, the expression data from G=2000 genes at E=8 time intervals. One type of approaches called bottom-up begins with each gene in its own cluster. Another type of approaches called top-down begins by selecting a predetermined number of clusters. Genes are then assigned to those clusters to minimize variation within clusters and maximize variation between them. 5

Hierarchical clustering method It is a bottom-up approach. It begins with each gene in its own cluster. Clusters are then recursively clustered based on similarities, creating a hierarchical, treelike organization. Many of the same methods are used for both gene expression clustering and phylogeny reconstruction. 6

Three steps of the hierarchical clustering method by Eisen et al. (1998) Construct a matrix of similarity measures between all pairs of genes; Recursively cluster the genes into a treelike hierarchy; Determine the boundaries between individual cluster. 7

Let x Correlation coefficient is a measure of similarity of a pair of genes ge experiment e. Generally, x e Let x g be the relative measure log makes the mean value of or the mean ratio equals1. the E experimental conditions, x g = 1 E E e= 1 ge x ge be the average (mean) expression level for gene x is a to remove the possible fluorescence bias. This centering step ge 2 (red/green) for gene centered value for a over all genes in an array equals 0, g g in given array over 8

Let s g for gene g, be the standard deviation of the measures in E experiments s g 1 = E E ( xge xg ) e= 1 Then for a pair of gene i and j the correlation coefficient r E 1 xie x x je x i j rij = E e 1 s i s = j Genes with similar expression profiles will have values of 2 near unity. Correlation coefficient is a measure of the similarity between pairs of genes. ij r is ij 9

Correlation coefficient only measures the similarity of change patterns of a pair of genes, but ignores the scales of the changes by standardizing expression over experiments. 10

We calculate correlation coefficients for each of the [G(G- 1)]/2 pairs of genes i and j. With the matrix of correlations in hand, we proceed to the clustering algorithm in the following recursion: 1. Find the pair of clusters with the highest correlation and combine the pair into a single cluster. 2. Update the correlation matrix using the average values of the newly combined clusters. 3. Repeat step 1 and 2 G-1 times until all genes have been clustered. 11

Suppose the initial correlation matrix for G=5 genes is 2 3 4 5 1 0.3 0.2 0.8 0.1 2 0.9 0.1 0.8 3 0.2 0.7 4 0.1 12

The highest correlation is the 0.9 observed between genes 2 and 3, so we combine them into a cluster and recompute the correlation matrix: 23 4 5 1 0.25 0.8 0.1 23 0.15 0.75 4 0.1 13

Note that the correlation between the cluster 23 and 4, for example, is the average of the correlations going into the new cluster, 1 1 r23,4 = ( r2,4 + r3,4 ) = (0.1+ 0.2) = 2 2 0.15 We continue this process, clustering 1 with 4, then 23 and 5. The resulting hierarchy takes the form ------- --- --- --- 2 3 5 1 4 14

The final step in the clustering process is to determine the boundaries of individual clusters. In the above example, do 2 and 3 form a cluster to the exclusion of 5, or is there a single 235 cluster? In some cases, we define clusters based on shared biological function. For instance, if genes 2, 3, and 5 were all heat-shock proteins, it might be sensible to include all of them in a single cluster. 15

An initially disordered set of gene expression profiles can be converted into an immediately intelligible set of clusters by hierarchical clustering. The clusters also suggest functions for those unknown (UNK) genes. 16

Hierarchical clustering can be performed both on the genes and the treatments, allowing detection of patterns in two dimensions. There are numerous different types of clustering algorithms that may or may not give the same tree structure, and there is often no solid statistical support for the clustering produced by these methods. The biological meaning of clusters will be determined by their predictive power and their capacity for generating hypotheses. We can align cluster data with gene ontology and other genomic annotation, including linkage to literature database. 17

References Textbook p141-145. Eisen, M. B., P. Spellman, P. Brown and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. (USA) 95: 14863-14868. 18