Analyzing Microarray Time course Genome wide Data

Similar documents
Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

PROPERTIES OF A SINGULAR VALUE DECOMPOSITION BASED DYNAMICAL MODEL OF GENE EXPRESSION DATA

Singular value decomposition for genome-wide expression data processing and modeling. Presented by Jing Qiu

Advances in microarray technologies (1 5) have enabled

Data visualization and clustering: an application to gene expression data

Missing Value Estimation for Time Series Microarray Data Using Linear Dynamical Systems Modeling

The Continuum Model of the Eukaryotic Cell Cycle: Application to G1-phase control, Rb phosphorylation, Microarray Analysis of Gene Expression,

Exploratory statistical analysis of multi-species time course gene expression

Lecture 5: November 19, Minimizing the maximum intracluster distance

Topographic Independent Component Analysis of Gene Expression Time Series Data

Tree-Dependent Components of Gene Expression Data for Clustering

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

The development and application of DNA and oligonucleotide microarray

Uses of the Singular Value Decompositions in Biology

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions.

Bioinformatics. Transcriptome

Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations

Singular Value Decomposition and Principal Component Analysis (PCA) I

An Introduction to the spls Package, Version 1.0

Expression arrays, normalization, and error models

Published: 26 April 2017

A Case Study -- Chu et al. The Transcriptional Program of Sporulation in Budding Yeast. What is Sporulation? CSE 527, W.L. Ruzzo 1

BIOINFORMATICS. Geometry of gene expression dynamics. S. A. Rifkin 1 and J. Kim INTRODUCTION SYSTEM

Detection and Visualization of Subspace Cluster Hierarchies

Estimating Biological Function Distribution of Yeast Using Gene Expression Data

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Sparse regularization for functional logistic regression models

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Protein Feature Based Identification of Cell Cycle Regulated Proteins in Yeast

Fuzzy Clustering of Gene Expression Data

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Assignment #9: Orthogonal Projections, Gram-Schmidt, and Least Squares. Name:

Learning in Bayesian Networks

Invariant mrna and mitotic protein breakdown solves the Russian Doll problem of the cell cycle

33A Linear Algebra and Applications: Practice Final Exam - Solutions

18-660: Numerical Methods for Engineering Design and Optimization

Superstability of the yeast cell-cycle dynamics: Ensuring causality in the presence of biochemical stochasticity

Math 3191 Applied Linear Algebra

Fractal functional regression for classification of gene expression data by wavelets

Bioconductor Project Working Papers

Learning Causal Networks from Microarray Data

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Clustering & microarray technology

T H E J O U R N A L O F C E L L B I O L O G Y

Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Principal component analysis (PCA) for clustering gene expression data

PRINCIPAL COMPONENTS ANALYSIS TO SUMMARIZE MICROARRAY EXPERIMENTS: APPLICATION TO SPORULATION TIME SERIES

Initiation of DNA Replication Lecture 3! Linda Bloom! Office: ARB R3-165! phone: !

Integration of functional genomics data

Ranking Interesting Subspaces for Clustering High Dimensional Data

networks in molecular biology Wolfgang Huber

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to clustering methods for gene expression data analysis

Principal Component Analysis (PCA)

Infinite Dimensional Vector Spaces. 1. Motivation: Statistical machine learning and reproducing kernel Hilbert Spaces

Linear Algebra Methods for Data Mining

Principal component analysis, PCA

Introduction to clustering methods for gene expression data analysis

Metabolic networks: Activity detection and Inference

Timing of gene expression responses to environmental changes

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Pattern Recognition Letters

LECTURE NOTE #10 PROF. ALAN YUILLE

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Application of random matrix theory to microarray data for discovering functional gene modules

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Kernel-Based Principal Component Analysis (KPCA) and Its Applications. Nonlinear PCA

INTEGRATING MICROARRAY DATA BY CONSENSUS CLUSTERING

Issues and Techniques in Pattern Classification

Introduction to Data Mining

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

PARAMETER UNCERTAINTY QUANTIFICATION USING SURROGATE MODELS APPLIED TO A SPATIAL MODEL OF YEAST MATING POLARIZATION

Measuring TF-DNA interactions

Improving the identification of differentially expressed genes in cdna microarray experiments

Gene expression experiments. Infinite Dimensional Vector Spaces. 1. Motivation: Statistical machine learning and reproducing kernel Hilbert Spaces

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Microarray Data Analysis: Discovery

Advanced Methods for Fault Detection

Modifying A Linear Support Vector Machine for Microarray Data Classification

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules

What is Principal Component Analysis?

Principal Components Analysis. Sargur Srihari University at Buffalo

Standardization and Singular Value Decomposition in Canonical Correlation Analysis

Chapter 11. Correlation and Regression

CLUSTER, FUNCTION AND PROMOTER: ANALYSIS OF YEAST EXPRESSION ARRAY

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

Data Mining Techniques

Lecture 6: Methods for high-dimensional problems

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Computational methods for predicting protein-protein interactions

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Statistical Applications in Genetics and Molecular Biology

Gene Expression Data Classification With Kernel Principal Component Analysis

EXPLORING THE IMPACT OF A THREE-BODY INTERACTION ADDED TO THE GRAVITATIONAL POTENTIAL FUNCTION IN THE RESTRICTED THREE-BODY PROBLEM

Transcription:

OR 779 Functional Data Analysis Course Project Analyzing Microarray Time course Genome wide Data Presented by Xin Zhao April 29, 2002 Cornell University

Overview 1. Introduction Biological Background Biological Questions Data Description Our Approach 2. Data Analysis Functional Data Analysis Approach 3. Conclusions 4. Possible Future Ideas 5. Acknowledgements 6. References

Biological Background Refer to Jing Qiu s talk. Cell Cycle Includes G1, S, G2, M, M/G1 five phases in literature. Cell cycle specific Gene Gene epresses periodically over cell cycle. Intuitively, call it periodic gene.

Biological Background (cont.) Yeast Genome Include over 6,000 genes By 1998, 104 known periodic genes Spellman (1998) - Identified 800 periodic genes - 94 known genes were included - Considered as standard Main data source for studying periodic genes

Biological Background (cont.) Gene epression level Measured by cdna-microarray eperiment. Epression ratio between test gene and reference gene. Figure: yeast genome-wide gene epressions over 2 cell cycles Data source: genome www.stanford.edu/cellcycle. Referenced by over 293 published papers so far.

Biological Background (cont.) Figure: yeast genome-wide gene epressions over 2 cell cycles Each curve represents the time series for each gene over 18 sampling points Sampling at 7-min intervals for 120 minutes. The time interval covers approimately 2 cell cycles 4,489 time series in the population ais: sampling points over time. y ais: log 2 (gene epression level)

Biological Background (cont.) Periodic gene classification Peak epression at a specific phase during a cell cycle. Have five groups (G1, S, G2, M, M/G1) e.g., G1 group peak epression at G1 phase Figure: yeast cell genome-wide periodic gene classification (200 genes)

Biological Questions Identification of Periodic Genes Figure: yeast genome-wide periodic genes identification Classification of Periodic Genes Figure: Yeast cell genome-wide periodic gene classification (200 genes)

Data Description Yeast cells synchronized by α-factor arrest Raw data 1,1 d,1 1,, L, M d n, n M Where d = 18, n = 4,489, : log 2 (epression level) for j th gene at i th sampling point. i j

Our Approach Project Goals: Goal 1: Understand population structure. Goal 2: Eplore identification and classification for periodic genes. This is an eploratory analysis Fit in the framework of Functional Data Analysis. Object Space Feature Space (curves data vectors)

Our Approach (cont.) Object Space Feature Space (i.e., curves data vectors) Approach to Goal 1: - PCA for data - PCA for projections onto an appropriate Fourier subspace Approach to Goal 2: Project data onto a 2-dim Fourier subspace - Identify periodic genes - Classify periodic genes

Our Approach (cont.) Feature Space Object Space (i.e., data vectors curves) Visualization of population structures Figure: yeast cell genome-wide periodic gene classification (200 genes)

Functional Data Analysis Object Space View: Overlay plots of curves Recall: Figure: yeast genome-wide gene epressions over 2 cell cycles

Functional Data Analysis (Cont.) Feature Space Data Representation Data vectors Recall: 1,1 d,1 1,, L, M d n, n M Where d = 18, n = 4,489 i, j : log2 (epression level) for j th gene at i th sampling point.

Functional Data Analysis (Cont.) Feature Space Data Representation Center data vector over time centered data = = d i j i j n n d n n d d 1,, 1, 1,1 1 1,1 1,, M L M data matri = n n d d n n, 1,1 1, 1 1,1 L M O M L 18 4,489

Understand Population Structure 1. PCA for data Curve View Graphic: a nice approach to view PCA Figure: spellman_alpha_complete_pca.eps No dominant eigenvalue, as shown clearly in the power plot. 1st PC only eplains about 25% of total energy. 2nd PC only eplains about 16% of total energy. Periodic direction is not the PC directions, but might be a rotation of the PC directions.

Understand Population Structure (Cont.) 2. PCA for projections onto a Fourier subspace Fourier basis B = {sin(iωt), cos(iωt), i = 2, 4, 6, 8, t = 1,, 18} 18 8 2 π Where ω =, T = 18 T Reasons: The time interval covers two cell cycles. The period of periodic genes is consistent with that of a cell cycle. 18 (equally spaced) sampling points available.

Understand Population Structure (Cont.) 2. PCA for projections onto a Fourier subspace Projection matri = B(B T B) -1 B T (data matri) 18 4,489 18 4,489 Perform PCA on the projected data Figure: spellman_alpha_complete_proj_pca.eps

Understand Population Structure (Cont.) 2. PCA for projections onto a Fourier subspace Figure: spellman_alpha_complete_proj_pca.eps The first two PCs are dominant (eplain about 65% of total energy) 1 st PC eplains 37.31% of total energy. 1 st PC direction similar to a sine wave over two periods. 2 nd PC eplains 27.54% of total energy. 2 nd PC direction similar to a cosine wave over two periods.

Understand Population Structure (Cont.) 2. PCA for projections onto a Fourier subspace Projections onto the 2-dim Fourier subspace spanned by {sin(2ωt), cos(2ωt)} captures the main feature of the periodicity in the data. Time shift issue is captured in the subspace: Reason: sin(2ωt + φ) = cos(φ)sin(2ωt) + sin(φ)cos(2ωt) = a 1 sin(2ωt) + a 2 cos(2ωt), where a 1 and a 2 are constants similar to cos(2ωt + φ)

Identification and classification for periodic genes 1. Identification of periodic genes Recall: Subspace spanned by {sin(2ωt), cos(2ωt)} captures periodicity. Idea: Periodic genes have large distance to the origin in the subspace. Q: Which distance is considered as large?

Identification and classification for periodic genes (cont.) 1. Identification of periodic genes Q: Which distance is considered as large? We consider a range of thresholding criteria: Rank all genes in decreasing order according to the distance to the origin in the subspace. Choose a range of thresholding values: First 200, 400, 600, 800, and 1,000 genes Figure: Periodic gene identification scatterplots

Identification and classification for periodic genes (cont.) Figure: Periodic gene identification scatterplots - G1 group is in red - S group is in green - G2 group is in blue - M group is in yellow - M/G1 group is in cyan - Non-periodic genes are in black - Purple lines are boundaries (eplained later)

Identification and classification for periodic genes (Cont.) 2. Classification of periodic genes Idea: set the boundaries for the angles in the 2-dim subspace. Q: Do we know the timing for G1, S, G2, M, M/G1 phases? A: No. Solution: - Initial guess from previous result: Spellman s classification. - Modification using Sizer plot of angles and scatterplot in the subspace.

Identification and classification for periodic genes (Cont.) 2. Classification of periodic genes I. Results by Spellman (1998): Figure: Spellman s classification II. Modification by SiZer plot of first 200 gene angles: Figure: SiZer plot of first 200 gene angles III. Modification by scatterplot in the subspace Figure: periodic genes scatterplot by sizer (200, 400 genes)

Identification and classification for periodic genes (Cont.) 2. Classification of periodic genes Our set of angle boundaries for the five periodic gene groups: Note: M/G1 phase: [0.83, 2.04] G1 phase: [2.04, 3.74] S phase: [3.74, 4.58] G2 phase: [4.58, 5.72] M phase: [5.72, 0.83] This selection came from previous standard results and eyeball eamination of the Sizer plot and scatterplot. It may not be statistically robust.

Identification and classification for periodic genes (Cont.) 2. Classification of periodic genes Figure: compare classification results for 2 thresholds (200, 800) Sizer plot for first 200 genes shows 2 significant bumps, G1 and S groups. The Sizer plot for the first 800 genes shows 2 significant bumps: G1 and G2 groups. Suggests that S group has more highly periodic genes, and G2 group has more low periodic genes. Might have biological interpretation.

Identification and classification for periodic genes (cont.) 2. Classification of periodic genes Kernel density estimator of periodic genes for different thresholds: Figure: Kde plot of periodic genes for 2 thresholds (200, 800) Kde plot of first 200 genes: four bumps for G1, S, G2, and M/G1 groups. Kde plot for first 800 genes: G2 group became most significant. G1 group is significant for each threshold.

Identification and classification for periodic genes (cont.) 2. Classification of periodic genes Figure: Compare percentage of genes in each group for different thresholds G1 group is the largest group for each threshold. S and M groups are the small groups for each threshold. As the threshold increases, percentage of periodic genes - decreases in G1 group - increases in G2 group - relatively stable in S, M, and M/G1 groups There might eist some meaningful biological interpretation.

Visualization of population structures In object space, compare five groups of periodic genes for different thresholds: Plot periodic gene curves for each threshold in each group. Figure: Plot of periodic gene curves for each threshold in G1 group

Conclusions Fourier subspace spanned by {sin(2ωt), cos(2ωt)} captures periodicity. G1 and S groups has more highly periodic genes G2 group has more low periodic genes Periodicity in M, and M/G1 groups is relatively uniformdistributed.

Possible Future Ideas (cont.) I. Apply our approach to different microarray eperiments Current microarray eperiments: Sampling Total Eperiment Cell Cycle Number of Interval Time Time Sampling (minutes) (minutes) (minutes) Points Alpha-factor a 7 120 66 18 CDC 15 a 10 290 110 24 CDC28 b 10 160 85 17 Elutriation a 30 390 390 14 a : data is from Spellman, et al b : data is from Cho, et al

Possible Future Ideas (cont.) II. Patterns should be eperimentally reproducible and statistically significant. Q: How reproducible are the patterns in current microarray eperiments? Genes that are periodic under one synchronization procedure are not necessarily periodic under a different synchronization procedure. - Shedden, Kerby and Cooper, Stephen (2002)

Acknowledgements My deep appreciation to Prof. Steve Marron, under whose guidance this project was conducted. Thank Prof. Martin Wells for his initiation on the project.

References Some publications in this area: Clustering Method: Eisen, M.B, Spellman, P.T., Brown, P.O., and Botstein (1998), Cluster Analysis and display of genome-wide epression patterns, Proc. Natl. Acad. Sci. USA, 95, 14863-14868. Whitfield, M.L., Sherlock, G, Saldanha, A., Murray, JI, Ball, C.A., Aleande K.E., Matese J.C., Perou, C.M., Hurt M.M., Brown, P.O., and Botstein, D. (2002), Identification of genes periodically epressed in the human cell cycle and their epression in tumors, Mol. Biol. Cell, 10,???-??? Corresponse Analysis: Fellenberg, K., Hauser, N.C., Brors, B., Neutzner, A, Hoheisel, J.D., and Vingron, M. (2001), Correspondence analysis applied to microarray data, Proc. Natl. Acad. Sci. USA, 98(19), 10781-10786.

Singular Value Decomposition Method: Holter, N., Mitra, M., Maritan, A., Cieplak, M., Banavar, JR, and Fedoroff, NV. (2000), Fundamental patters underlying gene epression profiles: simplicity from compleity, Proc. Natl. Acad. Sci. USA, 97, 8409-8414. Alter, O., Brown, P.O., and Botstein, D. (2000) Singular value decomposition for genome-wide epression data processing and modeling, Proc. Natl. Acad. Sci. USA, 97(18), 10101-10106 Others: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botsein, D. and Futcher, B. (1998), Comprehensive Identification of Cell Cycle regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization, Molecular Biology of the Cell, 9, 3273 3297. Cho R.J., et al. (1998), A Genome-wide Transcriptional Analysis of the Mitotic Cell Cycle, Molecular Cell, 2, 65-73 Cooper, S. Cell Cycle Analysis and Microarrays, Trends in Genetics, accepted for publication.

Shedden, K. and Cooper, S. (2002), Analysis of cell-cycle-specific gene epression in human cells as determined by microarrays and double-thymidine block synchronization, PNAS, 99(7), 4379-4384