Singular value decomposition for genome-wide expression data processing and modeling. Presented by Jing Qiu

Similar documents
Analyzing Microarray Time course Genome wide Data

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Uses of the Singular Value Decompositions in Biology

Basic modeling approaches for biological systems. Mahesh Bule

BIOINFORMATICS. Geometry of gene expression dynamics. S. A. Rifkin 1 and J. Kim INTRODUCTION SYSTEM

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Lecture Notes 1: Vector spaces

Grouping of correlated feature vectors using treelets

16 The Cell Cycle. Chapter Outline The Eukaryotic Cell Cycle Regulators of Cell Cycle Progression The Events of M Phase Meiosis and Fertilization

Discovering molecular pathways from protein interaction and ge

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Computational methods for predicting protein-protein interactions

Learning in Bayesian Networks

GS Analysis of Microarray Data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Cell Division and Reproduction

5.1 Cell Division and the Cell Cycle

BMD645. Integration of Omics

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

BME 5742 Biosystems Modeling and Control

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Life Sciences 1a: Section 3B. The cell division cycle Objectives Understand the challenges to producing genetically identical daughter cells

Introduction to Bioinformatics

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Linear Algebra (Review) Volker Tresp 2017

Salt Dome Detection and Tracking Using Texture Analysis and Tensor-based Subspace Learning

Three different fusions led to three basic ideas: 1) If one fuses a cell in mitosis with a cell in any other stage of the cell cycle, the chromosomes

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Analysis and Simulation of Biological Systems

Linear Algebra (Review) Volker Tresp 2018

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Singular value decomposition (SVD) of large random matrices. India, 2010

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

COMPETENCY GOAL 1: The learner will develop abilities necessary to do and understand scientific inquiry.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Biological Pathways Representation by Petri Nets and extension

We use the overhead arrow to denote a column vector, i.e., a number with a direction. For example, in three-space, we write

Principal Components

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Tree-Dependent Components of Gene Expression Data for Clustering

AP BIOLOGY SUMMER ASSIGNMENT

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Bioinformatics. Transcriptome

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Latent Variable models for GWAs

Table of Laplacetransform

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

purpose of this Chapter is to highlight some problems that will likely provide new

178 Part 3.2 SUMMARY INTRODUCTION

Comparative Network Analysis

Differential Modeling for Cancer Microarray Data

2. Mathematical descriptions. (i) the master equation (ii) Langevin theory. 3. Single cell measurements

NAME MATH 304 Examination 2 Page 1

Chapter 15 Active Reading Guide Regulation of Gene Expression

Sig2GRN: A Software Tool Linking Signaling Pathway with Gene Regulatory Network for Dynamic Simulation

Androgen-independent prostate cancer

CS-E5880 Modeling biological networks Gene regulatory networks

Supplementary material (Additional file 1)

Genetics, brain development, and behavior

Bioinformatics Chapter 1. Introduction

2. Review of Linear Algebra

Midterm 1 Solutions Math Section 55 - Spring 2018 Instructor: Daren Cheng

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

CELL CYCLE AND DIFFERENTIATION

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

I. Specialization. II. Autonomous signals

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Cell cycle regulation in the budding yeast

Problem # Max points possible Actual score Total 120

BIOLOGY Grades Summer Units: 10 high school credits UC Requirement Category: d. General Description:

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Dimension Reduction and Iterative Consensus Clustering

Unravelling the biochemical reaction kinetics from time-series data

Biol478/ August

CHAPTER 12 - THE CELL CYCLE (pgs )

DS-GA 1002 Lecture notes 10 November 23, Linear models

Gene expression in prokaryotic and eukaryotic cells, Plasmids: types, maintenance and functions. Mitesh Shrestha

Properties of Matrices and Operations on Matrices

PCA, Kernel PCA, ICA

Curriculum Links. AQA GCE Biology. AS level

Total

Advanced Introduction to Machine Learning CMU-10715

Networks in systems biology

Correlations to Next Generation Science Standards. Life Sciences Disciplinary Core Ideas. LS-1 From Molecules to Organisms: Structures and Processes

BIOINFORMATICS Datamining #1

Keywords: systems biology, microarrays, gene expression, clustering

Chapter: Life's Structure and Classification

International Journal of Scientific & Engineering Research, Volume 6, Issue 2, February ISSN

Subspace sampling and relative-error matrix approximation

Non-Negative Factorization for Clustering of Microarray Data

L26: Advanced dimensionality reduction

Introduction to Molecular and Cell Biology

6.435, System Identification

Transcription:

Singular value decomposition for genome-wide expression data processing and modeling Presented by Jing Qiu April 23, 2002

Outline Biological Background Mathematical Framework:Singular Value Decomposition SVD calculation Pattern Inference Data normalization Data Sorting Biological Data Analysis I: Elutriation-synchronized cell cycle. 1

Biological Background Cell cycle regulation Cell cycle: the program for cell growth and division Four broad phases: (and ), S,,and M Cell cycle diagram (cell growth and protein for DNA synthesis) (DNA replication, two daughter cell) (cell growth and protein synthesis) (split apart) Application: cancer(uncontrolled cell growth and proliferation) 2

Biological Background Microarray Experiment Diagram of Central Dogma of Molecule Biology 100,000 genes in mammalian genome each cell express 15,000 of these genes each gene is expressed at a different level cell cycle regulated genes have different expression levels across the cycle period Microarrays:Massive parallel analysis of gene expression The process diagram 3

Mathematical Framework:Notation denotes a matrix; denotes a column vector denotes a row vector,, all denote inner products denotes outer product. the th column of the th row of the matrix. 4

Mathematical Framework:Data of interest Data of interest: (dim=n M)(See Fig 13a) N rows N genes of a model organism the relative expression of the th gene across different arrays. M columns M different samples(arrays) the genome-wide relative expression measured by the th array. 5

Mathematical Framework:SVD of the data where,...!, #" $" %&%&%'" )( ". +* the eigenexpression of the, th eigengene in the, th eigenarray. the N-genes L-eigenarray basis sets.- *,the genome-wide expression in the, th eigenarray. 6

M-eigengenes M-arrays basis sets *,the, th eigengene across the different arrays The fraction of eigenexpression : * * Shannon entropy,, measure the complexity of the dataset captured by a single eigengene(eigenarray) all eigengenes are equally expressed Figure 13 and Fig14

% Mathematical Framework: SVD calculation diagonalizing (dim= ; ) to get and 7

Mathematical Framewor:Pattern Inference An eigengene * represents a regulatory process when this pattern is bilogically interpretable..- * the relative amplitude of the, th eigengene pattern in relative to all other genes. The corresponding eigenarray.- * represents the cellular state which corresponds to this process. 8

! Mathematical Framework:Data normalization Normalizing the data by filtering out the eigengenes * which are inferred to represent noise * +*.- * *... * * where +*... 9

Mathematical Framework:Data sorting Sort data by similarity in the expression of any chosen subsets of the eigengenes Correlation plot: Scatter plot of *, where : Amplitude of expression of the th gene in the subspace spanned by and * relative to its overall expression: : The phase of the th gene in the transition from the expression pattern * to and back to *. sort the genes according to. 10

Biological Data Analysis:Data Elutriation-synchronized cell cycle data of budding yeast genes, of which were classified as cell cycle regulated in Spellman et al.,1998. arrays in interval of 30 minutes. 11

Biological Data Analysis:Pattern inference describes the time-invariant relative expression during the cell cycle.(fig 14a, constantly red) Others show oscillation during the cell cycle. Low entropy: % weak perturbation of a steady state of expression(figure 14b) of the overall rela- captures more than tive expression:,, capture % of the overall relative expression, respectively. The time variation of fits a normalized sine function of period T;(fig 14c) 12

The time variations of function of period,(fig 14c) and fit a cosine show decreasing expression on transition from to min show increasing expression on transition from to min Inference: represents experimental additive constants superimposed on a steady gene expression state. represents expression oscillation during a cell cycle and represent initial transient increase and decrease in expression in response to the elutriation, respectively.

Biological Data Anlalysis: Data Normalization Filter out the first eigengene to remove the steady state of expression: (,.- where and is the variance in the measured expression of the th gene in the th array. Figure 15a displays the eigengenes for ( ( captures more than information in this dataset. of the overall 13

It describes a weak initial transient increase superimposed on a time-invariant scale of expression variance (maybe a response to the elutriation). The time-invariant scale of expression variance suggests a steady scale of the data. It also suggests the time-invariant multiplicative constant noise may be superimposed on the data. Filter out (, removing the steady scale of expression variance, ( ( ( (.- (.( ( ( tabulates expression pattern that are approximately centered at the steady state with variance

which are approximately normalized by the steady scale of the expression variance.. and. (of similar significance), capture together more than of the overall normalized expression, %.(Figure 1) fit normalized sine and cosine functions of period T and initial phase. The time variations of. and.. Inference: and. represent cell cycle expression oscillations and assume.- and.- represent the corresponding cell cycle cellular states.

Biological Data Analysis:Data sorting Assume.- and.- approximately represent all cell cycle cellular states All arrays(except ) have at least normalized expression in this subspace. of their Suggest:.- and.- are sufficient to approximate the elutriation array expression The array order according to is similar to the cell cycle time points. (an order of the cell cycle progress) Inferences:.- is associated with the cell cycle cellular state of transition from to. 14

-.- is associated with the transition from. to.- is associated with the cell cycle cellular state of transition from to. -.- is associated with the transition from to. The phase of, corresponds to the 30-min delay between the start of the experiment and that of the cell cycle stage. Figure 2b. Recall and. are inferred to approximately represent all cell cycle expression oscillation

Expect cell cycle regulated not regulated by the cell cycle at all. 641(classified as cell cycle regulated) have more than of their normalized expression in this subspace. This sortinggives a different classification from that by Spellman et al(the poor quality of the elutriation expression data). With all genes sorted, the gene variations of.- and.- fit normalized sine and cosine functions(figure 3). The sorted and normalized elutriation expression fit approximately a traveling wave of expression varying sinusoidally across both genes and arrays.

Some points SVD calculation: don t necessarily exist. Data Sorting What if more then 2 significant eigengenes? or? What smoothing method to fit? The disagreement due to poor quality of the data(too simple)? Are the first two eigengenes enough to explain all?(capture only around and Figure 2b not good projection) 15