Principal component analysis (PCA) for clustering gene expression data

Similar documents
Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Principal Components Analysis (PCA)

What is Principal Component Analysis?

Data Preprocessing Tasks

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

PRINCIPAL COMPONENTS ANALYSIS

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Unconstrained Ordination

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principal Components Analysis. Sargur Srihari University at Buffalo

Overview of clustering analysis. Yuehua Cui

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Component Analysis

Discriminant analysis and supervised classification

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Model-based clustering and validation techniques for gene expression data

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis

Introduction to Machine Learning

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Eigenvalues, Eigenvectors, and an Intro to PCA

Computation. For QDA we need to calculate: Lets first consider the case that

Eigenfaces. Face Recognition Using Principal Components Analysis

Computational functional genomics

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

PROCESS MONITORING OF THREE TANK SYSTEM. Outline Introduction Automation system PCA method Process monitoring with T 2 and Q statistics Conclusions

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Mathematical foundations - linear algebra

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

PRINCIPAL COMPONENT ANALYSIS

Ch. 10 Principal Components Analysis (PCA) Outline

Advanced Statistical Methods: Beyond Linear Regression

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Statistical Machine Learning

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Exercises * on Principal Component Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning (BSMC-GA 4439) Wenke Liu

Principal component analysis

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 5: November 19, Minimizing the maximum intracluster distance

Relations Between Adjacency And Modularity Graph Partitioning: Principal Component Analysis vs. Modularity Component Analysis

Principal Variance Components Analysis for Quantifying Variability in Genomics Data

CSE 554 Lecture 7: Alignment

PCA, Kernel PCA, ICA

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

STATISTICAL LEARNING SYSTEMS

Principal Component Analysis

Motivating the Covariance Matrix

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Microarray data analysis

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Descriptive Statistics

Unsupervised Learning. k-means Algorithm

A Bayesian Criterion for Clustering Stability

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Numerical Methods I Singular Value Decomposition

PRINCIPAL COMPONENTS ANALYSIS (PCA)

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

PCA and admixture models

Applied Multivariate Analysis

A Case Study -- Chu et al. The Transcriptional Program of Sporulation in Budding Yeast. What is Sporulation? CSE 527, W.L. Ruzzo 1

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal Component Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Maximum variance formulation

Linear Algebra Methods for Data Mining

Probabilistic Latent Semantic Analysis

7. Variable extraction and dimensionality reduction

Methods for sparse analysis of high-dimensional data, II

Exploratory Factor Analysis and Principal Component Analysis

Methods for sparse analysis of high-dimensional data, II

Linear Dimensionality Reduction

Chapter 4: Factor Analysis

Quantitative Understanding in Biology Principal Components Analysis

EECS 275 Matrix Computation

Algebra of Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Applied Multivariate Analysis

Unsupervised machine learning

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

Dimension Reduction and Low-dimensional Embedding

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

1 Principal Components Analysis

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Principal component analysis

Least Squares Optimization

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Machine Learning 2nd Edition

Preprocessing & dimensionality reduction

Unsupervised dimensionality reduction

Dimension Reduction and Classification Using PCA and Factor. Overview

Transcription:

Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions 2

Principal Component Analysis (PCA) Reduce dimensionality Retain as much variation as possible Linear transformation of the original variables Principal components (PC s) are uncorrelated and ordered PC 2 PC 1 5

Definition of PC s FIRST principle component the direction which maximizes variability of the data when projected on that axis Second PC the direction, among those orthogonal to the first, maximizing variability... Fact: They are eigenvectors of A T A ; eigenvalues are the variances 6

Motivation Chu et al. [1998] identified 7 clusters using Eisen et al. s CLUSTER software (hierarchical centroid-link) on the yeast sporulation data set. Raychaudhuri et al. [2000] applied PCA to the sporulation data, and claimed that the data showed a unimodal distribution in the space of the first 2 PC s. 8

PC s in Sporulation Data 1 st two PC s: > 90% of variance 1 st three PC s: > 95% of variance 9

The unimodal distribution of expression in the most informative two dimensions suggests the genes do not fall into welldefined clusters. -- Raychaudhuri et al. 10

11

12

15

PCA and clustering Euclidean distance: using all p variables, Euclidean distance between a pair of genes unchanged after PCA [Jolliffe 1986] using m variables (m<p) ==> approximation Correlation coefficient no general relationship before and after PCA PC 1 16

Intuitively, PCA helps clustering. But... PC 1 Under some assumptions, Chang[1983] showed that the set of PC s with the largest eigenvalues does not necessarily capture cluster structure info PC 1 18

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions 19

Our empirical study Goal: Compare the clustering results with and without PCA to an external criterion: expression data set with external criterion synthetic data sets methodology to compare to an external criterion clustering algorithms similarity metrics 20

Ovary data (Michel Schummer) Randomly selected cdna s on membrane arrays Subset of data: 235 clones 24 experiments (7 from normal tissues, 4 from blood samples, 13 from ovarian cancers) 235 clones correspond to 4 genes (sizes 58, 88, 57,32) The four genes form the 4 classes (external criterion) 21

PCA on ovary data Number of PC s to adequately represent the data: 14 PC s cover 90% of the variation scree graph eigenvalue 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 3 5 7 9 11 13 15 17 19 21 23 component number 22

Synthetic data sets (1) Mixture of normal distributions Compute the mean vector and covariance matrix for each class in the ovary data Generate a random mixture of normal distributions using the mean vectors, covariance matrices, and size distributions from the ovary data Histogram of a normal class Histogram of a tumor class frequency Expression level 23

Synthetic data sets (2) Randomly permuted ovary data Random sample (with replacement) the expression levels in the same class Empirical distrubution preserved But covariance matrix not preserved 1 Clones (observations) n Experimental conditions (variables) 1 j p class 24

Clustering algorithms and similarity metrics CAST [Ben-Dor and Yakhini 1999] with correlation build one cluster at a time add or remove genes from clusters based on similarity to the genes in the current cluster k-means with correlation and Euclidean distance initialized with hierarchical average-link 26

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions 27

Our approach PCA Cluster the original data PCA Cluster with the first m PC s (m=m 0,, p) Compare to external criterion Cluster with sets of PC with high adjusted Rand indices: greedy approach exhaustive search for m 0 components greedily add the next component modified greedy approach 28

Results: ovary data k-means with correlation 0.75 0.7 Adjusted Rand 0.65 0.6 0.55 0.5 0.45 0.4 no PCA greedy first modified greedy 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 number of components Adjusted Rand index for the first m (m>=7) PC s higher than without PCA adjusted Rand index with all 24 PC s higher than without PCA 29

Results: ovary data k-means with Euclidean distance 0.75 0.7 Adjusted Rand 0.65 0.6 0.55 0.5 0.45 0.4 first no PCA greedy modified greedy 2 4 6 8 10 12 14 16 18 20 22 24 number of components Sharp drop of adjusted Rand index from the first 3 to first 4 PC s 30

Results: ovary data CAST with correlation 0.75 0.7 Adjusted Rand 0.65 0.6 0.55 0.5 0.45 0.4 First no PCA greedy modified greedy 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Number of components Adjusted Rand index on the first components <= without PCA greedy or modified greedy approach usually achieve higher adjusted Rand than without PCA 31

Ovary/Cast, Correlation 32

Real data: did PCA help? 2 data sets, 5 algorithms + means clustering using the first c PC s helped (for some c) CAST k-means correlation k-means distance Ave-link correlation Ave-link distance Ovary - + - + - Cell cycle - - - - - 36

Synth data: Did PCA help? p-values from Wilcoxon signed rank test. (p<5% are bold) Synthetic data Alternative CAST k-mean k-mean ave-link ave-link hypothesis corr. Corr dist corr dist Mixture of normal no PCA > first 0.039 0.995 0.268 0.929 0.609 Mixture of normal no PCA < first 0.969 0.031 0.760 0.080 0.418 Random resampled no PCA > first 0.243 0.909 0.824 0.955 0.684 Random resampled no PCA < first 0.781 0.103 0.200 0.049 0.337 Cyclic data no PCA > first 0.023 NA 0.296 0.053 0.799 Cyclic data no PCA < first 0.983 NA 0.732 0.956 0.220 37

Some successes 38

Outline of talk Background and motivation Design of our empirical study Results Summary and Conclusions 39

Summary & Conclusions (1) PCA may not improve cluster quality first PC s may be worse than without PCA another set of PC s may be better than first PC s Effect of PCA depends on clustering algorithms and similarity metrics CAST with correlation: first m PC s usually worse than without PCA k-means with correlation: usually PCA helps k-means with Euclidean distance: worse after the first few PC s 40

Summary & Conclusions (2) No general trends in the components chosen by the greedy or modified greedy approach usually the first 2 components are chosen by the exhaustive search step Results on the synthetic data similar to real data 41

Bottom Line Successes by other groups make it a technique worth considering, but it should not be applied blindly. 42

Ka Yee Yeung Acknowledgements Michèl Schummer More Info http://www.cs.washington.edu/homes/{kayee,ruzzo} UW CSE Computational Biology Group 43

Mathematical definition of PCA The k-th PC: First PC :maximize var (z 1 )=a 1T Sa 1, such that a 1T a 1 =1, where S is the covariance matrix k-th PC: maximize var (z k )=a kt Sa k, such that a kt a k =1 and a kt a i =0, where i < k z k = p  j= 1 a k, j x j Genes (observations) Experimental conditions (variables) 1 j p 1 i n X i,j Vector x j 48

More details on PCA It can be shown that a k is an eigenvector of S corresponding to the k-th largest eigenvalue l k var (z k ) = l k Use sample covariance matrix: n  ( xi, j - m x )( x j i, k - m x ) k i 1 S( j, k), where m x n -1 = i= 1 = = j n  x n i, j 49