Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4
Hierarchical Average Lin Model-Based (If You Want) 5 6 Model-based clustering Gaussian mixture model: Assume each cluster is generated by a multivariate normal distribution Cluster has parameters : Mean vector: µ Covariance matrix: Σ 7 8 2
Model-based clustering Gaussian mixture model: Assume each cluster is generated by a multivariate normal distribution Cluster has parameters : Mean vector: µ Covariance matrix: Σ µ µ2 σ σ 2 Variance & Covariance Variance Covariance Correlation var(x) = E((x " x) 2 ) cov(x, y) = E((x " x)(y " y)) cor(x, y) = cov(x, y) " x " y 9 0 Gaussian Distributions Variance/Covariance Univariate 2"# 2 e$ 2(x$x) 2 /# 2 Multivariate (2") n # e$ 2 (x$x ) T (# $ )(x$x) where Σ is the variance/covariance matrix: " i, j = E((x i # x i)(x j # x j)) 2 3
Covariance models (Banfield & Raftery 993) Equal volume spherical model (): ~ means Σ = λ I Σ =λ D A D T volume shape orientation Covariance models (Banfield & Raftery 993) Equal volume spherical model (): ~ means Σ = λ I Unequal volume spherical (): Σ = λ I Σ =λ D A D T volume shape orientation 3 4 More flexible Covariance models (Banfield & Raftery 993) But more parameters Σ =λ D A D T volume Equal volume spherical model (): ~ means Σ = λ I Unequal volume spherical (): Σ = λ I Diagonal model: Σ = λ B, where B is, B = elliptical model: Σ = λdad T Unconstrained model (VVV): Σ = λ D A D T shape orientation 5 EM algorithm General approach to maximum lielihood Iterate between E and M steps: E step: compute the probability of each observation belonging to each cluster using the current parameter estimates M-step: estimate model parameters using the current group membership probabilities 6 4
Advantages of model-based clustering Higher quality clusters Flexible models Model selection A principled way to choose right model and right # of clusters Bayesian Information Criterion (BIC): Approximate Bayes factor: posterior odds for one model against another model Roughly: data lielihood, penalized for number of parameters A large BIC score indicates strong evidence for the corresponding model. 7 Definition of the BIC score 2 log p ( D M ) " 2log p( D $ ˆ, M )!# log( n) = BIC The integrated lielihood p(d M ) is hard to evaluate, where D is the data, M is the model. BIC is an approximation to log p(d M ) υ : number of parameters to be estimated in model M 8 Methodology Data Sets Results Validation Methodology Compare on data sets with external criteria (BIC scores do not require the external criteria) To compare clusters with external criterion: Adjusted Rand index (Hubert and Arabie 985) Adjusted Rand index = perfect agreement 2 random partitions have an expected index of 0 Compare quality of clusters to those from: a leading heuristic-based algorithm: CAST (Ben-Dor & Yahini 999) -Means (). 9 20 5
Gene expression data sets Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) Subset of data: 235 clones 24 experiments (cancer/normal tissue samples) 235 clones correspond to 4 genes Yeast cell cycle data (Cho et al 998) 7 time points Subset of 384 genes associated with 5 phases of cell cycle Synthetic data sets Both based on ovary data Randomly resampled ovary data For each class, randomly sample the expression levels in each experiment, independently Near covariance matrix Gaussian mixture Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data 2 22 BIC Adjusted Rand 0.9 0.8 0.7 0.6 0.5 0.4 0.3-0500 -000-500 -2000-2500 -3000-3500 VVV CAST Results: randomly resampled ovary data Diagonal model achieves max BIC score (~expected) max BIC at 4 clusters (~expected) max adjusted Rand beats CAST Adjusted Rand BIC Results: square root ovary data 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0-500 -000-500 -2000-2500 -3000 VVV CAST Adjusted Rand: max at 4 clusters (> CAST) BIC analysis: and models local max at 4 clusters Global max at 8 clusters (8 split of 4). 23 24 6
Results: standardized yeast cell cycle data 0.55 0.50 BIC Adjusted Rand 0.45 0.40 0.35 0.30 0.25 0.20 0.5-000 -3000-5000 -7000-9000 -000-3000 -5000-7000 VVV CAST Adjusted Rand: slightly > CAST at 5 clusters. BIC: selects at 5 clusters. 25 26 Importance of Data Transformation 27 28 7
log yeast cell cycle data Standardized yeast cell cycle data 29 30 Summary and Conclusions Synthetic data sets: With the correct model, model-based clustering better than a leading heuristic clustering algorithm BIC selects the right model & right Real expression data sets: Comparable adjusted Rand indices to CAST BIC gives a good hint as to the Appropriate data transformations increase normality & cluster quality (See paper & web.) 3 32 8
Acnowledgements Ka Yee Yeung, Chris Fraley 2,4, Alejandro Murua 4, Adrian E. Raftery 2 Michèl Schummer 5 the ovary data Jeremy Tantrum 2 help with MBC software ( model) Chris Saunders 3 CRE & noise model Computer Science & Engineering 4 Insightful Corporation 2 Statistics 5 Institute of Systems Biology 3 Genome Sciences More Info http://www.cs.washington.edu/homes/ruzzo Adjusted Rand Example c#(4) c#2(5) c#3(7) c#4(4) class#(2) 2 0 0 0 class#2(3) 0 0 0 3 class#3(5) 4 0 0 class#4(0) 7 ' 2$ ' 3$ ' 4$ ' 7$ a = % " + % " + % " + % " = 3 ' 4$ ' 5$ ' 7$ ' 4$ b = % " + % " + % " + % "! a = 43! 3 = 2 ' 2$ ' 3$ ' 5$ ' 0$ c = % " + % " + % " + % "! a = 59! 3 = 28 & 2 # ' 20$ d = % "! a! b! c = 9 & 2 # a + d Rand, R = = 0.789 a + d + c + d R! E( R) Adjusted Rand = = 0.469! E( R) UW CSE Computational Biology Group 35 44 9