Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Similar documents
Principal component analysis (PCA) for clustering gene expression data

Model-based clustering and validation techniques for gene expression data

Unsupervised machine learning

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Gaussian Mixture Models, Expectation Maximization

Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

arxiv: v1 [stat.me] 7 Aug 2015

Multivariate Statistics

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Gaussian Mixtures and the EM algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

An Introduction to Expectation-Maximization

Regression Clustering

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Penalized Loss functions for Bayesian Model Choice

THE EMMIX SOFTWARE FOR THE FITTING OF MIXTURES OF NORMAL AND t-components

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Data Mining and Analysis: Fundamental Concepts and Algorithms

MODEL BASED CLUSTERING FOR COUNT DATA

Variable selection for model-based clustering

Clustering and Gaussian Mixture Models

Gaussian Mixture Model

Families of Parsimonious Finite Mixtures of Regression Models arxiv: v1 [stat.me] 2 Dec 2013

Chapter 10. Semi-Supervised Learning

On Potts Model Clustering, Kernel K-means, and Density Estimation

The Expectation Maximization Algorithm & RNA-Sequencing

An Introduction to Multivariate Statistical Analysis

Model-based mixture discriminant analysis an experimental study

On Potts Model Clustering, Kernel K-Means, and Density Estimation

More on Unsupervised Learning

Inferring Transcriptional Regulatory Networks from High-throughput Data

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Gaussian Mixtures. ## Type 'citation("mclust")' for citing this R package in publications.

MIXTURE MODELS AND EM

Using Model Selection and Prior Specification to Improve Regime-switching Asset Simulations

Fuzzy Clustering of Gene Expression Data

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

STA 4273H: Statistical Machine Learning

Package EMMIXmcfa. October 18, 2017

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Data Preprocessing. Cluster Similarity

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Pattern Recognition and Machine Learning

Package EMMIXmfa. October 18, Index 15

ECE521 week 3: 23/26 January 2017

MACHINE LEARNING ADVANCED MACHINE LEARNING

MSA220 Statistical Learning for Big Data

Computational Genomics

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

10708 Graphical Models: Homework 2

Probabilistic clustering

Clustering Analysis of SAGE Transcription Profiles Using a Poisson Approach

Clustering using Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

PATTERN CLASSIFICATION

ntopic Organic Traffic Study

Minimum Error Rate Classification

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Bayesian Networks Structure Learning (cont.)

A Probability Review

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Random Vectors, Random Matrices, and Matrix Expected Value

Non-Parametric Bayes

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Machine Learning for Data Science (CS4786) Lecture 12

Sparse Linear Models (10/7/13)

Introduction to Probabilistic Graphical Models: Exercises

BAYESIAN DECISION THEORY

A Bayesian Criterion for Clustering Stability

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Filter Methods. Part I : Basic Principles and Methods

arxiv: v1 [stat.me] 27 Nov 2012

Convexification and Multimodality of Random Probability Measures

Probability Theory Review Reading Assignments

Introduction to clustering methods for gene expression data analysis

Multivariate Analysis

Journal of Statistical Software

The mclust Package. January 18, Author C. Fraley and A.E. Raftery, Dept. of Statistics, University of Washington.

CS6220: DATA MINING TECHNIQUES

Chris Bishop s PRML Ch. 8: Graphical Models

Mixtures of Gaussians continued

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

Determining the number of components in mixture models for hierarchical data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Different points of view for selecting a latent structure model

Transcription:

Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4

Hierarchical Average Lin Model-Based (If You Want) 5 6 Model-based clustering Gaussian mixture model: Assume each cluster is generated by a multivariate normal distribution Cluster has parameters : Mean vector: µ Covariance matrix: Σ 7 8 2

Model-based clustering Gaussian mixture model: Assume each cluster is generated by a multivariate normal distribution Cluster has parameters : Mean vector: µ Covariance matrix: Σ µ µ2 σ σ 2 Variance & Covariance Variance Covariance Correlation var(x) = E((x " x) 2 ) cov(x, y) = E((x " x)(y " y)) cor(x, y) = cov(x, y) " x " y 9 0 Gaussian Distributions Variance/Covariance Univariate 2"# 2 e$ 2(x$x) 2 /# 2 Multivariate (2") n # e$ 2 (x$x ) T (# $ )(x$x) where Σ is the variance/covariance matrix: " i, j = E((x i # x i)(x j # x j)) 2 3

Covariance models (Banfield & Raftery 993) Equal volume spherical model (): ~ means Σ = λ I Σ =λ D A D T volume shape orientation Covariance models (Banfield & Raftery 993) Equal volume spherical model (): ~ means Σ = λ I Unequal volume spherical (): Σ = λ I Σ =λ D A D T volume shape orientation 3 4 More flexible Covariance models (Banfield & Raftery 993) But more parameters Σ =λ D A D T volume Equal volume spherical model (): ~ means Σ = λ I Unequal volume spherical (): Σ = λ I Diagonal model: Σ = λ B, where B is, B = elliptical model: Σ = λdad T Unconstrained model (VVV): Σ = λ D A D T shape orientation 5 EM algorithm General approach to maximum lielihood Iterate between E and M steps: E step: compute the probability of each observation belonging to each cluster using the current parameter estimates M-step: estimate model parameters using the current group membership probabilities 6 4

Advantages of model-based clustering Higher quality clusters Flexible models Model selection A principled way to choose right model and right # of clusters Bayesian Information Criterion (BIC): Approximate Bayes factor: posterior odds for one model against another model Roughly: data lielihood, penalized for number of parameters A large BIC score indicates strong evidence for the corresponding model. 7 Definition of the BIC score 2 log p ( D M ) " 2log p( D $ ˆ, M )!# log( n) = BIC The integrated lielihood p(d M ) is hard to evaluate, where D is the data, M is the model. BIC is an approximation to log p(d M ) υ : number of parameters to be estimated in model M 8 Methodology Data Sets Results Validation Methodology Compare on data sets with external criteria (BIC scores do not require the external criteria) To compare clusters with external criterion: Adjusted Rand index (Hubert and Arabie 985) Adjusted Rand index = perfect agreement 2 random partitions have an expected index of 0 Compare quality of clusters to those from: a leading heuristic-based algorithm: CAST (Ben-Dor & Yahini 999) -Means (). 9 20 5

Gene expression data sets Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) Subset of data: 235 clones 24 experiments (cancer/normal tissue samples) 235 clones correspond to 4 genes Yeast cell cycle data (Cho et al 998) 7 time points Subset of 384 genes associated with 5 phases of cell cycle Synthetic data sets Both based on ovary data Randomly resampled ovary data For each class, randomly sample the expression levels in each experiment, independently Near covariance matrix Gaussian mixture Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data 2 22 BIC Adjusted Rand 0.9 0.8 0.7 0.6 0.5 0.4 0.3-0500 -000-500 -2000-2500 -3000-3500 VVV CAST Results: randomly resampled ovary data Diagonal model achieves max BIC score (~expected) max BIC at 4 clusters (~expected) max adjusted Rand beats CAST Adjusted Rand BIC Results: square root ovary data 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0-500 -000-500 -2000-2500 -3000 VVV CAST Adjusted Rand: max at 4 clusters (> CAST) BIC analysis: and models local max at 4 clusters Global max at 8 clusters (8 split of 4). 23 24 6

Results: standardized yeast cell cycle data 0.55 0.50 BIC Adjusted Rand 0.45 0.40 0.35 0.30 0.25 0.20 0.5-000 -3000-5000 -7000-9000 -000-3000 -5000-7000 VVV CAST Adjusted Rand: slightly > CAST at 5 clusters. BIC: selects at 5 clusters. 25 26 Importance of Data Transformation 27 28 7

log yeast cell cycle data Standardized yeast cell cycle data 29 30 Summary and Conclusions Synthetic data sets: With the correct model, model-based clustering better than a leading heuristic clustering algorithm BIC selects the right model & right Real expression data sets: Comparable adjusted Rand indices to CAST BIC gives a good hint as to the Appropriate data transformations increase normality & cluster quality (See paper & web.) 3 32 8

Acnowledgements Ka Yee Yeung, Chris Fraley 2,4, Alejandro Murua 4, Adrian E. Raftery 2 Michèl Schummer 5 the ovary data Jeremy Tantrum 2 help with MBC software ( model) Chris Saunders 3 CRE & noise model Computer Science & Engineering 4 Insightful Corporation 2 Statistics 5 Institute of Systems Biology 3 Genome Sciences More Info http://www.cs.washington.edu/homes/ruzzo Adjusted Rand Example c#(4) c#2(5) c#3(7) c#4(4) class#(2) 2 0 0 0 class#2(3) 0 0 0 3 class#3(5) 4 0 0 class#4(0) 7 ' 2$ ' 3$ ' 4$ ' 7$ a = % " + % " + % " + % " = 3 ' 4$ ' 5$ ' 7$ ' 4$ b = % " + % " + % " + % "! a = 43! 3 = 2 ' 2$ ' 3$ ' 5$ ' 0$ c = % " + % " + % " + % "! a = 59! 3 = 28 & 2 # ' 20$ d = % "! a! b! c = 9 & 2 # a + d Rand, R = = 0.789 a + d + c + d R! E( R) Adjusted Rand = = 0.469! E( R) UW CSE Computational Biology Group 35 44 9