Model-based clustering and validation techniques for gene expression data

Size: px

Start display at page:

Download "Model-based clustering and validation techniques for gene expression data"

Donald Merritt
5 years ago
Views:

1 Moel-base clustering an valiation techniques for gene expression ata Ka Yee Yeung Department of Microbiology University of Washington, Seattle WA

2 Overview Introuction to microarrays Clustering 11 Valiating clustering results on microarray ata Moel-base clustering using microarray ata Co-expression co-regulation?? 6/8/ Ka Yee Yeung - Lipari

3 Introuction to Microarrays 6/8/ Ka Yee Yeung - Lipari

4 DNA Arrays Measure the Concentration of 1 s of Genes Simultaneously On the surface In solution After Hybriization A B copies of gene A, 1 copy of gene B A B 6/8/ Ka Yee Yeung - Lipari

5 Two-color analysis allows for comparative stuies to be one In solution #1 After Hybriization copies of gene A, 1 copy of gene B In solution # A B copies of gene A, copies of gene B 6/8/ Ka Yee Yeung - Lipari

6 Cell Population #1 Cell Population # Extract mrna Extract mrna Make cdna Label w/ Green Fluor Make cdna Label w/ Re Fluor Co-hybriize.. Scan Slie with DNA from ifferent genes

7 The Gene Chip from Affymetrix 6/8/ Ka Yee Yeung - Lipari 7

8 Affymetrix chips: oligonucleotie arrays PM (Perfect Match) vs. MM (mis-match) 6/8/ Ka Yee Yeung - Lipari 8

9 A gene expression ata set.. Snapshot of activities in the cell Each chip represents an experiment: time course tissue samples (normal/cancer) n genes p experiments X ij 6/8/ Ka Yee Yeung - Lipari 9

10 What is clustering? Group similar objects together Clustering experiments Clustering genes Gene 1 Gene Gene Gene E E E E /8/ Ka Yee Yeung - Lipari 1

11 Applications of clustering gene expression ata Guilt by association E.g. Cluster the genes Ë functionally relate genes Class iscovery E.g. Cluster the experiments Ë iscover new subtypes of tissue samples 6/8/ Ka Yee Yeung - Lipari 11

12 Clustering 11 6/8/ Ka Yee Yeung - Lipari 1

13 What is clustering? Group similar objects together Objects in the same cluster (group) are more similar to each other than objects in ifferent clusters Data exploratory tool Unsupervise metho Do not assume any knowlege of the genes or experiments 6/8/ Ka Yee Yeung - Lipari 1

14 How to efine similarity? 1 X 1 Experiments p X genes n genes genes Y Y n Raw matrix Similarity measure: A measure of pairwise similarity or issimilarity Examples: Correlation coefficient Eucliean istance Similarity matrix 6/8/ Ka Yee Yeung - Lipari 1 n

15 6/8/ Ka Yee Yeung - Lipari 1 Similarity measures Eucliean istance Correlation coefficient 1 ]) [ ] [ Â( - p j j Y j X p j X X where Y j Y X j X Y j Y X j X p j p j p j p j Â Â Â Â ] [, ) ] [ ( ) ] [ ( ) ] [ )( ] [ (

16 Example X 1-1 Y 1 Z -1 1 W X Y Z W - - Correlation (X,Y) 1 Distance (X,Y) Correlation (X,Z) -1 Distance (X,Z).8 Correlation (X,W) 1 Distance (X,W) 1.1 6/8/ Ka Yee Yeung - Lipari 16

17 Lessons from the example Correlation irection only Eucliean istance magnitue & irection 6/8/ Ka Yee Yeung - Lipari 17

18 Clustering algorithms Inputs: Raw ata matrix or similarity matrix Number of clusters or some other parameters Hierarchical vs Partitional algorithms 6/8/ Ka Yee Yeung - Lipari 18

19 Hierarchical Clustering [Hartigan 197] Agglomerative (bottom-up) enrogram Algorithm: Initialize: each item a cluster Iterate: select two most similar clusters merge them Halt: when require number of clusters is reache 6/8/ Ka Yee Yeung - Lipari 19

20 Hierarchical: Single Link cluster similarity similarity of two most similar members - Potentially long an skinny clusters + Fast 6/8/ Ka Yee Yeung - Lipari

21 6/8/ Ka Yee Yeung - Lipari 1 Example: single link Î È Î È (1,) (1,) 1 8 min{ 9,8} }, min{ 9 min{ 1,9} }, min{ 6,} min{ }, min{, 1, (1,),, 1, (1,),, 1, (1,),

22 6/8/ Ka Yee Yeung - Lipari Example: single link Î È 7 (1,,) (1,,) 1 Î È (1,) (1,) Î È min{ 8,} }, min{ 7 min{ 9,7} }, min{, (1,), (1,,),, (1,), (1,,),

23 6/8/ Ka Yee Yeung - Lipari Example: single link Î È 7 (1,,) (1,,) 1 Î È (1,) (1,) Î È }, min{ (1,,), (1,,), 1,,),(,) (

24 Hierarchical: Complete Link cluster similarity similarity of two least similar members + tight clusters - slow 6/8/ Ka Yee Yeung - Lipari

25 6/8/ Ka Yee Yeung - Lipari Example: complete link Î È Î È (1,) (1,) 1 9 max{9,8} }, max{ 1 max{1,9} }, max{ 6 max{6,} }, max{, 1, (1,),, 1, (1,),, 1, (1,),

26 6/8/ Ka Yee Yeung - Lipari 6 Example: complete link Î È Î È (1,) (1,) 1 Î È (,) (1,) (,) (1,) 7 max{7,} }, max{ 1 max{1,9} }, max{,,,(,) (1,), (1,), (1,),(,)

27 6/8/ Ka Yee Yeung - Lipari 7 Example: complete link Î È Î È (1,) (1,) Î È (,) (1,) (,) (1,) 1 1 }, max{ ),(, (1,),(,) 1,,),(,) (

28 Hierarchical: Average Link cluster similarity average similarity of all pairs + tight clusters - slow 6/8/ Ka Yee Yeung - Lipari 8

29 6/8/ Ka Yee Yeung - Lipari 9 Example: average link Î È Î È (1,) (1,) ) ( ) ( 1. 6 ) ( 1, 1, (1,),, 1, (1,),, 1, (1,),

30 6/8/ Ka Yee Yeung - Lipari Example: average link Î È Î È (1,) (1,) 1 Î È 6 9. (,) (1,) (,) (1,) 6 ) ( 1 9 ) ( 1,,,(,),, 1, 1, (1,),(,)

31 6/8/ Ka Yee Yeung - Lipari 1 Example: average link Î È Î È (1,) (1,) Î È 6 9. (,) (1,) (,) (1,) 1 8 ) ( 6 1,,,, 1, 1, 1,,),(,) (

32 Partitional: K-Means [MacQueen 196] 1 6/8/ Ka Yee Yeung - Lipari

33 Details of k-means Iterate until converge: Assign each ata point to the closest centroi Compute new centroi Properties: Converge to local optimum In practice, converge quickly Ten to prouce spherical, equal-size clusters 6/8/ Ka Yee Yeung - Lipari

34 D-clustering Cluster both genes an experiments 6/8/ Ka Yee Yeung - Lipari

35 Summary Definition of clustering Pairwise similarity: Correlation Eucliean istance Clustering algorithms: Hierarchical (single-link, complete-link, average-link) K-means 6/8/ Ka Yee Yeung - Lipari

36 Clustering microarray ata

37 What has been one? Many clustering algorithms have been propose for gene expression ata. For example: Hierarchical clustering algorithms eg. [Eisen et al 1998] K-means eg. [Tavazoie et al. 1999] Self-organizing maps (SOM) eg. [Tamayo et al. 1999] Cluster Affinity Search Technique (CAST) [Ben- Dor, Yakhini 1999] an many others 6/8/ Ka Yee Yeung - Lipari 7

38 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

39 Valiating clustering results [Yeung, Haynor, Ruzzo 1] FOM iea Data sets Results ISI most cite paper in Computer Science (Dec ) 6/8/ Ka Yee Yeung - Lipari 9

40 Valiation techniques [Jain an Dubes 1988] External valiation Require external knowlege Internal valiation Does not require external knowlege Gooness of fit between the ata an the clusters 6/8/ Ka Yee Yeung - Lipari

41 Comparing ifferent heuristicbase algorithms Apply a clustering algorithm to all but one experiment Use the left-out experiment to check the preictive power of the algorithm 6/8/ Ka Yee Yeung - Lipari 1

42 Figure of Merit (FOM) 1 Experiments 1 e m Cluster C 1 genes g Cluster C i n Cluster C k FOM estimates preictive power: measures uniformity of gene expression levels in the left-out conition in clusters forme Low FOM > High preictive power 6/8/ Ka Yee Yeung - Lipari

43 FOM genes Experiments 1 e m 1 g n R(g,e) Cluster C 1 k 1 FOM ( e, k) Â Â( R( g, e) - mc n Cluster C i Cluster C k FOM ( k) i 1 gœc m Â e 1 ajuste FOM(e, k) i FOM ( e, k) FOM(e, k)/ i ( e)) n - k n 6/8/ Ka Yee Yeung - Lipari

44 Clustering Algorithms Partitional CAST (Ben-Dor et al. 1999) k-means (Hartigan 197) Hierarchical single-link average-link Complete-link Ranom (as a control) Ranomly assign genes to clusters 6/8/ Ka Yee Yeung - Lipari

45 Gene expression ata sets Ovary ata (Michel Schummer, Institute of Systems Biology) Subset of ata : clones experiments (cancer/normal tissue samples) clones correspon to genes (classes) Yeast cell cycle ata (Cho et al 1998) 17 time points Subset of 8 genes correspon to phases of cell cycle 6/8/ Ka Yee Yeung - Lipari

46 Synthetic ata sets Mixture of normal istributions base on the ovary ata Generate a multivariate normal istributions with the sample covariance matrix an mean vector of each class in the ovary ata Ranomly resample ovary ata For each class, ranomly sample the expression levels in each experiment Near iagonal covariance matrix 6/8/ Ka Yee Yeung - Lipari 6

47 Ranomly resample synthetic Ovary ata experiments ata set Synthetic ata class genes 6/8/ Ka Yee Yeung - Lipari 7

48 Results: ovary ata FOM 1 CAST ranom average-link single-link complete-link kmeans (init avglink) number of clusters CAST, k-means an complete-link : best performanace 6/8/ Ka Yee Yeung - Lipari 8

49 Results: yeast cell cycle ata 6 FOM 1 1 CAST ranom avglink complete-link single-link 1 1 number of clusters k-means (init avglink) CAST, k-means: best performance 6/8/ Ka Yee Yeung - Lipari 9

50 Results: mixture of normal base on ovary ata FOM CAST ranom average-link single-link complete-link k-means (init avglink) number of clusters At clusters: CAST lowest FOM 6/8/ Ka Yee Yeung - Lipari

51 Results: ranomly resample ovary ata FOM CAST ranom average-link single-link complete-link k-means (init avglink) number of clusters At clusters: CAST lowest FOM 6/8/ Ka Yee Yeung - Lipari 1

52 Summary of Results CAST an k-means prouce higher quality clusters than the hierarchical algorithms Single-link has worse performance among the hierarchical algorithms 6/8/ Ka Yee Yeung - Lipari

53 Software Implementation Comman line C coe: not very userfrienly at the moment Thir party implementation: MEV from TIGR 6/8/ Ka Yee Yeung - Lipari

54 6/8/ Ka Yee Yeung - Lipari

55 Thank-you s FOM work: Davi Haynor (Raiology, UW) Larry Ruzzo (Computer Science, UW) Ovary ata: Michel Schummer (Institute of Systems Biology) 6/8/ Ka Yee Yeung - Lipari

56 Reay for a break?

57 Overview Introuction to microarrays Clustering 11 Valiating clustering results on microarray ata Moel-base clustering using microarray ata Co-expression co-regulation?? 6/8/ Ka Yee Yeung - Lipari 7

58 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

59 Moel-base clustering [Yeung, Fraley, Murua, Raftery, Ruzzo 1] Overview of moel-base clustering Data sets Results Summary an Future Work ISI most cite paper in Computer Science (Jan ) 6/8/ Ka Yee Yeung - Lipari 9

60 Moel-base clustering Gaussian mixture moel: Assume each cluster is generate by the multivariate normal istribution Each cluster k has parameters : Mean vector: m k Covariance matrix: S k Likelihoo for the mixture moel: n G Â 1,..., qg y) t k k ( yi k ) i 1 k 1 L mix ( q f q Data transformations & normality assumption 6/8/ Ka Yee Yeung - Lipari 6

61 EM algorithm General appraoch to maximum likelihoo Iterate between E an M steps: E step: compute the probability of each observation belonging to each cluster using the current parameter estimates M-step: estimate moel parameters using the current group membership probabilities 6/8/ Ka Yee Yeung - Lipari 61

62 Parameterization of the covariance matrix: S k l k D k A k D k T (Banfiel & Raftery 199) Equal variance spherical moel (EI): ~ kmeans S k l I variance orientation shape Unequal variance spherical moel (VI): S k l k I 6/8/ Ka Yee Yeung - Lipari 6

63 Covariance matrix: S k l k D k A k D k T Unconstraine moel (VVV): S k l k D k A k D k T variance orientation shape EEE elliptical moel: S k ldad T Diagonal moel: S k l k B k, where B k is iagonal with B k 1 6/8/ Ka Yee Yeung - Lipari 6

64 Key avantage of the moelbase approach: choose the moel an the number of clusters Bayesian Information Criterion (BIC) A large BIC score inicates strong evience for the corresponing moel. 6/8/ Ka Yee Yeung - Lipari 6

65 Definition of the BIC score log p ( D M ) ª log p( D qˆ, M ) -u log( n) k k k k BIC k The integrate likelihoo p(d M k ) is har to evaluate, where D is the ata, M k is the moel. BIC is an approximation to log p(d M k ) u k : number of parameters to be estimate in moel M k 6/8/ Ka Yee Yeung - Lipari 6

66 Overall Clustering Approach For a given range of # clusters (G) For each moel Apply EM to estimate moel parameters an cluster memberships Compute BIC 6/8/ Ka Yee Yeung - Lipari 66

67 Our Approach Our Goal: To show the moel-base approach has superior performance on: Quality of clusters Number of clusters an moel chosen (BIC) To compare clusters with classes: Ajuste Ran inex (Hubert an Arabie 198) High ajuste Ran inex Ë high agreement Compare the quality of clusters with a leaing heuristic-base algorithm: CAST (Ben- Dor & Yakhini 1999) 6/8/ Ka Yee Yeung - Lipari 67

68 Ajuste Ran inex Compare clusters to classes Consier # pairs of objects Same class Different class Same cluster a b Different cluster c 6/8/ Ka Yee Yeung - Lipari 68

69 6/8/ Ka Yee Yeung - Lipari 69 Example (Ajuste Ran) c#1() c#() c#(7) c#() class#1() class#() class#() 1 class#(1) ˆ Á Ë Ê - - ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê - - ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê c b a a c a b a.69 ) ( 1 ) ( Ajuste Ran.789, R E R E R c a a R Ran

70 Methoology for BIC users: Not require classes Choose the number of clusters an moel Our evaluation methoology: Ajuste Ran Nee classes Assess the agreement of clusters to the classes 6/8/ Ka Yee Yeung - Lipari 7

71 Gene expression ata sets Ovary ata (Michel Schummer, Institute of Systems Biology) Subset of ata : clones experiments (cancer/normal tissue samples) clones correspon to genes (classes) Yeast cell cycle ata (Cho et al 1998) 17 time points Subset of 8 genes correspon to phases of cell cycle 6/8/ Ka Yee Yeung - Lipari 71

72 Synthetic ata sets Mixture of normal istributions base on the ovary ata Generate a multivariate normal istributions with the sample covariance matrix an mean vector of each class in the ovary ata Ranomly resample ovary ata For each class, ranomly sample the expression levels in each experiment Near iagonal covariance matrix 6/8/ Ka Yee Yeung - Lipari 7

73 Ranomly resample synthetic Ovary ata experiments ata set Synthetic ata class genes 6/8/ Ka Yee Yeung - Lipari 7

74 Results: mixture of normal istributions base on ovary ata ( genes).8 Ajuste Ran BIC number of clusters EI VI VVV iagonal CAST EI VI VVV iagonal At clusters, VVV, iagonal, CAST : high ajuste Ran BIC selects VVV at clusters number of clusters 6/8/ Ka Yee Yeung - Lipari 7

75 Results: ranomly resample ovary ata.9.8 Ajuste Ran number of clusters EI VI VVV iagonal CAST EEE Diagonal moel achieves the max ajuste Ran an BIC score (higher than CAST) BIC EI VI iagonal EEE BIC max at clusters Confirms expecte result -1 number of clusters 6/8/ Ka Yee Yeung - Lipari 7

76 Results: square root ovary ata.8.7 Ajuste Ran.6... EI VI VVV iagonal CAST EEE Ajuste Ran: max at EEE clusters (> CAST) number of clusters BIC analysis: EEE an iagonal moels -> first local max at clusters BIC EI VI iagonal EEE Global max -> VI at 8 clusters - number of clusters 6/8/ Ka Yee Yeung - Lipari 76

77 Results: stanarize yeast cell cycle ata.. BIC Ajuste Ran number of clusters EI VI VVV iagonal CAST EEE EI VI iagonal EEE Ajuste Ran: EI slightly > CAST at clusters. BIC: selects EEE at clusters. -17 number of clusters 6/8/ Ka Yee Yeung - Lipari 77

78 Results: log yeast cell cycle ata BIC Ajuste Ran number of clusters number of clusters EI VI VVV iagonal CAST EEE EI VI iagonal EEE CAST achieves much higher ajuste Ran inices than most moel-base approaches (except EEE). BIC scores of EEE much higher than the other moels. 6/8/ Ka Yee Yeung - Lipari 78

79 log yeast cell cycle ata 6/8/ Ka Yee Yeung - Lipari 79

80 Stanarize yeast cell cycle ata 6/8/ Ka Yee Yeung - Lipari 8

81 Synthetic ata sets: Summary With the correct moel, the moel-base approach excels over CAST BIC selects the right moel at the correct number of clusters Real expression ata sets: Comparable quality of clusters to CAST Avantage: BIC gives a hint to the number of clusters 6/8/ Ka Yee Yeung - Lipari 81

82 Software Implementation Software: Mclust available in Splus (Chris Fraley an Arian Raftery) R (Ron Wehrens) Matlab (Angel Martinez an Weny Martinez) 6/8/ Ka Yee Yeung - Lipari 8

83 Future Work Custom refinements to the moelbase implementation: Design moels that incorporate specific information about the experiments, eg. Block iagonal covariance matrix Missing ata Outliers 6/8/ Ka Yee Yeung - Lipari 8

84 Moel-base work: Thank-you s Chris Fraley (Statistics, UW) Alejanro Murua (Statistics, UW) Arian Raftery (Statistics, UW) Larry Ruzzo (Computer Science, UW) Ovary ata: Michel Schummer (Institute of Systems Biology) 6/8/ Ka Yee Yeung - Lipari 8

85 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

86 From co-expression to co-regulation: How many microarray experiments o we nee? [Yeung, Meveovic, Bumgarner: To appear in Genome Biology ] 6/8/ Ka Yee Yeung - Lipari 86

87 From co-expression to co-regulation Motivation: Genes sharing the same transcriptional moules are expecte to prouce similar expression patterns Cluster analysis is often use to ientify genes that have similar expression patterns. Questions: 1. How likely are co-expresse genes regulate by the same transcription factors?. What is the effect of the following factors on the likelihoo: a. Number of microarray experiments b. Clustering algorithm use 6/8/ Ka Yee Yeung - Lipari 87

88 Transcription factor atabases Yeast microarray ata Pre-processing Yeast genes regulate by the same TF s Ranomly sample subsets with E experiments cluster For each pair of genes in the same cluster, evaluate if they share the same TF s 6/8/ Ka Yee Yeung - Lipari 88

89 Yeast transcription factors SCPD (Saccharomyces Cerevisiae Promoter Database) [zhang et al. 1999] List ~ genes that are regulate by 9 transcription factors (TF s) YPD (Yeast Protein Database) Commercial: UW oes not have access Appenix of [Lee et al. ] List genes regulate by each TF from literature as of Nov 1 List ~8 genes that are regulate by 1 TF s 6/8/ Ka Yee Yeung - Lipari 89

90 Comparing YPD an SCPD # istinct ORF s SCP D YPD 8 Common 16 # istinct TF s 18 1 # gene-tf interactions SCPD: 1/9 6% TF s regulate only 1 gene YPD: 17/1 1% TF s regulate only 1 gene In general, the YPD list contains TF s that regulate a higher # genes 6/8/ Ka Yee Yeung - Lipari 9

91 Yeast microarray ata Rosetta s yeast compenium ata [Hughes et al. ] knockout -color experiments Stanfor: Gasch et al. Data [ an 1] cdna array ata uner a variety of environmental stress (eg. heat shock) Total concatenate time course experiments 6/8/ Ka Yee Yeung - Lipari 91

92 Evaluation For each clustering result Count the number of pairs of genes that belong to the same cluster an share a common TF (True positive, TP) # gene pairs from the same clusters an share at least 1common TF's TP rate # gene pairs from the same clusters TP rate may change as a function of # clusters, we compare the TP rate to the TP rate of ranom partitions: Ranomly partition the set of genes 1 times Distribution of TP rate ~ Normal mean m an stanar eviation s Z-score (TP rate - m) / s A high z-score TP rate is significantly higher than those of ranom partitions 6/8/ Ka Yee Yeung - Lipari 9

93 Results: Compenium ata using all experiments To compare the performance of ifferent clustering algorithms 6/8/ Ka Yee Yeung - Lipari 9

94 compenium ata & SCPD (7 E) average-link (corr) IMM average-link (ist) complete-link (corr) Mclust complete-link (ist) 1 z-score # clusters MCLUST an complete-link (corr) prouce relatively high z-scores 6/8/ Ka Yee Yeung - Lipari 9

95 IMM (Infinite Mixture Moel-base) Infinite mixture moel: Each cluster is assume to follow a multivariate normal istribution o NOT assume # clusters Use a Gibbs Sampler to estimate the pairwise probabilities (Pij) for two genes (i,j) to belong to the same cluster To form clusters: cluster Pij with a heuristic clustering algorithm (eg. complete-link) Built-in error moel Assume the repeate measurements are generate from another multivariate Gaussian istribution. 6/8/ Ka Yee Yeung - Lipari 9

96 Results: Compenium ata: effect of # experiments 6/8/ Ka Yee Yeung - Lipari 96

97 Compenium ata & SCPD: Hierarchical complete-link over a range of # clusters meian z-score experiments 1 experiments experiments experiments 1 experiments experiments 7 experiments # clusters Observation: Meian z-score increases as # experiments increases over ifferent # clusters. 6/8/ Ka Yee Yeung - Lipari 97

98 Compenium ata & SCPD: Different clustering algorithms at clusters meian z-score average-link (corr) complete-link (corr) Mclust IMM # experiments Proportions of co-regulate genes increase as # experiments increases Mclust: highest proportions of co-regulate genes 6/8/ Ka Yee Yeung - Lipari 98

99 Compenium ata & YPD (7G): Different clustering algorithms at clusters meian z-score 1 1 average-link (corr) complete-link (corr) Mclust IMM # experiments 6/8/ Ka Yee Yeung - Lipari 99

100 Summary of Results More microarray experiments More likely to fin co-regulate genes!! SCPD/YPD prouces similar results Eucliean istance tens to prouce relatively low z- score compare to correlation using the same algorithm Stanarization greatly improves the performance of moel-base methos Mclust (EI moel) prouces relatively high z-scores IMM oesn t work as well as Mclust. Why?? 6/8/ Ka Yee Yeung - Lipari 1

101 ChIP-CHIP the methoology Transcription factors are crosslinke to genomic DNA DNA is sheare Antiboies immunoprecipitate a specific transcription factor DNA is un-linke, labele an use to interrogate arrays * * 6/8/ Ka Yee Yeung - Lipari 11

102 ChIP ata:r gol stanar [Lee et al. Science ] Chromatin Immunoprecipitation (ChIP): to etect the bining of TF s of interest to integenic sequences in yeast in vivo 16 TF s from YPD (11 TF s in their raw ata) Aopte error moel from [Hughes et al. ] Raw ata (log ratios an p-values for genes/integenic regions to TF s) available p-value cutoff.1 6/8/ Ka Yee Yeung - Lipari 1

103 Comparing ChIP ata an YPD 791 gene-tf interactions from YPD have a common gene an TF in the ChIP ata p-values range # gene-tf interactions % gene-tf interactions < x < < x < < x < < x < < x < < x <. 9. x >. 1.7 p-value TP FN total # ChIP interactions # etecte by ChIP but not in YPD TP % FN % /8/ Ka Yee Yeung - Lipari 1

104 Results: compenium ata & ChIP (1 genes) meian z-score 1 1 average-link (corr) complete-link (corr) Mclust IMM # experiments Very similar results on other atasets as well 6/8/ Ka Yee Yeung - Lipari 1

105 Take home message In orer to reliably infer co-regulation from cluster analysis, we nee lots of ata. 6/8/ Ka Yee Yeung - Lipari 1

106 Limitations Very naïve assumption of co-regulation: genes sharing at last one common transcription factors Yeast ata only Does not take the information limit of microarray atasets into consieration Consier only clustering algorithms in which each gene is assigne to only one cluster Our current stuy oes not provie completely quantitative results: how many experiments are sufficient to achieve x% co-regulation? 6/8/ Ka Yee Yeung - Lipari 16

107 Thank you s Roger Bumgarner (Microbiology, UW) Bumgarner Lab, UW: Tanveer Haier, Tao Peng, Mette Peters, Kyle Serikawa, Caimiao Wei Te Young -- Biochemistry, UW IMM: Mario Meveovic -- Univ of Cincinnati Mclust: Arian Raftery + Chris Fraley (Statistics, UW) 6/8/ Ka Yee Yeung - Lipari 17

108 Summary Question 1. How can I choose between ifferent clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many experiments o I nee? Answer FOM: compare any clustering algorithms on any ataset Moel-base clustering algorithm: high cluster quality estimate number of clusters. More experiments more likely to fin co-regulate genes Moel-base metho J in yeast, ~ experiments 6/8/ Ka Yee Yeung - Lipari 18

109 Meet my collaborators an mentors Davi Haynor Roger Bumgarner 6/8/ Mario Meveovic Arian Raftery Ka Yee Yeung - Lipari Larry Ruzzo 19

110 Key References [Yeung, Haynor, Ruzzo 1] Valiating clustering for gene expression ata. Bioinformatics 17:9-18 [Yeung, Fraley, Murua, Raftery, Ruzzo 1] Moelbase clustering an ata transformations for gene expresison ata. Bioinformatics 17: [Yeung, Meveovic, Bumgarner ] From coexpression to co-regulation: how many microarray experiments o we nee? To appear in Genome Biology. 6/8/ Ka Yee Yeung - Lipari 11

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical