Model-based clustering and validation techniques for gene expression data

Size: px
Start display at page:

Download "Model-based clustering and validation techniques for gene expression data"

Transcription

1 Moel-base clustering an valiation techniques for gene expression ata Ka Yee Yeung Department of Microbiology University of Washington, Seattle WA

2 Overview Introuction to microarrays Clustering 11 Valiating clustering results on microarray ata Moel-base clustering using microarray ata Co-expression co-regulation?? 6/8/ Ka Yee Yeung - Lipari

3 Introuction to Microarrays 6/8/ Ka Yee Yeung - Lipari

4 DNA Arrays Measure the Concentration of 1 s of Genes Simultaneously On the surface In solution After Hybriization A B copies of gene A, 1 copy of gene B A B 6/8/ Ka Yee Yeung - Lipari

5 Two-color analysis allows for comparative stuies to be one In solution #1 After Hybriization copies of gene A, 1 copy of gene B In solution # A B copies of gene A, copies of gene B 6/8/ Ka Yee Yeung - Lipari

6 Cell Population #1 Cell Population # Extract mrna Extract mrna Make cdna Label w/ Green Fluor Make cdna Label w/ Re Fluor Co-hybriize.. Scan Slie with DNA from ifferent genes

7 The Gene Chip from Affymetrix 6/8/ Ka Yee Yeung - Lipari 7

8 Affymetrix chips: oligonucleotie arrays PM (Perfect Match) vs. MM (mis-match) 6/8/ Ka Yee Yeung - Lipari 8

9 A gene expression ata set.. Snapshot of activities in the cell Each chip represents an experiment: time course tissue samples (normal/cancer) n genes p experiments X ij 6/8/ Ka Yee Yeung - Lipari 9

10 What is clustering? Group similar objects together Clustering experiments Clustering genes Gene 1 Gene Gene Gene E E E E /8/ Ka Yee Yeung - Lipari 1

11 Applications of clustering gene expression ata Guilt by association E.g. Cluster the genes Ë functionally relate genes Class iscovery E.g. Cluster the experiments Ë iscover new subtypes of tissue samples 6/8/ Ka Yee Yeung - Lipari 11

12 Clustering 11 6/8/ Ka Yee Yeung - Lipari 1

13 What is clustering? Group similar objects together Objects in the same cluster (group) are more similar to each other than objects in ifferent clusters Data exploratory tool Unsupervise metho Do not assume any knowlege of the genes or experiments 6/8/ Ka Yee Yeung - Lipari 1

14 How to efine similarity? 1 X 1 Experiments p X genes n genes genes Y Y n Raw matrix Similarity measure: A measure of pairwise similarity or issimilarity Examples: Correlation coefficient Eucliean istance Similarity matrix 6/8/ Ka Yee Yeung - Lipari 1 n

15 6/8/ Ka Yee Yeung - Lipari 1 Similarity measures Eucliean istance Correlation coefficient 1 ]) [ ] [ Â( - p j j Y j X p j X X where Y j Y X j X Y j Y X j X p j p j p j p j     ] [, ) ] [ ( ) ] [ ( ) ] [ )( ] [ (

16 Example X 1-1 Y 1 Z -1 1 W X Y Z W - - Correlation (X,Y) 1 Distance (X,Y) Correlation (X,Z) -1 Distance (X,Z).8 Correlation (X,W) 1 Distance (X,W) 1.1 6/8/ Ka Yee Yeung - Lipari 16

17 Lessons from the example Correlation irection only Eucliean istance magnitue & irection 6/8/ Ka Yee Yeung - Lipari 17

18 Clustering algorithms Inputs: Raw ata matrix or similarity matrix Number of clusters or some other parameters Hierarchical vs Partitional algorithms 6/8/ Ka Yee Yeung - Lipari 18

19 Hierarchical Clustering [Hartigan 197] Agglomerative (bottom-up) enrogram Algorithm: Initialize: each item a cluster Iterate: select two most similar clusters merge them Halt: when require number of clusters is reache 6/8/ Ka Yee Yeung - Lipari 19

20 Hierarchical: Single Link cluster similarity similarity of two most similar members - Potentially long an skinny clusters + Fast 6/8/ Ka Yee Yeung - Lipari

21 6/8/ Ka Yee Yeung - Lipari 1 Example: single link Î È Î È (1,) (1,) 1 8 min{ 9,8} }, min{ 9 min{ 1,9} }, min{ 6,} min{ }, min{, 1, (1,),, 1, (1,),, 1, (1,),

22 6/8/ Ka Yee Yeung - Lipari Example: single link Î È 7 (1,,) (1,,) 1 Î È (1,) (1,) Î È min{ 8,} }, min{ 7 min{ 9,7} }, min{, (1,), (1,,),, (1,), (1,,),

23 6/8/ Ka Yee Yeung - Lipari Example: single link Î È 7 (1,,) (1,,) 1 Î È (1,) (1,) Î È }, min{ (1,,), (1,,), 1,,),(,) (

24 Hierarchical: Complete Link cluster similarity similarity of two least similar members + tight clusters - slow 6/8/ Ka Yee Yeung - Lipari

25 6/8/ Ka Yee Yeung - Lipari Example: complete link Î È Î È (1,) (1,) 1 9 max{9,8} }, max{ 1 max{1,9} }, max{ 6 max{6,} }, max{, 1, (1,),, 1, (1,),, 1, (1,),

26 6/8/ Ka Yee Yeung - Lipari 6 Example: complete link Î È Î È (1,) (1,) 1 Î È (,) (1,) (,) (1,) 7 max{7,} }, max{ 1 max{1,9} }, max{,,,(,) (1,), (1,), (1,),(,)

27 6/8/ Ka Yee Yeung - Lipari 7 Example: complete link Î È Î È (1,) (1,) Î È (,) (1,) (,) (1,) 1 1 }, max{ ),(, (1,),(,) 1,,),(,) (

28 Hierarchical: Average Link cluster similarity average similarity of all pairs + tight clusters - slow 6/8/ Ka Yee Yeung - Lipari 8

29 6/8/ Ka Yee Yeung - Lipari 9 Example: average link Î È Î È (1,) (1,) ) ( ) ( 1. 6 ) ( 1, 1, (1,),, 1, (1,),, 1, (1,),

30 6/8/ Ka Yee Yeung - Lipari Example: average link Î È Î È (1,) (1,) 1 Î È 6 9. (,) (1,) (,) (1,) 6 ) ( 1 9 ) ( 1,,,(,),, 1, 1, (1,),(,)

31 6/8/ Ka Yee Yeung - Lipari 1 Example: average link Î È Î È (1,) (1,) Î È 6 9. (,) (1,) (,) (1,) 1 8 ) ( 6 1,,,, 1, 1, 1,,),(,) (

32 Partitional: K-Means [MacQueen 196] 1 6/8/ Ka Yee Yeung - Lipari

33 Details of k-means Iterate until converge: Assign each ata point to the closest centroi Compute new centroi Properties: Converge to local optimum In practice, converge quickly Ten to prouce spherical, equal-size clusters 6/8/ Ka Yee Yeung - Lipari

34 D-clustering Cluster both genes an experiments 6/8/ Ka Yee Yeung - Lipari

35 Summary Definition of clustering Pairwise similarity: Correlation Eucliean istance Clustering algorithms: Hierarchical (single-link, complete-link, average-link) K-means 6/8/ Ka Yee Yeung - Lipari

36 Clustering microarray ata

37 What has been one? Many clustering algorithms have been propose for gene expression ata. For example: Hierarchical clustering algorithms eg. [Eisen et al 1998] K-means eg. [Tavazoie et al. 1999] Self-organizing maps (SOM) eg. [Tamayo et al. 1999] Cluster Affinity Search Technique (CAST) [Ben- Dor, Yakhini 1999] an many others 6/8/ Ka Yee Yeung - Lipari 7

38 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

39 Valiating clustering results [Yeung, Haynor, Ruzzo 1] FOM iea Data sets Results ISI most cite paper in Computer Science (Dec ) 6/8/ Ka Yee Yeung - Lipari 9

40 Valiation techniques [Jain an Dubes 1988] External valiation Require external knowlege Internal valiation Does not require external knowlege Gooness of fit between the ata an the clusters 6/8/ Ka Yee Yeung - Lipari

41 Comparing ifferent heuristicbase algorithms Apply a clustering algorithm to all but one experiment Use the left-out experiment to check the preictive power of the algorithm 6/8/ Ka Yee Yeung - Lipari 1

42 Figure of Merit (FOM) 1 Experiments 1 e m Cluster C 1 genes g Cluster C i n Cluster C k FOM estimates preictive power: measures uniformity of gene expression levels in the left-out conition in clusters forme Low FOM > High preictive power 6/8/ Ka Yee Yeung - Lipari

43 FOM genes Experiments 1 e m 1 g n R(g,e) Cluster C 1 k 1 FOM ( e, k)  Â( R( g, e) - mc n Cluster C i Cluster C k FOM ( k) i 1 gœc m  e 1 ajuste FOM(e, k) i FOM ( e, k) FOM(e, k)/ i ( e)) n - k n 6/8/ Ka Yee Yeung - Lipari

44 Clustering Algorithms Partitional CAST (Ben-Dor et al. 1999) k-means (Hartigan 197) Hierarchical single-link average-link Complete-link Ranom (as a control) Ranomly assign genes to clusters 6/8/ Ka Yee Yeung - Lipari

45 Gene expression ata sets Ovary ata (Michel Schummer, Institute of Systems Biology) Subset of ata : clones experiments (cancer/normal tissue samples) clones correspon to genes (classes) Yeast cell cycle ata (Cho et al 1998) 17 time points Subset of 8 genes correspon to phases of cell cycle 6/8/ Ka Yee Yeung - Lipari

46 Synthetic ata sets Mixture of normal istributions base on the ovary ata Generate a multivariate normal istributions with the sample covariance matrix an mean vector of each class in the ovary ata Ranomly resample ovary ata For each class, ranomly sample the expression levels in each experiment Near iagonal covariance matrix 6/8/ Ka Yee Yeung - Lipari 6

47 Ranomly resample synthetic Ovary ata experiments ata set Synthetic ata class genes 6/8/ Ka Yee Yeung - Lipari 7

48 Results: ovary ata FOM 1 CAST ranom average-link single-link complete-link kmeans (init avglink) number of clusters CAST, k-means an complete-link : best performanace 6/8/ Ka Yee Yeung - Lipari 8

49 Results: yeast cell cycle ata 6 FOM 1 1 CAST ranom avglink complete-link single-link 1 1 number of clusters k-means (init avglink) CAST, k-means: best performance 6/8/ Ka Yee Yeung - Lipari 9

50 Results: mixture of normal base on ovary ata FOM CAST ranom average-link single-link complete-link k-means (init avglink) number of clusters At clusters: CAST lowest FOM 6/8/ Ka Yee Yeung - Lipari

51 Results: ranomly resample ovary ata FOM CAST ranom average-link single-link complete-link k-means (init avglink) number of clusters At clusters: CAST lowest FOM 6/8/ Ka Yee Yeung - Lipari 1

52 Summary of Results CAST an k-means prouce higher quality clusters than the hierarchical algorithms Single-link has worse performance among the hierarchical algorithms 6/8/ Ka Yee Yeung - Lipari

53 Software Implementation Comman line C coe: not very userfrienly at the moment Thir party implementation: MEV from TIGR 6/8/ Ka Yee Yeung - Lipari

54 6/8/ Ka Yee Yeung - Lipari

55 Thank-you s FOM work: Davi Haynor (Raiology, UW) Larry Ruzzo (Computer Science, UW) Ovary ata: Michel Schummer (Institute of Systems Biology) 6/8/ Ka Yee Yeung - Lipari

56 Reay for a break?

57 Overview Introuction to microarrays Clustering 11 Valiating clustering results on microarray ata Moel-base clustering using microarray ata Co-expression co-regulation?? 6/8/ Ka Yee Yeung - Lipari 7

58 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

59 Moel-base clustering [Yeung, Fraley, Murua, Raftery, Ruzzo 1] Overview of moel-base clustering Data sets Results Summary an Future Work ISI most cite paper in Computer Science (Jan ) 6/8/ Ka Yee Yeung - Lipari 9

60 Moel-base clustering Gaussian mixture moel: Assume each cluster is generate by the multivariate normal istribution Each cluster k has parameters : Mean vector: m k Covariance matrix: S k Likelihoo for the mixture moel: n G Â 1,..., qg y) t k k ( yi k ) i 1 k 1 L mix ( q f q Data transformations & normality assumption 6/8/ Ka Yee Yeung - Lipari 6

61 EM algorithm General appraoch to maximum likelihoo Iterate between E an M steps: E step: compute the probability of each observation belonging to each cluster using the current parameter estimates M-step: estimate moel parameters using the current group membership probabilities 6/8/ Ka Yee Yeung - Lipari 61

62 Parameterization of the covariance matrix: S k l k D k A k D k T (Banfiel & Raftery 199) Equal variance spherical moel (EI): ~ kmeans S k l I variance orientation shape Unequal variance spherical moel (VI): S k l k I 6/8/ Ka Yee Yeung - Lipari 6

63 Covariance matrix: S k l k D k A k D k T Unconstraine moel (VVV): S k l k D k A k D k T variance orientation shape EEE elliptical moel: S k ldad T Diagonal moel: S k l k B k, where B k is iagonal with B k 1 6/8/ Ka Yee Yeung - Lipari 6

64 Key avantage of the moelbase approach: choose the moel an the number of clusters Bayesian Information Criterion (BIC) A large BIC score inicates strong evience for the corresponing moel. 6/8/ Ka Yee Yeung - Lipari 6

65 Definition of the BIC score log p ( D M ) ª log p( D qˆ, M ) -u log( n) k k k k BIC k The integrate likelihoo p(d M k ) is har to evaluate, where D is the ata, M k is the moel. BIC is an approximation to log p(d M k ) u k : number of parameters to be estimate in moel M k 6/8/ Ka Yee Yeung - Lipari 6

66 Overall Clustering Approach For a given range of # clusters (G) For each moel Apply EM to estimate moel parameters an cluster memberships Compute BIC 6/8/ Ka Yee Yeung - Lipari 66

67 Our Approach Our Goal: To show the moel-base approach has superior performance on: Quality of clusters Number of clusters an moel chosen (BIC) To compare clusters with classes: Ajuste Ran inex (Hubert an Arabie 198) High ajuste Ran inex Ë high agreement Compare the quality of clusters with a leaing heuristic-base algorithm: CAST (Ben- Dor & Yakhini 1999) 6/8/ Ka Yee Yeung - Lipari 67

68 Ajuste Ran inex Compare clusters to classes Consier # pairs of objects Same class Different class Same cluster a b Different cluster c 6/8/ Ka Yee Yeung - Lipari 68

69 6/8/ Ka Yee Yeung - Lipari 69 Example (Ajuste Ran) c#1() c#() c#(7) c#() class#1() class#() class#() 1 class#(1) ˆ Á Ë Ê - - ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê - - ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê + ˆ Á Ë Ê c b a a c a b a.69 ) ( 1 ) ( Ajuste Ran.789, R E R E R c a a R Ran

70 Methoology for BIC users: Not require classes Choose the number of clusters an moel Our evaluation methoology: Ajuste Ran Nee classes Assess the agreement of clusters to the classes 6/8/ Ka Yee Yeung - Lipari 7

71 Gene expression ata sets Ovary ata (Michel Schummer, Institute of Systems Biology) Subset of ata : clones experiments (cancer/normal tissue samples) clones correspon to genes (classes) Yeast cell cycle ata (Cho et al 1998) 17 time points Subset of 8 genes correspon to phases of cell cycle 6/8/ Ka Yee Yeung - Lipari 71

72 Synthetic ata sets Mixture of normal istributions base on the ovary ata Generate a multivariate normal istributions with the sample covariance matrix an mean vector of each class in the ovary ata Ranomly resample ovary ata For each class, ranomly sample the expression levels in each experiment Near iagonal covariance matrix 6/8/ Ka Yee Yeung - Lipari 7

73 Ranomly resample synthetic Ovary ata experiments ata set Synthetic ata class genes 6/8/ Ka Yee Yeung - Lipari 7

74 Results: mixture of normal istributions base on ovary ata ( genes).8 Ajuste Ran BIC number of clusters EI VI VVV iagonal CAST EI VI VVV iagonal At clusters, VVV, iagonal, CAST : high ajuste Ran BIC selects VVV at clusters number of clusters 6/8/ Ka Yee Yeung - Lipari 7

75 Results: ranomly resample ovary ata.9.8 Ajuste Ran number of clusters EI VI VVV iagonal CAST EEE Diagonal moel achieves the max ajuste Ran an BIC score (higher than CAST) BIC EI VI iagonal EEE BIC max at clusters Confirms expecte result -1 number of clusters 6/8/ Ka Yee Yeung - Lipari 7

76 Results: square root ovary ata.8.7 Ajuste Ran.6... EI VI VVV iagonal CAST EEE Ajuste Ran: max at EEE clusters (> CAST) number of clusters BIC analysis: EEE an iagonal moels -> first local max at clusters BIC EI VI iagonal EEE Global max -> VI at 8 clusters - number of clusters 6/8/ Ka Yee Yeung - Lipari 76

77 Results: stanarize yeast cell cycle ata.. BIC Ajuste Ran number of clusters EI VI VVV iagonal CAST EEE EI VI iagonal EEE Ajuste Ran: EI slightly > CAST at clusters. BIC: selects EEE at clusters. -17 number of clusters 6/8/ Ka Yee Yeung - Lipari 77

78 Results: log yeast cell cycle ata BIC Ajuste Ran number of clusters number of clusters EI VI VVV iagonal CAST EEE EI VI iagonal EEE CAST achieves much higher ajuste Ran inices than most moel-base approaches (except EEE). BIC scores of EEE much higher than the other moels. 6/8/ Ka Yee Yeung - Lipari 78

79 log yeast cell cycle ata 6/8/ Ka Yee Yeung - Lipari 79

80 Stanarize yeast cell cycle ata 6/8/ Ka Yee Yeung - Lipari 8

81 Synthetic ata sets: Summary With the correct moel, the moel-base approach excels over CAST BIC selects the right moel at the correct number of clusters Real expression ata sets: Comparable quality of clusters to CAST Avantage: BIC gives a hint to the number of clusters 6/8/ Ka Yee Yeung - Lipari 81

82 Software Implementation Software: Mclust available in Splus (Chris Fraley an Arian Raftery) R (Ron Wehrens) Matlab (Angel Martinez an Weny Martinez) 6/8/ Ka Yee Yeung - Lipari 8

83 Future Work Custom refinements to the moelbase implementation: Design moels that incorporate specific information about the experiments, eg. Block iagonal covariance matrix Missing ata Outliers 6/8/ Ka Yee Yeung - Lipari 8

84 Moel-base work: Thank-you s Chris Fraley (Statistics, UW) Alejanro Murua (Statistics, UW) Arian Raftery (Statistics, UW) Larry Ruzzo (Computer Science, UW) Ovary ata: Michel Schummer (Institute of Systems Biology) 6/8/ Ka Yee Yeung - Lipari 8

85 Common questions 1. How can I choose between all these clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many microarray experiments o I nee? 6/8/ Ka Yee Yeung - Lipari 8

86 From co-expression to co-regulation: How many microarray experiments o we nee? [Yeung, Meveovic, Bumgarner: To appear in Genome Biology ] 6/8/ Ka Yee Yeung - Lipari 86

87 From co-expression to co-regulation Motivation: Genes sharing the same transcriptional moules are expecte to prouce similar expression patterns Cluster analysis is often use to ientify genes that have similar expression patterns. Questions: 1. How likely are co-expresse genes regulate by the same transcription factors?. What is the effect of the following factors on the likelihoo: a. Number of microarray experiments b. Clustering algorithm use 6/8/ Ka Yee Yeung - Lipari 87

88 Transcription factor atabases Yeast microarray ata Pre-processing Yeast genes regulate by the same TF s Ranomly sample subsets with E experiments cluster For each pair of genes in the same cluster, evaluate if they share the same TF s 6/8/ Ka Yee Yeung - Lipari 88

89 Yeast transcription factors SCPD (Saccharomyces Cerevisiae Promoter Database) [zhang et al. 1999] List ~ genes that are regulate by 9 transcription factors (TF s) YPD (Yeast Protein Database) Commercial: UW oes not have access Appenix of [Lee et al. ] List genes regulate by each TF from literature as of Nov 1 List ~8 genes that are regulate by 1 TF s 6/8/ Ka Yee Yeung - Lipari 89

90 Comparing YPD an SCPD # istinct ORF s SCP D YPD 8 Common 16 # istinct TF s 18 1 # gene-tf interactions SCPD: 1/9 6% TF s regulate only 1 gene YPD: 17/1 1% TF s regulate only 1 gene In general, the YPD list contains TF s that regulate a higher # genes 6/8/ Ka Yee Yeung - Lipari 9

91 Yeast microarray ata Rosetta s yeast compenium ata [Hughes et al. ] knockout -color experiments Stanfor: Gasch et al. Data [ an 1] cdna array ata uner a variety of environmental stress (eg. heat shock) Total concatenate time course experiments 6/8/ Ka Yee Yeung - Lipari 91

92 Evaluation For each clustering result Count the number of pairs of genes that belong to the same cluster an share a common TF (True positive, TP) # gene pairs from the same clusters an share at least 1common TF's TP rate # gene pairs from the same clusters TP rate may change as a function of # clusters, we compare the TP rate to the TP rate of ranom partitions: Ranomly partition the set of genes 1 times Distribution of TP rate ~ Normal mean m an stanar eviation s Z-score (TP rate - m) / s A high z-score TP rate is significantly higher than those of ranom partitions 6/8/ Ka Yee Yeung - Lipari 9

93 Results: Compenium ata using all experiments To compare the performance of ifferent clustering algorithms 6/8/ Ka Yee Yeung - Lipari 9

94 compenium ata & SCPD (7 E) average-link (corr) IMM average-link (ist) complete-link (corr) Mclust complete-link (ist) 1 z-score # clusters MCLUST an complete-link (corr) prouce relatively high z-scores 6/8/ Ka Yee Yeung - Lipari 9

95 IMM (Infinite Mixture Moel-base) Infinite mixture moel: Each cluster is assume to follow a multivariate normal istribution o NOT assume # clusters Use a Gibbs Sampler to estimate the pairwise probabilities (Pij) for two genes (i,j) to belong to the same cluster To form clusters: cluster Pij with a heuristic clustering algorithm (eg. complete-link) Built-in error moel Assume the repeate measurements are generate from another multivariate Gaussian istribution. 6/8/ Ka Yee Yeung - Lipari 9

96 Results: Compenium ata: effect of # experiments 6/8/ Ka Yee Yeung - Lipari 96

97 Compenium ata & SCPD: Hierarchical complete-link over a range of # clusters meian z-score experiments 1 experiments experiments experiments 1 experiments experiments 7 experiments # clusters Observation: Meian z-score increases as # experiments increases over ifferent # clusters. 6/8/ Ka Yee Yeung - Lipari 97

98 Compenium ata & SCPD: Different clustering algorithms at clusters meian z-score average-link (corr) complete-link (corr) Mclust IMM # experiments Proportions of co-regulate genes increase as # experiments increases Mclust: highest proportions of co-regulate genes 6/8/ Ka Yee Yeung - Lipari 98

99 Compenium ata & YPD (7G): Different clustering algorithms at clusters meian z-score 1 1 average-link (corr) complete-link (corr) Mclust IMM # experiments 6/8/ Ka Yee Yeung - Lipari 99

100 Summary of Results More microarray experiments More likely to fin co-regulate genes!! SCPD/YPD prouces similar results Eucliean istance tens to prouce relatively low z- score compare to correlation using the same algorithm Stanarization greatly improves the performance of moel-base methos Mclust (EI moel) prouces relatively high z-scores IMM oesn t work as well as Mclust. Why?? 6/8/ Ka Yee Yeung - Lipari 1

101 ChIP-CHIP the methoology Transcription factors are crosslinke to genomic DNA DNA is sheare Antiboies immunoprecipitate a specific transcription factor DNA is un-linke, labele an use to interrogate arrays * * 6/8/ Ka Yee Yeung - Lipari 11

102 ChIP ata:r gol stanar [Lee et al. Science ] Chromatin Immunoprecipitation (ChIP): to etect the bining of TF s of interest to integenic sequences in yeast in vivo 16 TF s from YPD (11 TF s in their raw ata) Aopte error moel from [Hughes et al. ] Raw ata (log ratios an p-values for genes/integenic regions to TF s) available p-value cutoff.1 6/8/ Ka Yee Yeung - Lipari 1

103 Comparing ChIP ata an YPD 791 gene-tf interactions from YPD have a common gene an TF in the ChIP ata p-values range # gene-tf interactions % gene-tf interactions < x < < x < < x < < x < < x < < x <. 9. x >. 1.7 p-value TP FN total # ChIP interactions # etecte by ChIP but not in YPD TP % FN % /8/ Ka Yee Yeung - Lipari 1

104 Results: compenium ata & ChIP (1 genes) meian z-score 1 1 average-link (corr) complete-link (corr) Mclust IMM # experiments Very similar results on other atasets as well 6/8/ Ka Yee Yeung - Lipari 1

105 Take home message In orer to reliably infer co-regulation from cluster analysis, we nee lots of ata. 6/8/ Ka Yee Yeung - Lipari 1

106 Limitations Very naïve assumption of co-regulation: genes sharing at last one common transcription factors Yeast ata only Does not take the information limit of microarray atasets into consieration Consier only clustering algorithms in which each gene is assigne to only one cluster Our current stuy oes not provie completely quantitative results: how many experiments are sufficient to achieve x% co-regulation? 6/8/ Ka Yee Yeung - Lipari 16

107 Thank you s Roger Bumgarner (Microbiology, UW) Bumgarner Lab, UW: Tanveer Haier, Tao Peng, Mette Peters, Kyle Serikawa, Caimiao Wei Te Young -- Biochemistry, UW IMM: Mario Meveovic -- Univ of Cincinnati Mclust: Arian Raftery + Chris Fraley (Statistics, UW) 6/8/ Ka Yee Yeung - Lipari 17

108 Summary Question 1. How can I choose between ifferent clustering methos?. Is there a clustering algorithm that works better than the others?. How to choose the number of clusters?. How often o I get biologically meaningful clusters?. How many experiments o I nee? Answer FOM: compare any clustering algorithms on any ataset Moel-base clustering algorithm: high cluster quality estimate number of clusters. More experiments more likely to fin co-regulate genes Moel-base metho J in yeast, ~ experiments 6/8/ Ka Yee Yeung - Lipari 18

109 Meet my collaborators an mentors Davi Haynor Roger Bumgarner 6/8/ Mario Meveovic Arian Raftery Ka Yee Yeung - Lipari Larry Ruzzo 19

110 Key References [Yeung, Haynor, Ruzzo 1] Valiating clustering for gene expression ata. Bioinformatics 17:9-18 [Yeung, Fraley, Murua, Raftery, Ruzzo 1] Moelbase clustering an ata transformations for gene expresison ata. Bioinformatics 17: [Yeung, Meveovic, Bumgarner ] From coexpression to co-regulation: how many microarray experiments o we nee? To appear in Genome Biology. 6/8/ Ka Yee Yeung - Lipari 11

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe

More information

Collapsed Variational Inference for HDP

Collapsed Variational Inference for HDP Collapse Variational Inference for HDP Yee W. Teh Davi Newman an Max Welling Publishe on NIPS 2007 Discussion le by Iulian Pruteanu Outline Introuction Hierarchical Bayesian moel for LDA Collapse VB inference

More information

A Variational Approach to Semi-Supervised Clustering

A Variational Approach to Semi-Supervised Clustering A Variational Approach to Semi-Supervise Clustering Peng Li, Yiming Ying an Colin Campbell Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, Unite Kingom Abstract We present

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

Lecture 6: Generalized multivariate analysis of variance

Lecture 6: Generalized multivariate analysis of variance Lecture 6: Generalize multivariate analysis of variance Measuring association of the entire microbiome with other variables Distance matrices capture some aspects of the ata (e.g. microbiome composition,

More information

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions.

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions. BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 Modelling Data over Time: Curve

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Inferring Transcriptional Regulatory Networks from High-throughput Data

Inferring Transcriptional Regulatory Networks from High-throughput Data Inferring Transcriptional Regulatory Networks from High-throughput Data Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: EBLUP Unit Level for Small Area Estimation Contents General section... 3 1. Summary... 3 2. General

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

under the null hypothesis, the sign test (with continuity correction) rejects H 0 when α n + n 2 2.

under the null hypothesis, the sign test (with continuity correction) rejects H 0 when α n + n 2 2. Assignment 13 Exercise 8.4 For the hypotheses consiere in Examples 8.12 an 8.13, the sign test is base on the statistic N + = #{i : Z i > 0}. Since 2 n(n + /n 1) N(0, 1) 2 uner the null hypothesis, the

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

The Limits of Multiplexing

The Limits of Multiplexing Article type: Focus Article The Limits of Multiplexing Dan Shen,, D.P. Dittmer 2, an J. S. Marron 3 Keywors Multiplexing, Genomics, nanostring TM, DASL TM, Probability Abstract We were motivate by three

More information

VI. Linking and Equating: Getting from A to B Unleashing the full power of Rasch models means identifying, perhaps conceiving an important aspect,

VI. Linking and Equating: Getting from A to B Unleashing the full power of Rasch models means identifying, perhaps conceiving an important aspect, VI. Linking an Equating: Getting from A to B Unleashing the full power of Rasch moels means ientifying, perhaps conceiving an important aspect, efining a useful construct, an calibrating a pool of relevant

More information

Inverse Theory Course: LTU Kiruna. Day 1

Inverse Theory Course: LTU Kiruna. Day 1 Inverse Theory Course: LTU Kiruna. Day Hugh Pumphrey March 6, 0 Preamble These are the notes for the course Inverse Theory to be taught at LuleåTekniska Universitet, Kiruna in February 00. They are not

More information

Fuzzy Clustering of Gene Expression Data

Fuzzy Clustering of Gene Expression Data Fuzzy Clustering of Gene Data Matthias E. Futschik and Nikola K. Kasabov Department of Information Science, University of Otago P.O. Box 56, Dunedin, New Zealand email: mfutschik@infoscience.otago.ac.nz,

More information

THE EFFICIENCIES OF THE SPATIAL MEDIAN AND SPATIAL SIGN COVARIANCE MATRIX FOR ELLIPTICALLY SYMMETRIC DISTRIBUTIONS

THE EFFICIENCIES OF THE SPATIAL MEDIAN AND SPATIAL SIGN COVARIANCE MATRIX FOR ELLIPTICALLY SYMMETRIC DISTRIBUTIONS THE EFFICIENCIES OF THE SPATIAL MEDIAN AND SPATIAL SIGN COVARIANCE MATRIX FOR ELLIPTICALLY SYMMETRIC DISTRIBUTIONS BY ANDREW F. MAGYAR A issertation submitte to the Grauate School New Brunswick Rutgers,

More information

ELECTRON DIFFRACTION

ELECTRON DIFFRACTION ELECTRON DIFFRACTION Electrons : wave or quanta? Measurement of wavelength an momentum of electrons. Introuction Electrons isplay both wave an particle properties. What is the relationship between the

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

A Hybrid Approach for Modeling High Dimensional Medical Data

A Hybrid Approach for Modeling High Dimensional Medical Data A Hybri Approach for Moeling High Dimensional Meical Data Alok Sharma 1, Gofrey C. Onwubolu 1 1 University of the South Pacific, Fii sharma_al@usp.ac.f, onwubolu_g@usp.ac.f Abstract. his work presents

More information

Bioinformatics. Transcriptome

Bioinformatics. Transcriptome Bioinformatics Transcriptome Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ Bioinformatics

More information

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS BY BENJAMIN AVANZI, LUKE C. CASSAR AND BERNARD WONG ABSTRACT In this paper we investigate the potential of Lévy copulas as a tool for

More information

arxiv: v1 [hep-ex] 4 Sep 2018 Simone Ragoni, for the ALICE Collaboration

arxiv: v1 [hep-ex] 4 Sep 2018 Simone Ragoni, for the ALICE Collaboration Prouction of pions, kaons an protons in Xe Xe collisions at s =. ev arxiv:09.0v [hep-ex] Sep 0, for the ALICE Collaboration Università i Bologna an INFN (Bologna) E-mail: simone.ragoni@cern.ch In late

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

CONFIRMATORY FACTOR ANALYSIS

CONFIRMATORY FACTOR ANALYSIS 1 CONFIRMATORY FACTOR ANALYSIS The purpose of confirmatory factor analysis (CFA) is to explain the pattern of associations among a set of observe variables in terms of a smaller number of unerlying latent

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

Discovering molecular pathways from protein interaction and ge

Discovering molecular pathways from protein interaction and ge Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation Binary Discrimination Methos for High Dimensional Data with a Geometric Representation Ay Bolivar-Cime, Luis Miguel Corova-Roriguez Universia Juárez Autónoma e Tabasco, División Acaémica e Ciencias Básicas

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

Sponsors. AMIA 4915 St. Elmo Ave, Suite 401 Bethesda, MD tel fax

Sponsors.   AMIA 4915 St. Elmo Ave, Suite 401 Bethesda, MD tel fax http://www.amia.org/meetings/f05/ CD Orer Form Post-conference Summary General Information Searchable Program Scheule At-A-Glance Tutorials Nursing Informatics Symposium Exhibitor Information Exhibitor

More information

Homework 2 EM, Mixture Models, PCA, Dualitys

Homework 2 EM, Mixture Models, PCA, Dualitys Homework 2 EM, Mixture Moels, PCA, Dualitys CMU 10-715: Machine Learning (Fall 2015) http://www.cs.cmu.eu/~bapoczos/classes/ml10715_2015fall/ OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Estimation of District Level Poor Households in the State of. Uttar Pradesh in India by Combining NSSO Survey and

Estimation of District Level Poor Households in the State of. Uttar Pradesh in India by Combining NSSO Survey and Int. Statistical Inst.: Proc. 58th Worl Statistical Congress, 2011, Dublin (Session CPS039) p.6567 Estimation of District Level Poor Househols in the State of Uttar Praesh in Inia by Combining NSSO Survey

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

WESD - Weighted Spectral Distance for Measuring Shape Dissimilarity

WESD - Weighted Spectral Distance for Measuring Shape Dissimilarity 1 WESD - Weighte Spectral Distance for Measuring Shape Dissimilarity Ener Konukoglu, Ben Glocker, Antonio Criminisi an Kilian M. Pohl Abstract This article presents a new istance for measuring shape issimilarity

More information

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

A New Minimum Description Length

A New Minimum Description Length A New Minimum Description Length Soosan Beheshti, Munther A. Dahleh Laboratory for Information an Decision Systems Massachusetts Institute of Technology soosan@mit.eu,ahleh@lis.mit.eu Abstract The minimum

More information

New Statistical Test for Quality Control in High Dimension Data Set

New Statistical Test for Quality Control in High Dimension Data Set International Journal of Applie Engineering Research ISSN 973-456 Volume, Number 6 (7) pp. 64-649 New Statistical Test for Quality Control in High Dimension Data Set Shamshuritawati Sharif, Suzilah Ismail

More information

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea Introuction A Dirichlet Form approach to MCMC Optimal Scaling Markov chain Monte Carlo (MCMC quotes: Metropolis et al. (1953, running coe on the Los Alamos MANIAC: a feasible approach to statistical mechanics

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

Lecture 5: November 19, Minimizing the maximum intracluster distance

Lecture 5: November 19, Minimizing the maximum intracluster distance Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Study. Technical Report Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Gene mapping, linkage analysis and computational challenges. Konstantin Strauch

Gene mapping, linkage analysis and computational challenges. Konstantin Strauch Gene mapping, linkage analysis an computational challenges Konstantin Strauch Institute for Meical Biometry, Informatics, an Epiemiology (IMBIE) University of Bonn E-mail: strauch@uni-bonn.e Genetics an

More information

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection an System Ientification Borhan M Sananaji, Tyrone L Vincent, an Michael B Wakin Abstract In this paper,

More information

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Bayesian Estimation of the Entropy of the Multivariate Gaussian Bayesian Estimation of the Entropy of the Multivariate Gaussian Santosh Srivastava Fre Hutchinson Cancer Research Center Seattle, WA 989, USA Email: ssrivast@fhcrc.org Maya R. Gupta Department of Electrical

More information

Bohr Model of the Hydrogen Atom

Bohr Model of the Hydrogen Atom Class 2 page 1 Bohr Moel of the Hyrogen Atom The Bohr Moel of the hyrogen atom assumes that the atom consists of one electron orbiting a positively charge nucleus. Although it oes NOT o a goo job of escribing

More information

Lecture 6 : Dimensionality Reduction

Lecture 6 : Dimensionality Reduction CPS290: Algorithmic Founations of Data Science February 3, 207 Lecture 6 : Dimensionality Reuction Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will consier the roblem of maing

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

Deriving ARX Models for Synchronous Generators

Deriving ARX Models for Synchronous Generators Deriving AR Moels for Synchronous Generators Yangkun u, Stuent Member, IEEE, Zhixin Miao, Senior Member, IEEE, Lingling Fan, Senior Member, IEEE Abstract Parameter ientification of a synchronous generator

More information

Spatio-temporal fusion for reliable moving vehicle classification in wireless sensor networks

Spatio-temporal fusion for reliable moving vehicle classification in wireless sensor networks Proceeings of the 2009 IEEE International Conference on Systems, Man, an Cybernetics San Antonio, TX, USA - October 2009 Spatio-temporal for reliable moving vehicle classification in wireless sensor networks

More information

Fast Resampling Weighted v-statistics

Fast Resampling Weighted v-statistics Fast Resampling Weighte v-statistics Chunxiao Zhou Mar O. Hatfiel Clinical Research Center National Institutes of Health Bethesa, MD 20892 chunxiao.zhou@nih.gov Jiseong Par Dept of Math George Mason Univ

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

arxiv: v4 [cs.ds] 7 Mar 2014

arxiv: v4 [cs.ds] 7 Mar 2014 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

Mixture models for analysing transcriptome and ChIP-chip data

Mixture models for analysing transcriptome and ChIP-chip data Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech,

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

Situation awareness of power system based on static voltage security region

Situation awareness of power system based on static voltage security region The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran

More information

Cost-based Heuristics and Node Re-Expansions Across the Phase Transition

Cost-based Heuristics and Node Re-Expansions Across the Phase Transition Cost-base Heuristics an Noe Re-Expansions Across the Phase Transition Elan Cohen an J. Christopher Beck Department of Mechanical & Inustrial Engineering University of Toronto Toronto, Canaa {ecohen, jcb}@mie.utoronto.ca

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Optimal Variable-Structure Control Tracking of Spacecraft Maneuvers

Optimal Variable-Structure Control Tracking of Spacecraft Maneuvers Optimal Variable-Structure Control racking of Spacecraft Maneuvers John L. Crassiis 1 Srinivas R. Vaali F. Lanis Markley 3 Introuction In recent years, much effort has been evote to the close-loop esign

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Regularized extremal bounds analysis (REBA): An approach to quantifying uncertainty in nonlinear geophysical inverse problems

Regularized extremal bounds analysis (REBA): An approach to quantifying uncertainty in nonlinear geophysical inverse problems GEOPHYSICAL RESEARCH LETTERS, VOL. 36, L03304, oi:10.1029/2008gl036407, 2009 Regularize extremal bouns analysis (REBA): An approach to quantifying uncertainty in nonlinear geophysical inverse problems

More information

A COMPARISON OF SMALL AREA AND CALIBRATION ESTIMATORS VIA SIMULATION

A COMPARISON OF SMALL AREA AND CALIBRATION ESTIMATORS VIA SIMULATION SAISICS IN RANSIION new series an SURVEY MEHODOLOGY 133 SAISICS IN RANSIION new series an SURVEY MEHODOLOGY Joint Issue: Small Area Estimation 014 Vol. 17, No. 1, pp. 133 154 A COMPARISON OF SMALL AREA

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

PLAL: Cluster-based Active Learning

PLAL: Cluster-based Active Learning JMLR: Workshop an Conference Proceeings vol 3 (13) 1 22 PLAL: Cluster-base Active Learning Ruth Urner rurner@cs.uwaterloo.ca School of Computer Science, University of Waterloo, Canaa, ON, N2L 3G1 Sharon

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

CONTROL CHARTS FOR VARIABLES

CONTROL CHARTS FOR VARIABLES UNIT CONTOL CHATS FO VAIABLES Structure.1 Introuction Objectives. Control Chart Technique.3 Control Charts for Variables.4 Control Chart for Mean(-Chart).5 ange Chart (-Chart).6 Stanar Deviation Chart

More information

A Randomized Approximate Nearest Neighbors Algorithm - a short version

A Randomized Approximate Nearest Neighbors Algorithm - a short version We present a ranomize algorithm for the approximate nearest neighbor problem in - imensional Eucliean space. Given N points {x } in R, the algorithm attempts to fin k nearest neighbors for each of x, where

More information

Exploratory statistical analysis of multi-species time course gene expression

Exploratory statistical analysis of multi-species time course gene expression Exploratory statistical analysis of multi-species time course gene expression data Eng, Kevin H. University of Wisconsin, Department of Statistics 1300 University Avenue, Madison, WI 53706, USA. E-mail:

More information

A Fay Herriot Model for Estimating the Proportion of Households in Poverty in Brazilian Municipalities

A Fay Herriot Model for Estimating the Proportion of Households in Poverty in Brazilian Municipalities Int. Statistical Inst.: Proc. 58th Worl Statistical Congress, 2011, Dublin (Session CPS016) p.4218 A Fay Herriot Moel for Estimating the Proportion of Househols in Poverty in Brazilian Municipalities Quintaes,

More information

Construction of Equiangular Signatures for Synchronous CDMA Systems

Construction of Equiangular Signatures for Synchronous CDMA Systems Construction of Equiangular Signatures for Synchronous CDMA Systems Robert W Heath Jr Dept of Elect an Comp Engr 1 University Station C0803 Austin, TX 78712-1084 USA rheath@eceutexaseu Joel A Tropp Inst

More information

CS9840 Learning and Computer Vision Prof. Olga Veksler. Lecture 2. Some Concepts from Computer Vision Curse of Dimensionality PCA

CS9840 Learning and Computer Vision Prof. Olga Veksler. Lecture 2. Some Concepts from Computer Vision Curse of Dimensionality PCA CS9840 Learning an Computer Vision Prof. Olga Veksler Lecture Some Concepts from Computer Vision Curse of Dimensionality PCA Some Slies are from Cornelia, Fermüller, Mubarak Shah, Gary Braski, Sebastian

More information

Database-friendly Random Projections

Database-friendly Random Projections Database-frienly Ranom Projections Dimitris Achlioptas Microsoft ABSTRACT A classic result of Johnson an Linenstrauss asserts that any set of n points in -imensional Eucliean space can be embee into k-imensional

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information