Clustering & microarray technology

Size: px

Start display at page:

Download "Clustering & microarray technology"

Lynn Megan Wood
6 years ago
Views:

1 Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1

2 Why is expression important? Proteins Gene Expression Blueprints of automobile parts Car parts Proteins DNA Phenotypes Automobiles 2

3 Transcription: DNA to mrna From Genes to Proteins Translation: mrna to Proteins DNA mrna Ribosome Protein 3

Proteins Proteins are the workhorses of cells To understand how cells work is to understand proteins Understanding proteins and cells is key for finding disease treatments

4 Proteins Proteins are the workhorses of cells To understand how cells work is to understand proteins Understanding proteins and cells is key for finding disease treatments and cures Modern drug development is centered on affecting proteins (receptors, hormones, etc.) But Proteins are hard to study directly, so microarrays look at the mrna instead 4

5 Hybridization Expression microarrays use the fact that complementary strands will hybridize (attach) to each other 5

6 Early cdna microarray (18,000 clones) 6

7 Microarray Methodology 7

8 Microarray Methodology Spot slide with known sequences A C D B 8

9 Microarray Methodology reference mrna test mrna Reference sample Test cells Spot slide with known sequences 9

10 Microarray Methodology reference mrna test mrna add green dye add red dye Spot slide with known sequences 10

11 Microarray Methodology reference mrna add green dye Add mrna to slide for Hybridization add red dye test mrna hybridize Spot slide with known sequences 11

12 Microarray Methodology reference mrna add green dye Add mrna to slide for Hybridization add red dye test mrna hybridize Spot slide with known sequences Scan hybridized array 12

13 Microarray Methodology reference mrna add green dye Add mrna to slide for Hybridization add red dye test mrna hybridize A 1.5 B 0.8 C -1.2 D 0.1 Spot slide with known sequences Scan hybridized array 13

14 Microarray Methodology reference mrna add green dye Add mrna to slide for Hybridization add red dye test mrna hybridize A 1.5 B 0.8 C -1.2 D 0.1 Spot slide with known sequences Scan hybridized array 14

15 Microarray Outputs Measure amounts of green and red dye on each spot Represent level of expression as a log ratio between these amounts Raw Image from Spellman et al., 98 15

Extracting Data Experiments 200 10000 50.00 5.64 4800 4800 1.

16 Extracting Data Experiments Cy3 Cy5 Cy5 Cy3 Cy5 log 2 Cy3 Genes 16

17 Why microarray analysis: the questions Large-scale study of biological processes What is going on in the cell at a certain point in time? what pathways are active, which genes are involved in these pathways On genomic level, what accounts for differences between phenotypes? which pathways are activated in stress response and which genes belong to these pathways Sequence important, but genes have 17

18 Clustering History: London physicist John Snow plotted outbreak of cholera deaths on map in 1850s. Location indicated that clusters were around certain intersections with polluted wells; this exposed the problem and solution! Outbreak of cholera deaths on map in 1850s. Reference: Nina Mishra, HP Labs Introduction to Computer Science Robert Sedgewick and Kevin Wayne

19 What is clustering? Reordering of vectors in a dataset so that similar patterns are next to each other 19

20 Why cluster microarray data? Guilt by association => if unknown gene i is similar in expression to known gene j, maybe they are involved in the same/related pathway Dimensionality reduction: datasets are too big to be able to get information out without reorganizing the data 20

21 Botstein & Brown group 21

22 Clustering Random vs Biological Data Challenge when is clustering real? From Eisen MB, et al, PNAS (25):

23 K-means clustering 1. Define k = number of clusters 2. Randomly initialize a seed vector for each cluster (could use a random gene vector) 3. Go through all genes, and assign each gene to the cluster which it is most similar to 4. Recalculate all seed vectors as means (or medians) of patterns of each cluster Repeat 3&4 until <stop condition> 23

24 K-means clustering: stop conditions Until the change in seed vectors is less than <constant> Until all genes get assigned to the same partition twice in a row Until some minimal number of genes (e.g. 90%) get assigned to the same partition twice in a row 24

25 K-means: problems Have to set k ahead of time Each gene only belongs to 1 cluster One cluster has no influence on the others Genes assigned to clusters on the basis of all experiments 25

26 Can a gene belong to N clusters? Fuzzy clustering: each gene s relationship to a cluster is probabilistic Gene can belong to many clusters More biologically realistic, but harder to get to work well/fast Harder to interpret

27 Self Organizing Maps (SOM) Similar to k-means BUT: allow clusters to influence each other 27

28 1. Initialize the seeds for each partition A D B E C F

29 2. Pick a gene at random, and adjust the closest partition A D B E C F Iteration 1. 29

30 3. Adjust neighboring partitions A R D B E C F Iteration 1. 30

31 2. Pick a gene at random, and adjust the closest partition A D B C E F Iteration 2. 31

32 Self-organizing maps algorithm 1. Partition data (e.g. 3x2 grid) 2. Randomly choose seed vectors for each partition (length = # experiments) 3. Pick a gene i at random, see which partition it is most similar to (e.g. partition A), and modify A s seed vector to be more similar to gene i 4. Now modify neighboring partitions of A to be more similar to A After map settles down, assign each gene to the most similar partition 32

33 Self-organizing maps iterations At higher iterations, smaller R At higher iterations, smaller change to partition seeds => the map settles down 33

34 Self Organizing Maps: Result SOMs result in genes being assigned to partitions of most similar genes Neighboring partitions are more similar to each other than they are to distant partitions 34

35 SOM: problems Have to set n and m ahead of time Each gene only belongs to 1 cluster Genes assigned to clusters on the basis of all experiments 35

36 Hierarchical clustering Imposes hierarchical structure on all of the data Easy visualization of similarities and differences between genes (experiments) and clusters of genes (experiments) 36

37 How does Hierarchical Clustering work? 1. Compare all expression patterns to each other. 2. Join patterns that are the most similar out of all patterns. 3. Compare joined patterns to all other un-joined patterns. 4. Go to step 2, and repeat until all patterns are joined. 37

38 Hierarchical clustering 38

39 Hierarchical clustering 39

40 Hierarchical clustering 40

41 Hierarchical clustering 41

42 Hierarchical clustering 42

43 Hierarchical clustering 43

44 Dendrogram Dendrogram. Scientific visualization of hypothetical sequence of evolutionary events. Leaves = genes. Internal nodes = hypothetical ancestors. Reference: 44

45 Dendrogram of Human tumors Tumors in similar tissues cluster together. Gene 1 Gene n Reference: Botstein & Brown group gene over expressed gene under expressed 45

46 Analysis and Micro-Optimizations Running time. Proportional to MN 2. Memory. Proportional to N 2. Ex. [M = 50, N = 6,000] Takes 280MB, 48 sec on fast PC. input size proportional to MN Micro-optimizations. Use float instead of double. Store only lower triangular part of distance matrix. Use squares of distances instead of distances. 46

47 Hierarchical clustering: problems Hard to define distinct clusters Genes assigned to clusters on the basis of all experiments Optimizing node ordering hard (finding the optimal solution is NP-hard) Can be influenced by one strong cluster a problem for gene expression b/c data in row space is often highly correlated Hard to partition into distinct clusters 47

48 Distance Metrics Choice of distance measure is important for most clustering techniques Linear measures: Euclidean distance, Pearson correlation Non-parametric: Spearman correlation, Kendall s tau d = 1 n n i=1 (x i y i ) 2 r = 1 n n i=1 $ & % x i x σ x ' $ )& (% y i y σ y ' ) ( ρ =1 n 6 [rank(x i ) rank(y i )] i=1 n(n 2 1) 48

49 Vector Data Type Vector data type. Set of values: sequence of N real numbers. Set of operations: distanceto, scale, plus. Ex. p = (1, 2, 3, 4), q = (5, 2, 4, 1). dist(p, q) = sqrt( ) = t = 1/4 p + 3/4 q = (4, 2, 3.75, 1.75). public static void main(string[] args) { double[] pdata = { 1.0, 2.0, 3.0, 4.0 }; double[] qdata = { 5.0, 2.0, 4.0, 1.0 }; Vector p = new Vector(pdata); Vector q = new Vector(qdata); double dist = p.distanceto(q); Vector r = p.scale(1.0/4.0); Vector s = q.scale(3.0/4.0); Vector t = r.plus(s); System.out.println(t); } 49

50 Vector: Array Implementation public class Vector { private int N; private double[] data; // dimension // components // create a vector from the array d public Vector(double[] d) { N = d.length; data = d; } // return Euclidean distance from this vector a to b public double distanceto(vector b) { Vector a = this; double sum = 0.0; for (int i = 0; i < N; i++) sum += (a.data[i] - b.data[i])*(a.data[i] - b.data[i]); return Math.sqrt(sum); } 50

51 Vector: Array Implementation public String tostring() { String s = ""; for (int i = 0; i < N; i++) s = s + data[i] + " "; return s; } return string representation of this vector public Vector plus(vector b) { Vector a = this; double[] d = new double[n]; for (int i = 0; i < N; i++) d[i] = a.data[i] + b.data[i]; return new Vector(d); } return vector sum of this vector a and vector b (could add error checking to ensure dimensions agree) 51

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional