Chapter 5: Microarray Techniques

Size: px

Start display at page:

Download "Chapter 5: Microarray Techniques"

Jonah Douglas
6 years ago
Views:

1 Chapter 5: Microarray Techniques 5.2 Analysis of Microarray Data Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Normalization Clustering Overview 2 1

2 Processing Microarray Data Problem 1: extract data from microarrays Problem 2: analyze the meaning of data (multiple arrays) Test T j Expression level of g i under test T j g 1 g 2 T j Genes g i g m Heat map 3 Normalization 4 2

3 Differentiating Gene Expression Ideal data R=G for all genes that are not differentiated R>G for up-regulated genes (R<G for down regulated) Microarray data can be noisy Noise due to technology factors: o Measurements of R and G may be noisy; two arrays can vary greatly o Even a single array can have variations in dye, mrna, scanning Noise due to biological factors: o Samples variability R Up-regulated R Down-regulated Ideal G More likely G 5 Normalizing Expression Levels Consider logr, logg to evaluate orders of magnitude differences Normalization: calibrate R,G fluorescence measurements Regression: consider log(r/g) -c = log(ar/g) c=-log(a) c is selected to shift the mean log ratio to 0 Under ideal circumstances this gives the distribution below logr Regression logr logg Log(aR/G) M=logR-logG=log(R/G) A=½[logR+logG]=log(RG) 1/2 M logg Rotate 45 o A 6 3

4 Lowess Normalization Relationships of M/A may not be linear Lowess (Locally WEighted polynomial regression) M M Lowess A A Normalized M values are the heights of spots from the trend line 7 Normalizing Data From Two Arrays Normalization: Transform to A,M axes Apply Lowess adjustment Use resulting values for gene expression matrix 8 4

5 Differentiated Expression Analysis Use the normalized regression Fold lines determine region M Up-regulated Fold line Down-regulated A 9 Hierarchical Clustering 10 5

6 Heat Map Matrix Tests/experiments/samples/conditions g1 g2 T1 T2 Tj Tn Expression level of gi under test Tj gi Genes Gene expression profile gi gm Heat map Tj Test expression profile 11 Clustering Analysis Gene profile co-expression Test/sample profile sample similarity gi Gene expression profile Tj Test expression profile 12 6

7 Clustering Expression Profiles Profile vector of expression values Gene (rows): g i (e i1,e i2,.e in ) Test/sample (columns): T j (e 1j,e 2j,.e mj ) g 1 g 2 T 1 T 2 T j T n g i g i =(e i1,e i2,.e in ) g m T j =(e 1j,e 2j,.e mj ) 13 Hierarchical Clustering Key idea: cluster recursively the closest pair E.g., We used this for phylogeny and MSA Agglomerative (bottom-up) vs. Divisive (top-down) Distance metrics*: d(a,b) 1. Euclidean: Σ i = 1 (x ia - x ib ) 2 m 2. Manhattan: Σ i = 1 x ia x ib 3. Pearson correlation n = 1 xat x r(a,b) n σ x BT x σ T= 1 A B A Distance Matrix B (* Triangle inequality is not required; semi-metric is sufficient) 14 7

8 Hierarchical Clustering 1) Connect nearest neighbors into cluster 2) Compute distance matrix to new cluster 3) Repeat until all clustered 15 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene

9 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 17 Hierarchical Clustering Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Gene

10 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 19 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene

11 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 21 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene

12 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 23 Pros & Cons of Hierarchical Clustering Pros It provides useful partitioning of the data Organizes co-expressed genes and similar tests Visual 2D organization of data Cons Can be very sensitive to noise Dimensionality may exacerbate sensitivity May not be related to nature (genes are not hierarchical) 24 12

13 An example : Hierarchical Clustering 25 An example : Hierarchical Clustering 26 13

14 An example : Hierarchical Clustering 27 K-Means Clustering 28 14

K-Means Clustering Key-idea: iterative improvement of clusters Start with random partitioning and improve it 29! Initialization http://www.elet.polimi.

15 K-Means Clustering Key-idea: iterative improvement of clusters Start with random partitioning and improve it 29! Initialization 1) Select # of clusters: k=4 2) Select k random centroids {m j } 3) Assign genes to cluster of closest centroid c = argmin j m j " g i 4) Compute new centroids 5) Repeat until convergence Classify gene i to cluster c m c = 1 N C N C " g i i=1! 30 15

16 Self Organizing Maps 31 Self Organizing Maps (SOM) Clustering Kohonen 87 Iterative clustering similar to k-means Select # of clusters k and a grid of k centroids Move grid closer to points 32 16

17 Initialize: A. Select k=6 Clusters 33 Initialize: B. Select Random Location For a Grid of k=6 Centroids 34 17

18 Iteration: Select A Random Point P 35 Iteration: Identify Nearest Centroid P NP 36 18

19 Iteration: Move Centroid Towards Point P NP 37 Iteration: Repeat For New Point Q NQ 38 19

20 Iteration: Repeat Q NQ 39 Iteration: Repeat Until Convergenece 40 20

21 Comparison (Based on W. Noble slides) 41 Comparison of clustering algorithms Hierarchical clustering + Widely used. + Easy to understand. + Does not require the number of clusters a priori. - Difficult to implement well. - Requires post-processing. - Unstable. - Greediness can lock in early mistakes. - Expression data may not be organized hierarchically

22 Comparison of clustering algorithms k-means - Less widely used. - Requires the number of clusters a priori. - Creates unorganized clusters that are hard to interpret. + Easy to understand. + Easy to implement. + Scales well. + Stable. 43 Comparison of clustering algorithms Self-organizing maps - Less widely used. - Difficult to understand. - Requires the number of clusters a priori. + Easy to implement. + Scales well. + Allows imposition of partial structure. + Stable

23 What clustering can t do Identify differentially regulated genes. Account for complex experimental design. Provide semantics for discovered clusters. Determine whether a pathway is differentially expressed. Incorporate prior knowledge about relevant gene groups

Clustering & microarray technology

Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression