Microarray data analysis

Size: px

Start display at page:

Download "Microarray data analysis"

Joan Martin
6 years ago
Views:

1 Microarray data analysis September 20, 2006 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins School of Public Health ( )

2 Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright 2003 by John Wiley & Sons, Inc. These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at including hyperlinks to the book chapters.

3 Schedule Today : microarray data analysis (Chapter 7) Friday: computer lab (microarray data analysis) Quiz 6 (chapter 7) opens Monday: continue on microarray data analysis; also SNP microarrays and array CGH Wednesday: Protein analysis (Chapter 8)

4 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

5 The MicroArray Quality Consortium (MAQC) The MAQC Consortium published a series of papers in Nature Biotechnology this week (September 2006, volume 24 issue 9). 20 microarray products and three technologies were evaluated for 12,000 RNA transcripts expressed in human tumor cell lines or brain. There was substantial agreement between sites and platforms for regulated transcripts.

6 MAQC Consortium (2006) Nature Biotechnology 24:

7 MAQC Consortium (2006) Nature Biotechnology 24:

9 Affymetrix GeneChip expression array

10 Example of a probe set corresponding to a gene

11 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

12 Microarray data analysis begin with a data matrix (gene expression values versus samples) Fig. 7.1 Page 190

13 Microarray data analysis begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 10,000) and few samples (~ 10) Fig. 7.1 Page 190

14 Microarray data analysis begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Fig. 7.1 Page 190

15 Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Page 191

16 Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 191

17 Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 192

18 Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 192

19 Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 192

20 Microarray data preprocessing Some researchers use housekeeping genes for global normalization Visit the Human Gene Expression (HuGE) Index: Page 192

21 Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

22 Scatter plot analysis of microarray data Fig. 7.2 Page 193

23 Differential Gene Expression in Different Tissue and Cell Types Brain Fibroblast Astrocyte Astrocyte

24 low high Expression level (sample 1) Fig. 7.2 Page 193 Expression level (sample 2)

25 Log-log transformation Fig. 7.3 Page 195

26 Page 194, 197 Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratio log 2 ratio time behavior value value t=0 basal t=1h no change t=2h 2-fold up t=3h 2-fold down

27 low expression level high up Log ratio down Mean log intensity Fig. 7.4 Page 196

28 SNOMAD converts array data to scatter plots Linear-linear plot EXP 20 EXP 0 Log-log plot 10-1 Log 10 (Ratio ) fold 2-fold CON CON Mean ( Log 10 ( Intensity ) ) EXP > CON EXP < CON Page

29 SNOMAD corrects local variance artifacts Log 10 ( Ratio ) robust local regression fit residual Corrected Log 10 ( Ratio ) [residuals] fold 2-fold EXP > CON EXP < CON Mean ( Log 10 ( Intensity ) ) Mean ( Log 10 ( Intensity ) ) Page

30 SNOMAD describes regulated genes in Z-scores 10 Corrected Log 10 ( Ratio ) Z= 1 Z= -1 Locally estimated standard deviation of positive ratios Mean ( Log 10 ( Intensity ) ) 2-fold 2-fold Locally estimated standard deviation of negative ratios Corrected Log 10 ( Ratio ) Local Log 10 ( Ratio ) Z-Score Z= -5 Z= 2 Z= 1 Z= -1 Z= -2 Mean ( Log 10 ( Intensity ) ) Z= 5 Z= -5 Z= Mean ( Log 10 ( Intensity ) ) 2-fold 2-fold

31 Robust multi-array analysis (RMA) Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others Available at as an R package Also available in various software packages (including Partek, and Iobion Gene Traffic) See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4 There are three steps: [1] Background adjustment based on a normal plus exponential model (no mismatch data are used) [2] Quantile normalization [3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect

32 precision accuracy precision with accuracy Good performance: reproducibility of the result Good quality of the result

33 Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). precision log expression SD MAS 5.0 RMA average log expression

34 Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. observed log expression accuracy log nominal concentration

35 RMA adjusts for the effect of GC content log intensity GC content

36 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

37 Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: There is no difference in signal intensity for the gene expression measurements in normal and diseased samples. The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level α to p < Page 199

38 Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. x 1 x 2 t = = σ difference between mean values variability (noise) Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Page 199

40 t-test to determine statistical significance disease vs normal Error t statistic = difference between mean of disease and normal variation due to error

41 ANOVA partitions total data variability Before partitioning After partitioning disease vs normal Subject Error disease vs normal Error Tissue type F ratio = variation between DS and normal variation due to error

42 Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test Compare two paired groups Paired t-test Wilcoxon test Compare 3 or more groups ANOVA Table 7-2 Page

43 Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 The Bonferroni correction is generally considered to be too conservative. Page 199

44 Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts # false discoveries Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?

45 Significance analysis of microarrays (SAM) SAM -- an Excel plug-in (URL: page 202) -- modified t-test -- adjustable false discovery rate Page 200

46 Fig. 7.7 Page 202

47 observed upregulated downregulated expected Fig. 7.7 Page 202

48 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

49 Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation Page 203

50 Similarity, dissimilarity, and distance Reference for the following slides on clustering: G. Dunn and B.S. Everitt, An Introduction to Mathematical Taxonomy. Mineola, New York: Dover, B.S. Everitt, S. Landau, M. Leese, Cluster Analysis. 4 th edition. London: Arnold, 2001.

51 What is a cluster? A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.

52 Similarity Similarity (s) refers to relatedness. We will refer to the character states (e.g. amino acid residues or gene expression values) of the operational taxonomic units (OTUs, e.g. globins from a particular species). For OTUs i and j, consider two character states: 1 (presence) and 0 (absence). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

53 Here p is the total number of characters studied, a is the number of characters where both OTUs have value 1, etc. The number of matches is a + d. According to the simple matching coefficient, s ij = (a + d) / p. That is, the similarity measure s ij is the ratio of the total number of matches to the total number of characters. The range of scores is 0 (no matches) to 1 (the OTUs match on every character). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

54 Similarity: Jaccard s coefficient Jaccard (1908) excluded the number of negative matches, defining a coefficient as s ij = a / (a + b + c). This is the ratio of positive matches to the total number of characters minus the number of negative matches. Why might this be useful, ignoring category d? The absence of wings in comparing camels and worms would be absurd (see Sokal and Sneath, 1963). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

55 Example of similarity matrices Character OTU OTU OTU OTU Consider the hypothetical case of four species (OTUs) that have either absence (character 0) or presence (character 1) of certain globin genes. Note that these matrices below are symmetrical because distance d ij = d ji. Simple matching coefficient Jaccard similarity matrix OTU

56 Example of similarity matrices Conclusion: different similarity coefficients highlight different aspects of the relationships between OTUs, and may produce very different classifications.

57 Similarity, dissimilarity, and distance Similarity (s) refers to relatedness. Dissimilarity (d) may be defined simply as 1 s. However, some dissimilarity measures have unique, useful properties. Let d ij = the dissimilarity between OTUs i and j. Symmetry: d ij = d ji > 0 Distinguishability of non-identicals: If d ij 0 then i j Indistinguishability of identicals: For identical OTUs i and j, d ii = 0 Triangular inequality (or metric inequality). Given three OTUs i, j, and k: d ik d ij + d jk Dissimilarity coefficients satisfying these properties are also called distances rather than dissimilarities.

58 Distance metrics: Euclidean For OTUs i and j, the Euclidean distance d ij is: d ij = [ (x ja x ia ) 2 + (x jb x ib ) 2 ] ½ b 8 B a c A c 2 = a 2 + b 2 c = [ a 2 + b 2 ] ½

59 The importance of using standardized scales Consider four individuals A-D age feet centimeters A B C D mean S.D This example is adapted from L. Kaufman and P.J. Rousseeuw, Finding Groups in Data (Wiley, 1990).

60 The importance of using standardized scales height (feet) B A age D C height (cm) B A age D C standardized height B A standardized age D C standardized age and height: x if -m f z if = s f where z if = z score (unitless) x if = each measurement m f = mean value of variable f s f = mean absolute deviation (related to standard deviation)

61 Data matrix (20 genes and 3 time points from Chu et al.) Fig. 7.8 Page 205

62 t=2.0 t=0.5 t=0 3D plot (using S-PLUS software) Fig. 7.8 Page 205

63 Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

64 Agglomerative clustering a b a,b c d e Adapted from Kaufman and Rousseeuw (1990) Fig. 7.9 Page 206

65 Agglomerative clustering a b a,b c d e d,e Fig. 7.9 Page 206

66 Agglomerative clustering a b a,b c d e d,e c,d,e Fig. 7.9 Page 206

67 Agglomerative clustering a b c d e a,b d,e c,d,e a,b,c,d,e tree is constructed Fig. 7.9 Page 206

68 Divisive clustering a,b,c,d,e Fig. 7.9 Page 206

69 Divisive clustering c,d,e a,b,c,d,e Fig. 7.9 Page 206

70 Divisive clustering c,d,e a,b,c,d,e d,e Fig. 7.9 Page 206

71 Divisive clustering a,b a,b,c,d,e c,d,e d,e Fig. 7.9 Page 206

72 a b c d e a,b d,e Divisive clustering c,d,e a,b,c,d,e tree is constructed Fig. 7.9 Page 206

73 agglomerative a b a,b a,b,c,d,e c d e d,e c,d,e Adapted from Kaufman and Rousseeuw (1990) divisive Fig. 7.9 Page 206

74 Fig. 7.8 Page 205

75 Fig Page 207

76 Agglomerative and divisive clustering sometimes give conflicting results, as shown here Fig Page 207

77 Cluster analysis Cluster analysis allows an explicit separation of data points into groups. Hierarchical clustering may be divided into agglomerative and divisive methods. Agglomerative hierarchcial techniques begin with a proximity matrix to fuse n OTUs into groups. The two OTUs with the highest similarity (or smallest distance) are grouped.

78 Cluster analysis: single-linkage clustering There are several methods available to calculate the proximity between a single OTU and a group containing several OTUs (or to calculate the proximity between two groups). In single-linkage clustering, the similarity between two clusters of OTUs is that of their most similar pair.

79 Cluster analysis: single-linkage clustering Consider this dissimilarity matrix (Dunn and Everitt p.78) OTU

80 Cluster analysis: single-linkage clustering Consider this dissimilarity matrix (Dunn and Everitt p.78) D 1 = OTU [1] Identify the smallest entry. It is 2 (for OTUs 1 and 2). [2] Join OTUs 1 and 2 as a two-membered cluster. [3] Calculate dissimilarities between this new cluster and the remaining three OTUs (3, 4 and 5), as follows.

81 Cluster analysis: single-linkage clustering D 1 = OTU [3] continued: d (12)3 = min {d 13, d 23 } = d 23 = 5 d (12)4 = min {d 14, d 24 } = d 24 = 9 d (12)5 = min {d 15, d 25 } = d 25 = 8 [4] create a new matrix: OTU D =

82 D 2 = Cluster analysis: single-linkage clustering OTU [5] Repeat steps 1-4. Here, identify the smallest distance (it is 3, from OTUs 4 to 5). [6] Create a new two-member cluster (45). [7] Calculate dissimilarities: d (12)3 = min {d 13, d 23 } = d 23 = 5 (calculated previously) d (12)(45) = min {d 14, d 15, d 24, d 25 } = d 24 = 8 d (45)3 = min {d 34, d 35 } = d 34 = 4 [8] Create matrix D 3 OTU D 3 =

83 Cluster analysis: single-linkage clustering D 3 = OTU [9] Repeat steps 1-4. Here, identify the smallest distance (it is 4, from OTUs 3 to 45). [10] Create a new three-member cluster (345). [11] Calculate dissimilarities: d (12)(345) = min {d 13, d 14, d 15, d 23, d 24, d 25 } = d 23 = 5 In this way, the relationships (distances) of all the members of the original dissimilarity matrix have been described in groups that form part of a single cluster. This cluster is commonly represented graphically as a dendrogram (tree). Note that the clusters were formed in successive levels in an ordered, hierarchical fashion.

84 Cluster analysis: complete-linkage clustering This approach differs from single-linkage clustering because it seeks the most dissimilar pair from each group. Thus in step [3] above, we previously obtained the minumum distance: d (12)3 = min {d 13, d 23 } = d 23 = 5 d (12)4 = min {d 14, d 24 } = d 24 = 9 d (12)5 = min {d 15, d 25 } = d 25 = 8 But in complete-linkage clustering we obtain the maximum distance: d (12)3 = max {d 13, d 23 } = d 13 = 6 d (12)4 = max {d 14, d 24 } = d 14 = 10 d (12)5 = max {d 15, d 25 } = d 15 = 9

85 Cluster analysis: complete-linkage clustering There are many other criteria for defining clusters: single-linkage complete-linkage group-average centroid They do not always give equivalent clustering patterns. For example, single-linkage clustering may be susceptible to chaining of closely related OTUs, obscuring reasonable cluster structure single linkage complete linkage centroid linkage chaining in single linkage

86 Cluster and TreeView Fig Page 208

87 Cluster and TreeView clustering K means SOM PCA Fig Page 208

88 Cluster and TreeView Fig Page 208

89 Cluster and TreeView Fig Page 208

90 Page 208

91 Fig Page 208

92 Fig Page 208

93 Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Fig Page 209

94 Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Page 211

95 Principal components analysis (PCA), an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines Principal component axis #2 (10%) P2 P1 P3 P4 C3 C4 N3 N4 N2 C2 C1 Principal component axis #1 (87%) Legend Lead (P) Sodium (N) Control (C) Page 211

96 Principal components analysis (PCA): objectives to reduce dimensionality to determine the linear combination of variables to choose the most useful variables (features) to visualize multidimensional data to identify groups of objects (e.g. genes/samples) to identify outliers Page 211

97 Page 212

98 Page 212

99 Page 212

100 Page 212

101 Why principal components analysis? For Euclidean distance measures, p characters are measured on each OTU (for example, 10,000 gene expression values are measured on each sample of a microarray experiment). In using a Euclidean distance metric, it may be assumed that each OTU may be represented by p orthogonal axes. But what if some of the p measurements are correlated? It may be helpful to explore the dataset further, trying to find uncorrelated composite measures. There may be far fewer than p composite measures. This approach is relevant whether trying to cluster observed characters (e.g. genes) or OTUs (e.g. samples).

102 Principal components analysis (PCA) No distinction is made between dependent and independent variables The purpose is to explore the structure of the relationships among the variables to determine whether a smaller number of underlying factors can account for the observed data to reduce the number of variables Sources: [1] G. Dunn and B.S. Everitt, An Introduction to Mathematical Taxonomy (2004); [2] G.R. Norman and D.L. Streiner, Biostatistics: The Bare Essentials, 2 nd edition (2000).

103 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p having properties discussed below. n genes (e.g. 20,000) p samples (e.g. 8) 1.0 p x p correlation matrix

104 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p having the following four properties.

105 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p Each y is called a factor; a series of linear combinations of variables define each factor. If there are 8 original variables (e.g. samples), we will extract 8 factors and have 8 equations.

106 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p The x terms are the original variables (e.g. x 1 x 8 for 8 samples).

107 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p The a terms are weights. There are two subscripts. The first shows which y term (factor) the weight is associated with. The second shows which variable (e.g. sample) the weight is associated with.

108 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [2] For each of the coefficients a defining each linear transformation, the sum of their squares is unity. p Σ a 2 ij = 1 for i = 1, i = 2,, i = p j=1 This step is helpful because once the vectors of these coefficients are normalized (or scaled) to 1, the latent roots of the matrix (λ 1, λ 2,, λ p ) will be interpretable as the sampling variances of y 1, y 2,, y p.

109 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [3] For all such transformations, y 1 has the greatest variance. We have set up 8 factors (y 1 to y p ) from 8 original variables, and the usefulness of this procedure is that the new factors will summarize the data in a ranked fashion, from the factors explaining the most variance to the least.

110 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [4] For all the transformations which are uncorrelated with y 1, y 2 has the greatest variance. Thus PCA produces a set of p composite characters that are uncorrelated, and ranked from having the most to the least variance. Uncorrelated factors or latent variables are also termed orthogonal.

111 factor PCA: Scree plot A scree plot displays eigenvalues (y-axis) versus factors (x-axis). It is used to estimate the number of factors that usefully capture the variance in the data. Look for where the plot levels off and exclude those factors. Typically, we hope to find as few factors (latent variables) as possible to explain the original data. Eigenvalue

112 PCA: geometric interpretation Consider 8 samples (4 control, 4 diseased) for which 40,000 gene expression measurements are obtained. One can do PCA on the samples (and obtain a plot with ten data points) or on the genes (upon transposing the data matrix, obtain a plot with 40,000 data points). In a PCA plot, two points that are close together in PCA space have a similar overall behavior in the original data matrix. For a measurement of p variables (e.g. p = 8), the first principal component is defined by a straight line that traverses the data points in p-dimensional (here 8- dimensional) space.

113 PCA: geometric interpretation Next, the first and second principal components together define a plane that best fits the data points in p- dimensional space; the first three PCs display the objects in three-dimensional space. Note that the units of the PC axes are the percentage of the variance captured along each axis. The first axis always has the largest percentage. If a large proportion of the variance (e.g. >>50%) is explained by the first two or three principal component axes, then the PCA plot has effectively visualized the relations of the data points while reducing the dimensionality from p to one, two, or three.

114 Fig Page 212

115 Fig Page 212

116 Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain Chr 21

117 Illustration of PCA and clustering The following examples use Partek software ( Create a spreadsheet consisting of ten proteins (columns). The rows show the number of times each of the 20 amino acids occurs in a protein.

118 Four small human proteins: alpha globin (NP_000549) beta globin (NP_000509) myoglobin (NP_976312) RBP4 (NP_006735)

119 Three huge human proteins: nesprin 1 (8797 amino acids) nebulin (6669 amino acids) mucin 16 (14507 amino acids)

120 Three trans-membrane proteins: ABCD1 (NP_000024) serotonin receptor (NP_000515) acetylcholine transporter (NP_003046)

121 PCA shows that along the first principal component (PC) axis, 60% of the variance is captured, and there are no dramatic groupings of amino acids PC #2 13% leu lys PC #1 60%

122 Complimentary to PCA, the relationships of the 20 amino acids can also be visualized by hierarchical clustering

123 The data matrix can be transposed (rows and columns are swapped). Columns = 20 amino acids; rows = ten proteins PCA shows that the three large proteins are different than the other seven proteins. Mucin 16 (>14,000 amino acids) is separable from the other two large proteins on the second PC axis. PC #2 11.3% mucin 16 nebulin 1 seven other proteins nesprin 1 PC #1 84.7%

124 Hierarchical clustering of ten proteins. Distance metric: Euclidean large proteins small proteins transmembrane proteins

125 Hierarchical clustering of ten proteins. Distance metric: Pearson s dissimilarity large proteins small proteins transmembrane proteins

126 Hierarchical clustering of ten proteins. Distance metric: Rank (Spearman) dissimilarity large proteins small proteins transmembrane proteins

127 Hierarchical clustering of ten proteins. Distance metric: Tanimoto large proteins small proteins transmembrane proteins

128 Hierarchical clustering of ten proteins. Distance metric: Cosine dissimilarity large proteins small proteins transmembrane proteins

129 K-means (partitioning) clustering: select a number of clusters at the outset. Four clusters are better than 1, 2, 3, or 5 by the Davies-Bouldin criterion here are those four clusters in PCA space

130 Multidimensional scaling Euclidean distance function 10 rows = proteins 20 columns = amino acids mucin 16 nebulin 1 nesprin 1

131 Multidimensional scaling Pearson s dissimilarity distance function 10 rows = proteins 20 columns = amino acids

132 Friday s computer lab Practice downloading a dataset (e.g. Chu et al. 1998) from Try making scatter plots in Excel Try doing t-tests and ratio calculations in Excel Try GEO at NCBI for microarray data and visualization

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the