Microarray data analysis

Size: px
Start display at page:

Download "Microarray data analysis"

Transcription

1 Microarray data analysis September 20, 2006 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins School of Public Health ( )

2 Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright 2003 by John Wiley & Sons, Inc. These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at including hyperlinks to the book chapters.

3 Schedule Today : microarray data analysis (Chapter 7) Friday: computer lab (microarray data analysis) Quiz 6 (chapter 7) opens Monday: continue on microarray data analysis; also SNP microarrays and array CGH Wednesday: Protein analysis (Chapter 8)

4 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

5 The MicroArray Quality Consortium (MAQC) The MAQC Consortium published a series of papers in Nature Biotechnology this week (September 2006, volume 24 issue 9). 20 microarray products and three technologies were evaluated for 12,000 RNA transcripts expressed in human tumor cell lines or brain. There was substantial agreement between sites and platforms for regulated transcripts.

6 MAQC Consortium (2006) Nature Biotechnology 24:

7 MAQC Consortium (2006) Nature Biotechnology 24:

8

9 Affymetrix GeneChip expression array

10 Example of a probe set corresponding to a gene

11 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

12 Microarray data analysis begin with a data matrix (gene expression values versus samples) Fig. 7.1 Page 190

13 Microarray data analysis begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 10,000) and few samples (~ 10) Fig. 7.1 Page 190

14 Microarray data analysis begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Fig. 7.1 Page 190

15 Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Page 191

16 Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 191

17 Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 192

18 Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 192

19 Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 192

20 Microarray data preprocessing Some researchers use housekeeping genes for global normalization Visit the Human Gene Expression (HuGE) Index: Page 192

21 Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

22 Scatter plot analysis of microarray data Fig. 7.2 Page 193

23 Differential Gene Expression in Different Tissue and Cell Types Brain Fibroblast Astrocyte Astrocyte

24 low high Expression level (sample 1) Fig. 7.2 Page 193 Expression level (sample 2)

25 Log-log transformation Fig. 7.3 Page 195

26 Page 194, 197 Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratio log 2 ratio time behavior value value t=0 basal t=1h no change t=2h 2-fold up t=3h 2-fold down

27 low expression level high up Log ratio down Mean log intensity Fig. 7.4 Page 196

28 SNOMAD converts array data to scatter plots Linear-linear plot EXP 20 EXP 0 Log-log plot 10-1 Log 10 (Ratio ) fold 2-fold CON CON Mean ( Log 10 ( Intensity ) ) EXP > CON EXP < CON Page

29 SNOMAD corrects local variance artifacts Log 10 ( Ratio ) robust local regression fit residual Corrected Log 10 ( Ratio ) [residuals] fold 2-fold EXP > CON EXP < CON Mean ( Log 10 ( Intensity ) ) Mean ( Log 10 ( Intensity ) ) Page

30 SNOMAD describes regulated genes in Z-scores 10 Corrected Log 10 ( Ratio ) Z= 1 Z= -1 Locally estimated standard deviation of positive ratios Mean ( Log 10 ( Intensity ) ) 2-fold 2-fold Locally estimated standard deviation of negative ratios Corrected Log 10 ( Ratio ) Local Log 10 ( Ratio ) Z-Score Z= -5 Z= 2 Z= 1 Z= -1 Z= -2 Mean ( Log 10 ( Intensity ) ) Z= 5 Z= -5 Z= Mean ( Log 10 ( Intensity ) ) 2-fold 2-fold

31 Robust multi-array analysis (RMA) Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others Available at as an R package Also available in various software packages (including Partek, and Iobion Gene Traffic) See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4 There are three steps: [1] Background adjustment based on a normal plus exponential model (no mismatch data are used) [2] Quantile normalization [3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect

32 precision accuracy precision with accuracy Good performance: reproducibility of the result Good quality of the result

33 Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). precision log expression SD MAS 5.0 RMA average log expression

34 Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. observed log expression accuracy log nominal concentration

35 RMA adjusts for the effect of GC content log intensity GC content

36 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

37 Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: There is no difference in signal intensity for the gene expression measurements in normal and diseased samples. The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level α to p < Page 199

38 Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. x 1 x 2 t = = σ difference between mean values variability (noise) Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Page 199

39

40 t-test to determine statistical significance disease vs normal Error t statistic = difference between mean of disease and normal variation due to error

41 ANOVA partitions total data variability Before partitioning After partitioning disease vs normal Subject Error disease vs normal Error Tissue type F ratio = variation between DS and normal variation due to error

42 Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test Compare two paired groups Paired t-test Wilcoxon test Compare 3 or more groups ANOVA Table 7-2 Page

43 Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 The Bonferroni correction is generally considered to be too conservative. Page 199

44 Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts # false discoveries Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?

45 Significance analysis of microarrays (SAM) SAM -- an Excel plug-in (URL: page 202) -- modified t-test -- adjustable false discovery rate Page 200

46 Fig. 7.7 Page 202

47 observed upregulated downregulated expected Fig. 7.7 Page 202

48 Outline: microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

49 Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation Page 203

50 Similarity, dissimilarity, and distance Reference for the following slides on clustering: G. Dunn and B.S. Everitt, An Introduction to Mathematical Taxonomy. Mineola, New York: Dover, B.S. Everitt, S. Landau, M. Leese, Cluster Analysis. 4 th edition. London: Arnold, 2001.

51 What is a cluster? A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.

52 Similarity Similarity (s) refers to relatedness. We will refer to the character states (e.g. amino acid residues or gene expression values) of the operational taxonomic units (OTUs, e.g. globins from a particular species). For OTUs i and j, consider two character states: 1 (presence) and 0 (absence). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

53 Here p is the total number of characters studied, a is the number of characters where both OTUs have value 1, etc. The number of matches is a + d. According to the simple matching coefficient, s ij = (a + d) / p. That is, the similarity measure s ij is the ratio of the total number of matches to the total number of characters. The range of scores is 0 (no matches) to 1 (the OTUs match on every character). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

54 Similarity: Jaccard s coefficient Jaccard (1908) excluded the number of negative matches, defining a coefficient as s ij = a / (a + b + c). This is the ratio of positive matches to the total number of characters minus the number of negative matches. Why might this be useful, ignoring category d? The absence of wings in comparing camels and worms would be absurd (see Sokal and Sneath, 1963). OTU i 1 0 OTU j 1 a b a + b 0 c d c + d a + c b + d p = a + b + c + d

55 Example of similarity matrices Character OTU OTU OTU OTU Consider the hypothetical case of four species (OTUs) that have either absence (character 0) or presence (character 1) of certain globin genes. Note that these matrices below are symmetrical because distance d ij = d ji. Simple matching coefficient Jaccard similarity matrix OTU

56 Example of similarity matrices Conclusion: different similarity coefficients highlight different aspects of the relationships between OTUs, and may produce very different classifications.

57 Similarity, dissimilarity, and distance Similarity (s) refers to relatedness. Dissimilarity (d) may be defined simply as 1 s. However, some dissimilarity measures have unique, useful properties. Let d ij = the dissimilarity between OTUs i and j. Symmetry: d ij = d ji > 0 Distinguishability of non-identicals: If d ij 0 then i j Indistinguishability of identicals: For identical OTUs i and j, d ii = 0 Triangular inequality (or metric inequality). Given three OTUs i, j, and k: d ik d ij + d jk Dissimilarity coefficients satisfying these properties are also called distances rather than dissimilarities.

58 Distance metrics: Euclidean For OTUs i and j, the Euclidean distance d ij is: d ij = [ (x ja x ia ) 2 + (x jb x ib ) 2 ] ½ b 8 B a c A c 2 = a 2 + b 2 c = [ a 2 + b 2 ] ½

59 The importance of using standardized scales Consider four individuals A-D age feet centimeters A B C D mean S.D This example is adapted from L. Kaufman and P.J. Rousseeuw, Finding Groups in Data (Wiley, 1990).

60 The importance of using standardized scales height (feet) B A age D C height (cm) B A age D C standardized height B A standardized age D C standardized age and height: x if -m f z if = s f where z if = z score (unitless) x if = each measurement m f = mean value of variable f s f = mean absolute deviation (related to standard deviation)

61 Data matrix (20 genes and 3 time points from Chu et al.) Fig. 7.8 Page 205

62 t=2.0 t=0.5 t=0 3D plot (using S-PLUS software) Fig. 7.8 Page 205

63 Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

64 Agglomerative clustering a b a,b c d e Adapted from Kaufman and Rousseeuw (1990) Fig. 7.9 Page 206

65 Agglomerative clustering a b a,b c d e d,e Fig. 7.9 Page 206

66 Agglomerative clustering a b a,b c d e d,e c,d,e Fig. 7.9 Page 206

67 Agglomerative clustering a b c d e a,b d,e c,d,e a,b,c,d,e tree is constructed Fig. 7.9 Page 206

68 Divisive clustering a,b,c,d,e Fig. 7.9 Page 206

69 Divisive clustering c,d,e a,b,c,d,e Fig. 7.9 Page 206

70 Divisive clustering c,d,e a,b,c,d,e d,e Fig. 7.9 Page 206

71 Divisive clustering a,b a,b,c,d,e c,d,e d,e Fig. 7.9 Page 206

72 a b c d e a,b d,e Divisive clustering c,d,e a,b,c,d,e tree is constructed Fig. 7.9 Page 206

73 agglomerative a b a,b a,b,c,d,e c d e d,e c,d,e Adapted from Kaufman and Rousseeuw (1990) divisive Fig. 7.9 Page 206

74 Fig. 7.8 Page 205

75 Fig Page 207

76 Agglomerative and divisive clustering sometimes give conflicting results, as shown here Fig Page 207

77 Cluster analysis Cluster analysis allows an explicit separation of data points into groups. Hierarchical clustering may be divided into agglomerative and divisive methods. Agglomerative hierarchcial techniques begin with a proximity matrix to fuse n OTUs into groups. The two OTUs with the highest similarity (or smallest distance) are grouped.

78 Cluster analysis: single-linkage clustering There are several methods available to calculate the proximity between a single OTU and a group containing several OTUs (or to calculate the proximity between two groups). In single-linkage clustering, the similarity between two clusters of OTUs is that of their most similar pair.

79 Cluster analysis: single-linkage clustering Consider this dissimilarity matrix (Dunn and Everitt p.78) OTU

80 Cluster analysis: single-linkage clustering Consider this dissimilarity matrix (Dunn and Everitt p.78) D 1 = OTU [1] Identify the smallest entry. It is 2 (for OTUs 1 and 2). [2] Join OTUs 1 and 2 as a two-membered cluster. [3] Calculate dissimilarities between this new cluster and the remaining three OTUs (3, 4 and 5), as follows.

81 Cluster analysis: single-linkage clustering D 1 = OTU [3] continued: d (12)3 = min {d 13, d 23 } = d 23 = 5 d (12)4 = min {d 14, d 24 } = d 24 = 9 d (12)5 = min {d 15, d 25 } = d 25 = 8 [4] create a new matrix: OTU D =

82 D 2 = Cluster analysis: single-linkage clustering OTU [5] Repeat steps 1-4. Here, identify the smallest distance (it is 3, from OTUs 4 to 5). [6] Create a new two-member cluster (45). [7] Calculate dissimilarities: d (12)3 = min {d 13, d 23 } = d 23 = 5 (calculated previously) d (12)(45) = min {d 14, d 15, d 24, d 25 } = d 24 = 8 d (45)3 = min {d 34, d 35 } = d 34 = 4 [8] Create matrix D 3 OTU D 3 =

83 Cluster analysis: single-linkage clustering D 3 = OTU [9] Repeat steps 1-4. Here, identify the smallest distance (it is 4, from OTUs 3 to 45). [10] Create a new three-member cluster (345). [11] Calculate dissimilarities: d (12)(345) = min {d 13, d 14, d 15, d 23, d 24, d 25 } = d 23 = 5 In this way, the relationships (distances) of all the members of the original dissimilarity matrix have been described in groups that form part of a single cluster. This cluster is commonly represented graphically as a dendrogram (tree). Note that the clusters were formed in successive levels in an ordered, hierarchical fashion.

84 Cluster analysis: complete-linkage clustering This approach differs from single-linkage clustering because it seeks the most dissimilar pair from each group. Thus in step [3] above, we previously obtained the minumum distance: d (12)3 = min {d 13, d 23 } = d 23 = 5 d (12)4 = min {d 14, d 24 } = d 24 = 9 d (12)5 = min {d 15, d 25 } = d 25 = 8 But in complete-linkage clustering we obtain the maximum distance: d (12)3 = max {d 13, d 23 } = d 13 = 6 d (12)4 = max {d 14, d 24 } = d 14 = 10 d (12)5 = max {d 15, d 25 } = d 15 = 9

85 Cluster analysis: complete-linkage clustering There are many other criteria for defining clusters: single-linkage complete-linkage group-average centroid They do not always give equivalent clustering patterns. For example, single-linkage clustering may be susceptible to chaining of closely related OTUs, obscuring reasonable cluster structure single linkage complete linkage centroid linkage chaining in single linkage

86 Cluster and TreeView Fig Page 208

87 Cluster and TreeView clustering K means SOM PCA Fig Page 208

88 Cluster and TreeView Fig Page 208

89 Cluster and TreeView Fig Page 208

90 Page 208

91 Fig Page 208

92 Fig Page 208

93 Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Fig Page 209

94 Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Page 211

95 Principal components analysis (PCA), an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines Principal component axis #2 (10%) P2 P1 P3 P4 C3 C4 N3 N4 N2 C2 C1 Principal component axis #1 (87%) Legend Lead (P) Sodium (N) Control (C) Page 211

96 Principal components analysis (PCA): objectives to reduce dimensionality to determine the linear combination of variables to choose the most useful variables (features) to visualize multidimensional data to identify groups of objects (e.g. genes/samples) to identify outliers Page 211

97 Page 212

98 Page 212

99 Page 212

100 Page 212

101 Why principal components analysis? For Euclidean distance measures, p characters are measured on each OTU (for example, 10,000 gene expression values are measured on each sample of a microarray experiment). In using a Euclidean distance metric, it may be assumed that each OTU may be represented by p orthogonal axes. But what if some of the p measurements are correlated? It may be helpful to explore the dataset further, trying to find uncorrelated composite measures. There may be far fewer than p composite measures. This approach is relevant whether trying to cluster observed characters (e.g. genes) or OTUs (e.g. samples).

102 Principal components analysis (PCA) No distinction is made between dependent and independent variables The purpose is to explore the structure of the relationships among the variables to determine whether a smaller number of underlying factors can account for the observed data to reduce the number of variables Sources: [1] G. Dunn and B.S. Everitt, An Introduction to Mathematical Taxonomy (2004); [2] G.R. Norman and D.L. Streiner, Biostatistics: The Bare Essentials, 2 nd edition (2000).

103 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p having properties discussed below. n genes (e.g. 20,000) p samples (e.g. 8) 1.0 p x p correlation matrix

104 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p having the following four properties.

105 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p Each y is called a factor; a series of linear combinations of variables define each factor. If there are 8 original variables (e.g. samples), we will extract 8 factors and have 8 equations.

106 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p The x terms are the original variables (e.g. x 1 x 8 for 8 samples).

107 Principal components analysis (PCA) Observed set of characters (e.g. samples, or gene expression values) x 1, x 2,, x p are transformed to a new set: y 1, y 2,, y p [1] Each y is a linear combination of the xs y 1 = a 11 x 1 + a 12 x a 1p x p y 2 = a 21 x 1 + a 22 x a 2p x p y p = a p1 x 1 + a p2 x a pp x p The a terms are weights. There are two subscripts. The first shows which y term (factor) the weight is associated with. The second shows which variable (e.g. sample) the weight is associated with.

108 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [2] For each of the coefficients a defining each linear transformation, the sum of their squares is unity. p Σ a 2 ij = 1 for i = 1, i = 2,, i = p j=1 This step is helpful because once the vectors of these coefficients are normalized (or scaled) to 1, the latent roots of the matrix (λ 1, λ 2,, λ p ) will be interpretable as the sampling variances of y 1, y 2,, y p.

109 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [3] For all such transformations, y 1 has the greatest variance. We have set up 8 factors (y 1 to y p ) from 8 original variables, and the usefulness of this procedure is that the new factors will summarize the data in a ranked fashion, from the factors explaining the most variance to the least.

110 Principal components analysis (PCA) Observed set of characters (e.g. gene expression values) x 1, x 2,, x p are transformed to a new set y 1, y 2,, y p having the following properties: [4] For all the transformations which are uncorrelated with y 1, y 2 has the greatest variance. Thus PCA produces a set of p composite characters that are uncorrelated, and ranked from having the most to the least variance. Uncorrelated factors or latent variables are also termed orthogonal.

111 factor PCA: Scree plot A scree plot displays eigenvalues (y-axis) versus factors (x-axis). It is used to estimate the number of factors that usefully capture the variance in the data. Look for where the plot levels off and exclude those factors. Typically, we hope to find as few factors (latent variables) as possible to explain the original data. Eigenvalue

112 PCA: geometric interpretation Consider 8 samples (4 control, 4 diseased) for which 40,000 gene expression measurements are obtained. One can do PCA on the samples (and obtain a plot with ten data points) or on the genes (upon transposing the data matrix, obtain a plot with 40,000 data points). In a PCA plot, two points that are close together in PCA space have a similar overall behavior in the original data matrix. For a measurement of p variables (e.g. p = 8), the first principal component is defined by a straight line that traverses the data points in p-dimensional (here 8- dimensional) space.

113 PCA: geometric interpretation Next, the first and second principal components together define a plane that best fits the data points in p- dimensional space; the first three PCs display the objects in three-dimensional space. Note that the units of the PC axes are the percentage of the variance captured along each axis. The first axis always has the largest percentage. If a large proportion of the variance (e.g. >>50%) is explained by the first two or three principal component axes, then the PCA plot has effectively visualized the relations of the data points while reducing the dimensionality from p to one, two, or three.

114 Fig Page 212

115 Fig Page 212

116 Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain Chr 21

117 Illustration of PCA and clustering The following examples use Partek software ( Create a spreadsheet consisting of ten proteins (columns). The rows show the number of times each of the 20 amino acids occurs in a protein.

118 Four small human proteins: alpha globin (NP_000549) beta globin (NP_000509) myoglobin (NP_976312) RBP4 (NP_006735)

119 Three huge human proteins: nesprin 1 (8797 amino acids) nebulin (6669 amino acids) mucin 16 (14507 amino acids)

120 Three trans-membrane proteins: ABCD1 (NP_000024) serotonin receptor (NP_000515) acetylcholine transporter (NP_003046)

121 PCA shows that along the first principal component (PC) axis, 60% of the variance is captured, and there are no dramatic groupings of amino acids PC #2 13% leu lys PC #1 60%

122 Complimentary to PCA, the relationships of the 20 amino acids can also be visualized by hierarchical clustering

123 The data matrix can be transposed (rows and columns are swapped). Columns = 20 amino acids; rows = ten proteins PCA shows that the three large proteins are different than the other seven proteins. Mucin 16 (>14,000 amino acids) is separable from the other two large proteins on the second PC axis. PC #2 11.3% mucin 16 nebulin 1 seven other proteins nesprin 1 PC #1 84.7%

124 Hierarchical clustering of ten proteins. Distance metric: Euclidean large proteins small proteins transmembrane proteins

125 Hierarchical clustering of ten proteins. Distance metric: Pearson s dissimilarity large proteins small proteins transmembrane proteins

126 Hierarchical clustering of ten proteins. Distance metric: Rank (Spearman) dissimilarity large proteins small proteins transmembrane proteins

127 Hierarchical clustering of ten proteins. Distance metric: Tanimoto large proteins small proteins transmembrane proteins

128 Hierarchical clustering of ten proteins. Distance metric: Cosine dissimilarity large proteins small proteins transmembrane proteins

129 K-means (partitioning) clustering: select a number of clusters at the outset. Four clusters are better than 1, 2, 3, or 5 by the Davies-Bouldin criterion here are those four clusters in PCA space

130 Multidimensional scaling Euclidean distance function 10 rows = proteins 20 columns = amino acids mucin 16 nebulin 1 nesprin 1

131 Multidimensional scaling Pearson s dissimilarity distance function 10 rows = proteins 20 columns = amino acids

132 Friday s computer lab Practice downloading a dataset (e.g. Chu et al. 1998) from Try making scatter plots in Excel Try doing t-tests and ratio calculations in Excel Try GEO at NCBI for microarray data and visualization

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Androgen-independent prostate cancer

Androgen-independent prostate cancer The following tutorial walks through the identification of biological themes in a microarray dataset examining androgen-independent. Visit the GeneSifter Data Center (www.genesifter.net/web/datacenter.html)

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Clustering & microarray technology

Clustering & microarray technology Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

Session 06 (A): Microarray Basic Data Analysis

Session 06 (A): Microarray Basic Data Analysis 1 SJTU-Bioinformatics Summer School 2017 Session 06 (A): Microarray Basic Data Analysis Maoying,Wu ricket.woo@gmail.com Dept. of Bioinformatics & Biostatistics Shanghai Jiao Tong University Summer, 2017

More information

Chapter 5: Microarray Techniques

Chapter 5: Microarray Techniques Chapter 5: Microarray Techniques 5.2 Analysis of Microarray Data Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Normalization Clustering Overview 2 1 Processing Microarray Data

More information

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests PSY 307 Statistics for the Behavioral Sciences Chapter 20 Tests for Ranked Data, Choosing Statistical Tests What To Do with Non-normal Distributions Tranformations (pg 382): The shape of the distribution

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Factor analysis. George Balabanis

Factor analysis. George Balabanis Factor analysis George Balabanis Key Concepts and Terms Deviation. A deviation is a value minus its mean: x - mean x Variance is a measure of how spread out a distribution is. It is computed as the average

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Gene expression profiling A quick review Which molecular processes/functions

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principal Component Analysis (PCA) Theory, Practice, and Examples Principal Component Analysis (PCA) Theory, Practice, and Examples Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A

More information

Sleep data, two drugs Ch13.xls

Sleep data, two drugs Ch13.xls Model Based Statistics in Biology. Part IV. The General Linear Mixed Model.. Chapter 13.3 Fixed*Random Effects (Paired t-test) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Principal Component Analysis, A Powerful Scoring Technique

Principal Component Analysis, A Powerful Scoring Technique Principal Component Analysis, A Powerful Scoring Technique George C. J. Fernandez, University of Nevada - Reno, Reno NV 89557 ABSTRACT Data mining is a collection of analytical techniques to uncover new

More information

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015 Nemours Biomedical Research Biostatistics Core Statistics Course Session 4 Li Xie March 4, 2015 Outline Recap: Pairwise analysis with example of twosample unpaired t-test Today: More on t-tests; Introduction

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Gene expression profiling A quick review Which molecular processes/functions

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University 2011 Han, Kamber, and Pei. All rights reserved.

More information

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Lecture 5: November 19, Minimizing the maximum intracluster distance

Lecture 5: November 19, Minimizing the maximum intracluster distance Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

Quantitative Understanding in Biology Principal Components Analysis

Quantitative Understanding in Biology Principal Components Analysis Quantitative Understanding in Biology Principal Components Analysis Introduction Throughout this course we have seen examples of complex mathematical phenomena being represented as linear combinations

More information

Non-parametric methods

Non-parametric methods Eastern Mediterranean University Faculty of Medicine Biostatistics course Non-parametric methods March 4&7, 2016 Instructor: Dr. Nimet İlke Akçay (ilke.cetin@emu.edu.tr) Learning Objectives 1. Distinguish

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Simplifying Drug Discovery with JMP

Simplifying Drug Discovery with JMP Simplifying Drug Discovery with JMP John A. Wass, Ph.D. Quantum Cat Consultants, Lake Forest, IL Cele Abad-Zapatero, Ph.D. Adjunct Professor, Center for Pharmaceutical Biotechnology, University of Illinois

More information

Course Review. Kin 304W Week 14: April 9, 2013

Course Review. Kin 304W Week 14: April 9, 2013 Course Review Kin 304W Week 14: April 9, 2013 1 Today s Outline Format of Kin 304W Final Exam Course Review Hand back marked Project Part II 2 Kin 304W Final Exam Saturday, Thursday, April 18, 3:30-6:30

More information

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden The clustering problem The goal of gene clustering process is to partition the genes into distinct

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. Preface p. xi Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. 6 The Scientific Method and the Design of

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

Statistics: revision

Statistics: revision NST 1B Experimental Psychology Statistics practical 5 Statistics: revision Rudolf Cardinal & Mike Aitken 29 / 30 April 2004 Department of Experimental Psychology University of Cambridge Handouts: Answers

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Small vs. large parsimony A quick review Fitch s algorithm:

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

Principal Component Analysis & Factor Analysis. Psych 818 DeShon Principal Component Analysis & Factor Analysis Psych 818 DeShon Purpose Both are used to reduce the dimensionality of correlated measurements Can be used in a purely exploratory fashion to investigate

More information

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19 additive tree structure, 10-28 ADDTREE, 10-51, 10-53 EXTREE, 10-31 four point condition, 10-29 ADDTREE, 10-28, 10-51, 10-53 adjusted R 2, 8-7 ALSCAL, 10-49 ANCOVA, 9-1 assumptions, 9-5 example, 9-7 MANOVA

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Fuzzy Clustering of Gene Expression Data

Fuzzy Clustering of Gene Expression Data Fuzzy Clustering of Gene Data Matthias E. Futschik and Nikola K. Kasabov Department of Information Science, University of Otago P.O. Box 56, Dunedin, New Zealand email: mfutschik@infoscience.otago.ac.nz,

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the Character Correlation Zhongyi Xiao Correlation In probability theory and statistics, correlation indicates the strength and direction of a linear relationship between two random variables. In general statistical

More information

Introduction to Principal Component Analysis (PCA)

Introduction to Principal Component Analysis (PCA) Introduction to Principal Component Analysis (PCA) NESAC/BIO NESAC/BIO Daniel J. Graham PhD University of Washington NESAC/BIO MVSA Website 2010 Multivariate Analysis Multivariate analysis (MVA) methods

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Statistical Inference

Statistical Inference Statistical Inference Jean Daunizeau Wellcome rust Centre for Neuroimaging University College London SPM Course Edinburgh, April 2010 Image time-series Spatial filter Design matrix Statistical Parametric

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) Principal Components Analysis (PCA) Principal Components Analysis (PCA) a technique for finding patterns in data of high dimension Outline:. Eigenvectors and eigenvalues. PCA: a) Getting the data b) Centering

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Nonparametric Statistics

Nonparametric Statistics Nonparametric Statistics Nonparametric or Distribution-free statistics: used when data are ordinal (i.e., rankings) used when ratio/interval data are not normally distributed (data are converted to ranks)

More information

Dimensionality Reduction Techniques (DRT)

Dimensionality Reduction Techniques (DRT) Dimensionality Reduction Techniques (DRT) Introduction: Sometimes we have lot of variables in the data for analysis which create multidimensional matrix. To simplify calculation and to get appropriate,

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley Memorial Sloan-Kettering Cancer Center July

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Background to Statistics

Background to Statistics FACT SHEET Background to Statistics Introduction Statistics include a broad range of methods for manipulating, presenting and interpreting data. Professional scientists of all kinds need to be proficient

More information

Xiaosi Zhang. A thesis submitted to the graduate faculty. in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

Xiaosi Zhang. A thesis submitted to the graduate faculty. in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE GENE EXPRESSION PATTERN ANALYSIS Xiaosi Zhang A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Bioinformatics and Computational

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information