Homework : Data Mining SOLUTIONS

Homework 4 36-350: Data Mining SOLUTIONS I loaded the data and transposed it thus: library(elemstatlearn) data(nci) nci.t = t(nci) This means that rownames(nci.t) contains the cell classes, or equivalently colnames(nci). I ll use that a lot, so I ll abbreviate it: nci.classes=rownames(nci.t). 1. Similarity searching for cells (a) How many features are there in the raw data? Are they discrete or continuous? (If there are some of each, which is which?) If some are continuous, are they necessarily positive? Answer: This depends on whether the cell classes are counted as features or not; you could defend either answer. If you don t count the cell class, there are 6830 continuous features. Since they are logarithms, they do not have to be positive, and indeed they are not. If you do count the class, there are 6831 features, one of which is discrete (the class). I am inclined not to count the clas as really being a feature, since when we get new data it may not have the label. (b) Describe a method for finding the k cells whose expression profiles are most similar to that of a given cell. Carefully explain what you mean by similar. How could you evaluate the success of the search algorithm? Answer: The simplest thing is to call two cells similar if their vectors of expression levels have a small Euclidean distance. The most similar cells are the ones whose expression-profile vectors are closest, geometrically, to the target expression profile vector. There are several ways the search could be evaluated. One would be to use the given cell labels as an indication of whether the search results are relevant. That is, one would look at the k closest matches and see (a) what fraction of them are of the same type as the target, which is the precision, and (b) what fraction of cells of that type 1

are in the search results (the recall). Plotting the precision and recall against k gives an indication of how efficient the search is. This could then be averaged over all the labeled cells. Normalizing the vectors by the Euclidean length is possible here, though a little tricky because the features are the logarithms of the concentration levels, so some of them can be negative numbers, meaning that very little of that gene is being expressed. Also, the reason for normalizing bag-of-words vectors by length was the idea that two documents might have similar meanings or subjects even if they had different sizes. A cell which is expressing all the genes at a low level, however, might well be very different from one which is expressing the same genes at a high level. (c) Would it make sense to weight genes by inverse cell frequency? Explain. Answer: No; or at least, I don t see how. Every gene is expressed to some degree by every cell (otherwise the logarithm of its concentration would be ), so every gene would have an inverse cell frequency of zero. You could fix a threshold and just count whether the expression level is over the threshold, but this would be pretty arbitrary. However, a related idea would be to scale genes expression levels by something which indicates how much they vary across cells, such as the range or the variance. Since we have labeled cell types here, we could even compute the average expression level for each type and then scale genes by the variance of these type-averages. That is, genes with a small variance (or range, etc.) should get small weights, since they are presumably uninformative, and genes with a large variance in epxression level should get larger weights. 2

2. k-means clustering Use the kmeans function in R to cluster the cells, with k = 14 (to match the number of true classes). Repeat this three times to get three different clusterings. Answer: clusters.1 = kmeans(nci.t,centers=14)$cluster clusters.2 = kmeans(nci.t,centers=14)$cluster clusters.3 = kmeans(nci.t,centers=14)$cluster I m only keping the cluster assignments, because that s all this problem calls for. (a) Say that k-means makes a lumping error whenever it assigns two cells of different classes to the same cluster, and a splitting error when it puts two cells of the same class in different clusters. Each pair of cells can give an error. (With n objects, there are n(n 1)/2 distinct pairs.) i. Write a function which takes as inputs a vector of classes and a vector of clusters, and gives as output the number of lumping errors. Test your function by verifying that when the classes are (1, 2, 2), the three clusterings (1, 2, 2), (2, 1, 1) and (4, 5, 6) all give zero lumping errors, but the clustering (1, 1, 1) gives two lumping errors. Answer: The most brute-force way (Code 1) uses nested for loops. This works, but it s not the nicest approach. (1) for loops, let alone nested loops, are extra slow and inefficient in R. (2) We only care about pairs which are in the same cluster, waste time checking all pairs. (3) It s a bit obscure what we re really doing. Code 2 is faster, cleaner, and more R-ish, using the utility function outer, which (rapidly!) applies a function to pairs of objects constructed from its first two arguments, returning a matrix. Here we re seeing which pairs of objects belong to different class, and which to the same clusters. (The solutions to the first homework include a version of this trick.) We restrict ourselves to pairs in the same cluster and count how many are of different classes. We divided by two because outer produces rectangular matrices and works with ordered pairs, so (i, j) (j, i), but the problem asks for unordered pairs. > count.lumping.errors(c(1,2,2),c(1,2,2)) [1] 0 > count.lumping.errors(c(1,2,2),c(2,1,1)) [1] 0 > count.lumping.errors(c(1,2,2),c(4,5,6)) [1] 0 3

# Count number of errors made by lumping different classes into # one cluster # Uses explicit, nested iteration # Input: vector of true classes, vector of guessed clusters # Output: number of distinct pairs belonging to different classes # lumped into one cluster count.lumping.errors.1 <- function(true.class,guessed.cluster) { n = length(true.class) # Check that the two vectors have the same length! stopifnot(n == length(guessed.cluster)) # Note: don t require them to have same data type # Initialize counter num.lumping.errors = 0 # check all distinct pairs for (i in 1:(n-1)) { # think of above-diagonal entries in a matrix for (j in (i+1):n) { if ((true.class[i]!= true.class[j]) & (guessed.cluster[i] == guessed.cluster[j])) { # lumping error: really different but put in same cluster num.lumping.errors = num.lumping.errors +1 } } } return(num.lumping.errors) } Code Example 1: The first way to count lumping errors. # Count number of errors made by lumping different classes into # one cluster # Vectorized # Input: vector of true classes, vector of guessed clusters # Output: number of distinct pairs belonging to different classes # lumped into one cluster count.lumping.errors <- function(true.class,guessed.cluster) { n = length(true.class) stopifnot(n == length(guessed.cluster)) different.classes = outer(true.class, true.class, "!=") same.clusters = outer(guessed.cluster, guessed.cluster, "==") num.lumping.errors = sum(different.classes[same.clusters])/2 return(num.lumping.errors) } Code Example 2: Second approach to counting lumping errors. 4

# Count number of errors made by splitting a single classes into # clusters # Input: vector of true classes, vector of guessed clusters # Output: number of distinct pairs belonging to the same class # split into different clusters count.splitting.errors <- function(true.class,guessed.cluster) { n = length(true.class) stopifnot(n == length(guessed.cluster)) same.classes = outer(true.class, true.class, "==") different.clusters = outer(guessed.cluster, guessed.cluster, "!=") num.splitting.errors = sum(different.clusters[same.classes])/2 return(num.splitting.errors) } Code Example 3: Counting splitting errors. > count.lumping.errors(c(1,2,2),c(1,1,1)) [1] 2 ii. Write a function, with the same inputs as the previous one, that calculates the number of splitting errors. Test it on the same inputs. (What should the outputs be?) Answer: A simple modification of the code from the last part. Before we can test this, we need to know what the test results are. The classes (1, 2, 2) and clustering (1, 2, 2) are identical, so obviously there will be no splitting errors. Likewise for comparing (1, 2, 2) and (2, 1, 1). Comparing (1, 2, 2) and (4, 5, 6), the pair of the second and third items are in the same class but split into two clusters one splitting error. On the other hand, comparing (1, 2, 2) and (1, 1, 1), there are no splitting errors, because everything is put in the same cluster. > count.splitting.errors(c(1,2,2),c(1,2,2)) [1] 0 > count.splitting.errors(c(1,2,2),c(2,1,1)) [1] 0 > count.splitting.errors(c(1,2,2),c(4,5,6)) [1] 1 > count.splitting.errors(c(1,2,2),c(1,1,1)) [1] 0 iii. How many lumping errors does k-means make on each of your three runs? How many splitting errors? > count.lumping.errors(nci.classes,clusters.1) [1] 79 > count.lumping.errors(nci.classes,clusters.2) [1] 96 > count.lumping.errors(nci.classes,clusters.3) 5

[1] 110 > count.splitting.errors(nci.classes,clusters.1) [1] 97 > count.splitting.errors(nci.classes,clusters.2) [1] 97 > count.splitting.errors(nci.classes,clusters.3) [1] 106 For comparison, there are 64 63/2 = 2016 pairs, so the error rate here is pretty good. (b) Are there any classes which seem particularly hard for k-means to pick up? Answer: Well-argued qualitative answers are fine, but there are ways of being quantitative. One is the entropy of the cluster assignments for the classes. Start with a confusion matrix: > confusion = table(nci.classes,clusters.1) > confusion clusters.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 0 0 0 2 0 0 0 0 0 0 2 1 0 0 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 6 0 0 0 0 0 0 K562A-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0 K562B-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5 MCF7A-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0 MCF7D-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0 7 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 2 0 0 0 1 0 1 0 0 0 3 1 0 0 0 0 0 2 0 0 0 0 PROSTATE 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 7 0 0 0 0 0 0 0 0 UNKNOWN 0 0 0 1 0 0 0 0 0 0 0 0 0 0 > signif(apply(confusion,1,entropy),2) K562A-repro K562B-repro 2.00 0.97 0.59 0.00 0.00 0.65 MCF7A-repro MCF7D-repro PROSTATE 0.00 0.00 0.54 2.10 1.50 0.00 UNKNOWN 0.99 0.00 Of course that s just one k-means run. Here are a few more. > signif(apply(table(nci.classes,clusters.2),1,entropy),2) K562A-repro K562B-repro 2.00 0.00 0.59 0.00 0.00 1.50 MCF7A-repro MCF7D-repro PROSTATE 6

0.00 0.00 0.54 2.10 1.50 0.00 UNKNOWN 0.99 0.00 > signif(apply(table(nci.classes,clusters.3),1,entropy),2) K562A-repro K562B-repro 2.00 1.50 0.59 0.00 0.00 1.90 MCF7A-repro MCF7D-repro PROSTATE 0.00 0.00 1.10 1.80 0.65 0.00 UNKNOWN 0.99 0.00 and have consistently high cluster entropy, indicating that cells of these types tend not to be clustered together. (c) Are there any pairs of cells which are always clustered together, and if so, are they of the same class? Answer: Cells number 1 and 2 are always clustered together in my three runs, and are tumors. However, cells 3 and 4 are always clustered together, too, and one is and one is. (d) Variation-of-information metric i. Calculate, by hand, the variation-of-information distance between the partition (1, 2, 2) and (2, 1, 1); between (1, 2, 2) and (2, 2, 1); and between (1, 2, 2) and (5, 8, 11). Answer: Write X for the cell of the first partition in each pair and Y for the second. In the first case, when X = 1, Y = 2 and vice versa. So H[X Y ] = 0, H[Y X] = 0 and the distance between the partitions is also zero. (The two partitions are the same, they ve just swapped the labels on the cells.) For the second pair, we have Y = 2 when X = 1, so H[Y X = 1] = 0, but Y = 1 or 2 equally often when X = 2, so H[Y X = 2] = 1. Thus H[Y X] = (0)(1/3) + (1)(2/3) = 2/3. Symmetrically, X = 2 when Y = 1, H[X Y = 1] = 0, but X = 1 or 2 equiprobably when Y = 2, so again H[X Y ] = 2/3, and the distance is 4/3. In the final case, H[X Y ] = 0, because for each Y there is a single value of X, but H[Y X] = 2/3, as in the previous case. So the distance is 2/3. ii. Write a function which takes as inputs two vectors of class or cluster assignments, and returns as output the variation-of-information distance for the two partitions. Test the function by checking that it matches your answers in the previous part. Answer: The most straight-forward way is to make the contingency table or confusion matrix for the two assignment vectors, calculate the two conditional entropies from that, and add them. But I already have a function to calculate mutual information, 7

# Variation-of-information distance between two partitions # Input: two vectors of partition-cell assignments of equal length # Calls: entropy, word.mutual.info from lecture-05.r # Output: variation of information distance variation.of.info <- function(clustering1, clustering2) { # Check that we re comparing equal numbers of objects stopifnot(length(clustering1) == length(clustering2)) # How much information do the partitions give about each other? mi = word.mutual.info(table(clustering1, clustering2)) # Marginal entropies # entropy function takes a vector of counts, not assignments, # so call table first h1 = entropy(table(clustering1)) h2 = entropy(table(clustering2)) v.o.i. = h1 + h2-2*mi return(v.o.i.) } Code Example 4: Calculating variation-of-information distance between partitions, based on vectors of partition-cell assignments. so I ll re-use that via a little algebra: I[X; Y ] H[X] H[X Y ] H[X Y ] = H[X] I[X; Y ] H[X Y ] + H[Y X] = H[X] + H[Y ] 2I[X; Y ] implemented in Code 4. Let s check this out on the test cases. > variation.of.info(c(1,2,2),c(2,1,1)) [1] 0 > variation.of.info(c(1,2,2),c(2,2,1)) [1] 1.333333 > variation.of.info(c(1,2,2),c(5,8,11)) [1] 0.6666667 > variation.of.info(c(5,8,11),c(1,2,2)) [1] 0.6666667 I threw in the last one to make sure the function is symmetric in its two arguments, as it should be. iii. Calculate the distances between the k-means clusterings and the true classes, and between the k-means clusterings and each other. Answer: > variation.of.info(nci.classes,clusters.1) [1] 1.961255 > variation.of.info(nci.classes,clusters.2) 8

[1] 2.026013 > variation.of.info(nci.classes,clusters.3) [1] 2.288418 > variation.of.info(clusters.1,clusters.2) [1] 0.5167703 > variation.of.info(clusters.1,clusters.3) [1] 0.8577165 > variation.of.info(clusters.2,clusters.3) [1] 0.9224744 The k-means clusterings are all roughly the same distance to the true classes (two bits or so), and noticeably closer to each other than to the true classes (around three-quarters of a bit). 9

3. Hierarchical clustering (a) Produce dendrograms using Ward s method, single-link clustering and complete-link clustering. Include both the commands you used and the resulting figures in your write-up. Make sure the figures are legible. (Try the cex=0.5 option to plot.) Answer: hclust needs a matrix of distances, handily produced by the dist function. nci.dist = dist(nci.t) plot(hclust(nci.dist,method="ward"),cex=0.5,xlab="cells", main="hierarchical clustering by Ward s method") plot(hclust(nci.dist,method="single"),cex=0.5,xlab="cells", main="hierarchical clustering by single-link method") plot(hclust(nci.dist,method="complete"),cex=0.5,xlab="cells", main="hierarchical clustering by complete-link method") Producing Figures 1, 2 and 3 (b) Which cell classes seem are best captured by each clustering method? Explain. Answer: Ward s method does a good job of grouping together, and ; is pretty good too but mixed up with some of the lab cell lines. Single-link is good with and decent with (though confused with ). There are little sub-trees of cells of the same time, like or, but mixed together with others. The complete-link method has sub-trees for,,, and, which look pretty much as good as Ward s method. (c) Which method best recovers the cell classes? Explain. Answer: Ward s method looks better scanning down the figure, a lot more of the adjacent cells are of the same type. Since the dendrogram puts items in the same sub-cluster together, this suggests that the clustering more nearly corresponds to the known cell types. (d) The hclust command returns an object whose height attribute is the sum of the within-cluster sums of squares. How many clusters does this suggest we should use, according to Ward s method? Explain. (You may find diff helpful.) Answer: > nci.wards = hclust(nci.dist,method="ward") > length(nci.wards$height) [1] 63 > nci.wards$height[1] [1] 38.23033 There are 63 heights, corresponding to the 63 joinings, i.e., not including the height (sum-of-squares) of zero when we have 64 clusters, 10

Hierarchical clustering by Ward's method Height 0 50 100 150 200 250 300 350 K562B-repro K562A-repro MCF7A-repro MCF7D-repro PROSTATE PROSTATE UNKNOWN Cells hclust (*, "ward") Figure 1: Hierarchical clustering by Ward s method. 11

Hierarchical clustering by single-link method Height 30 40 50 60 70 80 90 MCF7A-repro MCF7D-repro K562B-repro K562A-repro UNKNOWN PROSTATE PROSTATE Cells hclust (*, "single") Figure 2: Hierarchical clustering by the single-link method. 12

Hierarchical clustering by complete-link method Height 20 40 60 80 100 120 140 MCF7A-repro MCF7D-repro K562B-repro K562A-repro UNKNOWN PROSTATE PROSTATE Cells hclust (*, "complete") Figure 3: Hierarchical clustering by the complete-link method. 13

Merging cost 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Clusters remaining Figure 4: Merging costs in Ward s method. one for each data point. Since the merging costs are differences, we want to add an initial 0 to the sequence: merging.costs = diff(c(0,nci.wards$height)) plot(64:2,merging.costs,xlab="clusters remaining",ylab="merging cost") The function diff takes the difference between successive components in a vector, which is what we want here (Figure 4). The heuristic is to stop joining clusters when the cost of doing so goes up a lot; this suggests using only 10 clusters, since going from 10 clusters to 9 is very expensive. (Alternately, use 63 clusters; but that would be silly.) (e) Suppose you did not know the cell classes. Can you think of any reason to prefer one clustering method over another here, based on 14

their outputs and the rest of what you know about the problem? Answer: I can t think of anything very compelling. Ward s method has a better sum-of-squares, but of course we don t know in advance that sum-of-squares picks out differences between cancer types with any biological or medical importance. 15

4. (a) Use prcomp to find the variances (not the standard deviations) associated with the principal components. Include a print-out of these variances, to exactly two significant digits, in your write-up. Answer: > nci.pca = prcomp(nci.t) > nci.variances = (nci.pca$sdev)^2 > signif(nci.variances,2) [1] 6.3e+02 3.5e+02 2.8e+02 1.8e+02 1.6e+02 1.5e+02 1.2e+02 1.2e+02 [9] 1.1e+02 9.2e+01 8.9e+01 8.5e+01 7.8e+01 7.5e+01 7.1e+01 6.8e+01 [17] 6.7e+01 6.2e+01 6.2e+01 6.0e+01 5.8e+01 5.5e+01 5.4e+01 5.0e+01 [25] 5.0e+01 4.7e+01 4.5e+01 4.4e+01 4.3e+01 4.2e+01 4.1e+01 4.0e+01 [33] 3.7e+01 3.7e+01 3.6e+01 3.5e+01 3.4e+01 3.4e+01 3.2e+01 3.2e+01 [41] 3.1e+01 3.0e+01 3.0e+01 2.9e+01 2.8e+01 2.7e+01 2.6e+01 2.5e+01 [49] 2.3e+01 2.3e+01 2.2e+01 2.1e+01 2.0e+01 1.9e+01 1.8e+01 1.8e+01 [57] 1.7e+01 1.5e+01 1.4e+01 1.2e+01 1.0e+01 9.9e+00 8.9e+00 1.7e-28 (b) Why are there only 64 principal components, rather than 6830? (Hint: Read the lecture notes.) Answer: Because, when n < p, it is necessarily true that the data lie on an n-dimensional subspace of the p-dimensional feature space, so there are only n orthogonal directions along which the data can vary at all. (Actually, one of the variances is very, very small because any 64 points lie on a 63-dimensional surface, so we really only need 63 directions, and the last variance is within numerical error of zero.) Alternately, we can imagine that there are 6830 64 = 6766 other principal components, each with zero variance. (c) Plot the fraction of the total variance retained by the first q components against q. Include both a print-out of the plot and the commands you used. Answer: plot(cumsum(nci.variances)/sum(nci.variances),xlab="q", ylab="fraction of variance", main = "Fraction of variance retained by first q components") producing Figure 5. The handy cumsum function takes a vector and returns the cumulative sums of its components (i.e., another vector of the sum length). (The same effect can be achieved through lapply or sapply, or through an explicit loop, etc.) (d) Roughly how much variance is retained by the first two principal components? Answer: From the figure, about 25%. More exactly: > sum(nci.variances[1:2]) [1] 986.1434 > sum(nci.variances[1:2])/sum(nci.variances) [1] 0.2319364 16

Fraction of variance retained by first q components Fraction of variance 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 q Figure 5: R 2 vs. q. 17

so 23% of the variance. (e) Roughly how many components must be used to keep half of the variance? To keep nine-tenths? Answer: From the figure, about 10 components keep half the variance, and about 40 components keep nine tenths. More exactly: > min(which(cumsum(nci.variances) > 0.5*sum(nci.variances))) [1] 10 > min(which(cumsum(nci.variances) > 0.9*sum(nci.variances))) [1] 42 so we need 10 and 42 components, respectively. (f) Is there any point to using more than 50 components? (Explain.) Answer: There s very little point from the point of view of capturing variance. The error in reconstructing the expression levels from using 50 components is already extremely small. However, there s no guarantee that, in some other application, the information carried by the 55th component (say) wouldn t be crucial. (g) How much confidence should you have in results from visualizing the first two principal components? Why? Answer: Very little; the first two components represent only a small part of the total variance. 18

PC2-40 -20 0 20 PROSTATE UNKNOWN MCF7D-repro MCF7A-repro PROSTATE K562B-repro K562A-repro -40-20 0 20 40 60 PC1 Figure 6: PC projections of the cells, labeled by cell type. 5. (a) Plot the projection of each cell on to the first two principal components. Label each cell by its type. Hint: See the solutions to the first homework. Answer: Take the hint! plot(nci.pca$x[,1:2],type="n") text(nci.pca$x[,1:2],nci.classes, cex=0.5) Figure 6 is the result. (b) One tumor class (at least) forms a cluster in the projection. Say which, and explain your answer. Answer: ; most of the cells cluster in the bottomcenter of the figure (PC1 between about 0 and 20, PC2 around 19

40), with only a few non- cells nearby, and a big gap to the rest of the data. One could also argue for in the center left, in the upper left, above it, and in the top right. (c) Identify a tumor class which does not form a compact cluster. Answer: Most of them don t. What I had in mind though was, which is very widely spread. (d) Of the two classes of tumors you have just named, which will be more easily classified with the prototype method? With the nearest neighbor method? Answer: will be badly classified using the prototype method. The prototype will be around the middle of the plot, where there are no breast-cancer cells. Nearest-neighbors can hardly do worse. On the other hand, should work better with the prototype method, because it forms a compact blob. 20

6. Combining dimensionality reduction and clustering (a) Run k-means with k = 14 on the projections of the cells on to the first two principal components. Give the commands you used to do this, and get three clusterings from three different runs of k-means. pc.clusters.1 = kmeans(nci.pca$x[,1:2],centers=14)$cluster pc.clusters.2 = kmeans(nci.pca$x[,1:2],centers=14)$cluster pc.clusters.3 = kmeans(nci.pca$x[,1:2],centers=14)$cluster (b) Calculate the number of errors, as with k-means clustering based on all genes. Answer: > count.lumping.errors(nci.classes,pc.clusters.1) [1] 96 > count.lumping.errors(nci.classes,pc.clusters.2) [1] 92 > count.lumping.errors(nci.classes,pc.clusters.3) [1] 111 > count.splitting.errors(nci.classes,pc.clusters.1) [1] 126 > count.splitting.errors(nci.classes,pc.clusters.2) [1] 128 > count.splitting.errors(nci.classes,pc.clusters.3) [1] 134 The number of lumping errors has increased a bit, though not very much. The number of splitting errors has gone up by about 10%. (c) Are there any pairs of cells which are always clustered together? If so, do they have the same cell type? Answer: For me, the first three cells are all clustered together, and are all. (d) Does k-means find a cluster corresponding to the cell type you thought would be especially easy to identify in the previous problem? (Explain your answer.) Answer: Make confusion matrices and look at them. > table(nci.classes,pc.clusters.1) pc.clusters.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 2 0 0 2 1 0 0 2 0 0 0 K562A-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0 K562B-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4 2 0 0 0 0 0 0 0 0 21

MCF7A-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0 MCF7D-repro 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 6 0 0 0 0 0 0 0 0 2 2 1 0 3 0 1 0 0 0 0 0 0 0 0 1 3 0 2 0 0 PROSTATE 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 2 0 0 0 0 0 5 0 0 1 0 0 UNKNOWN 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Six of the eight cells are in one cluster here, which has no other members this matches what I guessed from the projection plot. This is also true in one of the other two runs; in the third that group of six is split into two clusters of 4 cells and 2. (e) Does k-means find a cluster corresponding to the cell type you thought be be especially hard to identify? (Again, explain.) Answer: I thought would be hard to classify, and it is, being spread across six clusters, all containing cells from other classes. This remains true in the other k-means runs. On the other hand, it wasn t well-clustered with the full data, either. 22

7. Combining dimensionality reduction and classification (a) Calculate the error rate of the prototype method using all the gene features, using leave-one-out cross-validation. You may use the solution code from previous assignments, or your own code. Remember that in this data set, the class labels are the row names, not a separate column, so you will need to modify either your code or the data frame. (Note: this may take several minutes to run. [Why so slow?]) Answer: I find it simpler to add another column for the class labels than to modify the code. > nci.frame = data.frame(labels = nci.classes, nci.t) > loocv.prototype(nci.frame) [1] 0.578125 (The frame-making command complains about duplicated row-names, but I suppressed that for simplicity.) 58% accuracy may not sound that great, but remember there are 14 classes it s a lot better than chance. Many people tried to use matrix here. The difficulty, as mentioned in the lecture notes, is that a matrix or array, in R, has to have all its components be of the same data type it can t mix, say, numbers and characters. When you try, it coerces the numbers into characters (because it can do that but can t go the other way). With data.frame, each column can be of a different data type, so we can mix character labels with numeric features. This takes a long time to run because it has to calculate 14 class centers, with 6830 dimensions, 64 times. Of course most of the class prototypes are not changing with each hold-out (only the one center of the class to which the hold-out point belongs), so there s a fair amount of wasted effort computationally. A slicker implementation of leave-one-out for the prototype method would take advantage of this, at the cost of specializing the code to that particular classifier method. (b) Calculate the error rate of the prototype method using the first two, ten and twenty principal components. Include the R commands you used to do this. Answer: Remember that the x attribute of the object returned by prcomp gives the projections on to the components. We need to turn these into data frames along with the class labels, but since we don t really care about keeping them around, we can do that in the argument to the CV function. > loocv.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:2])) [1] 0.421875 > loocv.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:10])) [1] 0.515625 23

proto.with.q.components = function(q, pcs, classes) { pcs = pcs[,1:q] frame = data.frame(class.labels=classes, pcs) accuracy = loocv.prototype(frame) return(accuracy) } Code Example 5: Function for calculating the accuracy of a variable number of components. > loocv.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:20])) [1] 0.59375 So the accuracy is only slightly reduced, or even slightly increased, by using the principal components rather than the full data. (c) (Extra credit.) Plot the error rate, as in the previous part, against the number q of principal components used, for q from 2 to 64. Include your code and comment on the graph. (Hint: Write a function to calculate the error rate for arbitrary q, and use sapply.) Answer: Follow the hint (Code Example 5). This creates the frame internally strictly the extra arguments to the function aren t needed, but assuming that global variables will exist and have the right properties is always bad programming practice. I could add default values to them (pcs=nci.pca$x, etc.), but that s not necessary because sapply can pass extra named arguments on to the function. > pca.accuracies = sapply(2:64, proto.with.q.components, + pcs=nci.pca$x, classes=nci.classes) > pca.accuracies[c(1,9,19)] [1] 0.421875 0.515625 0.593750 > plot(2:64, pca.accuracies, xlab="q", ylab="prototype accuracy") The middle command double checks that the proto.with.q.components function works properly, by giving the same results as we got by hand earlier. (Remember that the first component of the vector pca.accuracies corresponds to q = 2, etc.) The plot is Figure 7. There s a peak in the accuracy for q between about 15 and 20. 24

Prototype accuracy 0.45 0.50 0.55 0.60 0 10 20 30 40 50 60 q Figure 7: Accuracy of the prototype method as a function of the number of components q. 25