Supp Figure 1A Click here to download high resolution image

Size: px

Start display at page:

Download "Supp Figure 1A Click here to download high resolution image"

Martina Daniels
6 years ago
Views:

1 Supp Figure 1A Click here to download high resolution image

2 Supp Figure 1B Click here to download high resolution image

3 Supp Figure 1C Click here to download high resolution image

4 Supp Figure 1D Click here to download high resolution image

5 Supp File 1 Click here to download Table: supp file 1.pdf Prediction of Chemotherapy Response from Breast Cancer Cell Lines to Human Cancer Expression Data September 19, 2008 > library(splines) > library(oompabase) > library(mclust) use of mclust requires a license agreement see > library(nlme) > library(preprocess) > library(classcomparison) > library(cluster) > library(classdiscovery) 1 Loading MDACC 133 Array s Gene Expression Data 1.1 Load the expression data We use the mean-adjusted expression data, i.e. the expression data was adjusted to eliminate batch-effect. > set.seed(1000) > our.dir <- "//mdabam1/bioinfo/private/lajos-chemo-prediction/supplementary/" > mdacc.file.name <- "MDA133/Mean-Adjusted-raw-MBEI-MDA133.txt" > MDACC.centered.133 <- read.delim(paste(our.dir, mdacc.file.name, + sep = ""), header = TRUE, sep = "\t") > dim(mdacc.centered.133) [1]

6 The first column is probe set ID. We seperate the probe set ID from the expression data and then log transform the expression data > MDACC.centered133.dt <- MDACC.centered.133[, -1] > row.names(mdacc.centered133.dt) <- MDACC.centered.133[, 1] > MDACC.133.log.dt <- log2(mdacc.centered133.dt + 1) > rm(mdacc.centered.133) 1.2 Load associated clinical info > MDACC.clinical <- "MDA133/MDA133CompleteInfo txt" > MDACC.133.clinical <- read.delim(paste(our.dir, MDACC.clinical, + sep = ""), header = TRUE, sep = "\t") > all(mdacc.133.clinical$idtext == colnames(mdacc.133.log.dt)) [1] TRUE The order of array ID in gene expression data and the order of array ID in associated clinical data are the same. Next, define logical vector for pcr and RD cases. From clinical information, we know there are 34 pcr cases and 99 RD cases. > is.pcr <- rep("rd", length(mdacc.133.clinical$pcrtxt)) > is.pcr[mdacc.133.clinical$pcrtxt == "pcr"] <- "pcr" Introduce a function to compute p-values from correlation coefficeints, based on beta-distribution. > Beta.function <- function(x, n) { + z <- (x + 1)/2 + y <- pbeta(z, (n - 1)/2, (n - 1)/2) + p <- 1-2 * abs(y - 1/2) + return(p) + } (n is sample size) Introduce another function for computing sensitivity, specificity, PPV, and NPV from DLDA results. Please note that to compute these parameters, it is important to define what we test. In our analysis, we test for resistant (or RD); the true positive is RD, and true negative is pcr. 2

7 > my.function <- function(x) { + Sens <- x[1]/(x[1] + x[2]) + Spec <- x[4]/(x[3] + x[4]) + PPV <- x[1]/(x[1] + x[3]) + NPV <- x[4]/(x[2] + x[4]) + list(sensitivity = Sens, Specificity = Spec, PPV = PPV, NPV = NPV) + } 2 Loading Cell Line s Gene Expression Data 2.1 Load the expression data > fname <- "cell-line-data/mbei-for-cellline-data-from-cornelia txt" > chemo.cell.line.dt <- read.table(paste(our.dir, fname, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") The first column is the probe set ID, we seperate it from the data. > ProbeSet.ID <- chemo.cell.line.dt[, 1] > cellline.dt <- chemo.cell.line.dt[, -1] > rm(chemo.cell.line.dt) 2.2 Load array information file and replace array ID by cell line names > infonames <- "cell-line-data/cell-line-info.txt" > info.file <- read.table(paste(our.dir, infonames, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") > dimnames(info.file)[[2]] [1] "Number" "Cell.Line" "Array" "File.Name" > dimnames(cellline.dt)[[2]] <- info.file[, 2] 2.3 Data transformation and Load other cell line information > CellLineOtherInfo <- "Documents/cell screening progress note_cl.txt" > other.info.cell.lines <- read.table(paste(our.dir, CellLineOtherInfo, + sep = ""), header = T, row.names = NULL, skip = 0, sep = "\t") 3

8 2.4 Clustering for QC (1) Let us first define a filter, to filtering out noise measurements > max.vec <- apply(cellline.log.dt, 1, max) > q15 <- quantile(as.matrix(cellline.log.dt), 0.15) > f.vector <- max.vec >= q15 (2) Define a vector to remove control spots from expression measurements > is.not.controls <- rep(true, dim(cellline.log.dt)[[1]]) > is.not.controls[grep("affx", dimnames(cellline.log.dt)[[1]])] <- FALSE After filtering and remove the controls > cellline.log.dt <- cellline.log.dt[(f.vector & is.not.controls), + ] > dim(cellline.log.dt) [1] We will use this dataset to identify chemo predictors (3) performing cluster analysis > chemo.hc <- hclust(distancematrix(cellline.log.dt, "pearson"), + method = "complete") 4

9 Cluster Dendrogram Height AU565 BT483 MDA MB 453 MDA MB 468 BT 474 MDA MB 361 T47D BT20 ZR 751 MCF 7 SK BR 3 MDA MB 436 MDA MB BT 549 MDA MB 435 Hs578T MDAMB157 HBL100 Figure 1: Hierarchical clustering using all probe sets. Two distinct clusters can be seen. Correlating with available clinical information, these two clusters seems related with ER status (see attached Figure: clustering.pdf). 3 Load GI50 data > data <- read.table(paste(our.dir, "Documents/krc-parsed.tsv", + sep = ""), sep = "\t", header = TRUE) > data[, "Step"] <- factor(data[, "Step"]) > temp <- read.table(paste(our.dir, "Documents/translateConc.tsv", + sep = ""), sep = "\t", header = TRUE) > concentrations <- temp[, "PowerOfTen"] > names(concentrations) <- temp[, "TargetConc"] 5

10 = 4 Identifying Genes between Sensitive and Resistant Cell Lines from Gene Expression Data There are two ways to identify genes. (1) From two sample t-test between sensitive and resistant cell lines; and (2) from the correlation between expression data and GI50 (the drugs treated cell line s data). We apply both approaches for each drug. To perform t-test, we need to identify sensitive and resistant cell lines for each drug. We have discussed this issue in the last meeting, and decided to select sensitive and resistant cell lines based on the boxplot of the GI50 values from resamples dose response curves for each drug (see the report Breast Cancer Cell Line Dose Response, issued by 3 August 2007). The part of idenfying sensitive and resistant cell lines was illustrated in the previous report. We just outline the critical step. = 4.1 Paclitaxel > currentdrug <- "paclitaxel" For paclitaxel, we decided using 8 cell lines with the lowest concentrations as sensitive, and using 8 cell lines with the highest concentration as resistant. > K <- 8 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" "BT483" "BT20" [11] "MDA-MB-231" "MDA-MB-453" "MCF-7" " " "BT 474" [16] "MDA-MB-361" "SK-BR-3" "T47D" "ZR-751" 6

11 === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" > res.cell.lines [1] "ZR-751" "T47D" "SK-BR-3" "MDA-MB-361" "BT 474" [6] " " "MCF-7" "MDA-MB-453" === Get the mean GI50 values > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes 7

12 > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 2156 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering 8

13 > Tscore <- > pvalue <- > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] 9

14 AU565 HBL100 MDA.MB.435 Hs578T MDAMB157 MDA.MB.436 BT.549 T47D ZR.751 SK.BR.3 MCF.7 MDA.MB.361 BT.474 X MDA.MB.468 MDA.MB.453 Figure: Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) 10

15 [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes 11

16 > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT20" "BT483" "HBL100" [6] "Hs578T" "MDA-MB-231" "MDA-MB-435" "MDA-MB-436" "MDAMB157" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "MCF-7" "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "SK-BR-3" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 12

17 BT 549 MDA MB 435 MDA MB 436 AU565 HBL100 Hs578T MDAMB MDA MB 231 BT483 MDA MB 453 MDA MB 468 ZR 751 MCF 7 SK BR 3 BT20 T47D BT 474 MDA MB 361 Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 13

18 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 8 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 n7 n8 TrainingAccuracy CVPredictedAccuracy

19 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 0 8 (2) Testing set classification table > ttest.pred[[8]] 15

20 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 17 7 (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV

21 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 17

22 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 18

23 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 9 2 Predicted=Sensitive 0 8 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 9 7 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV

24 Empirical ROC Sensitivity Area= / False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 8 20

25 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 21

26 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 22

27 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 2 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 24 9 > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV

28 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 24

29 Beta Uniform Mixture FDR Control Density Significant P Value 0.0e e e P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value 0.0e e sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 25

30 [1] "HBL100" "BT20" "MDA-MB-435" "MDA-MB-436" "AU565" [6] "MDAMB157" "Hs578T" "MDA-MB-231" "BT483" "BT-549" > names(random.x[random.x == FALSE]) [1] "ZR-751" "MCF-7" " " "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "T47D" "BT 474" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 7 1 Predicted=Sensitive 2 9 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)

31 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 27

32 4.2 Doxorubicin > currentdrug <- "doxorubicin" For doxorubicin, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 6 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" " " "BT-549" "BT20" "SK-BR-3" [11] "MDA-MB-435" "AU565" "BT 474" "ZR-751" "MCF-7" [16] "BT483" "MDA-MB-231" "MDA-MB-436" "MDA-MB-361" === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" > res.cell.lines [1] "MDA-MB-361" "MDA-MB-436" "MDA-MB-231" "BT483" "MCF-7" [6] "ZR-751" === Get the mean GI50 values 28

33 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 29

34 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 30

35 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] MDA.MB.453 MDA.MB.468 T47D MDAMB157 Hs578T HBL100 BT483 MDA.MB.361 MDA.MB.436 MDA.MB.231 MCF.7 ZR.751 Figure: Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 31

36 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 32

37 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 33

38 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] " " "BT-549" "BT20" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-453" "MDA-MB-468" "MDAMB157" "T47D" Resistant cell lines > names(x[x == FALSE]) [1] "AU565" "BT 474" "BT483" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-436" "SK-BR-3" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 34

39 BT 549 HBL100 Hs578T MDAMB157 MDA MB 435 T47D BT483 MDA MB 453 MDA MB 468 AU565 BT20 BT 474 MDA MB 361 ZR 751 MCF 7 SK BR 3 MDA MB 231 MDA MB 436 Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 35

40 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 6 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 TrainingAccuracy CVPredictedAccuracy

41 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 0 Predicted=Sensitive 0 6 (2) Testing set classification table > ttest.pred[[8]] 37

42 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV

43 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 39

44 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 40

45 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 1 10 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 43 6 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV

46 Empirical ROC Sensitivity Area= / 0.05 False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 6 42

47 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 43

48 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 44

49 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 3 1 Predicted=Sensitive 3 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV

50 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 46

51 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 47

52 [1] "BT-549" "MDA-MB-435" "T47D" "MDA-MB-468" " " [6] "BT20" "MDAMB157" "HBL100" "Hs578T" "MDA-MB-453" > names(random.x[random.x == FALSE]) [1] "BT483" "MDA-MB-436" "BT 474" "MDA-MB-361" "ZR-751" [6] "AU565" "MDA-MB-231" "MCF-7" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 3 7 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)

53 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 49

54 4.3 Vinorelbine > currentdrug <- "vinorelbine" For vinorelbine, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 5 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" [6] "AU565" "HBL100" "BT20" "BT483" "MDA-MB-436" [11] "MDA-MB-468" "BT-549" "ZR-751" "MCF-7" "MDA-MB-361" [16] "T47D" "MDA-MB-231" "BT 474" " " === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" > res.cell.lines [1] " " "BT 474" "MDA-MB-231" "T47D" "MDA-MB-361" === Get the mean GI50 values 50

55 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 51

56 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 52

57 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] X MDA.MB.231 T47D BT.474 MDA.MB.361 MDAMB157 MDA.MB.435 Hs578T SK.BR.3 MDA.MB.453 Figure: Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 53

58 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 54

59 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 55

60 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT483" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-436" "MDA-MB-453" "MDAMB157" "SK-BR-3" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "BT20" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-468" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 56

61 MDA MB 231 T47D ZR 751 MCF 7 SK BR 3 BT20 BT 474 MDA MB 361 BT483 MDA MB 468 AU565 MDA MB 453 MDA MB 436 BT 549 MDA MB 435 HBL100 Hs578T MDAMB157 Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 57

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi