Supp Figure 1A Click here to download high resolution image

Size: px
Start display at page:

Download "Supp Figure 1A Click here to download high resolution image"

Transcription

1 Supp Figure 1A Click here to download high resolution image

2 Supp Figure 1B Click here to download high resolution image

3 Supp Figure 1C Click here to download high resolution image

4 Supp Figure 1D Click here to download high resolution image

5 Supp File 1 Click here to download Table: supp file 1.pdf Prediction of Chemotherapy Response from Breast Cancer Cell Lines to Human Cancer Expression Data September 19, 2008 > library(splines) > library(oompabase) > library(mclust) use of mclust requires a license agreement see > library(nlme) > library(preprocess) > library(classcomparison) > library(cluster) > library(classdiscovery) 1 Loading MDACC 133 Array s Gene Expression Data 1.1 Load the expression data We use the mean-adjusted expression data, i.e. the expression data was adjusted to eliminate batch-effect. > set.seed(1000) > our.dir <- "//mdabam1/bioinfo/private/lajos-chemo-prediction/supplementary/" > mdacc.file.name <- "MDA133/Mean-Adjusted-raw-MBEI-MDA133.txt" > MDACC.centered.133 <- read.delim(paste(our.dir, mdacc.file.name, + sep = ""), header = TRUE, sep = "\t") > dim(mdacc.centered.133) [1]

6 The first column is probe set ID. We seperate the probe set ID from the expression data and then log transform the expression data > MDACC.centered133.dt <- MDACC.centered.133[, -1] > row.names(mdacc.centered133.dt) <- MDACC.centered.133[, 1] > MDACC.133.log.dt <- log2(mdacc.centered133.dt + 1) > rm(mdacc.centered.133) 1.2 Load associated clinical info > MDACC.clinical <- "MDA133/MDA133CompleteInfo txt" > MDACC.133.clinical <- read.delim(paste(our.dir, MDACC.clinical, + sep = ""), header = TRUE, sep = "\t") > all(mdacc.133.clinical$idtext == colnames(mdacc.133.log.dt)) [1] TRUE The order of array ID in gene expression data and the order of array ID in associated clinical data are the same. Next, define logical vector for pcr and RD cases. From clinical information, we know there are 34 pcr cases and 99 RD cases. > is.pcr <- rep("rd", length(mdacc.133.clinical$pcrtxt)) > is.pcr[mdacc.133.clinical$pcrtxt == "pcr"] <- "pcr" Introduce a function to compute p-values from correlation coefficeints, based on beta-distribution. > Beta.function <- function(x, n) { + z <- (x + 1)/2 + y <- pbeta(z, (n - 1)/2, (n - 1)/2) + p <- 1-2 * abs(y - 1/2) + return(p) + } (n is sample size) Introduce another function for computing sensitivity, specificity, PPV, and NPV from DLDA results. Please note that to compute these parameters, it is important to define what we test. In our analysis, we test for resistant (or RD); the true positive is RD, and true negative is pcr. 2

7 > my.function <- function(x) { + Sens <- x[1]/(x[1] + x[2]) + Spec <- x[4]/(x[3] + x[4]) + PPV <- x[1]/(x[1] + x[3]) + NPV <- x[4]/(x[2] + x[4]) + list(sensitivity = Sens, Specificity = Spec, PPV = PPV, NPV = NPV) + } 2 Loading Cell Line s Gene Expression Data 2.1 Load the expression data > fname <- "cell-line-data/mbei-for-cellline-data-from-cornelia txt" > chemo.cell.line.dt <- read.table(paste(our.dir, fname, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") The first column is the probe set ID, we seperate it from the data. > ProbeSet.ID <- chemo.cell.line.dt[, 1] > cellline.dt <- chemo.cell.line.dt[, -1] > rm(chemo.cell.line.dt) 2.2 Load array information file and replace array ID by cell line names > infonames <- "cell-line-data/cell-line-info.txt" > info.file <- read.table(paste(our.dir, infonames, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") > dimnames(info.file)[[2]] [1] "Number" "Cell.Line" "Array" "File.Name" > dimnames(cellline.dt)[[2]] <- info.file[, 2] 2.3 Data transformation and Load other cell line information > CellLineOtherInfo <- "Documents/cell screening progress note_cl.txt" > other.info.cell.lines <- read.table(paste(our.dir, CellLineOtherInfo, + sep = ""), header = T, row.names = NULL, skip = 0, sep = "\t") 3

8 2.4 Clustering for QC (1) Let us first define a filter, to filtering out noise measurements > max.vec <- apply(cellline.log.dt, 1, max) > q15 <- quantile(as.matrix(cellline.log.dt), 0.15) > f.vector <- max.vec >= q15 (2) Define a vector to remove control spots from expression measurements > is.not.controls <- rep(true, dim(cellline.log.dt)[[1]]) > is.not.controls[grep("affx", dimnames(cellline.log.dt)[[1]])] <- FALSE After filtering and remove the controls > cellline.log.dt <- cellline.log.dt[(f.vector & is.not.controls), + ] > dim(cellline.log.dt) [1] We will use this dataset to identify chemo predictors (3) performing cluster analysis > chemo.hc <- hclust(distancematrix(cellline.log.dt, "pearson"), + method = "complete") 4

9 Cluster Dendrogram Height AU565 BT483 MDA MB 453 MDA MB 468 BT 474 MDA MB 361 T47D BT20 ZR 751 MCF 7 SK BR 3 MDA MB 436 MDA MB BT 549 MDA MB 435 Hs578T MDAMB157 HBL100 Figure 1: Hierarchical clustering using all probe sets. Two distinct clusters can be seen. Correlating with available clinical information, these two clusters seems related with ER status (see attached Figure: clustering.pdf). 3 Load GI50 data > data <- read.table(paste(our.dir, "Documents/krc-parsed.tsv", + sep = ""), sep = "\t", header = TRUE) > data[, "Step"] <- factor(data[, "Step"]) > temp <- read.table(paste(our.dir, "Documents/translateConc.tsv", + sep = ""), sep = "\t", header = TRUE) > concentrations <- temp[, "PowerOfTen"] > names(concentrations) <- temp[, "TargetConc"] 5

10 = 4 Identifying Genes between Sensitive and Resistant Cell Lines from Gene Expression Data There are two ways to identify genes. (1) From two sample t-test between sensitive and resistant cell lines; and (2) from the correlation between expression data and GI50 (the drugs treated cell line s data). We apply both approaches for each drug. To perform t-test, we need to identify sensitive and resistant cell lines for each drug. We have discussed this issue in the last meeting, and decided to select sensitive and resistant cell lines based on the boxplot of the GI50 values from resamples dose response curves for each drug (see the report Breast Cancer Cell Line Dose Response, issued by 3 August 2007). The part of idenfying sensitive and resistant cell lines was illustrated in the previous report. We just outline the critical step. = 4.1 Paclitaxel > currentdrug <- "paclitaxel" For paclitaxel, we decided using 8 cell lines with the lowest concentrations as sensitive, and using 8 cell lines with the highest concentration as resistant. > K <- 8 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" "BT483" "BT20" [11] "MDA-MB-231" "MDA-MB-453" "MCF-7" " " "BT 474" [16] "MDA-MB-361" "SK-BR-3" "T47D" "ZR-751" 6

11 === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" > res.cell.lines [1] "ZR-751" "T47D" "SK-BR-3" "MDA-MB-361" "BT 474" [6] " " "MCF-7" "MDA-MB-453" === Get the mean GI50 values > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes 7

12 > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 2156 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering 8

13 > Tscore <- > pvalue <- > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] 9

14 AU565 HBL100 MDA.MB.435 Hs578T MDAMB157 MDA.MB.436 BT.549 T47D ZR.751 SK.BR.3 MCF.7 MDA.MB.361 BT.474 X MDA.MB.468 MDA.MB.453 Figure: Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) 10

15 [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes 11

16 > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT20" "BT483" "HBL100" [6] "Hs578T" "MDA-MB-231" "MDA-MB-435" "MDA-MB-436" "MDAMB157" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "MCF-7" "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "SK-BR-3" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 12

17 BT 549 MDA MB 435 MDA MB 436 AU565 HBL100 Hs578T MDAMB MDA MB 231 BT483 MDA MB 453 MDA MB 468 ZR 751 MCF 7 SK BR 3 BT20 T47D BT 474 MDA MB 361 Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 13

18 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 8 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 n7 n8 TrainingAccuracy CVPredictedAccuracy

19 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 0 8 (2) Testing set classification table > ttest.pred[[8]] 15

20 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 17 7 (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV

21 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 17

22 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 18

23 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 9 2 Predicted=Sensitive 0 8 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 9 7 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV

24 Empirical ROC Sensitivity Area= / False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 8 20

25 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 21

26 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 22

27 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 2 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 24 9 > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV

28 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 24

29 Beta Uniform Mixture FDR Control Density Significant P Value 0.0e e e P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value 0.0e e sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 25

30 [1] "HBL100" "BT20" "MDA-MB-435" "MDA-MB-436" "AU565" [6] "MDAMB157" "Hs578T" "MDA-MB-231" "BT483" "BT-549" > names(random.x[random.x == FALSE]) [1] "ZR-751" "MCF-7" " " "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "T47D" "BT 474" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 7 1 Predicted=Sensitive 2 9 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)

31 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 27

32 4.2 Doxorubicin > currentdrug <- "doxorubicin" For doxorubicin, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 6 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" " " "BT-549" "BT20" "SK-BR-3" [11] "MDA-MB-435" "AU565" "BT 474" "ZR-751" "MCF-7" [16] "BT483" "MDA-MB-231" "MDA-MB-436" "MDA-MB-361" === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" > res.cell.lines [1] "MDA-MB-361" "MDA-MB-436" "MDA-MB-231" "BT483" "MCF-7" [6] "ZR-751" === Get the mean GI50 values 28

33 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 29

34 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 30

35 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] MDA.MB.453 MDA.MB.468 T47D MDAMB157 Hs578T HBL100 BT483 MDA.MB.361 MDA.MB.436 MDA.MB.231 MCF.7 ZR.751 Figure: Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 31

36 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 32

37 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 33

38 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] " " "BT-549" "BT20" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-453" "MDA-MB-468" "MDAMB157" "T47D" Resistant cell lines > names(x[x == FALSE]) [1] "AU565" "BT 474" "BT483" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-436" "SK-BR-3" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 34

39 BT 549 HBL100 Hs578T MDAMB157 MDA MB 435 T47D BT483 MDA MB 453 MDA MB 468 AU565 BT20 BT 474 MDA MB 361 ZR 751 MCF 7 SK BR 3 MDA MB 231 MDA MB 436 Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 35

40 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 6 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 TrainingAccuracy CVPredictedAccuracy

41 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 0 Predicted=Sensitive 0 6 (2) Testing set classification table > ttest.pred[[8]] 37

42 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV

43 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 39

44 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 40

45 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 1 10 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 43 6 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV

46 Empirical ROC Sensitivity Area= / 0.05 False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 6 42

47 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 43

48 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 44

49 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 3 1 Predicted=Sensitive 3 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV

50 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 46

51 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 47

52 [1] "BT-549" "MDA-MB-435" "T47D" "MDA-MB-468" " " [6] "BT20" "MDAMB157" "HBL100" "Hs578T" "MDA-MB-453" > names(random.x[random.x == FALSE]) [1] "BT483" "MDA-MB-436" "BT 474" "MDA-MB-361" "ZR-751" [6] "AU565" "MDA-MB-231" "MCF-7" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 3 7 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)

53 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 49

54 4.3 Vinorelbine > currentdrug <- "vinorelbine" For vinorelbine, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 5 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" [6] "AU565" "HBL100" "BT20" "BT483" "MDA-MB-436" [11] "MDA-MB-468" "BT-549" "ZR-751" "MCF-7" "MDA-MB-361" [16] "T47D" "MDA-MB-231" "BT 474" " " === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" > res.cell.lines [1] " " "BT 474" "MDA-MB-231" "T47D" "MDA-MB-361" === Get the mean GI50 values 50

55 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 51

56 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 52

57 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] X MDA.MB.231 T47D BT.474 MDA.MB.361 MDAMB157 MDA.MB.435 Hs578T SK.BR.3 MDA.MB.453 Figure: Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 53

58 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 54

59 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 55

60 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT483" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-436" "MDA-MB-453" "MDAMB157" "SK-BR-3" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "BT20" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-468" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 56

61 MDA MB 231 T47D ZR 751 MCF 7 SK BR 3 BT20 BT 474 MDA MB 361 BT483 MDA MB 468 AU565 MDA MB 453 MDA MB 436 BT 549 MDA MB 435 HBL100 Hs578T MDAMB157 Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 57

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

Probability and Statistics. Terms and concepts

Probability and Statistics. Terms and concepts Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org

More information

Package plw. R topics documented: May 7, Type Package

Package plw. R topics documented: May 7, Type Package Type Package Package plw May 7, 2018 Title Probe level Locally moderated Weighted t-tests. Version 1.40.0 Date 2009-07-22 Author Magnus Astrand Maintainer Magnus Astrand

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Univariable Screening by ROC curve analysis

Univariable Screening by ROC curve analysis Univariable Screening by RO curve analysis Binary response: rank genes according to their differential expression between control sample and target sample use summary measures based on Receiver Operating

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are

More information

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006 Gene Expression an Overview of Problems & Solutions: 3&4 Utah State University Bioinformatics: Problems and Solutions Summer 006 Review Considering several problems & solutions with gene expression data

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

Introduction to analyzing NanoString ncounter data using the NanoStringNormCNV package

Introduction to analyzing NanoString ncounter data using the NanoStringNormCNV package Introduction to analyzing NanoString ncounter data using the NanoStringNormCNV package Dorota Sendorek May 25, 2017 Contents 1 Getting started 2 2 Setting Up Data 2 3 Quality Control Metrics 3 3.1 Positive

More information

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis   jmding/math475/index. istical A istic istics : istical Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html August 29, 2013 istical August 29, 2013 1 / 18 istical A istic

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Lecture Network analysis for biological systems

Lecture Network analysis for biological systems Lecture 11 2014 Network analysis for biological systems Anja Bråthen Kristoffersen Biological Networks Gene regulatory network: two genes are connected if the expression of one gene modulates expression

More information

RNASeq Differential Expression

RNASeq Differential Expression 12/06/2014 RNASeq Differential Expression Le Corguillé v1.01 1 Introduction RNASeq No previous genomic sequence information is needed In RNA-seq the expression signal of a transcript is limited by the

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data Biostatistics (2007), 8, 4, pp. 744 755 doi:10.1093/biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its

More information

Bayesian Estimation of Bipartite Matchings for Record Linkage

Bayesian Estimation of Bipartite Matchings for Record Linkage Bayesian Estimation of Bipartite Matchings for Record Linkage Mauricio Sadinle msadinle@stat.duke.edu Duke University Supported by NSF grants SES-11-30706 to Carnegie Mellon University and SES-11-31897

More information

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015 Probability and Statistics Joyeeta Dutta-Moscato June 29, 2015 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P. 1 2 Review Data for assessing the sensitivity and specificity of a test are usually of the form disease category test result diseased (+) nondiseased ( ) + A B C D Sensitivity: is the proportion of diseased

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Supervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth

Supervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth Supervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth Chong Ma 1 David B. Hitchcock 2 1 PhD Candidate University of South Carolina 2 Associate Professor

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Metric Predicted Variable With One Nominal Predictor Variable

Metric Predicted Variable With One Nominal Predictor Variable Metric Predicted Variable With One Nominal Predictor Variable Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Package multivariance

Package multivariance Package multivariance January 10, 2018 Title Measuring Multivariate Dependence Using Distance Multivariance Version 1.1.0 Date 2018-01-09 Distance multivariance is a measure of dependence which can be

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Orange Visualization Tool (OVT) Manual

Orange Visualization Tool (OVT) Manual Orange Visualization Tool (OVT) Manual This manual describes the features of the tool and how to use it. 1. Contents of the OVT Once the OVT is open (the first time it may take some seconds), it should

More information

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from: Appendix F Computational Statistics Toolbox The Computational Statistics Toolbox can be downloaded from: http://www.infinityassociates.com http://lib.stat.cmu.edu. Please review the readme file for installation

More information

Sta$s$cs for Genomics ( )

Sta$s$cs for Genomics ( ) Sta$s$cs for Genomics (140.688) Instructor: Jeff Leek Slide Credits: Rafael Irizarry, John Storey No announcements today. Hypothesis testing Once you have a given score for each gene, how do you decide

More information

Package GeneExpressionSignature

Package GeneExpressionSignature Package GeneExpressionSignature September 6, 2018 Title Gene Expression Signature based Similarity Metric Version 1.26.0 Date 2012-10-24 Author Yang Cao Maintainer Yang Cao , Fei

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Cross model validation and multiple testing in latent variable models

Cross model validation and multiple testing in latent variable models Cross model validation and multiple testing in latent variable models Frank Westad GE Healthcare Oslo, Norway 2nd European User Meeting on Multivariate Analysis Como, June 22, 2006 Outline Introduction

More information

Gene Selection Using GeneSelectMMD

Gene Selection Using GeneSelectMMD Gene Selection Using GeneSelectMMD Jarrett Morrow remdj@channing.harvard.edu, Weilianq Qiu stwxq@channing.harvard.edu, Wenqing He whe@stats.uwo.ca, Xiaogang Wang stevenw@mathstat.yorku.ca, Ross Lazarus

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Probability and Discrete Distributions

Probability and Discrete Distributions AMS 7L LAB #3 Fall, 2007 Objectives: Probability and Discrete Distributions 1. To explore relative frequency and the Law of Large Numbers 2. To practice the basic rules of probability 3. To work with the

More information

Model Assessment and Selection: Exercises

Model Assessment and Selection: Exercises Practical DNA Microarray Analysis Berlin, Nov 2003 Model Assessment and Selection: Exercises Axel Benner Biostatistik, DKFZ, D-69120 Heidelberg, Germany Introduction The exercises provided here are of

More information

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System Outline I Data Preparation Introduction to SpaceStat and ESTDA II Introduction to ESTDA and SpaceStat III Introduction to time-dynamic regression ESTDA ESTDA & SpaceStat Learning Objectives Activities

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Manual: R package HTSmix

Manual: R package HTSmix Manual: R package HTSmix Olga Vitek and Danni Yu May 2, 2011 1 Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

PDF-4+ Tools and Searches

PDF-4+ Tools and Searches PDF-4+ Tools and Searches PDF-4+ 2018 The PDF-4+ 2018 database is powered by our integrated search display software. PDF-4+ 2018 boasts 72 search selections coupled with 125 display fields resulting in

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

PDF-4+ Tools and Searches

PDF-4+ Tools and Searches PDF-4+ Tools and Searches PDF-4+ 2019 The PDF-4+ 2019 database is powered by our integrated search display software. PDF-4+ 2019 boasts 74 search selections coupled with 126 display fields resulting in

More information

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq Andrew Fernandes, Gregory Gloor, Jean Macklaim July 18, 212 1 Introduction This guide provides an overview of

More information

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP Personal Healthcare Revolution Electronic health records (CFH) Personal genomics (DeCode, Navigenics, 23andMe) X-prize: first $10k human genome technology

More information

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services

More information

ncounter PlexSet Data Analysis Guidelines

ncounter PlexSet Data Analysis Guidelines ncounter PlexSet Data Analysis Guidelines NanoString Technologies, Inc. 530 airview Ave North Seattle, Washington 98109 USA Telephone: 206.378.6266 888.358.6266 E-mail: info@nanostring.com Molecules That

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org

More information

ProCoNA: Protein Co-expression Network Analysis

ProCoNA: Protein Co-expression Network Analysis ProCoNA: Protein Co-expression Network Analysis David L Gibbs October 30, 2017 1 De Novo Peptide Networks ProCoNA (protein co-expression network analysis) is an R package aimed at constructing and analyzing

More information

Similarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList

Similarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList for Center Berlin Genome Based Bioinformatics Max Planck Institute for Molecular Genetics Computational Diagnostics Group @ Dept. Vingron Ihnestrasse 63-73, D-14195 Berlin, Germany http://compdiag.molgen.mpg.de/

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Intro. to Tests for Differential Expression (Part 2) Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

Intro. to Tests for Differential Expression (Part 2) Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3. Intro. to Tests for Differential Expression (Part 2) Utah State University Spring 24 STAT 557: Statistical Bioinformatics Notes 3.4 ### First prepare objects for DE test ### (as on slide 3 of Notes 3.3)

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Bayesian variable selection and classification with control of predictive values

Bayesian variable selection and classification with control of predictive values Bayesian variable selection and classification with control of predictive values Eleni Vradi 1, Thomas Jaki 2, Richardus Vonk 1, Werner Brannath 3 1 Bayer AG, Germany, 2 Lancaster University, UK, 3 University

More information

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined OECD QSAR Toolbox v.3.3 Step-by-step example of how to build a userdefined QSAR Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

2p or not 2p: Tuppence-based SERS for the detection of illicit materials

2p or not 2p: Tuppence-based SERS for the detection of illicit materials SUPPLEMENTARY INFORMATION 2p or not 2p: Tuppence-based SERS for the detection of illicit materials Figure S1. Deposition of silver (Grey target) demonstrated on a post-1992 2p coin. Figure S2. Raman spectrum

More information

Handout 1: Predicting GPA from SAT

Handout 1: Predicting GPA from SAT Handout 1: Predicting GPA from SAT appsrv01.srv.cquest.utoronto.ca> appsrv01.srv.cquest.utoronto.ca> ls Desktop grades.data grades.sas oldstuff sasuser.800 appsrv01.srv.cquest.utoronto.ca> cat grades.data

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 12

More information

EBSeq: An R package for differential expression analysis using RNA-seq data

EBSeq: An R package for differential expression analysis using RNA-seq data EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Clustering & microarray technology

Clustering & microarray technology Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

Package hierdiversity

Package hierdiversity Version 0.1 Date 2015-03-11 Package hierdiversity March 20, 2015 Title Hierarchical Multiplicative Partitioning of Complex Phenotypes Author Zachary Marion, James Fordyce, and Benjamin Fitzpatrick Maintainer

More information

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and

More information

EECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015

EECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015 EECS564 Estimation, Filtering, and Detection Exam Week of April 0, 015 This is an open book takehome exam. You have 48 hours to complete the exam. All work on the exam should be your own. problems have

More information

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical

More information

Summarize Abnormality Counts

Summarize Abnormality Counts Summarize Abnormality Counts Kevin R. Coombes 10 September 2011 Contents 1 Executive Summary 1 1.1 Introduction......................................... 1 1.1.1 Aims/Objectives..................................

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

Package dhga. September 6, 2016

Package dhga. September 6, 2016 Type Package Title Differential Hub Gene Analysis Version 0.1 Date 2016-08-31 Package dhga September 6, 2016 Author Samarendra Das and Baidya Nath Mandal

More information

Ligand Scout Tutorials

Ligand Scout Tutorials Ligand Scout Tutorials Step : Creating a pharmacophore from a protein-ligand complex. Type ke6 in the upper right area of the screen and press the button Download *+. The protein will be downloaded and

More information

Aplicable methods for nondetriment. Dr José Luis Quero Pérez Assistant Professor Forestry Department University of Cordoba (Spain)

Aplicable methods for nondetriment. Dr José Luis Quero Pérez Assistant Professor Forestry Department University of Cordoba (Spain) Aplicable methods for nondetriment findings Dr José Luis Quero Pérez Assistant Professor Forestry Department University of Cordoba (Spain) Forest Ecophysiology Water relations Photosynthesis Forest demography

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 11

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018 Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome. Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Meta-analysis for Microarray Experiments

Meta-analysis for Microarray Experiments Meta-analysis for Microarray Experiments Robert Gentleman, Markus Ruschhaupt, Wolfgang Huber, and Lara Lusa April 25, 2006 1 Introduction The use of meta-analysis tools and strategies for combining data

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Preview from Notesale.co.uk Page 3 of 63

Preview from Notesale.co.uk Page 3 of 63 Stem-and-leaf diagram - vertical numbers on far left represent the 10s, numbers right of the line represent the 1s The mean should not be used if there are extreme scores, or for ranks and categories Unbiased

More information

1 The Squared Ranks Test for Variances

1 The Squared Ranks Test for Variances 1 The Squared Ranks Test for Variances Data The data consist of two random samples. Let X 1, X 2,, X n denote the random sample of size n from population 1 and let Y 1, Y 2,, Y m, denote the random sample

More information

The First Thing You Ever Do When Receive a Set of Data Is

The First Thing You Ever Do When Receive a Set of Data Is The First Thing You Ever Do When Receive a Set of Data Is Understand the goal of the study What are the objectives of the study? What would the person like to see from the data? Understand the methodology

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Massive Event Detection. Abstract

Massive Event Detection. Abstract Abstract The detection and analysis of events within massive collections of time-series has become an extremely important task for timedomain astronomy. In particular, many scientific investigations (e.g.

More information

Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University

Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University Jian WANG, PhD j_wang@shou.edu.cn Room A115 College of Fishery and Life Science Shanghai Ocean University Useful Links Slides: http://sihua.us/biostatistics.htm Datasets: http://users.monash.edu.au/~murray/bdar/index.html

More information

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract Abstract The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale

More information

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples

More information