Supp Figure 1A Click here to download high resolution image
|
|
- Martina Daniels
- 6 years ago
- Views:
Transcription
1 Supp Figure 1A Click here to download high resolution image
2 Supp Figure 1B Click here to download high resolution image
3 Supp Figure 1C Click here to download high resolution image
4 Supp Figure 1D Click here to download high resolution image
5 Supp File 1 Click here to download Table: supp file 1.pdf Prediction of Chemotherapy Response from Breast Cancer Cell Lines to Human Cancer Expression Data September 19, 2008 > library(splines) > library(oompabase) > library(mclust) use of mclust requires a license agreement see > library(nlme) > library(preprocess) > library(classcomparison) > library(cluster) > library(classdiscovery) 1 Loading MDACC 133 Array s Gene Expression Data 1.1 Load the expression data We use the mean-adjusted expression data, i.e. the expression data was adjusted to eliminate batch-effect. > set.seed(1000) > our.dir <- "//mdabam1/bioinfo/private/lajos-chemo-prediction/supplementary/" > mdacc.file.name <- "MDA133/Mean-Adjusted-raw-MBEI-MDA133.txt" > MDACC.centered.133 <- read.delim(paste(our.dir, mdacc.file.name, + sep = ""), header = TRUE, sep = "\t") > dim(mdacc.centered.133) [1]
6 The first column is probe set ID. We seperate the probe set ID from the expression data and then log transform the expression data > MDACC.centered133.dt <- MDACC.centered.133[, -1] > row.names(mdacc.centered133.dt) <- MDACC.centered.133[, 1] > MDACC.133.log.dt <- log2(mdacc.centered133.dt + 1) > rm(mdacc.centered.133) 1.2 Load associated clinical info > MDACC.clinical <- "MDA133/MDA133CompleteInfo txt" > MDACC.133.clinical <- read.delim(paste(our.dir, MDACC.clinical, + sep = ""), header = TRUE, sep = "\t") > all(mdacc.133.clinical$idtext == colnames(mdacc.133.log.dt)) [1] TRUE The order of array ID in gene expression data and the order of array ID in associated clinical data are the same. Next, define logical vector for pcr and RD cases. From clinical information, we know there are 34 pcr cases and 99 RD cases. > is.pcr <- rep("rd", length(mdacc.133.clinical$pcrtxt)) > is.pcr[mdacc.133.clinical$pcrtxt == "pcr"] <- "pcr" Introduce a function to compute p-values from correlation coefficeints, based on beta-distribution. > Beta.function <- function(x, n) { + z <- (x + 1)/2 + y <- pbeta(z, (n - 1)/2, (n - 1)/2) + p <- 1-2 * abs(y - 1/2) + return(p) + } (n is sample size) Introduce another function for computing sensitivity, specificity, PPV, and NPV from DLDA results. Please note that to compute these parameters, it is important to define what we test. In our analysis, we test for resistant (or RD); the true positive is RD, and true negative is pcr. 2
7 > my.function <- function(x) { + Sens <- x[1]/(x[1] + x[2]) + Spec <- x[4]/(x[3] + x[4]) + PPV <- x[1]/(x[1] + x[3]) + NPV <- x[4]/(x[2] + x[4]) + list(sensitivity = Sens, Specificity = Spec, PPV = PPV, NPV = NPV) + } 2 Loading Cell Line s Gene Expression Data 2.1 Load the expression data > fname <- "cell-line-data/mbei-for-cellline-data-from-cornelia txt" > chemo.cell.line.dt <- read.table(paste(our.dir, fname, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") The first column is the probe set ID, we seperate it from the data. > ProbeSet.ID <- chemo.cell.line.dt[, 1] > cellline.dt <- chemo.cell.line.dt[, -1] > rm(chemo.cell.line.dt) 2.2 Load array information file and replace array ID by cell line names > infonames <- "cell-line-data/cell-line-info.txt" > info.file <- read.table(paste(our.dir, infonames, sep = ""), + header = T, row.names = NULL, skip = 0, sep = "\t") > dimnames(info.file)[[2]] [1] "Number" "Cell.Line" "Array" "File.Name" > dimnames(cellline.dt)[[2]] <- info.file[, 2] 2.3 Data transformation and Load other cell line information > CellLineOtherInfo <- "Documents/cell screening progress note_cl.txt" > other.info.cell.lines <- read.table(paste(our.dir, CellLineOtherInfo, + sep = ""), header = T, row.names = NULL, skip = 0, sep = "\t") 3
8 2.4 Clustering for QC (1) Let us first define a filter, to filtering out noise measurements > max.vec <- apply(cellline.log.dt, 1, max) > q15 <- quantile(as.matrix(cellline.log.dt), 0.15) > f.vector <- max.vec >= q15 (2) Define a vector to remove control spots from expression measurements > is.not.controls <- rep(true, dim(cellline.log.dt)[[1]]) > is.not.controls[grep("affx", dimnames(cellline.log.dt)[[1]])] <- FALSE After filtering and remove the controls > cellline.log.dt <- cellline.log.dt[(f.vector & is.not.controls), + ] > dim(cellline.log.dt) [1] We will use this dataset to identify chemo predictors (3) performing cluster analysis > chemo.hc <- hclust(distancematrix(cellline.log.dt, "pearson"), + method = "complete") 4
9 Cluster Dendrogram Height AU565 BT483 MDA MB 453 MDA MB 468 BT 474 MDA MB 361 T47D BT20 ZR 751 MCF 7 SK BR 3 MDA MB 436 MDA MB BT 549 MDA MB 435 Hs578T MDAMB157 HBL100 Figure 1: Hierarchical clustering using all probe sets. Two distinct clusters can be seen. Correlating with available clinical information, these two clusters seems related with ER status (see attached Figure: clustering.pdf). 3 Load GI50 data > data <- read.table(paste(our.dir, "Documents/krc-parsed.tsv", + sep = ""), sep = "\t", header = TRUE) > data[, "Step"] <- factor(data[, "Step"]) > temp <- read.table(paste(our.dir, "Documents/translateConc.tsv", + sep = ""), sep = "\t", header = TRUE) > concentrations <- temp[, "PowerOfTen"] > names(concentrations) <- temp[, "TargetConc"] 5
10 = 4 Identifying Genes between Sensitive and Resistant Cell Lines from Gene Expression Data There are two ways to identify genes. (1) From two sample t-test between sensitive and resistant cell lines; and (2) from the correlation between expression data and GI50 (the drugs treated cell line s data). We apply both approaches for each drug. To perform t-test, we need to identify sensitive and resistant cell lines for each drug. We have discussed this issue in the last meeting, and decided to select sensitive and resistant cell lines based on the boxplot of the GI50 values from resamples dose response curves for each drug (see the report Breast Cancer Cell Line Dose Response, issued by 3 August 2007). The part of idenfying sensitive and resistant cell lines was illustrated in the previous report. We just outline the critical step. = 4.1 Paclitaxel > currentdrug <- "paclitaxel" For paclitaxel, we decided using 8 cell lines with the lowest concentrations as sensitive, and using 8 cell lines with the highest concentration as resistant. > K <- 8 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" "BT483" "BT20" [11] "MDA-MB-231" "MDA-MB-453" "MCF-7" " " "BT 474" [16] "MDA-MB-361" "SK-BR-3" "T47D" "ZR-751" 6
11 === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "Hs578T" "MDAMB157" "HBL100" "AU565" [6] "MDA-MB-436" "BT-549" "MDA-MB-468" > res.cell.lines [1] "ZR-751" "T47D" "SK-BR-3" "MDA-MB-361" "BT 474" [6] " " "MCF-7" "MDA-MB-453" === Get the mean GI50 values > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes 7
12 > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 2156 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering 8
13 > Tscore <- > pvalue <- > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] 9
14 AU565 HBL100 MDA.MB.435 Hs578T MDAMB157 MDA.MB.436 BT.549 T47D ZR.751 SK.BR.3 MCF.7 MDA.MB.361 BT.474 X MDA.MB.468 MDA.MB.453 Figure: Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) 10
15 [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes 11
16 > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT20" "BT483" "HBL100" [6] "Hs578T" "MDA-MB-231" "MDA-MB-435" "MDA-MB-436" "MDAMB157" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "MCF-7" "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "SK-BR-3" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 12
17 BT 549 MDA MB 435 MDA MB 436 AU565 HBL100 Hs578T MDAMB MDA MB 231 BT483 MDA MB 453 MDA MB 468 ZR 751 MCF 7 SK BR 3 BT20 T47D BT 474 MDA MB 361 Two-way Hierarchical clustering for paclitaxel using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 13
18 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 8 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 n7 n8 TrainingAccuracy CVPredictedAccuracy
19 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 0 8 (2) Testing set classification table > ttest.pred[[8]] 15
20 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 17 7 (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV
21 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 17
22 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 18
23 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 9 2 Predicted=Sensitive 0 8 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 9 7 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV
24 Empirical ROC Sensitivity Area= / False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 8 20
25 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 21
26 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 22
27 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 2 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 24 9 > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV
28 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 24
29 Beta Uniform Mixture FDR Control Density Significant P Value 0.0e e e P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value 0.0e e sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 25
30 [1] "HBL100" "BT20" "MDA-MB-435" "MDA-MB-436" "AU565" [6] "MDAMB157" "Hs578T" "MDA-MB-231" "BT483" "BT-549" > names(random.x[random.x == FALSE]) [1] "ZR-751" "MCF-7" " " "MDA-MB-361" "MDA-MB-453" [6] "MDA-MB-468" "T47D" "BT 474" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 7 1 Predicted=Sensitive 2 9 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)
31 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 27
32 4.2 Doxorubicin > currentdrug <- "doxorubicin" For doxorubicin, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 6 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" " " "BT-549" "BT20" "SK-BR-3" [11] "MDA-MB-435" "AU565" "BT 474" "ZR-751" "MCF-7" [16] "BT483" "MDA-MB-231" "MDA-MB-436" "MDA-MB-361" === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "T47D" "MDA-MB-453" "MDAMB157" "Hs578T" "MDA-MB-468" [6] "HBL100" > res.cell.lines [1] "MDA-MB-361" "MDA-MB-436" "MDA-MB-231" "BT483" "MCF-7" [6] "ZR-751" === Get the mean GI50 values 28
33 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 29
34 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 30
35 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] MDA.MB.453 MDA.MB.468 T47D MDAMB157 Hs578T HBL100 BT483 MDA.MB.361 MDA.MB.436 MDA.MB.231 MCF.7 ZR.751 Figure: Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 31
36 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 32
37 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 33
38 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] " " "BT-549" "BT20" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-453" "MDA-MB-468" "MDAMB157" "T47D" Resistant cell lines > names(x[x == FALSE]) [1] "AU565" "BT 474" "BT483" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-436" "SK-BR-3" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 34
39 BT 549 HBL100 Hs578T MDAMB157 MDA MB 435 T47D BT483 MDA MB 453 MDA MB 468 AU565 BT20 BT 474 MDA MB 361 ZR 751 MCF 7 SK BR 3 MDA MB 231 MDA MB 436 Two-way Hierarchical clustering for doxorubicin using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 35
40 To perform cross validate on training data, we apply leave-two-out cross validation; i.e. selecting two cell lines from the data (one sensitive and one resistant) as validation set. Then we perform t-test on the remaining cell line data. As we did previously, we select top 100 probe sets (ranked by p-values) as predictors. Finally, we apply the selected predictors to predict the validation set. We repeat the process of selection top 100 predictors for each leave-two-out cross validation. > Leave.two.out <- data.frame(matrix(na, ncol = K, nrow = 2)) > colnames(leave.two.out) <- paste("n", 1:K, sep = "") > rownames(leave.two.out) <- c("trainingaccuracy", "CVPredictedAccuracy") > K [1] 6 > for (i in 1:K) { + M <- 2 * K set1 <- colnames(new.dt)[i] + set2 <- colnames(new.dt)[(m - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(new.dt), set) + set3 + training.set <- new.dt[, match(set3, colnames(new.dt))] + v1 <- is.sens[match(set3, colnames(new.dt))] + ttest <- MultiTtest(training.set, v1 == "Sensitive") + ordered.dt <- training.set[order(ttest@p.values), ] + N < top.genes.dt <- ordered.dt[1:n, ] + test.set <- new.dt[, match(set, colnames(new.dt))] + v2 <- is.sens[match(set, colnames(new.dt))] + test.set <- data.frame(test.set) + rownames(test.set) <- rownames(new.dt) + top.gene.test.set <- test.set[match(rownames(top.genes.dt), + rownames(test.set)), ] + jk <- myfct.dlda(data.train = top.genes.dt, class.train = v1, + data.test = data.frame(top.gene.test.set), class.test = v2) + Leave.two.out[1, i] <- round(jk[[1]], 2) + Leave.two.out[2, i] <- round(jk[[4]], 2) + } The results of corss validation > Leave.two.out n1 n2 n3 n4 n5 n6 TrainingAccuracy CVPredictedAccuracy
41 > apply(leave.two.out, 1, mean) TrainingAccuracy CVPredictedAccuracy (1b) Prediction of human data set (testing set) > MDA133.predict <- MDACC.133.log.dt[match(row.names(selected100.dt), + row.names(mdacc.133.log.dt)), ] Re-define logical vector that the RD cases are resistant and pcr cases are sensitive. as we defined, we test for resistant. i.e, the true positive is RD, and true negative is pcr. > testing.class <- is.pcr > testing.class[testing.class == "RD"] <- "Resistant(RD)" > testing.class[testing.class == "pcr"] <- "Sensitive(pCR)" > ttest.pred <- myfct.dlda(data.train = selected100.dt, class.train = is.sens, + data.test = MDA133.predict, class.test = testing.class) > names(ttest.pred) [1] "TrainingAccuracy" "SummaryTraining" [3] "IndividualTrainingVsPredicted" "CVPredictedAccuracy" [5] "ROC" "ProbOfClass1" [7] "FPandTP" "SummaryTesting" [9] "IndividualTestVsPredicted" (1) Training set classification table > ttest.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 0 Predicted=Sensitive 0 6 (2) Testing set classification table > ttest.pred[[8]] 37
42 TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) (3) Predict Accuracy on testing set > ttest.pred[[4]] [1] (4) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > ttest.predictors <- my.function(ttest.pred[[8]]) > data.frame(ttest.predictors) Sensitivity Specificity PPV NPV
43 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Prediction, using DLDA with the predictors identified from correlation. (2a) Cross validation of cell line data. Again, we apply Leave-two-out cross validation approach. > n <- dim(cell.line.dt)[[2]] > Leave.two.out.cor <- data.frame(matrix(na, ncol = (n - 1)/2, + nrow = 2)) > colnames(leave.two.out.cor) <- paste("n", 1:((n - 1)/2), sep = "") > rownames(leave.two.out.cor) <- c("trainingaccuracy", "CVPredictedAccuracy") > for (i in 1:(n/2)) { + set1 <- colnames(cell.line.dt)[i] 39
44 + set2 <- colnames(cell.line.dt)[(n - i)] + set <- c(set1, set2) + set3 <- setdiff(colnames(cell.line.dt), set) + training.set <- cell.line.dt[, match(set3, colnames(cell.line.dt))] + v1 <- is.sens.cor[match(set3, colnames(cell.line.dt))] + used.mean.gi50 <- mean.gi50[match(set3, names(mean.gi50))] + cor.with.gi50 <- cor(t(training.set), used.mean.gi50, method = "spearman") + p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = n) + ordered.cor.dt <- training.set[order(p.value.cor), ] + N < top.cor.genes.dt <- ordered.cor.dt[1:n, ] + test.cor.set <- cell.line.dt[, match(set, colnames(cell.line.dt))] + v2 <- is.sens.cor[match(set, colnames(cell.line.dt))] + test.cor.set <- data.frame(test.cor.set) + rownames(test.cor.set) <- rownames(cell.line.dt) + top.gene.test.set <- test.cor.set[match(rownames(top.cor.genes.dt), + rownames(test.cor.set)), ] + jk <- myfct.dlda(data.train = top.cor.genes.dt, class.train = v1, + data.test = top.gene.test.set, class.test = v2) + Leave.two.out.cor[1, i] <- round(jk[[1]], 2) + Leave.two.out.cor[2, i] <- round(jk[[4]], 2) + } The cross validation results > Leave.two.out.cor n1 n2 n3 n4 n5 n6 n7 n8 n9 TrainingAccuracy CVPredictedAccuracy > apply(leave.two.out.cor, 1, mean) TrainingAccuracy CVPredictedAccuracy (2b) prediction on human data > MDA133.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.cor.dt), + row.names(mdacc.133.log.dt)), ] 40
45 > cor.pred <- myfct.dlda(data.train = selected100.cor.dt, class.train = is.sens.cor, + data.test = MDA133.cor.pred, class.test = testing.class) (1) Training set classification table > cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 8 0 Predicted=Sensitive 1 10 (2) Predict Accuracy on Training set > cor.pred[[1]] [1] (3) Testing set classification table > cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) 43 6 (4) Predict Accuracy on testing set > cor.pred[[4]] [1] (5) Summarizing sensitivity, specificity, positive predict value (PPV), negative predictbalus (NPV), and plot reciever operating characteristic (ROC) curve. > cor.predictors <- my.function(cor.pred[[8]]) > data.frame(cor.predictors) Sensitivity Specificity PPV NPV
46 Empirical ROC Sensitivity Area= / 0.05 False Positive Ratio Finally, we apply random appraoch, i.e. use randomly identified predictors from cell line data to predict human data. The purpose of this analysis is to evaluate the prediction performance using randomly selected chemo predictors from cell line data. (1) t-test approach > K [1] 6 42
47 > random.sen.cl <- names(sample(mess))[1:k] > random.res.cl <- names(sample(mess))[19:(19 - K + 1)] > random.sen.cell <- match(random.sen.cl, dimnames(cellline.log.dt)[[2]]) > random.res.cell <- match(random.res.cl, dimnames(cellline.log.dt)[[2]]) > random.dt <- data.frame(cellline.log.dt[, random.sen.cell], cellline.log.dt[, + random.res.cell]) > dim(random.dt) [1] > temp <- gsub("x ", " ", colnames(random.dt)) > colnames(random.dt) <- temp > rm(temp) > sen.vec <- rep(false, ncol(random.dt)) > sen.vec[c(1:length(random.sen.cell))] <- TRUE > Random.CL.t.test <- MultiTtest(random.dt, sen.vec == TRUE) > Random.CL.bum <- Bum(Random.CL.t.test@p.values) 43
48 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > ordered.random.dt <- random.dt[order(random.cl.t.test@p.values), + ] > N [1] 100 > top.n.genes.random.dt <- ordered.random.dt[1:n, ] > MDA133.predict.Random <- MDACC.133.log.dt[match(row.names(top.N.genes.Random.dt), + row.names(mdacc.133.log.dt)), ] > is.sens.random <- rep("resistant", ncol(random.dt)) > is.sens.random[c(1:length(random.sen.cl))] <- "Sensitive" > testing.class.random <- is.pcr > testing.class.random[testing.class.random == "RD"] <- "Resistant(RD)" > testing.class.random[testing.class.random == "pcr"] <- "Sensitive(pCR)" 44
49 > ttest.pred.random <- myfct.dlda(data.train = top.n.genes.random.dt, + class.train = is.sens.random, data.test = MDA133.predict.Random, + class.test = testing.class.random) (1) Training set classification table > ttest.pred.random[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 3 1 Predicted=Sensitive 3 5 > ttest.pred.random[[1]] [1] > ttest.pred.random[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR) > ttest.pred.random[[4]] [1] > ttest.predictors.random <- my.function(ttest.pred.random[[8]]) > data.frame(ttest.predictors.random) Sensitivity Specificity PPV NPV
50 Empirical ROC Sensitivity Area= / False Positive Ratio (2) Correlation approach > random.mean.gi50 <- sample(mean.gi50) > cor.with.gi50.random <- cor(t(cell.line.dt), random.mean.gi50, + method = "spearman") > range(cor.with.gi50.random) [1] > p.value.cor.random <- Beta.function(x = cor.with.gi50.random[, + 1], n = 19) > cor.random.bum <- Bum(p.value.cor.random) 46
51 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > N [1] 100 > new.random.cor.dt <- data.frame(p.value.cor.random, cor.with.gi50.random, + cell.line.dt) > colnames(new.random.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.random.cor.dt <- new.random.cor.dt[order(new.random.cor.dt$pvalue), + ] > top100.cor.random.dt <- ord.random.cor.dt[1:n, ] > selected100.random.cor.dt <- top100.cor.random.dt[3:dim(top100.cor.random.dt)[[2]]] Compute the median GI50 value and define sensitive and resistant cell lines > random.x <- random.mean.gi50 <= median(random.mean.gi50) > names(random.x[random.x == TRUE]) 47
52 [1] "BT-549" "MDA-MB-435" "T47D" "MDA-MB-468" " " [6] "BT20" "MDAMB157" "HBL100" "Hs578T" "MDA-MB-453" > names(random.x[random.x == FALSE]) [1] "BT483" "MDA-MB-436" "BT 474" "MDA-MB-361" "ZR-751" [6] "AU565" "MDA-MB-231" "MCF-7" "SK-BR-3" > Sens.random <- match(names(random.x[random.x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor.random <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor.random[sens.random] <- "Sensitive" > rm(random.x) Prediction on human data > MDA133.random.cor.pred <- MDACC.133.log.dt[match(row.names(selected100.random.cor.dt), + row.names(mdacc.133.log.dt)), ] > random.cor.pred <- myfct.dlda(data.train = selected100.random.cor.dt, + class.train = is.sens.cor.random, data.test = MDA133.random.cor.pred, + class.test = testing.class.random) (1) Training set classification table > random.cor.pred[[2]] TrainSet=Resistant TrianSet=Sensitive Predicted=Resistant 6 3 Predicted=Sensitive 3 7 (2) Predict Accuracy on Training set > random.cor.pred[[1]] [1] (3) Testing set classification table > random.cor.pred[[8]] TestSet=Resistant(RD) TestSet=Sensitive(pCR) Predicted=Resistant(RD) Predicted=Sensitive(pCR)
53 (4) Predict Accuracy on testing set random.cor.pred[[4]] > random.cor.predicted <- data.frame(unlist(random.cor.pred[[9]][2, + ])) > colnames(random.cor.predicted) <- "Predicted" > random.cor.predictors <- my.function(random.cor.pred[[8]]) > data.frame(random.cor.predictors) Sensitivity Specificity PPV NPV Empirical ROC Sensitivity Area= / False Positive Ratio 49
54 4.3 Vinorelbine > currentdrug <- "vinorelbine" For vinorelbine, we decided using 6 cell lines with the lowest concentrations as sensitive, and using 6 cell lines with the highest concentration as resistant. > K <- 5 === The following codes were used to compute one of the quantiles of the GI50 values and produce the Boxplot > stem <- data.frame(t(gi50val)) > colnames(stem) <- rownames(averaged) > mess <- apply(stem, 2, quantile, 0.25) > stem <- stem[, order(mess)] > names(mess[order(mess)]) [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" [6] "AU565" "HBL100" "BT20" "BT483" "MDA-MB-436" [11] "MDA-MB-468" "BT-549" "ZR-751" "MCF-7" "MDA-MB-361" [16] "T47D" "MDA-MB-231" "BT 474" " " === Obtain sensitive and resistant cell lines > sen.cell.lines <- names(mess[order(mess)])[1:k] > res.cell.lines <- names(mess[order(mess)])[19:(19 - K + 1)] > sen.cell.lines [1] "MDA-MB-435" "SK-BR-3" "Hs578T" "MDA-MB-453" "MDAMB157" > res.cell.lines [1] " " "BT 474" "MDA-MB-231" "T47D" "MDA-MB-361" === Get the mean GI50 values 50
55 > mean.gi50 <- apply(gi50val, 1, mean) > mean.gi AU565 BT-549 BT 474 BT20 BT483 HBL Hs578T MCF-7 MDA-MB-231 MDA-MB-361 MDA-MB-435 MDA-MB-436 MDA-MB MDA-MB-468 MDAMB157 SK-BR-3 T47D ZR === (A) Performing statistical analysis on cell line s expression measurements to identify significant differentailly expresse genes between the sensitive and resistant celll lines, by two sample t-test. (1) Making new data set, consisting of the selected sensitive and resistant cell line expression data > sensitive.cell <- match(sen.cell.lines, dimnames(cellline.log.dt)[[2]]) > resistant.cell <- match(res.cell.lines, dimnames(cellline.log.dt)[[2]]) > new.dt <- data.frame(cellline.log.dt[, sensitive.cell], cellline.log.dt[, + resistant.cell]) > dim(new.dt) [1] (2) Performing t-test and identifying genes > sensitive <- rep(false, ncol(new.dt)) > sensitive[c(1:length(sensitive.cell))] <- TRUE > CellLine.t.test <- MultiTtest(new.dt, sensitive == TRUE) > CellLine.bum <- Bum(CellLine.t.test@p.values) 51
56 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec > sig.pvalues <- cutoffsignificant(cellline.bum, FDR.cutoff, by = "FDR") > sig.pvalues <- round(sig.pvalues, 5) > whcih.one.significant <- selectsignificant(cellline.bum, FDR.cutoff, + by = "FDR") > identified <- sum(whcih.one.significant) Using FDR = 15%, we identified 0 predictors. (3) Ordering expression data by p-valus, select top 100 genes, and performing two-way clustering > Tscore <- CellLine.t.test@t.statistics > pvalue <- CellLine.t.test@p.values > meanofsensitive <- apply(new.dt[, sensitive == TRUE], 1, mean) > meanofresistant <- apply(new.dt[, sensitive == FALSE], 1, mean) 52
57 > meanofdiff <- -(meanofsensitive - meanofresistant) > AveFoldChange <- 2^(meanOfDiff) > AveFoldChange[AveFoldChange < 1] <- -(1/(AveFoldChange[AveFoldChange < + 1])) > result.dt <- data.frame(pvalue, Tscore, meanofsensitive, meanofresistant, + AveFoldChange, new.dt) > ordered.dt <- result.dt[order(result.dt$pvalue), ] > N <- 100 > top.n.genes.dt <- ordered.dt[1:n, ] > selected100.dt <- top.n.genes.dt[6:dim(top.n.genes.dt)[[2]]] X MDA.MB.231 T47D BT.474 MDA.MB.361 MDAMB157 MDA.MB.435 Hs578T SK.BR.3 MDA.MB.453 Figure: Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from t-test). Color bar: blue=sensitive, red=resistant 53
58 (B) Next, we identify genes based on the correlation between expression measurements and mean GI50 values (1) Computing the correlation between expression measurements and mean GI50 values First, ordering the cell line gene expression data, so that the order of expression data are consistant with the order of mean GI50 values. Then we computed the correlation between expression measurements and the mean GI50 measurements. > cell.line.dt <- cellline.log.dt[, order(dimnames(cellline.log.dt)[[2]])] > all(names(mean.gi50) == colnames(cell.line.dt)) [1] TRUE > cor.with.gi50 <- cor(t(cell.line.dt), mean.gi50, method = "spearman") > range(cor.with.gi50) [1] === (2) Computing p-values and model the resulting p-values by BUM > p.value.cor <- Beta.function(x = cor.with.gi50[, 1], n = 19) > cor.bum <- Bum(p.value.cor) 54
59 Beta Uniform Mixture FDR Control Density Significant P Value P Values Desired False Discovery Rate Empirical Bayes ROC Curve Significant P Value sens ROC area = Posterior Probability 1 spec === (3) Order the data by p-values and select top 100 genes > new.cor.dt <- data.frame(p.value.cor, cor.with.gi50, cell.line.dt) > colnames(new.cor.dt) <- c("pvalue", "Correlation", colnames(cell.line.dt)) > ord.cor.dt <- new.cor.dt[order(new.cor.dt$pvalue), ] > N <- 100 > top100.cor.genes.dt <- ord.cor.dt[1:n, ] > selected100.cor.dt <- top100.cor.genes.dt[3:dim(top100.cor.genes.dt)[[2]]] (4) Assign sensitive and resistant cell lines based on the median of the computed all cell line s mean GI50 values. For the cell lines with mean GI50 value less than the median of all cell line s mean GI50 value, we assign these cell line as sensitive, and above median are resistant. 55
60 > x <- mean.gi50 <= median(mean.gi50) Sensitive cell lines > names(x[x == TRUE]) [1] "AU565" "BT-549" "BT483" "HBL100" "Hs578T" [6] "MDA-MB-435" "MDA-MB-436" "MDA-MB-453" "MDAMB157" "SK-BR-3" Resistant cell lines > names(x[x == FALSE]) [1] " " "BT 474" "BT20" "MCF-7" "MDA-MB-231" [6] "MDA-MB-361" "MDA-MB-468" "T47D" "ZR-751" Define a vector for the sensitive and resistant cell lines > Sens <- match(names(x[x == TRUE]), dimnames(cell.line.dt)[[2]]) > is.sens.cor <- rep("resistant", dim(cell.line.dt)[[2]]) > is.sens.cor[sens] <- "Sensitive" > rm(x) 56
61 MDA MB 231 T47D ZR 751 MCF 7 SK BR 3 BT20 BT 474 MDA MB 361 BT483 MDA MB 468 AU565 MDA MB 453 MDA MB 436 BT 549 MDA MB 435 HBL100 Hs578T MDAMB157 Two-way Hierarchical clustering for vinorelbine using top 100 genes (rank by p-values computed from correlation coefficient). Color bar: blue=sensitive, red=resistant Next, we use the identified predictors from cell line measurements to predict MDACC 133 arrays. (1) Prediction, using DLDA with the predictors identified by t-test. > is.sens <- rep("resistant", ncol(new.dt)) > is.sens[c(1:length(sensitive.cell))] <- "Sensitive" (1a) Cross validation of cell line data (training set) 57
Advanced Statistical Methods: Beyond Linear Regression
Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi
More informationLesson 11. Functional Genomics I: Microarray Analysis
Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)
More informationProbability and Statistics. Terms and concepts
Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org
More informationPackage plw. R topics documented: May 7, Type Package
Type Package Package plw May 7, 2018 Title Probe level Locally moderated Weighted t-tests. Version 1.40.0 Date 2009-07-22 Author Magnus Astrand Maintainer Magnus Astrand
More informationBiochip informatics-(i)
Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing
More informationUnivariable Screening by ROC curve analysis
Univariable Screening by RO curve analysis Binary response: rank genes according to their differential expression between control sample and target sample use summary measures based on Receiver Operating
More informationNon-specific filtering and control of false positives
Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Hypothesis testing Machine Learning CSE546 Kevin Jamieson University of Washington October 30, 2018 2018 Kevin Jamieson 2 Anomaly detection You are
More informationGene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006
Gene Expression an Overview of Problems & Solutions: 3&4 Utah State University Bioinformatics: Problems and Solutions Summer 006 Review Considering several problems & solutions with gene expression data
More informationcdna Microarray Analysis
cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization
More informationIntroduction to analyzing NanoString ncounter data using the NanoStringNormCNV package
Introduction to analyzing NanoString ncounter data using the NanoStringNormCNV package Dorota Sendorek May 25, 2017 Contents 1 Getting started 2 2 Setting Up Data 2 3 Quality Control Metrics 3 3.1 Positive
More informationMath 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.
istical A istic istics : istical Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html August 29, 2013 istical August 29, 2013 1 / 18 istical A istic
More informationExpression Data Exploration: Association, Patterns, Factors & Regression Modelling
Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation
More informationLecture Network analysis for biological systems
Lecture 11 2014 Network analysis for biological systems Anja Bråthen Kristoffersen Biological Networks Gene regulatory network: two genes are connected if the expression of one gene modulates expression
More informationRNASeq Differential Expression
12/06/2014 RNASeq Differential Expression Le Corguillé v1.01 1 Introduction RNASeq No previous genomic sequence information is needed In RNA-seq the expression signal of a transcript is limited by the
More informationHigh-throughput Testing
High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector
More informationA moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data
Biostatistics (2007), 8, 4, pp. 744 755 doi:10.1093/biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its
More informationBayesian Estimation of Bipartite Matchings for Record Linkage
Bayesian Estimation of Bipartite Matchings for Record Linkage Mauricio Sadinle msadinle@stat.duke.edu Duke University Supported by NSF grants SES-11-30706 to Carnegie Mellon University and SES-11-31897
More informationProbability and Statistics. Joyeeta Dutta-Moscato June 29, 2015
Probability and Statistics Joyeeta Dutta-Moscato June 29, 2015 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationReview. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.
1 2 Review Data for assessing the sensitivity and specificity of a test are usually of the form disease category test result diseased (+) nondiseased ( ) + A B C D Sensitivity: is the proportion of diseased
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationStatistical analysis of microarray data: a Bayesian approach
Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,
More informationSupervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth
Supervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth Chong Ma 1 David B. Hitchcock 2 1 PhD Candidate University of South Carolina 2 Associate Professor
More informationParametric Empirical Bayes Methods for Microarrays
Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions
More informationMetric Predicted Variable With One Nominal Predictor Variable
Metric Predicted Variable With One Nominal Predictor Variable Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more
More informationFundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new
More informationPackage multivariance
Package multivariance January 10, 2018 Title Measuring Multivariate Dependence Using Distance Multivariance Version 1.1.0 Date 2018-01-09 Distance multivariance is a measure of dependence which can be
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationOrange Visualization Tool (OVT) Manual
Orange Visualization Tool (OVT) Manual This manual describes the features of the tool and how to use it. 1. Contents of the OVT Once the OVT is open (the first time it may take some seconds), it should
More informationAppendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:
Appendix F Computational Statistics Toolbox The Computational Statistics Toolbox can be downloaded from: http://www.infinityassociates.com http://lib.stat.cmu.edu. Please review the readme file for installation
More informationSta$s$cs for Genomics ( )
Sta$s$cs for Genomics (140.688) Instructor: Jeff Leek Slide Credits: Rafael Irizarry, John Storey No announcements today. Hypothesis testing Once you have a given score for each gene, how do you decide
More informationPackage GeneExpressionSignature
Package GeneExpressionSignature September 6, 2018 Title Gene Expression Signature based Similarity Metric Version 1.26.0 Date 2012-10-24 Author Yang Cao Maintainer Yang Cao , Fei
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationCross model validation and multiple testing in latent variable models
Cross model validation and multiple testing in latent variable models Frank Westad GE Healthcare Oslo, Norway 2nd European User Meeting on Multivariate Analysis Como, June 22, 2006 Outline Introduction
More informationGene Selection Using GeneSelectMMD
Gene Selection Using GeneSelectMMD Jarrett Morrow remdj@channing.harvard.edu, Weilianq Qiu stwxq@channing.harvard.edu, Wenqing He whe@stats.uwo.ca, Xiaogang Wang stevenw@mathstat.yorku.ca, Ross Lazarus
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationProbability and Discrete Distributions
AMS 7L LAB #3 Fall, 2007 Objectives: Probability and Discrete Distributions 1. To explore relative frequency and the Law of Large Numbers 2. To practice the basic rules of probability 3. To work with the
More informationModel Assessment and Selection: Exercises
Practical DNA Microarray Analysis Berlin, Nov 2003 Model Assessment and Selection: Exercises Axel Benner Biostatistik, DKFZ, D-69120 Heidelberg, Germany Introduction The exercises provided here are of
More informationOutline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System
Outline I Data Preparation Introduction to SpaceStat and ESTDA II Introduction to ESTDA and SpaceStat III Introduction to time-dynamic regression ESTDA ESTDA & SpaceStat Learning Objectives Activities
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationManual: R package HTSmix
Manual: R package HTSmix Olga Vitek and Danni Yu May 2, 2011 1 Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationLow-Level Analysis of High- Density Oligonucleotide Microarray Data
Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline
More informationGlossary for the Triola Statistics Series
Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling
More informationPDF-4+ Tools and Searches
PDF-4+ Tools and Searches PDF-4+ 2018 The PDF-4+ 2018 database is powered by our integrated search display software. PDF-4+ 2018 boasts 72 search selections coupled with 125 display fields resulting in
More informationDiagnostics. Gad Kimmel
Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,
More informationPDF-4+ Tools and Searches
PDF-4+ Tools and Searches PDF-4+ 2019 The PDF-4+ 2019 database is powered by our integrated search display software. PDF-4+ 2019 boasts 74 search selections coupled with 126 display fields resulting in
More informationALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq
ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq Andrew Fernandes, Gregory Gloor, Jean Macklaim July 18, 212 1 Introduction This guide provides an overview of
More informationINTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP
INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP Personal Healthcare Revolution Electronic health records (CFH) Personal genomics (DeCode, Navigenics, 23andMe) X-prize: first $10k human genome technology
More informationContents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1
Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services
More informationncounter PlexSet Data Analysis Guidelines
ncounter PlexSet Data Analysis Guidelines NanoString Technologies, Inc. 530 airview Ave North Seattle, Washington 98109 USA Telephone: 206.378.6266 888.358.6266 E-mail: info@nanostring.com Molecules That
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org
More informationProCoNA: Protein Co-expression Network Analysis
ProCoNA: Protein Co-expression Network Analysis David L Gibbs October 30, 2017 1 De Novo Peptide Networks ProCoNA (protein co-expression network analysis) is an R package aimed at constructing and analyzing
More informationSimilarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList
for Center Berlin Genome Based Bioinformatics Max Planck Institute for Molecular Genetics Computational Diagnostics Group @ Dept. Vingron Ihnestrasse 63-73, D-14195 Berlin, Germany http://compdiag.molgen.mpg.de/
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org
More informationIntro. to Tests for Differential Expression (Part 2) Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.
Intro. to Tests for Differential Expression (Part 2) Utah State University Spring 24 STAT 557: Statistical Bioinformatics Notes 3.4 ### First prepare objects for DE test ### (as on slide 3 of Notes 3.3)
More informationConsistent high-dimensional Bayesian variable selection via penalized credible regions
Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org
More informationBayesian variable selection and classification with control of predictive values
Bayesian variable selection and classification with control of predictive values Eleni Vradi 1, Thomas Jaki 2, Richardus Vonk 1, Werner Brannath 3 1 Bayer AG, Germany, 2 Lancaster University, UK, 3 University
More informationOECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined
OECD QSAR Toolbox v.3.3 Step-by-step example of how to build a userdefined QSAR Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed
More informationEmpirical Bayes Moderation of Asymptotically Linear Parameters
Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi
More information2p or not 2p: Tuppence-based SERS for the detection of illicit materials
SUPPLEMENTARY INFORMATION 2p or not 2p: Tuppence-based SERS for the detection of illicit materials Figure S1. Deposition of silver (Grey target) demonstrated on a post-1992 2p coin. Figure S2. Raman spectrum
More informationHandout 1: Predicting GPA from SAT
Handout 1: Predicting GPA from SAT appsrv01.srv.cquest.utoronto.ca> appsrv01.srv.cquest.utoronto.ca> ls Desktop grades.data grades.sas oldstuff sasuser.800 appsrv01.srv.cquest.utoronto.ca> cat grades.data
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 12
More informationEBSeq: An R package for differential expression analysis using RNA-seq data
EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model
More informationSample Size Estimation for Studies of High-Dimensional Data
Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,
More informationClustering & microarray technology
Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression
More informationData Preprocessing. Data Preprocessing
Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different
More informationPrincipal component analysis (PCA) for clustering gene expression data
Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical
More informationCSC2515 Assignment #2
CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and
More informationPackage hierdiversity
Version 0.1 Date 2015-03-11 Package hierdiversity March 20, 2015 Title Hierarchical Multiplicative Partitioning of Complex Phenotypes Author Zachary Marion, James Fordyce, and Benjamin Fitzpatrick Maintainer
More informationFull versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation
cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and
More informationEECS564 Estimation, Filtering, and Detection Exam 2 Week of April 20, 2015
EECS564 Estimation, Filtering, and Detection Exam Week of April 0, 015 This is an open book takehome exam. You have 48 hours to complete the exam. All work on the exam should be your own. problems have
More informationOverview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering
Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical
More informationSummarize Abnormality Counts
Summarize Abnormality Counts Kevin R. Coombes 10 September 2011 Contents 1 Executive Summary 1 1.1 Introduction......................................... 1 1.1.1 Aims/Objectives..................................
More informationLinear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments
Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray
More informationPackage dhga. September 6, 2016
Type Package Title Differential Hub Gene Analysis Version 0.1 Date 2016-08-31 Package dhga September 6, 2016 Author Samarendra Das and Baidya Nath Mandal
More informationLigand Scout Tutorials
Ligand Scout Tutorials Step : Creating a pharmacophore from a protein-ligand complex. Type ke6 in the upper right area of the screen and press the button Download *+. The protein will be downloaded and
More informationAplicable methods for nondetriment. Dr José Luis Quero Pérez Assistant Professor Forestry Department University of Cordoba (Spain)
Aplicable methods for nondetriment findings Dr José Luis Quero Pérez Assistant Professor Forestry Department University of Cordoba (Spain) Forest Ecophysiology Water relations Photosynthesis Forest demography
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Brad Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 11
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationZhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018
Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng
More informationMultiple Testing. Hoang Tran. Department of Statistics, Florida State University
Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome
More informationTools and topics for microarray analysis
Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline
More informationClassification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.
Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),
More informationSemi-Penalized Inference with Direct FDR Control
Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p
More informationSTAT 461/561- Assignments, Year 2015
STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and
More informationMeta-analysis for Microarray Experiments
Meta-analysis for Microarray Experiments Robert Gentleman, Markus Ruschhaupt, Wolfgang Huber, and Lara Lusa April 25, 2006 1 Introduction The use of meta-analysis tools and strategies for combining data
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationPreview from Notesale.co.uk Page 3 of 63
Stem-and-leaf diagram - vertical numbers on far left represent the 10s, numbers right of the line represent the 1s The mean should not be used if there are extreme scores, or for ranks and categories Unbiased
More information1 The Squared Ranks Test for Variances
1 The Squared Ranks Test for Variances Data The data consist of two random samples. Let X 1, X 2,, X n denote the random sample of size n from population 1 and let Y 1, Y 2,, Y m, denote the random sample
More informationThe First Thing You Ever Do When Receive a Set of Data Is
The First Thing You Ever Do When Receive a Set of Data Is Understand the goal of the study What are the objectives of the study? What would the person like to see from the data? Understand the methodology
More informationFalse discovery rate and related concepts in multiple comparisons problems, with applications to microarray data
False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using
More informationMassive Event Detection. Abstract
Abstract The detection and analysis of events within massive collections of time-series has become an extremely important task for timedomain astronomy. In particular, many scientific investigations (e.g.
More informationJian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University
Jian WANG, PhD j_wang@shou.edu.cn Room A115 College of Fishery and Life Science Shanghai Ocean University Useful Links Slides: http://sihua.us/biostatistics.htm Datasets: http://users.monash.edu.au/~murray/bdar/index.html
More informationOutline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract
Abstract The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale
More informationThe gpca Package for Identifying Batch Effects in High-Throughput Genomic Data
The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples
More information