Manual: R package HTSmix

Size: px

Start display at page:

Download "Manual: R package HTSmix"

Merry Lang
5 years ago
Views:

1 Manual: R package HTSmix Olga Vitek and Danni Yu May 2, Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are subject to substantial biological variation, and technical variation due to batches and plates. To enable high throughput the screens cannot fully implement the fundamental principles of statistical experimental design, in particular replication and randomization of the order of the replicates throughout the screen. Distinguishing perturbation-induced changes in the phenotypes from stochastic variation is therefore challenging, and requires adequate statistical methodology. HTSmix is a software package for interpretation of such high-throughput screens measuring lowdimensional quantitative phenotypes. HTSmix represents the structure of signals in the screen using linear mixed models, normalizes and summarizes the phenotypes to make them comparable across samples, and outputs a list of hits while controlling the False Discovery Rate. The methodology is appropriate for experimental designs with at least two control samples profiled throughout the screen. The current implementation is extensively tested on screens with ionomic phenotypes. A future release will be tested on screens employing other technologies. HTSmix requires the following R packages: 1. locfdr; 2.fdrtool; 3. lme4; 4. cellhts2. Reference: D. Yu, J. Danku, I. Baxter, S. Kim, O. K. Vatamaniuk, D. E. Salt, O. Vitek. Noise reduction in genome-wide perturbation screens using linear mixed-effect models, Submitted to Bioinformatics, ovitek@stat.purdue.edu dyu@stat.purdue.edu 1

2 2 Data structures 2.1 Input data structures The input data structure is an object of class data.frame with the following columns: Required columns: sample name (e.g. mutant strain), position index in a plate (e.g. tube number), plate index (e.g. tray), and the quantitative values of the phenotype (one column per dimension for multivariate phenotypes). The tube number (or tube index) within plate is required to identify each replicate in a plate. Optional columns: index of a batch (where batch contains several plates), a covariate or a confounding variable (e.g. the optical density score quantifying the sample growth rate), the column index and the row index of a sample on a plate. 2.2 Example dataset As an example, the package contains measurements from a perturbation screen quantifying ionomic profiles of 1127 single-gene knock-out diploid strains of yeast. The complete information about this and related datasets is available at > data(rawkod); > rawkod[1:4,] od tube line tray run_batch Ca44 Cd111 Co BY BLANK YAL043C YAL041W Cu65 Fe57 K39 Mg25 Mn55 Mo95 Na Ni60 P31 S34 Zn > ># Names of columns containing each dimension of the multivariate phenotype > elekod [1] "Ca44" "Cd111" "Co59" "Cu65" "Fe57" "K39" "Mg25" "Mn55" "Mo95" [10] "Na23" "Ni60" "P31" "S34" "Zn66" 2

3 The columns in this dataset are defined as follows: od is the optional quantitative phenotype of the covariate (optical density), tube is the position of the sample in the plate, line is the name of the silenced gene (i.e. the identifier of the biological sample of interest), tray is the id of the tray and run batch is the id of the batch. The remaining columns contain quantified mineral nutrient and trace elements (i.e. dimensions of the multivariate phenotype) profiled in this screen. 3 Normalization 3.1 Basic batch-plate normalization The function norm2ctr{htsmix} performs the batch-plate normalization with linear mixedeffect modeling procedure. We denote X gkbp a scored univariate phenotype, where g is the mutant gene, k is the replicate sample of that mutant, b is the batch index and p is the plate index. For multivariate phenotypes we consider each dimension separately, and use the convention that X gkbp represents one particular dimension. Then the basic normalization model is specified as X gkbp = µ g + B gb + P (B) gp + ε gkbp (1) B gb N (0, σ 2 B g ), P (B) gp N (0, σ 2 P g ), ε gkbp N (0, σ 2 ε g ) where B gb is the batch effect, P (B) gp is the plate effect nested within the batch, and ε gkbp is the combination of the biological and technical variation. B gb, P (B) gp, and ε gkbp are independent. The normalized phenotype is defined as r gkbp = X gkbp [ ˆB 1b + ˆP (B) 1p ] (2) The code below shows how the example dataset can be normalized with respect to the control sample "BY4743" separately for each element. The argument exludestrains allows the user to provide a vector of names of the samples which should be excluded from normalization. The argument dimname specifies all the columns containing the multivariate phenotype. > norm.x <- norm2ctr(indata = rawkod, ref.ctr1="by4743", + excludestrains=c("blank", "YLR396C", "YPR065W), + dimname=elekod, batch="run_batch", tray="tray") 3.2 Adjustment for a covariate The function norm2covariate{htsmix} extends the normalization with an adjustment for a covariate using a linear regression model. If we denote gr growth rate normalized with 3

4 the same control, then a linear model can be fit to estimate a single linear relationship between the confounding factor and the phenotype across all the biological samples r gkbp = β 0 + β 1 gr gkbp + ɛ gkbp, ε gkbp N (0, σ 2 ε g ) (3) and the adjusted normalized values are obtained as r gkbp = r gkbp ˆβ 1 gr gkbp (4) The code below shows how the example dataset, initially normalized with respect to the control sample "BY4743", can be further adjusted with respect to optical density. The argument exludestrains specifies a vector of samples names to exclude, dimname specifies all the dimensions of the phenotype. The argument ref.conf specifies the column of the data structure that contains the covariate of interest. The function normalizes the covariate to the control, and then adjusts the normalized phenotypes to the normalized covariate. > norm.covariate <- norm2covariate(indata = norm.x, ref.ctr1 = "BY4743", + excludestrains="blank", dimname = elekod, batch= "run_batch", + tray = "tray", ref.conf="od") 3.3 Within-plate normalization of row and column effects Measurements in a plate can be subject to systematic effects of rows and columns. The function norm2cr{htsmix} extends linear fixed-effect effect modeling to account for these effects: X gkbp = µ g + R ip + C jp + B gb + P (B) gp + ε gkbp, (5) i R ip = 0, j C jp = 0, P (B) gp N (0, σ 2 P g ), B gb N (0, σ 2 B g ), ε gkbp N (0, σ 2 ε g ) where R ip and C jp are the deviations on row i and column j on the pth plate, and the remaining notation as above. The code below illustrates this normalization as applied to the data structure norm.x in Sec The same procedure can be applied to the data normalized as in Sec The argument dimname specifies all the dimensions of the multivariate phenotype. > norm.cr <- norm2cr(norm.x, dimname=elekod, partialtitle="kod", makeplot=true ) When the option makeplot is set to TRUE, the code produces a graphical visualization of the importance of the row and column effects suggested by Malo et. al. 1, while adding "partialtitle" to the title. The plot for the example dataset is shown in Fig. 1. As can be seen, the column the row effects have a similar variation, and the median values are close to 1. This indicates only a mild effect of rows and columns on this dataset. 1 Malo, N., et. al.. Statistical practice in high-throughput screening data analysis, Nature Biotechnology, 24,

5 Figure 1: Visual representation of the row and column effects, produced by norm2cr{htsmix} for the rawkod dataset. X-axis: dimensions of the multivariate phenotype. Y axis: ratios between median absolute deviations of column effects and model residuals (in blue), and the ratio between median absolute deviations of row effects and model residuals (in yellow). The boxplots summarize the ratios across all plates. 3.4 Export of the results into cellhts-class Function HTSmix2cellHTS{HTSmix} exports the raw or the normalized data into a data structure of class cellhts that is compatible with the package cellhts2. As the result, other normalization and summarization steps implemented in cellhts2 can be applied to these datasets directly. For consistency with the arguments of cellhts, negatives specify a vector of names of negative controls, and positives specify a vector of names of positive controls. others specifies additional types of samples, e.g. unknown and BLANK in the case of the example dataset. Each dimension of the multivariate phenotype is exported into a separate object of this class, and dimname1 specifies the relevant dimension. > res <- HTSmix2cellHTS(norm.x, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", dimname1="ca44"); The following example illustrates how data normalized with HTSmix can be exported into into cellhts-class for summarization with cellhts2. We generate a histogram of the summary statistics of element Ca44 obtained with cellhts2 after normalization in HTSmix (Fig. 2 (a)). ># Export normalized phenotypes > res <- HTSmix2cellHTS(norm.covariate, 5

6 + negatives=c("by4743","ydl227c"), positives=c("ylr396c","ypr065w"), + others="blank", dimname1="ca44"); ># Score and summarize replicates in cellhts2 > res1<- scorereplicates(res, sign="-", method="zscore") > res2 <- summarizereplicates(res, summary="mean") ># Generate histogram > hist(data(res2), xlab="test statistics in cellhts", main=""); (a) (b) Frequency Frequency test statistics in cellhts MLE: delta: sigma: p0: CME: delta: sigma: p0: Figure 2: (a) Distribution of the summary statistics of the element Ca44 of the example dataset, normalized using HTSmix and summarized using cellhts2. (b) Distribution of the Z statistics for all elements of the example dataset combined, obtained with HTSmix as described in Sec Summarization 4.1 Calculation of per-sample summary of the quantitative phenotype The function dscore{htsmix} estimates the residual variation in the normalized phenotypes using a second control. It expresses the normalized phenotypes r gkbp of sample g in terms of random effects the second linear model r gkbp = µ g + P (B) gp + B gb + ε gkbp, (6) P (B) gp N (0, σp 2 ), g B gb N (0, σb 2 ), g 6

7 ε gkbp N (0, σε 2 ), for g = 2, 3, 4, 5,... g For samples profiled in a single plate, the summary phenotype of mutant g is µ g, and it s estimate is equivalent to the average of the observed phenotypes over all replicates r g. The associated estimated variation is V ar( r g ) = (ˆσ 2 P g + ˆσ 2 B g + ˆσ 2 ε g /n g) (7) where n g is the number of within-plate replicate samples of the mutant g. Parameter ˆσ ε 2 g is estimated by the sample variance s 2 ε, and plug-in estimates of σ2 g P and g σ2 B are obtained g from a second control. The overall per-sample summary score D g is then defined as D g = r g / (ˆσ P ˆσ 2 B 2 + s 2 ε g /n g) (8) The code below illustrates the use of function dscore{htsmix} for summarization of mutant-wise phenotypes on the dimension-to-dimension basis for the example dataset. The arguments have the same interpretation as the other function in HTSmix. > d.mut <- dscore(indata = norm.covariate, + ref.ctr1="by4743", ref.ctr2="ydl227c", exludedstrains ="BLANK", + dimname=elekod, batch="run_batch", tray="tray") > names(d.mut); [1] "dmut" "mean" "var" "freq" "var.ctr2" > head(d.mut$dmut) Ca44 Cd111 Co59 Cu65 Fe57 K39 Mg25 BY YAL001C YAL003W YAL025C YAL032C YAL033W Mn55 Mo95 Na23 Ni60 P31 S34 BY YAL001C YAL003W YAL025C YAL032C YAL033W Zn66 BY YAL001C

8 YAL003W YAL025C YAL032C YAL033W Export of the results into cellhts-class The summary D score or the test statistics calculated in HTSmix can also be exported into an object of class cellhts-class, separately for each dimension of a multivariate phenotype, using function HTSmix2cellHTS{HTSmix}. > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + excludestrains=null, + partialtitle="kod", ele=elekod ) > res <- HTSmix2cellHTS(z.mut, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", + summarized=true, dimname1="ca44") Here the argument summarized indicates whether the input data are the result of summarization, and the argument dimname1 indicates the column name of the dimension of the phenotype of interest. The remaining arguments have the same meaning as above. 5 Determination of hits Finally, the package implements the detection of hits among the normalized and summarized phenotypes, while controlling the False Discovery Rate (FDR)at the desired level. First, getzrr2con{htsmix} calculates per-sample standardized summary score, called Z statistic, which is comparable across all the dimensions of the phenotype: Z g = D g median(d g ) median( D g median(d g ) ) C, (9) Here C = 1/Φ 1 (3/4) is a normalizing constant for a robust unbiased estimation of the scale 2. >#Get mutant-wise Z statistics > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + partialtitle="kod", ele=elekod ) 2 Hoaglin, D., et al. Understanding robust and exploratory data analysis, John Wiley & Sons, pp ,

9 > head(z.mut) Ca44 Cd111 Co59 Cu65 Fe57 K39 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Mg25 Mn55 Mo95 Na23 Ni60 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A P31 S34 Zn66 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Package locfdr implementing the approach by Efon 3 in the R package locfdr is then used to determine the cutoff of Z g that controls the FDR at the level FDRcutoff. The histogram in Fig. 2 (b) visualizes the distribution of the statistics. > z.locfdr <- gethits(z.mut, FDRcutoff = 0.05, partialtitle ="KOd", makeplot=true ) 3 Efron, B., Microarrays, Empirical Bayes, and the two-groups model, Statistical Science, 23, 122,

Mapping connections between the genome, ionome and the physical landscape. Photo by Bruce Bohm. David E Salt Purdue University, USA

Mapping connections between the genome, ionome and the physical landscape Photo by Bruce Bohm David E Salt Purdue University, USA What is the Ionome Environment Transcriptome Proteome Ionome The elemental