Manual: R package HTSmix
|
|
- Merry Lang
- 5 years ago
- Views:
Transcription
1 Manual: R package HTSmix Olga Vitek and Danni Yu May 2, Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are subject to substantial biological variation, and technical variation due to batches and plates. To enable high throughput the screens cannot fully implement the fundamental principles of statistical experimental design, in particular replication and randomization of the order of the replicates throughout the screen. Distinguishing perturbation-induced changes in the phenotypes from stochastic variation is therefore challenging, and requires adequate statistical methodology. HTSmix is a software package for interpretation of such high-throughput screens measuring lowdimensional quantitative phenotypes. HTSmix represents the structure of signals in the screen using linear mixed models, normalizes and summarizes the phenotypes to make them comparable across samples, and outputs a list of hits while controlling the False Discovery Rate. The methodology is appropriate for experimental designs with at least two control samples profiled throughout the screen. The current implementation is extensively tested on screens with ionomic phenotypes. A future release will be tested on screens employing other technologies. HTSmix requires the following R packages: 1. locfdr; 2.fdrtool; 3. lme4; 4. cellhts2. Reference: D. Yu, J. Danku, I. Baxter, S. Kim, O. K. Vatamaniuk, D. E. Salt, O. Vitek. Noise reduction in genome-wide perturbation screens using linear mixed-effect models, Submitted to Bioinformatics, ovitek@stat.purdue.edu dyu@stat.purdue.edu 1
2 2 Data structures 2.1 Input data structures The input data structure is an object of class data.frame with the following columns: Required columns: sample name (e.g. mutant strain), position index in a plate (e.g. tube number), plate index (e.g. tray), and the quantitative values of the phenotype (one column per dimension for multivariate phenotypes). The tube number (or tube index) within plate is required to identify each replicate in a plate. Optional columns: index of a batch (where batch contains several plates), a covariate or a confounding variable (e.g. the optical density score quantifying the sample growth rate), the column index and the row index of a sample on a plate. 2.2 Example dataset As an example, the package contains measurements from a perturbation screen quantifying ionomic profiles of 1127 single-gene knock-out diploid strains of yeast. The complete information about this and related datasets is available at > data(rawkod); > rawkod[1:4,] od tube line tray run_batch Ca44 Cd111 Co BY BLANK YAL043C YAL041W Cu65 Fe57 K39 Mg25 Mn55 Mo95 Na Ni60 P31 S34 Zn > ># Names of columns containing each dimension of the multivariate phenotype > elekod [1] "Ca44" "Cd111" "Co59" "Cu65" "Fe57" "K39" "Mg25" "Mn55" "Mo95" [10] "Na23" "Ni60" "P31" "S34" "Zn66" 2
3 The columns in this dataset are defined as follows: od is the optional quantitative phenotype of the covariate (optical density), tube is the position of the sample in the plate, line is the name of the silenced gene (i.e. the identifier of the biological sample of interest), tray is the id of the tray and run batch is the id of the batch. The remaining columns contain quantified mineral nutrient and trace elements (i.e. dimensions of the multivariate phenotype) profiled in this screen. 3 Normalization 3.1 Basic batch-plate normalization The function norm2ctr{htsmix} performs the batch-plate normalization with linear mixedeffect modeling procedure. We denote X gkbp a scored univariate phenotype, where g is the mutant gene, k is the replicate sample of that mutant, b is the batch index and p is the plate index. For multivariate phenotypes we consider each dimension separately, and use the convention that X gkbp represents one particular dimension. Then the basic normalization model is specified as X gkbp = µ g + B gb + P (B) gp + ε gkbp (1) B gb N (0, σ 2 B g ), P (B) gp N (0, σ 2 P g ), ε gkbp N (0, σ 2 ε g ) where B gb is the batch effect, P (B) gp is the plate effect nested within the batch, and ε gkbp is the combination of the biological and technical variation. B gb, P (B) gp, and ε gkbp are independent. The normalized phenotype is defined as r gkbp = X gkbp [ ˆB 1b + ˆP (B) 1p ] (2) The code below shows how the example dataset can be normalized with respect to the control sample "BY4743" separately for each element. The argument exludestrains allows the user to provide a vector of names of the samples which should be excluded from normalization. The argument dimname specifies all the columns containing the multivariate phenotype. > norm.x <- norm2ctr(indata = rawkod, ref.ctr1="by4743", + excludestrains=c("blank", "YLR396C", "YPR065W), + dimname=elekod, batch="run_batch", tray="tray") 3.2 Adjustment for a covariate The function norm2covariate{htsmix} extends the normalization with an adjustment for a covariate using a linear regression model. If we denote gr growth rate normalized with 3
4 the same control, then a linear model can be fit to estimate a single linear relationship between the confounding factor and the phenotype across all the biological samples r gkbp = β 0 + β 1 gr gkbp + ɛ gkbp, ε gkbp N (0, σ 2 ε g ) (3) and the adjusted normalized values are obtained as r gkbp = r gkbp ˆβ 1 gr gkbp (4) The code below shows how the example dataset, initially normalized with respect to the control sample "BY4743", can be further adjusted with respect to optical density. The argument exludestrains specifies a vector of samples names to exclude, dimname specifies all the dimensions of the phenotype. The argument ref.conf specifies the column of the data structure that contains the covariate of interest. The function normalizes the covariate to the control, and then adjusts the normalized phenotypes to the normalized covariate. > norm.covariate <- norm2covariate(indata = norm.x, ref.ctr1 = "BY4743", + excludestrains="blank", dimname = elekod, batch= "run_batch", + tray = "tray", ref.conf="od") 3.3 Within-plate normalization of row and column effects Measurements in a plate can be subject to systematic effects of rows and columns. The function norm2cr{htsmix} extends linear fixed-effect effect modeling to account for these effects: X gkbp = µ g + R ip + C jp + B gb + P (B) gp + ε gkbp, (5) i R ip = 0, j C jp = 0, P (B) gp N (0, σ 2 P g ), B gb N (0, σ 2 B g ), ε gkbp N (0, σ 2 ε g ) where R ip and C jp are the deviations on row i and column j on the pth plate, and the remaining notation as above. The code below illustrates this normalization as applied to the data structure norm.x in Sec The same procedure can be applied to the data normalized as in Sec The argument dimname specifies all the dimensions of the multivariate phenotype. > norm.cr <- norm2cr(norm.x, dimname=elekod, partialtitle="kod", makeplot=true ) When the option makeplot is set to TRUE, the code produces a graphical visualization of the importance of the row and column effects suggested by Malo et. al. 1, while adding "partialtitle" to the title. The plot for the example dataset is shown in Fig. 1. As can be seen, the column the row effects have a similar variation, and the median values are close to 1. This indicates only a mild effect of rows and columns on this dataset. 1 Malo, N., et. al.. Statistical practice in high-throughput screening data analysis, Nature Biotechnology, 24,
5 Figure 1: Visual representation of the row and column effects, produced by norm2cr{htsmix} for the rawkod dataset. X-axis: dimensions of the multivariate phenotype. Y axis: ratios between median absolute deviations of column effects and model residuals (in blue), and the ratio between median absolute deviations of row effects and model residuals (in yellow). The boxplots summarize the ratios across all plates. 3.4 Export of the results into cellhts-class Function HTSmix2cellHTS{HTSmix} exports the raw or the normalized data into a data structure of class cellhts that is compatible with the package cellhts2. As the result, other normalization and summarization steps implemented in cellhts2 can be applied to these datasets directly. For consistency with the arguments of cellhts, negatives specify a vector of names of negative controls, and positives specify a vector of names of positive controls. others specifies additional types of samples, e.g. unknown and BLANK in the case of the example dataset. Each dimension of the multivariate phenotype is exported into a separate object of this class, and dimname1 specifies the relevant dimension. > res <- HTSmix2cellHTS(norm.x, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", dimname1="ca44"); The following example illustrates how data normalized with HTSmix can be exported into into cellhts-class for summarization with cellhts2. We generate a histogram of the summary statistics of element Ca44 obtained with cellhts2 after normalization in HTSmix (Fig. 2 (a)). ># Export normalized phenotypes > res <- HTSmix2cellHTS(norm.covariate, 5
6 + negatives=c("by4743","ydl227c"), positives=c("ylr396c","ypr065w"), + others="blank", dimname1="ca44"); ># Score and summarize replicates in cellhts2 > res1<- scorereplicates(res, sign="-", method="zscore") > res2 <- summarizereplicates(res, summary="mean") ># Generate histogram > hist(data(res2), xlab="test statistics in cellhts", main=""); (a) (b) Frequency Frequency test statistics in cellhts MLE: delta: sigma: p0: CME: delta: sigma: p0: Figure 2: (a) Distribution of the summary statistics of the element Ca44 of the example dataset, normalized using HTSmix and summarized using cellhts2. (b) Distribution of the Z statistics for all elements of the example dataset combined, obtained with HTSmix as described in Sec Summarization 4.1 Calculation of per-sample summary of the quantitative phenotype The function dscore{htsmix} estimates the residual variation in the normalized phenotypes using a second control. It expresses the normalized phenotypes r gkbp of sample g in terms of random effects the second linear model r gkbp = µ g + P (B) gp + B gb + ε gkbp, (6) P (B) gp N (0, σp 2 ), g B gb N (0, σb 2 ), g 6
7 ε gkbp N (0, σε 2 ), for g = 2, 3, 4, 5,... g For samples profiled in a single plate, the summary phenotype of mutant g is µ g, and it s estimate is equivalent to the average of the observed phenotypes over all replicates r g. The associated estimated variation is V ar( r g ) = (ˆσ 2 P g + ˆσ 2 B g + ˆσ 2 ε g /n g) (7) where n g is the number of within-plate replicate samples of the mutant g. Parameter ˆσ ε 2 g is estimated by the sample variance s 2 ε, and plug-in estimates of σ2 g P and g σ2 B are obtained g from a second control. The overall per-sample summary score D g is then defined as D g = r g / (ˆσ P ˆσ 2 B 2 + s 2 ε g /n g) (8) The code below illustrates the use of function dscore{htsmix} for summarization of mutant-wise phenotypes on the dimension-to-dimension basis for the example dataset. The arguments have the same interpretation as the other function in HTSmix. > d.mut <- dscore(indata = norm.covariate, + ref.ctr1="by4743", ref.ctr2="ydl227c", exludedstrains ="BLANK", + dimname=elekod, batch="run_batch", tray="tray") > names(d.mut); [1] "dmut" "mean" "var" "freq" "var.ctr2" > head(d.mut$dmut) Ca44 Cd111 Co59 Cu65 Fe57 K39 Mg25 BY YAL001C YAL003W YAL025C YAL032C YAL033W Mn55 Mo95 Na23 Ni60 P31 S34 BY YAL001C YAL003W YAL025C YAL032C YAL033W Zn66 BY YAL001C
8 YAL003W YAL025C YAL032C YAL033W Export of the results into cellhts-class The summary D score or the test statistics calculated in HTSmix can also be exported into an object of class cellhts-class, separately for each dimension of a multivariate phenotype, using function HTSmix2cellHTS{HTSmix}. > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + excludestrains=null, + partialtitle="kod", ele=elekod ) > res <- HTSmix2cellHTS(z.mut, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", + summarized=true, dimname1="ca44") Here the argument summarized indicates whether the input data are the result of summarization, and the argument dimname1 indicates the column name of the dimension of the phenotype of interest. The remaining arguments have the same meaning as above. 5 Determination of hits Finally, the package implements the detection of hits among the normalized and summarized phenotypes, while controlling the False Discovery Rate (FDR)at the desired level. First, getzrr2con{htsmix} calculates per-sample standardized summary score, called Z statistic, which is comparable across all the dimensions of the phenotype: Z g = D g median(d g ) median( D g median(d g ) ) C, (9) Here C = 1/Φ 1 (3/4) is a normalizing constant for a robust unbiased estimation of the scale 2. >#Get mutant-wise Z statistics > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + partialtitle="kod", ele=elekod ) 2 Hoaglin, D., et al. Understanding robust and exploratory data analysis, John Wiley & Sons, pp ,
9 > head(z.mut) Ca44 Cd111 Co59 Cu65 Fe57 K39 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Mg25 Mn55 Mo95 Na23 Ni60 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A P31 S34 Zn66 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Package locfdr implementing the approach by Efon 3 in the R package locfdr is then used to determine the cutoff of Z g that controls the FDR at the level FDRcutoff. The histogram in Fig. 2 (b) visualizes the distribution of the statistics. > z.locfdr <- gethits(z.mut, FDRcutoff = 0.05, partialtitle ="KOd", makeplot=true ) 3 Efron, B., Microarrays, Empirical Bayes, and the two-groups model, Statistical Science, 23, 122,
Mapping connections between the genome, ionome and the physical landscape. Photo by Bruce Bohm. David E Salt Purdue University, USA
Mapping connections between the genome, ionome and the physical landscape Photo by Bruce Bohm David E Salt Purdue University, USA What is the Ionome Environment Transcriptome Proteome Ionome The elemental
More informationThe locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5
Title Computes local false discovery rates Version 1.1-2 The locfdr Package August 19, 2006 Author Bradley Efron, Brit Turnbull and Balasubramanian Narasimhan Computation of local false discovery rates
More informationPackage locfdr. July 15, Index 5
Version 1.1-8 Title Computes Local False Discovery Rates Package locfdr July 15, 2015 Maintainer Balasubramanian Narasimhan License GPL-2 Imports stats, splines, graphics Computation
More informationEmpirical Bayes Moderation of Asymptotically Linear Parameters
Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi
More informationThe gpca Package for Identifying Batch Effects in High-Throughput Genomic Data
The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples
More informationGeochemical Data Evaluation and Interpretation
Geochemical Data Evaluation and Interpretation Eric Grunsky Geological Survey of Canada Workshop 2: Exploration Geochemistry Basic Principles & Concepts Exploration 07 8-Sep-2007 Outline What is geochemical
More informationEmpirical Bayes Moderation of Asymptotically Linear Parameters
Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi
More informationSupplementary Discussion:
Supplementary Discussion: I. Controls: Some important considerations for optimizing a high content assay Edge effect: The external rows and columns of a 96 / 384 well plate are the most affected by evaporation
More informationCausal Graphical Models in Systems Genetics
1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation
More informationProperties of the least squares estimates
Properties of the least squares estimates 2019-01-18 Warmup Let a and b be scalar constants, and X be a scalar random variable. Fill in the blanks E ax + b) = Var ax + b) = Goal Recall that the least squares
More informationPackage IGG. R topics documented: April 9, 2018
Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai Description Implements Bayesian linear regression,
More informationCross-Sectional Regression after Factor Analysis: Two Applications
al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016 Overview 1 2 3 4 1 / 27 Outline 1 2 3 4 2 / 27 Data matrix Y R n p Panel data. Transposable
More informationWeighted Least Squares
Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w
More informationDEGseq: an R package for identifying differentially expressed genes from RNA-seq data
DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics
More informationJournal of Statistical Software
JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00 GroupTest: Multiple Testing Procedure for Grouped Hypotheses Zhigen Zhao Abstract In the modern Big Data
More informationLesson 11. Functional Genomics I: Microarray Analysis
Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)
More informationmlegp: an R package for Gaussian process modeling and sensitivity analysis
mlegp: an R package for Gaussian process modeling and sensitivity analysis Garrett Dancik January 30, 2018 1 mlegp: an overview Gaussian processes (GPs) are commonly used as surrogate statistical models
More informationMultiple Testing. Hoang Tran. Department of Statistics, Florida State University
Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome
More informationLatent Variable Methods for the Analysis of Genomic Data
John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/ Data m variables
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationProtein-Protein Interaction Detection Using Mixed Models
Protein-Protein Interaction Detection Using Mixed Models Andrew Best, Andrea Ekey, Alyssa Everding, Sarah Jermeland, Jalen Marshall, Carrie N. Rider, and Grace Silaban July 25, 2013 Summer Undergraduate
More informationPackage gtheory. October 30, 2016
Package gtheory October 30, 2016 Version 0.1.2 Date 2016-10-22 Title Apply Generalizability Theory with R Depends lme4 Estimates variance components, generalizability coefficients, universe scores, and
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationUnivariate Linkage in Mx. Boulder, TC 18, March 2005 Posthuma, Maes, Neale
Univariate Linkage in Mx Boulder, TC 18, March 2005 Posthuma, Maes, Neale VC analysis of Linkage Incorporating IBD Coefficients Covariance might differ according to sharing at a particular locus. Sharing
More informationPractical Statistics for the Analytical Scientist Table of Contents
Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning
More informationParametric Empirical Bayes Methods for Microarrays
Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions
More informationMS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015
MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates
More informationLinear Regression. In this lecture we will study a particular type of regression model: the linear regression model
1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor
More informationAPPLICATION OF GEOGRAPHICALLY WEIGHTED REGRESSION ANALYSIS TO LAKE-SEDIMENT DATA FROM AN AREA OF THE CANADIAN SHIELD IN SASKATCHEWAN AND ALBERTA
APPLICATION OF GEOGRAPHICALLY WEIGHTED REGRESSION ANALYSIS TO LAKE-SEDIMENT DATA FROM AN AREA OF THE CANADIAN SHIELD IN SASKATCHEWAN AND ALBERTA Nadia Yavorskaya 1, Stephen Amor 2 1 450 Bonner Av., Winnipeg,
More informationGenetic dissection of the Arabidopsis thaliana ionome
Genetic dissection of the Arabidopsis thaliana ionome Genome Ionome Landscape distribution David E Salt Purdue University, USA What is the Ionome Environment Transcriptome Proteome Ionome The elemental
More informationALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq
ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq Andrew Fernandes, Gregory Gloor, Jean Macklaim July 18, 212 1 Introduction This guide provides an overview of
More informationProblem Selected Scores
Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected
More informationA brief introduction to mixed models
A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.
More informationDon t be Fancy. Impute Your Dependent Variables!
Don t be Fancy. Impute Your Dependent Variables! Kyle M. Lang, Todd D. Little Institute for Measurement, Methodology, Analysis & Policy Texas Tech University Lubbock, TX May 24, 2016 Presented at the 6th
More informationCausal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data
Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Peter Bühlmann joint work with Jonas Peters Nicolai Meinshausen ... and designing new perturbation
More informationMultiple Linear Regression for the Supervisor Data
for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates
More informationQuality Control The ASTA team
Quality Control The ASTA team Contents 0.1 Outline................................................ 2 1 Quality control 2 1.1 Quality control chart......................................... 2 1.2 Example................................................
More informationPackage LBLGXE. R topics documented: July 20, Type Package
Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author
More informationBootstrapping, Randomization, 2B-PLS
Bootstrapping, Randomization, 2B-PLS Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation,
More informationRESPONSE SURFACE MODELLING, RSM
CHEM-E3205 BIOPROCESS OPTIMIZATION AND SIMULATION LECTURE 3 RESPONSE SURFACE MODELLING, RSM Tool for process optimization HISTORY Statistical experimental design pioneering work R.A. Fisher in 1925: Statistical
More informationPrentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)
National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the
More informationRegression Analysis. Regression: Methodology for studying the relationship among two or more variables
Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the
More informationSTAT 4385 Topic 06: Model Diagnostics
STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationAgilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis
Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis Technical Overview Introduction Metabolomics studies measure the relative abundance of metabolites
More informationEstimating representational dissimilarity measures
Estimating representational dissimilarity measures lexander Walther MRC Cognition and rain Sciences Unit University of Cambridge Institute of Cognitive Neuroscience University
More informationStatistics 849 Homework 2 due: (p. 1) FALL 2010 Stat 849: Homework Assignment 2 Due: October 8, 2010 Total points = 70
Statistics 849 Homework 2 due: 2010-10-08 (p. 1) FALL 2010 Stat 849: Homework Assignment 2 Due: October 8, 2010 Total points = 70 1. (50 pts) Consider the following multiple linear regression problem Construct
More informationProbability and Statistics Notes
Probability and Statistics Notes Chapter Seven Jesse Crawford Department of Mathematics Tarleton State University Spring 2011 (Tarleton State University) Chapter Seven Notes Spring 2011 1 / 42 Outline
More informationCourse topics (tentative) The role of random effects
Course topics (tentative) random effects linear mixed models analysis of variance frequentist likelihood-based inference (MLE and REML) prediction Bayesian inference The role of random effects Rasmus Waagepetersen
More informationWeighted Least Squares
Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w
More informationSTATISTICS 479 Exam II (100 points)
Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the
More informationPackage aspi. R topics documented: September 20, 2016
Type Package Title Analysis of Symmetry of Parasitic Infections Version 0.2.0 Date 2016-09-18 Author Matt Wayland Maintainer Matt Wayland Package aspi September 20, 2016 Tools for the
More information25 : Graphical induced structured input/output models
10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph
More informationStatistics 910, #5 1. Regression Methods
Statistics 910, #5 1 Overview Regression Methods 1. Idea: effects of dependence 2. Examples of estimation (in R) 3. Review of regression 4. Comparisons and relative efficiencies Idea Decomposition Well-known
More informationUltra-fast determination of base metals in geochemical samples using the 5100 SVDV ICP-OES
Ultra-fast determination of base metals in geochemical samples using the 5100 SVDV ICP-OES Application note Geochemistry, metals, mining Authors John Cauduro Agilent Technologies, Mulgrave, Australia Introduction
More informationInference based on robust estimators Part 2
Inference based on robust estimators Part 2 Matias Salibian-Barrera 1 Department of Statistics University of British Columbia ECARES - Dec 2007 Matias Salibian-Barrera (UBC) Robust inference (2) ECARES
More informationFractal functional regression for classification of gene expression data by wavelets
Fractal functional regression for classification of gene expression data by wavelets Margarita María Rincón 1 and María Dolores Ruiz-Medina 2 1 University of Granada Campus Fuente Nueva 18071 Granada,
More informationANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication
ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:
More informationdiscovery rate control
Optimal design for high-throughput screening via false discovery rate control arxiv:1707.03462v1 [stat.ap] 11 Jul 2017 Tao Feng 1, Pallavi Basu 2, Wenguang Sun 3, Hsun Teresa Ku 4, Wendy J. Mack 1 Abstract
More informationHigh-throughput Testing
High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector
More informationDimension Reduction. David M. Blei. April 23, 2012
Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do
More informationStat 579: Generalized Linear Models and Extensions
Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 15 1 / 38 Data structure t1 t2 tn i 1st subject y 11 y 12 y 1n1 Experimental 2nd subject
More informationA Simple, Graphical Procedure for Comparing Multiple Treatment Effects
A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb May 15, 2015 > Abstract In this paper, we utilize a new graphical
More informationThe Model Building Process Part I: Checking Model Assumptions Best Practice
The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test
More informationMSc / PhD Course Advanced Biostatistics. dr. P. Nazarov
MSc / PhD Course Advanced Biostatistics dr. P. Nazarov petr.nazarov@crp-sante.lu 2-12-2012 1. Descriptive Statistics edu.sablab.net/abs2013 1 Outline Lecture 0. Introduction to R - continuation Data import
More informationCh3. TRENDS. Time Series Analysis
3.1 Deterministic Versus Stochastic Trends The simulated random walk in Exhibit 2.1 shows a upward trend. However, it is caused by a strong correlation between the series at nearby time points. The true
More informationStatistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko
More informationCONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING
Submitted to the Annals of Statistics CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING By Jingshu Wang, Qingyuan Zhao, Trevor Hastie, Art B. Owen Stanford University We consider large-scale studies
More informationLinear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments
Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray
More informationPackage CorrMixed. R topics documented: August 4, Type Package
Type Package Package CorrMixed August 4, 2016 Title Estimate Correlations Between Repeatedly Measured Endpoints (E.g., Reliability) Based on Linear Mixed-Effects Models Version 0.1-13 Date 2015-03-08 Author
More informationPackage gma. September 19, 2017
Type Package Title Granger Mediation Analysis Version 1.0 Date 2018-08-23 Package gma September 19, 2017 Author Yi Zhao , Xi Luo Maintainer Yi Zhao
More informationarxiv: v1 [stat.me] 30 Dec 2017
arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw
More informationPrincipal Component Analysis, A Powerful Scoring Technique
Principal Component Analysis, A Powerful Scoring Technique George C. J. Fernandez, University of Nevada - Reno, Reno NV 89557 ABSTRACT Data mining is a collection of analytical techniques to uncover new
More informationSPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA
SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA D. Pokrajac Center for Information Science and Technology Temple University Philadelphia, Pennsylvania A. Lazarevic Computer
More informationOptimal normalization of DNA-microarray data
Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La
More informationThe Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)
The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE
More information= Prob( gene i is selected in a typical lab) (1)
Supplementary Document This supplementary document is for: Reproducibility Probability Score: Incorporating Measurement Variability across Laboratories for Gene Selection (006). Lin G, He X, Ji H, Shi
More informationHOMEWORK ANALYSIS #2 - STOPPING DISTANCE
HOMEWORK ANALYSIS #2 - STOPPING DISTANCE Total Points Possible: 35 1. In your own words, summarize the overarching problem and any specific questions that need to be answered using the stopping distance
More informationFast and Accurate Causal Inference from Time Series Data
Fast and Accurate Causal Inference from Time Series Data Yuxiao Huang and Samantha Kleinberg Stevens Institute of Technology Hoboken, NJ {yuxiao.huang, samantha.kleinberg}@stevens.edu Abstract Causal inference
More informationA variance-stabilizing transformation for gene-expression microarray data
BIOINFORMATICS Vol. 18 Suppl. 1 00 Pages S105 S110 A variance-stabilizing transformation for gene-expression microarray data B. P. Durbin 1, J. S. Hardin, D. M. Hawins 3 and D. M. Roce 4 1 Department of
More informationINTRODUCTION TO BASIC LINEAR REGRESSION MODEL
INTRODUCTION TO BASIC LINEAR REGRESSION MODEL 13 September 2011 Yogyakarta, Indonesia Cosimo Beverelli (World Trade Organization) 1 LINEAR REGRESSION MODEL In general, regression models estimate the effect
More informationPackage RATest. November 30, Type Package Title Randomization Tests
Type Package Title Randomization Tests Package RATest November 30, 2018 A collection of randomization tests, data sets and examples. The current version focuses on three testing problems and their implementation
More informationNon-specific filtering and control of false positives
Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview
More informationMax. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes
Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the
More informationMethods Used for Estimating Statistics in EdSurvey Developed by Paul Bailey & Michael Cohen May 04, 2017
Methods Used for Estimating Statistics in EdSurvey 1.0.6 Developed by Paul Bailey & Michael Cohen May 04, 2017 This document describes estimation procedures for the EdSurvey package. It includes estimation
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationOverview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database
Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But
More informationAnalysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science
1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study
More informationDifferentially Private Linear Regression
Differentially Private Linear Regression Christian Baehr August 5, 2017 Your abstract. Abstract 1 Introduction My research involved testing and implementing tools into the Harvard Privacy Tools Project
More informationDESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective
DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective Second Edition Scott E. Maxwell Uniuersity of Notre Dame Harold D. Delaney Uniuersity of New Mexico J,t{,.?; LAWRENCE ERLBAUM ASSOCIATES,
More informationA measurement error model approach to small area estimation
A measurement error model approach to small area estimation Jae-kwang Kim 1 Spring, 2015 1 Joint work with Seunghwan Park and Seoyoung Kim Ouline Introduction Basic Theory Application to Korean LFS Discussion
More informationComputational Biology Course Descriptions 12-14
Computational Biology Course Descriptions 12-14 Course Number and Title INTRODUCTORY COURSES BIO 311C: Introductory Biology I BIO 311D: Introductory Biology II BIO 325: Genetics CH 301: Principles of Chemistry
More informationExam: high-dimensional data analysis January 20, 2014
Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish
More informationStatistics for Differential Expression in Sequencing Studies. Naomi Altman
Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand
More informationA Short Course in Basic Statistics
A Short Course in Basic Statistics Ian Schindler November 5, 2017 Creative commons license share and share alike BY: C 1 Descriptive Statistics 1.1 Presenting statistical data Definition 1 A statistical
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationPredicting causal effects in large-scale systems from observational data
nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary
More information