Manual: R package HTSmix

Size: px
Start display at page:

Download "Manual: R package HTSmix"

Transcription

1 Manual: R package HTSmix Olga Vitek and Danni Yu May 2, Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are subject to substantial biological variation, and technical variation due to batches and plates. To enable high throughput the screens cannot fully implement the fundamental principles of statistical experimental design, in particular replication and randomization of the order of the replicates throughout the screen. Distinguishing perturbation-induced changes in the phenotypes from stochastic variation is therefore challenging, and requires adequate statistical methodology. HTSmix is a software package for interpretation of such high-throughput screens measuring lowdimensional quantitative phenotypes. HTSmix represents the structure of signals in the screen using linear mixed models, normalizes and summarizes the phenotypes to make them comparable across samples, and outputs a list of hits while controlling the False Discovery Rate. The methodology is appropriate for experimental designs with at least two control samples profiled throughout the screen. The current implementation is extensively tested on screens with ionomic phenotypes. A future release will be tested on screens employing other technologies. HTSmix requires the following R packages: 1. locfdr; 2.fdrtool; 3. lme4; 4. cellhts2. Reference: D. Yu, J. Danku, I. Baxter, S. Kim, O. K. Vatamaniuk, D. E. Salt, O. Vitek. Noise reduction in genome-wide perturbation screens using linear mixed-effect models, Submitted to Bioinformatics, ovitek@stat.purdue.edu dyu@stat.purdue.edu 1

2 2 Data structures 2.1 Input data structures The input data structure is an object of class data.frame with the following columns: Required columns: sample name (e.g. mutant strain), position index in a plate (e.g. tube number), plate index (e.g. tray), and the quantitative values of the phenotype (one column per dimension for multivariate phenotypes). The tube number (or tube index) within plate is required to identify each replicate in a plate. Optional columns: index of a batch (where batch contains several plates), a covariate or a confounding variable (e.g. the optical density score quantifying the sample growth rate), the column index and the row index of a sample on a plate. 2.2 Example dataset As an example, the package contains measurements from a perturbation screen quantifying ionomic profiles of 1127 single-gene knock-out diploid strains of yeast. The complete information about this and related datasets is available at > data(rawkod); > rawkod[1:4,] od tube line tray run_batch Ca44 Cd111 Co BY BLANK YAL043C YAL041W Cu65 Fe57 K39 Mg25 Mn55 Mo95 Na Ni60 P31 S34 Zn > ># Names of columns containing each dimension of the multivariate phenotype > elekod [1] "Ca44" "Cd111" "Co59" "Cu65" "Fe57" "K39" "Mg25" "Mn55" "Mo95" [10] "Na23" "Ni60" "P31" "S34" "Zn66" 2

3 The columns in this dataset are defined as follows: od is the optional quantitative phenotype of the covariate (optical density), tube is the position of the sample in the plate, line is the name of the silenced gene (i.e. the identifier of the biological sample of interest), tray is the id of the tray and run batch is the id of the batch. The remaining columns contain quantified mineral nutrient and trace elements (i.e. dimensions of the multivariate phenotype) profiled in this screen. 3 Normalization 3.1 Basic batch-plate normalization The function norm2ctr{htsmix} performs the batch-plate normalization with linear mixedeffect modeling procedure. We denote X gkbp a scored univariate phenotype, where g is the mutant gene, k is the replicate sample of that mutant, b is the batch index and p is the plate index. For multivariate phenotypes we consider each dimension separately, and use the convention that X gkbp represents one particular dimension. Then the basic normalization model is specified as X gkbp = µ g + B gb + P (B) gp + ε gkbp (1) B gb N (0, σ 2 B g ), P (B) gp N (0, σ 2 P g ), ε gkbp N (0, σ 2 ε g ) where B gb is the batch effect, P (B) gp is the plate effect nested within the batch, and ε gkbp is the combination of the biological and technical variation. B gb, P (B) gp, and ε gkbp are independent. The normalized phenotype is defined as r gkbp = X gkbp [ ˆB 1b + ˆP (B) 1p ] (2) The code below shows how the example dataset can be normalized with respect to the control sample "BY4743" separately for each element. The argument exludestrains allows the user to provide a vector of names of the samples which should be excluded from normalization. The argument dimname specifies all the columns containing the multivariate phenotype. > norm.x <- norm2ctr(indata = rawkod, ref.ctr1="by4743", + excludestrains=c("blank", "YLR396C", "YPR065W), + dimname=elekod, batch="run_batch", tray="tray") 3.2 Adjustment for a covariate The function norm2covariate{htsmix} extends the normalization with an adjustment for a covariate using a linear regression model. If we denote gr growth rate normalized with 3

4 the same control, then a linear model can be fit to estimate a single linear relationship between the confounding factor and the phenotype across all the biological samples r gkbp = β 0 + β 1 gr gkbp + ɛ gkbp, ε gkbp N (0, σ 2 ε g ) (3) and the adjusted normalized values are obtained as r gkbp = r gkbp ˆβ 1 gr gkbp (4) The code below shows how the example dataset, initially normalized with respect to the control sample "BY4743", can be further adjusted with respect to optical density. The argument exludestrains specifies a vector of samples names to exclude, dimname specifies all the dimensions of the phenotype. The argument ref.conf specifies the column of the data structure that contains the covariate of interest. The function normalizes the covariate to the control, and then adjusts the normalized phenotypes to the normalized covariate. > norm.covariate <- norm2covariate(indata = norm.x, ref.ctr1 = "BY4743", + excludestrains="blank", dimname = elekod, batch= "run_batch", + tray = "tray", ref.conf="od") 3.3 Within-plate normalization of row and column effects Measurements in a plate can be subject to systematic effects of rows and columns. The function norm2cr{htsmix} extends linear fixed-effect effect modeling to account for these effects: X gkbp = µ g + R ip + C jp + B gb + P (B) gp + ε gkbp, (5) i R ip = 0, j C jp = 0, P (B) gp N (0, σ 2 P g ), B gb N (0, σ 2 B g ), ε gkbp N (0, σ 2 ε g ) where R ip and C jp are the deviations on row i and column j on the pth plate, and the remaining notation as above. The code below illustrates this normalization as applied to the data structure norm.x in Sec The same procedure can be applied to the data normalized as in Sec The argument dimname specifies all the dimensions of the multivariate phenotype. > norm.cr <- norm2cr(norm.x, dimname=elekod, partialtitle="kod", makeplot=true ) When the option makeplot is set to TRUE, the code produces a graphical visualization of the importance of the row and column effects suggested by Malo et. al. 1, while adding "partialtitle" to the title. The plot for the example dataset is shown in Fig. 1. As can be seen, the column the row effects have a similar variation, and the median values are close to 1. This indicates only a mild effect of rows and columns on this dataset. 1 Malo, N., et. al.. Statistical practice in high-throughput screening data analysis, Nature Biotechnology, 24,

5 Figure 1: Visual representation of the row and column effects, produced by norm2cr{htsmix} for the rawkod dataset. X-axis: dimensions of the multivariate phenotype. Y axis: ratios between median absolute deviations of column effects and model residuals (in blue), and the ratio between median absolute deviations of row effects and model residuals (in yellow). The boxplots summarize the ratios across all plates. 3.4 Export of the results into cellhts-class Function HTSmix2cellHTS{HTSmix} exports the raw or the normalized data into a data structure of class cellhts that is compatible with the package cellhts2. As the result, other normalization and summarization steps implemented in cellhts2 can be applied to these datasets directly. For consistency with the arguments of cellhts, negatives specify a vector of names of negative controls, and positives specify a vector of names of positive controls. others specifies additional types of samples, e.g. unknown and BLANK in the case of the example dataset. Each dimension of the multivariate phenotype is exported into a separate object of this class, and dimname1 specifies the relevant dimension. > res <- HTSmix2cellHTS(norm.x, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", dimname1="ca44"); The following example illustrates how data normalized with HTSmix can be exported into into cellhts-class for summarization with cellhts2. We generate a histogram of the summary statistics of element Ca44 obtained with cellhts2 after normalization in HTSmix (Fig. 2 (a)). ># Export normalized phenotypes > res <- HTSmix2cellHTS(norm.covariate, 5

6 + negatives=c("by4743","ydl227c"), positives=c("ylr396c","ypr065w"), + others="blank", dimname1="ca44"); ># Score and summarize replicates in cellhts2 > res1<- scorereplicates(res, sign="-", method="zscore") > res2 <- summarizereplicates(res, summary="mean") ># Generate histogram > hist(data(res2), xlab="test statistics in cellhts", main=""); (a) (b) Frequency Frequency test statistics in cellhts MLE: delta: sigma: p0: CME: delta: sigma: p0: Figure 2: (a) Distribution of the summary statistics of the element Ca44 of the example dataset, normalized using HTSmix and summarized using cellhts2. (b) Distribution of the Z statistics for all elements of the example dataset combined, obtained with HTSmix as described in Sec Summarization 4.1 Calculation of per-sample summary of the quantitative phenotype The function dscore{htsmix} estimates the residual variation in the normalized phenotypes using a second control. It expresses the normalized phenotypes r gkbp of sample g in terms of random effects the second linear model r gkbp = µ g + P (B) gp + B gb + ε gkbp, (6) P (B) gp N (0, σp 2 ), g B gb N (0, σb 2 ), g 6

7 ε gkbp N (0, σε 2 ), for g = 2, 3, 4, 5,... g For samples profiled in a single plate, the summary phenotype of mutant g is µ g, and it s estimate is equivalent to the average of the observed phenotypes over all replicates r g. The associated estimated variation is V ar( r g ) = (ˆσ 2 P g + ˆσ 2 B g + ˆσ 2 ε g /n g) (7) where n g is the number of within-plate replicate samples of the mutant g. Parameter ˆσ ε 2 g is estimated by the sample variance s 2 ε, and plug-in estimates of σ2 g P and g σ2 B are obtained g from a second control. The overall per-sample summary score D g is then defined as D g = r g / (ˆσ P ˆσ 2 B 2 + s 2 ε g /n g) (8) The code below illustrates the use of function dscore{htsmix} for summarization of mutant-wise phenotypes on the dimension-to-dimension basis for the example dataset. The arguments have the same interpretation as the other function in HTSmix. > d.mut <- dscore(indata = norm.covariate, + ref.ctr1="by4743", ref.ctr2="ydl227c", exludedstrains ="BLANK", + dimname=elekod, batch="run_batch", tray="tray") > names(d.mut); [1] "dmut" "mean" "var" "freq" "var.ctr2" > head(d.mut$dmut) Ca44 Cd111 Co59 Cu65 Fe57 K39 Mg25 BY YAL001C YAL003W YAL025C YAL032C YAL033W Mn55 Mo95 Na23 Ni60 P31 S34 BY YAL001C YAL003W YAL025C YAL032C YAL033W Zn66 BY YAL001C

8 YAL003W YAL025C YAL032C YAL033W Export of the results into cellhts-class The summary D score or the test statistics calculated in HTSmix can also be exported into an object of class cellhts-class, separately for each dimension of a multivariate phenotype, using function HTSmix2cellHTS{HTSmix}. > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + excludestrains=null, + partialtitle="kod", ele=elekod ) > res <- HTSmix2cellHTS(z.mut, negatives=c("by4743","ydl227c"), + positives=c("ylr396c","ypr065w"), others="blank", + summarized=true, dimname1="ca44") Here the argument summarized indicates whether the input data are the result of summarization, and the argument dimname1 indicates the column name of the dimension of the phenotype of interest. The remaining arguments have the same meaning as above. 5 Determination of hits Finally, the package implements the detection of hits among the normalized and summarized phenotypes, while controlling the False Discovery Rate (FDR)at the desired level. First, getzrr2con{htsmix} calculates per-sample standardized summary score, called Z statistic, which is comparable across all the dimensions of the phenotype: Z g = D g median(d g ) median( D g median(d g ) ) C, (9) Here C = 1/Φ 1 (3/4) is a normalizing constant for a robust unbiased estimation of the scale 2. >#Get mutant-wise Z statistics > z.mut = zscore(mutfile = d.mut$dmut, + center = TRUE, ref.ctr1= "BY4743", ref.ctr2="ydl227c", + partialtitle="kod", ele=elekod ) 2 Hoaglin, D., et al. Understanding robust and exploratory data analysis, John Wiley & Sons, pp ,

9 > head(z.mut) Ca44 Cd111 Co59 Cu65 Fe57 K39 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Mg25 Mn55 Mo95 Na23 Ni60 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A P31 S34 Zn66 YAL001C YAL003W YAL025C YAL032C YAL033W YAL034W-A Package locfdr implementing the approach by Efon 3 in the R package locfdr is then used to determine the cutoff of Z g that controls the FDR at the level FDRcutoff. The histogram in Fig. 2 (b) visualizes the distribution of the statistics. > z.locfdr <- gethits(z.mut, FDRcutoff = 0.05, partialtitle ="KOd", makeplot=true ) 3 Efron, B., Microarrays, Empirical Bayes, and the two-groups model, Statistical Science, 23, 122,

Mapping connections between the genome, ionome and the physical landscape. Photo by Bruce Bohm. David E Salt Purdue University, USA

Mapping connections between the genome, ionome and the physical landscape. Photo by Bruce Bohm. David E Salt Purdue University, USA Mapping connections between the genome, ionome and the physical landscape Photo by Bruce Bohm David E Salt Purdue University, USA What is the Ionome Environment Transcriptome Proteome Ionome The elemental

More information

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5 Title Computes local false discovery rates Version 1.1-2 The locfdr Package August 19, 2006 Author Bradley Efron, Brit Turnbull and Balasubramanian Narasimhan Computation of local false discovery rates

More information

Package locfdr. July 15, Index 5

Package locfdr. July 15, Index 5 Version 1.1-8 Title Computes Local False Discovery Rates Package locfdr July 15, 2015 Maintainer Balasubramanian Narasimhan License GPL-2 Imports stats, splines, graphics Computation

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples

More information

Geochemical Data Evaluation and Interpretation

Geochemical Data Evaluation and Interpretation Geochemical Data Evaluation and Interpretation Eric Grunsky Geological Survey of Canada Workshop 2: Exploration Geochemistry Basic Principles & Concepts Exploration 07 8-Sep-2007 Outline What is geochemical

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Supplementary Discussion:

Supplementary Discussion: Supplementary Discussion: I. Controls: Some important considerations for optimizing a high content assay Edge effect: The external rows and columns of a 96 / 384 well plate are the most affected by evaporation

More information

Causal Graphical Models in Systems Genetics

Causal Graphical Models in Systems Genetics 1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation

More information

Properties of the least squares estimates

Properties of the least squares estimates Properties of the least squares estimates 2019-01-18 Warmup Let a and b be scalar constants, and X be a scalar random variable. Fill in the blanks E ax + b) = Var ax + b) = Goal Recall that the least squares

More information

Package IGG. R topics documented: April 9, 2018

Package IGG. R topics documented: April 9, 2018 Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai Description Implements Bayesian linear regression,

More information

Cross-Sectional Regression after Factor Analysis: Two Applications

Cross-Sectional Regression after Factor Analysis: Two Applications al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016 Overview 1 2 3 4 1 / 27 Outline 1 2 3 4 2 / 27 Data matrix Y R n p Panel data. Transposable

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00 GroupTest: Multiple Testing Procedure for Grouped Hypotheses Zhigen Zhao Abstract In the modern Big Data

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

mlegp: an R package for Gaussian process modeling and sensitivity analysis

mlegp: an R package for Gaussian process modeling and sensitivity analysis mlegp: an R package for Gaussian process modeling and sensitivity analysis Garrett Dancik January 30, 2018 1 mlegp: an overview Gaussian processes (GPs) are commonly used as surrogate statistical models

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Latent Variable Methods for the Analysis of Genomic Data

Latent Variable Methods for the Analysis of Genomic Data John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/ Data m variables

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Protein-Protein Interaction Detection Using Mixed Models

Protein-Protein Interaction Detection Using Mixed Models Protein-Protein Interaction Detection Using Mixed Models Andrew Best, Andrea Ekey, Alyssa Everding, Sarah Jermeland, Jalen Marshall, Carrie N. Rider, and Grace Silaban July 25, 2013 Summer Undergraduate

More information

Package gtheory. October 30, 2016

Package gtheory. October 30, 2016 Package gtheory October 30, 2016 Version 0.1.2 Date 2016-10-22 Title Apply Generalizability Theory with R Depends lme4 Estimates variance components, generalizability coefficients, universe scores, and

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Univariate Linkage in Mx. Boulder, TC 18, March 2005 Posthuma, Maes, Neale

Univariate Linkage in Mx. Boulder, TC 18, March 2005 Posthuma, Maes, Neale Univariate Linkage in Mx Boulder, TC 18, March 2005 Posthuma, Maes, Neale VC analysis of Linkage Incorporating IBD Coefficients Covariance might differ according to sharing at a particular locus. Sharing

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015 MS&E 226 In-Class Midterm Examination Solutions Small Data October 20, 2015 PROBLEM 1. Alice uses ordinary least squares to fit a linear regression model on a dataset containing outcome data Y and covariates

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

APPLICATION OF GEOGRAPHICALLY WEIGHTED REGRESSION ANALYSIS TO LAKE-SEDIMENT DATA FROM AN AREA OF THE CANADIAN SHIELD IN SASKATCHEWAN AND ALBERTA

APPLICATION OF GEOGRAPHICALLY WEIGHTED REGRESSION ANALYSIS TO LAKE-SEDIMENT DATA FROM AN AREA OF THE CANADIAN SHIELD IN SASKATCHEWAN AND ALBERTA APPLICATION OF GEOGRAPHICALLY WEIGHTED REGRESSION ANALYSIS TO LAKE-SEDIMENT DATA FROM AN AREA OF THE CANADIAN SHIELD IN SASKATCHEWAN AND ALBERTA Nadia Yavorskaya 1, Stephen Amor 2 1 450 Bonner Av., Winnipeg,

More information

Genetic dissection of the Arabidopsis thaliana ionome

Genetic dissection of the Arabidopsis thaliana ionome Genetic dissection of the Arabidopsis thaliana ionome Genome Ionome Landscape distribution David E Salt Purdue University, USA What is the Ionome Environment Transcriptome Proteome Ionome The elemental

More information

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq

ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq ALDEx: ANOVA-Like Differential Gene Expression Analysis of Single-Organism and Meta-RNA-Seq Andrew Fernandes, Gregory Gloor, Jean Macklaim July 18, 212 1 Introduction This guide provides an overview of

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information

Don t be Fancy. Impute Your Dependent Variables!

Don t be Fancy. Impute Your Dependent Variables! Don t be Fancy. Impute Your Dependent Variables! Kyle M. Lang, Todd D. Little Institute for Measurement, Methodology, Analysis & Policy Texas Tech University Lubbock, TX May 24, 2016 Presented at the 6th

More information

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Peter Bühlmann joint work with Jonas Peters Nicolai Meinshausen ... and designing new perturbation

More information

Multiple Linear Regression for the Supervisor Data

Multiple Linear Regression for the Supervisor Data for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Quality Control The ASTA team

Quality Control The ASTA team Quality Control The ASTA team Contents 0.1 Outline................................................ 2 1 Quality control 2 1.1 Quality control chart......................................... 2 1.2 Example................................................

More information

Package LBLGXE. R topics documented: July 20, Type Package

Package LBLGXE. R topics documented: July 20, Type Package Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author

More information

Bootstrapping, Randomization, 2B-PLS

Bootstrapping, Randomization, 2B-PLS Bootstrapping, Randomization, 2B-PLS Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation,

More information

RESPONSE SURFACE MODELLING, RSM

RESPONSE SURFACE MODELLING, RSM CHEM-E3205 BIOPROCESS OPTIMIZATION AND SIMULATION LECTURE 3 RESPONSE SURFACE MODELLING, RSM Tool for process optimization HISTORY Statistical experimental design pioneering work R.A. Fisher in 1925: Statistical

More information

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis

Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis Agilent MassHunter Profinder: Solving the Challenge of Isotopologue Extraction for Qualitative Flux Analysis Technical Overview Introduction Metabolomics studies measure the relative abundance of metabolites

More information

Estimating representational dissimilarity measures

Estimating representational dissimilarity measures Estimating representational dissimilarity measures lexander Walther MRC Cognition and rain Sciences Unit University of Cambridge Institute of Cognitive Neuroscience University

More information

Statistics 849 Homework 2 due: (p. 1) FALL 2010 Stat 849: Homework Assignment 2 Due: October 8, 2010 Total points = 70

Statistics 849 Homework 2 due: (p. 1) FALL 2010 Stat 849: Homework Assignment 2 Due: October 8, 2010 Total points = 70 Statistics 849 Homework 2 due: 2010-10-08 (p. 1) FALL 2010 Stat 849: Homework Assignment 2 Due: October 8, 2010 Total points = 70 1. (50 pts) Consider the following multiple linear regression problem Construct

More information

Probability and Statistics Notes

Probability and Statistics Notes Probability and Statistics Notes Chapter Seven Jesse Crawford Department of Mathematics Tarleton State University Spring 2011 (Tarleton State University) Chapter Seven Notes Spring 2011 1 / 42 Outline

More information

Course topics (tentative) The role of random effects

Course topics (tentative) The role of random effects Course topics (tentative) random effects linear mixed models analysis of variance frequentist likelihood-based inference (MLE and REML) prediction Bayesian inference The role of random effects Rasmus Waagepetersen

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

Package aspi. R topics documented: September 20, 2016

Package aspi. R topics documented: September 20, 2016 Type Package Title Analysis of Symmetry of Parasitic Infections Version 0.2.0 Date 2016-09-18 Author Matt Wayland Maintainer Matt Wayland Package aspi September 20, 2016 Tools for the

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Statistics 910, #5 1. Regression Methods

Statistics 910, #5 1. Regression Methods Statistics 910, #5 1 Overview Regression Methods 1. Idea: effects of dependence 2. Examples of estimation (in R) 3. Review of regression 4. Comparisons and relative efficiencies Idea Decomposition Well-known

More information

Ultra-fast determination of base metals in geochemical samples using the 5100 SVDV ICP-OES

Ultra-fast determination of base metals in geochemical samples using the 5100 SVDV ICP-OES Ultra-fast determination of base metals in geochemical samples using the 5100 SVDV ICP-OES Application note Geochemistry, metals, mining Authors John Cauduro Agilent Technologies, Mulgrave, Australia Introduction

More information

Inference based on robust estimators Part 2

Inference based on robust estimators Part 2 Inference based on robust estimators Part 2 Matias Salibian-Barrera 1 Department of Statistics University of British Columbia ECARES - Dec 2007 Matias Salibian-Barrera (UBC) Robust inference (2) ECARES

More information

Fractal functional regression for classification of gene expression data by wavelets

Fractal functional regression for classification of gene expression data by wavelets Fractal functional regression for classification of gene expression data by wavelets Margarita María Rincón 1 and María Dolores Ruiz-Medina 2 1 University of Granada Campus Fuente Nueva 18071 Granada,

More information

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:

More information

discovery rate control

discovery rate control Optimal design for high-throughput screening via false discovery rate control arxiv:1707.03462v1 [stat.ap] 11 Jul 2017 Tao Feng 1, Pallavi Basu 2, Wenguang Sun 3, Hsun Teresa Ku 4, Wendy J. Mack 1 Abstract

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 15 1 / 38 Data structure t1 t2 tn i 1st subject y 11 y 12 y 1n1 Experimental 2nd subject

More information

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb May 15, 2015 > Abstract In this paper, we utilize a new graphical

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov MSc / PhD Course Advanced Biostatistics dr. P. Nazarov petr.nazarov@crp-sante.lu 2-12-2012 1. Descriptive Statistics edu.sablab.net/abs2013 1 Outline Lecture 0. Introduction to R - continuation Data import

More information

Ch3. TRENDS. Time Series Analysis

Ch3. TRENDS. Time Series Analysis 3.1 Deterministic Versus Stochastic Trends The simulated random walk in Exhibit 2.1 shows a upward trend. However, it is caused by a strong correlation between the series at nearby time points. The true

More information

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko

More information

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING Submitted to the Annals of Statistics CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING By Jingshu Wang, Qingyuan Zhao, Trevor Hastie, Art B. Owen Stanford University We consider large-scale studies

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

Package CorrMixed. R topics documented: August 4, Type Package

Package CorrMixed. R topics documented: August 4, Type Package Type Package Package CorrMixed August 4, 2016 Title Estimate Correlations Between Repeatedly Measured Endpoints (E.g., Reliability) Based on Linear Mixed-Effects Models Version 0.1-13 Date 2015-03-08 Author

More information

Package gma. September 19, 2017

Package gma. September 19, 2017 Type Package Title Granger Mediation Analysis Version 1.0 Date 2018-08-23 Package gma September 19, 2017 Author Yi Zhao , Xi Luo Maintainer Yi Zhao

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Principal Component Analysis, A Powerful Scoring Technique

Principal Component Analysis, A Powerful Scoring Technique Principal Component Analysis, A Powerful Scoring Technique George C. J. Fernandez, University of Nevada - Reno, Reno NV 89557 ABSTRACT Data mining is a collection of analytical techniques to uncover new

More information

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA D. Pokrajac Center for Information Science and Technology Temple University Philadelphia, Pennsylvania A. Lazarevic Computer

More information

Optimal normalization of DNA-microarray data

Optimal normalization of DNA-microarray data Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

= Prob( gene i is selected in a typical lab) (1)

= Prob( gene i is selected in a typical lab) (1) Supplementary Document This supplementary document is for: Reproducibility Probability Score: Incorporating Measurement Variability across Laboratories for Gene Selection (006). Lin G, He X, Ji H, Shi

More information

HOMEWORK ANALYSIS #2 - STOPPING DISTANCE

HOMEWORK ANALYSIS #2 - STOPPING DISTANCE HOMEWORK ANALYSIS #2 - STOPPING DISTANCE Total Points Possible: 35 1. In your own words, summarize the overarching problem and any specific questions that need to be answered using the stopping distance

More information

Fast and Accurate Causal Inference from Time Series Data

Fast and Accurate Causal Inference from Time Series Data Fast and Accurate Causal Inference from Time Series Data Yuxiao Huang and Samantha Kleinberg Stevens Institute of Technology Hoboken, NJ {yuxiao.huang, samantha.kleinberg}@stevens.edu Abstract Causal inference

More information

A variance-stabilizing transformation for gene-expression microarray data

A variance-stabilizing transformation for gene-expression microarray data BIOINFORMATICS Vol. 18 Suppl. 1 00 Pages S105 S110 A variance-stabilizing transformation for gene-expression microarray data B. P. Durbin 1, J. S. Hardin, D. M. Hawins 3 and D. M. Roce 4 1 Department of

More information

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL INTRODUCTION TO BASIC LINEAR REGRESSION MODEL 13 September 2011 Yogyakarta, Indonesia Cosimo Beverelli (World Trade Organization) 1 LINEAR REGRESSION MODEL In general, regression models estimate the effect

More information

Package RATest. November 30, Type Package Title Randomization Tests

Package RATest. November 30, Type Package Title Randomization Tests Type Package Title Randomization Tests Package RATest November 30, 2018 A collection of randomization tests, data sets and examples. The current version focuses on three testing problems and their implementation

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Methods Used for Estimating Statistics in EdSurvey Developed by Paul Bailey & Michael Cohen May 04, 2017

Methods Used for Estimating Statistics in EdSurvey Developed by Paul Bailey & Michael Cohen May 04, 2017 Methods Used for Estimating Statistics in EdSurvey 1.0.6 Developed by Paul Bailey & Michael Cohen May 04, 2017 This document describes estimation procedures for the EdSurvey package. It includes estimation

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science 1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study

More information

Differentially Private Linear Regression

Differentially Private Linear Regression Differentially Private Linear Regression Christian Baehr August 5, 2017 Your abstract. Abstract 1 Introduction My research involved testing and implementing tools into the Harvard Privacy Tools Project

More information

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective Second Edition Scott E. Maxwell Uniuersity of Notre Dame Harold D. Delaney Uniuersity of New Mexico J,t{,.?; LAWRENCE ERLBAUM ASSOCIATES,

More information

A measurement error model approach to small area estimation

A measurement error model approach to small area estimation A measurement error model approach to small area estimation Jae-kwang Kim 1 Spring, 2015 1 Joint work with Seunghwan Park and Seoyoung Kim Ouline Introduction Basic Theory Application to Korean LFS Discussion

More information

Computational Biology Course Descriptions 12-14

Computational Biology Course Descriptions 12-14 Computational Biology Course Descriptions 12-14 Course Number and Title INTRODUCTORY COURSES BIO 311C: Introductory Biology I BIO 311D: Introductory Biology II BIO 325: Genetics CH 301: Principles of Chemistry

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

A Short Course in Basic Statistics

A Short Course in Basic Statistics A Short Course in Basic Statistics Ian Schindler November 5, 2017 Creative commons license share and share alike BY: C 1 Descriptive Statistics 1.1 Presenting statistical data Definition 1 A statistical

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Predicting causal effects in large-scale systems from observational data

Predicting causal effects in large-scale systems from observational data nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary

More information