Statistical Issues in Preprocessing Quantitative Bottom-Up LCMS Proteomics Data

Similar documents
Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Non-specific filtering and control of false positives

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Single gene analysis of differential expression. Giorgio Valentini

Basics of Modern Missing Data Analysis

Probabilistic Inference for Multiple Testing

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

The Model Building Process Part I: Checking Model Assumptions Best Practice

A note on multiple imputation for general purpose estimation

Predicting Protein Functions and Domain Interactions from Protein Interactions

The miss rate for the analysis of gene expression data

Review of Statistics 101

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

Single gene analysis of differential expression

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Distribution Fitting (Censored Data)

Data Mining Stat 588

Machine Learning Linear Classification. Prof. Matteo Matteucci

Unsupervised machine learning

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Regression Clustering

Statistical Data Analysis Stat 3: p-values, parameter estimation

Biochip informatics-(i)

Missing Value Estimation for Time Series Microarray Data Using Linear Dynamical Systems Modeling

CMSC858P Supervised Learning Methods

GS Analysis of Microarray Data

Final Exam. Name: Solution:

Experimental Design and Data Analysis for Biologists

Inference For High Dimensional M-estimates. Fixed Design Results

4.1. Introduction: Comparing Means

Practical Statistics for the Analytical Scientist Table of Contents

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

ISyE 691 Data mining and analytics

A Bias Correction for the Minimum Error Rate in Cross-validation

Parametric Empirical Bayes Methods for Microarrays

Proteomics and Variable Selection

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

STAT5044: Regression and Anova

L11: Pattern recognition principles

6.867 Machine Learning

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Advanced Statistical Methods: Beyond Linear Regression

Bayesian Networks in Educational Assessment

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

TUTORIAL EXERCISES WITH ANSWERS

One-way ANOVA Model Assumptions

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Heteroskedasticity. Part VII. Heteroskedasticity

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Chapter 10. Semi-Supervised Learning

Inference For High Dimensional M-estimates: Fixed Design Results

Mixtures and Hidden Markov Models for analyzing genomic data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Stephen Scott.

Statistical analysis of isobaric-labeled mass spectrometry data

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

Lecture: Mixture Models for Microbiome data

Statistical inference

STAT 461/561- Assignments, Year 2015

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Applications in Genetics and Molecular Biology

Introduction to Machine Learning Midterm Exam

Manual: R package HTSmix

More on Unsupervised Learning

Machine Learning Linear Regression. Prof. Matteo Matteucci

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Part 6: Multivariate Normal and Linear Models

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

A Note on Bayesian Inference After Multiple Imputation

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Diagnostics. Gad Kimmel

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

Statistical Inference

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Transcription:

Statistical Issues in Preprocessing Quantitative Bottom-Up LCMS Proteomics Data Jean-Eudes DAZARD xd101@case.edu Case Western Reserve University Center for Proteomics and Bioinformatics November 05, 2009

Background Label-Free LC/MS Analysis Workflow C T Network Analysis Digestion C T Biomarker Discovery Stats. Analysis Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA... NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Raw Data Acquisition LC/MS/MS Consensus Sequence Score 0 0 Protein IP NA NA Protein Description NA NA Data Preprocessing 2

Outline Missingness Issues Transformation Issues Not covered here: peak alignment, peptide identification, and statistical inferences 3

Missingness Issues Dataset Structure and Missing Data Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA... NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Consensus Sequence Score 0 0 Protein IP Protein Description NA NA NA NA Extent, Implications, Source of Missingness, State of the Art Treatment. 4

Missingness Issues Extent of the Issue In general, ~ 5-10% of overall missingness is considered small and is common in omics data (e.g. in some DNA micro-array datasets). In contrast, large to very large amounts of missing values commonly occur in LCMS proteomics datasets: Initial Step #1 Step #2 Step #3 Study # Diffsets % Miss. # Diffsets % Miss. # Diffsets % Miss. # Diffsets % Miss. Diabetes 10177 77.9 1931 53.2 1783 51.5 1429 42.3 CACTI (pilot) 7446 96.9 1555 54.9 1362 52.8 914 38.1 CACTI 25323 88.9 6556 79.0 3915 72.8 2584 60.8 IPS 14253 71.4 2025 37.7 1253 21.7 1088 12.3 Restenosis 15701 75.3 2283 38.6 1949 31.9 1553 18.7 This amount of missing data is consistent with what others have observed ( ~50%) e.g. (Karpievitch et al., 2009). 5

Missingness Issues Extent of the Issue A reviewer s comment: It is very difficult to accept conclusions based on a dataset having 40% of missing values [ ], no matter how sophisticated the statistics are. Because many different algorithms for missing value imputation were initially developed in the context of microarray data, performances were usually not tested beyond the range of [0, 20%] (assuming Missing At Random more later). But, recall that in our case we deal with variable missingness within a range of roughly [20%, 60%]. 6

Missingness Issues Extent of the Issue It is true that every imputation method will break down beyond a certain rate of missingness. However, this breaking point has not been systematically investigated and remains unknown for most of the available methods. In fact, two recent studies have investigated the comparative behavior of a few methods in this context with high and even extreme high missingness of up to 40% and 60% respectively (Oba et al. 2003; Kim et al., 2004). In each study, performances of several methods did not degrade till these rates. So, it is conceivable and practically possible to deal with say at least 40% overall missing entries. 7

Missingness Issues Implications Missing values prevent some statistical procedures to be performed at all (e.g. PCA, ) or without further data processing (e.g., ANOVA, LM, ). We may have increasing difficulty in fitting certain GLM models, etc... When analyses are possible, estimates/tests may be prone to extreme bias, depending in part on the mechanism for missingness. Missing data (even if missing at random) will generally result in reduced precision in estimation or reduced power in statistical tests. 8

Missingness Issues Sources of Missingness Overview of quantification approaches in LC-MS based proteomic experiments (Mueller et al. 2007). Label-free quantification extracts peptide signals by tracking isotopic patterns along their chromatographic elution profile. The corresponding signal in the LC-MS run 2 is then found by comparing the coordinates m/z, Tr, and z of the peptide (blue). Align multiple LC-MS datasets to one another using chrompeaks, after which LC-MS features (peptides) can be matched to a database. 9

Missingness Issues Sources of Missingness Even after peak clustering/aligning, there will still be peaks which are missing from some of the samples. That can occur simply because a peptide was not present in a sample, or because not every peak was detected, identified and aligned in all samples. Peak list (red map) is generated by peak detection and alignment of multiple LC-MS runs (gray maps). So, in shotgun proteomics measurements, many peptides that are observed in some samples are not observed in others, resulting in widespread missing values. 10

Missingness Issues Sources of Missingness Furthermore, there are two independent mechanisms by which an intensity will not be observed (Karpievitch et al., 2009): the peak was not observed for a peptide due to the fact that the peptide s presence is at an abundance level lower than the instrument detection threshold => left-censored data with non-ignorable missingness. A second source of missingness is due to the ionization inefficiencies, ionsuppression effects and other technical factors (Tang et al., 2004) => completely random missingness. Whenever there is informative (non-ignorable) missingness, care must be taken when handling the missing values to avoid biasing abundance estimates (Old et al., 2005; Wang et al. 2006a; Karpievitch et al., 2009). 11

Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness The systematic pattern reflects the fact that low-abundance peptides are more likely to have missing peaks: 12

Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness More than 34 % of peptides/diffsets contain the maximal number of 10 missing values (out of 15 positions) => almost a zero probability event under the Missing At Random (MAR) assumption. The distribution of missing events by peptide is NOT uniform across peptides/diffsets or proteins: 13

Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness Test of the random nature of the missingness: Let X = #{observed missing counts per peptide/diffset}. Under a random process of missingness, it is reasonable to assume that X is a count r.v. of rare events, i.e. distributed as a Poisson random variable. Hypothesis testing (e.g. a KS test): H 0 : X ~ Poisson (l), Strong evidence to reect the null hypothesis 14

Missingness Issues Treatment of Missingness We have shown that the missing mechanism of LCMS proteomics data is intensitydependent => Missing Not At Random (MNAR) The probability of a value being missing will generally depend on the observed and unobserved values (had they been observed). It is to be distinguished from the other missingness mechanisms (MCAR and MAR), which occur only more rarely and randomly. So, missingness should be reduced as much as possible and treated, but carefully: Prefiltering will be limited in scope. Account for informative missingness in peak intensities to allow unbiased, modelbased, protein-level (Karpievitch et al., 2009) or peptide-level (Wang et al. 2006a or ours, 2009) estimation and inference (more later). 15

Missingness Issues Treatment of Missingness Prefiltering Mistakes For instance, complete data analyses has its simplicity, but: loss of power due to dropping of a large amount of information. If variables (peptides/diffsets) with missing values, are ust discarded or ignored a substantial loss of power is introduced because information is simply lost! # Diffsets Study # Total # Complete % Diabetes 10177 209 2.05 CACTI 7446 148 1.99 IPS 14253 593 4.16 Restenosis 15701 655 4.17 substantial bias induction in inferences (i.e. not applicable to the population) due to ignoring the possible systematic differences between the complete and incomplete cases (violation of MCAR assumption). 16

Missingness Issues Treatment of Missingness Imputation Mistakes Almost all existing imputation methods available for high-dimensional data assume MAR (KNN, SVD, EM etc ) => None of these are applicable here! To mention a few: Row Average (Little, 1987), Singular Value Decomposition and k-nearest Neighbors (Troyanskaya, 2001), Bayesian variable selection (Zhou, 2003), local least square imputation (Kim, 2005), Support Vector Regression (Wang, 2006b), Logical Set imputation (Jornsten, 2007) etc,... 17

Missingness Issues Treatment of Missingness Example of Upward Bias Mean and Downward Variance Estim. Intensity (abundance) + treatment control True sample group mean of treatments Detection threshold + + + + + Estimate #1 of group mean of controls (Complete Data or Row Average) True sample group mean of controls Peptide # Bias #1 Bias #2 Estimate #2 of group mean of controls (Minimum Value) 18

Missingness Issues Prefiltering LCMS Data Step #1 Initial Preselection The initial dataset contains aligned peptide/diffset peak intensities (ChromPeaks), uniquely identified by diffsets ID. But many do not even have any annotation, not even a consensus peptide sequence ( EST situation in mrna microarray data) => ust ignore them (at this point). Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Consensus Sequence Score 0 0 Protein IP NA NA Protein Description NA NA 19

Missingness Issues Prefiltering LCMS Data Step #1 Initial Preselection Peptides/diffsets that are retained have : (i) a consensus peptide sequence completely annotated (score > 0) (ii) a consensus peptide sequence score above a lower-bound cutoff percentile Percentile is chosen to remove the lowest mode in the peptide sequence score distribution, or the flat part of the cumulative distribution function. (e.g. 79 th percentile, score > 14.95) Summary statistics of missing values p = 1555 peptides/diffsets selected 54.9 % = overall missingness (was 96.9%!) 20

Missingness Issues Prefiltering LCMS Data Step #2 Peptide Intensity Summarization ~ summarization that removes feature extraction and peak alignment errors: Diffset 1 Diffset 2 Diffset p Integrate (over retention time and m/z ratio ) the area under the peak intensities of all peptides/diffsets of identical consensus sequence (in length and a.a. composition). This summarization preselects those peptides/diffsets having a unique IP number and annotation (with adusted intensities). Intensity Run 1 NA NA Intensity Run 2... NA Intensity Run n Consensus Mass Consensus Time Consensus Sequence Consensus Sequence Score Protein IP Protein Description NA Summary statistics of missing values p = 1362 peptides/diffsets selected 52.8 % = overall missingness 21

Missingness Issues Prefiltering LCMS Data Step #3 Missingness Proportion Reduction Let X = #{observed missing counts per peptide/diffset} Select those peptides/diffsets for which the upper Criterion #1 Diffset 1 Diffset 2 Diffset bound (n) of X should satisfy 2 criterions simultaneously: Intensity Run 1 Intensity Run 2 i. n must be less than the total sample size n less half Intensity Run 3 the minimal sample size n k of experimental group Intensity Run 4 ii. ) k = 1,,G (assuming n n > 1 and n >1). (1) min k k n should ideally maximize the difference between Intensity Run 5 Intensity Run 6 Intensity Run 7 the total number of remaining variables after Intensity Run 8 Intensity Run 9 selection and the total number of missing values (see next). where n k G k denotes the k th group sample size of group G k k 1,, G, such that n G k 1 k n 22

Missingness Issues Prefiltering LCMS Data Step #3 Missingness Proportion Reduction X x 0, n for each = 1,, p By criterion #1: n n n (1) 2 1 denotes the floor function., where p p By criterion #2: n argmax ( n x ) (, ) g x y x 1 1 where Y =y is the observed abundance of the th peptide. Example: maximal number of missing values allowed per peptide/diffset: n => use e.g. the nearest integer-mean value: n = 6, minimum value: n = 4, or maximum value :n = 7 Summary statistics of missing values (n = 6) p = 914 peptides/diffsets selected (512 prot.) 38.1 % = overall missingness (3134 occ.) 4,6. Criterion #2 23

Missingness Issues Model-Based Imputation Probability Model To account for the non random nature of the missing events in this type of data, we use a probability model adapted from Wang et al. (2006a) to describe artefactual missing events. It makes inferences on the missing values of one sample based on the information from other similar samples (technical replicates or nearest neighbors). artefactual missing event because the peptide exists in the sample but no signal has been detected (unobserved). Based on the intensities of those peptides observed in both samples, the scale difference of the overall abundances between Sample 1 and Sample 2 can be estimated, and used to approximate the missing intensity of Peptide 1 in Sample 1. 24

Missingness Issues Model-Based Imputation Probability Model Formally, let Z be the indicator variable of the th peptide in one sample i: Z th 1, if the peptide exists in the sample 0, otherwise where dependence on i th subscript is dropped. Let X be the true abundance of that peptide, then X 0, if Z 0 f, if Z 1 where f denotes the (unknown) p.d.f. of X. Z X Z The observed abundance Y of the th peptide given the value (X, Z, d) satisfies: 0, if Z 0 Y X, Z, d ) 0, if Z 1 and X d 1, if Z 1 and X d where d = the minimum detection level of the LCMS instrument 25

Missingness Issues Model-Based Imputation Probability Model Under MNAR assumption, the missingness mechanism is described by the probability of an artefactual missing event of the th peptide (denoted as signal is observed : Pr( M Y 0) M Z 1, Y 0 ) given that no Using Bayes rule, one derives an estimate of the searched conditional expectation of the true intensity X given that it is unobservable (i.e. Y = 0) : E( X Y 0) E( X X d, Z 1) Pr( M Y 0) From these derivations, using Bayes rule, given the detectable level parameter d, the distribution function f, and the observed abundance Y, it is possible to estimate E( X X d, Z 1) and the probability Pr( M Y 0) and in turn E( X Y 0) 26

Missingness Issues R Implementations R Package in prep. (J-E Dazard s R source code) 1. EDA of missingness 1. Explore.lcms(.) 2. Prefiltering 2. Select.lcms(.) 3. Imputation of missing values: 3. Impute.lcms(.) (Implementation based on Wang et al. 2006a) 27

Outline Missingness Issues Transformations Issues 28

Transformation Issues Why do we often need to transform the data? Remove sources of systematic biases and variations in the measured intensities due to experimental artifacts or the technology employed (non random or technical). E.g. in 2DIGE : dye effects, intensity dependence, spatial effects. In LCMS: LC-MS run, label, intensity, Satisfy subsequent model assumptions (independence of observations, normality, homoscedasticity,...) In biomarker discovery, this is essential to allow proper comparisons between experimental conditions and for statistical inferences (e.g. aimed at identifying differentially expressed variables across experimental conditions). Usually simpler and often only feasible approach than trying to model these systematic effects. 29

M M Transformation Issues In Normalization (standardization), one wants to compensate the location and scale effects between different samples (or groups) of the data. A wide variety of normalization approaches have been proposed (e.g. LOESS, VSN, CSS, Quantile ). Location normalization techniques : all adust the average of the data: Total intensity normalization: A single, global, multiplicative adustment => average intensities is zero Assumes that most variables do not respond to experimental conditions Median adustment Essentially the same w/o filtering Global adustment such that the median intensities is zero Intensity dependent normalization using local estimation A smooth best-fit curve is calculated for the dependence of logratio (M) on overall geometric mean (A) of intensities. Normalized log ratios are given as the residuals from this curve. e.g. global M-A "loess" normalization: Locally weighted regression estimation (M) A A "Scale" normalization techniques : "Scale" normalizations adust the range of data, rather than the center of the distribution. 30

Transformation Issues Often misunderstood: normalization is used generically. It usually includes standardization, but not necessarily all the following procedures: 1. Normality transformation: In general statistical models assume normality of the measurements (or errors) and rely on a set of assumptions about the ideal form of the data and attempts to transform the data consistent with that ideal form (e.g. for continuous data: Log(.), Box-Cox, ). Can do away if n >> 1 (CLT result). 2. Homoscedastic transformation? Statistical procedures rely on equality of variances assumption = > crucial for inferences. 3. Variance stabilization: Often high-throughput data (p >> n) exhibit a complex dependence between the mean and the standard deviation with standard deviations severely increasing with the means. Clearly, simple normalizations do not necessarily stabilize the variance of the data across variables => crucial for improved statistical inferences. 31

Transformation Issues Regularization: The premise: because many variables are measured simultaneously, it is very likely that most of them will behave similarly and share similar parameters (more later). The idea: take advantage of the parallel nature of the high-throughput data by borrowing information across similar variables (pooling of information). A maority of authors have used regularization techniques and shown that shrinkage estimators can significantly improve the accuracy of inferences, especially in p >> n situations. See also J. Stein s shrinkage estimator (1961). Empirical Bayes approaches applied to estimation of variance amounts to combining residual sample variances across variables and shrinking them towards a pooled estimate. Posterior estimators follow distributions with augmented degrees of freedom => greater statistical power and far more stable inference, especially when p >> n. 32

Transformation Issues Variance Stabilization and Regularization Via Clustering Let Y i, be the expression measurement of peptide/diffset (variable) in sample i. Using the individual sample standard deviation of variable (gene, peptide, protein, ) as population estimate to scale each response : transformed standard deviation transformation of the data * ˆ 1 Y Y * ˆ i, i, will give a common to each variable that is likely to be a poor...variable......variable... sample i Y i, => sample i * Y i, ˆ ˆ *... 1... Because, even if a standard deviation-1 model is true, we still expect sampling variability in ˆ around 1. 33

Transformation Issues Variance Stabilization and Regularization Via Clustering A commonality to each of the previous types of methods involves shrinkage of the sample variance to a global value by a form a variance regularization. However, the assumption of an equal variance model (homoscedastic), where a pooled common estimator for all variables is used is unrealistic. In fact, in large datasets when p >> n, generating shrinkage estimators to a common global value that is used for all variables (such as variable-by-variable z-scores to transform the data) can lead to misleading inferences mostly due to overfitting (Callow et al. 2000; Cui et al. 2005; Efron et al. 2001; Ishwaran et al. 2003; Papana et al. 2006; Storey 2002; 2003a; 2003b; Tusher et al. 2001) 34

Transformation Issues Variance Stabilization and Regularization Via Clustering So, to ensure that assumptions are satisfied and avoid pitfalls in inferences, one has to properly standardize the data. We propose using an adaptive regularization technique: Ideas: 1) use the information contained in the estimated population mean to get a better estimate of the population standard deviation ˆ for each variable. (Recall mutual mean-variance dependence, and J. Stein s result (from his theory on inadmissibility 1956) that the standard sample variance is improved by a shrinkage estimator using information contained in the sample mean (1964)). ˆ 2) borrow information across variables to get better estimates of the population standard deviations (and population means) to normalize with => look for homogeneity of variances (and means) across variables. 35

Transformation Issues Variance Stabilization and Regularization Via Clustering By exploiting the above ideas of the variance-mean dependence, and of local-pooling of information from similar variables, we will be able to generate oint adaptive shrinkage estimators of the mean and the variance. Perform a (bi-dimensional) clustering of the variables into C clusters with similar variances (and means), where population variance 2 estimated using information from the cluster variable belongs to: and mean for each variable are i cluster l...... ˆ Y i, 2... ( l )...... ˆ ( l )... <- cluster mean of sample variance <- cluster mean of sample mean where l 1,, C is the cluster membership indicator of variable. Use these within cluster-shared estimates ˆ 2 ( l ), ˆ ( ) p l 1 to standardize the variables. 36

Transformation Issues Variance Stabilization and Regularization Via Clustering In practice, the cluster mean of sample variance for variable is given by: 1 ˆ ( l ) #{ : } 2 2 l l { : l l} The cluster mean of sample mean for variable is: ˆ cluster l, l 1,, C...... i Y i, ˆ ( l ) 1 #{ : l l} { : l l} ˆ 2 ˆ ˆ Where 2 ˆ ( l ) ˆ( l ) l 1,, C is the cluster membership indicator of variable : ˆ is the sample mean for variable : 1 ˆ n 2 ˆ is the unbiased sample variance for variable : n i1 Y i, ˆ 1 n 2 2 ˆ ( Yi, ) n 1 i1 C l C I( cluster( ) C ) l l l1 37

Transformation Issues Variance Stabilization and Regularization Via Clustering Recall that we need to perform a clustering of the variable (gene, protein, peptide, ) sample means and standard deviations into C clusters. Clustering is done e.g. using the K-means partitioning clustering algorithm with 1000 replicated random start seedings. A maor challenge in cluster analysis is the estimation of the true number of clusters in a dataset. How do we determine an estimate of that number? The gap statistic is a method for estimating this number (Tibshirani, 2001). Here, we devised a modified version of the gap statistic for determining the optimal number of clusters C in the combined set p ˆ, ˆ 1 of sample means and standard deviations. 38

Transformation Issues Gap Statistic for Clustering Let where C l denotes the l th C p C l l set l 1,, C, such that p p. Assume that for a given cluster configuration C with C clusters the data have been centered and standardized to have within-cluster means and standard deviations of 0 and 1 respectively. l1 l Most methods for estimating the true number of cluster are based on the pooled within cluster dispersion defined as: where D l { : l l} { ': l l} ' d, ' C 1 Wp() l Dl 2 p l l 1 is the sum of pairwise squared Euclidean ˆ ˆ distances of all variables in cluster C l. 39

Transformation Issues Gap Statistic for Clustering An estimate of the true number of clusters is usually obtained by identifying a kink in the plot of W ( p l ) f ( l ) : The gap statistic is an automatic way of locating this kink. W () p l Our version of the gap statistic compares the curves of log Wp ( l) to its expected value under an appropriate null reference distribution with true i) mean 0 and ii) standard deviation 1 (e.g. a standard Gaussian distribution N(0,1)). l Define a similarity statistic by the absolute value of the gap between the two curves: * where and W * () l denote respectively the expectation and E p * * Gap p( l) E log p Wp ( l) log W p( l) p the pooled within cluster dispersion under a sample of size p log Wp ( l) from the reference distribution. 40 * * E log p Wp( l) l

Transformation Issues Gap Statistic for Clustering The estimated true number ˆl of clusters will be the smallest value of l for which the similarity between the two distributions will be maximal, i.e. for which the gap statistic between the two curves will be minimal after assessing its sampling distribution : lˆ min arg min Gap p( l) l l ^ Gap () p l * In practice, we estimate, W * () l and the sampling distribution of Gap () p l by E p p drawing, say B = 100 Monte-Carlo replicates from our reference distribution. If we let B the estimate of ˆ * * 1 * b E log, denoted by, then the p Wp ( l) log Wp ( l) B L b 1 corresponding gap statistic estimate is: ^ Gap p( l) L log W p( l) l 41

Transformation Issues Simulation Setup Consider a synthetic dataset drawn from Rocke & Durbin s additive and multiplicative error model (Rocke et al. 2001): y e n iid where, ~ N (0,1) (1) Multip. Add. Consider a multi group problem, where for each variable = 1,,p (gene, peptide, protein, ) and each group k = 1,,G. From a slight derivation of (1) the individual response (signal, intensity) can be written as : And ki Y k i, i, k, ( k, k ) e n ki, 1,, G where is the group membership indicator of sample i iid i,, i, ~ N(0,1) i : ki k 42

Transformation Issues Simulation Setup This model ensures two things simultaneously : 1) the sample variance for a variable is proportional to the square of its mean signal From the delta method, we get: 2 2 2 i, i, k, nk k Var( Y ) E( Y ) e In fact, for small values of while for large values of the signal Y i, is approximately distributed as a lognormal distribution: i, i, the signal Y i, is approximately normally distributed, Y Y N i, i, Y e Y L N 2 i, 2 k, k n ki, 0 i, ~ (2 k, k, n k ) 0 i, k, k i, k, k k k i, 2 ( ) ~ og (log( ), ) i, i, 2) variances are unequal across groups for k k' 2 k i, 2 Var( ) ( ) Var( ) : Y e n i k k i, k, k k i i, k ', k ' k ' i 2 k ' i, 2 Var( Y ) ( ) Var( e ) n i : k k ' 43

Transformation Issues Gene Clustering Results (Synthetic Dataset) Parameters Simulated means Groups: G = 2 1 = 15; 2 = 15 80% prob. 1, 2, 0 Sample size: n 1 = n 2 = 5 1 = 0.1; 2 = 0.2 iid 1, ~ Exp( l1 ), l1 ~ U[1,10] Dimensionality: p = 1000 n 1 = 1; n 2 =3 20% prob. iid 2, ~ Exp( l2), l2 ~ U[1,10] 44

Transformation Issues Normalization Results (Synthetic Dataset) 45

Transformation Issues Normality Tests Results (Synthetic Dataset) Normality assumption: Shapiro-Wilk test for normality within group k 1,, G 0 i, 2 ) H : Y ~ N, i : k k; 1,, p i 46

Transformation Issues Variance Stabilization (Synthetic Dataset) Homoscedasticity assumption: Robust Levene test: 2 2 2 H : = = 1,, p 0 1, 2, G, The SD vs. rank(sd) plot allows to visually verify whether the assumption is satisfied or not. 47

Transformation Issues Variance Stabilization (Synthetic Dataset) The Mean-SD scatterplot allows to visually verify whether there is a dependence of the variance on the mean. The black dotted line depicts the running median estimator (window-width 10%). If there is no variance-mean dependence, then this line should be approximately horizontal. 48

Transformation Issues Variance Stabilization (Synthetic Dataset) Comparative QQplots of observed pooled std vs. expected standard deviation from a homoscedastic model (e.g. a standard Gaussian distribution N(0,1)) Two-sided Kolmogorov-Smirnov (K-S) test across variables: H :(,,, ) N(0,1) 2 2 2 0 1 2 p 49

Transformation Issues Extension : Regularized Test Statistics Using regularized estimates of the population mean and variance of each variable (gene, peptide, protein, ) not only stabilizes the variance, but also can be used in new regularized test statistics to increase the statistical power. Recall : G groups k 1,, G of samples, and C clusters l 1,, C of variables k 1 l 1 C cluster l 1,, C group k 1,, G...... k G 2 ˆk, 2 ˆ ( l k, ) ˆ( l k, ) ˆk, 50

Transformation Issues Extension : Regularized Test Statistics (e.g. t-tests) Consider now a two-group problem (G = 2). Define for variable in group k: a (i) cluster mean of group sample mean, and (ii) a cluster mean of group sample variance: ˆ ( l ) 1 ˆ #{ : } where lk, 1,, C is the cluster membership k, k, l l l l 1 ˆ ( l ) #{ : } ˆ 2 2 k, k, l l l l indicator of variable (in the k th group). Define a regularized unequal variance t-statistic: t ˆ ( l ) ˆ ( l ) 1, 2, ˆ ( l ) n ˆ ( l ) n 2 2 1, 1 2, 2 51

Transformation Issues Power of Regularization Compare regular or regularized t-tests performances under various procedures. Each procedure has a different cutoff value for identifying differentially expressing variables => comparisons were calibrated by using the top 20% of variables ranked by absolute value of their test statistic. Monte Carlo estimates of False Positive, False Negative and total Misclassification based ˆ on B = 100 replicated synthetic datasets: FP #{Variable called significant Variable is truly not differentially expressed} Log Scale Raw Scale ˆFP ˆFN ˆM ˆFP ˆFN ˆM FN ˆ #{Variable called not significant Variable is truly differentially expresse d} Mˆ FP ˆ FN ˆ t-test t-cart t-rs t-vsn t-loess t-css t-quant 65.53 70.52 56.46 61.62 69.11 67.13 69.03 66.41 71.40 57.35 62.51 69.99 68.02 69.91 131.94 141.92 113.81 124.13 139.10 135.15 138.94 68.77 60.51 59.08 63.72 63.94 69.12 70.96 71.71 63.46 62.03 66.67 72.29 72.06 73.90 140.48 123.96 121.11 130.39 141.62 141.18 144.86 Notice : the loss of power occurring in common normalization procedures that do not guarantee a stabilization of variance or when not accounting for the mean during regularization. quantile normalization looked great, yet yields worst inferences! 52

Transformation Issues Advantage of using oint shrinkage estimator of mean/variance t-rs (based on a oint shrinkage estimator of mean/variance) yields less false positive and more true positives than t-cart (based on a single variance shrinkage estimator): 53

Transformation Issues Summary Transformations Assumptions Normalization (standardization) of samples Normality of Features (by Group) Group Variance Homogenization (Homoscedasticity) Variance Stabilization of Features INFERENCES (Tests statistics) None (Raw Data) Logarithm / / Regularized Standardization (RS) / / CART-Stabilization Regularization ( CART ) / / / Generalized Logarithm (VSN) / Local Polynomial Regression (LOESS) Cubic Smoothing Splines (CSS) Quantile (QUANT) Summary table of quality control checks: Use regularization procedures e.g. RS-transformation or CART -variance stabilization procedure because they are the only transformation that do a good ob in term of feature-wise variance stabilization and inferences, which are essential. They also do not deteriorate the data and achieving normality after a log transformation. 54

Transformation Issues R Implementations Reg. CART Reg. RS Unreg. VSN Unreg. LOESS Unreg. CSS Unreg. QUANT R package preprocesscore normalize.quantiles(.) Bolstad, B. M. et al. (2003) RecSplit(.) H. Ishwaran s (CCF) R code based on Papana et al. (2006) R package affy normalize.loess(.) R package vsn ustvsn (.) Huber W. et al. (2002) R package affy normalize.qspline(.) Workman C. et al. (2002) R Package in prep. RegStd(.) JE Dazard s R code 55

References (By order of appearance) Karpievitch, Y. et al. A statistical framework for protein quantitation in bottom-up MS-based proteomics Bioinformatics 2009, 25(16):2038-2034. Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19(16):2088-2096. Kim KY, Kim BJ, Yi GS: Reuse of imputed data in microarray analysis increases imputation efficiency. BMC bioinformatics 2004, 5:160. Mueller et al. An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 2008, 7(1) :51-61. Tang, K. et al. Charge competition and the linear dynamic range of detection in electrospray ionization mass spectrometry. J Am Soc Mass Spectrom 15(10):1416-1423 Old, W. et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol Cell Proteomics 2005 4(10):1487-1502 Wang P.et al. Normalization regarding non-random missing values in high-throughput mass spectrometry data. Pac Symp Biocomput. 2006a 11:315-326 Little R, Rubin D: Statistical Analysis With Missing Data. New Work: Wiley; 1987. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520-525. Zhou X, et al. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 2003, 19(17):2302-2307. Kim H, et al. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005, 21(2):187-198. Wang X, et al. Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC bioinformatics 2006b, 7:32. Jornsten R, et al. A meta-data based method for DNA microarray imputation. BMC bioinformatics 2007, 8:109. Callow, M. J. et al. Microarray expression profiling identifies genes with altered expression in hdl-deficient mice. Genome Res 2000 10(12):2022 9. Cui, X. et al. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005 6(1):59 75. Efron, B. et al. Empirical bayes analysis of a microarray experiment. J Amer Stat Assoc 2001 96:1151 1160. Ishwaran, H. and Rao, J. S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Stat Assoc 2003 98(462):438 455. Papana A. et al. CART variance stabilization and regularization for high-throughput genomic data. Bioinformatics 2006, 22(18):2254-2261. Storey, J. D. A direct approach to false discovery rates. J R Statist Soc 2002 64(3)(Series B):479 498. Storey, J. D. The positive false discovery rate: A bayesian interpretation and the q-value. The Annals of Statistics 2003a 31:2013 2035. Storey, J. D. et al. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. J R Statist Soc 2003b 66(Series B):187 205. Tusher, V. G. et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001 98(9):5116 21. Rocke et al. A model for measurement error for gene expression arrays. J Comput Biol 2001, 8:557-569. Bolstad, B. M. et al. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 2003 19(2), 185-193 Workman C. et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biology 2002, 3(9) research0048 Huber W. et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 Suppl 1:S96-104. Durbin et al. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18(S1):S105-S110 Stein, C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean, 1964 volume 16. Springer Netherlands. Tibshirani R: Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc 2001, 63(Series B):411 423. 56