Statistical Issues in Preprocessing Quantitative Bottom-Up LCMS Proteomics Data

Size: px
Start display at page:

Download "Statistical Issues in Preprocessing Quantitative Bottom-Up LCMS Proteomics Data"

Transcription

1 Statistical Issues in Preprocessing Quantitative Bottom-Up LCMS Proteomics Data Jean-Eudes DAZARD Case Western Reserve University Center for Proteomics and Bioinformatics November 05, 2009

2 Background Label-Free LC/MS Analysis Workflow C T Network Analysis Digestion C T Biomarker Discovery Stats. Analysis Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA... NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Raw Data Acquisition LC/MS/MS Consensus Sequence Score 0 0 Protein IP NA NA Protein Description NA NA Data Preprocessing 2

3 Outline Missingness Issues Transformation Issues Not covered here: peak alignment, peptide identification, and statistical inferences 3

4 Missingness Issues Dataset Structure and Missing Data Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA... NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Consensus Sequence Score 0 0 Protein IP Protein Description NA NA NA NA Extent, Implications, Source of Missingness, State of the Art Treatment. 4

5 Missingness Issues Extent of the Issue In general, ~ 5-10% of overall missingness is considered small and is common in omics data (e.g. in some DNA micro-array datasets). In contrast, large to very large amounts of missing values commonly occur in LCMS proteomics datasets: Initial Step #1 Step #2 Step #3 Study # Diffsets % Miss. # Diffsets % Miss. # Diffsets % Miss. # Diffsets % Miss. Diabetes CACTI (pilot) CACTI IPS Restenosis This amount of missing data is consistent with what others have observed ( ~50%) e.g. (Karpievitch et al., 2009). 5

6 Missingness Issues Extent of the Issue A reviewer s comment: It is very difficult to accept conclusions based on a dataset having 40% of missing values [ ], no matter how sophisticated the statistics are. Because many different algorithms for missing value imputation were initially developed in the context of microarray data, performances were usually not tested beyond the range of [0, 20%] (assuming Missing At Random more later). But, recall that in our case we deal with variable missingness within a range of roughly [20%, 60%]. 6

7 Missingness Issues Extent of the Issue It is true that every imputation method will break down beyond a certain rate of missingness. However, this breaking point has not been systematically investigated and remains unknown for most of the available methods. In fact, two recent studies have investigated the comparative behavior of a few methods in this context with high and even extreme high missingness of up to 40% and 60% respectively (Oba et al. 2003; Kim et al., 2004). In each study, performances of several methods did not degrade till these rates. So, it is conceivable and practically possible to deal with say at least 40% overall missing entries. 7

8 Missingness Issues Implications Missing values prevent some statistical procedures to be performed at all (e.g. PCA, ) or without further data processing (e.g., ANOVA, LM, ). We may have increasing difficulty in fitting certain GLM models, etc... When analyses are possible, estimates/tests may be prone to extreme bias, depending in part on the mechanism for missingness. Missing data (even if missing at random) will generally result in reduced precision in estimation or reduced power in statistical tests. 8

9 Missingness Issues Sources of Missingness Overview of quantification approaches in LC-MS based proteomic experiments (Mueller et al. 2007). Label-free quantification extracts peptide signals by tracking isotopic patterns along their chromatographic elution profile. The corresponding signal in the LC-MS run 2 is then found by comparing the coordinates m/z, Tr, and z of the peptide (blue). Align multiple LC-MS datasets to one another using chrompeaks, after which LC-MS features (peptides) can be matched to a database. 9

10 Missingness Issues Sources of Missingness Even after peak clustering/aligning, there will still be peaks which are missing from some of the samples. That can occur simply because a peptide was not present in a sample, or because not every peak was detected, identified and aligned in all samples. Peak list (red map) is generated by peak detection and alignment of multiple LC-MS runs (gray maps). So, in shotgun proteomics measurements, many peptides that are observed in some samples are not observed in others, resulting in widespread missing values. 10

11 Missingness Issues Sources of Missingness Furthermore, there are two independent mechanisms by which an intensity will not be observed (Karpievitch et al., 2009): the peak was not observed for a peptide due to the fact that the peptide s presence is at an abundance level lower than the instrument detection threshold => left-censored data with non-ignorable missingness. A second source of missingness is due to the ionization inefficiencies, ionsuppression effects and other technical factors (Tang et al., 2004) => completely random missingness. Whenever there is informative (non-ignorable) missingness, care must be taken when handling the missing values to avoid biasing abundance estimates (Old et al., 2005; Wang et al. 2006a; Karpievitch et al., 2009). 11

12 Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness The systematic pattern reflects the fact that low-abundance peptides are more likely to have missing peaks: 12

13 Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness More than 34 % of peptides/diffsets contain the maximal number of 10 missing values (out of 15 positions) => almost a zero probability event under the Missing At Random (MAR) assumption. The distribution of missing events by peptide is NOT uniform across peptides/diffsets or proteins: 13

14 Missingness Issues Exploratory Data Analysis of Missingness Evidences Against Random Missingness Test of the random nature of the missingness: Let X = #{observed missing counts per peptide/diffset}. Under a random process of missingness, it is reasonable to assume that X is a count r.v. of rare events, i.e. distributed as a Poisson random variable. Hypothesis testing (e.g. a KS test): H 0 : X ~ Poisson (l), Strong evidence to reect the null hypothesis 14

15 Missingness Issues Treatment of Missingness We have shown that the missing mechanism of LCMS proteomics data is intensitydependent => Missing Not At Random (MNAR) The probability of a value being missing will generally depend on the observed and unobserved values (had they been observed). It is to be distinguished from the other missingness mechanisms (MCAR and MAR), which occur only more rarely and randomly. So, missingness should be reduced as much as possible and treated, but carefully: Prefiltering will be limited in scope. Account for informative missingness in peak intensities to allow unbiased, modelbased, protein-level (Karpievitch et al., 2009) or peptide-level (Wang et al. 2006a or ours, 2009) estimation and inference (more later). 15

16 Missingness Issues Treatment of Missingness Prefiltering Mistakes For instance, complete data analyses has its simplicity, but: loss of power due to dropping of a large amount of information. If variables (peptides/diffsets) with missing values, are ust discarded or ignored a substantial loss of power is introduced because information is simply lost! # Diffsets Study # Total # Complete % Diabetes CACTI IPS Restenosis substantial bias induction in inferences (i.e. not applicable to the population) due to ignoring the possible systematic differences between the complete and incomplete cases (violation of MCAR assumption). 16

17 Missingness Issues Treatment of Missingness Imputation Mistakes Almost all existing imputation methods available for high-dimensional data assume MAR (KNN, SVD, EM etc ) => None of these are applicable here! To mention a few: Row Average (Little, 1987), Singular Value Decomposition and k-nearest Neighbors (Troyanskaya, 2001), Bayesian variable selection (Zhou, 2003), local least square imputation (Kim, 2005), Support Vector Regression (Wang, 2006b), Logical Set imputation (Jornsten, 2007) etc,... 17

18 Missingness Issues Treatment of Missingness Example of Upward Bias Mean and Downward Variance Estim. Intensity (abundance) + treatment control True sample group mean of treatments Detection threshold Estimate #1 of group mean of controls (Complete Data or Row Average) True sample group mean of controls Peptide # Bias #1 Bias #2 Estimate #2 of group mean of controls (Minimum Value) 18

19 Missingness Issues Prefiltering LCMS Data Step #1 Initial Preselection The initial dataset contains aligned peptide/diffset peak intensities (ChromPeaks), uniquely identified by diffsets ID. But many do not even have any annotation, not even a consensus peptide sequence ( EST situation in mrna microarray data) => ust ignore them (at this point). Diffset 1 Diffset 2 Diffset Diffset +1 Diffset p Intensity Run 1 NA NA Intensity Run 2 NA NA NA NA Intensity Run n NA NA Consensus Mass Consensus Time Consensus Sequence NA NA Consensus Sequence Score 0 0 Protein IP NA NA Protein Description NA NA 19

20 Missingness Issues Prefiltering LCMS Data Step #1 Initial Preselection Peptides/diffsets that are retained have : (i) a consensus peptide sequence completely annotated (score > 0) (ii) a consensus peptide sequence score above a lower-bound cutoff percentile Percentile is chosen to remove the lowest mode in the peptide sequence score distribution, or the flat part of the cumulative distribution function. (e.g. 79 th percentile, score > 14.95) Summary statistics of missing values p = 1555 peptides/diffsets selected 54.9 % = overall missingness (was 96.9%!) 20

21 Missingness Issues Prefiltering LCMS Data Step #2 Peptide Intensity Summarization ~ summarization that removes feature extraction and peak alignment errors: Diffset 1 Diffset 2 Diffset p Integrate (over retention time and m/z ratio ) the area under the peak intensities of all peptides/diffsets of identical consensus sequence (in length and a.a. composition). This summarization preselects those peptides/diffsets having a unique IP number and annotation (with adusted intensities). Intensity Run 1 NA NA Intensity Run 2... NA Intensity Run n Consensus Mass Consensus Time Consensus Sequence Consensus Sequence Score Protein IP Protein Description NA Summary statistics of missing values p = 1362 peptides/diffsets selected 52.8 % = overall missingness 21

22 Missingness Issues Prefiltering LCMS Data Step #3 Missingness Proportion Reduction Let X = #{observed missing counts per peptide/diffset} Select those peptides/diffsets for which the upper Criterion #1 Diffset 1 Diffset 2 Diffset bound (n) of X should satisfy 2 criterions simultaneously: Intensity Run 1 Intensity Run 2 i. n must be less than the total sample size n less half Intensity Run 3 the minimal sample size n k of experimental group Intensity Run 4 ii. ) k = 1,,G (assuming n n > 1 and n >1). (1) min k k n should ideally maximize the difference between Intensity Run 5 Intensity Run 6 Intensity Run 7 the total number of remaining variables after Intensity Run 8 Intensity Run 9 selection and the total number of missing values (see next). where n k G k denotes the k th group sample size of group G k k 1,, G, such that n G k 1 k n 22

23 Missingness Issues Prefiltering LCMS Data Step #3 Missingness Proportion Reduction X x 0, n for each = 1,, p By criterion #1: n n n (1) 2 1 denotes the floor function., where p p By criterion #2: n argmax ( n x ) (, ) g x y x 1 1 where Y =y is the observed abundance of the th peptide. Example: maximal number of missing values allowed per peptide/diffset: n => use e.g. the nearest integer-mean value: n = 6, minimum value: n = 4, or maximum value :n = 7 Summary statistics of missing values (n = 6) p = 914 peptides/diffsets selected (512 prot.) 38.1 % = overall missingness (3134 occ.) 4,6. Criterion #2 23

24 Missingness Issues Model-Based Imputation Probability Model To account for the non random nature of the missing events in this type of data, we use a probability model adapted from Wang et al. (2006a) to describe artefactual missing events. It makes inferences on the missing values of one sample based on the information from other similar samples (technical replicates or nearest neighbors). artefactual missing event because the peptide exists in the sample but no signal has been detected (unobserved). Based on the intensities of those peptides observed in both samples, the scale difference of the overall abundances between Sample 1 and Sample 2 can be estimated, and used to approximate the missing intensity of Peptide 1 in Sample 1. 24

25 Missingness Issues Model-Based Imputation Probability Model Formally, let Z be the indicator variable of the th peptide in one sample i: Z th 1, if the peptide exists in the sample 0, otherwise where dependence on i th subscript is dropped. Let X be the true abundance of that peptide, then X 0, if Z 0 f, if Z 1 where f denotes the (unknown) p.d.f. of X. Z X Z The observed abundance Y of the th peptide given the value (X, Z, d) satisfies: 0, if Z 0 Y X, Z, d ) 0, if Z 1 and X d 1, if Z 1 and X d where d = the minimum detection level of the LCMS instrument 25

26 Missingness Issues Model-Based Imputation Probability Model Under MNAR assumption, the missingness mechanism is described by the probability of an artefactual missing event of the th peptide (denoted as signal is observed : Pr( M Y 0) M Z 1, Y 0 ) given that no Using Bayes rule, one derives an estimate of the searched conditional expectation of the true intensity X given that it is unobservable (i.e. Y = 0) : E( X Y 0) E( X X d, Z 1) Pr( M Y 0) From these derivations, using Bayes rule, given the detectable level parameter d, the distribution function f, and the observed abundance Y, it is possible to estimate E( X X d, Z 1) and the probability Pr( M Y 0) and in turn E( X Y 0) 26

27 Missingness Issues R Implementations R Package in prep. (J-E Dazard s R source code) 1. EDA of missingness 1. Explore.lcms(.) 2. Prefiltering 2. Select.lcms(.) 3. Imputation of missing values: 3. Impute.lcms(.) (Implementation based on Wang et al. 2006a) 27

28 Outline Missingness Issues Transformations Issues 28

29 Transformation Issues Why do we often need to transform the data? Remove sources of systematic biases and variations in the measured intensities due to experimental artifacts or the technology employed (non random or technical). E.g. in 2DIGE : dye effects, intensity dependence, spatial effects. In LCMS: LC-MS run, label, intensity, Satisfy subsequent model assumptions (independence of observations, normality, homoscedasticity,...) In biomarker discovery, this is essential to allow proper comparisons between experimental conditions and for statistical inferences (e.g. aimed at identifying differentially expressed variables across experimental conditions). Usually simpler and often only feasible approach than trying to model these systematic effects. 29

30 M M Transformation Issues In Normalization (standardization), one wants to compensate the location and scale effects between different samples (or groups) of the data. A wide variety of normalization approaches have been proposed (e.g. LOESS, VSN, CSS, Quantile ). Location normalization techniques : all adust the average of the data: Total intensity normalization: A single, global, multiplicative adustment => average intensities is zero Assumes that most variables do not respond to experimental conditions Median adustment Essentially the same w/o filtering Global adustment such that the median intensities is zero Intensity dependent normalization using local estimation A smooth best-fit curve is calculated for the dependence of logratio (M) on overall geometric mean (A) of intensities. Normalized log ratios are given as the residuals from this curve. e.g. global M-A "loess" normalization: Locally weighted regression estimation (M) A A "Scale" normalization techniques : "Scale" normalizations adust the range of data, rather than the center of the distribution. 30

31 Transformation Issues Often misunderstood: normalization is used generically. It usually includes standardization, but not necessarily all the following procedures: 1. Normality transformation: In general statistical models assume normality of the measurements (or errors) and rely on a set of assumptions about the ideal form of the data and attempts to transform the data consistent with that ideal form (e.g. for continuous data: Log(.), Box-Cox, ). Can do away if n >> 1 (CLT result). 2. Homoscedastic transformation? Statistical procedures rely on equality of variances assumption = > crucial for inferences. 3. Variance stabilization: Often high-throughput data (p >> n) exhibit a complex dependence between the mean and the standard deviation with standard deviations severely increasing with the means. Clearly, simple normalizations do not necessarily stabilize the variance of the data across variables => crucial for improved statistical inferences. 31

32 Transformation Issues Regularization: The premise: because many variables are measured simultaneously, it is very likely that most of them will behave similarly and share similar parameters (more later). The idea: take advantage of the parallel nature of the high-throughput data by borrowing information across similar variables (pooling of information). A maority of authors have used regularization techniques and shown that shrinkage estimators can significantly improve the accuracy of inferences, especially in p >> n situations. See also J. Stein s shrinkage estimator (1961). Empirical Bayes approaches applied to estimation of variance amounts to combining residual sample variances across variables and shrinking them towards a pooled estimate. Posterior estimators follow distributions with augmented degrees of freedom => greater statistical power and far more stable inference, especially when p >> n. 32

33 Transformation Issues Variance Stabilization and Regularization Via Clustering Let Y i, be the expression measurement of peptide/diffset (variable) in sample i. Using the individual sample standard deviation of variable (gene, peptide, protein, ) as population estimate to scale each response : transformed standard deviation transformation of the data * ˆ 1 Y Y * ˆ i, i, will give a common to each variable that is likely to be a poor...variable......variable... sample i Y i, => sample i * Y i, ˆ ˆ * Because, even if a standard deviation-1 model is true, we still expect sampling variability in ˆ around 1. 33

34 Transformation Issues Variance Stabilization and Regularization Via Clustering A commonality to each of the previous types of methods involves shrinkage of the sample variance to a global value by a form a variance regularization. However, the assumption of an equal variance model (homoscedastic), where a pooled common estimator for all variables is used is unrealistic. In fact, in large datasets when p >> n, generating shrinkage estimators to a common global value that is used for all variables (such as variable-by-variable z-scores to transform the data) can lead to misleading inferences mostly due to overfitting (Callow et al. 2000; Cui et al. 2005; Efron et al. 2001; Ishwaran et al. 2003; Papana et al. 2006; Storey 2002; 2003a; 2003b; Tusher et al. 2001) 34

35 Transformation Issues Variance Stabilization and Regularization Via Clustering So, to ensure that assumptions are satisfied and avoid pitfalls in inferences, one has to properly standardize the data. We propose using an adaptive regularization technique: Ideas: 1) use the information contained in the estimated population mean to get a better estimate of the population standard deviation ˆ for each variable. (Recall mutual mean-variance dependence, and J. Stein s result (from his theory on inadmissibility 1956) that the standard sample variance is improved by a shrinkage estimator using information contained in the sample mean (1964)). ˆ 2) borrow information across variables to get better estimates of the population standard deviations (and population means) to normalize with => look for homogeneity of variances (and means) across variables. 35

36 Transformation Issues Variance Stabilization and Regularization Via Clustering By exploiting the above ideas of the variance-mean dependence, and of local-pooling of information from similar variables, we will be able to generate oint adaptive shrinkage estimators of the mean and the variance. Perform a (bi-dimensional) clustering of the variables into C clusters with similar variances (and means), where population variance 2 estimated using information from the cluster variable belongs to: and mean for each variable are i cluster l ˆ Y i, 2... ( l ) ˆ ( l )... <- cluster mean of sample variance <- cluster mean of sample mean where l 1,, C is the cluster membership indicator of variable. Use these within cluster-shared estimates ˆ 2 ( l ), ˆ ( ) p l 1 to standardize the variables. 36

37 Transformation Issues Variance Stabilization and Regularization Via Clustering In practice, the cluster mean of sample variance for variable is given by: 1 ˆ ( l ) #{ : } 2 2 l l { : l l} The cluster mean of sample mean for variable is: ˆ cluster l, l 1,, C i Y i, ˆ ( l ) 1 #{ : l l} { : l l} ˆ 2 ˆ ˆ Where 2 ˆ ( l ) ˆ( l ) l 1,, C is the cluster membership indicator of variable : ˆ is the sample mean for variable : 1 ˆ n 2 ˆ is the unbiased sample variance for variable : n i1 Y i, ˆ 1 n 2 2 ˆ ( Yi, ) n 1 i1 C l C I( cluster( ) C ) l l l1 37

38 Transformation Issues Variance Stabilization and Regularization Via Clustering Recall that we need to perform a clustering of the variable (gene, protein, peptide, ) sample means and standard deviations into C clusters. Clustering is done e.g. using the K-means partitioning clustering algorithm with 1000 replicated random start seedings. A maor challenge in cluster analysis is the estimation of the true number of clusters in a dataset. How do we determine an estimate of that number? The gap statistic is a method for estimating this number (Tibshirani, 2001). Here, we devised a modified version of the gap statistic for determining the optimal number of clusters C in the combined set p ˆ, ˆ 1 of sample means and standard deviations. 38

39 Transformation Issues Gap Statistic for Clustering Let where C l denotes the l th C p C l l set l 1,, C, such that p p. Assume that for a given cluster configuration C with C clusters the data have been centered and standardized to have within-cluster means and standard deviations of 0 and 1 respectively. l1 l Most methods for estimating the true number of cluster are based on the pooled within cluster dispersion defined as: where D l { : l l} { ': l l} ' d, ' C 1 Wp() l Dl 2 p l l 1 is the sum of pairwise squared Euclidean ˆ ˆ distances of all variables in cluster C l. 39

40 Transformation Issues Gap Statistic for Clustering An estimate of the true number of clusters is usually obtained by identifying a kink in the plot of W ( p l ) f ( l ) : The gap statistic is an automatic way of locating this kink. W () p l Our version of the gap statistic compares the curves of log Wp ( l) to its expected value under an appropriate null reference distribution with true i) mean 0 and ii) standard deviation 1 (e.g. a standard Gaussian distribution N(0,1)). l Define a similarity statistic by the absolute value of the gap between the two curves: * where and W * () l denote respectively the expectation and E p * * Gap p( l) E log p Wp ( l) log W p( l) p the pooled within cluster dispersion under a sample of size p log Wp ( l) from the reference distribution. 40 * * E log p Wp( l) l

41 Transformation Issues Gap Statistic for Clustering The estimated true number ˆl of clusters will be the smallest value of l for which the similarity between the two distributions will be maximal, i.e. for which the gap statistic between the two curves will be minimal after assessing its sampling distribution : lˆ min arg min Gap p( l) l l ^ Gap () p l * In practice, we estimate, W * () l and the sampling distribution of Gap () p l by E p p drawing, say B = 100 Monte-Carlo replicates from our reference distribution. If we let B the estimate of ˆ * * 1 * b E log, denoted by, then the p Wp ( l) log Wp ( l) B L b 1 corresponding gap statistic estimate is: ^ Gap p( l) L log W p( l) l 41

42 Transformation Issues Simulation Setup Consider a synthetic dataset drawn from Rocke & Durbin s additive and multiplicative error model (Rocke et al. 2001): y e n iid where, ~ N (0,1) (1) Multip. Add. Consider a multi group problem, where for each variable = 1,,p (gene, peptide, protein, ) and each group k = 1,,G. From a slight derivation of (1) the individual response (signal, intensity) can be written as : And ki Y k i, i, k, ( k, k ) e n ki, 1,, G where is the group membership indicator of sample i iid i,, i, ~ N(0,1) i : ki k 42

43 Transformation Issues Simulation Setup This model ensures two things simultaneously : 1) the sample variance for a variable is proportional to the square of its mean signal From the delta method, we get: i, i, k, nk k Var( Y ) E( Y ) e In fact, for small values of while for large values of the signal Y i, is approximately distributed as a lognormal distribution: i, i, the signal Y i, is approximately normally distributed, Y Y N i, i, Y e Y L N 2 i, 2 k, k n ki, 0 i, ~ (2 k, k, n k ) 0 i, k, k i, k, k k k i, 2 ( ) ~ og (log( ), ) i, i, 2) variances are unequal across groups for k k' 2 k i, 2 Var( ) ( ) Var( ) : Y e n i k k i, k, k k i i, k ', k ' k ' i 2 k ' i, 2 Var( Y ) ( ) Var( e ) n i : k k ' 43

44 Transformation Issues Gene Clustering Results (Synthetic Dataset) Parameters Simulated means Groups: G = 2 1 = 15; 2 = 15 80% prob. 1, 2, 0 Sample size: n 1 = n 2 = 5 1 = 0.1; 2 = 0.2 iid 1, ~ Exp( l1 ), l1 ~ U[1,10] Dimensionality: p = 1000 n 1 = 1; n 2 =3 20% prob. iid 2, ~ Exp( l2), l2 ~ U[1,10] 44

45 Transformation Issues Normalization Results (Synthetic Dataset) 45

46 Transformation Issues Normality Tests Results (Synthetic Dataset) Normality assumption: Shapiro-Wilk test for normality within group k 1,, G 0 i, 2 ) H : Y ~ N, i : k k; 1,, p i 46

47 Transformation Issues Variance Stabilization (Synthetic Dataset) Homoscedasticity assumption: Robust Levene test: H : = = 1,, p 0 1, 2, G, The SD vs. rank(sd) plot allows to visually verify whether the assumption is satisfied or not. 47

48 Transformation Issues Variance Stabilization (Synthetic Dataset) The Mean-SD scatterplot allows to visually verify whether there is a dependence of the variance on the mean. The black dotted line depicts the running median estimator (window-width 10%). If there is no variance-mean dependence, then this line should be approximately horizontal. 48

49 Transformation Issues Variance Stabilization (Synthetic Dataset) Comparative QQplots of observed pooled std vs. expected standard deviation from a homoscedastic model (e.g. a standard Gaussian distribution N(0,1)) Two-sided Kolmogorov-Smirnov (K-S) test across variables: H :(,,, ) N(0,1) p 49

50 Transformation Issues Extension : Regularized Test Statistics Using regularized estimates of the population mean and variance of each variable (gene, peptide, protein, ) not only stabilizes the variance, but also can be used in new regularized test statistics to increase the statistical power. Recall : G groups k 1,, G of samples, and C clusters l 1,, C of variables k 1 l 1 C cluster l 1,, C group k 1,, G k G 2 ˆk, 2 ˆ ( l k, ) ˆ( l k, ) ˆk, 50

51 Transformation Issues Extension : Regularized Test Statistics (e.g. t-tests) Consider now a two-group problem (G = 2). Define for variable in group k: a (i) cluster mean of group sample mean, and (ii) a cluster mean of group sample variance: ˆ ( l ) 1 ˆ #{ : } where lk, 1,, C is the cluster membership k, k, l l l l 1 ˆ ( l ) #{ : } ˆ 2 2 k, k, l l l l indicator of variable (in the k th group). Define a regularized unequal variance t-statistic: t ˆ ( l ) ˆ ( l ) 1, 2, ˆ ( l ) n ˆ ( l ) n 2 2 1, 1 2, 2 51

52 Transformation Issues Power of Regularization Compare regular or regularized t-tests performances under various procedures. Each procedure has a different cutoff value for identifying differentially expressing variables => comparisons were calibrated by using the top 20% of variables ranked by absolute value of their test statistic. Monte Carlo estimates of False Positive, False Negative and total Misclassification based ˆ on B = 100 replicated synthetic datasets: FP #{Variable called significant Variable is truly not differentially expressed} Log Scale Raw Scale ˆFP ˆFN ˆM ˆFP ˆFN ˆM FN ˆ #{Variable called not significant Variable is truly differentially expresse d} Mˆ FP ˆ FN ˆ t-test t-cart t-rs t-vsn t-loess t-css t-quant Notice : the loss of power occurring in common normalization procedures that do not guarantee a stabilization of variance or when not accounting for the mean during regularization. quantile normalization looked great, yet yields worst inferences! 52

53 Transformation Issues Advantage of using oint shrinkage estimator of mean/variance t-rs (based on a oint shrinkage estimator of mean/variance) yields less false positive and more true positives than t-cart (based on a single variance shrinkage estimator): 53

54 Transformation Issues Summary Transformations Assumptions Normalization (standardization) of samples Normality of Features (by Group) Group Variance Homogenization (Homoscedasticity) Variance Stabilization of Features INFERENCES (Tests statistics) None (Raw Data) Logarithm / / Regularized Standardization (RS) / / CART-Stabilization Regularization ( CART ) / / / Generalized Logarithm (VSN) / Local Polynomial Regression (LOESS) Cubic Smoothing Splines (CSS) Quantile (QUANT) Summary table of quality control checks: Use regularization procedures e.g. RS-transformation or CART -variance stabilization procedure because they are the only transformation that do a good ob in term of feature-wise variance stabilization and inferences, which are essential. They also do not deteriorate the data and achieving normality after a log transformation. 54

55 Transformation Issues R Implementations Reg. CART Reg. RS Unreg. VSN Unreg. LOESS Unreg. CSS Unreg. QUANT R package preprocesscore normalize.quantiles(.) Bolstad, B. M. et al. (2003) RecSplit(.) H. Ishwaran s (CCF) R code based on Papana et al. (2006) R package affy normalize.loess(.) R package vsn ustvsn (.) Huber W. et al. (2002) R package affy normalize.qspline(.) Workman C. et al. (2002) R Package in prep. RegStd(.) JE Dazard s R code 55

56 References (By order of appearance) Karpievitch, Y. et al. A statistical framework for protein quantitation in bottom-up MS-based proteomics Bioinformatics 2009, 25(16): Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19(16): Kim KY, Kim BJ, Yi GS: Reuse of imputed data in microarray analysis increases imputation efficiency. BMC bioinformatics 2004, 5:160. Mueller et al. An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 2008, 7(1) : Tang, K. et al. Charge competition and the linear dynamic range of detection in electrospray ionization mass spectrometry. J Am Soc Mass Spectrom 15(10): Old, W. et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol Cell Proteomics (10): Wang P.et al. Normalization regarding non-random missing values in high-throughput mass spectrometry data. Pac Symp Biocomput. 2006a 11: Little R, Rubin D: Statistical Analysis With Missing Data. New Work: Wiley; Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6): Zhou X, et al. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 2003, 19(17): Kim H, et al. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005, 21(2): Wang X, et al. Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC bioinformatics 2006b, 7:32. Jornsten R, et al. A meta-data based method for DNA microarray imputation. BMC bioinformatics 2007, 8:109. Callow, M. J. et al. Microarray expression profiling identifies genes with altered expression in hdl-deficient mice. Genome Res (12): Cui, X. et al. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics (1): Efron, B. et al. Empirical bayes analysis of a microarray experiment. J Amer Stat Assoc : Ishwaran, H. and Rao, J. S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Stat Assoc (462): Papana A. et al. CART variance stabilization and regularization for high-throughput genomic data. Bioinformatics 2006, 22(18): Storey, J. D. A direct approach to false discovery rates. J R Statist Soc (3)(Series B): Storey, J. D. The positive false discovery rate: A bayesian interpretation and the q-value. The Annals of Statistics 2003a 31: Storey, J. D. et al. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. J R Statist Soc 2003b 66(Series B): Tusher, V. G. et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A (9): Rocke et al. A model for measurement error for gene expression arrays. J Comput Biol 2001, 8: Bolstad, B. M. et al. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics (2), Workman C. et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biology 2002, 3(9) research0048 Huber W. et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 Suppl 1:S Durbin et al. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18(S1):S105-S110 Stein, C. Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean, 1964 volume 16. Springer Netherlands. Tibshirani R: Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc 2001, 63(Series B):

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Statistical Methods. Missing Data  snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23 1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, 2013 1 Joint work with Shu Yang Introduction 1 Introduction

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA

ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA REVSTAT Statistical Journal Volume 3, Number 2, November 2005, 99 111 ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA Authors: Klaus Jung Department of Statistics, University of Dortmund, Germany (klaus.jung@uni-dortmund.de)

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University Statistica Sinica 27 (2017), 000-000 doi:https://doi.org/10.5705/ss.202016.0155 DISCUSSION: DISSECTING MULTIPLE IMPUTATION FROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAPPENS WHEN GOD S, IMPUTER S AND

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Basics of Modern Missing Data Analysis

Basics of Modern Missing Data Analysis Basics of Modern Missing Data Analysis Kyle M. Lang Center for Research Methods and Data Analysis University of Kansas March 8, 2013 Topics to be Covered An introduction to the missing data problem Missing

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little Measurement error as missing data: the case of epidemiologic assays Roderick J. Little Outline Discuss two related calibration topics where classical methods are deficient (A) Limit of quantification methods

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database Overview - MS Proteomics in One Slide Obtain protein Digest into peptides Acquire spectra in mass spectrometer MS masses of peptides MS/MS fragments of a peptide Results! Match to sequence database 2 But

More information

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and

More information

Distribution Fitting (Censored Data)

Distribution Fitting (Censored Data) Distribution Fitting (Censored Data) Summary... 1 Data Input... 2 Analysis Summary... 3 Analysis Options... 4 Goodness-of-Fit Tests... 6 Frequency Histogram... 8 Comparison of Alternative Distributions...

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot RPPA Immunohistochemistry

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

Statistical Data Analysis Stat 3: p-values, parameter estimation

Statistical Data Analysis Stat 3: p-values, parameter estimation Statistical Data Analysis Stat 3: p-values, parameter estimation London Postgraduate Lectures on Particle Physics; University of London MSci course PH4515 Glen Cowan Physics Department Royal Holloway,

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Missing Value Estimation for Time Series Microarray Data Using Linear Dynamical Systems Modeling

Missing Value Estimation for Time Series Microarray Data Using Linear Dynamical Systems Modeling 22nd International Conference on Advanced Information Networking and Applications - Workshops Missing Value Estimation for Time Series Microarray Data Using Linear Dynamical Systems Modeling Connie Phong

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

4.1. Introduction: Comparing Means

4.1. Introduction: Comparing Means 4. Analysis of Variance (ANOVA) 4.1. Introduction: Comparing Means Consider the problem of testing H 0 : µ 1 = µ 2 against H 1 : µ 1 µ 2 in two independent samples of two different populations of possibly

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course. Name of the course Statistical methods and data analysis Audience The course is intended for students of the first or second year of the Graduate School in Materials Engineering. The aim of the course

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University

Protein Quantitation II: Multiple Reaction Monitoring. Kelly Ruggles New York University Protein Quantitation II: Multiple Reaction Monitoring Kelly Ruggles kelly@fenyolab.org New York University Traditional Affinity-based proteomics Use antibodies to quantify proteins Western Blot Immunohistochemistry

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Bayesian Networks in Educational Assessment

Bayesian Networks in Educational Assessment Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior

More information

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career. Introduction to Data and Analysis Wildlife Management is a very quantitative field of study Results from studies will be used throughout this course and throughout your career. Sampling design influences

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

TUTORIAL EXERCISES WITH ANSWERS

TUTORIAL EXERCISES WITH ANSWERS TUTORIAL EXERCISES WITH ANSWERS Tutorial 1 Settings 1. What is the exact monoisotopic mass difference for peptides carrying a 13 C (and NO additional 15 N) labelled C-terminal lysine residue? a. 6.020129

More information

One-way ANOVA Model Assumptions

One-way ANOVA Model Assumptions One-way ANOVA Model Assumptions STAT:5201 Week 4: Lecture 1 1 / 31 One-way ANOVA: Model Assumptions Consider the single factor model: Y ij = µ + α }{{} i ij iid with ɛ ij N(0, σ 2 ) mean structure random

More information

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1 Data representations Data are relative measurements log 2 ( red

More information

Heteroskedasticity. Part VII. Heteroskedasticity

Heteroskedasticity. Part VII. Heteroskedasticity Part VII Heteroskedasticity As of Oct 15, 2015 1 Heteroskedasticity Consequences Heteroskedasticity-robust inference Testing for Heteroskedasticity Weighted Least Squares (WLS) Feasible generalized Least

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Chapter 10. Semi-Supervised Learning

Chapter 10. Semi-Supervised Learning Chapter 10. Semi-Supervised Learning Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Outline

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Statistical analysis of isobaric-labeled mass spectrometry data

Statistical analysis of isobaric-labeled mass spectrometry data Statistical analysis of isobaric-labeled mass spectrometry data Farhad Shakeri July 3, 2018 Core Unit for Bioinformatics Analyses Institute for Genomic Statistics and Bioinformatics University Hospital

More information

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011 Proteome-wide label-free quantification with MaxQuant Jürgen Cox Max Planck Institute of Biochemistry July 2011 MaxQuant MaxQuant Feature detection Data acquisition Initial Andromeda search Statistics

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Manual: R package HTSmix

Manual: R package HTSmix Manual: R package HTSmix Olga Vitek and Danni Yu May 2, 2011 1 Overview High-throughput screens (HTS) measure phenotypes of thousands of biological samples under various conditions. The phenotypes are

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A. Linero and M. Daniels UF, UT-Austin SRC 2014, Galveston, TX 1 Background 2 Working model

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research

profileanalysis Innovation with Integrity Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research profileanalysis Quickly pinpointing and identifying potential biomarkers in Proteomics and Metabolomics research Innovation with Integrity Omics Research Biomarker Discovery Made Easy by ProfileAnalysis

More information

A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke January 22, 2014 BST 226 Statistical Methods for Bioinformatics 1 Mass Spectrometry Mass spectrometry (mass spec, MS) comprises a set of instrumental

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information