SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Size: px

Start display at page:

Download "SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1"

Cornelius Little
5 years ago
Views:

1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1

2 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and choices Criteria for comparison of statistical methods Results of standard packages on three data set Null performance Source of differences in the results Input data Tests Variance estimation Normalization Conclusions April 28, 2015 SPH 247 Statistics for Laboratory Data 2

3 RNA-Seq Gene expression is the transcription of the DNA in a gene into mrna, which (in many cases) is later translated into a protein. We can measure expression of a single gene with PCR or other assays. Gene expression arrays measure expression of many genes simultaneously using spots each of which contains a matching sequence to the gene sequence to be detected. But this can only detect what we already suspected might be there. April 28, 2015 SPH 247 Statistics for Laboratory Data 3

4 RNA-Seq For RNA-Seq, the RNA in the sample is reverse transcribed into the corresponding DNA sequence. Then the DNA fragments are sequenced (in an NGS sequencer, usually Illumina ) Each fragment is mapped to the reference genome The data to be analyzed are the number of fragments mapping to each gene in a table where the columns are samples and the rows are genes. April 28, 2015 SPH 247 Statistics for Laboratory Data 4

5 RNA-Seq This mapping can be complex We can choose to estimate isoforms or not We can choose how to handle ambiguous reads (omit or spread across genes) We can then use statistical analysis to determine when there is significantly more expression in one condition or another. This may or may not be better than an expression array depending on goals. April 28, 2015 SPH 247 Statistics for Laboratory Data 5

6 Analysis of RNA-Seq Data For each gene/exon/isoform (we will say gene from now on), and for each sample, we have a count of fragments mapping to that gene. In principle, we need to test whether the counts from one group are significantly larger than another. Or we may have more than one factor or variable that could be associated. In practice, we may (probably) need to normalize the samples first, and may need to import some information across genes. April 28, 2015 SPH 247 Statistics for Laboratory Data 6

7 Read Counts We assume that bioinformatic analysis has been conducted on the original RNA-Seq data for each sample and converted into reads K ij per gene i per sample j. If the sample were a big soup of mapped fragments, and a sample were taken of it, then the statistical distribution of the K ij would be multinomial with a very high dimension, yielding a vector of length 20,000 or so with average counts per item varying over many orders of magnitude. April 28, 2015 SPH 247 Statistics for Laboratory Data 7

8 Distribution of Read Counts Generally, in a multinomial distribution, the counts per variable are correlated, because a count in one cell eliminates a count in another, but in this case weakly so. So approximately, the individual count values would be binomial, with a very small proportion p. This in turn is approximately Poisson, which might be a plausible starting place. April 28, 2015 SPH 247 Statistics for Laboratory Data 8

9 Poisson Problems The Poisson distribution has a single parameter λ, which controls both the mean λ and the variance λ. For a given gene i, the Poisson parameter λ i depends on the actual count of transcripts mapped to gene i in the soup, the total number of transcripts in the soup, and factors describing biases in production of reads from a transcript (including but not limited to the size of the gene). A hypothesis test of whether sample 1 differs in expression of gene i from sample 2, is a useless test two samples always differ. It is not a test of treatment vs. control. April 28, 2015 SPH 247 Statistics for Laboratory Data 9

10 Poisson Problems If X and Y are in dependent Poisson variates with parameters λ X and λ Y, then the hypothesis H 0 : λ X = λ Y can be tested by (for example) the statistic (X Y)/2L, where L = X + Y. This is because Var(X Y) = λ X + λ Y. But if X is from a treatment sample and Y from a control sample, this is in no way a reasonable test of whether treatment differs from control. It does not take biological variability into account at all. April 28, 2015 SPH 247 Statistics for Laboratory Data 10

11 Freezer Cells grown to 50% confluence in 8 dishes 4 treated with TNF-α 4 controls Other (biological) variability Actual transcript count for one gene in 8 dishes Poisson variability? RNA-Seq mapped read count for one gene in 8 dishes April 28, 2015 SPH 247 Statistics for Laboratory Data 11

12 Poisson Variability Is at best the technical variation Even then, the approximation is poor A better model for the overall experimental variability is perhaps a mixture of Poissons. Each sample under the same condition has a different actual transcript count before sampling, which means that λ ij for gene i and sample j differs from sample to sample. So the actual observed mapped reads count for a given gene across samples in the same condition can better be considered a mixture of Poisson random variables. April 28, 2015 SPH 247 Statistics for Laboratory Data 12

13 The Negative Binomial Given a series of Bernoulli trials (coin flips) with chance of success p on each trial, if we let X be the number of successes at the time when r th failure occurs, then X has a negative binomial distribution NB(r;p). For example, if r is 3, and the sequence is SSFSFSSSSFSSF, then X is 7 and the length of the sequence is 10. The mean of X is pr/(1-p) and the variance is pr/(1-p) 2. April 28, 2015 SPH 247 Statistics for Laboratory Data 13

14 The Negative Binomial An more useful formulation is that a negative binomial can be written as a gamma mixture of Poisson random variables. If an observed count is Poisson, with a random mean λ that differs from time to time and which is gamma distributed across samples, then the count overall is negative binomial. April 28, 2015 SPH 247 Statistics for Laboratory Data 14

15 Gamma Mixtures The gamma distribution (same as the chi-squared) is an asymmetric distribution defined on [0, ) with two parameters, the shape k and the scale θ. The mean is kθ and the variance is kθ 2. NB(r; p) is produced by a gamma mixture of Poissons with k = r and θ = p/(1-p). Otherwise put, if we can estimate the mean μ and variance σ 2, then the parameters of the negative binomial can be determined from those. April 28, 2015 SPH 247 Statistics for Laboratory Data 15

16 The negative binomial has two parameters: pr, arising from the original definition of NB k, θ from the gamma mixture µσ, 2 µα, mean and dispersion parameter 1/ α = r is sometimes called the size. 2 2 σ = µ + αµ April 28, 2015 SPH 247 Statistics for Laboratory Data 16

17 Generalized Linear Models The type of predictive model one uses depends on a number of issues; one is the type of response. Measured values such as quantity of a protein, age, weight usually can be handled in an ordinary linear regression model, possibly after a log transformation. Patient survival, which may be censored, calls for a different method (survival analysis, Cox regression). April 28, 2015 SPH 247 Statistics for Laboratory Data 17

18 If the response is binary, then can we use logistic regression models If the response is a count, we can use Poisson regression If the count has a higher variance than is consistent with the Poisson, we can use a negative binomial or quasipoisson Other forms of response can generate other types of generalized linear models One type of count data occurs in proteomics: number of unique peptide fragments mapping to the given protein. A similar count is the number of reads in RNA-Seq mapping to a particular gene. April 28, 2015 SPH 247 Statistics for Laboratory Data 18

19 Generalized Linear Models We need a linear predictor of the same form as in linear regression βx In theory, such a linear predictor can generate any type of number as a prediction, positive, negative, or zero We choose a suitable distribution for the type of data we are predicting (normal for any number, gamma for positive numbers, binomial for binary responses, Poisson for counts) We create a link function which maps the mean of the distribution onto the set of all possible linear prediction results, which is the whole real line (-, ). The inverse of the link function takes the linear predictor to the actual prediction April 28, 2015 SPH 247 Statistics for Laboratory Data 19

20 Ordinary linear regression has identity link (no transformation by the link function) and uses the normal distribution If one is predicting an inherently positive quantity, one may want to use the log link since e x is always positive. An alternative to using a generalized linear model with an log link, is to transform the data using the log or maybe glog. This is a device that works well with measurement data and may be usable in other cases April 28, 2015 SPH 247 Statistics for Laboratory Data 20

21 R glm() Families Family gaussian binomial Gamma inverse.gaussian poisson negative binomial quasi quasibinomial quasipoisson Links identity, log, inverse logit, probit, cauchit, log, cloglog inverse, identity, log 1/mu^2, inverse, identity, log log, identity, sqrt log, identity, sqrt (uses glm.nb()) identity, logit, probit, cloglog, inverse, log, 1/mu^2 and sqrt logit, probit, identity, cloglog, inverse, log, 1/mu^2 and sqrt log, identity, logit, probit, cloglog, inverse, 1/mu^2 and sqrt April 28, 2015 SPH 247 Statistics for Laboratory Data 21

22 R glm() Link Functions Links Domain Range identity (, ) (, ) log (0, ) (, ) inverse (0, ) (0, ) logit (0, 1) (, ) probit (0, 1) (, ) cloglog (0, 1) (, ) 1/mu^2 (0, ) (0, ) sqrt (0, ) (0, ) η = Xβ = g( µ ) = µ η = Xβ = g( µ ) = log( µ ) η = Xβ = g( µ ) = 1/ µ η = Xβ = g µ = Φ p 1 ( ) ( ) η = Xβ = g( µ ) = log( log(1 p)) 2 η = Xβ = g( µ ) = 1/ µ η = Xβ = g( µ ) = µ ( ) η = X β = g( µ ) = log p/ (1 p) April 28, 2015 SPH 247 Statistics for Laboratory Data 22

23 Possible Means 0 Link = Log - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 23

24 Possible Means 0 Inverse Link = e x - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 24

25 Logistic Regression: response is binary, coded as 0/1, with mean x p e ln = xβ px ( ) = 1 p 1 e n k n k f( k) = p (1 p) k E( k) = np V( k) = τnp(1 p) where τ = 1 if not overdispersed β xβ p Poisson Regression: response is a count, with mean λ ln( λ) = xβ λ = e k λ e f( k) = k! Ek ( ) = λ λ xβ V ( k) = τλ where τ = 1 if not overdispersed April 28, 2015 SPH 247 Statistics for Laboratory Data 25

26 Poisson Regression: response is a count, with mean λ ln( λ) = xβ λ = e k λ e f ( k) = k! Ek ( ) = λ λ xβ V( k) = τλ where τ = 1 if not overdispersed Negative Binomial Regression: response is a count, with mean λ ln( λ) = xβ λ = e x+ r 1 f( k) = p x rp Ek ( ) = λ = 1 p V k x xβ (1 p) n x = λ+ c λ = λ+ r λ τ = ( ) where 1 if not overdispersd e April 28, 2015 SPH 247 Statistics for Laboratory Data 26

27 Logistic Regression Suppose we are trying to predict a binary variable (patient has ovarian cancer or not, patient is responding to therapy or not) We can describe this by a 0/1 variable in which the value 1 is used for one response (patient has ovarian cancer) and 0 for the other (patient does not have ovarian cancer We can then try to predict this response April 28, 2015 SPH 247 Statistics for Laboratory Data 27

28 For a given patient, a prediction can be thought of as a kind of probability that the patient does have ovarian cancer. As such, the prediction should be between 0 and 1. Thus ordinary linear regression is not suitable The logit transform takes a number which can be anything, positive or negative, and produces a number between 0 and 1. Thus the logit link is useful for binary data April 28, 2015 SPH 247 Statistics for Laboratory Data 28

29 Possible Means 0 1 Link = Logit - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 29

30 Possible Means 0 1 Inverse Link = inverse logit - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 30

31 p logit ( p) = log if p 0 then logit( p) if p 1 then logit( p) 1 p e 1+ e x logit ( x) = if x then logit ( x) 0 if x then logit ( x) 1 x x x x e e e x x x log 1+ e log 1 e x = + = 1 x x 1 x log + e = log ( e ) = x e + e e x 1 x x + e + e 1+ e April 28, 2015 SPH 247 Statistics for Laboratory Data 31

32 April 28, 2015 SPH 247 Statistics for Laboratory Data 32

33 April 28, 2015 SPH 247 Statistics for Laboratory Data 33

34 Overdispersion A binomial random variable X with n observations and probability of success p has mean np and variance np(1 p). Could the variance be systematically larger? It could if p varies from trial to trial. The worst case is if X = n with probability p and X = 0 with probability 1 p. Then the mean is still np but the variance is now n 2 p(1 p) >> np(1 p). If we have overdispersion, we can use the quasibinomial family, which estimates the variance rather than assumes that it is np(1 p). April 28, 2015 SPH 247 Statistics for Laboratory Data 34

35 Poisson Distributions The Poisson distribution can be used to model unbounded count data, 0, 1, 2, 3, An example would be the number of cases of sepsis in each hospital in a city in a given month. The Poisson distribution has a single parameter λ, which is the mean of the distribution and also the variance. The standard deviation is λ April 28, 2015 SPH 247 Statistics for Laboratory Data 35

36 Poisson Regression If the mean λ of the Poisson distribution depends on variables x 1, x 2,, x p then we can use a generalized linear model with Poisson distribution and log link. We have that log(λ) is a linear function of x 1, x 2,, x p. This works pretty much like logistic regression, and is used for data in which the count has no specific upper limit (number of cases of lung cancer at a hospital) whereas logistic regression would be used when the count is the number out of a total (number of emergency room admissions positive for C. dificile out of the known total of admissions). With overdispersion, we can use the quasipoisson family or the negative binomial. April 28, 2015 SPH 247 Statistics for Laboratory Data 36

37 Existing Methods for RNA-Seq Existing methods often contain complex combinations or filtering, normalization, transformation, and variance estimation before any statistical tests are performed. The methods are often poorly documented and change rapidly and often substantially between frequent versions, without any push notice. The results of different methods such as DESeq, edger, and voom/limma can differ substantially. It appears that most methods produce large numbers of false positives. April 28, 2015 SPH 247 Statistics for Laboratory Data 37

38 Bottomly Data Two inbred mouse strains, C7BL6/6J (10 animals) and DBA/2J (11 animals) Gene expression in striatal neurons by RNA-Seq and expression arrays. 36,536 unique genes of which 11,870 had a count of at least 10 across the 21 samples. This data set has been used in other comparisons We compare three packages on this data set. Later we construct data sets where the null hypothesis is true. We see what some simple methods can do. April 28, 2015 SPH 247 Statistics for Laboratory Data 38

39 Statistical Tests Example gene from Bottomly data ENSMUSG C57BL/6J 2, 7, 9, 11, 3, 7, 1, 1, 2, 1 Median 2.5, Mean 4.4, sd 3.75, var DBA/2J 8, 6, 12, 10, 19, 15, 17, 6, 20, 6, 14 Median 12, Mean 12.09, sd 5.28, var April 28, 2015 SPH 247 Statistics for Laboratory Data 39

40 April 28, 2015 SPH 247 Statistics for Laboratory Data 40

41 Poisson Analysis One of the first ideas was that the counts might be Poisson distributed and that we could use this for statistical tests. In this case, if the null hypothesis was true, then the 21 observations came from the same Poisson distribution with observed mean and with variance also estimated at (Poisson distributions have mean and variance that are equal). This implies that the variance of the mean of the 10 C7BL/6J counts should be 8.429/10 and the variance of the mean of the 11 DBA/2J counts should be 8.429/11 So we can use z = ( )/ (8.429/ /11) = 6.06 to test the difference April 28, 2015 SPH 247 Statistics for Laboratory Data 41

42 Poisson Analysis This turns out to be a terrible idea. Frequently, the variance of the data is much larger than the mean. We call this overdispersion. C57BL/6J Mean 4.4, Variance DBA/2J Mean 12.09, Variance Also, Poisson analysis can be done with one treatment and one control, and any belief that statistical evidence can be gathered with no biological replicates is evidence of terminal delusion! April 28, 2015 SPH 247 Statistics for Laboratory Data 42

43 Over-dispersed Poisson Analysis In this case, we estimate the means and the variances from the data using either the negative binomial distribution or simply an over-dispersed Poisson (quasi-poisson). These may yield similar results, but need not. In the negative binomial case, we model the data so that, for a given gene g and /sample i, the count y has E( y ) ig Var( y 2 2 ig ig ig For the quasi-poisson, Va r( y ig = µ ) ) ig = µ + c µ = θµ 2 ig ig April 28, 2015 SPH 247 Statistics for Laboratory Data 43

44 > summary(glm(y1~strain,family=poisson)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** straindba/2j e-09 *** (Dispersion parameter for poisson family taken to be 1) > anova(glm(y1~strain,family=poisson),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain e-10 *** First test is Wald test, second is likelihood ratio test. But Poisson assumption is not valid. April 28, 2015 SPH 247 Statistics for Laboratory Data 44

45 > summary(glm(y1~strain,family=quasipoisson)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** straindba/2j ** (Dispersion parameter for quasipoisson family taken to be ) > anova(glm(y1~strain,family=quasipoisson),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain *** April 28, 2015 SPH 247 Statistics for Laboratory Data 45

46 > summary(glm.nb(y1~strain)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-13 *** straindba/2j e-05 *** (Dispersion parameter for Negative Binomial(5.1558) family taken to be 1) > anova(glm.nb(y1~strain),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain e-05 *** April 28, 2015 SPH 247 Statistics for Laboratory Data 46

47 P-values ( 10 5 ) Wald LR Quasi-Poisson Negative Binomial Both the Wald test and the likelihood ratio test are valid asymptotically. The quasi-poisson and negative binomial should be similar asymptotically, depending on the variances In finite samples there can be substantial differences April 28, 2015 SPH 247 Statistics for Laboratory Data 47

48 > t.test(y1~strain) Welch Two Sample t-test t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group C57BL/6J mean in group DBA/2J Similar results to that using the negative binomial or quasi-poisson. No assumption about relationship of variance to mean. Fits separate variances to the two groups. April 28, 2015 SPH 247 Statistics for Laboratory Data 48

49 Is the Test Statistic Important? Not too much if the Negative Binomial is True Simulated negative binomial count data Two groups of 10 Mean 5 and variance 6 10,000 trials Test Reject at 5% Reject at 1% Reject at 0.1% t-test 4.9% 0.90% 0.11% Wilcoxon Test 4.6% 0.74% 0.03% NB Likelihood Ratio Test 5.2% 0.95% 0.13% NB Wald Test 5.7% 1.16% 0.22% 4.8% 5.2% 0.90% 1.11% 0.07% 0.12% April 28, 2015 SPH 247 Statistics for Laboratory Data 49

50 825 Genes Significant by at least one method 486 by all methods More significant genes by edger Is this greater power or is the test producing false positives? April 28, 2015 SPH 247 Statistics for Laboratory Data 50

51 Null Behavior 100 randomly selected subsets of size 6 out of the 10 C7BL6 mice and 100 randomly selected subsets of size 6 from the DBA/2J In each subset, 3 assigned to treatment and 3 to control at random. The null hypothesis is on the average true. Given the 11,870 genes with large enough counts, we would expect about 0.01% to be significant at the p = level. We have 2 1,187,000 tests, so there should be about 237 rejections for each method April 28, 2015 SPH 247 Statistics for Laboratory Data 51

52 Significant at p = Null Hypothesis is True Estimator Observed Expected Over-Rejection DESeq DESeq edger edger-robust limma-voom April 28, 2015 SPH 247 Statistics for Laboratory Data 52

53 Important Factors These methods share some features but differ in their implementation. Most of them treat the data as over-dispersed Poisson, or specifically negative binomial They may use a test based on the negative binomial distribution, but limma-voom uses standard regression/anova with variance-based weights Since sample sizes tend to be small, they all have the possibility to replace the variance estimate from a given gene with a smoothed estimate based on all the genes. Normalization may be needed due to the differing total numbers of reads April 28, 2015 SPH 247 Statistics for Laboratory Data 53

54 Factors That May Cause False Positives Normalization procedure Normalization driven by high-abundance transcripts can cause spurious significance of low abundance ones. Variance estimation Many methods borrow strength from other genes to get better variance estimates. This may cause bias Statistical test used Many choices Nature of the data May require filtering April 28, 2015 SPH 247 Statistics for Laboratory Data 54

55 Normalization Normalization is commonly used in differential expression analysis with microarrays. Normalization for RNA-Seq is often couched in terms like library size as if we should divide by the total (mapped) fragment count This can be problematic because it depends on only a few genes. Mostly, normalization in RNA-Seq is done with a single constant per sample, though this is unusual with expression arrays April 28, 2015 SPH 247 Statistics for Laboratory Data 55

56 Library Size Normalization Using the total fragment count is problematic because highly expressed genes will provide most of the fragments Four genes with expression 10,000, 100, 150, 200 in condition A and 20,000, 100, 150, 200 in condition B. Normalized fragment counts use total fragment counts of 10,450 and 20,450 and can be normalized to 15,450 Normalized fragment counts are 14,785, 148, 222, 296 in condition A and 15,110, 76, 113, 151 in condition B, so upregulation of gene 1 has been turned into down-regulation of the other three. Fold changes are 2.0, 1.0, 1.0, 1.0 before normalization and 1.02, 0.51, 0.51, 0.51 after. April 28, 2015 SPH 247 Statistics for Laboratory Data 56

57 Normalization Methods Total count Quantile Normalization of other signal based methods Geometric normalization (Cuffdiff2/DESeq version) For each gene, compute the geometric mean of the total fragment count across libraries Library size is the median across genes of the total fragment count divided by the geometric mean fragment count. In our 4-gene example, the geometric means are 14,142, 100, 150, 200, the ratios for A are 0.707, 1, 1, 1 and for B are 1.414, 1, 1, 1, so the size factors are the medians, namely 1 and 1 Median normalization: divide each count by the median count of that sample (and multiply by the mean of the medians) April 28, 2015 SPH 247 Statistics for Laboratory Data 57

58 k fragment count for gene i in library j ij m 1/ m m 1 gi = ki ν = ki ν ν = 1 m ν = 1 s j exp log( ) kij = median i g i Cuffdiff2 first normalizes replicates under the same conditions giving an internal library size of s j Then the arithmetic mean of the scaled gene counts for each gene is used to compute an external library size of η j. This is a possible source of problems, the scale of which is unknown April 28, 2015 SPH 247 Statistics for Laboratory Data 58

59 Variance Estimation Many RNA-Seq experiments are small. Small studies have low power Very small p-values are needed to pass the false discovery filter So the default for small studies is no results Suppose two means differ by one standard deviation With 10,000 genes, the Bonferroni level is With two groups of 3, there are ~4df and corresponds to almost 50 standard deviations April 28, 2015 SPH 247 Statistics for Laboratory Data 59

60 Multiplicity Problems Suppose we have 10,000 genes and can run a test on each one. If we use a criterion of p = 0.05, then we will have 500 false positives. We can help this with a lower value of p, such as (~10 fp) If we pick p = 0.05 /10,000 then fp = 0 with probability this is too strict Mostly, we use false discovery rate control which maps the vector of p-values to a vector of adjusted p-values. On the average, the set of tests with FDR adjusted p-values below 10% contains at most 10% false discoveries. But this still requires very small p-values to be significant. These can be very hard to achieve with small sample sizes. April 28, 2015 SPH 247 Statistics for Laboratory Data 60

61 Variance Estimation If we use tests that reference only the data from the specific gene, then usually variance estimation is not a problem. But with small sample sizes, the power is low, so there is a temptation to improve the variance estimates by smoothing or shrinkage. The variance of a negative binomial increases with the mean in a way controlled by a variance parameter. We can smooth the plot of the sample variance vs. the mean and use the smoothed estimate instead of the per-gene estimate. April 28, 2015 SPH 247 Statistics for Laboratory Data 61

62 A Poisson random variable with parameter λ has µ = λ σ 2 = λ A negative binomial random variable can be seen as a mixture of Poissons with varying λ µ = λ σ = λ+ c λ σ / µ = λ + c c s x cˆ = if this is positive 2 x If we smooth a plot of the variance or the square CV or cˆ 2 vs. the mean we obtain an average estimate of c for the given value of µ We can use the individual variance, the smoothed variance, or a compromise April 28, 2015 SPH 247 Statistics for Laboratory Data 62

63 April 28, 2015 SPH 247 Statistics for Laboratory Data 63

64 April 28, 2015 SPH 247 Statistics for Laboratory Data 64

65 Variance Estimation The sample variance is an unbiased estimate of the population variance A smoothed variance will be biased down or up depending on the data point While this can reduce the MSE of estimating the variance, it may increase false positives and false negatives for tests based on those variance estimates This is a possible source of the differences in results in various methods of analysis. April 28, 2015 SPH 247 Statistics for Laboratory Data 65

66 Empirical Bayes The original inspiration for the idea of variance shrinkage is from a result of Charles Stein and implementation in the form of empirical Bayes analysis. Empirical Bayes means that the prior, instead of being given in advance, is estimated also from the data. This can sometimes be an effective method for reducing the mean square error of estimation. We go through this in the case of measured response (as in arrays) not counted response, where the math is only approximate. April 28, 2015 SPH 247 Statistics for Laboratory Data 66

67 Suppose we have n observations from a normal distribution 2 ~ (, ),1 x i N µσ i n then x is the 'best' estimator of µ in a variety of ways: Minimum Variance Unbiased Estimator Minimum Mean Square Erro r Estimator Maximum Likelihood Estimator What if we have multiple distributions? x N µ σ i n,1 j m 2 ij ~ ( j, j ),1 j The estimates x j for µ are still MVUE and MLE, but not MMSQEE. j This idea is due to Charles Stein and later Willard James. Suppose that for simplicity we assume that all the variances are known and that µ θτ 2 j ~ N(, ) f( µ j, xj) f( xj µ j) f( µ j) f( µ j xj) = = which is itself a normal distribution with mean f( x ) f( x ) j kx k k n j j + (1 ) θ where = τ / ( τ + σ j / j) a weighted average of the aggregate mean and the individual mean 2 but this is only helpful if we know or can estimate θ and τ April 28, 2015 SPH 247 Statistics for Laboratory Data 67

68 µ θτ x x j 2 ~ N(, ) 2 j µ j ~ N( µ j, σ ) j is normally distributed E( x) = E [ E ( x µ )] = E [ µ ] = θ µ x µ µ 2 2 V( x) = V( x µ ) + V( µ ) = σ / n+ τ ˆ θ = x ˆ τ = Vˆ ( x) se ( x) 2 2 pooled or use maximum likelihood A similar setup works for estimation of variances. We have the individual variance estimates and a hypothesized prior distribution for the true variances. The conjugate prior is the inverse gamma. We use the posterior estimate of the individual variance, and estimate the parameters of the prior. Empirical Bayes. April 28, 2015 SPH 247 Statistics for Laboratory Data 68

69 We have p measurements (genes) and n arrays and a linear model to apply to each gene. The tests (F- or T- ) use the MSE or its square root as the denominator. If we have only a small df for each gene, then we may want to 'borrow strength' The true residual variance (MSE) for gene i is σ The observed residual variance (MSE) for gene i is s ~ σ χ ( ν) / ν where ν is the residual degrees of freedom i i 2 This has a gamma distribution with mean τ = σ and shape a = ν / 2. If the τ = σ i 2 i i 2 i i are random and distributed from an inverse gamma distribution (reciprocal of a gamma) then this has param eters α and β or α and η = αβ ( η is the mean of the reciprocal of τ, which is sometimes called the precision). The density is f ( ταη ;, ) τ e = τ e α 1 α / ητ α 1 1/ τβ April 28, 2015 SPH 247 Statistics for Laboratory Data 69

70 2 The posterior distribution of τ given s i is proporti e τ 1/ τβ * where 2 = which, like the prior, is inverse gamma with parameters xν + 2 α / η ν/2+ α+ 1 * β * α ν α * β η = /2+ 2 = xν + 2 α / η = αβ = * * * ν + 2α xν + 2 α / η x = onal to 1 xν + 2 α / η 2 ν 1 2α = = s * i + η ν + 2α ν + 2α η ν + 2α So our estimate of the MSE is a weighted average of the observed MSE and the prior (reciprocal mean precision) with weights equal to the degrees of freedom April 28, 2015 SPH 247 Statistics for Laboratory Data 70

71 The observed gene-specific variance s distribution with mean τ and a = ν / 2, and if the prior for τ is inverse i 2 i has a gamma (chi-squared) gamma with parameters α and β, then it can be shown that the overall distribution of the observed gene-specific variances is 2 αβ si = F ν,2 α Given the collection of gene-specific variances, we can estimate the parameters α and β of the prior by maximum likelihood or by comparing moments. If the residual degrees of freedom is large, then it is probably best to use the gene-specific p-values. If it is small, there can be a substantial gain in power using the posterior p-values (p-values from the F- or t- test using the posterior MSE). April 28, 2015 SPH 247 Statistics for Laboratory Data 71

72 Lowess Normalization If we have vectors x 1, x 2,, x n each of length k, we can mean normalize them by computing the sample means m 1, m 2,, m n of each, and the grand mean m, and then replacing x i by x i m i +m. Intensity based normalization does this separately for small, medium, large genes, for possibly many bands, or in a running average. Lowess normalization orders the genes by mean expression and does a lowess smooth of each gene, subtracts that, and adds the mean of the lowess smooths. April 28, 2015 SPH 247 Statistics for Laboratory Data 72

73 April 28, 2015 SPH 247 Statistics for Laboratory Data 73

74 Conclusions Analysis of RNA-Seq data is still in the early stages of development. Existing programs vary substantially in the results and it is unclear which is the most reliable More work is needed that focuses on the fundamental components of analysis, particularly choice of test statistics, variance estimation, and sample normalization. April 28, 2015 SPH 247 Statistics for Laboratory Data 74

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the